A2.Movie Sentiment Analysis PDF

INF1340 Assignment 2

Movie Sentiment Analysis

Due: 11:59 pm Thursday November 21, 2019
Worth: 13% of your final grade
Submit to: MarkUS (markus.teach.cs.toronto.edu/inf1340-2019-09)
Work: In pairs using the pair programming paradigm (en.wikipedia.org/wiki/Pair_programming)

Sentiment analysis is a Big Data problem that seeks to determine the general attitude (positive or negative) of a
piece of content based on a set of known sentiments.

In this assignment you'll use a dataset of over 8000 reviews from Rotten Tomatoes to build a simple sentiment
analysis model for movie critique.
Training a model
We will build a simple model that can reasonably predict how positive or negative a particular movie
review is. We want our model to understand that the text "the film was a breath of fresh air" is
largely positive while the text "the movie made me want to poke my eyes out" is largely negative.

To predict the sentiment of movie reviews, we need to find and process examples of existing movie
reviews for which we already know the sentiments. We can use this data to train our model of sentiment
so we can apply it to new data.

We have found 8528 reviews from the movie review site Rotten Tomatoes. Each review is matched to a
score ranging from 0 to 4, with 0 being extremely negative, 2 being neutral, and 4 being extremely
positive.

All movie reviews are contained in the file full.txt

There are also smaller testing datasets provided in the files medium.txt, small.txt, and tiny.txt.

This assignment has four parts:

● Part I: Find the sentiments of individual words by reading through the reviews file every time
● Part II: Build a sentiment model in memory using a dictionary and estimate the sentiments of
statements consisting of multiple words
● Part III: Sort the dictionary and determine the most extreme words used in movie reviews
● Part IV: Perform analysis or insight of your choice using the dataset and report on your findings

Version 1 (Nov. 3, 2019)

1
Anatomy of a dataset
All datasets conform to the following formatting rules:
● Each movie review is on a separate line ending in '\n'
● Each line begins with a single-digit rating from 0 to 4 followed by a space
● The dataset has been pre-processed to be easier to analyze: each review consists of tokens that
are separated by spaces
● Words are the most common tokens, but punctuation, contractions, numbers, etc. are also in the
review data. This means that both regular words and non-words appear in the dataset as tokens.

E.g.,

2 Bond-inspired ?
0 Staggeringly dreadful romance .
4 This method almost never fails him , and it works superbly here .
1 Weird they 're still making these ham-fisted romps .
3 Competently directed but terminally cute drama .

Words and non-words

We need to analyze individual words for their sentiments. To do so, we need to be able to recognize which
of the tokens in our dataset are not meaningful words.

We would like to ignore the following from our dataset:

● Partial words like "'re" and "n't"
● Tokens containing numerical data like "120", "60's", or "M-16"
● Tokens consisting solely of punctuation, e.g., ".", "...", "-"

However, we do want to count hyphenated words like "high-concept", "candy-coat", and
"ultra-cheesy", and words with forward slashes like "writer/director".
Fall 2019
2
Known Sentiment Score (KSS) Calculation

How would we determine if the statement "The movie 's downfall is to substitute plot for
personality ." is positive or negative?

One algorithm we can use is to assign a numeric value to every word in the statement based on how
positive or negative it is, then score the statement based on the combined value of the words.

But, how do we come up with our word scores in the first place? A word's Known Sentiment Score (KSS) is
the average of the review scores for all occurrences of the word in the dataset, regardless of case.

Example:

0 Brando seems uninspired by the heist script and is phoning it in
4 Marlon Brando is incredible as the patriarch of the family
1 Brando is Brando , but for this one it 's not enough

'brando' appears as a token four times:
● once in a review with score 0
● once in a review with score 4
● twice in a review with score 1

The Known Sentiment Score (KSS) for 'brando' is (0 + 4 + 1 + 1) / 4 = 1.5
'the' appears once in a 0-score review and twice in a 4-score review, so its KSS is
(0 + 4 + 4) / 3 = 2.67
'and' only appears in one 0-rated review, and so its KSS score is 0.0
Notice that 'and' should not be counted as appearing in the second and third review, even though it is a
substring of 'brando'. Only full occurrences of a word are counted in the KSS.

● We are counting every appearance of a word, not just every review in which it appears
● We are only counting occurrences of the entire word towards the KSS
● The case of the word does not matter when determining KSS

The tiny training set above has the following KSS and counts:
Fall 2019
3
Predicted Sentiment Score (PSS) Calculation

To determine the Predicted Sentiment Score (PSS) of a statement with an unknown sentiment, we
compute the average of the Known Sentiment Scores (KSS) of all words in the statement. We can assume
that statements are formatted using the same rules as the reviews, with all tokens separated by spaces.

There are two rules to follow:
● If a word appears multiple times in the statement, its KSS is added multiple times
● If a token in the statement has no KSS because it was not in the training set or is not a word, it is
excluded from the calculation.

Let's use the tiny model above to calculate the PSS of the following statement:

As incredible as the script is , the execution is uninspired

as incredible as the script is , the execution is uninspired
4 4 4 2.67 0 1.67 2.67 1.67 0

● "," is not a word and so it is excluded
● "execution" is not a word with a known KSS and so it is excluded

We add up the nine valid KSS scores and divide by the number of words:

(4 + 4 + 4 + 2.67 + 0 + 1.67 + 2.67 + 1.67 + 0) / 9 = 2.298

The Predicted Sentiment Score of the statement above is 2.298, which is mildly positive.

Fall 2019
4

Part I: Preparation
(String methods, File I/O, lists, tuples)
Complete the Code portion of the Function Design Recipe (FDR) for each of the following functions in the
file sentiment.py.
is_word(token: str) -> bool

This function determines if token is a word or not.
Words consist of alphabetic characters and may include dashes or forward slashes.
judge(score: float) -> str

This function returns a human-readable judgment of a passed-in sentiment score. It lets us print
things like "atrocious has a sentiment score of 0.5, which is negative"
get_word_list(statement: str) -> List(str)

This function converts a single string statement consisting of tokens separated by spaces into a list
of approved words. Tokens that are not words according to is_word are excluded from the list.
word_kss_scan(word: str, file: TextIO) -> float or None

This function accepts two parameters: a single word string and an open text file. Notice that
file is an open file-like object and NOT a filename. This is done so that we can test the function
without knowing exactly where you stored your data files.

This function scans file from start to finish, counting the number of times word appears in it, and
aggregating a sum of the ratings associated with each occurrence of word.

As long as word was seen at least once in file, the function returns the average rating of the word
(its Known Sentiment Score).

If word is not found in the file object, the function returns None.

Notice that this function needs a freshly opened file for every word it tries to scan, and it retains
none of the scanning work it's done.

If we had to determine the combined score of an entire review using word_kss_scan, we would have to
run it many times, each with a freshly opened file. This is slow and inefficient, which is why we will improve
it by extracting all words from our dataset into a dictionary in the next part.
Fall 2019
5
Part II: KSS Extraction and Statement PSS

(Dictionaries, sorted function)
file sentiment.py.
extract_kss(file: TextIO) -> Dict(str, List(int))

This function accepts a freshly opened file composed of rated movie reviews, reads through it,
and extracts the aggregate score and the count of all words in the file.
The resulting dictionary should have the words from file as keys. For each word, it should store a
list consisting of the total aggregate score for the word and the number of occurrences of the word
in the dataset.

Example: the key-value pair 'film' : [45, 20] indicates that the word 'film' was seen 20
times in the dataset and the sum of all review scores for those 20 times is 45.

We are retaining the number of occurrences for each word in the dataset because we will need
this information later.
word_kss(word: str, kss: Dict(str, List(int)))

-> float or None
This function returns the same value as word_kss_scan but it does not scan through an open file
to aggregate scores for word. Instead, it looks up word in the already assembled kss dictionary and
returns the total aggregate score divided by the number of occurrences.
statement_pss(statement: str, kss: Dict(str, List(int)))

-> float or None
This function accepts a movie review as statement and the dictionary kss containing our known
scores. It uses get_word_list to gather a list of sensible words, then tries to find the average
score of each of these words. If a word is not in kss, it ignores it.
It returns an average of all the scores of the words in statement that were found in kss.
If statement contained no words that were found in kss, the function returns None.
Fall 2019
6
Part III: Word Frequencies

file sentiment.py.
most_negative_words(count: int, min_occ: int, kss:

Dict(str, List(int))) -> List(str, int)
This function returns a list of the count most negative words in kss that occurred at least
min_occ times. It does so by calling most_extreme_words and returning the list it generates.
most_positive_words(count: int, min_occ: int, kss:

Dict(str, List(int))) -> List(str, int)
This function returns a list of the count most positive words in kss that occurred at least min_occ
times. It does so by calling most_extreme_words and returning the list it generates.
most_extreme_words(count: int, min_occ: int, kss:

Dict(str, List(int)), pos: bool) -> List(str, int)
This function accepts four parameters.
● count: how many extreme words should it return
● min_occ: at least how many times each word should have occurred in the dataset
● kss: the dataset dictionary
● pos: whether the user is looking for the most positive or the most negative words.

The function returns a list of length count. Each item is going to be a list with three items.
● The first item is a word (str)
● The second item is the Known Sentiment Score of the word (float)
● The third item is the number of times the word was seen in the dataset (int)
E.g., the return list is composed of items that look like this: ['film', 2.25, 20]
score(item: Tuple(str, List(int))) -> float)

This is a helper function to help most_extreme_words sort the items in a dictionary. It receives a
tuple containing a key-value pair where the key is a string and the value is a list with two integer
members. It returns the ratio of the first and second member of the value key.
E.g., If it receives the key-value pair ('film', [45, 20]), it will return 45 / 20 or 2.25
Hint: You should make score the key function in a call to sorted() with kss.items() as the
collection.
Fall 2019
7
Part IV: Explore

In this part, you can do some freeform data exploration given the structures above.

Here are some thoughts and concepts to get you started. You can pick one of these or make up your own,
but you must use Python code to gather your data.

● How close are our Predicted Sentiment Scores to the scores the reviews originally came with?
● Can we generate movie reviews that sound positive or negative?

Can we make them grammatically correct too?
● How do most positive and most negative word sets change as the minimum number of
occurrences change? Which sets seem the most movie-relevant and accurate to us?
● Is there any way to "sharpen" the sentiment of a movie review, i.e., to make it less likely to fall in
the neutral zone around 2.0?
● Are there any words in the dataset that have scores significantly different from what we'd expect?
How did you find them?
● What are the most commonly occurring words in the dataset? Which of them are movie-specific,
and which of them are just common words?
● Could we recognize given names (e.g., 'Scorsese', 'Kidman', etc.) and could we compile a sentiment
analysis list of most positively and negatively reviewed people?
● Is our reviews dataset skewed in any way? Could we generate a non-skewed version of it?

To complete this portion, write any code you may need in sentiment_explore.py. Then, write your
analysis/question/insight as a short abstract in a file with the filename "explore" and submit it on
MarkUs. The file's extension is up to you (TXT, DOC, PDF, etc.), but the file size should not exceed 5 MB.

Significant, clever, or original insights or processing of the data are required to earn a full grade for this
portion of the assignment. You may have to or choose to generate new text files, CSV files, or
visualizations.

Fall 2019
8
Starter code description

The starter code package consists of the following files:
● sentiment.py containing function headers for you to complete
● sentiment_display.py is a complete file that will let you interactively test the code you
wrote in sentiment.py. It imports everything from sentiment.py.
● sentiment_explore.py imports everything from sentiment.py. Use this file for your own
data explorations in Part IV

The package includes the following datasets:
● tiny.txt contains the three reviews used as examples in the KSS and PSS sections of A2
● small.txt contains 29 reviews
● medium.txt contains 500 reviews
● full.txt contains 8528 reviews, including all reviews in small.txt and medium.txt
● most_common_english_words.txt consists of an ordered list of the 5000 most common
words in the English language according to wordfrequency.info. You don't need to use this
set unless you are dealing with general word frequencies in Part IV.
What to submit
● sentiment.py with your completed functions
● sentiment_explore.py imports everything from sentiment.py. Use this file for your own data
explorations in Part IV
● explore.[txt/pdf/doc/…] describing the results of your exploration in Part IV
Component weight
Component % Criteria
Code correctness 60 Your functions pass automated tests of correctness
Coding style and variable 20 Code is readable and robust, if statements used appropriately,
naming conventions minimal code repetition, informative variable names follow
pothole_case
Input and output sanitation 10 No print() calls outside of _

_main__ blocks, no files left
unclosed
Part IV: Exploration 10 An interesting question has been asked about the dataset. Data
has been gathered using additional Python code, and the results
have been described well.
Fall 2019
9
How to submit
1. Find a partner and follow the pair programming paradigm.
2. Ensure you can log in to MarkUs at h ttps://markus.teach.cs.toronto.edu/inf1340-2019-09/
3. Go to Assignment 2 and under “Group Information” invite your partner or join their group. You
both need to be in the same group in order to submit. You need to know the UTORid of your
partner to invite them.
4. You may submit as many times as you like before the due date. You may also submit after the due
date following the late submission penalties detailed in the course outline.
How to test your code
In the __main__ block or Python shell

As you work on sentiment.py, you can add testing code to the __main__ block of the file, or evaluate
(run) the file to make function calls in the live Python shell to test how your functions are working.
Use sentiment_display.py
sentiment_display.py imports from sentiment.py and repeatedly asks for input to test your code.
If you have not completed the dictionary in Part II, you can run the part_i_interface function call.
If you have completed the dictionary in Part II, you can uncomment the file opening block and the
part_ii_interface function call and run the whole file.

Note: sentiment_display.py is built to take input of the correct type and format only. There is no need
to test it or make it work with any other data types. You do not have to submit this file.
Finally: run automatic tests in MarkUs

Once you have uploaded a completed sentiment.py to your MarkUs group, you will be able to run tests
on it. This is the same framework we will use to run autotests, so checking if your code is performing well
on MarkUs will be helpful to you.

Use the “Automated Testing” tab under the assignment submission page and press “Run Tests”. Refresh
the page in a few seconds to see a report of how your code did. You can expand the report to see what
errors it returned. Your report will also be saved in your submission directory as test_output.txt.

To prevent spamming the test framework, you only have eight test tokens: you can only run these tests
six times in a 24-hour period. Once you run out, you have to wait until they refresh the following day.

Passing these tests does NOT guarantee you full marks on the assignment, but not passing them
guarantees that something is wrong with your code.
Fall 2019
10
Glossary
Dataset:
A collection of reviews, each on a separate line

Sentiment:
The positive or negative feeling associated with a piece of text. In A2, it's quantified as a float
between 0 and 4, inclusive.

Statement:
A piece of text in which all tokens are separated by spaces. Statements do not begin with a review
score.

Review:
A piece of text that begins with a 0-4 score and a space, followed by tokens separated by spaces

Token:
The part of a string that can be separated from others by splitting the string on a delimiter, e.g.,
space. Tokens may be words (also hyphenated or slashed), review scores, partial words (e.g.,
"n't"), numbers, or punctuation.

Word:
A token that can be interpreted as an entire word and scored as such. Words contain alphabetic
characters, optional hyphens, and/or optional forward slashes, but no numbers and no
punctuation. Words are case-insensitive.

Aggregate score:
The sum of the review ratings for all occurrences of a word in the dataset. If a word appears
multiple times in a review, its rating is added multiple times to the aggregate.

Count:
The number of times a word appears as a token in the dataset. This does not include partial
matches, e.g., the word "land" does not appear in the review "1 iceland is beautiful"

KSS (Known Sentiment Score):
A word's aggregate score in a dataset divided by its count in the dataset

PSS (Predicted Sentiment Score):
The average of the KSS of all words in a statement.
Fall 2019
11

A2.Movie Sentiment Analysis PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

A2.Movie Sentiment Analysis PDF

Caricato da

Copyright:

Formati disponibili

INF1340 Assignment 2

Movie Sentiment Analysis

Version 1 (Nov. 3, 2019)

Words and non-words

Known Sentiment Score (KSS) Calculation

Predicted Sentiment Score (PSS) Calculation

4 4 4 2.67 0 1.67 2.67 1.67 0

is_word(token: str)​ -> bool

judge(score: float) -> str

get_word_list(statement: str) -> List(str)

word_kss_scan(word: str, file: TextIO) -> float or None

Part II: KSS Extraction and Statement PSS

extract_kss(file: TextIO) -> Dict(str, List(int))

word_kss(word: str, kss: Dict(str, List(int)))

statement_pss(statement: str, kss: Dict(str, List(int)))

Part III: Word Frequencies

most_negative_words(count: int, min_occ: int, kss:

most_positive_words(count: int, min_occ: int, kss:

most_extreme_words(count: int, min_occ: int, kss:

score(item: Tuple(str, List(int))) -> float)

Part IV: Explore

● Can we generate movie reviews that sound positive or negative?

Starter code description

Code correctness 60 Your functions pass automated tests of correctness

Input and output sanitation 10 No ​print()​ calls outside of _

How to test your code

In the ​__main__​ block or Python shell

Finally: run automatic tests in MarkUs

Potrebbero piacerti anche

is_word(token: str) -> bool

Input and output sanitation 10 No print() calls outside of _

In the main block or Python shell