Sei sulla pagina 1di 11

INF1340 Assignment 2 

  
 

Movie Sentiment Analysis 


Due:​ 11:59 pm Thursday November 21, 2019 
Worth:​ 13% of your final grade 
Submit to: ​MarkUS (​markus.teach.cs.toronto.edu/inf1340-2019-09​) 
Work: ​In pairs using the pair programming paradigm (​en.wikipedia.org/wiki/Pair_programming​) 
 
Sentiment analysis is a Big Data problem that seeks to determine the general attitude (positive or negative) of a 
piece of content based on a set of known sentiments. 
 
In this assignment you'll use a dataset of over 8000 reviews from Rotten Tomatoes to build a simple sentiment 
analysis model for movie critique. 

Training a model 
We will build a simple model that can reasonably predict how positive or negative a particular movie 
review is. We want our model to understand that the text ​"the film was a breath of fresh air"​ is 
largely positive while the text ​"the movie made me want to poke my eyes out"​ is largely negative. 
 
To predict the sentiment of movie reviews, we need to find and process examples of existing movie 
reviews for which we already know the sentiments. We can use this data to train our model of sentiment 
so we can apply it to new data. 
 
We have found 8528 reviews from the movie review site Rotten Tomatoes. Each review is matched to a 
score ranging from 0 to 4, with 0 being extremely negative, 2 being neutral, and 4 being extremely 
positive.  
 
All movie reviews are contained in the file ​full.txt
 
There are also smaller testing datasets provided in the files ​medium.txt​, ​small.txt​, and ​tiny.txt​. 
 
This assignment has four parts: 
 
● Part I:​ Find the sentiments of individual words by reading through the reviews file every time 
● Part II:​ Build a sentiment model in memory using a ​dictionary​ and estimate the sentiments of 
statements consisting of multiple words 
● Part III: ​ Sort the dictionary and determine the most extreme words used in movie reviews 
● Part IV:​ Perform analysis or insight of your choice using the dataset and report on your findings 
 

Version 1 (Nov. 3, 2019) 



INF1340 Assignment 2 

Anatomy of a dataset 
All datasets conform to the following formatting rules: 
● Each movie review is on a separate line ending in ​'\n'
● Each line begins with a single-digit rating from 0 to 4 followed by a space 
● The dataset has been pre-processed to be easier to analyze: each review consists of ​tokens​ that 
are separated by spaces 
● Words are the most common tokens, but punctuation, contractions, numbers, etc. are also in the 
review data. This means that both regular words and non-words appear in the dataset as tokens. 
 
E.g.,  
 
2 Bond-inspired ​?
0 Staggeringly dreadful romance ​.
4 This method almost never fails him ​,​ and it works superbly here ​.
1 Weird they ​'re​ still making these ham-fisted romps ​.
3 Competently directed but terminally cute drama ​.
 

Words and non-words 


We need to analyze individual words for their sentiments. To do so, we need to be able to recognize which 
of the tokens in our dataset are not meaningful words. 
 
We would like to ignore the following from our dataset: 
 
● Partial words like ​"'re"​ and ​"n't"
● Tokens containing numerical data like ​"120"​, ​"60's"​, or ​"M-16"
● Tokens consisting solely of punctuation, e.g., ​"."​, ​"..."​, ​"-"
 
However, we do want to count hyphenated words like ​"high-concept"​, ​"candy-coat"​, and 
"ultra-cheesy"​, and words with forward slashes like ​"writer/director"​.   

Fall 2019 

INF1340 Assignment 2 

Known Sentiment Score (KSS) Calculation 


How would we determine if the statement ​"The movie 's downfall is to substitute plot for
personality ."​ is positive or negative? 
 
One algorithm we can use is to assign a numeric value to every word in the statement based on how 
positive or negative it is, then score the statement based on the combined value of the words.  
 
But, how do we come up with our word scores in the first place? A word's ​Known Sentiment Score (KSS)​ is 
the average of the review scores for all occurrences of the word in the dataset, regardless of case. 
 
Example: 
 
0 ​Brando​ seems uninspired by the heist script and is phoning it in
4 Marlon ​Brando​ is incredible as the patriarch of the family
1 ​Brando​ is ​Brando​ , but for this one it 's not enough
 
'brando'​ appears as a token four times:  
● once in a review with score 0  
● once in a review with score 4  
● twice in a review with score 1 
 
The Known Sentiment Score (​KSS​) for ​'brando'​ is ​(0 + 4 + 1 + 1) / 4 = ​1.5 
'the'​ appears once in a 0-score review and twice in a 4-score review, so its KSS is  
(0 + 4 + 4) / 3 = 2.67
'and'​ only appears in one 0-rated review, and so its KSS score is ​0.0 
Notice that ​'and'​ should not be counted as appearing in the second and third review, even though it is a 
substring of ​'brando'​. Only full occurrences of a word are counted in the KSS. 
 
● We are counting ​every​ appearance of a word, not just every review in which it appears 
● We are only counting occurrences of the ​entire word​ towards the KSS 
● The case of the word does ​not​ matter when determining KSS 
 
The tiny training set above has the following KSS and counts: 

Fall 2019 

INF1340 Assignment 2 

Predicted Sentiment Score (PSS) Calculation 


To determine the ​Predicted Sentiment Score (PSS)​ of a statement with an unknown sentiment, we 
compute the average of the ​Known Sentiment Scores (KSS)​ of all words in the statement. We can assume 
that statements are formatted using the same rules as the reviews, with all tokens separated by spaces. 
 
There are two rules to follow: 
● If a word appears multiple times in the statement, its KSS is added multiple times 
● If a token in the statement has no KSS because it was not in the training set or is not a word, it is 
excluded from the calculation. 
 
Let's use the tiny model above to calculate the PSS of the following statement: 
 
As incredible as the script is , the execution is uninspired
 
as incredible as the script is , the execution is uninspired

4  4  4  2.67  0  1.67    2.67    1.67  0 


 
● ","​ is not a word and so it is excluded 
● "execution"​ is not a word with a known KSS and so it is excluded 
 
We add up the nine valid KSS scores and divide by the number of words: 
 
(4 + 4 + 4 + 2.67 + 0 + 1.67 + 2.67 + 1.67 + 0) / 9 = 2.298
 
The Predicted Sentiment Score of the statement above is ​2.298​, which is mildly positive.  
 
   

Fall 2019 

INF1340 Assignment 2 
 

Part I: Preparation 
(String methods, File I/O, lists, tuples) 
Complete the Code portion of the Function Design Recipe (FDR) for each of the following functions in the 
file ​sentiment.py​.  

is_word(token: str)​ -> bool 


This function determines if token is a word or not. 
Words consist of alphabetic characters and may include dashes or forward slashes. 

judge(score: float) -> str  


This function returns a human-readable judgment of a passed-in sentiment score. It lets us print 
things like ​"atrocious has a sentiment score of 0.5, which is ​negative​"

get_word_list(statement: str) -> List(str) 


This function converts a single string statement consisting of tokens separated by spaces into a list 
of approved words. Tokens that are not words according to ​is_word​ are excluded from the list. 

word_kss_scan(word: str, file: TextIO) -> float or None 


This function accepts two parameters: a single ​word​ string and an open text ​file​. Notice that 
file​ is an open file-like object and NOT a filename. This is done so that we can test the function 
without knowing exactly where you stored your data files. 
 
This function scans ​file​ from start to finish, counting the number of times ​word​ appears in it, and 
aggregating a sum of the ratings associated with each occurrence of ​word​. 
 
As long as ​word​ was seen at least once in ​file​, the function returns the average rating of the word 
(its Known Sentiment Score).  
 
If ​word​ is not found in the file object, the function returns ​None​. 
 
Notice that this function needs a freshly opened file for every word it tries to scan, and it retains 
none of the scanning work it's done. 
 
If we had to determine the combined score of an entire review using ​word_kss_scan​, we would have to 
run it many times, each with a freshly opened file. This is slow and inefficient, which is why we will improve 
it by extracting all words from our dataset into a dictionary in the next part. 

Fall 2019 

INF1340 Assignment 2 

Part II: KSS Extraction and Statement PSS 


(Dictionaries, ​sorted​ function) 
Complete the Code portion of the Function Design Recipe (FDR) for each of the following functions in the 
file ​sentiment.py​.  

extract_kss(file: TextIO) -> Dict(str, List(int)) 


This function accepts a freshly opened ​file​ composed of rated movie reviews, reads through it, 
and extracts the aggregate score and the count of all words in the file. 
The resulting dictionary should have the words from ​file​ as keys. For each word, it should store a 
list consisting of the total aggregate score for the word and the number of occurrences of the word 
in the dataset. 
 
Example: the key-value pair​ ​'film' : [45, 20]​ indicates that the word ​'film'​ was seen 20 
times in the dataset and the sum of all review scores for those 20 times is 45. 
 
We are retaining the number of occurrences for each word in the dataset because we will need 
this information later. 

word_kss(word: str, kss: Dict(str, List(int)))


-> float or None 
This function returns the same value as ​word_kss_scan​ but it does not scan through an open file 
to aggregate scores for ​word​. Instead, it looks up ​word​ in the already assembled ​kss​ dictionary and 
returns the total aggregate score divided by the number of occurrences. 

statement_pss(statement: str, kss: Dict(str, List(int)))


-> float or None 
This function accepts a movie review as ​statement​ and the dictionary ​kss​ containing our known 
scores. It uses ​get_word_list​ to gather a list of sensible words, then tries to find the average 
score of each of these words. If a word is not in ​kss​, it ignores it. 
It returns an average of all the scores of the words in ​statement​ that were found in ​kss​. 
If ​statement​ contained no words that were found in ​kss​, the function returns ​None​. 

Fall 2019 

INF1340 Assignment 2 

Part III: Word Frequencies 


Complete the Code portion of the Function Design Recipe (FDR) for each of the following functions in the 
file ​sentiment.py​.  

most_negative_words(count: int, min_occ: int, kss:


Dict(str, List(int))) -> List(str, int) 
This function returns a list of the ​count​ most negative words in ​kss​ that occurred at least 
min_occ​ times. It does so by calling ​most_extreme_words​ and returning the list it generates. 

most_positive_words(count: int, min_occ: int, kss:


Dict(str, List(int))) -> List(str, int) 
This function returns a list of the ​count​ most positive words in ​kss​ that occurred at least ​min_occ 
times. It does so by calling ​most_extreme_words​ and returning the list it generates. 

most_extreme_words(count: int, min_occ: int, kss:


Dict(str, List(int)), pos: bool) -> List(str, int) 
This function accepts four parameters. 
● count:​ how many extreme words should it return 
● min_occ:​ at least how many times each word should have occurred in the dataset 
● kss:​ the dataset dictionary 
● pos:​ whether the user is looking for the most positive or the most negative words. 
 
The function returns a ​list​ of length ​count​. Each item is going to be a ​list​ with three items. 
● The first item is a word (​str​) 
● The second item is the Known Sentiment Score of the word (​float​) 
● The third item is the number of times the word was seen in the dataset (​int​) 
E.g., the return list is composed of items that look like this: ​['film', 2.25, 20] 

score(item: Tuple(str, List(int))) -> float) 


This is a helper function to help ​most_extreme_words ​sort the items in a dictionary. It receives a 
tuple containing a key-value pair where the key is a string and the value is a list with two integer 
members. It returns the ratio of the first and second member of the value key. 
E.g., ​If it receives the key-value pair ​('film', [45, 20])​, it will return ​45 / 20 ​or ​2.25 
Hint: ​You should make ​score​ the key function in a call to ​sorted()​ with ​kss.items()​ as the 
collection. 

Fall 2019 

INF1340 Assignment 2 

Part IV: Explore 


In this part, you can do some freeform data exploration given the structures above.  
 
Here are some thoughts and concepts to get you started. You can pick one of these or make up your own, 
but you must use Python code to gather your data.  
 
● How close are our Predicted Sentiment Scores to the scores the reviews originally came with? 

● Can we generate movie reviews that sound positive or negative?  


Can we make them grammatically correct too? 

● How do most positive and most negative word sets change as the minimum number of 
occurrences change? Which sets seem the most movie-relevant and accurate to us? 

● Is there any way to "sharpen" the sentiment of a movie review, i.e., to make it less likely to fall in 
the neutral zone around 2.0? 

● Are there any words in the dataset that have scores significantly different from what we'd expect? 
How did you find them? 

● What are the most commonly occurring words in the dataset? Which of them are movie-specific, 
and which of them are just common words? 

● Could we recognize given names (e.g., 'Scorsese', 'Kidman', etc.) and could we compile a sentiment 
analysis list of most positively and negatively reviewed people? 

● Is our reviews dataset skewed in any way? Could we generate a non-skewed version of it? 

 
To complete this portion, write any code you may need in ​sentiment_explore.py​. Then, write your 
analysis/question/insight as a short abstract in a file with the filename "​explore​" and submit it on 
MarkUs. The file's extension is up to you (TXT, DOC, PDF, etc.), but the file size should not exceed 5 MB. 
 
Significant, clever, or original insights or processing of the data are required to earn a full grade for this 
portion of the assignment. You may have to or choose to generate new text files, CSV files, or 
visualizations. 
 
   

Fall 2019 

INF1340 Assignment 2 

Starter code description 


The starter code package consists of the following files: 
● sentiment.py​ containing function headers for you to complete 
● sentiment_display.py​ is a complete file that will let you interactively test the code you 
wrote in sentiment.py. It imports everything from sentiment.py. 
● sentiment_explore.py​ imports everything from sentiment.py. Use this file for your own 
data explorations in Part IV 
 
The package includes the following datasets: 
● tiny.txt​ contains the three reviews used as examples in the KSS and PSS sections of A2 
● small.txt​ contains 29 reviews 
● medium.txt​ contains 500 reviews 
● full.txt​ contains 8528 reviews, including all reviews in ​small.txt​ and ​medium.txt
● most_common_english_words.txt​ consists of an ordered list of the 5000 most common 
words in the English language according to ​wordfrequency.info​. You don't need to use this 
set unless you are dealing with general word frequencies in Part IV. 

What to submit 
● sentiment.py​ with your completed functions 
● sentiment_explore.py​ imports everything from ​sentiment.py​. Use this file for your own data 
explorations in Part IV 
● explore.[txt/pdf/doc/…] ​ describing the results of your exploration in Part IV 

Component weight 
Component  %  Criteria 

Code correctness  60  Your functions pass automated tests of correctness 

Coding style and variable  20  Code is readable and robust, if statements used appropriately, 
naming conventions  minimal code repetition, informative variable names follow 
pothole_case 

Input and output sanitation  10  No ​print()​ calls outside of _


​ _main__​ blocks, no files left 
unclosed 

Part IV: Exploration  10  An interesting question has been asked about the dataset. Data 
has been gathered using additional Python code, and the results 
have been described well. 

Fall 2019 

INF1340 Assignment 2 

How to submit 
1. Find a ​partner​ and follow the pair programming paradigm. 
2. Ensure you can log in to ​MarkUs​ at h​ ttps://markus.teach.cs.toronto.edu/inf1340-2019-09/ 
3. Go to Assignment 2 and under “Group Information” invite your ​partner​ or join their group. You 
both need to be in the same group in order to submit. You need to know the ​UTORid​ of your 
partner to invite them.  
4. You may ​submit​ as many times as you like before the due date. You may also submit after the due 
date following the late submission penalties detailed in the course outline. 

How to test your code 

In the ​__main__​ block or Python shell 


As you work on ​sentiment.py​, you can add testing code to the ​__main__​ block of the file, or ​evaluate 
(run)​ the file to make function calls in the live Python shell to test how your functions are working. 

Use ​sentiment_display.py
sentiment_display.py​ imports from ​sentiment.py​ and repeatedly asks for input to test your code. 
If you ​have​ ​not​ completed the dictionary in Part II, you can run the ​part_i_interface​ function call.  
If you ​have​ completed the dictionary in Part II, you can uncomment the file opening block and the 
part_ii_interface​ function call and run the whole file. 
 
Note: ​sentiment_display.py​ is built to take input of the correct type and format only. There is no need 
to test it or make it work with any other data types. You do not have to submit this file. 

Finally: run automatic tests in MarkUs 


Once you have uploaded a completed ​sentiment.py​ to your MarkUs group, you will be able to run tests 
on it. This is the same framework we will use to run autotests, so checking if your code is performing well 
on MarkUs will be helpful to you. 
 
Use the “​Automated Testing​” tab under the assignment submission page and press “​Run Tests​”. Refresh 
the page in a few seconds to see a report of how your code did. You can expand the report to see what 
errors it returned. Your report will also be saved in your submission directory as ​test_output.txt​. 
 
To prevent spamming the test framework, ​you only have eight test tokens​: you can only run these tests 
six times in a 24-hour period. Once you run out, you have to wait until they refresh the following day. 
 
Passing these tests does NOT guarantee you full marks on the assignment, but not passing them 
guarantees that something is wrong with your code. 
Fall 2019 
10 
INF1340 Assignment 2 

Glossary 
Dataset:  
A collection of reviews, each on a separate line 
 
Sentiment:  
The positive or negative feeling associated with a piece of text. In A2, it's quantified as a float 
between 0 and 4, inclusive. 
 
Statement:  
A piece of text in which all tokens are separated by spaces. Statements do not begin with a review 
score. 
 
Review:  
A piece of text that begins with a 0-4 score and a space, followed by tokens separated by spaces  
 
Token:  
The part of a string that can be separated from others by splitting the string on a delimiter, e.g., 
space. Tokens may be words (also hyphenated or slashed), review scores, partial words (e.g., 
"​n't​"), numbers, or punctuation. 
 
Word:  
A token that can be interpreted as an entire word and scored as such. Words contain alphabetic 
characters, optional hyphens, and/or optional forward slashes, but no numbers and no 
punctuation. Words are case-insensitive. 
 
Aggregate score:  
The sum of the review ratings for all occurrences of a word in the dataset. If a word appears 
multiple times in a review, its rating is added multiple times to the aggregate. 
 
Count: 
The number of times a word appears as a token in the dataset. This does not include partial 
matches, e.g., the word "land" does not appear in the review "1 iceland is beautiful" 
 
KSS (Known Sentiment Score):  
A word's aggregate score in a dataset divided by its count in the dataset 
 
PSS (Predicted Sentiment Score):  
The average of the KSS of all words in a statement.   

Fall 2019 
11 

Potrebbero piacerti anche