Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Training a model
We will build a simple model that can reasonably predict how positive or negative a particular movie
review is. We want our model to understand that the text "the film was a breath of fresh air" is
largely positive while the text "the movie made me want to poke my eyes out" is largely negative.
To predict the sentiment of movie reviews, we need to find and process examples of existing movie
reviews for which we already know the sentiments. We can use this data to train our model of sentiment
so we can apply it to new data.
We have found 8528 reviews from the movie review site Rotten Tomatoes. Each review is matched to a
score ranging from 0 to 4, with 0 being extremely negative, 2 being neutral, and 4 being extremely
positive.
All movie reviews are contained in the file full.txt
There are also smaller testing datasets provided in the files medium.txt, small.txt, and tiny.txt.
This assignment has four parts:
● Part I: Find the sentiments of individual words by reading through the reviews file every time
● Part II: Build a sentiment model in memory using a dictionary and estimate the sentiments of
statements consisting of multiple words
● Part III: Sort the dictionary and determine the most extreme words used in movie reviews
● Part IV: Perform analysis or insight of your choice using the dataset and report on your findings
Anatomy of a dataset
All datasets conform to the following formatting rules:
● Each movie review is on a separate line ending in '\n'
● Each line begins with a single-digit rating from 0 to 4 followed by a space
● The dataset has been pre-processed to be easier to analyze: each review consists of tokens that
are separated by spaces
● Words are the most common tokens, but punctuation, contractions, numbers, etc. are also in the
review data. This means that both regular words and non-words appear in the dataset as tokens.
E.g.,
2 Bond-inspired ?
0 Staggeringly dreadful romance .
4 This method almost never fails him , and it works superbly here .
1 Weird they 're still making these ham-fisted romps .
3 Competently directed but terminally cute drama .
Fall 2019
2
INF1340 Assignment 2
Fall 2019
3
INF1340 Assignment 2
Fall 2019
4
INF1340 Assignment 2
Part I: Preparation
(String methods, File I/O, lists, tuples)
Complete the Code portion of the Function Design Recipe (FDR) for each of the following functions in the
file sentiment.py.
Fall 2019
5
INF1340 Assignment 2
Fall 2019
6
INF1340 Assignment 2
Fall 2019
7
INF1340 Assignment 2
● How do most positive and most negative word sets change as the minimum number of
occurrences change? Which sets seem the most movie-relevant and accurate to us?
● Is there any way to "sharpen" the sentiment of a movie review, i.e., to make it less likely to fall in
the neutral zone around 2.0?
● Are there any words in the dataset that have scores significantly different from what we'd expect?
How did you find them?
● What are the most commonly occurring words in the dataset? Which of them are movie-specific,
and which of them are just common words?
● Could we recognize given names (e.g., 'Scorsese', 'Kidman', etc.) and could we compile a sentiment
analysis list of most positively and negatively reviewed people?
● Is our reviews dataset skewed in any way? Could we generate a non-skewed version of it?
To complete this portion, write any code you may need in sentiment_explore.py. Then, write your
analysis/question/insight as a short abstract in a file with the filename "explore" and submit it on
MarkUs. The file's extension is up to you (TXT, DOC, PDF, etc.), but the file size should not exceed 5 MB.
Significant, clever, or original insights or processing of the data are required to earn a full grade for this
portion of the assignment. You may have to or choose to generate new text files, CSV files, or
visualizations.
Fall 2019
8
INF1340 Assignment 2
What to submit
● sentiment.py with your completed functions
● sentiment_explore.py imports everything from sentiment.py. Use this file for your own data
explorations in Part IV
● explore.[txt/pdf/doc/…] describing the results of your exploration in Part IV
Component weight
Component % Criteria
Coding style and variable 20 Code is readable and robust, if statements used appropriately,
naming conventions minimal code repetition, informative variable names follow
pothole_case
Part IV: Exploration 10 An interesting question has been asked about the dataset. Data
has been gathered using additional Python code, and the results
have been described well.
Fall 2019
9
INF1340 Assignment 2
How to submit
1. Find a partner and follow the pair programming paradigm.
2. Ensure you can log in to MarkUs at h ttps://markus.teach.cs.toronto.edu/inf1340-2019-09/
3. Go to Assignment 2 and under “Group Information” invite your partner or join their group. You
both need to be in the same group in order to submit. You need to know the UTORid of your
partner to invite them.
4. You may submit as many times as you like before the due date. You may also submit after the due
date following the late submission penalties detailed in the course outline.
Use sentiment_display.py
sentiment_display.py imports from sentiment.py and repeatedly asks for input to test your code.
If you have not completed the dictionary in Part II, you can run the part_i_interface function call.
If you have completed the dictionary in Part II, you can uncomment the file opening block and the
part_ii_interface function call and run the whole file.
Note: sentiment_display.py is built to take input of the correct type and format only. There is no need
to test it or make it work with any other data types. You do not have to submit this file.
Glossary
Dataset:
A collection of reviews, each on a separate line
Sentiment:
The positive or negative feeling associated with a piece of text. In A2, it's quantified as a float
between 0 and 4, inclusive.
Statement:
A piece of text in which all tokens are separated by spaces. Statements do not begin with a review
score.
Review:
A piece of text that begins with a 0-4 score and a space, followed by tokens separated by spaces
Token:
The part of a string that can be separated from others by splitting the string on a delimiter, e.g.,
space. Tokens may be words (also hyphenated or slashed), review scores, partial words (e.g.,
"n't"), numbers, or punctuation.
Word:
A token that can be interpreted as an entire word and scored as such. Words contain alphabetic
characters, optional hyphens, and/or optional forward slashes, but no numbers and no
punctuation. Words are case-insensitive.
Aggregate score:
The sum of the review ratings for all occurrences of a word in the dataset. If a word appears
multiple times in a review, its rating is added multiple times to the aggregate.
Count:
The number of times a word appears as a token in the dataset. This does not include partial
matches, e.g., the word "land" does not appear in the review "1 iceland is beautiful"
KSS (Known Sentiment Score):
A word's aggregate score in a dataset divided by its count in the dataset
PSS (Predicted Sentiment Score):
The average of the KSS of all words in a statement.
Fall 2019
11