Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Universidad de las
Amricas Puebla
Neural Networks
Final Report
Sentiment Analysis with Python
and Scikit-Learn
Carmen Paola Hernndez Morales
Hctor Beristain Bermdez
ID 146873
ID 145826
Neural Networks
Objectives
Analyze a particular application of sentiment analysis document
level.
Whether a positive or negative feeling towards the product that is
being discussed is provided.
Theoretical framework
In general, a learning disability considered a set of n data samples and then
tries to predict the properties of unknown data.
We can categorize learning problems as follows:
Supervised learning, where the data comes with additional attributes
we want to predict this kind comprises:
Rating: samples belong to two or more classes and want to learn from
the data and predict the class labeled unlabeled data. An example of
classification problem would recognition Example handwritten digits, where
the objective is to assign each input vector to one of a finite number of
discrete categories. Another way to think of it as a discrete classification (as
opposed to continuous) form of supervised learning where you have a limited
number of categories and for each of the samples provided No one is trying
to label them with the correct category or class.
Regression: If the desired output consists of one or more continuous
variables, then the task is called regression. An example of a regression
problem would be predicting the length of a salmon as a function of age and
weight.
Unsupervised learning, in which the training data consist of a set of input
vectors x without any corresponding target values. The goal in this type of
problem can be to find groups of similar examples within the data, where it is
called the group, or to determine the distribution of data in the input space,
known as density estimation, or to project data from a high-dimensional
space to two or three dimensions for the purpose of visualization.
Developing
In general words of Benzanini Sentiment Analysis can be defines as the
process of determining whether a piece of writing is positive, negative or
neutral [1]. Sentiment analysis is a field of study that analyzes people's
opinions towards the products entities, usually expressed in written form and
online reviews. In recent years, there has been much discussed in academia
and industry, thanks to the popularity of social networks that provide a
constant source of full-text data views for analyzing.
Neural Networks
We are using Python and in particular scikit-learn for these experiments.
Scikit-learn to install, use the following commands in console:
Last login: Thu Nov 12 10:14:40 on console
MacBook-Pro-de-Hector-5:~ HectorBeristainBermudez$ pip install -U scikit-learn
Collecting scikit-learn
Downloading scikit-learn-0.17.tar.gz (7.8MB)
100% || 7.8MB 36kB/s
The dataset used for these experiments is known Dataset Polarity v2.0,
downloadable from the link Movie Review Data provided by Bonzanini .
The dataset contains 2,000 documents, labeled and preprocessed. In
particular, there are two labels, positive and negative with 1,000 documents
in each block. Each line of a document is a prayer. Preprocessing absorbs
most of the work we have to do to get started, so you can focus on the
problem of classification.
The real-world data are often not ordered and need suitable
pretreatment before we can make good use of them. All we need to do in this
case is read files and divide more words in the blanks.
The code may be found as Gist on Marco Bonzaninis Github. In the following
will explain the main tasks of the scrip created:
# You need to install scikit-learn:
# sudo pip install scikit-learn
#
# Dataset: Polarity dataset v2.0
Neural Networks
# http://www.cs.cornell.edu/people/pabo/movie-review-data/
#
import sys
import os
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.metrics import classification_report
def usage():
print("Usage:")
print("python %s <data_dir>" % sys.argv[0])
if __name__ == '__main__':
if len(sys.argv) < 2:
usage()
sys.exit(1)
data_dir = sys.argv[1]
classes = ['pos', 'neg']
The first reads the content of the files and creates lists of training/testing
documents and labels.
We split the data set into training (90% of the documents) and testing (10%)
by exploiting the file names (they all start with cvX, with X=[0..9]). This
calls for k-fold cross-validation,
not implemented in the example but fairly easy to integrate.
# Read the data
train_data = []
train_labels = []
test_data = []
test_labels = []
for curr_class in classes:
dirname = os.path.join(data_dir, curr_class)
for fname in os.listdir(dirname):
with open(os.path.join(dirname, fname), 'r') as f:
content = f.read()
if fname.startswith('cv9'):
test_data.append(content)
test_labels.append(curr_class)
else:
train_data.append(content)
train_labels.append(curr_class)
Neural Networks
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)
The SVC() class generates a SVM classifier with RBF (Gaussian) kernel as
default option (several other options are available).
The fit() method will perform the training and it requires the training
data processed by the vectorizer as well as the correct class labels.
The classification step consists in predicting the labels for the test data.
Neural Networks
After performing the classification, we print the quality results using
classification_report(), and some timing information.
# Print results in a nice table
print("Results for SVC(kernel=rbf)")
print("Training time: %fs; Prediction time: %fs" % (time_rbf_train, time_rbf_predict))
print(classification_report(test_labels, prediction_rbf))
print("Results for SVC(kernel=linear)")
print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))
print(classification_report(test_labels, prediction_linear))
print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(test_labels, prediction_liblinear))
By following the link for the complete code on Gist/GitHub at the end of the
article and getting the script, we saved the script and then call it from
command line with:
Last login: Mon Nov 23 09:13:34 on ttys000
MacBook-Pro-de-Hector-5:~ HectorBeristainBermudez$ cd desktop
MacBook-Pro-de-Hector-5:desktop HectorBeristainBermudez$ python Bonzanini.py review_polarity/txt_sentoken/
/Users/HectorBeristainBermudez/anaconda/lib/python3.5/site-packages/sklearn/utils/fixes.py:64:
DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
if 'order' in inspect.getargspec(np.copy)[0]:
Results for SVC(kernel=rbf)
Training time: 8.367630s; Prediction time: 0.805447s
precision recall f1-score support
neg
pos
avg / total
0.86
0.78
0.82
0.75
0.88
0.81
0.80
0.83
0.81
100
100
200
0.91
0.92
0.92
0.92
0.91
0.92
0.92
0.91
0.91
100
100
200
0.92
0.94
0.93
0.94
0.92
0.93
0.93
0.93
0.93
100
100
200
MacBook-Pro-de-Hector-5:desktop HectorBeristainBermudez$
The default RBG kernel performs worse than the linear kernel, this opens for
a discussion on Gaussian vs. linear kernels, not really part of this blog post,
but as a rule of thumb when the number of features is much higher than the
number of samples (documents), a linear kernel is probably the preferred
choice. Moreover, there are options to properly tune the parameters of a RBF
kernel.
Neural Networks
SVC() with linear kernel is much much slower than LinearSVC(), this is easily
explained by the fact that, under the hood, scikit-learn relies on different C
libraries. In particular SVC() is implemented using libSVM, while LinearSVC()
is implemented using liblinear, which is explicitly designed for this kind of
application.
Conclusions
We talked about an application of sentiment analysis, addressed as a
problem of classification of documents with Python and Scikit-Learn.
The choice of the classifier, and the feature extraction process, influence the
overall quality of results, and it is always good to experiment with different
configurations.
Scikit-learn offers many options from this point of view.
Knowing the underlying implementation also allows a better option in terms
of speed.
References
[1] Bonzanini, M. (2015, January 19). Sentiment Analysis with Python and
scikit-learn. Retrieved November 20, 2015.
[3] Scikit-Learn. (n.d.). Retrieved November 20, 2015.
[4] Sentiment Analysis. (2015). Retrieved November 20, 2015.
[5] 1.4 Support Vectors Machines. (n.d.). Retrieved November 22, 2015.