Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1 Text Mining
Text mining refers to the process of deriving high-quality information from
text. It describes a set of linguistic, statistical, and machine learning
techniques
that model and structure the information content of textual sources for
business
intelligence, exploratory data analysis, research, or investigation. Text
mining is a
variation of data mining, that tries to find interesting patterns from
databases. As
most information is currently stored as text, text mining is believed to
have a high
commercial potential value. Text analysis processes typically include:
a. Information retrieval or identification of a corpus. This step includes
collecting
or identifying a set textual materials on the web, file system, database, or
content
management system;
b. Applying natural language processing, such as part of speech tagging,
syntatic
parsing, and other types of linguistic analysis;
c. Named entity recognition to identify named text features: people,
organizations,
place names, and so on using statistical techniques;
d. Recognition of Pattern Identified Entities. Features such as telephone
numbers,
email addresses, and quantities can be discerned with regular expression;
e. Coreference is identification of noun phrases and other terms that refer
to the
same object;
f. Identification of associations among entities and other information in
text;
g. Sentiment analysis, involves discerning subjective material and
extracting various
forms of attitudinal information, such as opinion, mood, and emotion;
h. Quantitive text analysis is a set of techniques stemming from the social
sciences
in order to find out the meaning or stylistic patterns of a casual personal
text.
Text mining is now broadly applied for various fields, including security,
biomedical, software applications, sentiment analysis, marketing and
academic applications.
Sentiment analysis refers to the use of natural language processing, text
analysis,
and computational linguistics to identify and extract subjective
information in
source materials. The basic task in sentiment analysis is classifying the
polarity of
posted tweets per day [24]. The service also handled 1.6 billion search
queries per
day. This high popularity leads Twitter to be used for various purposes,
such as
political campaigns, learning media, and advertisement, whereas it faces
various
issues and controversies regarding security, user privacy, lawsuit, and
censorship
[17].
2.4.2 Tweets
Tweets are text messages sent by users which is limited to 140
characters.
Users may subscribe to other users tweets, this is known as following,
and the
subscribers are known as followers. Users can group posts together by
topic or type
by using hashtags, words or phrases prefixed with a "#" sign. Similarly,
the "@"
sign followed by a username is used for mentioning or replying to other
users. To
repost a message from another user and share the message with ones
own followers,
the retweet function is symbolized by "RT" before the message.
A word, phrase, or topic that is tagged at a greater rate than other tags
is said to be a trending topic. Trending topics become popular either
through a
concerted effort by users, or because of an event that prompts people to
talk about
one specific topic. These topics help Twitter and the users to understand
what is
happening in the world.
2.5 R
R is a free software programming language and software environment for
statistical
computing and graphics, including linear and nonlinear modeling, classical
statistical tests, time-series analysis, classification, clustering, and others.
The R
language is widely used among statisticians and data miners for
developing statistical
software and data analysis. Polls and surveys of data miners are showing
Rs
k. Weka, allows for the use of the data mining capabilities in Weka and
statistical
analysis in R.
2.5.3 R Add-on Packages
The capabilities of R are extended through user-created packages, which
allow
specialized statistical techniques, graphical devices, import and export
capabilities,
reporting tools, etc. These packages are developed primarily in R, and
sometimes
in Java, C and Fortran. A core set of packages is included with the
installation of
R, with 5300 additional packages (as of April 2012) available at the
Comprehensive
R Archive Network (CRAN), Bioconductor, and other repositories.
13
2.5.3.1 Add-on Packages in R
The R distribution comes with the following packages:
a. base, base R functions (and datasets before R 2.0.0);
b. compiler, R byte code compiler (added in R 2.13.0);
c. datasets, base R datasets (added in R 2.0.0);
d. grDevices, graphics devices for base and grid graphics (added in R
2.0.0);
e. graphics, R functions for base graphics;
f. grid, a rewrite of the graphics layout capabilities, plus some support for
interaction;
g. methods, formally defined methods and classes for R objects;
h. parallel, support for parallel computation, including by forking and by
sockets,
and random-number generation (added in R 2.14.0). ;
i. splines, regression spline functions and classes;
j. stats, R statistical functions;
k. stats4, statistical functions using S4 classes;
l. tcltk, interface and language bindings to Tcl/Tk GUI elements;
m. tools, tools for package development and administration;
n. utils, R utility functions.
14
These base packages were substantially reorganized in R 1.9.0. The
former
base was split into the four packages: base, graphics, stats, and utils.
Packages
ctest, eda, modreg, mva, nls, stepfun and ts were merged into stats, and
package mle moved to stats4.
2.5.3.2 Add-on Packages from CRAN
The Comprehensive R Archive Network (CRAN) is a collection of sites
which