Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Hongting Chen
Computer Science Department, CIMS
New York University
USA
e-mail: hc1924@nyu.edu
Abstract—In the world of finance, one key lesson is the registered in Tesla’s capitalization [1] and 3% increase in
importance of psychology in the behavior of financial markets. Tesla’s stock price. Netizens flocked to guess possible new
Many investors are irrationally exuberant when making products that Tesla might announce - some thought it might
financial decisions, but predictive analytics can generate be a new solar energy product, a Tesla hovercraft, or a flying
insights that are free of investors’ emotions, and hence human broom. It turned out that there was no major product
irrational exuberance in decision-making can be mitigated. announcement; Elon Musk was simply referring to a
Data sources that investors adopt in their investment decision- software update that could help Tesla owners locate charging
making are, in most cases, traditional – including quarterly stations [2]. Just a few minutes after the Hawthorne event,
earnings reports and financial statements. In this work, we
Tesla stock fell several points.
propose a predictive analytics framework that aims at mining
insights from two alternative data sources: news articles and
2. Analogously, in December of 2016, President Trump
micro-blogs. We investigate the predictive correlation and tweeted: “Based on the tremendous cost and cost overruns of
causation between (1) collective opinion mining in news articles the Lockheed Martin F-35, I have asked Boeing to price-out
fused with Twitter mood and (2) movements in financial a comparable F-18 Super Hornet!” After the tweet, it was
markets. Experimental results indicate a relationship between reported that Lockheed Martin shares fell about two percent,
stock market prices and collective opinion mining variations on while Boeing shares were up 0.5 percent. The President’s
news articles combined with Twitter’s sentiment variations. tweet might have lost Lockheed Martin about one billion
The framework introduced in this work could potentially be dollars in market value [3].
adopted as a supplement to the conventional analyses being Data is becoming a catalyst of decision-making, growth
used in major investment banks. This research was partially and change; indeed, it is the new oil. However, data in its
funded by the Australian government under the Awards- raw form is useless. Dr.Eric Siegel said that “The real value
Endeavour research grant. of raw data is what is discovered therein” [4]. Predictive
analytics can interrogate and derive data-driven insights. It is
Keywords—predictive analytics, machine learning, artificial the art and science of mining data to extract novel and
intelligence, applied data science, finance, wall street, investment ultimately useful trends to make better data-driven decisions
banking [5]. And it has long been a research subject of both academic
and finance, using predictive analytics to understand
I. INTRODUCTION financial markets.
Words matter. Even a few hundreds of characters of In 1999, Wysocki was the first researcher to investigate
Twitter can influence stock prices, as seen in two recent the predictive power of micro-blogs in finance. He
instances that made headlines:1. In March 2015, Elon Musk investigated the predictive correlation between Yahoo!
of Tesla tweeted about unveiling a new product: “Major new Finance message boards and stock prices [6] and discovered
product line – not a car – will be unveiled at our Hawthorne that message volume can predict changes in next day stock
Design Studio on Thurs 8pm, April 30.” It was reported that returns. However, in 2001, Tumarkin et. al [7] contradicted
in just a few minutes, a one billion US dollar increase was the work of Wysocki and reported no correlative or causative
335
be used to collect news related to the Australian market and A. Data Pre-processing
in the end incorporated news from two sources: HotCopper We applied standard text processing algorithms to
and Bing News. HotCopper is an Australian stock market Twitter data to remove special characters, numbers, and stop
online chat forum that allows its users to discuss financial words. We removed all punctuation aside from exclamation
topics. Using these two sources, we assembled over three points and question marks, as they were used in the
thousand news articles from 2017 about the top fourteen sentiment analysis that will be described later in this section.
companies in the Australia Index Fund (EWA) mentioned in Stemming was applied to all tweets, converting words with
Table I. the same root to their original format. For instance,
C. Stock Market Data “connection”, “connections”, and “connective” were all
linked to their root form, “connect”. Tweets and news
Yahoo Finance interface was used to collect the data for articles were stored in a Hadoop cluster, and then a map-
the Australian Index Fund (EWA). Data was collected for reduce version of the porter stemming algorithm was applied.
daily prices of open, close, high, low, and traded volume. An After employing these pre-processing techniques, all the
exchange-traded fund, ETF for short, is a marketable security stemmed terms and tweets were combined in to one
that tracks an index or a basket of assets like an index fund document-feature matrix. We adopted a time-based Term
and is different from a mutual fund in a way that an ETF Frequency and Inverse Document Frequency (TF-IDF) as a
trades like a common stock on a stock exchange; an ETF can measure to normalize terms and their frequency for a specific
be bought and sold. Investors might be more attracted to window of time. Equation one represents the inverse
invest in an ETF than in a mutual funds share because an document-frequency where N is the total number of days
ETF has higher daily liquidity and lower fees than mutual
funds. One of the main reasons that led us to select an ETF under the window of observation. Term-frequency tf (t , d )
for our data is that it offers exposure to the Australian equity corresponds to the occurrence of a word over a day in tweets
market based on social media sentiment. Studying the and news document.
Australia Index Fund can be useful for financial decision idf (t , d ) log( N y {d D : t d } ) (1)
making given that the underlying equities and trading region
of the fund are in two different countries with two macro- B. Ensemble of Opinion Mining Algorithms for Generating
economic factors. Mood Time Series
Opinion Mining is the process of computationally
categorizing opinions expressed in text [13]. The goal of a
sentiment analysis algorithm is to determine the attitude and
emotion of the author towards the topic mentioned in his or
her text [14]. After pre-processing, we designed an ensemble
of two sentiment analysis algorithms: Opinion Finder (OF)
and Stanford natural language processing programming
interface (SNLP). OpinionFinder (OF) is a sentiment
analysis algorithm that can be used to identify sentence-level
subjectivity [15]. OF relies on a lexicon of over nine
thousand positive and negative words. The algorithm is
known in literature for its capability to successfully analyze
the emotional context of a large collection of text.
For every tweet and news article headline, we computed
a sentiment score based on the positive and negative words
Figure 1. Graph of ETF’ entities contained in the text. Although OF pulls from a large
dictionary of words, it ignores the word order of the sentence,
which could potentially lead to inaccuracies in predicting
III. EXPERIMENTAL METHODS AND RESULTS sentiment scores. For that reason, we adopted Stanford NLP
The data science problem addressed in this paper can be in our ensemble in addition to OpinionFinder. SNLP is a
formulated as follows: Given (1) historical daily prices and deep learning model that builds up features based on the
volume of the exchange-traded fund, and (2) collection of sentence structure. It computes the sentiment based on how
tweets associated with the ETF and (3) news articles words compose the meaning of longer phrases. The SNLP
published on any given day that are related to the ETF in algorithm is based on a Recursive Neural Network
question, predict the direction of the ETF stock for the next implementation that builds on top of grammatical structures.
day (up or down). In the following sections, we explain the As we were processing the tweets, we found an
data pre-processing methods, opinion mining algorithms, overwhelming use of emojis, a relatively small digital image
feature engineering methods, and data classification - typically smileys and ideograms - used to express an
techniques used in this study. emotion. In order to enhance the sentiment score in twitter,
we included analysis of emojis, which were frequently used
in the tweets we analyzed.
336
Our results indicate that including Emojis in the analysis tweets, and percentage change on the ETF from the previous
enhanced sentiment measurement, particularly in Twitter day (up vs down).
data, as shown in Figure 2 sentiment analysis was applied to
both tweets and news article titles, aiming to derive one D. Correlation and Lag Analysis
sentiment score per day for both data sources in order to In order to analyze whether Twitter sentiment is
generate sentiment time series. correlated and possibly a predictor of ETF stock prices, we
investigated the synchronous correlation coefficients
between the two-time series at various lags. Consider twitter
sentiment time series as x = {x1. ...,,xn} and ETF stock price
time series as y = {y1, ..., yn}, the cross correlation γ at lag h
is then defined as shown in equation 2:
(2)
¦
m
Info( N ) = - p lg(p x ) (3)
x 1 x
Information needed after using feature F to split N into y
partitions to classify N:
¦
Ny
Info F (N ) = - I(N y ) (4)
N
337
E. Time Series Cross Validation
We applied cross validation on the stock price time series
and compared 1-step, 2-step,..,12-step forecast using Mean
Absolute Error (MAE), also known as roll-forward cross-
validation.
classificationTreeModel()
while(stopCondition != True) Figure 6. ETF price vs. Twitter sentiment
createNewNode(node.left);
createNewNode(node.right);
if(checkStopCondition(node) == True)
predictedLabel
// 1 for market up, 0 for market down
print samplesUnderClass()
stopCondtion = True;
else
continue;
createNewNode()
checkStopCondition(node)
isNodeHomogeneous() && noFeatureRemain() && F. Autocorrelation and Auto-regression of Time Series
noSampleRemain()
return True
In this phase of the project autocorrelation and auto-
else regression analysis of the ETF price is assessed and
return False presented as shown in figure 8. At one time step, the
correlation becomes negative and oscillates around zero.
Using one-day lag of open price to run the regression and to
predict the direction of the market open price movement
does not generate valuable information as it can be
visualized from Figure 9. This generally means that using
one-day lag of open signal as the only feature does not
provide a predictive feature.
338
bing/open@t+1 1 0.2041 0.6583 0.2478 0.6186
bing/open@t+1 2 2.1621 0.1615 6.2899 0.0431
bing/volume@t+1 1 0.4450 0.5156 0.5403 0.4623
bing/volume@t+1 2 0.6531 0.5395 1.8999 0.3868
339
market. Fusing the wisdom of the crowds learned from two volume." Portuguese Conference on Artificial Intelligence. Springer,
alternative data sources – both Twitter and news articles – Berlin, Heidelberg, 2013.
could provide the investor with an edge to position them [9] Johan Bollen, Huina Mao, and Xiaojun Zeng. "Twitter mood predicts
the stock market." Journal of computational science 2.1 (2011): 1-8.
ahead of the pack.
[10] Sohangir, Sahar, et al. "Big Data: Deep Learning for financial
ACKNOWLEDGEMENTS sentiment analysis." Journal of Big Data 5.1
(2018):StockMarketPredictionUsingTwitterSentimentAnalysis.
This work was partially supported by the Australian pdf) 15 (2012).
government department of education under the Australia [11] Huina Mao, Scott Counts, and Johan Bollen. "Predicting financial
Awards-Endeavour research and fellowship grant. markets: Comparing survey, news, twitter and search engine
data." arXiv preprint arXiv:1112.1051 (2011).
REFERENCES [12] Sharad Goel et al. "Predicting consumer behavior with Web
search." Proceedings of the National academy of sciences 107.41
[1] Cornell Bradford, and Aswath Damodaran. "Tesla: Anatomy of a (2010): 17486-17490.
Run-up." The Journal of Portfolio Management 41.1 (2014): 139-
151. [13] Bari, Anasse, and Goktug Saatcioglu. "Emotion Artificial
Intelligence Derived from Ensemble Learning." 2018 17th IEEE
[2] Malhotra, Claudia Kubowicz, and Arvind Malhotra. "How CEOs can International Conference On Trust, Security And Privacy In
leverage twitter." MIT Sloan Management Review 57.2 (2016): 73.
Computing And Communications/12th IEEE International
[3] Ge Qi, Alexander Kurov, and Marketa Halova Wolfe. "Stock Market Conference On Big Data Science And Engineering
Reactions to Presidential Social Media Usage: Evidence from (TrustCom/BigDataSE). IEEE, 2018.
Company-Specific Tweets." (2017). [14] Bari, A., Chaouchi, M., & Jung, T. (2016). Predictive analytics for
[4] Siegel, Eric. Predictive analytics: The power to predict who will click, dummies. John Wiley & Sons.
buy, lie, or die. Hoboken: Wiley, 2013. [15] Bellaachia, A., & Bari, A. (2012, June). Flock by leader: a novel
[5] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. machine learning biologically inspired clustering algorithm.
"Knowledge Discovery and Data Mining: Towards a Unifying In International Conference in Swarm Intelligence (pp. 117-126).
Framework." KDD. Vol. 96. 1996. Springer, Berlin, Heidelberg.
[16] Bellaachia, A., & Bari, A. (2012, March). A flocking based data
[6] Peter D Wysocki. "Cheap talk on the web: The determinants of
postings on stock message boards." (1998). mining algorithm for detecting outliers in cancer gene expression
microarray data. In Information Retrieval & Knowledge
[7] Robert Tumarkin, and Robert F. Whitelaw. "News or noise? Internet Management (CAMP), 2012 International Conference on (pp. 305-
postings and stock prices." Financial Analysts Journal 57.3 (2001): 311). IEEE.
41-51.
[8] Nuno Oliveira, Paulo Cortez, and Nelson Areal. "On the predictability
of stock market behavior using stocktwits sentiment and posting
340