Sei sulla pagina 1di 8

The Efficacy of using Unstructured Data in

Earnings Prediction and Alpha Generation


Aditya Dhingra

Research Intern, ALPHABETA INC

f20170642@pilani.bits-pilani.ac.in

Sathyanarayanan Palaniappan

COO, ALPHABETA INC

Sathya@alphabeta.io

ABSTRACT I. INTRODUCTION
The semi-strong and strong forms of the efficient
In the constant race to generate ALPHA (excess market hypothesis suggest that prices of assets
return relative to a benchmark index) active asset incorporate all information about the assets that are
managers, hedge funds and traders are constantly available in the public domain. However, in the last
looking for new insights that will help spot decade, the information and data availability being
arbitration opportunities. In this paper we examine symmetrical/equal access to all the market
the rise of unstructured data (combined with participants, active investment management firms
machine learning techniques) to systematically and hedge funds are struggling to identify ALPHA
mine for patterns and trends. We compare different generation opportunities. ALPHA is a zero sum
kinds of unstructured data on the basis of criteria game, a gain for market participant corresponds to
such as 1) Quality of data 2) Ease of collection 3) the losses made by another market participant. The
Ability to generate alpha and 4) Predictive Risks. fees and existence of the active managers, depends
We examine the efficacy of models that use on the ALPHA the firm is able to generate.
unstructured data to make forecasts. A discussion
on the mainstream applications of analytic This has led to the growth of passive investment
unstructured data such as the use of Natural vehicles like Exchange Traded Funds (ETFs)
Language Processing in sentiment analysis and the which often post equally good returns as the
use of satellite imagery data in sales forecasting is actively managed funds, sans the management
included. We also look at the relationship between fees. As can be witnessed in the last decade, in the
structured and unstructured data variables and the United States, less than 15% of actively managed
ability to predict economic variables (augmented large cap funds have beaten S&P 500 [1]. Similar
analytics). We test our findings by building our trends have been witnessed in the net money inflow
own model using alternative data. Finally, we to passive investment vehicles as compared to
discuss issues concerning privacy, compliance and active investment vehicles [2]. Consequently, there
intellectual property in relation to unstructured comes an existential threat for active investment
data. management firms.
Keywords — unstructured data, alternative data, Historically, active investment management firms
price forecasting, machine learning, augmented build models and spot investment opportunities,
analytics using the data related to company financials, price

1
data, macro-economic factors and industry ● ANALYTIC data
indicators. These data sets are collectively referred Analytic data include insights, perspectives and
to as structured data. opinions about a company’s performance by
professional financial analysts; often working for
With the growth of technology and computational rating agencies, media outlets and consulting
power, the cost and the availability of structured houses. Also called as derivative data because it is
data has become uniform across investor derived based on the underlying fundamental
community. The first mover advantage or early company or market performance data, these
spotting an investment opportunity using include consensus revenue /earnings estimates
structured data only has lost significance in the from the sell-side/analysts tracking the
current landscape. This has resulted in firms performance of the company, credit ratings, etc. An
looking at data sources outside the financial realm example of an investment strategy utilizing
to spot arbitration opportunities. These include data analytic reports assumes that the wording and
from social media feeds, news sentiments, analyst grammar of company filings and credit quality
ratings, earnings expectations, satellite imagery revisions are generally homogenous until there are
data, remote sensing data, etc. Together these data impending events of financial significance. Thus,
set collectively are referred to as unstructured the change of language generates a tick. The main
data. advantage of this dataset is that it is based on some
underlying fact that is interpreted within an
The first use of unstructured data can be traced industry or macro-economic context.
back to 2008 when a quantitative hedge fund, the
MarketPsy Long-Short Fund LP began using ● ALTERNATIVE data
sentiment analysis in its models [3]. Today such Alternative data sources include satellite
data is widely used to analyze market sentiment, imagery/Drones, search engine trends, social
forecast fundamentals, review policy and make media posts, data from IOT sensors, Weather
socially responsible and sustainable investment patterns, parking lot occupancy etc. In many cases
decisions. the data is produced by sources outside the ambit
of the company hence it is primary data.
The rest of this paper is organized as follows: Alternative data often entails higher acquisition
cost due to lack of general availability,
Section II discusses the types of unstructured data contextualization, (in)completeness and systemic
that are widely used in the industry, Section III integration. An example of an investment strategy
discusses the relationship between structured and based on alternative data is the use of Satellite
unstructured data namely the lead/lag effect. Imagery to predict Oil Prices and the performance
Section IV discusses the efficacy and risks of of energy and utilities companies (Brittany Weiss
developing predictive models using unstructured and the Motiva Enterprises). The method of storage
data, Section V provides several pertinent use of oils necessitates the use of floating lid tanks. The
cases, Section VI illustrates a design approach to level of these lids casts crescent shaped shadows on
developing such models and highlights challenges the land. The length of this shadow is indicative of
such as access, timeliness and quality of data, the supply of the natural resource. This is a strong
Section VII discusses legal, privacy and IP albeit nuanced predictor of the ensuing oil prices.
implications and finally in Section VIII we submit
our conclusions and suggestions for further Below are examples of commonly used alternative
research. data sources in the industry today.

II. UNSTRUCTURED DATA

In the literature, unstructured data is often


classified into two distinct types: Analytic Data and
Alternative Data

2
○ Social/Sentiment management firms (a study by alternative.org
estimates a spend of $1.7 BN, for the year 2020, on
(a) Twitter and Facebook have long been used
with a high level of accuracy for event driven Alternative Data by investment managers) [6], the
strategies and sentiment analysis. This is achieved cost of acquisition, sourcing challenges, data
by running Natural Language Processing quality and infrastructure needs, have resulted in
algorithms on tweets, comments and the level of the imperative question on what’s lead (or lag)
activity. The level of a firm’s promotional activity effect of the data in creating a first mover
and the public response has been leading drivers of advantage and ALPHA generation opportunity.
expectations as investors believe that the firm is The chart below provides types of alternative data
creating new markets to drive future sales of and relative popularity of the usage of these data
products. Social media, trade magazines and amongst the funds,
product blogs have also been used to elicit
consumer response to firm’s products. Narges
Tabari et al, 2018, also proved the causality
between Social media tweets and Stock returns,
[4]. They also proved the causality of higher
volumes of tweets in a day, related to a company,
in the event of a specific news/events regarding the Source: Alternativedata.org
company/stock.
Web data scraping, credit card and debit card
○ Satellite Imagery transaction details of the customers, Social
Earth Imaging has long been used to estimate sentiment analyzer are the most popular usage of
economic and financial fundamentals such as alternative data, amongst the active investment
shipments for international trade, satellite imagery management firms.
of parking lot traffic across major retailers (Zsolt
Katona et al, March 2019) [5] for predicting retail 1. Web scraping: With rise of e-commerce
sales and weather patterns and availability of online shopping behavior (searching and buying) is
natural resources and crop yields for commodities now tracked and analyzed for upticks/downticks in
and companies in the energy sector. sales. Aggregated online data can reliably predict
quarterly earnings for businesses that are mostly
○ IoT/Bluetooth Beacons Sensor Activity online, as demonstrated by the Eagle Alpha case
A major application of this is measuring footfall in study on GoPro in 2015. Eagle Alpha, alternative
retail stores and thus forecasting earnings/sales. data provider, accurately predicted the weakness of
One of the challenges however is the sales of GoPro - action camera manufacturer for Q3
contextualization of this data to derive insight. For 2015, by web scraping of data from Electronic
instance, footfall increase/decrease into a store may websites and 80 MN sources. Whilst 68% of the
be more attributable to weather than to marketing analyst maintained a “Buy” rating for GoPro [7]. A
strategy, key challenge is the legality of web scraping. We
○ Search Engine Volumes will discuss this in more detail in Section VII.
2. Satellite Imagery: Usage of satellite
The volume of search engine queries has been imagery of parking lot traffic in retail stores, to
found to be correlated to some economic and predict the performance of retailers, provide the
financial fundamentals. There is evidence of strong lead effect of three-day trading window, prior to
leading relationship between unemployment rate the quarterly earnings announcement (Zsolt Katona
and internet searches of unemployment benefits et al, March 2019) [5]
provided by the Government. Google Search Results: Evidence of
3.
higher Google search volumes, in Russell 3000
III. LEAD /LAG EFFECT OF ALTERNATIVE stocks, predicts higher stock price in the next 2
DATA weeks and eventual price reversal within the year
(Zhi Da et al, 2011) [8]
While the usage of the alternative data is slowly 4. Sentiment Analyzer with Twitter: With
becoming mainstream for active investment an 87% accuracy, the Bolleen study predicted daily

3
up /down movement of Dow Jones index, based on ○ Material Non-Public Information Risk
the Twitter sentiments which determines the mood This risk has led to impending SEC/regulatory
of the market (Johan Bollen et al, 2011) [9]. An crackdowns. There are two dimensions to the
extension of this was performed in [10] where prohibition of such information: Material (the
bullish and bearish sentiments were harnessed information must have potential for alpha
based on tweets over a 3-day observable period and generation) and Non-Public (not the same as
a decision tree classifier was able to predict exclusive). Public data refers to information that is
up/down movements correctly nearly 80% of the non-confidential. In some cases, the confidential
time. information may be obtained through legal means
even as simple as direct observation.
IV. EFFICACY AND RISKS OF
UNSTRUCTURED DATA SETS For example, in studying the availability of oil at a
mining site owned by a mining company, it will be
considered encroachment if the investment firm
To be effective in predicting price movements or plants sensors to study the level of oil rigs, however
earnings, unstructured data is often combined with if an employee may drive around the mining site to
structured (fundamental performance) data through observe the mining machines in action, he may
regression models. A typical model attempts to obtain crucial information which may act as a
forecast variation in a structured variable using substitute without legal consequences. The
other structured variables controlling for various information is considered public because it may be
effects. The regression is performed again with the observed by anyone.
addition of an index, created to quantify a
qualitative attribute. The success of the new In certain cases, investment management firms find
addition is reflected through 1) An increase in the ways to estimate confidential variables such as
explained variation or 2) A high t-statistic for the sales without requiring the consent of the firm in
unstructured variable index. However, there is question. In case of non-public information, firms
usually a high risk of multicollinearity among the must ask themselves and their data vendors if the
independent variables as qualitative factors, data is a consequence of a violation of a fiduciary
especially sentiments may arise from changes in duty for example investment bankers are under
the structured variables themselves. legally binding constraints not to disclose
information about their clients even to trading
The real benefit of using unstructured variables lies departments in the same investment bank.
is their real time currency, which also make them
relatively less popular in developing forecasts with However, it is observed that in a significant number
a long-term investment view. But these models are of cases, firms have willingly granted the right to
increasingly powerful for applications as short- other companies they may be in business with, to
term trend forecasting or ‘nowcasting’ where data use and sell the former’s data.
is released into the public domain with lags. These
lags typically range from 3-30 days. Thus, for ○ Model Risk
active investment management, this provides a The data may not lead to the discovery of a stable
window of 3-4 weeks before the rest of the market relationship between any variables. It may not give
reacts to traditionally published data. rise to a model which may generate any trading
signals. Often times models suffer from over-
Asset managers and hedge funds invest significant fitting errors or biases and do not perform well
resources to procure Alternative datasets. On beyond specific time periods. Also, there is the
average these datasets cost upwards of $100K. And validity risk of a model generating wrong trading
they come with significant risks. These risks signals owing to noise in the measurement of
concern the i) Insider Trading Laws ii) Privacy variables and the influence of other variables. The
risks and iii) Copyright laws. Let us look at some quality and completeness of data also poses
of these now. challenges. Unlike audited and certified
fundamental data, much of alternative data is raw,
secured from unknown sources and there is no
guarantee it is not synthetic.
4
correlated leading indicator for forecasting
○ International trade activity as it could accurately
Privacy Risk
predict the level of imports and exports also. TNI
The data may endanger the right to privacy of forecast was often a better predictor of China
individuals if managed recklessly by unregulated Export and Import trade balances. [11]
data vendors. This is the case with credit rating or
transactions data. Under no circumstances, should
the subjects of the data be personally identifiable.
However, this is not the major concern as
Investment Management firms are primarily
interested in aggregates and may discard an
individual’s behavior as an inconsequential data
point among millions.

Despite the high processing costs and entailing ○ Using search volumes of unemployment
risks, asset managers and hedge funds report higher
benefits to predict unemployment rates.
scope for alpha by leveraging alternative data in
their analysis and modeling. Google Trends is an online publicly available
service which tells us how often a particular
search query was entered relative to the entire
V. PREDICTING FUNDAMENTAL AND search volume. The steps of the process
ECONOMIC VARIABLES of generating an Unemployment index were as
follows (Citi Research) [11]:
The following use cases illustrate the efficacy of
unstructured data in enhancing forecast accuracy of
• Generate relevant search terms
fundamental and economic variables.
• Source the search volume for each term
○ Predicting Retail Sales using traffic dating back to 2004
imagery • Curate the data and adjust for outliers and
The year on year traffic growth rate in the parking seasonality
lot of major retail stores is used to predict whether • Search terms are ranked by their predictive
actual sales will fall short of or surpass analysts’ scores
earnings and sales expectations. Satellite Imagery • Final index includes a selected basket of
is modelled through Image Processing Algorithms terms, and measures co-movement of
and signals are generated based on the traffic search activity with a particular economic
density patterns and compared to corresponding indicator
patterns in the previous year. The signal generated
is controlled for the presence of any discounts or The unemployment indicator (based on a 12-month
special offers that might externally influence store moving average) was plotted against the
traffic. conventional measure from the US Bureau of
Labor Statistics. There was a correlation of 0.88.
This provides a more current leading indicator of The indicator was considered to be a leading
sales volume. A comparison to the actual stock indicator as it showed higher correlation with
performance yields the following curve. Bollinger subsequent data (0.51) compared to recent
bands, similar to ones generated by technical historical data (.46).
analysts are generated [5].
VI. MODELING USING ALTERNATIVE DATA
○ International Trade data using satellite
imagery of ships In this section we show a simple approach to
Chinese Trade Balance was forecasted using developing a model using an unstructured data set
25,000 time series data as TNI (Trade Nowcasting similar to the use case above involving Google
Indicators). The predicted values were plotted Trends, for India market. We use insurance as the
against actual levels. The index proved to be highly industry of choice because of its high online sales

5
conversion and the fact that consumers tend to explain the variation in premiums in the absence of
compare policies online before consulting brokers. any structured variable terms. The terms which
Also, Insurance tends to be a moderately seasonal remain in the model are selected and the rest are
product: travel insurance sales rise in months eliminated.
characterized by summer and winter breaks or auto
and property insurance sales rise during festival We will run two regressions, including only
time which features high discounts on big structured variables in the former. The metrics for
purchases. Hence our hypothesis is that Google success of the model are the following:
Search Trends could be a reliable predictor of
insurance premium sales. 1)A significant reduction in Mean Absolute Error
(reduction in R squared)
A. Data 2) High t statistics for the search volume variables.
The data regarding insurance premiums earned by
the top public and private insurance vendors is The market share is taken as a proxy for the number
published monthly on the Insurance Regulatory of search conversions contribution to the ith firm’s
and Development Authority of India. The data is premiums.
publicly available for both Life and General
insurance firms dating back to 2004. We will be PREMt = a(PREMt-1) +b(PREMt-12) +c(SEARCH
primarily interested in sales data for the companies. MAGNITUDE)+d

The search volume data is taken from Google We are in the process of acquiring the necessary
Trends. To obtain insights from Google Trends data for back testing and validating the algorithm
data, we prepare a list of terms that reflect and our hypothesis. We will publish our findings
consumer attention towards the need for insurance and learnings in a subsequent paper.
products. The list contains terms ranging from
‘insurance’, ‘premiums’ etc. Google Trends
VII. PRIVACY AND INTELLECTUAL
provides a monthly average of the number of PROPERTY
searches of the assigned search term conducted
over the past year. The search results may be
It is ironic that there has been a rise in the licensing
constrained over given geographic region and over
and regulation of datasets despite the effort towards
specific time periods to look at seasonality effects.
open data sourcing. Data which is by itself is
factual is not subject to any copyright or legal
B. Model
protection. However, if there is a significant human
or financial investment in the modification,
The following variables are the structured
explanation, visualization or refinement of any data
independent variables. ‘t’ denotes the current
point, there may be a licensing required to reflect
month for which we forecast the premiums and ‘i’
the effort of the modifier. This was exemplified in
denotes the ith firm for which we forecast.
the lawsuit that LinkedIn brought in against hiQ for
web scraping user data from its pages to develop
i)PREMt-1,i
and sell analytics. The ruling in this particular case
This is used to maintain continuity in magnitude of
favored hiQ allowing bots to access any
insurance premiums and capture any trends
information that is not secured behind a password
emerging from sales momentum.
protected wall. However subsequently the ruling
ii)PREMt-12,i
has been challenged. And several other companies
This term is used to capture any seasonal
have filed similar lawsuits against using
variation(owing to festivals, vacations,discounts).
technology to get access to data from public sites.
Because of the presence of this term, we do not
seasonally adjust the dependent variable, which is
Intellectual property rights are another nebulous
the cumulative premium in the current month.
topic when it comes to alternative data. Does
someone “own” the tweets they post? It turns out
For the unstructured variable, we will use a LASSO
that Twitter users implicitly have conceded all
regression to eliminate the search terms which least
rights to their posts to Twitter and anyone else who

6
uses Twitter! Facebook operates within similar REFERENCES
principles: once someone posts on Facebook they
concede sharing rights with Facebook and its users.
However, secondary data can be protected based [1] Bob Pisani, “Active fund managers trail the
on originality even if the underlying tweets or S&P 500 for the ninth year in a row in triumph for
posts or not. indexing”,Available:https://www.cnbc.com/2019/
03/15/active-fund-managers-trail-the-sp-500-for-
The regulation and compliance framework are still
largely undefined for Alternate Data such as the-ninth-year-in-a-row-in-triumph-for-
satellite imagery or IoT devices data. The indexing.html , [Accessed August 1st, 2019]
categorization of big data as public information is
not completely justified owing to the high [2] Dawn Lim, “Passive Investing Resumes its
procurement cost of these datasets, making them March”,Available:https://www.wsj.com/articles/th
affordable only to large hedge funds with high e-rise-of-passive-investing-marches-on-
assets under management. Albeit the risks that the
11563442202, [Accessed August 1st, 2019]
funds expose themselves to in acquiring these data
sets. However, we anticipate that as open data
sources become more specialized and data vendors [3] Deloitte, “Alternative Data for Investment
compete to sell credible data and put pressure on Decisions”,Available:https://www2.deloitte.com/c
data prices use of alternative data will become ontent/dam/Deloitte/us/Documents/financial-
much more mainstream among average investors services/us-fsi-dcfs-alternative-data-for-
and money managers. This will lead to a investment-decisions.pdf [Accessed August 15th,
subsequent increase in the market efficiency and
2019]
awareness of as yet unclassified risks.

[4] Narges Tabari et al, 2018, “Causality Analysis


VIII. CONCLUSIONS of Twitter Sentiments and Stock Market Returns”,
First Workshop on Economics and Natural
In this paper we examined the rise of unstructured Language Processing, pages 11-19, Association for
data within the asset management firms to generate
Computational Linguistics.
ALPHA. We looked at the types of data that these
firms have integrated into their analytical models Available:https://www.aclweb.org/anthology/W1
to predict performance of markets, industries or 8-3102 [Accessed August 1st, 2019]
specific companies. We identified some of the key
risks that entail use of unstructured data: namely [5] Zsolt Katona et al, 2019, “On the Capital
cost, quality, contextualization and systemic Market Consequences of Alternative Data:
integration with existing processes. We
Evidence from Outer Space”, 9th Miami
documented several successful use cases of
generating successful trading signals with Behavioral Finance Conference 2018, Available :
acceptable lead times. We also summarized the https://papers.ssrn.com/sol3/papers.cfm?abstract_i
legal and privacy challenges that still need to be d=3222741 [Accessed June 22nd, 2019]
addressed before unstructured data becomes
mainstream for average investors improving [6] Alternativedata.org , “Alternative Data Usage
market efficiency. Growth”,Available:https://alternativedata.org/alter
native-data/ [Accessed August 1st, 2019]
Acknowledgment
We thank our mentor Mr Siva Visveswaran (CTO [7] Eagle Alpha, “The Predictive power of web
& Co-Founder of topXight Labs LLC) in directing scraped product data for Institutional investors: A
and focusing this endeavor. GoPro case study”,
Available:https://blog.scrapinghub.com/gopro-
study [Accessed June 22nd, 2019]

7
[8] Zhi Da et al, 2011, “In Search of Attention”, Other References:
Journal of Finance, VOL. LXVI, NO. 5
OCTOBER 2011, pg 1461 -1499, Available : Michael, W. 2015, “Sharing Research Data and
https://onlinelibrary.wiley.com/doi/epdf/10.1111/j Intellectual Property Law: A Primer”. Available:
.1540-6261.2011.01679.x [Accessed June 22nd, http://journals.plos.org/plosbiology/article?id=10.
2019] 1371/journal.pbio.1002235. An introduction to the
various kinds of property rights that can be
associated with research data.
[9] Johan Bollen et al, 2011, “Twitter predicts the
stock market”, Available: Open Licenses. Project Open Data. Available:
https://arxiv.org/pdf/1010.3003.pdf [Accessed https://project-open-data.cio.gov/open-licenses/.
The US Federal Government guide to open licenses
June 22nd, 2019]
and dedications.

[10] Tien Than Vu, et al, 2012, “An experiment Cohen, Dan. 2013.
http://www.dancohen.org/2013/11/26/cc0-by/. A
integrating sentiment features for Tech prediction
call for using CC0 with data, tempered by an
in Twitter” Available: ethical obligation to attribute.
https://pdfs.semanticscholar.org/fd9e/91daecb0b1
9cbc7e10216d4f26cba7492458.pdf Kratz, John 2013: Data Citation Developments.
http://datapub.cdlib.org/2013/10/11/data-citation-
developments/. An update on efforts to standardize
[11] Citi Research, “Searching for ALPHA: Big data attribution requirements.
Data”,Available: https://s3-eu-west-
1.amazonaws.com/ea- Korn et al, 2011: Licensing Open Data: A Practical
Guide.
documents/papers/EA+section+of+2017+Citi+Wh http://discovery.ac.uk/files/pdf/Licensing_Open_
ite+pdf.pdf [Accessed June 2019] Data_A_Practical_Guide.pdf. Another guide
written with UK law in mind, but with a helpful
comparison of CC and ODC licensing options.

Potrebbero piacerti anche