In Certain Applications Involving Big Data, Data Sets Get So Large and Complex That It Becomes Difficult To

ABSTRACT
In certain applications involving big data, data sets get so large and complex that it becomes difficult to
analyze using traditional data processing applications.
In order to overcome these challenges, we can extract useful information from big data to an understandable
structure using Data Mining. We can also use algorithms that learn from this data and automatically predict
further trends. This branch of Artificial Intelligence is called Machine Learning and Artificial Neural Networks
is the approach we are using to implement this.
The stock market is a platform where an enormous amount of data exists and constantly needs to be scrutinized
for business opportunities. Therefore, we are applying these aforementioned methods to simulate a brokerage
system and analyze the stock market while at the same time learning the fundamentals of investment, without
risking your own money
0|P ag e
Chapter-1
1. Introduction
Stock Market Analysis and Prediction is the project on technical analysis, visualization, and prediction using
data provided by Google Finance. By looking at data from the stock market, particularly some giant technology
stocks and others. Used pandas to get stock information, visualized different aspects of it, and finally looked at
a few ways of analyzing the risk of a stock, based on its previous performance history. Predicted future stock
prices through a National stock exchane oil method!
1.1 Objective
The purpose of this project is to comparatively analyze the effectiveness of prediction algorithms on stock
market data and get general insight on this data through visualization to predict future stock behavior and value
at risk for each stock. The project encompasses the concept of Data Mining and Statistics. This project makes
heavy use of NumPy, Pandas, and Data Visualization Libraries.
To examine a number of different forecasting techniques to predict future stock returns based on past returns
and numerical news indicators to construct a portfolio of multiple stocks in order to diversify the risk. We do
this by applying supervised learning methods for stock price forecasting by interpreting the seemingly chaotic
market data.
Forecasting is the process of predicting the future values based on historical data and analyzing the trend of
current data. Processing powers of computers nowadays have become powerful enough to process large amount
of data. By running simulations of future states based on present states, we can foresee the trend of stock
market.
1.2 Scope of the Project
In this project, a selection of stock data in the Standard & Poor’s 500(S&P 500) are used for the prediction of
trend. This application can be used to retrieve the current market scenario at any given point of time and allows
a user to trade virtual money using real time data. By analyzing historical data as well as user's portfolio, it
guides the user while buying stocks by predicting future trends in the stock market on a day to day basis.
Currently, it can be succesfully hosted on a web server and serve as a virtual stock market trading
platform
1|P ag e
1.3 Details of Hardware & Software used
Language-python using machine learning
Software- Anaconda (jupitor notbook)
Oprating system- windows 10
Hardware- laptop
Power supply
1.4 Feasibility Study
This application can be used to retrieve the current market scenario at any given point of time and allows a user
to trade virtual money using real time data. By analyzing historical data as well as user's portfolio, it guides the
user while buying stocks by predicting future trends in the stock market on a day to day basis.
1.5 Methodology Adopted/Literature Survey
Takashi Kimoto and Kazuo Asakawa Computer-based Systems Laboratory FUJITSU LABORATORIES
LTD., KAWASAKI and Morio Yoda and Masakazu Takeoka INVESTMENT TECHNOLOGY &
RESEARCH DIVISION The Nikko Securities Co.,
Ltd. Japan proposed buying and selling timing prediction system for stocks on the Tokyo Stock Exchange and
analysis of intemal representation. It is based on modular neural networks. They developed a number of
learning algorithms and prediction methods for the TOPIX (Tokyo Stock Exchange Prices Indexes) prediction
system. The prediction system achieved accurate predictions and the simulation on stocks trading showed an
excellent profit. The prediction system was developed by Fujitsu and Nikko Securities.
Ramon Lawrence, Department of Computer Science University of Manitoba, his paper is a survey on the
application of neural networks in forecasting stock market prices. With their ability to discover patterns in
nonlinear and chaotic systems, neural networks offer the ability to predict market directions more accurately
than current techniques. Common market analysis techniques such as technical analysis, fundamental analysis,
and regression are discussed and compared with neural network performance. Also, the Efficient Market
Hypothesis (EMH) is presented and contrasted with chaos theory and neural networks. This paper refutes the
EMH based on previous neural network work.Finally, future directions for applying neural networks to the
financial markets are discussed. Xue Zhang, Hauke Fuehres , Peter A. Gloor from National University of
Defense Technology, Changsha, Hunan,China and MIT Center for Collective Intelligence, Cambridge MA,
USA .Their work describes early work trying to predict stock market indicators such as Dow Jones, NASDAQ
2|P ag e
and S&P 500 by analyzing Twitter posts. We collected the twitter feeds for six months and got a randomized
subsample of about one hundredth of the full volume of all tweets. We measured collective hope and fear on
each day and analyzed the correlation between these indices and the stock market indicators. We found that
emotional tweet percentage significantly negatively correlated with Dow Jones, NASDAQ and S&P 500, but
displayed significant positive correlation to VIX. It therefore seems that just checking on twitter for emotional
outbursts of any kind gives a predictor of how the stock market will be doing the next day.
3|P ag e
Chapter-2
2. Related work
2.1 Machine learning algorithm in Quantopian
Quantopian is a public and open website where people and professionals can share their programs and exchange
ideas in the machine learning in financial sector. In this website, there are lots of valuable resource, including
various algorithms and models of machine learning in predicting stock price. Most of their algorithms and
models are very complex which is out of our level of understanding now. However, the use of machine learning
concept in finance provide us an insight and introduce a lots of useful module in Python that are very helpful to
our project.
2.2 Financial Programming using Python
Our group is completely new to the Python and data science. pythonprogramming.net[] is an excellent resource
that help us to understand the python syntax and teach us to using different modules in python to manipulate the
data and make prediction. There are also some similar project in this website that help us to understand the
concept much quicker
4|P ag e
Chapter-3
3. Problem definition
3.1 Scope of Data
The premier source for financial, economic, and alternative datasets, serving investment professionals.
Quandl’s platform is used by over 250,000 people, including analysts from the world’s top hedge funds, asset
managers and investment banks.
3.2 Definition of Prediction
Our Program is aimed to identify the trend of the price of the target stock. Prediction here refers to the general
trend of the specific stock price.
3.3 Evaluation of the Accuracy of the Prediction
The accuracy of the system is measured as the percentage of the predictions that were correctly
determined by the system. For instance, if the system forecasts an upward trend and the index indeed
goes up, it is supposed to be correct, otherwise, if the index goes down or remains stable for an uptrend,
it is assumed to be wrong.
3.4 Recommendation for the user
Combining the accuracy and the prediction, recommendation can be given to the user to acknowledge them the
trend of the target stock with known accuracy.
5|P ag e
Chapter-4
4. Challenges
4.1 Variable representation
The variables in machine learning equations are not easily to be applied in stock market we could not precisely
represent their utilities and values. For example, concepts in stock trading like buy, sell and hold are extremely
difficult to be implemented even though the actions and states of successor function could be defined:
Actions: Either buy, sell or hold
States: current stock price
However, it is not feasible to define the reward function. For example, you cannot define the reward of the
action “hold”. If the stock price drops after you buy it, you cannot define it as a loss because you are still
holding the stocks.
The stock price is still possible to rebound. The future value of the stock is still an unknown.
Defining the value to a loss would affect the data consistency as the cumulative amount of loss would be larger
than the actual value of loss once you sold the stock. Therefore, the reward function of “hold” cannot be
defined as a loss. However, it is also not suitable for you to define it as +0. As the stock value is actually
decreasing, contributing all the loss to the corresponding “sell” action would greatly affect the data integrity. It
is hard for the system to trace back the declination or the actual intermediate states values of the data.
4.2 Quality of data
Our predictive model is evaluated on NSE market (oil) on the financial historical stock data over the
training period of 30 September 2009 to 07 July 2018. The news data is collected from the financial
web sites http://www.quandl.com, http://reuters.com and www.moneycontrol.com. The news data is
collected once in day. The stock quotes corresponding to each trading day were downloaded from
http://finance.yahoo.com.
6|P ag e
Chapter-5
5. Infrastructure
5.1 Python
This program is written in python, one of the most used language in Machine Learning.
Python is a high-level, interpreted, interactive and object-oriented scripting language. Python is designed to be
highly readable. It uses English keywords frequently where as other languages use punctuation, and has fewer
syntactical constructions than other languages.
5.2 Python’s modules
In this project, various python’s modules are used to facilitate predictions, regression analysis, graph plotting,
data manipulation and machine learning. These include sklearn[], pandas[], pandas-datareader and matplotlib[].
A module allows you to logically organize your Python code. Grouping related code into a module makes the
code easier to understand and use. A module is a Python object with arbitrarily named attributes that you can
bind and reference
The Python code for a module named aname normally resides in a file named aname.py. Here's an example
of a simple module, support.py
defprint_func( par ):
print"Hello : ", par
return
5.3 Packages
A package is a hierarchical file directory structure that defines a single Python application environment
that consists of modules and sub packages and sub-subpackages, and so on.
Consider a file Pots.py available in Phone directory. This file has following line of source code −
def Pots():
print "I'm Pots Phone"
7|P ag e
Chapter-6
6. Solution
In general, this project is going to use linear regression analysis to predict the trend of the target stock by
obtaining the slope of the the linear regression line. We will also provide the predicted price of the stock at
corresponding time point.
After obtaining the trend, we will evaluate the accuracy of the prediction by using Q-learning. The Q-value
obtained will reflect the accuracy of the model with a heavier weight of present state.
Lastly, combine the prediction and the q-value by using a simple weighted sum to give a recommendation to
the user.
6.1 Default Setting and Asumption
In the program, we will predict the trend of the target company by using the price data of the previous month,
the actual number of days in the previous month will vary as stock exchange will close at certain days. We
assume there are 30 days in a month for simplicity.
The stock price on the 7th day since the date the user inputted will be predicted by default. It is assume that the
7th day since the user inputted will be the working day of the stock exchange where actual stock price on that
day would be available.
6.2 Regression Analysis and Visualization
With the aid of the sklearn module and matplotlib, we can do various regression analysis on the data and
visualize it on a graph. In our case, we visualize the three regression model, namely Radial Basis Function
model, linear model and Polynomial model. The are visualized on a graph to give user insight in the analysis.
See fig.1 as example (figure in session 8.1).
6.3 Linear Regression and Prediction
For simplicity, we use linear regression model for our prediction as it is easy to obtain the regression line slope,
which can indicate the trend of the stock price. We use about 30 days of data to predict the trend of the
upcoming week and output the predict stock on the 7th day since the date user inputted. The prediction follows
the assumptions mentioned above.
8|P ag e
6.4 Q-learning
The Q-learning is not directly applied on the regression function, but it analyzes the quality of the function by
considering whether the prediction result is reliable or not, reliability depends on the coefficient (slope of the
linear regression line) of regression.
The following are the procedure and equation for the Q-learning.
1.Obtain the predicted slope of the linear regression and the actual trend of stock price on a large scale
a) The actual trend is defined as increasing if the actual price of stock from the beginning is greater
than that the actual price of the stock on the 7th day since the beginning.
2 Define action, states, transition function and assigning future reward, discount, instantaneous reward and
alpha (learning rate) value
a) There is only 1 action, which is to use the linear regression model.

b) There are two state: Prediction is correct (𝑠0). Prediction is wrong (𝑠1) 𝑠0: actual trend is increasing and
the coefficient (slope of the linear regression line) is positive or actual trend is decreasing and the
coefficient (slope of the linear regression line) is negative 𝑠1: Other than 𝑠0
c) Future reward (𝑉(𝑠′)) is 1 if reaching 𝑠0,-1 if reaching 𝑠1
d) Instantenous reward/transition reward (𝑅(𝑠, 𝑎, 𝑠′) ) is always 0
e) alpha (𝛼) is 0.01
f) Discount (𝛾) is 1
g) Transition function: unknown
3 Sample based Q-value iteration

a) From we will learn in the lecture note, Running average:
𝑄 (𝑠, 𝛼) ← (1 −𝛼) 𝑄 (𝑠, 𝛼) + [𝑠𝑎𝑚𝑝𝑙𝑒]
Sample: R (𝑠, 𝑎, 𝑠′) + (𝑚𝑎𝑥)𝑄(𝑠′, 𝑎′)
𝑉(𝑠′) = 𝑚𝑎𝑥𝑄(𝑠′, 𝑎′)
Thus, 𝑄 (𝑠, 𝛼) ← (1 − 𝛼) 𝑄 (𝑠, 𝛼) +
𝛼(𝑅(𝑠, 𝑎, 𝑠′) + 𝛾(𝑉(𝑠′)))
b) By substituting the aforementioned value into the new running average. The resultant running average
for our project:
9|P ag e
𝑄 (𝑠, 𝛼) ← 0.99(𝑠, 𝛼) + 0.01(𝑉(𝑠′))
𝑉 (𝑠′) = 1/−1
4 Obtaining the Q-value using the resultant running average
6.5 Combining Q-value and prediction to give recomendation
Define accuracy (a) as number of correct prediction/total number of episode recommendation(r) is

calculated as:
𝑟 = 𝑎 (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛) − (1 − 𝑎) (𝑄 (𝑠, 𝑎))
10 | P a g e
Chapter-7
7. Limitiation
7.1 Limitation of quantitative analysis
The future values of stock prices are not only based on its historical data, they are also affected by some
external factors, for example industry performance, investor sentiment and also economic factors. Due to the
uncertainty in the stock market, there is no universal mathematical principle for predicting the future trend
accurately. Quantitative analysis can only be applied to problems of computing mathematical principles.
Therefore, the decision making based only on the quantitative analysis can lead to severe loss in investment.
Quantitative analysis only provide insight from a mathematical perspective.
7.2 Limitation of Linear Regression
The accuracy of the prediction by Linear Regression is actually not high enough to make a good decision on
stock trading. Linear Regression is limited to linear relationships.The algorithmn already assume the system is a
straight-line. However, for stock trading, the values of the system could be either a raise, a drop or remain
constant. The data values are scattered and fluctuated. Apart from that, Linear Regression is not a complete
description of relationships among variable. It only provides the functionality to investigate on the mean of the
dependent variable and the independent variable. However, it is not applicable for the situation we encountered
in stock market. And hence, the prediction is actually suppressed by this constraint.
11 | P a g e
Chapter-8
8. Result and Analysis
The system evaluation on the stocks from India’s Bombay Stock Exchange & NSE is carried out. For
given day’s open index, day’s high, day’s low, volume and adjacent values along with the stock news
textual data, our forecaster will forecast the closing index value for particular trading day.
8.1 Visualization of Regression Model:
In statistical modelling, regression analysis is a set of statistical processes for estimating the relationships
among variables. It includes many techniques for modelling and analysing several variables, when the focus is
on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More
specifically, regression analysis helps one understand how the typical value of the dependent variable (or
'criterion variable') changes when any one of the independent variables is varied, while the other independent
variables are held fixed.
12 | P a g e
Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the
field of machine learning. Regression analysis is also used to understand which among the independent
variables are related to the dependent variable, and to explore the forms of these relationships. In restricted
circumstances, regression analysis can be used to infer casual relationships between the independent and
dependent variables. However this can lead to illusions or false relationships, so caution is advisable
8.2 Prediction
13 | P a g e
Chapter-9
9. Conclusion
In this project, a Q-learning algorithm was implemented to give a recommendation to the user on the
dependability of the result from Linear Regression. User can refer to the recommendation rating and then make
decisions on stock trading. Linear Regression is a statistic methodology that is being criticized for its accuracy.
Only depending on the result of Linear Regression cannot make a good decision on stock trading. Therefore,
we have introduced a novel way of using machine learning to evaluate the rating of trust on the Linear
Regression. Combining Linear Regression with Q-learning, we could produce a more accurate prediction for
the user whether the stock price would follow the predicted trend of Linear Regression
9.1 Future Work
 More work on refining key phrases extraction will definitely produce better results.
Enhancements in the preprocessor unit of this system will help in improving more accurate
predictability in stock market.
 Twitter feeds message board, Extracting RSS feeds and news
 Considering internal factors of the company likes Sales, Assets etc
14 | P a g e
15 | P a g e

In Certain Applications Involving Big Data, Data Sets Get So Large and Complex That It Becomes Difficult To

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

In Certain Applications Involving Big Data, Data Sets Get So Large and Complex That It Becomes Difficult To

Caricato da

Copyright:

Formati disponibili

ABSTRACT

1.2 Scope of the Project

Language-python using machine learning

Software- Anaconda (jupitor notbook)

Oprating system- windows 10

1.4 Feasibility Study

1.5 Methodology Adopted/Literature Survey

2.1 Machine learning algorithm in Quantopian

2.2 Financial Programming using Python

3.1 Scope of Data

3.2 Definition of Prediction

3.3 Evaluation of the Accuracy of the Prediction

3.4 Recommendation for the user

4.1 Variable representation

Actions: Either buy, sell or hold

States: current stock price

4.2 Quality of data

5.2 Python’s modules

print "I'm Pots Phone"

6.1 Default Setting and Asumption

6.2 Regression Analysis and Visualization

See fig.1 as example (figure in session 8.1).

6.3 Linear Regression and Prediction

a) There is only 1 action, which is to use the linear regression model.

3 Sample based Q-value iteration

𝑄 (𝑠, 𝛼) ← (1 −𝛼) 𝑄 (𝑠, 𝛼) + [𝑠𝑎𝑚𝑝𝑙𝑒]

Sample: R (𝑠, 𝑎, 𝑠′) + (𝑚𝑎𝑥)𝑄(𝑠′, 𝑎′)

𝑉(𝑠′) = 𝑚𝑎𝑥𝑄(𝑠′, 𝑎′)

Thus, 𝑄 (𝑠, 𝛼) ← (1 − 𝛼) 𝑄 (𝑠, 𝛼) +

𝛼(𝑅(𝑠, 𝑎, 𝑠′) + 𝛾(𝑉(𝑠′)))

4 Obtaining the Q-value using the resultant running average

6.5 Combining Q-value and prediction to give recomendation

Define accuracy (a) as number of correct prediction/total number of episode recommendation(r) is

𝑟 = 𝑎 (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛) − (1 − 𝑎) (𝑄 (𝑠, 𝑎))

7.1 Limitation of quantitative analysis

7.2 Limitation of Linear Regression

8. Result and Analysis

8.1 Visualization of Regression Model:

9.1 Future Work

Potrebbero piacerti anche