Sei sulla pagina 1di 8

BitCoin price-sentiment analysis

DATA MINING PROJECT REPORT






Professor: Hadikadi Amer Students: Asmir Avdicevic
Anis Borcak
Adna Kolakovic
Mirza Masic



June, 2014
Sarajevo
BitCoin price-sentiment analysis AA, AB, AK, MM
2


Contents
1. Project Definition .................................................................................................................................. 3
2. Data location and collection ................................................................................................................. 3
3. Data preparation, pre-processing, integration and exploration ........................................................... 4
List of attributes: ....................................................................................................................................... 5
4. Data Mining and Evaluation .................................................................................................................. 6
4.1 Association Rules .............................................................................................................................. 6
Rule 1: ....................................................................................................................................................... 6
Rule 2: ....................................................................................................................................................... 6
Rule 3: ....................................................................................................................................................... 7
Rule 4: ....................................................................................................................................................... 7
5. Result Interpretation ............................................................................................................................. 8




BitCoin price-sentiment analysis AA, AB, AK, MM
3

1. Project Definition
Bitcoin is peer to peer version of electronic cash which allows users online payments to be sent
directly from one party to another without going through financial institution. During last
quarter of 2013 bitcoin started to grow very rapidly and reached record price on November 29
of $1,242 per coin. For comparison, during the same day spot gold prices hit a price of $1,240
per ounce. Currently there are more than 12 million bitcoins in circulation and the rate of new
bitcoins will be halved every four years until there is a maximum of 21 million coins. After
record price of bitcoin in November, price plunged to around $600 and then started to stagnate
around that price point with sight ups and downs. Today, price of bitcoin is $617 and scored
slight growth in May 2014. Because of stated facts where price of virtual currency passes price
of gold in one point of day, we will try to analyze is there a correlation between twitter post
called tweets and price of bitcoin. If there is a correlation, that can be a good standing point for
predicting future plunges or jumps in terms of bitcoin price.

2. Data location and collection
The data source we choose to use is twitter. Twitter is social platform which allows users to
post small amount of text called tweets. For data mining purposes we can use 1% of all tweets
and can choose tweets with certain keywords. Keyword we used is #bitcoin. We collected data
for the period of time from 24-31 of May. Because this is basically real time data in raw format
and with each tweet we collect huge amount of junk which must discard to get data we need.
After that we must adjust that raw data for inserting it to the tables which can later on be used
for analysis.

BitCoin price-sentiment analysis AA, AB, AK, MM
4

3. Data preparation, pre-processing, integration and exploration
Before preprocessing and discarding of unnecessary data we had 13,7GB file. For easing the
process of data mining we had to preprocess the data. During preprocessing part, we discarded
all irrelevant attributes like profile pictures, background etc. The file we got after these two
methods was 1.76GB (7.2 million records) and we concluded that was enough if we take in
consideration that one tweet with all relevant attributes take approximately 256bytes. Next
thing is to clear all non-English language tweets by language filtering and remove spam by
taking most frequent words and with nave Bayesian decided which the spam are. For making
things faster we included missing data handling within spam filter. Because we need time
stamp for our mining process and time is very hard to fill in instead of missing values, we
discarded all tweets without timestamp. For spam reduction we discarded all data records with
word count lower than 3 and records whose tweet contain non ASCII characters because those
are ones which we cannot analyze with confidence. After these filtering methods, we have got
around 1.1 million records which was 252MB. With data we acquired after filtering we begin
with sentiment analysis. Sentiment analysis is done with list of words with valance, arousal and
dominance. After successful sentiment analysis we should get three dimensional map of tweet
moods but sentiment analysis will remove records which cannot be analyzed. After sentiment
analysis we were left with 80MB of data or 335 000 individual records and each of them have
new derived attributes related to sentiment analysis and those are: mood, mean valence, mean
arousal, mean dominance and intensity of the mood. Attribute mood has 20 different values
and each of those can have different arousal, valence, dominance and intensity. After preparing
our data acquired from twitter, we must take historic bitcoin price data with time information
from one of the largest bitcoin exchange websites. Next thing is to match each tweet with
corresponding bitcoin price by using relevant timestamp. Next thing we need to do is to adjust
BitCoin price-sentiment analysis AA, AB, AK, MM
5

our data set for WEKA. Because WEKA requires csv format we need to convert our data set to
that format.
List of attributes:

Attribute 1: USERID - ID of the twitter user
Attribute 2: VERIFIED_USER True or false if this user is verified user on twitter
Attribute 3: FOLLOWERS_COUNT Numerical value which represents followers count of
certain user
Attribute 4: TWEET_FAVOURITE_COUNT Numerical value, how many favorites has certain
tweet
Attribute 5: TWEET_RETWEETED True or false if this tweet had been retweeted
Attribute 6: TWEET_RETWEET_COUNT Number of retweets of this tweet
Attribute 7: TIMESTAMP time of tweet creation
Attribute 8: MEAN_VALENCE Represents if the tweet sentiment is good or bad
Attribute 9: MEAN_AROUSAL Represents amount of excitement and involvement in tweet
Attribute 10: MEAN_DOMINANCE Represents level of assertiveness
Attribute 11: MOOD Which mood is expressed by certain tweet, 20 moods in total
Attribute 12: INTENSITY What is the intensity of certain mood
Attribute 13: BTC_PRICE Numerical value of bitcoin price at the time of tweet creation
Attribute 14: BTC_VOL Average of how much bitcoins are sold in that second

BitCoin price-sentiment analysis AA, AB, AK, MM
6


4. Data Mining and Evaluation

For the process of data mining we used the scripting language Python 2.7 to acquire and
preprocess all the data. After we used WEKA 3-6-10 to analyze our final data set and extract
knowledge from it. To do so we used the Apriori algorithm for generating association rules.
We picked rules based on the lift coefficient of the rules because merely confidence levels were
not enough to produce meaningful results. We also filtered the rules based on what they imply,
mostly bitcoin price as we wanted to see a correlation between it and twitter.

4.1 Association Rules

Rule 1:
INTENSITY='(4.123189-4.567537]' BTC_PRICE='(548.820107-557.316128]' 68291 ==>
MOOD=happy 67397 conf:(0.99)
If the overall intensity of the mood is high and the bitcoin price is comparatively high to the
recent past the mood of the overall tweet is happy.
Rule 2:
MOOD=happy BTC_PRICE='(557.316128-565.812149]' 73378 ==> INTENSITY='(4.123189-
4.567537]' 69220 conf:(0.94)
If the user mood is happy and the price of bitcoin is high, the level of emotions expressed in
tweets will be high.
BitCoin price-sentiment analysis AA, AB, AK, MM
7

Rule 3:
MOOD=happy INTENSITY='(4.123189-4.567537]' 181989 ==> BTC_PRICE='(548.820107-
557.316128]' 67397 conf:(0.37) < lift:(1.13)> lev:(0.02) [7954] conv:(1.07)
If the user is happy and the intensity of the tweet is high we can expect a tendency to slightly
higher price of bitcoin than average.
Rule 4:
15. MOOD=happy INTENSITY='(4.123189-4.567537]' 181989 ==> BTC_PRICE='(557.316128-
565.812149]' 69220 conf:(0.38) < lift:(1.12)> lev:(0.02) [7685] conv:(1.07)
If the user is happy and the intensity of the tweet is high we can expect a tendency to greatly
higher price of bitcoin than average.









BitCoin price-sentiment analysis AA, AB, AK, MM
8

5. Result Interpretation

From the association rules generated we can see that mood of a users tweet, its intensity and
the bitcoin price have significant correlation. Rule 1 and 2 show us that in this period the overall
tendency for bitcoins sentiment was very positive and correlates to slight price increase
tendencies. Rule 3 and 4 confirm us that a positive user attitude towards the given topic
correlates to slightly too highly increased bitcoin prices. All of this can be confirmed by checking
recent resources online (May 2014) where we can see that after the bitcoin crash last year the
last two weeks were the first time bitcoin recovered over USD $500 and thus broke the
psychological barrier that existed for months resulting in a lot of positive news further fueling
the price beyond USD $565. We can see similar findings in our results.

Potrebbero piacerti anche