Sei sulla pagina 1di 53

Big

Data Elec,ve
26.11.2013
- Team Amazon -

Simon Fakir, Thilo Koch, Dragan Mileski, Robert Weindl

The Amazon Dataset

5 838 041

Review dates: May 1996- May 2005

member id | product id | date | number of helpful feedbacks | number of feedbacks | rating | title | body

1 194 723

Product Information

product id | product name | product type | brand | sales price | list price | short product description
List of Product Category Paths

2 235 953

Short Member Information

member id | member name | number of reviews


In my own words

2 235 953

Detailed Member Information

username | member rank | birthday | location | name | member id | state

More than 8 GB Data


CDTM

Amazon reviews: Basic sta6s6cs


Most reviews are 5 star reviews
Ra,o of stars of all reviews:
57.5%
60%
50%

40%

30%

20.1%
20%

8.7%
10%
00%
5 stars
4 stars
3 stars

Average ra,ng: 4.132

Standard devia,on: 1.268

5.4%
2 stars

8.28%

1 star

Database: 5,838,041 reviews


CDTM

Amazon reviews: Basic sta6s6cs


1.9 million loca6ons of reviews couldnt be retrieved
Northern Hemisphere:
Reviews submiSed: 3,822,715
Average ra6ng: 4.144
Standard devia6on: 1.257
60%

Southern Hemisphere:
Reviews submiSed: 109,363
Average ra6ng: 4.152
Standard devia6on: 1.196

50%

40%

Ra6o of stars of
Northern
Hemisphere

30%

Ra6o of stars of
Southern
Hemisphere

20%

10%

Database: 3,932,078 reviews

0%
5 stars

4 stars

3 stars

2 stars

1 star

Reviews monthly submiSed (not loca6on dependent)


Data retrieved from Weka, put into graphs with Excel

Reviews submiFed from 1996-2005


600,000
500,000
400,000
300,000
200,000
100,000
0

Database: 5,838,041 reviews


CDTM

5 stars reviews monthly submiSed


Data retrieved from MySQL, put into graphs with Excel

5 star ra,o of all reviews


59.0%
58.0%
57.0%
56.0%
55.0%

Northern Hemisphere

54.0%

Southern Hemisphere

53.0%

Weighted averages

52.0%
51.0%

CDTM

Database: 3,932,078 reviews


3,822,715 (northern)
6
109,363 (southern)

1 star reviews monthly submiSed


Data retrieved from MySQL, put into graphs with Excel
1 star ra,o of all reviews
9.0%
8.5%
8.0%
7.5%
7.0%

Northern Hemisphere
Southern Hemisphere

6.5%

Weighted averages

6.0%
5.5%
5.0%

CDTM

Database: 3,932,078 reviews


3,822,715 (northern)
7
109,363 (southern)

Machine Learning
CDTM

Machine Learning
The Machine Learning Learning Process

Data Mining
&
Parsing

Development
&
Execution
CDTM

Success
Feelings

Machine Learning
What to do with a lot of data?

Generate More Data!

CDTM

10

Machine Learning
Important informa6on is missing, isnt it?

2 235 953

Short Member Information

member id | member name | number of reviews


In my own words

2 235 953

Detailed Member Information

username | member rank | birthday | location | name | member id | state

What Gender Do Our User Have?


CDTM

11

Predic6ng the Members Gender


How to predict a gender?
Gather Training
Data

Male Names

Female Names

Gender
Features

Training &
Valida6on Set

Features
extractor

Classier

Features
extractor

Name

Male
Female
CDTM

12

Predic6ng the Members Gender


Gather Training Data - US Social Security Service

Get a free list of birth names to gender from 1880 to 2012


http://www.socialsecurity.gov/oact/babynames/names.zip

CDTM

13

Predic6ng the Members Gender


Gather Training Data A Look in the les
Sophia,F,22158
Emma,F,20791
Isabella,F,18931
Olivia,F,17147
Ava,F,15418
Emily,F,13550
Abigail,F,12583
Mia,F,11940
Madison,F,11319
Elizabeth,F,9596
Chloe,F,9595
Ella,F,9115
Avery,F,8272
Addison,F,8122
Aubrey,F,8006
Lily,F,7889
Natalie,F,7852
Sofia,F,7767
Charlotte,F,7418
Zoey,F,7411
Grace,F,7304
Hannah,F,7202
Amelia,F,7191
Harper,F,7154

2012

CDTM

Jacob,M,18899
Mason,M,18856
Ethan,M,17547
Noah,M,17201
William,M,16726
Liam,M,16687
Jayden,M,16013
Michael,M,15996
Alexander,M,15105
Aiden,M,14779
Daniel,M,14143
Matthew,M,13834
Elijah,M,13719
James,M,13271
Anthony,M,13105
Benjamin,M,12695
Joshua,M,12522
Andrew,M,12501
David,M,12422
Joseph,M,12404
Logan,M,12390
Jackson,M,12388
Christopher,M,11777
Gabriel,M,11442
14

Predic6ng the Members Gender


Extract les and create name dic6onary
namesDictionary = {
'KARMELA': [0, 85],
'KOLLEEN': [0, 1016],
'EMMAROSE': [0, 328],
'YUSRA': [0, 811],
'OPAL': [718, 67964],
'TAMERON': [112, 53],
'JAHQUELL': [10, 0],
'GERTURDE': [0, 127],
'JORENE': [0, 515],
'LASH': [78, 0],
'GAYLON': [3472, 218],

CDTM

15

Predic6ng the Members Gender


Create list of female and male.

Anna

Dragan

Cornelia

Simon

Katrin

Thilo

Svenja

Robert

CDTM

16

Predic6ng the Members Gender


Extract Features, combine both lists and create training data
Define Features: Last letter | Last two letters | Is last letter a vowel
[
({'last_is_vowel': True, 'last_two': 'EY', 'last_letter': 'Y'}, 'M'),
({'last_is_vowel': False, 'last_two': 'IN', 'last_letter': 'N'}, M'),
({'last_is_vowel': False, 'last_two': 'LL', 'last_letter': 'L'}, M'),
({'last_is_vowel': False, 'last_two': 'LL', 'last_letter': 'L'}, W'),
({'last_is_vowel': False, 'last_two': 'LL', 'last_letter': 'L'}, W),
({'last_is_vowel': False, 'last_two': 'LL', 'last_letter': 'L'}, W),

CDTM

17

Predic6ng the Members Gender


Train and validate the Nave Bayes Classier

CDTM

18

Predic6ng the Members Gender


User the Nave Bayes Classier

CDTM

19

Predic6ng the Members Gender

Live Demonstration

#
Reviews
Ratio

665 825

1 278 629

1 626 779

3 438 134

0.41 %

0.37 %
CDTM

20

Machine Learning
What else do we have achieved?

Determine mood of a member


by analysing his review

Propose Rating due to written text

CDTM

21

Hacking Big Data with Java7

MONGODB, JAVA, GOOGLE MAPS API,


JFREECHART

Connec6ng to MongoDB from Java

What now?!?!?!?

Maybe:
Find and visualize 10 best rated products
Find and visualize 10 worst rated products
Find and visualize 10 best product brands on Amazon
Find and visualize haters on the world map

Challenge accepted ;)

Find and visualize 10 best reviewed products (1)

Find and visualize 10 best reviewed products (2)

Find and visualize 10 worst reviewed products (1)

Find and visualize 10 worst reviewed products (2)

Find and visualize 10 best rated brands (1)

Find and visualize 10 best rated brands (2)

Find and visualize 10 best rated brands (2)

Find and visualize 10 best rated brands (2) (First 5K Users)

Find and visualize the Haters on a world map (1)

Lack of cool and easy to use libraries


Some of them have limitations (i.e. 3000 points only)
Data set contains String address locations
Some of them are corrupted
Conversion to latitude and longitude needed

Find and visualize the Haters on a world map (2)

Find and visualize the Haters on a world map (3)

// Transform the addresses into geo coordinates

Find and visualize the Haters on a world map (4)

Find and visualize the Haters on a world map (5)

Find and visualize the Haters on a world map (6)

Find and visualize the Haters on a world map (7)

Find and visualize the Haters on a world map (8) Silicon Valley Startups?! J

User Insights
CDTM

41

Reviews Per User


How to predict a gender?

Over 10.000 reviews?

CDTM

42

Detail View on User

CDTM

43

Review Timeline of One User

Over 100 reviews per day?


Probably a paid author

CDTM

44

Today in your horoscope:




You will have a great day

And give nice amazon feedbacks

CDTM

45

Sorry Guys, No Relea6on between Day of Birth and Ra6ngs

Average given stars

Average helpful articles

Day of birth

46

Impor6ng Geo Data

Aner 3.000 request google API rejects.

REQUEST COORDINATES

2.8m users

CDTM

47

(Map.mov)

CDTM

48

N = approx 1.4m

CDTM

49

Used Sonware

ANALYTICAL TOOLS

RAW PROGRAMMING LANG.

DATABASE
CDTM

51

5 stars reviews monthly submiSed (not loca6on dependent)


Data retrieved from MySQL, put into graphs with Excel
5 stars ra,o of all reviews
58.5%
58.0%
57.5%
57.0%
56.5%
56.0%
55.5%
55.0%

Reviews with 5 stars of all reviews per month


Weighted average of 5 stars reviews per month
CDTM

Database: 5,838,041 reviews


52

1 stars reviews monthly submiSed (not loca6on dependent)


Data retrieved from Weka, put into graphs with Excel
1 star ra,o of all reviews
10.0%
9.5%
9.0%
8.5%
8.0%
7.5%
7.0%
6.5%

Reviews with 1 star of all reviews per month


Weighted average of 1 star reviews per month
CDTM

Database: 5,838,041 reviews


53

Potrebbero piacerti anche