Sei sulla pagina 1di 25

Big Data Insight

WEEK 3 BIG DATA AND DATA ANALYTICS


Outline
o Finding Pattern / Insight from Data
o Data Transformation
o Data Visualization
o Looking for Correlation
o The Future (According to Data)
o Data Collection Workflow
Pattern
A pattern is a discernible regularity in the world or in a manmade design. As such, the elements of
a pattern repeat in a predictable manner. (Wikipedia)
Why Pattern Recognition is important:
1. The ability to recognize and create patterns help us make predictions based on our observations
2. Patterns allow us to see relationships and develop generalizations
3. Allows someone to identify such patterns when they first appear
4. Patterns provide a sense of order in what might otherwise appear chaotic
5. Patterns allow someone to make educated guesses (hypothesis in science)
6. Understanding patterns aid in developing mental skills
7. Patterns can provide a clear understanding of mathematical relationships
8. Understanding patterns provide a clear basis for problem solving skills (eg. Algebra)
9. A knowledge of pattern can be transferred into many science fields where they prove very helpful
10. Patterns provide clear insight into the natural world
You don’t have to start with an outcome in mind when logging
personal data. If you know how to ask the right questions of raw
data, you may find patterns you didn’t expect. The beauty of having
lots of different types of data about yourself is that you may uncover
unexpected correlations. One of my favorite examples of an
unexpected correlation in personal data is Jewel Loree’s musical
cycles. She got her mood report from last.fm, visualizing the
patterns in the music she listens to. She was interested in the
pattern of peaks in sad/ low energy music.
FINDING PATTERN / INSIGHT
FINDING PATTERN / INSIGHT

Making a Bayesian Model to Infer Uber Rider Destinations


FINDING PATTERN / INSIGHT
FINDING DATA INSIGHT / PATTERN
• It is never been easy / almost hard
• often need machine / software help
• use your imagination / creativity
DATA TRANSFORMATION
Definition: Data transformation is the process of converting data or information from one
format to another, usually from the format of a source system into the required format of a
new destination system.

Objective: Simplifying the problem solutions, for example using adjacency matrix and graph (in
social network).

Reasons: Convenience, Reducing Skewness, Equal Spreads, Linear Relationship, Additive


Relationship
DATA VISUALIZATION
Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual
communication. It involves the creation and study of the visual representation of data, meaning
"information that has been abstracted in some schematic form, including attributes or variables for the units
of information“. (Wikipedia)

A primary goal of data visualization is to communicate information clearly and efficiently via statistical
graphics, plots and information graphics.

Data visualization is both an art and a science. It is viewed as a branch of descriptive statistics by some,
but also as a grounded theory development tool by others.
DATA VISUALIZATION MATTERS
• Identify areas that need attention or improvement
• Clarify which factors influence other factors such as in customer behavior
• Help you understand your problem and solution (example: which products to place
where)
• Predict future changes (example: sales volumes)

A good data visualization is a part of your story telling.


Peoples love storytelling, especially ones with pictures.
Data-driven stories.
DATA VISUALIZATION
• http://www.datanalytics.ch/tag/data-visualization/
• Google “Data Visualization” Image

VIDEO
• The Beauty of Data Visualization - https://www.youtube.com/watch?v=5Zg-C8AAIGg
• The Art of Data Visualization - https://www.youtube.com/watch?v=AdSZJzb-aX8
Looking for Correlation
• You have two sampling variables (X and Y). Does the value of one variable depend on the other or are the
variables random.
• Correlation then determines the probability that the two variables are randomly correlated.
• Correlations can be strong or weak.
• Strong correlations are extremely useful in identifying root causes and/or what the most important variables are
• Weak correlations open the way for tremendous ambiguity
• If two variables are correlated, it means that one variable can be written as a function of the other.
Looking for Correlation
Example: Correlation Matrix
The Future (According to Data)

The forecasts are shown as a blue line, with the 80% prediction intervals as an gray shaded
area, and the 95% prediction intervals as a light gray shaded area.
The Future (According to Data)
Predict Network Growth
NoSQL for Big Data
A NoSQL (originally referring to "non SQL" or "non relational")[1] database provides a mechanism
for storage and retrieval of data that is modeled in means other than the tabular relations used
in relational databases.

They hold and help manage the vast reservoirs of structured and unstructured data that make it possible to
mine for insight with Big Data

Some of NOSQL type Classification :


• Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA
• Document: Apache CouchDB, ArangoDB, Clusterpoint, Couchbase, Cosmos DB, HyperDex, IBM
Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
• Key-value: Aerospike, ArangoDB, Couchbase, Dynamo, FairCom c-
treeACE, FoundationDB, HyperDex, InfinityDB, MemcacheDB, MUMPS, Oracle NoSQL
Database, OrientDB, Redis, Riak, Berkeley DB, SDBM/Flat File dbm
• Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso
• Multi-model: Alchemy Database, ArangoDB,
CortexDB, Couchbase, FoundationDB, InfinityDB, MarkLogic, OrientDB

We are talking about GRAPH DATABASE for the reason of simplicity


NEO4J (Graph Database)
Assignment (in The Class)
oFind a Case Study of Big Data Implementation / Application for
Business or others
o State the objective, problems, solution idea (Week 1)
o State the methodology used (explain) (Week 2)
o State the model, measurement, accuracy (Week 3)
Lab. Activities
• Introduction to R
• Module / Package Installation
• Import / Export Data

Next weeks activities:


• Data gathering from Twitter
• Crawling data from websites
• Classification and Clustering
• Association
• Sentiment Analysis with Text Mining
ADDITIONAL SLIDES
Data Collection Workflow

Potrebbero piacerti anche