Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Purpose
1. Discuss the evolution of Data Platforms and why Hadoop was created
2. Discuss the purpose, functionality, and value of Hadoop
3. Describe the various Hadoop components
4. Discuss some of the most common use cases for Hadoop
Agenda
1. What is Hadoop
2. The Evolution of Data Platforms
3. How Hadoop Is Being Used Today
4. Resources and Key Takeaways
HDFS MapReduce
Cost of traditional
platforms too high to store Greater Cost
New
and process this new data Deployment
Models &
Pressures
Languages Increasing
Customer
Expectations
1. Hadoop reduces the cost of storing & processing data to a point that
keeping all data, indefinitely is suddenly a very real possibility – AND –
that cost is halving every 18 months
2. MapReduce makes developing & executing massively parallel data
processing tasks trivial compared to historical alternatives (e.g. HPC /
Grid)
3. Schema on Read paradigm shifts typical data preparation complexity
to analysis phase rather than acquisition phase
4. For the modern CIO “BIG DATA = HADOOP” - Don’t underestimate
the irrational exuberance of the market
The cost and effort to consume and extract value from data has been
fundamentally changed
Organizations that are evaluating Hadoop are typically also looking at other NoSQL
databases such as Cassandra, Mongo DB, Couch DB, Amazon EMR, and others.
They might also be evaluating scale-out file systems to use for storage such as Isilon
or Gluster FS. These systems mirror Hadoop’s scale-out architecture and are also
capable of handling the volume and unstructured nature of data that could be stored in
Hadoop.
• Challenges
– Existing EDW used for low value and resource consuming ETL
process
– Planned growth will far exceed compute capacity
– Hard to do analytics or even basic reporting on EDW system
• Objectives
– Reduce EDW Total Cost of Ownership
– Enable longer data retention to enable analytics and accelerate time
to market
– Migrate ETL off EDW to free up compute resources
Risk Modeling
– Bank had customer data across multiple lines of business and needed to
develop a better risk picture of its customers.
▪ i.e. -- if direct deposits stop coming into checking acct, it’s likely that customer lost his/her job,
which impacts creditworthiness for other products (CC, mortgage, etc)
– Data existing in silos across mutliple LOB’s and acquired bank systems
– Data size approached 1 petabyte
Why Hadoop
– Social media/web data is unstructured
– Amount of data is immense
– New data sources arise weekly
Webcast
– Hadoop Spotlight Webinar
– Video: HAWQ and Pivotal HD Video
– Internal Webcast: Internal Webcast
Product Management
– SK Krishnamurthy - SK.Krishnamurthy@emc.com
2
Pivotal Confidential–Internal Use Only 727
Thank You
Please note: