Sei sulla pagina 1di 22

Big Data Pipelines

Module 1
Agenda
✓Data Pipelines
✓Data Pipelines Property
✓Types of Data
✓Evolution of Data Pipelines
✓Deployment of Data Pipelines
✓Analytical platform for IoT landscape
✓Building Big Data Pipelines
✓Benefits of Big Data Pipelines
Data Pipelines
• Building data pipelines is a core component of data science at a startup.
• Collect Data and process
• Typically, the destination for a data pipeline is a data lake, such as Hadoop or
parquet files on S3, or a relational database, such as Redshift
• A data pipeline views all data as streaming data and it allows for flexible
schemas.
• The data pipeline does not require the ultimate destination to be a data
warehouse.
• Pipeline is commonplace for everything related to data whether to ingest
data, store data or to analyze that data.
Components of Big Data Pipelines

Compute Storage Messaging


Compute
•Compute is how your data gets processsed
–Hadoop MapReduce
–Apache Spark
–Apache Flink
–Apache Storm
–Apache Heron
Storage Component
•HDFS
•S3 or other cloud filesystems
•Local Storage
•No SQL Database
Messaging Component
•Apache Kafka
•Apache Pulsar
•RabbitMQ
Deployment of Data Pipelines
•Who owns the data pipeline?
•Which teams will be consuming data?
•Who will QA the pipeline?
Types of Data
Processed
• Tracking Data • Aggregated
data • Decoded • # of
• Jason • Schema Sessions

Raw Data Cooked


Evolution of Data Pipelines

Flat File Database Data


Era Era Lake Era
File Flat Era
• A flat file database stores data in plain text format. In a relational
database, a flat file includes a table with one record per line.
• Flat files are widely used in data warehousing projects to import
data.
• Flat files are text documents in which data are seperated by (usually)
comma's or tabs.
Data Base Era
• In a relational database data are
stored in tables
• The database table below contains
the same data as the flat file..
• Ex.
–Oracle
–Microsoft SQL
–MySQL
–IBM
–Microsoft Access
Data Lake Era
• Data Lake is one of the arguable concepts
appeared in the era of big data.
• Data Lake original idea is originated from business
field instead of academic field.
• As Data Lake is a newly conceived idea with
revolutionized concepts, it brings many challenges
for its adoption.
Data Pipelines Property
Low Event
Latency

Scalability

Property
Interactive
Querying:

Versioning

Monitoring

Testing
Data Warehouse Vs. Data Lake
Data Pipelines Solutions

Real-
Batch
time

Cloud Open
native Source
IoT Data Pipelines
Layers
Data Ingestion Layer

Data Collection Layer

Data Processing Layer

Data Storage Layer

Data Query Layer

Data Visualization Layer


Technology Stack
Hadoop Distributed
Spark Streaming
file system

Spark MLLib Kafka

Visualization Tool
MongoDB such as Tableau,
Qlikview, D3.js, etc.
Building Big Data Pipelines
Benefits of Big Data Pipelines
• Big data pipelines help in Better Event framework Designing
• Data persistence maintained
• Ease of Scalability at the coding end
• Workflow management as the pipeline is Automated and has scalability
factors
• Provides Serialization framework
• There are some disadvantages of data pipelines also, but these are not that
much to worry on. They have some alternatives ways to manage.
• Economic resources may affect the performance as Data Pipelines are best
suited for large data sets only.
• Maintenance of job processing units or we can say Cloud Management.
• No more privacy on the cloud for critical data.
Thank you

Potrebbero piacerti anche