Hadoop

Edureka
Comoduty hardware
Mahout:-
Hive :-
Job Tacker -Task Trcaker
Cloudera Hotonwork
Master Slave
Within IBM InfoSphere DataStage, the user modifies a configuration file to define
multiple processing nodes. These nodes work concurrently to complete each job
quickly and efficiently. ... Parallel processing environments are categorized as
symmetric multiprocessing ( SMP ) or massively parallel processing ( MPP ) systems.
===============
data lake systems tend to employ extract, load and transform (ELT) methods for
collecting and integrating data, instead of the extract, transform and load (ETL)
approaches typically used in data warehouses. Data can be extracted and processed
outside of HDFS using MapReduce, Spark and other data processing frameworks
Hadoop data lakes have come to hold both raw and curated data.
https://www.ibm.com/blogs/insights-on-business/sap-consulting/enterprise-analytics-
reference-architecture/
http://www.ibmbigdatahub.com/blog/ingesting-data-data-value-chain
http://www.datavirtualizationblog.com/logical-architectures-big-data-analytics/
https://www.xenonstack.com/blog/data-engineering/ingestion-processing-data-for-big-
data-iot-solutions
Prescriptive vs Predictive
ETL tools like Talend/Pentaho?
Data storage format ( like Parquet/Avro)
Wherehouse:-Enterptise data,Structured Data
Data Lake :-Ingest data any shape any type.Not Structured Data
Startegic reportic: donene in wherhouse
==================================Types of No SQL DB=====
Key Value Store
Columnar Store
Graph Dat base
Document Database
=======================
Time Series DB for IoT and used for industrial data
Fast Load Vs Multiload - Teradata Community
Fastload has two phases: acquisition phase and application phase. mload has 5
phases(Note: however there is no acquisition phase for mload delete).
mload phases: Preliminary,DML Transaction,Acquisition,Application,Cleanup.
The name itself is fast load.

Mload:
LOGTABLE ---Identifies the table to be used to checkpoint information required for

safe, automatic restart of Teradata MultiLoad when the client or Teradata Database
system fails.
===============
Teradata WebSphere MQ Access Module
Terdata load unload
Check Point on Data
==================Tee Pump for Real time data
TPT Teradata
===================
Cloud Provide Schema Convers tool(SCT)--Source and Target
Informatoca Load Script
========================
data Ingestion in cloud:-
SPARK,STORM,Amazon Kinesis,LAMDA for small chunk of data evet based.
====================
Hive :-Apache Hive is the SQL-on-Hadoop technolog
Parquet file format Coumner
=================
data lake is one of the popular approach to build the next generation enterprise
analytics platform.
informatica Power center
These areas include:
B2B exchange
Data governance
Data migration
Data warehousing
Data replication and synchronization
Integration Competency Centers (ICC)
Master Data Management (MDM)
Service-oriented architectures (SOA) and more.
Informatica PowerCenter is an enterprise data integration platform working as a
unit.
Real time ingestion tool

=======================parquet file format columnar================
=========================== Compression for Parquet Files. For most CDH
components, by default Parquet data files are not
Compression for Parquet Files

Using Parquet Files in HBase
Using Parquet Tables in Hive
Using Parquet Tables in Impala
Using Parquet Files in MapReduce
Using Parquet Files in Pig
Using Parquet Files in Spark
Parquet File Interoperability
Parquet File Structure
Examples of Java Programs to Read and Write Parquet Files
Denormalized:-Reporting layer ,best suited for read only

Normalized:-
Defult Joint:- Inner Joint
NOSQL: ALL distributed DB

1.Key Value
2.Clumar
3.Graph
4.Dcocument
Cassandra™: A scalable multi-master database with no single points of
failure. KEY Value
ACID vs CAP(partitined and Tolernace)

NoSQL
As the term says NoSQL, it means non relational or Non-SQL database, refer to
Hbase, Cassandra, MongoDb, Riak, CouchDB. It is not based on table formats and
that’s the reason we don’t use SQL for data access. A traditional database deals
with structured data while a relational database deals with the vertical as well as
horizontal storage system. NoSQL deals with the unstructured, unpredictable kind of
data according to the system requirement.
Algorithms for Machine Learning

Other more well-known libraries that exist in this space which can be “easily”
leveraged are Apache Mahout, Spark MLlib, FlinkML, Apache SAMOA, H2O, and
TensorFlow. Not all of these are interoperable, but Mahout can run on Spark and
Flink, SAMOA runs on Flink, and H2O runs on Spark. Cross-platform compatibility is
becoming an important topic in this space, so look for more cross-pollination.
Apache Spark is a popular choice for streaming applications;

MapR Streams with the Kafka 0.9 API has shown that it can handle 18 million events
per second on a five-node cluster with each message being 200 bytes in size.
Leveraging this capability in a scalable platform with decoupled communications is
amazing. The cost to scale this platform is very low,
=================
elastic search
S3
casendra
==================
mysql - Database sharding Vs partitioning - Stack Overflow
Popular ETL Tools:
DataStage - IBM Product

SSIS - Microsoft Product ( SSAS --for creating cubes ( facts/dimension for
reporting layer) , SSRS -Reporting Tools
ODI - Oracle ( Golden Gate -Real time data push)
Informatica --Informatica
Ab Initio
Open Source ETL:
Pentaho
Talend
========================
Cloud Migration
AWS Cloud
======================
Oracle vs DB 2
Primary Key and primary Index
ELT vs ETL
LookUp optimization
Strategic Reporting
Facts and Dimension
In Memory Is cached is Huge
In-memory database for mission critical OLTP application

An Enterprise Data Lake provides a unified view for all data which are required to
cater organization’s analytical reporting needs.
Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
Importing and exporting data into HDFS and Hive using Sqoop
=================
Terdata:-Max Parellilism much more than other. High incentive Query is being
suupported primary Index
===========================
Pages--Blocks--Segment--->Extent
=========================MAPReduce============
Context class that provides various mechanisms

to communicate with the Hadoop framework,
=======================
In order to handle the Objects in Hadoop way. For example, hadoop uses Text instead
of java's String. The Text class in hadoop is similar to a java String, however,
Text implements interfaces like Comparable, Writable and WritableComparable.
===========================
The Writable and WritableComparable interfaces
If you browse the Hadoop API for the org.apache.hadoop.io package, you'll
see some familiar classes such as Text and IntWritable along with others with
the Writable suffix.
==================
Primitive wrapper classes
These classes are conceptually similar to the primitive wrapper classes, such as
Integer
and Long found in java.lang. They hold a single primitive value that can be set
either
at construction or via a setter method.
BooleanWritable
ByteWritable
DoubleWritable
FloatWritable
IntWritable
LongWritable
VIntWritable – a variable length integer type
VLongWritable – a variable length long type
==================
https://www.hakkalabs.co/articles/cassandra-data-modeling-guide

Hadoop

Caricato da

Informazioni sul documento

Copyright

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Hadoop

Caricato da

Copyright:

Edureka

mload phases: Preliminary,DML Transaction,Acquisition,Application,Cleanup.

The name itself is fast load.

LOGTABLE ---Identifies the table to be used to checkpoint information required for

Real time ingestion tool

Compression for Parquet Files

Denormalized:-Reporting layer ,best suited for read only

NOSQL: ALL distributed DB

ACID vs CAP(partitined and Tolernace)

Algorithms for Machine Learning

Apache Spark is a popular choice for streaming applications;

mysql - Database sharding Vs partitioning - Stack Overflow

Popular ETL Tools:

DataStage - IBM Product

In-memory database for mission critical OLTP application

Context class that provides various mechanisms

Potrebbero piacerti anche