Sei sulla pagina 1di 31

CarbonData : A New Hadoop File Forma

t For Faster Data Analysis

HUAWEI TECHNOLOGIES CO., LTD.

Outline
Use Case & Motivation : Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan

Use case: Sequential scan


Full table scan
Big scan & fast batch processing
Only fetch a few columns of the table

C
1

C
2

C
3

C
4

C
5

C
6

C
7

R1

R2

Common usage scenario:


ETL job
Log Analysis

R3

R4

R5

R6

R7

R8

R9

Use case: OLAP-Style Query


Multi-dimensional data analysis
Involves aggregation / join
Roll-up, Drill-down, Slicing and Dicing
Low-latency ad-hoc query

C
1

C
2

C
3

C
4

C
5

C
6

C
7

R1

R2

R3

Common usage scenario:


Dash-board reporting
Fraud & Ad-hoc Analysis

R4

R5

R6

R7

R8

R9

Use case: Random Access


Predicate filtering on range of columns
Full row keys or range of keys lookup
Narrow scan but might fetch all columns
Requires second/sub-second level low-latency

C
1

C
2

C
3

C
4

C
5

C
6

C
7

R1

R2

R3

Common usage scenario:


Operational query
User profiling

R4

R5

R6

R7

R8

R9

Motivation
OLAP Style Query
(multi-dimensional analysis)

Sequential Access
(big scan)

CarbonData: A Single File Format


suits for different types of access

Random Access
(narrow scan)

Design Goals

Low-Latency for various types of data access pattern


Allow fast query on fast data
Ensure Space Efficiency
General format available on Hadoop-ecosystem

CarbonData:
Read-optimized columnar storage
Leveraging multi-level Index for low-latency
Support column group to leverage the benefit of row-based
Enables dictionary encoding for deferred decoding for aggregation
Optimized streaming ingestion support
Broader Integration across Hadoop-ecosystem
7

Outline
Use cases & Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan

CarbonData File Structure


Blocklet : A set of rows in columnar format

Carbon File
Blocklet 1

Default blocklet size: ~120k rows


Balance between efficient scan and compression

Col1 Chunk
Col2 Chunk

Column chunk : Data for one column/column group in a Blocklet

Allow multiple columns forms a column group & stored as row-based


Column data stored as sorted index

Colgroup1 Chunk
Colgroup2 Chunk

Footer : Metadata information


File level metadata & statistics
Schema
Blocklet Index & Blocklet level Metadata

Blocklet N

Footer

Format

Carbon
Carbon Data
Data File
File
Blocklet 1
Column 1 Chunk
Column 2 Chunk

ColumnGroup 1 Chunk
ColumnGroup 2 Chunk

Blocklet N

Blocklet Info
Blocklet 1 Info

Column 1 Chunk Info


Compression scheme
ColumnFormat
ColumnID list
ColumnChunk length
ColumnChunk offset

ColumnGroup1 Chunk Info

File Footer

Blocklet Index
Blocklet 1 Index Node

Minmax index: min, max

Multi-dimensional index:
startKey, endKey

Blocklet N Index Node

File Metadata
Version, No.
Version,
No. Row,,

Blocklet N Info

Segment
Segment Info
Info

Schema

Schema for each column

Blocklet Index
Blocklet Info

10

Blocklet
Data are sorted along MDK (multi-dimensional keys)
data stored as index in columnar format

1143

1143
Blocklet
Blocklet Logical
Logical View
View 2
2
C1
C1
C7
C7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

C2
C2

C3
C3 C4
C4

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2

1
1
1
1
1
1
2
2
2
3
3
3
3
3
3
4
4
4

1
1
1
1
3
3
1
1
1
1
1
2
2
3
3
1
1
3

1
1
3
3
2
2
4
4
5
7
7
8
8
6
6
1
1
9

C64462
C6
2
2
5470
5470
142
142
2
2
443
443
5887
541
541
1
1
545
545
5618
5618
675
1
1
570
570
5101
5101
561
561
8
52
52
5524
5524
144
144
5
5
525
9749
9749
1153
1153
2
5039
5039

C5
C5

Years

Quarters

Months

Territory

Country

2003

QTR1

Jan

EMEA

Germany

142

11,432

2003

QTR1

Jan

APAC

China

541

54,702

2003

QTR1

Jan

EMEA

Spain

443

44,622

2003

QTR1

Feb

EMEA

Denmark

545

58,871

2003

QTR1

Feb

EMEA

Italy

675

56,181

2003

QTR1

Mar

APAC

India

52

9,749

2003

QTR1

Mar

EMEA

UK

570

51,018

2003

QTR1

Mar

Japan

Japan

561

55,245

2003

QTR2

Apr

APAC

Australia

525

50,398

2003

QTR2

Apr

EMEA

Germany

144

11,532

Sorted
Sorted MDK
MDK
Index
Index

[1,1,1,1,1] :
[142,11432]
[1,1,1,1,3] :
[443,44622]
[1,1,1,3,2] :
[541,54702]
[1,1,2,1,4] :
[545,58871]
[1,1,2,1,5] :
[675,56181]
[1,1,3,1,7] :
[570,51018]
[1,1,3,2,8] :
[561,55245]

Quantity

Sales

Encoding

Sort
(MDK Index)

[1,1,1,1,1] :
[142,11432]
[1,1,1,3,2] :
[541,54702]
[1,1,1,1,3] :
[443,44622]
[1,1,2,1,4] :
[545,58871]
[1,1,2,1,5] :
[675,56181]
[1,1,3,3,6] :
[52,9749]
[1,1,3,1,7] :
[570,51018]

11

File Level Blocklet Index


File Footer

Blocklet
Block 1
1 1 1 1

12000

1 1 1 2

5000

1 1 2 1Block
1 21

12000

1 2 1 3

11000

1 1 2 2

5000

1 2 2 3

11000

1 1 3 1 1 1
1 2 3 3Block
2 33

12000

1
1 1
3 3
2 2
5
1 3 1 4

2
4
4

5000
1000
2000

1 3 3 4 3 4
1 3 1 5 3 4
Block 4
1 3 3 5 3 4
1
2 3
1 2
1 4
1 3
1 4
1

2000
1000
1000
2000
12000

1 4 1 4

20000

2 1 1 2

5000

1 4 2 4

20000

2 1 2 1

12000

1 4 3 4

20000

2 1 2 2

5000

1
3
3

Blocklet Index
Block1
Start Key1
End Key1
C1(Min, Max)
.
C7(Min, Max)

11000

Block4
Start Key4
End Key4
C1(Min, Max)
.
C7(Min, Max)

Build in-memory file level MDK index tree for


filtering
Major optimization for efficient scan
Start Key1
End Key4
Start Key1
End Key2

Start Key3
End Key4

Start
Key1
End Key1

Start
Key2
End
Key2

Start
Key3
End Key3

Start
Key4
End Key4

C1(Min,Max)

C1(Min,Max)

C1(Min,Max)

C1(Min,Max)

C7(Min,Max)

C7(Min,Max)

C7(Min,Max)

C7(Min,Max)

12

Column Chunk Inverted Index

Blocklet
Blocklet
(( sort
sort column
column within
within column
column
chunk)
chunk)

[1|1] :[1|1] :[1|1] :[1|1] :[1|1]


: [142]:
[11432]
[1|2] :[1|2] :[1|2] :[1|2] :[1|9]
: [443]:
[44622]
[1|3] :[1|3] :[1|3] :[1|4] :[2|3]
: [541]:
[54702]
[1|4] :[1|4] :[2|4] :[1|5] :[3|2]
: [545]:
[58871]
[1|5] :[1|5] :[2|5] :[1|6] :[4|4]
: [675]:
[56181]
Column chunk Level
[1|6] :[1|6] :[3|6] :[1|9] :[5|5]
: [570]:
[51018]
inverted Index
[1|7] :[1|7] :[3|7] :[2|7] :[6|8]
: [561]:
[55245]
Run
Run Length
Length Encoding
Encoding &
& Compression
Compression
[1|8] :[1|8] :[3|8] :[3|3] :[7|6]
: [52]:
[9749]
Blocklet Rows
Columnar
Columnar Store
Store
[1|9] :[2|9] :[4|9] :[3|8] :[8|7]
: [144]: Measure1 Measure2
Block
Block
[11532]
Dim1 Block
Dim2 Block
Dim3
Block
Dim4 Block
Dim5 Block
[142]:[11432]
[1|10]:[2|10]:[4|10]:[3|10]
:[9|10]
:
[525]:
1(1-10)
1(1-8)
1(1-3)
1(1-2,4-6,9)
1(1,9)
[443]:[44622]
[50398]
2(9-10)
2(4-5)
2(7)
2(3)
[541]:[54702]
3(6-8)
3(3,8,10)
3(2)
4(9-10)
4(4)
[545]:[58871]
5(5)
[675]:[56181]
6(8)
[570]:[51018]
7(6)
[561]:[55245]
8(7)
[52]:[9749]
9(10)
[144]:[11532]
[525]:[50398]

Optionally store column data as inverted index


within column chunk
suitable to low cardinality column
better compression & fast predicate filtering

Blocklet
Blocklet Physical
Physical View
View
C1
C1
d
d
1
1
1
0
0

C2
C2
rr

1
1
1
0
0

d
d
1
8
8
2
2
2
2

C3
C3
rr

1
1
1
0
0

1
3
3
2
2
2
2
3
3
3
4
4
2
2

C4
C4

d
d

rr

1
1
1
0
0

1
6
6
2
2
1
1
3
3
3

C5
C5
d
d

1
2
2
4
4
3
3
9
9
1
7
7
1
1
3
3
1
1

rr
1
2
2
2
2
1
1
3
3
1
4
4
1
1
5
5
1
1

1
1
1
9
9
1
1
3
3
1
2
2
1
1
4
4
1
1

C6
C6
d
d

rr
142
142
443
443
541
541
545
545
675
675
570
570
561
561
52
52
144
144
525
525

1143
C7
2
C7
2
4462
4462
d rr
d
2
5470
2
2
5887
5887
1
1
5618
5618
1
5101
5101
8
8
5524
5524
5
5
9749
1153
1153
2
2
5039
5039

13

Column Group
Allow multiple columns form a column group
stored as a single column chunk in rowbased format
suitable to set of columns frequently
fetched together
saving stitching cost for reconstructing
row

Blocklet 1

C
Col
1

C
Col
2

C
Col3

Chunk Chunk Chunk

C
C
4 Col 5

10

23

23

10

50

10

11

12

Chun
k

C
6
Col
Chunk

38

15.2

15

29

18.5

51

18

52

22.8

60

29

16

32.9

68

32

18

21.6

14

Nested Data Type Representation


Arrays

Struts

Represented as a composite of two columns

Represented as a composite of finite number


of columns

One column for the element value


Each struct element is a separate column
One column for start_index & length of Array

Name

Array<Ph_Number
>

John

[192,191]

Sam
Bob

[121,345,333]
[198,787]

Nam
e

Array
[start,len
]

Ph_Number

John

0,2

192

Sam

2,3

191

Bob

5,2

Name

Info
Strut<age,gender
>

121

John

[31,M]

345

Sam

[45,F]

Bob

[16,M]

Nam
e

Info.age

Info.gender

John

31

Sam

45

Bob

16

333

198

15

Encoding & Compression


Efficient encoding scheme supported:
DELTA, RLE, BIT_PACKED
Dictionary:
medium high cardinality: file level dictionary
very low cardinality: table level global dictionary
CUSTOM
Compression Scheme: Snappy

Big Win:

Speedup Aggregation
Reduce run-time memory
footprint
Enable deferred decoding
Enable fast distinct count

16

Outline
Use Case & Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan

17

CarbonData Modules
Carbon-Spark
Integration

Integration of Carbon with Spark


including query optimization

Carbon-Hadoop
Input/Output Format

Carbon-core
Reader/Writer

Provide Hadoop Input/Output Format


interface
Core component of format
implementation for reading/writing
Carbon data

Carbon-format
Thrift definition

Language Agnostic Format


Specification

18

Spark Integration
Query CarbonData Table
DataFrame API
Spark SQL Statement

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name


data_type [COMMENT col_comment], ...)] [COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT
col_comment], ...)] STORED BY
org.carbondata.hive.CarbonHanlder [TBLPROPERTIES
(property_name=property_value, ...)] [AS select_statement];

Support schema evolution of Carbon table via ALTER TABLE


Add, Delete or Rename Column
schema update only, data stored on disk is untouched

19

Spark Integration
Table Level MDK Tree Index

Query optimization

Vectorized record reading


Predicate push down by leveraging multi-level index
Column Pruning
Defer decoding for aggregation

Table

C1

Block

Block

Block

Block

Blocklet

Blocklet

Blocklet

Blocklet

Blocklet

Blocklet

Blocklet

Blocklet

Footer +
Index

Footer +
Index

Footer +
Index

Footer +
Index

C2

Blocklet
C3

C4

C5 C6 C7

C9

Inverte
d
Index

20

Data Ingestion
Bulk Data Ingestion

CSV file conversion


MDK clustering level: load level vs. node level

Save Spark dataframe as Carbon data file

LOAD DATA [LOCAL] INPATH 'folder path' [OVERWRITE]


INTO TABLE tablename
OPTIONS(property_name=property_value, ...)
INSERT INTO TABLE tablennme AS select_statement1
FROM table1;

df.write
.format("org.apache.spark.CarbonSource")
.options(Map("dbName" -> "db1", "tableName" ->
"tbl1"))
.mode(SaveMode.Overwrite)
.save(/path)

21

Data Compaction

Data compaction is used to merge small files

Re-clustering across loads

Two types of compactions


- Minor compaction
Compact adjacent files into a single big file (~HDFS block size)
- Major compaction
Reorganize adjacent loads to achieve better clustering along MDK index

22

Outline
Use Case & Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan

23

Performance comparison
Carbon vs Popular Columnar Stores
120.00

High Throughput/Full Scan


Query
1.4x to
6x faster

Random Access
Query
26x
688x111.86
107.39
faster
101.62

OLAP/Interactive Query
20x 33x faster

100.00
80.00

Popular
Columnar
Stores

60.00

Response Time (Seconds)


40.00

Carbon
26.28

20.00

24.64

23.05
17.33

9.45

12.71
4.41

0.00
SQL1

SQL2

10.38

9.82
1.62

SQL3

2.54

SQL4

15.49

17.82

11.21
8.16
0.89

SQL5

SQL6

0.55

SQL7

0.52

SQL8

Benchmark Queries

0.54

SQL9

1.19

0.16

2.24

4.28

SQL10 SQL11 SQL12 SQL13

Data Size : 2TB


24

Performance comparison - Observations


High Throughput/Full Scan Query
1.4 to 6 times faster

Deferred decoding enables faster aggregation


on the fly.

OLAP/Interactive Query
20 to 33 times faster

MDK, Min-Max and Inverted indices enable


block pruning
Deferred decoding enables faster aggregation
on the fly.

Random Access Query


26 to 688 times faster

Inverted index enables faster row reconstruction.


Column group eliminates implicit joins for row
reconstruction.
25

Outline
Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan

26

Live Demo
High Throughput/Full Scan Query
SELECT PROD_BRAND_NAME, SUM(STR_ORD_QTY)
FROM oscon_demo GROUP BY PROD_BRAND_NAME;

OLAP/Interactive query
SELECT PROD_COLOR, SUM(STR_ORD_QTY) FROM
oscon_demo WHERE CUST_COUNTRY ='New Zealand'
AND CUST_CITY = 'Auckland' AND PRODUCT_NAME =
'Huawei Honor 4X' GROUP BY PROD_COLOR;

Random Access Query

Random
Query
SELECT
* FROMAccess
oscon_demo
WHERE
CUST_PRFRD_FLG= "Y" AND PROD_BRAND_NAME =
"Huawei" AND PROD_COLOR = "BLACK" AND
CUST_LAST_RVW_DATE = "2015-12-11 00:00:00" AND
CUST_COUNTRY ='New Zealand' AND CUST_CITY =
'Auckland' AND PRODUCT_NAME = 'Huawei Honor
4X' ;

Demo Environment
Number of
Nodes

5 VM (AWS
r3.4xlarge)

vCPU

80 (16/node)

Memory

500 GiB (100


GiB/node)

#Columns

300

Data Size

600GB

#Records

300M

27

Outline
Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan

28

Future Plan

Upgrade to Spark 2.0

Add append support

Support pre-aggregated table

Enable offline IUD support

Broader Integration across Hadoop-ecosystem

29

Community

CarbonData is open sourced & will become Apache Incubator project

Welcome contribution to our Github @:


https://github.com/HuaweiBigData/carbondata

Main Contributors:

Jihong MA, Vimal, Raghu, Ramana, Ravindra, Vishal, Aniket, Liang Chenliang, Jacky Likun,
Jarry Qiuheng, David Caiqiang, Eason Linyixin, Ashok, Sujith, Manish, Manohar, Shahid,
Ravikiran, Naresh, Krishna, Babu, Ayush, Santosh, Zhangshunyu, Liujunjie, Zhujing
(Huawei)

Jean-Baptiste Onofre (Talend, ASF member), Henry Saputra (eBay, ASF member),
Uma Maheswara Rao G(Intel, Hadoop PMC)

30

Thank you
www.huawei.com

Copyright2014 Huawei Technologies Co., Ltd. All Rights Reserved.


The information in this document may contain predictive statements including, without limitation,
statements regarding the future financial and operating results, future product portfolio, new technology,
etc. There are a number of factors that could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements. Therefore, such information is provided for
reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the
information at any time without notice.

Potrebbero piacerti anche