Advanced Data Modeling With Apache Cassandra 160219180817

Advanced Data Modeling
with Apache Cassandra

Patrick McFadin 
Chief Evangelist for Apache Cassandra
@PatrickMcFadin
©2013 DataStax Confidential. Do not distribute without consent. 1

Cassandra Modeling
Application
Models
Data
Think Before You Model
Or how to keep doing what you’re already doing
3
Some of the Entities and Relationships in KillrVideo
id
timestamp
firstname User 1
lastname adds
n id
email n
password 1 name
m
Video
description
rates
1 location
posts
rating
preview_image
timestamp features
tags
n
n
id
Comment
comment 4
Modeling Queries
• What are your application’s workflows?
• How will I access the data?
• Knowing your queries in advance is NOT optional
• Different from RDBMS because I can’t just JOIN or create a new

indexes to support new queries
5
Some Application Workflows in KillrVideo
Show latest
User Logs Search for a videos
into site video by tag added to the
site
Show video Show Show

and its comments ratings for a
details for a video video
Show
Show basic Show videos
comments
information added by a
posted by a
about user user
user 6
Some Queries in KillrVideo to Support Workflows
Users
Show basic
User Logs Find user by email information Find user by id
into site address about user
Comments
Show
Show
comments
Find comments by comments Find comments by
for a video video (latest first) posted by a
user
user (latest first)
Ratings
Show
ratings for a
Find ratings by
video video 7
Some Queries in KillrVideo to Support Workflows
Videos
Show latest
Search for a videos Find videos by date
Find video by tag
video by tag added to the (latest first)
site
Show video Show videos Find videos by user

and its Find video by id added by a
details user (latest first)
8
Data Modeling Refresher
• Cassandra limits us to queries that can scale across many nodes

– Include value for Partition Key and optionally, Clustering Column(s)
• We know our queries, so we build tables to answer them
• Denormalize at write time to do as few reads as possible
• Many times we end up with a “table per query”

– Similar to materialized views from the RDBMS world
9
Users – The Cassandra Way
Show basic
User Logs Find user by email information Find user by id
into site address about user
CREATE TABLE user_credentials (  CREATE TABLE users ( 

email text,  userid uuid, 
password text,  firstname text, 
userid uuid,  lastname text, 
PRIMARY KEY (email)  email text, 
); created_date timestamp, 
PRIMARY KEY (userid) 
);
Why not indexes?
Find the index
Application
80
70 10
60 20
50 30
40
Views or indexes?
Show video Show videos
Find videos by user
and its Find video by id added by a
details user (latest first)
CREATE TABLE videos (  CREATE TABLE user_videos ( 

videoid uuid,  userid uuid, 
userid uuid,  added_date timestamp, 
name text,  videoid uuid, 
description text,  name text, 
location text,  preview_image_location text, 
location_type int,  PRIMARY KEY (userid, added_date, videoid) 
preview_image_location text,  ) WITH CLUSTERING
tags set<text>,  ORDER BY (added_date DESC, videoid ASC);
added_date timestamp, 
PRIMARY KEY (videoid) 
);
Denormalized data
12
Videos Everywhere!
Show latest
Search for a videos Find videos by date
Find video by tag
video by tag added to the (latest first)
site
Considerations When Duplicating Data

• Can the data change?
• How likely is it to change or how frequently will it change?
• Do I have all the information I need to update duplicates and maintain
consistency?
13
Single Nodes Have Limits Too
Show latest
Find videos by date
• Latest videos are bucketed by day
videos
added to the (latest first)
site
• Means all reads/writes to latest
CREATE TABLE latest_videos ( 
videos are going to same
yyyymmdd text,  partition (and thus the same
videoid uuid,  nodes)
name text, 
preview_image_location text, 
PRIMARY KEY (yyyymmdd, 
added_date, videoid)  • Could create a hotspot
) WITH CLUSTERING ORDER BY ( 
added_date DESC, 
videoid ASC 
);
14
Single Nodes Have Limits Too
Show latest
Find videos by date
• Mitigate by adding data to the
videos
added to the (latest first) Partition Key to spread load
site
CREATE TABLE latest_videos ( 

yyyymmdd text,
• Data that’s already naturally a
bucket_number int,  part of the domain
videoid uuid,  – Latest videos by category?
name text, 
PRIMARY KEY ((yyyymmdd, bucket_number), 
) WITH CLUSTERING
added_date, videoid) 
• Arbitrary data, like a bucket
ORDER BY (added_date DESC, 
videoid ASC 
number
);
– Round robin at the app level
15
Hot spot
yyyymmmdd
1000 Node Cluster

Hot spot
yyyymmmdd, bucket_number
1000 Node Cluster

Use Case Examples
Top User Scores
Game API
Daily Top 10 Users

handle | score
-----------------+-------
subsonic | 66.2 Nightly
neo | 55.2
bennybaru | 49.2 Spark Jobs
tigger | 46.2
velvetfog | 45.2
flashberg | 43.6
jbellis | 43.4
cafruitbat | 43.2
groovemerchant | 41.2
rustyrazorblade | 39.2
User Score Table
• After each game, score is stored
• Partition is user + game
• Record timestamp is reversed
(last score first)
CREATE TABLE userScores (

userId uuid,
handle text static,
gameId uuid,
score_timestamp timestamp,
score double,
PRIMARY KEY ((userId, gameId), score_timestamp)
) WITH CLUSTERING ORDER BY (score_timestamp DESC);
Top Ten User Scores
• Written by Spark job
• Default TTL = 3 days
• Using Date Tiered Compaction Strategy
CREATE TABLE TopTen (

gameId uuid,
process_timestamp timestamp,
score double,
userId uuid,
handle text,
PRIMARY KEY (gameId, process_timestamp, score)
) WITH CLUSTERING ORDER BY (process_timestamp DESC, score DESC)
AND default_time_to_live = '259200'
AND COMPACTION = {'class': 'DateTieredCompactionStrategy', 'enabled': 'TRUE'};
DTCS
• Built for time series
• SSTable windows of time ranges
• Compaction grouped by time
• Best for same TTLed data(default TTL)
• Entire SSTables can be dropped
Queries, Yo
SELECT gameId, process_timestamp, score, handle, userId

FROM topten
WHERE gameid = 99051fe9-6a9c-46c2-b949-38ef78858dd0
AND process_timestamp = '2014-12-31 13:42:40';
gameid | process_timestamp | score | handle | userid

--------------------------------------+--------------------------+-------+-----------------+--------------------------------------
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 66.2 | subsonic | 99051fe9-6a9c-46c2-b949-38ef78858d07
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 55.2 | neo | 99051fe9-6a9c-46c2-b949-38ef78858d11
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 49.2 | bennybaru | 99051fe9-6a9c-46c2-b949-38ef78858d06
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 46.2 | tigger | 99051fe9-6a9c-46c2-b949-38ef78858d05
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 45.2 | velvetfog | 99051fe9-6a9c-46c2-b949-38ef78858d04
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 43.6 | flashberg | 99051fe9-6a9c-46c2-b949-38ef78858d10
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 43.4 | jbellis | 99051fe9-6a9c-46c2-b949-38ef78858d09
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 43.2 | cafruitbat | 99051fe9-6a9c-46c2-b949-38ef78858d02
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 41.2 | groovemerchant | 99051fe9-6a9c-46c2-b949-38ef78858d03
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 39.2 | rustyrazorblade | 99051fe9-6a9c-46c2-b949-38ef78858d01
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 20.2 | driftx | 99051fe9-6a9c-46c2-b949-38ef78858d08
File Storage Use Case
Upload API
It’s all about the model
• Use case
• User creates an account
• User uploads image
• Image is distributed worldwide
• User can check access patterns
• Start with our queries

• All data for a image
• All images over time
• Specific images over a range
• Access times of each image
user Table
• Our standard POJO
• emails are dynamic
CREATE TABLE user (
username text,
firstname text,
lastname text,
emails list<text>,
PRIMARY KEY (username)
);
INSERT INTO user (username, firstname, lastname, emails)

VALUES (‘pmcfadin’, ‘Patrick’, ‘McFadin’, [‘patrick@datastax.com’,
‘patrick.mcfadin@datastax.com’]
IF NOT EXISTS;
image Table
• Basic POJO for an image
• list of tags for potential search
• username is from user table
CREATE TABLE image (

image_id uuid, //Proxy image ID
username text,
created_at timestamp,
image_name text,
image_description text,
tags list<text>, // ? search in Solr ?
images map<text, uuid> , // orig, thumbnail, medium
PRIMARY KEY (image_id)
);
images_timeseries Table
• Time ordered list of images
• Reversed - Last image first
• Map stores versions
CREATE TABLE images_timeseries (

username text,
bucket int, //yyyymm
sequence timestamp,
image_id uuid,
image_name text,
image_description text,
images map<text, uuid>, // orig, thumbnail, medium
PRIMARY KEY ((username, bucket), sequence)
) WITH CLUSTERING ORDER BY (sequence DESC); // reverse clustering on sequence
bucket_index Table
• List of buckets for a user
• Bucket order is reversed
• High reads, no updates. Use LeveledCompaction
CREATE TABLE bucket_index (

username text,
bucket int,
PRIMARY KEY( username, bucket)
) WITH CLUSTERING ORDER BY (bucket DESC); //LCS + reverse clustering
blob Table
• Main pointer to chunks
• count and checksum for errors detection
• META-DATA stored with as an optimization
CREATE TABLE blob (

object_id uuid, // unique identifier
chunk_count int, // total number of chunks
size int, // total byte size
chunk_size int, // maximum size of the chunks.
checksum text, // optional checksum, this could be stored
// for each blob but only checked on a certain
// percentage of reads
attributes text, // optional text blob for additional json
// encoded attributes
PRIMARY KEY (object_id)
);
blob_chunk Table
• Main data storage table
• Size of blob is up to the client
• Return size for error detection
• Run in parallel!
CREATE TABLE blob_chunk (

object_id uuid, // same as the object.object_name above
chunk_id int, // order for this chunk in the blob
chunk_size int, // size of this chunk, the last chunk
// may be of a different size.
data blob, // the data for this blob chunk
PRIMARY KEY ((object_id, chunk_id))
);
access_log Table
• Classic time series table
• Inserts at CL.ONE
• Read at CL.ONE
CREATE TABLE access_log (

object_id uuid,
access_date text, // YYYYMMDD portion of access timestamp
access_time timestamp, // Access time to the ms
ip_address inet, // x.x.x.x inet address
PRIMARY KEY ((object_id, access_date), access_time, ip_address)
);
Light Weight Transactions
Regular Update
Fields to Update: Not in Primary Key
Table Name
UPDATE videos 
SET name = 'The data model is dead. Long live the data model.' 
WHERE id = 06049cbb-dfed-421f-b889-5f649a0de1ed;
Primary Key
The race is on
Process 1 Process 2
SELECT firstName, lastName

FROM users T0
WHERE username = 'pmcfadin';
SELECT firstName, lastName
(0 rows) T1 FROM users
WHERE username = 'pmcfadin';
(0 rows)
INSERT INTO users (username, firstname,

lastname, email, password, created_date)
VALUES ('pmcfadin','Patrick','McFadin', Got nothing! Good to go!
['patrick@datastax.com'], T2
'ba27e03fd95e507daf2937c937d499ab',
'2011-06-20 13:50:00');

T3 VALUES ('pmcfadin','Paul','McFadin',
['paul@oracle.com'],
This one wins 'ea24e13ad95a209ded8912e937d499de',
'2011-06-20 13:51:00');
Lightweight Transactions
INSERT INTO videos (videoid, name, userid, description, location, location_type, preview_thumbnails, tags, added_date, metadata) 
VALUES (06049cbb-dfed-421f-b889-5f649a0de1ed,'The data model is dead. Long live the data model.',
9761d3d7-7fbd-4269-9988-6cfd4e188678,  
'First in a three part series for Cassandra Data Modeling','http://www.youtube.com/watch?v=px6U2n74q3g',1, 
{'YouTube':'http://www.youtube.com/watch?v=px6U2n74q3g'},{'cassandra','data model','relational','instruction'}, 
'2013-05-02 12:30:29’)
IF NOT EXISTS;
Don’t overwrite!
UPDATE videos 
SET name = 'The data model is dead. Long live the data model.' 
WHERE id = 06049cbb-dfed-421f-b889-5f649a0de1ed
IF userid = 9761d3d7-7fbd-4269-9988-6cfd4e188678;
Don’t overwrite!
Solution LWT
Process 1

VALUES ('pmcfadin','Patrick','McFadin',
['patrick@datastax.com'], T0
'ba27e03fd95e507daf2937c937d499ab',
'2011-06-20 13:50:00')
IF NOT EXISTS;
[applied]
----------- T1
True
•Check performed for record

•Paxos ensures exclusive access
•applied = true: Success
Solution LWT
Process 2
T2

VALUES ('pmcfadin','Paul','McFadin',
T3 ['paul@oracle.com'],
'ea24e13ad95a209ded8912e937d499de',
'2011-06-20 13:51:00')
IF NOT EXISTS;
[applied] | username | created_date | firstname | lastname

-----------+----------+--------------------------+-----------+----------
False | pmcfadin | 2011-06-20 13:50:00-0700 | Patrick | McFadin
•applied = false: Rejected

•No record stomping!
No-op. Don’t throw error
CREATE TABLE IF NOT EXISTS videos_by_tag ( 

tag text, 
videoid uuid, 
name text, 
tagged_date timestamp, 
PRIMARY KEY (tag, videoid) 
);
User Defined Types
User Defined Types
• Complex data in one place
• No multi-gets (multi-partitions)
CREATE TYPE address (
• Nesting! street text,
city text,
zip_code int,
country text,
cross_streets set<text>
);
Before
CREATE TABLE videos (
videoid uuid,
userid uuid,
name varchar,
description varchar,
location text,
location_type int,
preview_thumbnails map<text,text>,
tags set<varchar>,
added_date timestamp, SELECT * In-application
FROM videos Title: Introduction to Apache Cassandra
PRIMARY KEY (videoid) join
); WHERE videoId = 2;
Description: A one hour talk on everything
you need to know about a totally amazing
SELECT *
database.
CREATE TABLE video_metadata ( FROM video_metadata
video_id uuid PRIMARY KEY, WHERE videoId = 2;
height int, Playback rate:
width int,
480 720
video_bit_rate set<text>,
encoding text
);
After
• Now video_metadata is
embedded in videos
CREATE TABLE videos (

videoid uuid,
userid uuid,
name varchar,
description varchar,
CREATE TYPE video_metadata (
location text,
height int,
location_type int,
width int,
preview_thumbnails map<text,text>,
video_bit_rate set<text>,
tags set<varchar>,
encoding text
metadata set <frozen<video_metadata>>,
);
added_date timestamp,
PRIMARY KEY (videoid)
);
Wait! Frozen??
Do you want to build a schema?
• Staying out of technical Do you want to store some JSON?
debt
• 3.0 UDTs will not have to

be frozen
• Applicable to User Defined

Types and Tuples
Let’s store some JSON
{
"productId": 2,
"name": "Kitchen Table",
"price": 249.99,
"description" : "Rectangular table with oak finish",
"dimensions": {
"units": "inches",
"length": 50.0,
"width": 66.0,
"height": 32
},
"categories": {
{
"category" : "Home Furnishings" {
"catalogPage": 45,
"url": "/home/furnishings"
},
{
"category" : "Kitchen Furnishings" {
"catalogPage": 108,
"url": "/kitchen/furnishings"
}
}
}
{
"productId": 2, CREATE TYPE dimensions (
"name": "Kitchen Table", units text,
"price": 249.99, length float,
"description" : "Rectangular table with oak finish", width float,
"dimensions": { height float
"units": "inches", );
"length": 50.0,
"width": 66.0,
"height": 32
},
"categories": {
{
"catalogPage": 45,
},
{
"catalogPage": 108,
}
}
}
{
"length": 50.0,
"width": 66.0,
"height": 32 CREATE TYPE category (
}, catalogPage int,
"categories": { url text
{ );
"catalogPage": 45,
},
{
"catalogPage": 108,
}
}
}
{
"length": 50.0,
"width": 66.0,
"height": 32 CREATE TYPE category (
}, catalogPage int,
"categories": { url text
{ );
"catalogPage": 45, CREATE TABLE product (
"url": "/home/furnishings" productId int,
}, name text,
{ price float,
"category" : "Kitchen Furnishings" { description text,
"catalogPage": 108, dimensions frozen <dimensions>,
"url": "/kitchen/furnishings" categories map <text, frozen <category>>,
} PRIMARY KEY (productId)
} );
}
INSERT INTO product (productId, name, price, description, dimensions, categories)
VALUES (2, 'Kitchen Table', 249.99, 'Rectangular table with oak finish',
{
units: 'inches', dimensions frozen <dimensions>
length: 50.0,
width: 66.0,
height: 32
},
{
'Home Furnishings': { categories map <text, frozen <category>>
catalogPage: 45,
url: '/home/furnishings'
},
'Kitchen Furnishings': {
catalogPage: 108,
url: '/kitchen/furnishings'
}
}
);
Retrieving fields
Aggregates
•Built-in: avg, min, max, count(<column name>)

•Runs on server
•Always use with partition key
*As of Cassandra 2.2

Materialized Views
• New as of 3.0 CREATE TABLE user(
id int PRIMARY KEY,
• Auto-denormalize your tables login text,
firstname text,
• Not for everything lastname text,
country text,
gender int
);
CREATE MATERIALIZED VIEW user_by_country

AS SELECT * //denormalize ALL columns
FROM user
WHERE country IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(country, id);
Materialized Views
INSERT INTO user(id,login,firstname,lastname,country) VALUES(1, 'jdoe', 'John', 'DOE', 'US');
INSERT INTO user(id,login,firstname,lastname,country) VALUES(2, 'hsue', 'Helen', 'SUE', 'US');
INSERT INTO user(id,login,firstname,lastname,country) VALUES(3, 'rsmith', 'Richard', 'SMITH', 'UK');
INSERT INTO user(id,login,firstname,lastname,country) VALUES(4, 'doanduyhai', 'DuyHai', 'DOAN', 'FR');
SELECT * FROM user_by_country;
country | id | firstname | lastname | login

---------+----+-----------+----------+------------
FR | 4 | DuyHai | DOAN | doanduyhai
US | 1 | John | DOE | jdoe
US | 2 | Helen | SUE | hsue
UK | 3 | Richard | SMITH | rsmith
SELECT * FROM user_by_country WHERE country='US';
country | id | firstname | lastname | login

---------+----+-----------+----------+-------
US | 1 | John | DOE | jdoe
US | 2 | Helen | SUE | hsue
Thank you!
Bring the questions
Follow me on twitter
@PatrickMcFadin

Advanced Data Modeling With Apache Cassandra 160219180817

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Advanced Data Modeling With Apache Cassandra 160219180817

Caricato da

Copyright:

Formati disponibili

Advanced Data Modeling

with Apache Cassandra

©2013 DataStax Confidential. Do not distribute without consent. 1

• What are your application’s workflows?

• How will I access the data?

• Knowing your queries in advance is NOT optional

• Different from RDBMS because I can’t just JOIN or create a new

Show video Show Show

Show video Show videos Find videos by user

• Cassandra limits us to queries that can scale across many nodes

• We know our queries, so we build tables to answer them

• Denormalize at write time to do as few reads as possible

• Many times we end up with a “table per query”

CREATE TABLE user_credentials ( CREATE TABLE users (

CREATE TABLE videos ( CREATE TABLE user_videos (

Considerations When Duplicating Data

CREATE TABLE latest_videos (

1000 Node Cluster

1000 Node Cluster

Daily Top 10 Users

CREATE TABLE userScores (

CREATE TABLE TopTen (

SELECT gameId, process_timestamp, score, handle, userId

gameid | process_timestamp | score | handle | userid

• Start with our queries

INSERT INTO user (username, firstname, lastname, emails)

CREATE TABLE image (

CREATE TABLE images_timeseries (

CREATE TABLE bucket_index (

CREATE TABLE blob (

CREATE TABLE blob_chunk (

CREATE TABLE access_log (

SELECT firstName, lastName

INSERT INTO users (username, firstname,

INSERT INTO users (username, firstname,

INSERT INTO users (username, firstname,

•Check performed for record

INSERT INTO users (username, firstname,

[applied] | username | created_date | firstname | lastname

•applied = false: Rejected

No-op. Don’t throw error

CREATE TABLE IF NOT EXISTS videos_by_tag (

• Complex data in one place

CREATE TABLE videos (

• 3.0 UDTs will not have to

• Applicable to User Defined

•Built-in: avg, min, max, count(<column name>)

*As of Cassandra 2.2

CREATE MATERIALIZED VIEW user_by_country

SELECT * FROM user_by_country;

country | id | firstname | lastname | login

SELECT * FROM user_by_country WHERE country='US';

country | id | firstname | lastname | login

Potrebbero piacerti anche

CREATE TABLE user_credentials (  CREATE TABLE users ( 

CREATE TABLE videos (  CREATE TABLE user_videos ( 

CREATE TABLE latest_videos ( 

CREATE TABLE IF NOT EXISTS videos_by_tag (