Sei sulla pagina 1di 55

Advanced Data Modeling

with Apache Cassandra


Patrick McFadin

Chief Evangelist for Apache Cassandra
@PatrickMcFadin

©2013 DataStax Confidential. Do not distribute without consent. 1


Cassandra Modeling
Application

Models

Data
Think Before You Model
Or how to keep doing what you’re already doing

3
Some of the Entities and Relationships in KillrVideo

id
timestamp
firstname User 1
lastname adds
n id
email n
password 1 name
m
Video
description
rates
1 location
posts
rating
preview_image
timestamp features
tags
n
n

id
Comment
comment 4
Modeling Queries

• What are your application’s workflows?

• How will I access the data?

• Knowing your queries in advance is NOT optional

• Different from RDBMS because I can’t just JOIN or create a new


indexes to support new queries

5
Some Application Workflows in KillrVideo
Show latest
User Logs Search for a videos
into site video by tag added to the
site

Show video Show Show


and its comments ratings for a
details for a video video

Show
Show basic Show videos
comments
information added by a
posted by a
about user user
user 6
Some Queries in KillrVideo to Support Workflows
Users

Show basic
User Logs Find user by email information Find user by id
into site address about user

Comments
Show
Show
comments
Find comments by comments Find comments by
for a video video (latest first) posted by a
user
user (latest first)

Ratings

Show
ratings for a
Find ratings by
video video 7
Some Queries in KillrVideo to Support Workflows

Videos
Show latest
Search for a videos Find videos by date
Find video by tag
video by tag added to the (latest first)
site

Show video Show videos Find videos by user


and its Find video by id added by a
details user (latest first)

8
Data Modeling Refresher

• Cassandra limits us to queries that can scale across many nodes


– Include value for Partition Key and optionally, Clustering Column(s)

• We know our queries, so we build tables to answer them

• Denormalize at write time to do as few reads as possible

• Many times we end up with a “table per query”


– Similar to materialized views from the RDBMS world

9
Users – The Cassandra Way

Show basic
User Logs Find user by email information Find user by id
into site address about user

CREATE TABLE user_credentials (
 CREATE TABLE users (



email text,
 userid uuid,

password text,
 firstname text,

userid uuid,
 lastname text,

PRIMARY KEY (email)
 email text,

); created_date timestamp,

PRIMARY KEY (userid)

);
Why not indexes?
Find the index
Application

80

70 10

60 20

50 30

40
Views or indexes?
Show video Show videos
Find videos by user
and its Find video by id added by a
details user (latest first)

CREATE TABLE videos (
 CREATE TABLE user_videos (



videoid uuid,
 userid uuid,

userid uuid,
 added_date timestamp,

name text,
 videoid uuid,

description text,
 name text,

location text,
 preview_image_location text,

location_type int,
 PRIMARY KEY (userid, added_date, videoid)

preview_image_location text,
 ) WITH CLUSTERING
tags set<text>,
 ORDER BY (added_date DESC, videoid ASC);
added_date timestamp,

PRIMARY KEY (videoid)

);

Denormalized data
12
Videos Everywhere!

Show latest
Search for a videos Find videos by date
Find video by tag
video by tag added to the (latest first)
site

Considerations When Duplicating Data


• Can the data change?
• How likely is it to change or how frequently will it change?
• Do I have all the information I need to update duplicates and maintain
consistency?
13
Single Nodes Have Limits Too
Show latest
Find videos by date
• Latest videos are bucketed by day
videos
added to the (latest first)
site
• Means all reads/writes to latest
CREATE TABLE latest_videos (

videos are going to same
yyyymmdd text,
 partition (and thus the same
added_date timestamp,

videoid uuid,
 nodes)
name text,

preview_image_location text,

PRIMARY KEY (yyyymmdd,

added_date, videoid)
 • Could create a hotspot
) WITH CLUSTERING ORDER BY (

added_date DESC,

videoid ASC

);

14
Single Nodes Have Limits Too
Show latest
Find videos by date
• Mitigate by adding data to the
videos
added to the (latest first) Partition Key to spread load
site

CREATE TABLE latest_videos (



yyyymmdd text,
• Data that’s already naturally a
bucket_number int,
 part of the domain
added_date timestamp,

videoid uuid,
 – Latest videos by category?
name text,

preview_image_location text,

PRIMARY KEY ((yyyymmdd, bucket_number),


) WITH CLUSTERING
added_date, videoid)

• Arbitrary data, like a bucket
ORDER BY (added_date DESC,

videoid ASC

number
);
– Round robin at the app level
15
Hot spot

yyyymmmdd

1000 Node Cluster


Hot spot

yyyymmmdd, bucket_number

1000 Node Cluster


Use Case Examples
Top User Scores

Game API

Daily Top 10 Users


handle | score
-----------------+-------
subsonic | 66.2 Nightly
neo | 55.2
bennybaru | 49.2 Spark Jobs
tigger | 46.2
velvetfog | 45.2
flashberg | 43.6
jbellis | 43.4
cafruitbat | 43.2
groovemerchant | 41.2
rustyrazorblade | 39.2
User Score Table
• After each game, score is stored
• Partition is user + game
• Record timestamp is reversed
(last score first)

CREATE TABLE userScores (


userId uuid,
handle text static,
gameId uuid,
score_timestamp timestamp,
score double,
PRIMARY KEY ((userId, gameId), score_timestamp)
) WITH CLUSTERING ORDER BY (score_timestamp DESC);
Top Ten User Scores
• Written by Spark job
• Default TTL = 3 days
• Using Date Tiered Compaction Strategy

CREATE TABLE TopTen (


gameId uuid,
process_timestamp timestamp,
score double,
userId uuid,
handle text,
PRIMARY KEY (gameId, process_timestamp, score)
) WITH CLUSTERING ORDER BY (process_timestamp DESC, score DESC)
AND default_time_to_live = '259200'
AND COMPACTION = {'class': 'DateTieredCompactionStrategy', 'enabled': 'TRUE'};
DTCS
• Built for time series
• SSTable windows of time ranges
• Compaction grouped by time
• Best for same TTLed data(default TTL)
• Entire SSTables can be dropped
Queries, Yo

SELECT gameId, process_timestamp, score, handle, userId


FROM topten
WHERE gameid = 99051fe9-6a9c-46c2-b949-38ef78858dd0
AND process_timestamp = '2014-12-31 13:42:40';

gameid | process_timestamp | score | handle | userid


--------------------------------------+--------------------------+-------+-----------------+--------------------------------------
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 66.2 | subsonic | 99051fe9-6a9c-46c2-b949-38ef78858d07
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 55.2 | neo | 99051fe9-6a9c-46c2-b949-38ef78858d11
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 49.2 | bennybaru | 99051fe9-6a9c-46c2-b949-38ef78858d06
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 46.2 | tigger | 99051fe9-6a9c-46c2-b949-38ef78858d05
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 45.2 | velvetfog | 99051fe9-6a9c-46c2-b949-38ef78858d04
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 43.6 | flashberg | 99051fe9-6a9c-46c2-b949-38ef78858d10
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 43.4 | jbellis | 99051fe9-6a9c-46c2-b949-38ef78858d09
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 43.2 | cafruitbat | 99051fe9-6a9c-46c2-b949-38ef78858d02
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 41.2 | groovemerchant | 99051fe9-6a9c-46c2-b949-38ef78858d03
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 39.2 | rustyrazorblade | 99051fe9-6a9c-46c2-b949-38ef78858d01
99051fe9-6a9c-46c2-b949-38ef78858dd0 | 2014-12-31 13:42:40-0800 | 20.2 | driftx | 99051fe9-6a9c-46c2-b949-38ef78858d08
File Storage Use Case

Upload API
It’s all about the model
• Use case
• User creates an account
• User uploads image
• Image is distributed worldwide
• User can check access patterns

• Start with our queries


• All data for a image
• All images over time
• Specific images over a range
• Access times of each image
user Table
• Our standard POJO
• emails are dynamic
CREATE TABLE user (
username text,
firstname text,
lastname text,
emails list<text>,
PRIMARY KEY (username)
);

INSERT INTO user (username, firstname, lastname, emails)


VALUES (‘pmcfadin’, ‘Patrick’, ‘McFadin’, [‘patrick@datastax.com’,
‘patrick.mcfadin@datastax.com’]
IF NOT EXISTS;
image Table
• Basic POJO for an image
• list of tags for potential search
• username is from user table

CREATE TABLE image (


image_id uuid, //Proxy image ID
username text,
created_at timestamp,
image_name text,
image_description text,
tags list<text>, // ? search in Solr ?
images map<text, uuid> , // orig, thumbnail, medium
PRIMARY KEY (image_id)
);
images_timeseries Table
• Time ordered list of images
• Reversed - Last image first
• Map stores versions

CREATE TABLE images_timeseries (


username text,
bucket int, //yyyymm
sequence timestamp,
image_id uuid,
image_name text,
image_description text,
images map<text, uuid>, // orig, thumbnail, medium
PRIMARY KEY ((username, bucket), sequence)
) WITH CLUSTERING ORDER BY (sequence DESC); // reverse clustering on sequence
bucket_index Table
• List of buckets for a user
• Bucket order is reversed
• High reads, no updates. Use LeveledCompaction

CREATE TABLE bucket_index (


username text,
bucket int,
PRIMARY KEY( username, bucket)
) WITH CLUSTERING ORDER BY (bucket DESC); //LCS + reverse clustering
blob Table
• Main pointer to chunks
• count and checksum for errors detection
• META-DATA stored with as an optimization

CREATE TABLE blob (


object_id uuid, // unique identifier
chunk_count int, // total number of chunks
size int, // total byte size
chunk_size int, // maximum size of the chunks.
checksum text, // optional checksum, this could be stored
// for each blob but only checked on a certain
// percentage of reads
attributes text, // optional text blob for additional json
// encoded attributes
PRIMARY KEY (object_id)
);
blob_chunk Table
• Main data storage table
• Size of blob is up to the client
• Return size for error detection
• Run in parallel!

CREATE TABLE blob_chunk (


object_id uuid, // same as the object.object_name above
chunk_id int, // order for this chunk in the blob
chunk_size int, // size of this chunk, the last chunk
// may be of a different size.
data blob, // the data for this blob chunk
PRIMARY KEY ((object_id, chunk_id))
);
access_log Table
• Classic time series table
• Inserts at CL.ONE
• Read at CL.ONE

CREATE TABLE access_log (


object_id uuid,
access_date text, // YYYYMMDD portion of access timestamp
access_time timestamp, // Access time to the ms
ip_address inet, // x.x.x.x inet address
PRIMARY KEY ((object_id, access_date), access_time, ip_address)
);
Light Weight Transactions
Regular Update
Fields to Update: Not in Primary Key
Table Name

UPDATE videos

SET name = 'The data model is dead. Long live the data model.'

WHERE id = 06049cbb-dfed-421f-b889-5f649a0de1ed;

Primary Key
The race is on
Process 1 Process 2

SELECT firstName, lastName


FROM users T0
WHERE username = 'pmcfadin';
SELECT firstName, lastName
(0 rows) T1 FROM users
WHERE username = 'pmcfadin';

(0 rows)

INSERT INTO users (username, firstname,


lastname, email, password, created_date)
VALUES ('pmcfadin','Patrick','McFadin', Got nothing! Good to go!
['patrick@datastax.com'], T2
'ba27e03fd95e507daf2937c937d499ab',
'2011-06-20 13:50:00');

INSERT INTO users (username, firstname,


lastname, email, password, created_date)
T3 VALUES ('pmcfadin','Paul','McFadin',
['paul@oracle.com'],
This one wins 'ea24e13ad95a209ded8912e937d499de',
'2011-06-20 13:51:00');
Lightweight Transactions

INSERT INTO videos (videoid, name, userid, description, location, location_type, preview_thumbnails, tags, added_date, metadata)

VALUES (06049cbb-dfed-421f-b889-5f649a0de1ed,'The data model is dead. Long live the data model.',
9761d3d7-7fbd-4269-9988-6cfd4e188678, 

'First in a three part series for Cassandra Data Modeling','http://www.youtube.com/watch?v=px6U2n74q3g',1,

{'YouTube':'http://www.youtube.com/watch?v=px6U2n74q3g'},{'cassandra','data model','relational','instruction'},

'2013-05-02 12:30:29’)
IF NOT EXISTS;

Don’t overwrite!
Lightweight Transactions

UPDATE videos

SET name = 'The data model is dead. Long live the data model.'

WHERE id = 06049cbb-dfed-421f-b889-5f649a0de1ed
IF userid = 9761d3d7-7fbd-4269-9988-6cfd4e188678;

Don’t overwrite!
Solution LWT
Process 1

INSERT INTO users (username, firstname,


lastname, email, password, created_date)
VALUES ('pmcfadin','Patrick','McFadin',
['patrick@datastax.com'], T0
'ba27e03fd95e507daf2937c937d499ab',
'2011-06-20 13:50:00')
IF NOT EXISTS;

[applied]
----------- T1
True

•Check performed for record


•Paxos ensures exclusive access
•applied = true: Success
Solution LWT
Process 2
T2

INSERT INTO users (username, firstname,


lastname, email, password, created_date)
VALUES ('pmcfadin','Paul','McFadin',
T3 ['paul@oracle.com'],
'ea24e13ad95a209ded8912e937d499de',
'2011-06-20 13:51:00')
IF NOT EXISTS;

[applied] | username | created_date | firstname | lastname


-----------+----------+--------------------------+-----------+----------
False | pmcfadin | 2011-06-20 13:50:00-0700 | Patrick | McFadin

•applied = false: Rejected


•No record stomping!
Lightweight Transactions

No-op. Don’t throw error

CREATE TABLE IF NOT EXISTS videos_by_tag (



tag text,

videoid uuid,

added_date timestamp,

name text,

preview_image_location text,

tagged_date timestamp,

PRIMARY KEY (tag, videoid)

);
User Defined Types
User Defined Types

• Complex data in one place

• No multi-gets (multi-partitions)
CREATE TYPE address (
• Nesting! street text,
city text,
zip_code int,
country text,
cross_streets set<text>
);
Before
CREATE TABLE videos (
videoid uuid,
userid uuid,
name varchar,
description varchar,
location text,
location_type int,
preview_thumbnails map<text,text>,
tags set<varchar>,
added_date timestamp, SELECT * In-application
FROM videos Title: Introduction to Apache Cassandra
PRIMARY KEY (videoid) join
); WHERE videoId = 2;
Description: A one hour talk on everything
you need to know about a totally amazing
SELECT *
database.
CREATE TABLE video_metadata ( FROM video_metadata
video_id uuid PRIMARY KEY, WHERE videoId = 2;
height int, Playback rate:
width int,
480 720
video_bit_rate set<text>,
encoding text
);
After
• Now video_metadata is
embedded in videos

CREATE TABLE videos (


videoid uuid,
userid uuid,
name varchar,
description varchar,
CREATE TYPE video_metadata (
location text,
height int,
location_type int,
width int,
preview_thumbnails map<text,text>,
video_bit_rate set<text>,
tags set<varchar>,
encoding text
metadata set <frozen<video_metadata>>,
);
added_date timestamp,
PRIMARY KEY (videoid)
);
Wait! Frozen??
Do you want to build a schema?
• Staying out of technical Do you want to store some JSON?
debt

• 3.0 UDTs will not have to


be frozen

• Applicable to User Defined


Types and Tuples
Let’s store some JSON
{
"productId": 2,
"name": "Kitchen Table",
"price": 249.99,
"description" : "Rectangular table with oak finish",
"dimensions": {
"units": "inches",
"length": 50.0,
"width": 66.0,
"height": 32
},
"categories": {
{
"category" : "Home Furnishings" {
"catalogPage": 45,
"url": "/home/furnishings"
},
{
"category" : "Kitchen Furnishings" {
"catalogPage": 108,
"url": "/kitchen/furnishings"
}
}
}
Let’s store some JSON
{
"productId": 2, CREATE TYPE dimensions (
"name": "Kitchen Table", units text,
"price": 249.99, length float,
"description" : "Rectangular table with oak finish", width float,
"dimensions": { height float
"units": "inches", );
"length": 50.0,
"width": 66.0,
"height": 32
},
"categories": {
{
"category" : "Home Furnishings" {
"catalogPage": 45,
"url": "/home/furnishings"
},
{
"category" : "Kitchen Furnishings" {
"catalogPage": 108,
"url": "/kitchen/furnishings"
}
}
}
Let’s store some JSON
{
"productId": 2, CREATE TYPE dimensions (
"name": "Kitchen Table", units text,
"price": 249.99, length float,
"description" : "Rectangular table with oak finish", width float,
"dimensions": { height float
"units": "inches", );
"length": 50.0,
"width": 66.0,
"height": 32 CREATE TYPE category (
}, catalogPage int,
"categories": { url text
{ );
"category" : "Home Furnishings" {
"catalogPage": 45,
"url": "/home/furnishings"
},
{
"category" : "Kitchen Furnishings" {
"catalogPage": 108,
"url": "/kitchen/furnishings"
}
}
}
Let’s store some JSON
{
"productId": 2, CREATE TYPE dimensions (
"name": "Kitchen Table", units text,
"price": 249.99, length float,
"description" : "Rectangular table with oak finish", width float,
"dimensions": { height float
"units": "inches", );
"length": 50.0,
"width": 66.0,
"height": 32 CREATE TYPE category (
}, catalogPage int,
"categories": { url text
{ );
"category" : "Home Furnishings" {
"catalogPage": 45, CREATE TABLE product (
"url": "/home/furnishings" productId int,
}, name text,
{ price float,
"category" : "Kitchen Furnishings" { description text,
"catalogPage": 108, dimensions frozen <dimensions>,
"url": "/kitchen/furnishings" categories map <text, frozen <category>>,
} PRIMARY KEY (productId)
} );
}
Let’s store some JSON
INSERT INTO product (productId, name, price, description, dimensions, categories)
VALUES (2, 'Kitchen Table', 249.99, 'Rectangular table with oak finish',
{
units: 'inches', dimensions frozen <dimensions>
length: 50.0,
width: 66.0,
height: 32
},
{
'Home Furnishings': { categories map <text, frozen <category>>
catalogPage: 45,
url: '/home/furnishings'
},
'Kitchen Furnishings': {
catalogPage: 108,
url: '/kitchen/furnishings'
}

}
);
Retrieving fields
Aggregates

•Built-in: avg, min, max, count(<column name>)


•Runs on server
•Always use with partition key

*As of Cassandra 2.2


Materialized Views
• New as of 3.0 CREATE TABLE user(
id int PRIMARY KEY,
• Auto-denormalize your tables login text,
firstname text,
• Not for everything lastname text,
country text,
gender int
);

CREATE MATERIALIZED VIEW user_by_country


AS SELECT * //denormalize ALL columns
FROM user
WHERE country IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(country, id);
Materialized Views
INSERT INTO user(id,login,firstname,lastname,country) VALUES(1, 'jdoe', 'John', 'DOE', 'US');
INSERT INTO user(id,login,firstname,lastname,country) VALUES(2, 'hsue', 'Helen', 'SUE', 'US');
INSERT INTO user(id,login,firstname,lastname,country) VALUES(3, 'rsmith', 'Richard', 'SMITH', 'UK');
INSERT INTO user(id,login,firstname,lastname,country) VALUES(4, 'doanduyhai', 'DuyHai', 'DOAN', 'FR');

SELECT * FROM user_by_country;

country | id | firstname | lastname | login


---------+----+-----------+----------+------------
FR | 4 | DuyHai | DOAN | doanduyhai
US | 1 | John | DOE | jdoe
US | 2 | Helen | SUE | hsue
UK | 3 | Richard | SMITH | rsmith

SELECT * FROM user_by_country WHERE country='US';

country | id | firstname | lastname | login


---------+----+-----------+----------+-------
US | 1 | John | DOE | jdoe
US | 2 | Helen | SUE | hsue
Thank you!
Bring the questions

Follow me on twitter
@PatrickMcFadin

Potrebbero piacerti anche