Sei sulla pagina 1di 166

Riak Handbook

Mathias Meyer

Revision 1.1

Table of Contents
Introduction ................................................................................................... 8
Thank You ............................................................................................. 8
How to read the book............................................................................. 9
Feedback ................................................................................................. 9
Code........................................................................................................ 9
Changelog .............................................................................................. 9
CAP Theorem .............................................................................................. 11
The CAP Theorem is Not Absolute ........................................................ 12
Fine-Tuning CAP with Quorums .......................................................... 13
N, R, W, Quorums, Oh My!.................................................................... 13
How Quorums Affect CAP ..................................................................... 14
A Word of CAP Wisdom......................................................................... 15
Further Reading ....................................................................................... 15
Eventual Consistency................................................................................... 15
Consistency in Quorum-Based Systems ................................................. 16
Consistent Hashing ...................................................................................... 16
Sharding and Rehashing........................................................................... 16
A Better Way............................................................................................ 17
Enter Consistent Hashing ........................................................................ 17
Looking up an Object .............................................................................. 19
Problems with Consistent Hashing ......................................................... 20
Dealing with Overload and Data Loss ..................................................... 21
Amazon's Dynamo....................................................................................... 22
Basics......................................................................................................... 22
Virtual Nodes ........................................................................................... 22
Master-less Cluster ................................................................................... 23
Quorum-based Replication ..................................................................... 24
Read Repair and Hinted Handoff ............................................................ 24
Conflict Resolution using Vector Clocks ................................................ 24
Conclusion ............................................................................................... 26
What is Riak?................................................................................................ 27
Riak: Dynamo, And Then Some ................................................................. 27
Installation .................................................................................................... 28
Installing Riak using Binary Packages ..................................................... 28
Talking to Riak............................................................................................. 29
Buckets ..................................................................................................... 29
Fetching Objects ...................................................................................... 29
Creating Objects ...................................................................................... 30

Object Metadata ....................................................................................... 31


Custom Metadata ..................................................................................... 32
Linking Objects........................................................................................ 33
Walking Links.......................................................................................... 34
Walking Nested Links ............................................................................. 35
The Anatomy of a Bucket ........................................................................ 36
List All Of The Keys................................................................................. 37
How Do I Delete All Keys in a Bucket?............................................... 38
How Do I Get the Number of All Keys in a Bucket? .......................... 39
Querying Data ............................................................................................. 39
MapReduce............................................................................................... 40
MapReduce Basics .................................................................................... 41
Mapping Tweet Attributes ...................................................................... 41
Using Reduce to Count Tweets .............................................................. 42
Re-reducing for Great Good ................................................................... 43
Counting all Tweets................................................................................. 44
Chaining Reduce Phases .......................................................................... 44
Parameterizing MapReduce Queries....................................................... 46
Chaining Map Phases ............................................................................... 48
MapReduce in a Riak Cluster................................................................... 48
Efficiency of Buckets as Inputs................................................................. 50
Key Filters................................................................................................. 51
Using Riak's Built-in MapReduce Functions.......................................... 53
Intermission: Riak's Configuration Files ................................................. 54
Errors Running JavaScript MapReduce................................................... 55
Deploying Custom JavaScript Functions ................................................ 56
Using Erlang for MapReduce .................................................................. 57
Writing Custom Erlang MapReduce Functions ................................. 58
On Full-Bucket MapReduce and Key-Filters Performance ................... 61
Querying Data, For Real.............................................................................. 61
Riak Search ............................................................................................... 62
Enabling Riak Search ........................................................................... 62
Indexing Data ....................................................................................... 62
Indexing from the Command-Line................................................. 63
The Anatomy of a Riak Search Document.......................................... 63
Querying from the Command-Line ................................................... 64
Other Command-Line Features ...................................................... 64
The Riak Search Document Schema ....................................................... 64
Analyzers .............................................................................................. 65
Writing Custom Analyzers .................................................................. 66

Other Schema Options..................................................................... 69


An Example Schema............................................................................. 70
Setting the Schema ............................................................................... 72
Indexing Data from Riak ......................................................................... 72
Using the Solr Interface............................................................................ 74
Paginating Search Results .................................................................... 75
Sorting Search Results .......................................................................... 76
Search Operators .................................................................................. 76
Summary of Solr API Search Options.................................................. 79
Summary of the Solr Query Operators ................................................ 80
Indexing Documents using the Solr API ............................................. 81
Deleting Documents using the Solr API ............................................. 82
Using Riak's MapReduce with Riak Search ........................................ 83
The Overhead of Indexing................................................................... 83
Riak Secondary Indexes ........................................................................... 84
Indexing Data with 2i........................................................................... 84
Querying Data with 2i ......................................................................... 86
Using Riak 2i with MapReduce ........................................................... 87
Storing Multiple Index Values ............................................................. 87
Managing Object Associations: Links vs. 2i ........................................ 88
How Does Riak 2i Compare to Riak Search? ...................................... 89
Riak Search vs. Riak 2i vs. MapReduce................................................ 90
How Do I Index Data Already in Riak?................................................... 91
Using Pre- and Post-Commit Hooks ...................................................... 92
Validating Data..................................................................................... 92
Enabling Pre-Commit Hooks ............................................................. 93
Pre-Commit Hooks in Erlang ............................................................. 94
Modifying Data in Pre-Commit Hooks.............................................. 95
Accessing Riak Objects in Commit Hooks ......................................... 97
Enabling Post-Commit Hooks .......................................................... 100
Deploying Custom Erlang Functions................................................ 100
Updating External Sources in Post-Commit Hooks ......................... 102
Riak in its Setting........................................................................................ 102
Building a Cluster................................................................................... 102
Adding a Node to a Riak Cluster ....................................................... 103
Configuring a Riak Node .............................................................. 103
Joining a Cluster ............................................................................. 104
Anatomy of a Riak Node.................................................................... 104
What Happens When a Node Joins a Cluster ................................... 105
Leaving a Cluster................................................................................ 105

Eventually Consistent Riak .................................................................... 106


Handling Consistency........................................................................ 106
Writing with a Non-Default Quorum .......................................... 106
Durable Writes ............................................................................... 107
Primary Writes ............................................................................... 108
Tuning Default-Replication and Quorum Per Bucket................. 108
Choosing the Right N Value ......................................................... 110
Reading with a Non-Default Quorum.......................................... 110
Read-Repair.................................................................................... 111
Modeling Data for Eventual Consistency ................................................. 111
Choosing the Right Data Structures ...................................................... 112
Conflicts in Riak ................................................................................. 115
Siblings............................................................................................ 116
Reconciling Conflicts......................................................................... 117
Modeling Counters and Other Data Structures ................................ 118
Problems with Timestamps for Conflict Resolution ..................... 119
Strategies for Reconciling Conflicts .................................................. 123
Reads Before Writes ....................................................................... 124
Merging Strategies ......................................................................... 124
Sibling Explosion................................................................................ 124
Building a Timeline with Riak .......................................................... 125
Multi-User Timelines..................................................................... 128
Avoiding Infinite Growth.................................................................. 129
Intermission: How to Fetch Multiple Objects in one Request.......... 129
Intermission: Paginating Using MapReduce .................................... 130
Handling Failure .................................................................................... 131
Operating Riak....................................................................................... 132
Choosing a Ring Size ......................................................................... 132
Protocol Buffers vs. HTTP ................................................................ 133
Storage Backends................................................................................ 133
Innostore......................................................................................... 134
Bitcask............................................................................................. 134
LevelDB.......................................................................................... 135
Load-Balancing Riak ......................................................................... 136
Placing Riak Nodes across a Network ............................................... 138
Monitoring Riak................................................................................. 140
Request Times ................................................................................ 141
Number of Requests ....................................................................... 142
Read Repairs, Object Size, Siblings................................................ 143
Monitoring 2i ................................................................................. 144

Miscellany ....................................................................................... 144


Monitoring Reference.................................................................... 144
Managing a Riak Cluster with Riak Control..................................... 147
Enabling Riak Control ................................................................... 147
Intermission: Generating an SSL Certificate ................................. 148
Riak Control Cluster Overview..................................................... 149
Managing Nodes with Riak Control ............................................. 150
Managing the Ring with Riak Control ......................................... 151
To Be Continued............................................................................ 152
When To Riak? .......................................................................................... 152
Riak Use Cases in Detail......................................................................... 153
Using Riak for File Storage ................................................................ 153
File Storage Access Patterns ........................................................... 154
Object Size...................................................................................... 154
Storing Large Files in Riak ............................................................. 155
Riak Cloud Storage ........................................................................ 155
Using Riak to Store Logs.................................................................... 156
Modeling Log Records................................................................... 157
Logging Access Patterns ................................................................ 157
Indexing Log Data for Efficient Access ......................................... 158
Secondary Index Ranges as Key Filter Replacement ..................... 159
Searching Logs ............................................................................... 160
Riak for Log Storage in the Wild ................................................... 161
Deleting Historical Data ................................................................ 161
What about Analytics? ................................................................... 162
Session Storage ................................................................................... 162
Modeling Session Data ................................................................... 163
Session Storage Access Patterns...................................................... 164
Bringing Session Data Closer to Users .......................................... 164
URL Shortener ................................................................................... 164
URL Shortening Access Patterns ................................................... 165
Modeling Data................................................................................ 165
Riak URL Shortening in the Wild ................................................. 165
Where to go from here............................................................................... 165

Introduction

Introduction
I first heard about Riak in September 2009, right after it was unveiled to the
public, at one of the early events around NoSQL in Berlin. I tip my hat to
Martin Scholl for introducing the attendees (myself included) to this new
database. It's distributed, written in Erlang, supports JSON, and MapReduce.
That's all we needed to know.
Riak fascinated me right from the beginning. Its roots in Amazon's Dynamo
and the distributed nature were intriguing. It was fun to see it develop since
then, it's been more than two years.
Over that time, Riak went from a simple key-value store you can use to
reliably store sessions to a full-blown database with lots of bells and whistles.
I was more and more intrigued, and started playing with it more, diving into
its feature set and into Dynamo too.
Add to that the friendly Basho folks, makers of Riak, whom I had the great
pleasure of meeting a few times and even working with.
But something was missing. Every database should have a book dedicated to
it. I never thought that it would even be possible to write a whole book about
Riak, let alone that I would be the one to write it, yet here we are.
What you're looking at is my collective brain dump on all things Riak,
covering everything from basic usage, by way of MapReduce, full-text
search and indexing data, to advanced topics like modeling data to fit in well
with Riak's eventually consistent distribution model.
So here we are, I hope you'll enjoy what you're about to read as much as I
enjoyed writing it.
This is a one-man operation, please respect the time and effort that went into
this book. If you came by a free copy and find it useful, please buy the book.

Thank You
This book wouldn't be here, on your screen, without the help and support of
quite a few people. To be honest, I was surprised how much work goes into
a book, and how many people are more than willing to help you finish it. For
that I am incredibly grateful.
First and foremost I want to thank my wife Jrdis, who not only was very
supportive, but also helped a great deal by doing all the design work in and
around the book, the cover, the illustrations, and the website. She gave me
Riak Handbook | 8

Introduction

that extra push when I needed it. My daughter Mari was supportive in her
very own way, probably without realizing it, but supportive nonetheless.
She was great to have around when writing this book.
Thank you so very much to everyone who reviewed the initial and advanced
versions of the book, devoting their valuable time to giving invaluable
feedback. You never realize until later how many typos you end up creating.
Thank you for your great feedback, for tirelessly answering my questions,
and for all the support you guys gave me: Florian Ebeling, Eric Lindvall, Till
Klampckel, Steve Vinoski, Russell Brown, Sean Cribbs, Reid Draper, Ryan
Zezeski, John Vincent, Rick Olson, Corey Donohoe, Mark Philips, Ralph
von der Heyden, Patrick Hsler, Robin Mehner, Stefan Schmidt, Kelly
McLaughlin, Brian Shumate, Jeremiah Peschka, Marc Heiligers. I bow to
you!

How to read the book


Start at the front, read the book all the way to the back.

Feedback
If you think you found a typo, have some suggestions to make for things
you think are missing and whatnot, or generally would like to say hi, send
an email to feedback@riakhandbook.com. Be sure to include the revision
you're referring to, it's printed on the second page.

Code
This book includes a lot of code, but only in small chunks, easy to grasp.
There are only two listings in the entire book that stretch close to a page.
Most of the code doesn't build on top of each other but tries to stand alone,
though there's the occasional assumption that some piece of code has been
run at some point. What was worth breaking out into small programs or
what would require tedious copy and paste has been moved into a code
repository that accompanies this book. You can find it on GitHub.

Changelog
Version 1.1
Added a section on load balancing
Added a section on network placement of Riak nodes
Added a section on monitoring

Riak Handbook | 9

Introduction

Added a section on storing multiple index values and using 2i to manage


object relationships
Fixed code examples in the ePub and Kindle versions
Added a section on Riak Control
Added a section on pre- and post-commit hooks
Added a section on deploying custom Erlang code
Added a section describing an issue that may come up when running
JavaScript MapReduce requests
Added a section on Riak use cases explained in detail. Includes file
storage, log storage, session storage, and URL shortening.
Added a section explaining primary writes
The book is now included as a man page for easy reading and searching
on the command line.

Riak Handbook | 10

CAP Theorem

CAP Theorem
CAP is an abbreviation for consistency, availability, and partition tolerance.
The basic idea is that in a distributed system, you can have only two of these
properties, but not all three at once. Let's look at what each property means.
Consistency
Data access in a distributed database is considered to be consistent when
an update written on one node is immediately available on another node.
Traditional ways to achieve this in relational database systems are
distributed transactions. A write operation is only successful when it's
written to a master and at least one slave, or even all nodes in the system.
Every subsequent read on any node will always return the data written by
the update on all nodes.
Availability
The system guarantees availability for requests even though one or more
nodes are down. For any database with just one node, this is impossible
to achieve. Even when you add slaves to one master database, there's still
the risk of unavailability when the master goes down. The system can still
return data for reads, but can't accept writes until the master comes back
up. To achieve availability data in a cluster must be replicated to a number
of nodes, and every node must be ready to claim master status at any time,
with the cluster automatically rebalancing the data set.
Partition Tolerance
Nodes can be physically separated from each other at any given point and
for any length of time. The time they're not able to reach each other,
due to routing problems, network interface troubles, or firewall issues, is
called a network partition. During the partition, all nodes should still be
able to serve both read and write requests. Ideally the system automatically
reconciles updates as soon as every node can reach every other node again.
Given features like distributed transactions it's easy to describe consistency
as the prime property of relational databases. Think about it though, in a
master-slave setup data is usually replicated down to slaves in a lazy manner.
Unless your database supports it (like the semi-synchronous replication in
MySQL 5.5) and you enable it explicitly, there's no guarantee that a write
to the master will be immediately visible on a slave. It can take crucial
milliseconds for the data to show up, and your application needs to be able
to handle that. Unless of course, you've chosen to ignore the potential

Riak Handbook | 11

The CAP Theorem is Not Absolute

inconsistency, which is fair enough, I'm certainly guilty of having done that
myself in the past.
While Brewer's original description of CAP was more of a conjecture, by
now it's accepted and proven that a distributed database system can only
allow for two of the three properties. For example, it's considered impossible
for a database system to offer both full consistency and 100% availability
at the same time, there will always be trade-offs involved. That is, until
someone finds the universal cure against network partitions, network
latency, and all the other problems computers and networks face.

The CAP Theorem is Not Absolute


While consistency and availability certainly aren't particularly friendly with
each other, they should be considered tuning knobs instead of binary
switches. You can have some of one and some of the other. This approach
has been adopted by quorum-based, distributed databases.
A quorum is the minimum number of parties that need to be successfully
involved in an operation for it to be considered successful as a whole. In
real life it can be compared to votes to make decisions in a democracy,
only applied to distributed systems. By distributed systems I'm referring to
systems that use more than one computer, a node, to get a job done. A job
can be many things, but in our case we're dealing with storing a piece of data.
Every node in a cluster gets a vote, and the number of required votes can
be specified for the system as a whole, and for every operation separately. If
the latter isn't specified, a sensible default is chosen based on a configured
consensus, a path that oftentimes is not successfully applied to a democracy.
In the world of quorum database systems, every piece of data is replicated to a
number of nodes in a cluster. This number is specified using a value called N.
It represents a default for the whole cluster, and can be tuned for every read
and write operation.
Consider a cluster with five nodes and an N value of 3. The N value is the
number of replicas, and you can tune every operation with a quorum, which
determines the number of nodes that are required for that operation to be
successful.

Riak Handbook | 12

Fine-Tuning CAP with Quorums

Fine-Tuning CAP with Quorums


When you write (that is, update or create) a piece of data, you can specify a
value W. With W you can specify how many replicas the write must go to
for it to be considered successful. That number can't be higher than N.
The W value is a tuning knob for consistency and availability. The higher
you pick your W, the more consistent the written data will be across all
replicas, but an operation may fail because some nodes are currently
unreachable or down due to maintenance.
Lowering the W value will affect consistency of the data set, as a subsequent
read on a different replica is not guaranteed to return the updated data.
Choosing a higher W also affects speed. The more nodes need to be involved
for a single write the more network latency is involved. A lower W involves
fewer nodes, so it will take some time for a write to propagate to the replicas
not affected by the quorum. Operations can be parallelized for speed but an
operation is still only as fast as its slowest link. Subsequent reads on other
replicas may return an outdated value. When I say time, I'm talking
milliseconds, but in an application with quickly-changing data, that may still
be a factor.
For reads, the value is called R. The first R nodes to return the requested value
make up the read quorum. The higher the R value, the more nodes need to
return the same value for the read to be considered successful.
Again, choosing a higher value for R affects performance, but offers a
stronger consistency level. Even with a low W value, a high R value can
force the cluster to reconcile outdated pieces of data so that they're
consistent. That way, there are no situations where a read will return
outdated information. It's a trade-off between low write consistency and
high read consistency. Choosing a lower R makes a read less prone to
availability issues and lowers read latency. The optimum for consistency lies
in choosing R and W in a way that R + W > N. That way, data will always
be kept consistent.

N, R, W, Quorums, Oh My!
In the real world, it will depend on your particular use case which N, W,
and R values you're going to pick. Need high insert and update speed? Pick a
low W and maybe a higher R value. Care about consistent reads and a bit less
about increased read latency? Pick a high R. If speed is all you're after in reads
and writes, but you still want to have data replicated for availability, pick a

Riak Handbook | 13

How Quorums Affect CAP

low W and R value, but an N of 3 or higher. Apart from the N value, the
other quorums are not written in stone, they can be tuned for every read and
write operation separately.
In a paper that takes a more detailed look at Brewer's conjecture, Gilbert and
Lynch quite fittingly state that in the real world, most systems have settled on
getting "most of the data, most of the time." You will see how this works out
in the practical part of this book.

How Quorums Affect CAP


As you can see, a quorum offers a way to fine-tune both availability and
consistency. You can pick a level that is pretty loose on both ends, making
the whole cluster less prone to availability issues. Lower values also tune
consistency to a level where the application and the user are more likely to
be affected by outdated replicas. You can increase both to the same level, or
use a low W and a high R for high speed and high consistency, but you'll be
subject to higher read latency.
Quorums allow fine-tuning partition tolerance against consistency and
availability. With a higher quorum, you increase consistency but sacrifice
availability, as more nodes are required to participate in an operation. If
one replica required to win the quorum is not available, the operation fails.
A more fitting name for this is yield, the percentage of requests answered
successfully, coined by Brewer in a follow-up paper on CAP.
With a lower quorum, you increase availability but lower your consistency
expectations. You accept that a response may not include all the data, that
your harvest varies. Harvest measures the completeness of a response by
looking at the percentage of data included.
Both scenarios have different trade-offs, but both are means to fine-tune
partition tolerance. The lower expectations an application has on yield or
harvest, the more resilient it is to network partitions, and the lower the
expectations towards consistency during normal operations. Which
combination you pick depends on your use case, there is no one true
combination of values.
Tuning both up to 100% means a distributed system is not tolerant to
partitions, as they'd result in either a decreased yield or decreased harvest, or
maybe even a combination of both. As Coda Hale put it: "You can't sacrifice
partition tolerance."

Riak Handbook | 14

A Word of CAP Wisdom

A Word of CAP Wisdom


While CAP is something I think you should be aware of, it's not worth
wasting time fighting over which database falls into which category. What
matters is how every database works in reality, how it works for your use
cases, and what your requirements are. Collect assumptions and
requirements, and compare them to what a database you're interested has
to offer. It's simple like that. What particular attributes of CAP it chose in
theory is less important than that.

Further Reading
To dive deeper into the ideas behind CAP, read Seth Gilbert's and Nancy
Lynch's dissection of Brewer's original conjecture. They're doing a great
job of proving the correctness of CAP, all the while investigating alternative
models, trying to find a sweet spot for all three properties along the way.
Julian Browne wrote a more illustrated explanation on CAP, going as far as
comparing the coinage of CAP to the creation of punk rock, something I can
certainly get on board with. Coda Hale recently wrote an update on CAP,
which is a lot less formal and aims towards practical applicability, a highly
recommended read. And last but not least, you can peek at Brewer's original
slides too.
Daniel Abadi brings up some interesting points regarding CAP, arguing
that CAP should consider latency as well. Eric Brewer and Armando Fox
followed up the CAP discussion with a paper on harvest and yield, which is
also worth your while, as it argues for a need of a weaker version of CAP.
One that focuses on dialing down one property while increasing another
instead of considering them binary switches.

Eventual Consistency
In the last chapter we already talked about updates that are not immediately
propagated to all replicas in a cluster. That can have lots of reasons, one being
the chosen R or W value, while others may involve network partitions,
making parts of the cluster unreachable or increasing latency. In other
scenarios, you may have a database running on your laptop, which
constantly synchronizes data with another node on a remote server. Or you
have a master-slave setup for a MySQL or PostgreSQL database, where all
writes go to a master, and subsequent reads only go to the slave. In this
scenario the master will first accept the write and then populates it to a
number of slaves, which takes time. We're usually talking about a couple of
Riak Handbook | 15

Consistency in Quorum-Based Systems

milliseconds, but as you never know what happens, it could end up being
hours. Sound familiar? It's what DNS does, a system you deal with almost
every day.

Consistency in Quorum-Based Systems


In a truly distributed environment, and when writes involve quorums, you
can tune how many nodes need to have successfully accepted a write so that
the operation as a whole is a success. If you choose a W value less than the
number of replicas, the remaining replicas that were not involved in the
write will receive the data eventually. Again, we're talking milliseconds in
common cases, but it can be a noticeable lag, and your application should be
ready to deal with cases like that.
In every scenario, the common thing is that a write will reach all the relevant
nodes eventually, so that all nodes have the same data. It will take some
time, but eventually the data in the whole cluster will be consistent for this
particular piece of data, even after network partitions. Hence the name
eventual consistency. Once again it's not really a specific feature of NoSQL
databases, every time you have a setup involving masters and slaves, eventual
consistency will strike with furious anger.
The term was originally coined by Werner Vogels, Amazon's CTO, in
2007. The paper he wrote about it is well worth reading. Being the biggest
e-commerce site out there, Amazon had a big influence on a whole slew of
databases.

Consistent Hashing
The invention of consistent hashing is one of these things that only happen
once a century. At least that's how Andy Gross from Basho Technologies
likes to think about it. When you deal with a distributed database
environment and have to deal with an elastic cluster, where nodes come and
go, I'm pretty sure you'll agree with him. But before we delve into detail, let's
have a look at how data distribution is usually done in a cluster of databases
or cache farms.

Sharding and Rehashing


Relational databases, or even just the caches you put in between your
application and your database, don't really have a way to rebalance a cluster
automatically as nodes come and go. In traditional setups you either had a
collection of masters synchronizing data with each other, with you sitting on
Riak Handbook | 16

A Better Way

the other end, hoping that they never get out of sync (which they will). Or
you started sharding your data.
In a sharded setup, you split up your dataset using a predefined key. The
simplest version of that could be to simply use the primary key of any table as
your shard key. Using modulo math you calculate the modulo of the key and
the number of shards (i.e. nodes) in the cluster. So the key of 103 in a cluster
of 5 nodes would go to the fourth node, as 103 % 5 = 3. This is the simplest
way of sharding.
To get a bit more fancy, add a hash function, which is applied to the shard
key. Like before, calculate the modulo of the result and the number of
servers. The problems start when you want to add a new node. Almost all of
the data needs to be moved to another server, because the modulo needs to
be recalculated for every record, and the result is very likely to be different,
in fact, it's about N / (N + 1) likely to be different, with N being the number
of nodes currently being in the cluster. For going from three to four nodes
that's 75% of data affected, from four to five nodes it's 80%. The result gets
worse as you add more nodes.
Not only is that a very expensive operation, it also defeats the purpose of
adding new nodes, because for a while your cluster will be mostly busy
shuffling data around, when it should really deliver that data to your
customers.

A Better Way
As you will surely agree, this doesn't pan out too well in a production system.
It works, but it's not great.
In the late nineties Akamai needed a way to increase and decrease caching
capacity on demand without having to go through a full rebalancing process
every time. Sounds like the scenario I just described, doesn't it? They needed
it for caches, but it's easily applicable to databases too. The result is called
consistent hashing, and it's a technique that's so beautifully simple yet
incredibly efficient in avoiding moving unnecessary amounts of data
around, it blows my mind every time anew.

Enter Consistent Hashing


To understand consistent hashing, stop thinking of your data's keys as an
infinite stream of integers. Consistent hashing's basic idea is to turn that
stream into a ring that starts with 0 and ends with a number like 2^64, leaving

Riak Handbook | 17

Enter Consistent Hashing

room for plenty of keys in between. No really, that's a lot. Of course the
actual ring size depends on the hash function you're using. To use SHA1 for
example, the ring must have a size of 2^160. The keys are ordered counterclockwise, starting at 0, ending at 2^160 and then folding over again.

The Ring.

Consistent hashing, as the name suggests, uses a hash function to determine


where an object belongs on the ring with a given key. Other than with the
modulus approach, the key is simply mapped onto the ring using its integer
representation.

Mapping a key to the ring using a hash function.

Riak Handbook | 18

Looking up an Object

When a node joins the cluster, it picks a random key on the ring. The node
will then be responsible for handling all data between this and the next key
chosen by a different node. If there's only one node, it will be responsible for
all the keys in the ring.

One node responsible for the entire ring.

Add another node, and it will once again pick a random key on the ring. All
it needs to do now is fetch the data between this key and the one picked by
the first node.
The ring is therefore sliced into what is generally called partitions. If a pizza
slice is a nicer image to you, it works as well. The difference is though, that
with a pizza everyone loves to have the biggest slice, while in a database
environment having that slice could kill you.
Now add a third node and it needs to transfer even less data because the
partitions created by the randomly picked keys on the ring get smaller and
smaller as you add more nodes. See where this is going? Suddenly we're
shuffling around much less data than with traditional sharding. Sure, we're
still shuffling, but somehow data has to be moved around, there's no
avoiding that part. You can only try and reduce the time and effort needed
to shuffle it.

Looking up an Object
When a client goes to fetch an object stored in the cluster, it needs to be
aware of the cluster structure and the partitions created in it. It uses the same

Riak Handbook | 19

Problems with Consistent Hashing

hash function as the cluster to choose the correct partition and therefore the
correct physical node the object resides on.
To do that, it hashes the key and then walks clockwise until it finds a key
that's mapped to a node, which will be the key the node randomly picked
when it joined the cluster. Say, your key hashes to the value 1234, and you
have two nodes in the cluster, one claiming the key space from 0 to 1023, the
other claiming the space from 1024 to 2048. Yes, that's indeed a rather small
key space, but much better suited to illustrate the example.
To find the node responsible for the data, you go clockwise from 1234 to
1024, the next lowest key picked by a node in the cluster, the second node in
our example.

Problems with Consistent Hashing


Even though consistent hashing itself is rather ingeniously simple, the
randomness of it can cause problems if applied as the only technique,
especially in smaller clusters.
As each node picks a random key, there's no guarantee how close together or
far apart the nodes really are on the ring. One may end up with only a million
keys, while the other has to carry the weight of all the remaining keys. That
can turn into a problem with load, one node gets swamped with requests
for the majority of keys, while the other idles around desperately waiting for
client requests to serve.

Two nodes in a ring, one with less keys than the other.

Riak Handbook | 20

Dealing with Overload and Data Loss

Also, when a node goes down, due to hardware failure, a network partition,
who knows what's going to happen in production, there is still the question
of what happens to the data that it was responsible for. The solution once
again is rather simple.

Dealing with Overload and Data Loss


Every node that joins the cluster not only grabs its own slice of the ring, it
also becomes responsible for a number of slices from other nodes, it turns
into a replica of their data. It now not only serves requests for its own data, it
can also serve clients asking for data originally claimed by other nodes in the
cluster.
This simple concept is called a virtual node and solves two problems at once.
It helps to spread request load evenly across the cluster, as more nodes are able
to serve any given request, increasing capacity as you add more nodes. It also
helps to reduce the risk of losing data by replicating it throughout the cluster.
Some databases and commercial concepts take consistent hashing even
further to reduce the potential of overloading and uneven spread of data, an
idea first adopted (as far as I know of) by Amazon's Dynamo database. We'll
look into the details in the next chapter.

Riak Handbook | 21

Amazon's Dynamo

Amazon's Dynamo
One of the more influential products and papers in the field has been
Amazon's Dynamo, responsible for, among other things, storing your
shopping cart. It takes concepts like eventual consistency, consistent
hashing, and the CAP theorem, and slaps a couple of niceties on top. The
result is a distributed, fault-tolerant, and highly available data store.

Basics
Dynamo is meant to be easily scalable in a linear fashion by adding and
removing nodes, to be fully fault-tolerant, highly available and redundant.
The goal was for it to survive network partitions and be easily replaceable
even across data centers.
All these requirements stemmed from actual business requirements, so either
way, it pays off to read the full paper to see how certain features relate to
production use cases Amazon has.
Dynamo is an accumulation of techniques and technologies, thrown
together to offer just what Amazon wanted for some of their business use
cases. Let's go through the most important ones, most notably virtual nodes,
replication, read repairs, and conflict resolution using vector clocks.

Virtual Nodes
Dynamo takes the idea of consistent hashing and adds virtual nodes to the
mix. We already came across them as a solution to spread load in a cluster
using consistent hashing. Dynamo takes it a step further. When a cluster is
defined, it splits up the ring into equally sized partitions. It's like an evenly
sliced pizza, and the slice size never changes.

Riak Handbook | 22

Master-less Cluster

A hash ring with equally sized partitions.

The advantage of choosing a partitioning scheme like that is that the ring
setup is known and constant throughout the cluster's life. Whenever a node
joins, it doesn't need to pick a random key, it picks random partitions instead,
therefore avoiding the risk of having partitions are that are either too small
or too large for a single node.
Say you have a cluster with 3 nodes and 32 partitions, every node will hold
either 10 or 11 partitions. When you bring a fourth node into the ring, you
will end up with 8 partitions on each node. A partition is hosted by a virtual
node, which is only responsible for that particular slice of the data. As the
cluster grows and shrinks the virtual node may or may not move to other
physical nodes.

Master-less Cluster
No node in a Dynamo cluster is special. Every client can request data from
any node and write data to any node. Every node in the cluster has
knowledge of the partitioning scheme, that is which node in the cluster is
responsible for which partitions.
Whenever a client requests data from a node, that node becomes the
coordinator node, even if it's not holding the requested piece of data. When
the data is stored in a partition on a different node, the coordinator node
simply forwards the requests to the relevant node and returns its response to
the client.

Riak Handbook | 23

Quorum-based Replication

This has the added benefit that clients don't need to know about the way
data is partitioned. They don't need to keep track of a table with partitions
and their respective nodes. They simply ask any node for the data they're
interested in.

Quorum-based Replication
As explained above in the section on consistent hashing, partitioning makes
replicating data quite easy. A physical node may not only hold the data in the
partitions it picked, it will hold a total of up to P / PN * RE partitions, where
P is the number of partitions in the ring, PN the number of physical nodes,
and RE is the number of replicas configured for the cluster.
So if every piece of data is replicated three times across the cluster, a single
physical node in cluster of four may hold up to 48 virtual nodes, given that it
contains 64 partitions.
The quorum is the consistency-availability tuning knob in a Dynamo
cluster. Amazon leaves it up to a specific engineering team's preference how
to deal with read and write consistency in their particular setup. As I
mentioned already, it's a setting that's different for every use case.

Read Repair and Hinted Handoff


Read repair is a way to ensure consistency of data between replicas. It's a
passive process that kicks in during a read operation to ensure all replicas
have an up-to-date view of the data.
Hinted handoff is an active process, used to transfer data that has been
collected by other nodes while one or more nodes were down. While the
node is down, others can accept writes for it to ensure availability. When the
node comes back up, the others that collected data send hints to it that they
currently have data that's not theirs to keep.

Conflict Resolution using Vector Clocks


Before you're nodding off with all the theoretical things we're going
through here, let me just finish this part on Dynamo with the way it handles
conflicts. In a distributed database system, a situation can easily arise where
two clients update the same piece of data through two different nodes in the
cluster.
A vector clock is a pair of a server identifier and a version, an initial pair
being assigned to a piece of data the moment it is created. Whenever a client

Riak Handbook | 24

Conflict Resolution using Vector Clocks

updates an object it provides the vector clock it's referring to. Let's have a
look at an example.

A simplified view of a vector clock.

When the object is updated the coordinating node adds a new pair with
server identifier and version, so an object's vector clock can grow
significantly over time when it's updated frequently. As long as the path
through the pairs is the same, an update is considered to be a descendant of
the previous one. All of Bob's updates descent from one another.
The fun starts when two different clients update the same objects. Each client
adds a new identifier to the list of pairs, and now there are two different lists
of pairs from each node. We've run into a conflict. We now have two vector
clocks that aren't descendants of each other. Like the conflicts created by
Alice and then Carol in the picture above.
Dynamo doesn't really bother with the conflict, it can simply store both
versions and let the next reading client know that there are multiple versions
that need to be reconciled. Vector clocks can be pretty mind-bending, but
they're actually quite simple. There are two great summaries on the Basho
blog, and Kresten Krab Thorup wrote another one, where he refers to them
as version vectors instead, which actually makes a lot of sense and, I'm sure,
will help you understand vector clocks better.
The basic idea of vector clocks goes way back into the seventies, when Leslie
Lamport wrote a paper on using time and version increments as a means to
restore order in a distributed system. That was in 1978, think about that for
a minute. But it wasn't until 1988 that the idea of vector clocks that include
both time and a secondary means of deriving ordering was published, in a
paper by Colin J. Fidge.

Riak Handbook | 25

Conclusion

Vector clocks are confusing, no doubt, and you hardly have to deal with their
inner workings. They're just a means for a database to discover conflicting
updates.

Conclusion
Dynamo throws quite a punch, don't you agree? It's a great collection of
different algorithms and technologies, brought together to solve real life
problems. Even though it's a lot to take in, you'll find that it influenced
a good bunch of databases in the NoSQL field and is referenced or cited
equally often.
There have been several open source implementations, namely Dynomite
(abandonded these days due to copyright issues, but the first open source
Dynamo clone), Project Voldemort, and Riak. Cassandra also drew some
inspiration from it.

Riak Handbook | 26

What is Riak?

What is Riak?
Riak does one thing, and one thing really well: it ensures data availability
in the face of system or network failure, even when it has only the slightest
chance to still serve a piece of data available to it, even though parts of the
whole dataset might be missing temporarily.
At the very core, Riak is an implementation of Amazon's Dynamo, made by
the smart folks from Basho. The basic way to store data is by specifying a
key and a value for it. Simple as that. A Riak cluster can scale in a linear and
predictable fashion, because adding more nodes increases capacity thanks to
consistent hashing and replication. Throw on top the whole shebang of fault
tolerance, no special nodes, and boom, there's Riak.
A value stored with a key can be anything, Riak is pretty agnostic, but
you're well advised to provide a proper content type for what you're storing.
To no-one's surprise, for any reasonably structured data, using JSON is
recommended.

Riak: Dynamo, And Then Some


There's more to Riak than meets the eye though. Over time, the folks at
Basho added some neat features on top. One of the first things they added
was the ability to have links between objects stored in Riak, to have a simpler
way to navigate an association graph without having to know all the keys
involved.
Another noteworthy feature is MapReduce, which has traditionally been the
preferred way to query data in Riak, based for example, on the attributes of
an object. Riak utilizes JavaScript, though if you're feeling adventurous you
can also use Erlang to write MapReduce functions. As a means of indexing
and querying data, Riak offers full-text search and secondary indexes.
There are two ways I'm referring to Riak. Usually when I say Riak, I'm
talking about the system as a whole. But when I mention Riak KV, I'm
talking about Riak the key-value store (the original Riak if you will). Riak's
feature set has grown beyond just storing keys and values. We're looking
at the basic feature set of Riak KV first, and then we'll look at things that
were added over time, such as MapReduce, full-text search, and secondary
indexes.

Riak Handbook | 27

Installation

Installation
While you can use Homebrew and a simple brew install riak to install
Riak, you can also use one of the binary packages provided by Basho. Riak
requires Erlang R14B03 or newer, but using the binary packages or
Homebrew, that's already taken care of for you. As of this writing, 1.1.2
is the most recent version, and we'll stick to its feature set. Be aware that
Riak doesn't run on Windows, so you'll need some flavor of Unix to make it
through this book.
When properly installed and started using riak start, it should be up
and running on port 8098, and you should be able to run the following
command and get a response from Riak.
$ curl localhost:8098/riak
{}

While you're at it, install Node.js as well. We'll talk to Riak using Node.js
and the riak-js library, a nice and clean asynchronous library for Riak, while
we peek under the covers to figure out exactly what's going on.
Running npm install http://nosql-handbook.s3.amazonaws.com/pkg/
riak-js-7d3b8bbf.tar.gz installs the latest version of riak-js (we're using
the custom version as it includes some important fixes). After you're done,
you should be able to start a Node shell by running the command node and
executing the line below without causing any errors.
var riak = require('riak-js').getClient()

As we work our way through its feature set we'll store tweets in Riak. First
we'll just use the tweet's identifier to reference tweets, then we'll dig deeper
and store tweets per user, making them searchable along the way.

Installing Riak using Binary Packages


Riak is known to be easy to handle from an operational perspective. That
includes the installation process too. Basho provides a bunch of binary
packages for common systems like Debian, Ubuntu, RedHat, and Solaris.
All of them neatly include the Erlang distribution required to run Riak, so
you don't have to install anything other than the package itself. That saves
you the trouble of dealing with Linux distributions that come with outdated
versions of Erlang. Which is most of them, really.

Riak Handbook | 28

Talking to Riak

So when you're on Ubuntu or Debian, simply download the .deb file and
install it using dpkg.
$ wget downloads.basho.com/riak/riak-1.1.2/riak_1.1.2-1_amd64.deb
$ dpkg -i riak_1.1.2-1_amd64.deb

Now you can start Riak using the provided init script.
$ sudo /etc/init.d/riak start

The procedures are pretty similar, no matter if you're on Ubuntu, Debian,


RedHat, or Solaris. The beauty of this holistic approach to packaging Riak is
that it's easy to automate.

Talking to Riak
The easiest way to become friends with Riak is to use its HTTP interface.
Later, in production, you're more likely to turn to the Protocol Buffers
interface for better performance and throughput, but HTTP is just a nice and
visual way to explore the things you can do with Riak.
Riak's HTTP implementation is as RESTful as it gets. Important details
(links, vector clocks, modification times, ETags, etc.) are nicely exposed
through proper HTTP headers, and Riak utilizes multi-part responses where
applicable.

Buckets
Other than a key and a value, Riak divides data into buckets. A bucket is
nothing more than a way to logically separate physical data, so for example,
all user objects can go into a bucket named users. A bucket is also a way to
set different properties for things like replication for different types of data.
This allows you to have stricter rules for objects that are of more importance
in terms of consistency and replication than data for which a lack of
immediate replication is acceptable, such as sessions.

Fetching Objects
Now that we got that out of the way, let's talk to our database. That's why
I love using HTTP to get to know it better; it's such a nice and humanreadable format, with no special libraries required. We'll start off the basics
using both a client library and curl, so you'll see what's going on under the
covers.

Riak Handbook | 29

Creating Objects

When you're installing and starting Riak, it installs a bunch of URL handlers,
one of them being /riak, which we'll play with for the next couple of
sections. Again, the client libraries are hiding that from us, but when you're
playing on your own, using curl, my favorite browser, it's good to know.
If you haven't done so already, fire up the Node shell, and let's start with
some basics. After this example I'm assuming the riak library is loaded in the
Node.js console and points to the riak-js library.
var riak = require('riak-js').getClient()
riak.get('tweets', '41399579391950848')

We're looking for a tweet with granted, a rather odd looking key, but it's
a real tweet, and the key conforms to Twitter's new scheme for tweet
identifiers, so there you have it.
What riak-js does behind the curtains is send a GET request to the URL
/riak/tweets/41399579391950848. Riak, being a good HTTP sport,
returns a status code of 404. You can try this yourself using curl.
$ curl localhost:8098/riak/tweets/41399579391950848

As you'll see it doesn't return anything yet, so let's create the object in Riak.

Creating Objects
To create or update an object using riak-js: we'll simply use the function
save() and specify the object to save.
riak.save('tweets', '41399579391950848', {
user: "roidrage",
tweet:
"Using @riakjs for the examples in the Riak chapter!",
tweeted_at: new Date(2011, 1, 26, 8, 0)
})

Under the covers, riak-js sends a PUT request to the URL /riak/tweets/
41399579391950848, with the object we specified as the body. It also
automatically uses application/json as the content type and serializes the
object to a JSON string, as this is clearly what we're trying to store in Riak.
Here's how you'd do that using curl.
curl -X PUT localhost:8098/riak/tweets/41399579391950848 \
-H 'Content-Type: application/json' -d @-

Riak Handbook | 30

Object Metadata

{"user":"roidrage",
"tweet":"Using @riakjs for the examples in the Riak chapter!",
"tweeted_at":"Mon Dec 05 2011 17:31:40 GMT+0100 (CET)"}

Phew, this looks a tiny bit more confusing. We're telling curl to PUT to the
specified URL, to add a header for the content type, and to read the request
body from stdin (that's the odd-looking parameter -d @-). Type Ctrl-D after
you're done with the body to send the request.
Riak will automatically create the bucket and use the key specified in the
URL the PUT was sent to. Sending subsequent PUT requests to the same
URL won't recreate the object, they'll update it instead. Note that you can't
update single attributes of a JSON document in Riak. You always need to
specify the full object when writing to it.

Object Metadata
Every object in Riak has a default set of metadata associated with it. Examples
are the vector clock, links, date of last modification, and so on. Riak also
allows you to specify your own metadata, which will be stored with the
object. When HTTP is used, they'll be specified and returned as a set of
HTTP headers.
To fetch the metadata in JavaScript, you can add a third parameter to the call
to get(): a function to evaluate errors, the fetched object, and the metadata
for that object. By default, riak-js dumps errors and the object to the console.
Let's peek into the metadata and look at what we're getting.
riak.get('tweets', '41399579391950848',
function(error, object, meta) {
console.log(meta);
})

The result will look something like the output below.


{ usermeta: {},
debug: false,
api: 'http',
encodeUri: false,
host: 'localhost',
clientId: 'riak-js',
accept: 'multipart/mixed, application/json;q=0.7, */*;q=0.5',
binary: false,
raw: 'riak',
connection: 'close',

Riak Handbook | 31

Custom Metadata

responseEncoding: 'utf8',
contentEncoding: 'utf8',
links: [],
port: 8098,
bucket: 'tweets',
key: '41399579391950848',
headers: {
Accept: 'multipart/mixed, application/json;q=0.7, */*;q=0.5',
Host: 'localhost', Connection: 'close' },
contentType: 'application/json',
vclock: 'a85hYGBgzGDKBVIcypz/fvptYKvIYEpkymNl4NxndYIvCwA=',
lastMod: 'Fri, 18 Nov 2011 11:31:21 GMT',
contentRange: undefined,
acceptRanges: undefined,
statusCode: 200,
etag: '68Ze86EpWbh8dbAcpMBpZ0' }

The vector clock is indeed a biggie, and as you update an object, you'll see
it grow even more. Try updating our tweet a few times, just for fun and
giggles.
for (var i = 0; i < 5; i++) {
riak.get('tweets', '41399579391950848',
function(error, object, meta) {
riak.save('tweets', '41399579391950848', object);
})
}

Now if you dump the object's metadata on the console one more time, you'll
see that it has grown a good amount with just five updates.

Custom Metadata
You can specify a set of custom metadata yourself. riak-js makes that process
fairly easy: simply specify a fourth parameter when calling save(). Let's
attach some location information to the tweet.
var tweet = {user: 'roidrage',
tweet: 'Using riakjs for the examples in the Riak chapter!',
tweeted_at: new Date (2011, 1, 26, 8, 0)}
riak.save('tweets', '41399579391950848', tweet,
{latitude: '52.523324', longitude: '13.41156'})

When done via HTTP, you simply specify additional headers in the form
of X-Riak-Meta-Y, where Y is the name of the metadata you'd like to be
stored with the object. So in the example above, the headers would be X-

Riak Handbook | 32

Linking Objects
Riak-Meta-Latitude and X-Riak-Meta-Longitude. If you don't believe me,

we can ask our good friend curl for verification.

$ curl -v localhost:8098/riak/tweets/41399579391950848
...snip...
< X-Riak-Meta-Longitude: 13.41156
< X-Riak-Meta-Latitude: 52.523324
...snap...

Note that, just like with the object itself, you always need to specify the full
set of metadata when updating an object, as it's always written anew. Which
makes using riak-js all the better, because the meta object you get from the
callback when fetching an object lends itself nicely to be reused when saving
the object again later.

Linking Objects
Linking objects is one of the neat additions of Riak over and above Dynamo.
You can create logical trees or even graphs of objects. If you fancy objectoriented programming, this can be used as the equivalent of object
associations.
By default, every object has only one link: a reference to its bucket. When
using HTTP, links are expressed using the syntax specified in the HTTP
RFC. A link can be tagged to give the connection context. Riak doesn't
enforce any referential integrity on links though, it's up to your application
to catch and handle nonexisting ends of links.
In our tweets example however, one thing we could nicely express with links
is a tweet reply. Say frank06, author of riak-js, responded to my tweet, saying
something like "@roidrage Dude, totally awesome!" We'd like to store the
reference to the original tweet as a link for future reference. We could of
course simply store the original tweet's identifier, but where's the fun in that?
To store a link, riak-js allows us to specify them as a list of JavaScript hashes
(some call them objects, but I like to mix it up).
var reply = {
user: 'frank06',
tweet: '@roidrage Dude, totally awesome!',
tweeted_at: new Date (2011, 1, 26, 8, 0)};
riak.save('tweets', '41399579391950849', reply,
{links: [{tag: 'in_reply_to',

Riak Handbook | 33

Walking Links

key: '41399579391950848',
bucket: 'tweets'}]})

A link is a simple set consisting of a tag, a key and a bucket. The tag in
this case identifies this tweet as a reply to the one we had before, we're
using the tag in_reply_to to mark it as such. This way we can store entire
conversations as a combination of links and key-value, walking the path up
to the root tweet at any point.
Now when you fetch the new object via HTTP, you'll notice that the header
for links has grown and contains the link we just defined.
$ curl -v localhost:8098/riak/tweets/41399579391950849
...
Link: </riak/tweets/41399579391950848>; riaktag="in_reply_to",
</riak/tweets>; rel="up"
...

You can fetch them with riak-js too, using the metadata object, which will
give you a nice array of objects containing bucket, tag and key.
riak.get('tweets', '41399579391950849',
function(error, object, meta) {
console.log(meta.links)
})

An object can have an arbitrary number of links attached to it, but there are
some boundaries. It's not recommended to have more than 10000 links on
a single object. Consider for example that all the links are sent through the
HTTP API, which makes a couple of HTTP clients explode, because the
single header for links is much larger than expected. The number of links on
an object also adds to its total size, making an object with thousands of links
more and more expensive to fetch and send over the network.

Walking Links
So now that we have links in place, how do we walk them, how can we
follow the graph created by links? Riak's HTTP API offers a simple way to
fetch linked objects through an arbitrary number of links. When you request
a single object, you attach one or more additional parameters to the URL,
specifying the target bucket, the tag and whether you would like the linked
object to be included in the response.

Riak Handbook | 34

Walking Nested Links

riak-js doesn't have support to walk links from objects in this way yet, so
we'll look at the URLs instead. Play along to see what the results look like.
Let's have a look at an example.
$ curl .../riak/tweets/41399579391950849/tweets,in_reply_to,_/

There are three parameters involved in this link phase.


tweets tells Riak that we only want to follow links pointing to the bucket
tweets
in_reply_to specifies the link tag we're interested in
The last parameter (_ in this example) tells Riak whether or not you want

the object pointed to by this link returned in the response. It defaults to 0,


meaning false, but setting it to 1 gives you a more complete response as
you walk deeper nested links.

When you run the command above you'll receive a multi-part response from
Riak which is not exactly pretty to look at. The response includes all the
objects that are linked to from this tweet.
Given the nature of a Twitter conversation it will usually be just one, but you
could also include links to the people mentioned in this tweet, giving them a
different tag and give the whole tweet even more link data to work with.
If you have multiple tags you're interested in, or don't specifically care about
the target bucket, you can replace both with _, and it will follow links to
any bucket or with any tweet respectively. The following query will simply
return all linked objects.
$ curl localhost:8098/riak/tweets/41399579391950849/_,_,_/

Walking Nested Links


You aren't limited to walking just one level of links, you can walk around
the resulting graph of objects at any depth. Just add more link specifications
to the URL. Before we try it out, let's throw in another tweet, that's a reply
to the reply, so we have a conversation chain of three tweets. We'll do this in
our Node console.
var reply = {
user: 'roidrage',
tweet: "@frank06 Thanks for all the work you've put into it!",
tweeted_at: new Date(2011, 1, 26, 10, 0)};
riak.save('tweets', '41399579391950850', reply, {links:

Riak Handbook | 35

The Anatomy of a Bucket

[{tag: 'in_reply_to', key: '41399579391950849', bucket: 'tweets'}]


})
$ curl localhost:8098/riak/tweets/41399579391950850/_,_,_/_,_,_/

This query will walk two levels of links, so given a conversation with one
reply to another reply to the original tweet, you can get the original tweet
from the second reply. Mind-bending in a way, but pretty neat, because with
this query you'll also receive all the objects in between with the response, not
just the original tweet, but all replies too.

The Anatomy of a Bucket


There is great confusion around what a bucket in Riak actually is, and what
you can and cannot do with it. A bucket is just a name in Riak, that's it.
It's a name that allows you to set some configuration properties like data
distribution, quorum and such, but it's just a name.
A bucket is not a physical entity. Whenever you reference a bucket-key
value in Riak to fetch a value, both are one and the same. To look up data
Riak always uses both the bucket and the key, only together do they make
up the complete key, which is also used for hashing the key to find the node
responsible for the data. The lookup order in Riak is always hash(bucket +
key) and not bucket/hash(key).
A bucket is nothing like a table in a relational database. A table is a physical
entity that's stored in a different location than other tables. So when you
think of a bucket, don't think of it as a table or anything else that relates to a
physical separation of data. It's just a namespace, nothing more. And yes, the
name "bucket" is rather unfortunate in that regard, as it suggests a physical
separation of data in the first place.
All this has a couple of implications, most of them easily thwarting
expectations people coming to Riak usually have.
You can't just get all the keys of objects stored in a particular bucket. To
do that, Riak has to go through its entire dataset, filtering out the ones that
match the bucket.
You can't just delete all data in a bucket, as there is no physical distinction
between bucket and a key. If you need to keep track of the data, you need
to keep additional indexes on it, or you can list keys, though the latter is
not exactly recommended either.

Riak Handbook | 36

List All Of The Keys

You can set specific properties on a per-bucket basis, such as the number
of replicas, quorum and other niceties, which override the defaults for all
buckets in the cluster. The configuration for every bucket created over the
lifetime of a cluster is part of the whole ring configuration that all nodes in
a Riak cluster share.

List All Of The Keys


Now that we got that out of the way, you're bound to ask: "but how do I get
all of my keys out of Riak?" Or: "how can I count all the keys in my Riak?"
Before we dive into that, let me reply with this: "don't try this at home or
rather, don't use this in production, or at least keep using it to a necessary
minimum."
For fetching all keys, even of a single bucket, the whole Riak cluster has to
go through its entire key set, either reading it from disk or from memory,
but through the whole set nonetheless, finding the ones belonging to that
particular bucket by looking at the full key. Depending on the number of
keys in your cluster, this can take time. Going through millions of keys is
not a feat done in one second, and it puts a bit of load on your cluster too.
Performance also depends on the storage backend chosen, as some keep all
the keys in memory, while others have to load them from disk.
Now that we got the caveats out of the way, the way to fetch all keys
in a bucket is to request the bucket with an additional query parameter
keys=true. That will cause the whole cluster to load the keys and return
them in one go. riak-js has a keys() method:
riak.keys('tweets')

A word of warning though, this will choke with Node.js when there are a lot
of objects in the bucket. This is because listing all keys generates a really long
header with links to all the objects in the bucket. You'll probably want to use
the streaming version of listing keys as shown further down.
The same as a plain old HTTP request using curl:
$ curl 'localhost:8098/riak/tweets?keys=true'

This will return pretty quickly if you have only a couple of objects stored in
Riak, several tens of thousands are not a big problem either, but what you
probably want to do instead is to stream the keys as they're read on each node
in the cluster. You won't get all keys in one response, but the Riak node

Riak Handbook | 37

List All Of The Keys

coordinating the request will send the keys to the clients as they are sent by
all the other nodes. To do that set the parameter keys to the value stream.
$ curl 'localhost:8098/riak/tweets?keys=stream'

With curl, it will keep dumping keys on your console as long as the
connection is kept open. In riak-js, due to its asynchronous nature, things
need some more care. It takes an EventEmitter object, a Node.js specific
type that triggers events when it receives data. We'll do the simplest thing
possible and dump the keys onto the console.
riak.keys('tweets', {keys: 'stream'}).
on('keys', console.log).start()

If you really must list keys, you want to use the streaming version. riak-js
uses the streaming mechanism to give you a means of counting all objects in
a bucket by way of a count('tweets') function.
In general, if you find yourself wanting to list keys in a bucket a lot, it's very
likely you actually want to use something like a full-text search or secondary
indexes. Thankfully, Riak comes with both. When you do list keys, keep a
good eye on the load in your cluster. With tens of millions of keys, the load
will increase for sure, and the request may eventually even time out. So you
need to do your homework, tens of millions of keys are a lot to gather and
collect over a network.

How Do I Delete All Keys in a Bucket?


As you probably realize by now, this is no easy feat. As bucket and keys are
one and the same, the only way to delete all data in a bucket is to list all the
keys in that buck, or to keep a secondary index of the keys, by using some
secondary data store. Redis has been used for this in the past, for example.
You can also keep a list of keys as a separate Riak object, or use some of Riak's
built-in query features. As they are quite comprehensive, I'll give them the
attention they deserve in the next section.
The approach of using key listings to delete data has certainly been used
in the past, but again involves loading all keys in a bucket. If you use it
cautiously with streaming key listings, it might work well enough.
There's one thing to be aware of when deleting based on listing keys. You
may see ghost keys showing up when listing keys immediately after deleting

Riak Handbook | 38

Querying Data

objects. The list of keys is always an indication, it may not always be 100%
accurate when it comes to the objects stored with the keys.

How Do I Get the Number of All Keys in a Bucket?


The short version: see above. There is no built-in way of getting the exact
number of keys in a bucket. Atomically incrementing an integer value is
a feat that's not easy to achieve in a distributed system as it requires
coordination. That won't help you right now, as the exact number is what
you're after.
The longer version involves either building indices, using Riak Search or
Riak Secondary Indexes (which we'll get to soon enough). You could use
a range large enough, maybe by utilizing the object's key (assuming there's
some sort of range there), and then feed the data into a reduce phase,
avoiding loading the actual objects from Riak, counting the objects as you
go. The downside of this approach is that it may not catch all keys, that data
needs to be fully indexed, and that you need to use Erlang for a MapReduce
query. The latter is simple enough, especially for this particular use case, and
we'll look at the details in the MapReduce section.
You can stream the keys in a bucket to the client and keep counting the
result, but it certainly won't give you an ad-hoc view if you're storing tens
of millions of objects, as it will take time.
Or you keep track of the number of objects through some external means,
for example using counters in Redis. If you need statistics on the number of
objects, you should keep separate statistics around. You could feed them into
a monitoring tool like Graphite or Munin, use a number of Redis instances
to keep track of them, or something entirely different. You could even use
built-in mechanisms, namely post-commit hooks, to update your counters
when data was updated or deleted. If ad-hoc numbers are what you need, this
is a good way to get them. Otherwise you'll pay with decreased performance
numbers, as your cluster is busy combing through the keys.
The bottom line is, you need to think about these things upfront, before
putting Riak in production. Retrofitting solutions gets harder and harder the
more data you store in Riak.

Querying Data
Now that we got the basics out of the way, let's look at how you can get
data out of Riak. We already covered how you can get an object out of Riak,

Riak Handbook | 39

MapReduce

simply by using its key. The problem with that approach is that you have to
know the key. It's somewhat of the dilemma of using a key-value store.
There are some inherent problems involved when wanting to run a query
across the entire data set stored in a Riak cluster, especially when you're
dealing with millions of objects.
Because Justin Bieber is so wildly popular, and because we need some data
to play with, I whipped up a script to use Twitter's streaming API to fetch
all the tweets mentioning him. You can change the search term to anything
you want, but trust me, with Bieber in it, you'll end up having thousands of
tweets in your Riak database in no time.
The script requires your Twitter username and password to be set as
environment variables TWITTER_USER and TWITTER_PASSWORD respectively.
Now you can just run node 08-riak/twitter-riak.js and watch as pure
awesomeness is streaming into your database. Leave it running for an hour
or so, believe me, it's totally worth it.
If you can't wait, five minutes will do. You'll still have at least a hundred
tweets as a result. The script will also store replies as proper links, so the
longer it runs the more likely you'll end up at least having some discussions
in there.

MapReduce
Assuming you have a whole bunch of tweets in your local Riak, the easiest
way to sift through them is by using MapReduce. Riak's MapReduce
implementation supports both using JavaScript and Erlang to run
MapReduce, with JavaScript being more suitable for ad hoc style queries,
whereas Erlang code needs to be known to all physical nodes in the cluster
before you can use it, but comes with some performance benefits.
Speaking of Riak's MapReduce as a means to query data is actually a bit of a
lie, as it's rather a way to analyze and aggregate data. There are some caveats
involved, especially when you're trying to run an analysis on all the data in
your cluster, but we'll look at them in a minute.
A word of warning up-front: there is currently a bug in Riak that might
come up when you have stored several thousand tweets, and you're running
a JavaScript MapReduce request on them. Should you run into an error
running the examples below, there is a section dedicated to the issue and
workarounds.

Riak Handbook | 40

MapReduce Basics

MapReduce Basics
A MapReduce query consists of an arbitrary number of phases, each feeding
data into the next. The first part is usually specifying an input, which can be
an entire bucket or a number of keys. You can choose to walk links from
the objects returned from that phase too, and use the results as the basis for a
MapReduce request.
Following that can be any number of map phases, which will usually do any
kind of transformation of the data fed into them from buckets, link walks or a
previous map phase. A map phase will usually fetch attributes of interest and
transform them into a format that is either interesting to the user, or that will
be used and aggregated by a following reduce phase.
It can also transform these attributes into something else, like only fetch the
year and month from a stored date/time attribute. A map phase is called for
every object returned by the previous phase, and is expected to return a list of
items, even if it contains only one. If a map phase is supposed to be chained
with a subsequent map phase, it's expected to return a list of bucket and key
pairs.
Finally, any number of reduce phases can aggregate the data handed to them
by the map phases in any way, sort the results, group by an attribute, or
calculate maximum and minimum values.

Mapping Tweet Attributes


Now it's time to sprinkle some MapReduce on our tweet collection. Let's
start by running a simple map function. A MapReduce request sent to Riak
using the HTTP API is nothing more than a JSON document specifying
the inputs and the phases to be executed. For JavaScript functions, you can
simply include their stringified source in the document, which makes it a
bit tedious to work with. But as you'll see in a moment, riak-js handles this
much more JavaScript-like.
Let's build a map function first. Say, we're interested in tweets that contain
the word "love", because let's be honest, everyone loves Justin Bieber.
Riak.mapValuesJson(), used in the code snippet below, is a built-in
function that extracts and parses the value of serialized JSON object into
JavaScript objects.
var loveTweets = function(value) {
try {
var doc = Riak.mapValuesJson(value)[0];

Riak Handbook | 41

Using Reduce to Count Tweets

if (doc.tweet.match(/love/i)) {
return [doc];
} else {
return [];
}
} catch (error) {
return [];
}
}

Before we the look at the raw JSON that's sent to Riak, let's run this in the
Node console, feeding it all the tweets in the tweets bucket.
riak.add('tweets').map(loveTweets).run()

Imagine a long list of tweets mentioning Justin Bieber scrolling by, or try it
out yourself. The number of tweets you'll get will vary from day to day, but
given that so many people are in love with Justin, I don't have the slightest
doubt that you'll see a result here.

Using Reduce to Count Tweets


What if we want to count the tweets using the output we got from the map
function above? Why, we write a reduce function of course.
Reduce functions will usually get a list of values from the map function, not
just one value. So to aggregate the data in that list, you iterate over it and
well, reduce it. Thankfully JavaScript has got us covered here. Let's whip out
the code real quick.
var countTweets = function(values) {
return [values.reduce(function(total, value) {
return total + 1;
}, 0)];
}

Looks simple enough, right? We iterate over the list of values using
JavaScript's built-in reduce function and keep a counter for all the results fed
to the function from the map phase.
Now we can run this in our console.
riak.add('tweets').map(loveTweets).
reduce(countTweets).run()
// Output: [ 8 ]

Riak Handbook | 42

Re-reducing for Great Good

The result is weird, the number is a lot smaller than expected, when you
compare the number to the list of actual tweets containing "love" you'll
notice that the number is a lot smaller. There's a reason for this, and it's
generally referred to as re-reduce. We can fix this no problem, but let's look
at what it actually is first.

Re-reducing for Great Good


It's not unlikely that a map function will return a pretty large number of
results. For efficiency reasons, Riak's MapReduce doesn't feed all results into
the reduce functions immediately, instead splits them up into chunks. Say the
list of tweets returned by the map function is split into chunks of 100. Each
chunk is fed into the reduce function as an array, then the results are collected
into a new array, which again is fed into the same reduce function.
This may or may not happen, depending on how large the initial combined
results from the reduce functions are. But in general your reduce function
should be prepared to receive two different inputs, unless it returns the same
kind of result as the map function.
This can be the cause of great confusion, because it means your reduce
function needs to be somewhat aware of its own output and the output of the
map function, and be able to differentiate both to calculate a correct result.
Now, let's make the above reduce function safe for re-reducing. All we really
need to do to make it work is make it aware that values can be either objects
or numbers. When it's a number, add it to the total, when it's an object, just
add 1.
var countTweets = function(values) {
return [values.reduce(function(total, value) {
if (isNaN(parseInt(value))) {
return total + 1;
} else {
return total + value;
}
}, 0)];
}

Much more like it. When you rerun the query, I'm sure you'll now have a
reasonable number of "love" tweets as a result. I'm sure you'll agree that the
number is particularly crazy in comparison to the total number of tweets.

Riak Handbook | 43

Counting all Tweets

Counting all Tweets


We'll build a map function that returns 1 for every document, and then we'll
calculate the total of that in the reduce function by adding up all the numbers
returned by the map phase.
The map function is, not surprisingly, simple enough.
var onePerTweet = function(value) {
return [1];
}

The reduce function simply sums up all these values. Since the return value
from our map function is already a number, we don't have to do any type
checking for re-reducing, as the values will always be numbers in both cases.
var sum = function(values) {
return [values.reduce(function(total, value) {
return total + value;
}, 0)];
}

Chaining Reduce Phases


You can run an arbitrary number of phases in one go. You could reduce a
list of results from a previous reduce function even further. Why don't we do
that right away? Let's determine the particular hour of the day from a tweet
and then group the results on that hour, so we can get an overview of the
busiest Bieber hours of the day.
The map function parses the date and time, which we conveniently stored in
the attribute tweeted_at, and return an object with the hour as key and 1 as
value.
var hourOfDay = function(value) {
var doc = Riak.mapValuesJson(value)[0];
var hour = {};
hour["" + new Date(doc.tweeted_at).getUTCHours()] = 1
return [hour];
}

The reduce function then aggregates the values based on the key, and stores
the value again with the same key, so it's immune to both results from the
map phase and to re-reducing its own data. The resulting method is
something that can easily be reused, because grouping data is a pretty

Riak Handbook | 44

Chaining Reduce Phases

common pattern in aggregation. Just think of the last time you ran a GROUP
BY on a relational database.
The reduce phase iterates over all values and all attributes in the value, adding
up their values, and then stores it again in the result with the same key.
var groupByValues = function(values) {
var result = {};
for (var value in values) {
for (var key in values[value]) {
if (key in result) {
result[key] += values[value][key];
} else {
result[key] = values[value][key];
}
}
}
return [result];
}

The function creates a new object and adds up the data from the values
handed into it, creating new attributes where necessary. Now let's run a map
reduce using the map function above and this reduce function.
riak.add('tweets').map(hourOfDay).
reduce(groupByValues).run()

The result of this MapReduce query depends on how long you left the
Twitter search running, but there should be at least one result for the hour
you ran it in. If you let it run for more than an hour, you'll see the hours
adding up and the number of tweets too.
Now, the real purpose of this was to explain chaining phases, right? So let's
add something that will extract the top five busiest Bieber hours in the day.
It'll be a tough call, since he's such a worldwide phenomenon, but we'll try
our best. In case you're wondering by now if I'm a big fan of his, I'm really
not.
var topFiveHours = function(values) {
return values.map(function(value) {
for (var k in value) {
this.push([k, value[k]])
}
return this.sort(function(left, right) {
if (left[1] > right[1]) {
return -1;

Riak Handbook | 45

Parameterizing MapReduce Queries

} else if (left[1] < right[1]) {


return 1;
}
return 0;
}).slice(0, 5);
}, []);
}

Let's walk through it step by step, because this function is actually doing two
things. First it transforms the objects returned by the previous phase into an
array of arrays, because it's easier to iterate for the sorting that happens next.
Note that so far we're pretty oblivious whether the input is coming from a
map or a reduce function. You could argue that the transformation into an
array of arrays should probably be done in a map function, and I challenge
you to fix that after we're done here.
Anyhoo, after we're done transforming a single value, which is still of the
form {'17': 1234}, into a list of array tuples like ['17', 1234], we're
sorting the resulting array by the number of tweets, which is its second
element.
The result is then sliced to get the top five elements from the list. Now let's
chain us some reduce functions for great good.
riak.add('tweets').map(hourOfDay).
reduce(groupByValues).
reduce(topFiveHours).run();

You should see a nice list of sorted hours and the number of tweets as a result.

Parameterizing MapReduce Queries


Let's say we want to be able to fetch the top three, five, and ten without
changing the code of the function for every request. No problem, because
Riak's MapReduce lets us hand in additional arguments on every
MapReduce request.
Both map and reduce functions can accept arguments, but for now we'll
focus on the reduce function. We can simply modify it to accept a second
argument which we can then specify when making the initial MapReduce
request. You can specify additional arguments for every phase separately,
which works out well for us in this case, because the number of top hours
we're interested in is only of concern for this particular reduce phase.

Riak Handbook | 46

Parameterizing MapReduce Queries

var topHours = function(values, top) {


return values.map(function(value) {
for (var k in value) {
this.push([k, value[k]])
}
return this.sort(function(left, right) {
if (left[1] > right[1]) {
return -1;
} else if (left[1] < right[1]) {
return 1;
}
return 0;
}).slice(0, top);
}, []);
}

The only modification we made was to add a second parameter top to the
function and using it when calling slice() instead using a fixed number.
Let's run it real quick to verify that it actually works. Notice the second
parameter we added to the second reduce phase.
riak.add('tweets').map(hourOfDay).
reduce(groupByValues).
reduce(topHours, 1).run();

Similarly, we can adapt the map function we used to find tweets containing
love to accept an argument, so we can use it to search for arbitrary terms.
var searchTweets = function(value, keyData, arg) {
var doc = Riak.mapValuesJson(value)[0];
if (doc.tweet.match(arg)) {
return [doc];
} else {
return [];
}
}

Now we can specify the word to search for like so.


riak.add('tweets').map(searchTweets, 'love').
reduce(countTweets).run()

In case you're wondering about the second parameter for the map function,
the one called keyData, it contains the data that was used to fetch the inputs
for this MapReduce request, in this case the tweets bucket.

Riak Handbook | 47

Chaining Map Phases

Chaining Map Phases


The story is a bit different with map phases in Riak. Chaining them requires
a bit more attention, because only the last map function in a list of phases is
allowed to return an arbitrary result; all the others need to return an array
containing a bucket and a key.
To understand why this is the case, we need to look at how work is spread
out across a Riak cluster for a single MapReduce query.

MapReduce in a Riak Cluster


When you send a MapReduce request to Riak, a couple of things happen
across the cluster. The node accepting the request is called the coordinating
node, and it's responsible for a bunch of things. But first things first.
The simple rule of how a MapReduce request is spread throughout the
cluster is this:
All map requests are run on the nodes that have the relevant data. If your
inputs reference two objects on two different nodes, the map request is
sent both nodes. That way, processing of data is done close to their
location in the cluster, removing network overhead.
All reduce requests are run on the coordinating node, the node that
initiated the MapReduce run on behalf of the client. The coordinating
node collects all the results from the nodes that ran the map phases, and
feeds them into the reduce phases, returning the response to the client
when done.
The coordinating node has a timeout kicking in after a configurable amount
of time. Depending on the amount of data you're sifting through with a
MapReduce request and the load on the cluster, you may need to play with
the timeout values to find the right values. The timeout can be specified on
a per-request basis too; the following call will use a rather low timeout of 10
seconds.
riak.add('tweets').map(...).run({timeout: 10000})

The caveat that affects chained map phases is their data locality. A map
function that's part of a chain can't just return any data, it has to return a list of
bucket/key pairs unless it's the last phase in a particular MapReduce request,
the same type of list that's fed into the initial map phase as input.

Riak Handbook | 48

MapReduce in a Riak Cluster

If the coordinating node detects that a map phase is followed by another, it


uses the results to determine data locality once again and to send the next
map request to the node that's responsible for the data.
How could we put this to good use with our collection of tweets? We could
have two phases to get only the tweets containing the world love followed
by another to fetch tweets containing the word hate. We can almost reuse
the original function without any changes, but we have to take care of the
fact that the function can both be an intermediate as well as the last phase.
We can use the arguments passed to each function for just that. So instead
of just passing in the word to look for we'll pass in a JavaScript object
containing the word and a flag to signal whether it's the last in the row of
phases or not.
var searchTweets = function(value, keyData, arg) {
var doc = Riak.mapValuesJson(value)[0];
if (doc.tweet.match(arg.keyword)) {
if (arg.last) {
return [doc];
} else {
return [[value.bucket, value.key]];
}
} else {
return [];
}
}

There are two modifications to note. First, we're using the arg parameter as
a JavaScript object to fetch the keyword using the key with the same name.
Second, arg can now have a boolean attribute last. When it's true the
function returns the object's value, if not it returns its bucket and key for the
next map phase.
And just like that, our map function is easily chainable. Let's chain us some
map phases!
riak.add('tweets').
map(searchTweets, {keyword: 'love', last: false}).
map(searchTweets, {keyword: 'hate', last: true}).run();

As you can see, I changed the argument to be a proper JavaScript object


with two attributes. And that was pretty much all there is to it. With just a
few simple changes we made our function aware of multiple map phases and
could use it that way without a problem.

Riak Handbook | 49

Efficiency of Buckets as Inputs

The results of this will once again depend on the number of tweets you have
in your Riak bucket. If you don't get anything, exchange the word "hate"
with just bieber, that should do the trick.
When would you use this kind of MapReduce magic? The above example
shows you one use case, sifting through data in multiple steps, narrowing
down the final result set as you go through the phases.
In a key-value store, keys are preferred to be in a well-known format,
something that can be derived from something like a session identifier, a user
name, an email address, a URL, or some other attribute you have easy access
to. Given this preference, you could use chained map phases to sort of walk
from one kind of object to another.
Say you have a user object with an email address as an attribute. The user
is identified by his login name. You hand in that object to a map phase,
extracting the email address. You can use it then to generate a new key to
fetch more details about the email address, or to find emails sent to that email
address, given that you stored them in a way you can reconstruct using the
email address, like a single key that contains an object with all the emails sent.
This turns out to be a very similar to link walking, which would certainly be
a preferable way, but it's nice to have options, right?

Efficiency of Buckets as Inputs


There's a caveat when running a MapReduce request with just a bucket
as input. Gathering all the data from the whole cluster is a very expensive
operation, and not really recommended. It requires for every node to walk its
entire key space and load all data in the bucket from disk. It's a recommended
practice to specify a restricted set of inputs on a production system instead of
using a bucket. Simply specify an array of bucket/key pairs instead of just the
bucket name.
riak.add([['tweets', '41399579391950849'],
['tweets', '41399579391950848'], ...])

I'm sure you'll agree that this is pretty cumbersome, though it makes sense
in some scenarios, when your sole interest is to fetch more than one object
in one request to save round trips. Say we want to fetch the above tweets,
without even running a reduce phase, we could do that with MapReduce, a
pretty useful technique.

Riak Handbook | 50

Key Filters

riak.add([['tweets', '41399579391950849'],
['tweets', '41399579391950848']])
.map('Riak.mapValues').run()

This will return just the values of the specific keys. The Riak.mapValues()
function extracts the value from the Riak object handed to it. Be aware
that this won't include the usual metadata you'll get when fetching a single
object.
Back on our original track, what kind of data will MapReduce request
usually run on? We run some sort of analysis or query on a set of data
identified by keys or ranges of keys. We could use the keys' names as a way
to restrict the input to the map phase properly, to reduce the input to the data
we're really interested in.

Key Filters
Enter key filters, a way to reduce the data set fed into the initial map phase
using the key schema. Key filters can be used to fetch only a certain range of
keys that match a regular expression or fall into a certain range of integers or
strings. The keys can be transformed by splitting them up based on a token,
or by converting a string key to an integer.
Key filters are specified together with a bucket to restrict the initial set of
keys that needs to be sifted through. There are one or more preprocessing
filters followed by a list of one or more matching filters. The preprocessing
is optional, and is only necessary if you need to transform the keys into
something else before matching them, for example to convert strings to
integers.
Let's say we want to fetch all tweets whose keys start with 41399, which
should just return the tweets we created manually above.
riak.add({bucket: 'tweets', key_filters: [["matches", "^41399"]]})

Key filters are specified using a list which in turn contains one or more
lists containing filter names and parameters, and it can look a bit confusing
the more filters you add. Let's build another filter, this time combining a
preprocessor and a match.
Say we want do the same based on a numeric range, so we'll need to convert
the keys to an integer first.

Riak Handbook | 51

Key Filters

riak.add({bucket: 'tweets',
key_filters: [["string_to_int"],
["less_than", 41399579391950849]]}).
map('Riak.mapValues').run()

I should add that the number we're basing the range on is not a valid
JavaScript integer, so you may see some oddness, Node.js silently converts it
to 41399579391950850 so both tweets fall in that range. It's an oddity with
the Twitter API that came up when they changed their way of generating
tweet identifiers.
You can combine any number of transformation filters, you first tokenize
a string, and then convert a string to an integer. To combine a number of
matching filters, you can throw in an logical operator. So to ask for a specific
range with less_than and greater_than, you can combine them with and.
riak.add({bucket: 'tweets',
key_filters:
[["string_to_int"],
["and", [["less_than", 41399579391950850]],
[["greater_than", 41399579391950840]]]]}).
map('Riak.mapValues').run()

A logical operator accepts an arbitrary list of filters, so you can have separate
chains of transformation and matching in all parts of the operation Let's
make this even more fun.
riak.add(
{bucket: 'tweets',
key_filters:
[["and", [["string_to_int"],
["less_than", 41399579391950850]],
[["string_to_int"],
["greater_than", 41399579391950840]]]]}).
map('Riak.mapValues').run()

You can combine operators at any level, though it gets kind of messy at
some point because of all the brackets involved. This looks for tweets with
identifiers less than 41399579391950850 that don't match the string 849.
Just like the top level list of key filters, every logical operator accepts a
number of lists of filters or once again, logical operators.
riak.add({bucket: 'tweets',
key_filters:
[["and", [["string_to_int"],

Riak Handbook | 52

Using Riak's Built-in MapReduce Functions

["less_than", 41399579391950850]],
[["not", [["matches", "849"]]]]]]
}).map('Riak.mapValues').run()

Key filters are a nifty little tool, but don't come without disadvantages, as
they still put a considerable load on the Riak cluster. Matching keys still
requires loading and checking all of them.

Using Riak's Built-in MapReduce Functions


Some of these operations are pretty common, and are likely to be reusable
with lots of map and reduce functions. We could either install our own
functions or use Riak's built-ins. We already did the latter by using
Riak.mapValuesJson.
Mostly it's a matter of speed when resorting to using the built-in functions.
When Riak starts up, it also starts processes loading the SpiderMonkey VM,
which is responsible for running the JavaScript code and in turn loads the
built-in JavaScript functions when it boots. That, in the end, is a lot cheaper
than always parsing and evaluating ad-hoc JavaScript like in all the examples
above.
So on the one hand it makes a lot of sense to use the built-in functions as
much as you can, but it also makes sense to pre-load your own, custom
JavaScript, once it reaches a stable state, on all your Riak instances.
The set of built-in functions covers some basic ground. The good news is
they're even re-usable inside your own functions, as they're sharing the same
global namespace. You can go through the JavaScript file that contains the
built-ins to get a good grasp of what's available to you.
We already used a bunch of them, most notably Riak.mapValues or
Riak.mapValuesJson. To use them directly in a MapReduce query, without
specifying your own code, simply reference them as a string, Riak will look
up the function object and run it instead of a custom function you'd usually
provide.
What's more important than the built-ins even is that you can distribute
your own JavaScript functions. Using ad-hoc functions is nice to get going
with MapReduce, but distributing them across your Riak cluster is in the
long run more efficient, as Riak's JavaScript engine doesn't have to parse the
code every time, but only once, during start-up.

Riak Handbook | 53

Intermission: Riak's Configuration Files

Intermission: Riak's Configuration Files


Before we go any further, it's worth taking a minute to look at Riak's
configuration files. They'll be important in the following section.
Riak has two relevant configuration files. Depending on your mode of
installation they'll either be in /etc/riak (Ubuntu, Debian, RedHat), /opt/
riak/etc (Solaris), /usr/local/etc/riak (Homebrew on Mac OS X), or in
a subdirectory etc when you downloaded and untar'd a binary package.
The first file of interest is app.config. It contains the configuration for Riak
and all its components, allowing you to configure things like TCP ports, IP
addresses to bind to, data directories, enable and disable components, and so
on. Here's a small snippet of the Riak Core section, which configures some
basic network settings.
%% Riak Core config
{riak_core, [
{ring_state_dir, "./data/ring"},
%% http is a list of IP addresses and TCP ports that the Riak
%% HTTP interface will bind.
{http, [ {"127.0.0.1", 8098 } ]},
%% https is a list of IP addresses and TCP ports that the Riak
%% HTTPS interface will bind.
%{https, [{ "127.0.0.1", 8098 }]},
%% Default cert and key locations for https can be overridden
%% with the ssl config variable, for example:
%{ssl, [
%
{certfile, "./etc/cert.pem"},
%
{keyfile, "./etc/key.pem"}
%
]},
%% riak_handoff_port is the TCP port that Riak uses for
%% intra-cluster data handoff.
{handoff_port, 8099 },
]},

app.config contains a whole lot of other configuration settings too, and

we'll revisit this file whenever we need to change or enable something over
the course of the book.

The other relevant file is vm.args. It contains flags that are handed over to the
Erlang process when Riak is started, such as the node's name, the maximum
number of Erlang-internal processes, and directories where Riak should look
Riak Handbook | 54

Errors Running JavaScript MapReduce

for additional, user-provided code to load. Here's an excerpt from the default
vm.args file.
## Name of the riak node
-name riak@127.0.0.1
## Cookie for distributed erlang. All nodes in the same
## cluster should use the same cookie or fhey will not
## be able to communicate.
-setcookie riak
## Heartbeat management; auto-restarts VM if it dies
## or becomes unresponsive
## (Disabled by default..use with caution!)
##-heart
## Enable kernel poll and a few async threads
+K true
+A 64
## Treat error_logger warnings as warnings
+W w
## Increase number of concurrent ports/sockets
-env ERL_MAX_PORTS 4096
## Tweak GC to run more often
-env ERL_FULLSWEEP_AFTER 0
## Set the location of crash dumps
-env ERL_CRASH_DUMP log/erl_crash.dump

We'll revisit the vm.args too when something relevant needs to be changed.

Errors Running JavaScript MapReduce


With Riak 1.0, a new system to run MapReduce request in a Riak cluster has
been introduced. It's called Riak Pipe, and it may bring up occasional errors
when running JavaScript MapReduce code in a certain scenarios. They'll
come up in irregular intervals, not necessarily with every run.
The one most commonly seen is preflist_exhausted. You'll see this error
instead of Riak returning a result. Below is an example of the error. I
removed several bits of Erlang for brevity, but the gist should be the same in
all cases.

Riak Handbook | 55

Deploying Custom JavaScript Functions

{ [Error: HTTP error 500:


{"phase":0,"error":"[preflist_exhausted]",
"input": "{ok,{r_object,<<\"tweets\">>,<<\"11111\">>",
"type":"forward_preflist",
"stack":"[]"}] statusCode: 500 }

There is currently an open ticket for this problem, which you can track to be
notified of fixes or further workarounds and findings. The bug still exists in
the current version of Riak, which, at the time of writing, is 1.1.2.
The problem is that there are not enough JavaScript processes around to
handle the current request. The simplest workaround is to increase the
number of processes. To do that, we go back to the app.config file, which
you've just been introduced to.
In the section for riak_kv, there are two settings you need to change, both
specify the number of JavaScript processes available to Riak Pipe. The default
is shown below.
{map_js_vm_count, 8 },
{reduce_js_vm_count, 6 },

To reduce the likelihood of the error popping up, increase both numbers to
24 and 18 respectively, tripling the number of processes available. Note that
this will also increase the amount of memory required to run Riak, as every
JavaScript process has a default of 8MB allocated to it.
{map_js_vm_count, 24 },
{reduce_js_vm_count, 18 },

Restart Riak using the riak restart command, and retry running the query
that caused the issues.

Deploying Custom JavaScript Functions


If you want to deploy custom JavaScript code for easier reuse and to avoid
parsing and evaluating code for every MapReduce request, you have to
install them on every Riak node. Then you tell Riak where to find them and
restart it.
Ideally you'd put them in version control and check them out into a
directory on every server running Riak. The part of checking out the code
should be straightforward enough for you to figure out. Ideally, you have
automated it to oblivion using a tool like Chef or Puppet, et. al.

Riak Handbook | 56

Using Erlang for MapReduce

Once you've figured out a proper location, you just need to tell Riak about
it. In the app.config file there's a setting name js_source, which is
commented out by default. If you change the line to something like shown
below, Riak will load the JavaScript files in that directory on startup.
{js_source_dir, "/etc/riak/js_source"},

Be sure to restart Riak after changing this setting.

Using Erlang for MapReduce


Now that we're done with the easy part, let's look at some simple examples of
how to use Erlang functions in Riak's MapReduce. Even if you're not going
to write your own Erlang code to run as a map or reduce phase, just using
the built-in functions may give you a speed advantage. Keeping a JavaScript
VM around requires more resources, and data needs to be serialized and
deserialized to JSON before it can be accessed in JavaScript.
Using Erlang instead of JavaScript doesn't require any serialization, and it
runs in the same context as Riak itself, so it doesn't need to call out to external
libraries to map values to JSON or to aggregate simple attributes.
The simplest way to use Erlang in MapReduce is to not write any Erlang at
all. I'm sure you'll agree that this is rather good news. I've got nothing against
Erlang myself, but it looks, let's say a bit strange to newcomers. The problem
or rather, the challenge, with Erlang MapReduce is that the end result needs
to be serializable into JSON, so make sure you return strings or numbers, and
not some Erlang binary terms.
Instead of a JavaScript function you specify an Erlang module and a function
in that module. Just like with JavaScript, Riak has a bunch of built-in
functions available for you by way of the module riak_kv_mapreduce. To
fetch an object's value you can use the function map_object_value. Let's
throw it at riak-js and see what happens.
riak.add('tweets').map({language: 'erlang',
module: 'riak_kv_mapreduce',
function: 'map_object_value'}).run()

To add some reduce goodness, let's add a function that counts all the data
returned by the above map phase, effectively giving us a total number of
objects in the bucket.

Riak Handbook | 57

Using Erlang for MapReduce

riak.add('tweets').
map({language: 'erlang',
module: 'riak_kv_mapreduce',
function: 'map_object_value'}).
reduce({language: 'erlang',
module: 'riak_kv_mapreduce',
function: 'reduce_count_inputs'}).
run()

Now, the really nice part about this is that we actually don't need to run the
map phase just to count the objects. As we're querying the whole bucket and
just need to count the bucket-key pairs fed into the MapReduce request, we
might as well just use the reduce function all by itself.
riak.add('tweets').
reduce({language: 'erlang',
module: 'riak_kv_mapreduce',
function: 'reduce_count_inputs'}).
run()

This has the added benefit that the data doesn't need to be loaded. We're
running a query on the data without actually loading the data, just based on
the keys. That's a pretty neat tool right there, pretty useful when you just
want to count the number of objects in a bucket.
Riak comes with a bunch of built-in functions for Erlang too, properly
documented and well worth looking into. As a general rule, you'll benefit a
lot from using Erlang to write your own MapReduce code, simply because
it doesn't require all the overhead needed for JavaScript, like serializing and
calling out to external libraries (the SpiderMonkey JavaScript VM in this
case).
That said, there's a learning curve, but in my experience it's not too steep.
The Riak code itself is well readable and nicely documented too, so you can
get a good idea of what's going on in the system, and what you can do with
Erlang and the data stored in Riak.

Writing Custom Erlang MapReduce Functions


Before we move on, a quick example of how you would write custom
MapReduce functions in Erlang. To do that, we'll work with the Riak
console. You can bring it up with the command riak attach, assuming
you have a running Riak instance. Hit return when you see the first output
message. You can hit Ctrl-D anytime to quit the console.

Riak Handbook | 58

Using Erlang for MapReduce

$ riak attach
Attaching to /tmp/usr/local/riak-1.1.2/erlang.pipe.1 (^D to exit)
(riak@127.0.0.1)1>

The following steps are not just useful to run MapReduce queries using
Erlang, they're helpful to give you an idea of how you can work with Riak
more closely, if you need to debug something.
First of all, we'll fetch a local client. This talks directly to Riak without going
through HTTP or Protocol Buffers, instead using plain old Erlang function
calls and message passing. You can type the following lines into the console,
I'll omit the console sugar to focus on the Erlang code. Don't forget to end
every statement with a period.
{ok, C} = riak:local_client().

This fetches a local client, and assigns it, through the magic of pattern
matching, to the variable C, which now holds our client object. Using that,
you can fetch data from Riak, run search queries, or execute MapReduce
jobs. Here's how you fetch an object.
C:get(<<"tweets">>, <<"41399579391950848">>).

The resulting output is the Erlang object as it is stored in Riak, something


that might look a bit unusual if you're new to Erlang.
Let's stick with a simple map function for now, one that extracts the object's
value, parses the JSON, and returns the tweet's text. An Erlang map function
takes three arguments, just like its JavaScript counterpart. We're ditching the
key data and phase arguments, as we're just interested in the actual object.
Here's the full code of the function.
ExtractTweet = fun(RObject, _, _) ->
{struct, Obj} = mochijson2:decode(
riak_object:get_value(RObject)),
[proplists:get_value(<<"tweet">>, Obj)]
end.

The first line defines an anonymous function, assigning it to the variable


ExtractTweet, which we can later use to reference the function. The second
line extracts the value from the object stored in Riak, just returning the plain
data as a string. It's immediately decoded using mochijson2, a JSON library

Riak Handbook | 59

Using Erlang for MapReduce

Riak also uses internally to handle JSON data, assigning the result to the
variable Obj. We're using pattern matching to get rid of the struct term.
The third line extracts a value from the leftover property list (a list of keyvalues), namely the value tweet, which extracts the body, returning it in a
list. Erlang map functions are expected to return lists too.
That's it. No magic, and pretty straightforward too. To run it, the local Riak
client offers a mapred function. Here's how to run our function on a single
tweet.
C:mapred([{<<"tweets">>, <<"41399579391950848">>}],
[{map, {qfun, ExtractTweet}, none, true}]).

The first line specifies the input for the MapReduce request, a list of tuples
with bucket and key. The second line specifies a map phase, hence the map
term at the beginning. Phases are also specified in a list of tuples, but our
example contains only one. Using the qfun term, we're specifying that an
anonymous function is to be used, don't feed any arguments into the map
function (none), and finally tell it to return the data from this phase to the
client (true).
The resulting output should include a list of just one tweet body. There, you
just wrote you some Erlang. You can take this a lot further, you could even
write back to Riak from an Erlang map or reduce function, which you can't
do in JavaScript. Nice touch to store intermediate results back into Riak.
You can also run built-in Erlang map and reduce functions this way, and
even kick off JavaScript jobs. Here's an example for a full MapReduce with
two built-in functions, running the same functions as the JavaScript code in
the previous section.
C:mapred([{<<"tweets">>, <<"41399579391950848">>}],
[{map, {modfun, riak_kv_mapreduce, map_object_value},
none, false},
{reduce, {modfun, riak_kv_mapreduce, reduce_count_inputs},
none, true}]
).

Instead of qfun the code specifies modfun, with a module and function name
following, telling Riak to run this function on the inputs.
If you're working with Riak in production, it's well worth familiarizing
yourself with the things you can do from the Erlang console. It comes in
handy every now and then.
Riak Handbook | 60

On Full-Bucket MapReduce and Key-Filters Performance

On Full-Bucket MapReduce and Key-Filters


Performance
Before we move on to investigate more options to query data in Riak, a word
on the general performance implications of using MapReduce and key filters
on the whole data set.
The simple version is that running a MapReduce query on all objects in a
bucket requires Riak to go through all the keys stored in a cluster. See the
section on the anatomy of a Riak bucket for a deeper explanation of why that
is. The same is true for key filters. Both actually work very much alike.
For a full bucket MapReduce query Riak needs to go through its entire set
of keys to find the ones belonging to that particular bucket, tweets in our
example. For key filters, Riak also goes through the entire set, matching not
only the bucket but also the conditions you specified to the key name.
This process works reasonably well when you only have a small-ish number
of keys, up to 100.000 objects maybe. Be aware that there's no fixed number
that applies here. Given a fast set of disks it could still be reasonably fast to run
a full bucket MapReduce job, but it won't make your cluster very happy.
Riak thrives when working on a specific data set, for example MapReduce
queries on a specified set of keys. From my experience I'd avoid key filters
altogether. They look nice in theory, but in practice they're usually a sign
that you want to query your data using some other means.
Luckily, Riak has got you covered with not just one, but two ways.

Querying Data, For Real


To go beyond ad-hoc style queries using MapReduce, Riak offers two
different ways to index and query data. The index is either built by Riak
itself, or based on data provided by the application. One of them is Riak
Search, the other is Riak Secondary Indexes.
The problem both try to solve is querying data based on some more or less
predefined set of attributes, like attribute-value pairs of a JSON document, or
all words in a simple text document. The difference in both is mostly related
to how that is actually achieved.
The name Riak Search suggests that we're dealing with a full-text search
system. You throw a set of data at it and Riak Search will, depending on the
content type, try to index the document to the best of its knowledge and as

Riak Handbook | 61

Riak Search

much of it as possible. The process is pretty transparent to you, as Riak Search


can automatically index data you store in Riak KV.
Riak Secondary Indexes (2i), on the other hand, relies on you to specify the
data you want it to index. Your application code determines what attributes
Riak 2i should store alongside the object data. It doesn't even have to be data
that's included in the object itself, you could even wait for it use it to
store and index the object's key. Boom, there's your key filter replacement.
We'll go into much more detail on how both are different, and when each
makes sense to use, but first, let's take a look at Riak Search and Riak 2i.

Riak Search
Riak Search is to Riak what Sphinx is to MySQL: a full-text search add-on
that can use your main data store as the source for building an inverted index
of your data, tokenizing strings in a way that allows you to search for only
parts of a string. It also was one of the first truly distributed full-text search
engines, right up there with ElasticSearch. It scales up and down just like the
rest of the Riak ecosystem does.
Riak Search was heavily inspired by Apache Lucene, which is close enough
to being a defacto standard for full-text search. The similarities focus mostly
on the interface, though, manifesting themselves in the Solr-like HTTP
interface and the Lucene-style query syntax, but Riak Search doesn't support
the full Lucene query set yet.

Enabling Riak Search


Riak Search is included in Riak, but disabled by default. To enable it, open
app.config and look for a line in the section for Riak Search Config, which
handily only has one option {enabled, false}. Change that to {enabled,
true}, and you're almost there. Now stop and start Riak using riak stop
and riak start, and you should be good to go.

Indexing Data
Just enabling it only gives us the option to use Riak Search, it doesn't index
anything yet. But we do have three options to do so now:
From the command line using search-cmd
Indexing objects stored in Riak through a pre-commit hook on a perbucket basis
Index data directly using a Solr-like HTTP interface

Riak Handbook | 62

Riak Search

There's a secret, fourth way to index data, using the Erlang API, which
search-cmd uses, but we'll ignore that for now.

Indexing from the Command-Line


To get started with Riak Search, the simplest way is to use the search-cmd
tool to index a directory of text files or something similar. It's straightforward
enough, so let's give it a go. The example code library for this book comes
with a number of text files we can throw at Riak Search to see what happens.
The sample texts are not related to Justin Bieber, but we'll get back to him
soon, don't worry.
To index data, use search-cmd index ipsums <path>, where ipsums is the
name of our index in Riak Search and <path> is the path where search-cmd
looks for files to index.
Assuming you're in the directory containing the example code for this
chapter, try running the following command.
$ search-cmd index ipsums index-data

You should see some nice output telling you how many documents were
indexed and how fast. Boom, we have data. Using the command-line is not
necessarily the best way to index data though, but it's the easiest to get you
started with Riak Search quickly.

The Anatomy of a Riak Search Document


When indexing data in Riak Search, you need three things: an index name,
a document identifier, and the document itself. Consider the index name
and document identifier to be similar to the bucket/key combination used to
identify objects stored in Riak. When using search-cmd to index data it will
assume the filename to be the key.
Riak Search is not entirely agnostic to the data you store in it, but it's
certainly capable of indexing common data types based on the content type,
at least when data is indexed directly from Riak KV. Indexing through the
command-line assumes that you want to index text and not a nested
structure like JSON, so you don't get much control over text stored in
separate attributes, everything will just be indexed as one string.
If you go through the means of indexing data from Riak KV or using the
Solr interface, you get a lot more control over what's indexed and how, and
your documents can have multiple attributes with different types, numbers

Riak Handbook | 63

The Riak Search Document Schema

or strings, whereas using search-cmd will dump everything from a file into
one field, which makes up the whole document.
Every document has a bunch of fields and a default field. This comes from
Lucene, where the default field is the one that you don't have to explicitly
specify when querying data. If there's a string in your query that isn't
prefixed with a field name, Riak Search (and Lucene) assume you're
searching the default field.

Querying from the Command-Line


search-cmd can also be used to run queries from the command line. Though

you'll rarely use this in production it is a nice tool to quickly validate


indexing or custom schemas. Let's search for "ipsum".
$ search-cmd search ipsums 'ipsum'

What you get is a list of documents that matched the query. It should be
three, which is surprising because all of them are ipsum texts, but there you
go.
This doesn't return the documents themselves, it only gives you references
along with scores and positions. Use search-cmd search-doc if you're
interested in all indexes fields and their respective values. It's a nice tool for
simple debugging purposes.

Other Command-Line Features


The search-cmd tool also allows you to specify and show the schema for an
index, to get the execution plan for a particular query, or install a pre-commit
hook for a Riak bucket to automatically index data stored in Riak KV.
Riak Search uses a default index schema if you don't specify one yourself.
The default is pretty much inspired by what Solr uses as a default, so there are
no surprises. Schemas are defined in Erlang, but only using very simple data
structures.

The Riak Search Document Schema


Let's get the boring part out of the way before we dive into all the other
usages of Riak Search: let's look at how a schema is specified, what the default
schema looks like, and what kind of string analyzers you can use. Here's an
example from the default schema: a dynamic field that assumes that every
field whose name ends with _text is a string.

Riak Handbook | 64

The Riak Search Document Schema

{dynamic_field, [
{name, "*_text"},
{type, string},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]}

The default schema specifies a bunch of these dynamic fields. Three different
field types are supported: integer, string, and date.
Every field has an analyzer attached to it. You can specify your own, but
there's a bunch of built-in analyzers that should be more than enough to get
you started.
If you have a fixed schema for documents, and you don't need to rely on
using name patterns to determine the type, but instead know it upfront, you
can declare a field instead of dynamic_field, simply specifying a full name
instead.

Analyzers
What does an analyzer do? Its main purpose is to look at a particular field,
tear apart the data in it, and tokenize it for efficient storage and lookup in the
search index.
The standard analyzer assumes a string is in the English language, tokenizes
it based on whitespace, removes tokens (words) shorter than 3 characters,
and removes a bunch of stop words like "a", "if", "the", "this", and the like.
The standard analyzer is (by default) used for fields whose names end with
_txt or _text.
The whitespace analyzer (whitespace_analyzer_factory) simply tokenizes
based on whitespace, leaving all words in the string intact, so no stop words
are removed, neither are words shorter than three characters.
Numbers
are
analyzed
using
the
integer
analyzer
(integer_analyzer_factory). What it does is pad numbers with zeros to
a 10 character string. This is relevant because you have to pad numbers in
queries too. So the integer 1 is padded to a 0000000001.
When using something like a date, which needs to be considered as a whole
(with both date and time) for sorting, the no-operation analyzer
(noop_analyzer_factory) is a good choice. It doesn't do anything, big
surprise, and leaves the string unchanged.

Riak Handbook | 65

The Riak Search Document Schema

Why is it important for dates? Since everything is stored as strings and sorted
lexicographically, dates need to be treated as strings as well. Riak doesn't
really care about the format you're using, as long as it implies a sorting order
that is the same as the ordering of days and time, so that November 3rd 2011,
12:00 am comes before November 4th 2011, 1 pm.
This means that the American system for dates and time is of no real use, and
you're better off using the ISO 8601 format to specify both, so that the above
examples would be "2011-11-03T00:00:00" and "2011-11-04T13:00:00"
respectively.
The no-op analyzer is mostly useful for strings you expect to only contain
one word, that should be indexed as a whole. You can still do queries for
e.g. all documents that have the date 2011-11-04 or even 2011-11, thanks to
lexicographical ordering.

Writing Custom Analyzers


Feel free to skip this part if you don't feel like diving into some Erlang right
now, the main purpose is to show how easy it is to get started writing your
own analyzers.
As a fun (pun intended) exercise, we'll build an analyzer for the German
language. You can use the default analyzers as a blueprint and build your
own code from there.
We're building on the standard analyzer, which was made for the English
language. Since German and English, at least in grammatical terms aren't
too far apart, the most work we have to do is define different stop words.
The German language has a lot of those, so we're keeping the list short for
brevity. I separated the module into several code snippets to keep things short
enough to follow along.
-module(german_analyzer).
-export([
german_analyzer_factory/2
]).
-define(UPPERCHAR(C), (C >= $A andalso C =< $Z)).
-define(LOWERCHAR(C), (C >= $a andalso C =< $z)).
-define(NUMBER(C),
(C >= $0 andalso C =< $9)).
-define(WHITESPACE(C), ((C == $\s) orelse (C == $\n)
orelse (C == $\t) orelse (C == $\f)
orelse (C == $\r) orelse (C == $\v))).

Riak Handbook | 66

The Riak Search Document Schema

german_analyzer_factory(Text, [MinLengthArg]) ->


MinLength = list_to_integer(MinLengthArg),
{ok, german(Text, MinLength, [], [])};
german_analyzer_factory(Text, _Other) ->
{ok, german(Text, 3, [], [])}.

The whole code is neatly wrapped into a german_analyzer module, which


can be compiled and handed over to Riak Search for use as an analyzer. The
header defines a couple of macros to determine if a character is in uppercase,
lowercase, a number, or whitespace. The $ operator is Erlang's shorthand for
a single character.
The

first

part

is

the entry point for Riak Search, the


german_analyzer_factory, which takes a text and a minimum length for
words as parameters. The analyzer is not necessarily expected to respect the
minimum length, as it depends on language and purpose really.
german(<<H, T/binary>>, MinLength, Acc, ResultAcc)
when ?UPPERCHAR(H) ->
H1 = H + ($a - $A),
german(T, MinLength, [H1|Acc], ResultAcc);
german(<<H, T/binary>>, MinLength, Acc, ResultAcc)
when ?LOWERCHAR(H) orelse ?NUMBER(H) ->
german(T, MinLength, [H|Acc], ResultAcc);
german(<<$.,H,T/binary>>, MinLength, Acc, ResultAcc)
when ?UPPERCHAR(H) ->
H1 = H + ($a - $A),
german(T, MinLength, [H1,$.|Acc], ResultAcc);
german(<<$.,H,T/binary>>, MinLength, Acc, ResultAcc)
when ?LOWERCHAR(H) orelse ?NUMBER(H) ->
german(T, MinLength, [H,$.|Acc], ResultAcc);
german(<<_,T/binary>>, MinLength, Acc, ResultAcc) ->
german_termify(T, MinLength, Acc, ResultAcc);
german(<<>>, MinLength, Acc, ResultAcc) ->
german_termify(<<>>, MinLength, Acc, ResultAcc).

The second part is a collection of functions that downcases words, using


some macros to detect characters, and then calls into the german_termify
line of functions to determine if the string, which is always represented as a
binary string, is as long as or longer than the minimum length and is not a
stop word.
german_termify(<<>>, _MinLength, [], ResultAcc) ->
lists:reverse(ResultAcc);
german_termify(T, MinLength, [], ResultAcc) ->
german(T, MinLength, [], ResultAcc);

Riak Handbook | 67

The Riak Search Document Schema

german_termify(T, MinLength, Acc, ResultAcc)


when length(Acc) < MinLength ->
%% mimic org.apache.lucene.analysis.LengthFilter,
%% which does not incement position index
german(T, MinLength, [], ResultAcc);
german_termify(T, MinLength, Acc, ResultAcc) ->
Term = lists:reverse(Acc),
case is_stopword(Term) of
false ->
TermBinary = list_to_binary(Term),
NewResultAcc = [TermBinary|ResultAcc];
true ->
NewResultAcc = [skip|ResultAcc]
end,
german(T, MinLength, [], NewResultAcc).

Last but not least, the part that determines if a term is a stop word based
on ordered lists of words. It's based on a simple and not fully complete set
of German words you usually don't want to have indexed, but you get the
idea. Turns out, there are a lot more stop words in German than there are in
English.
is_stopword(Term) when length(Term) == 2 ->
ordsets:is_element(Term,
["an", "ab", "da", "er", "es", "im", "in", "ja", "wo", "zu"]);
is_stopword(Term) when length(Term) == 3 ->
ordsets:is_element(Term,
["das", "dem", "den", "der", "die", "fr", "sie", "uns",
"was", "wie"]);
is_stopword(Term) when length(Term) == 4 ->
ordsets:is_element(Term,
["aber", "auch", "dein", "euer", "eure", "mein", "wann"]);
is_stopword(Term) when length(Term) == 5 ->
ordsets:is_element(Term,
["sowie", "warum", "wieso", "woher", "wohin"]);
is_stopword(Term) when length(Term) == 6 ->
ordsets:is_element(Term,
["machen", "sollen", "soweit", "werden"]);
is_stopword(Term) when length(Term) == 7 ->
ordsets:is_element(Term,
["dadurch", "deshalb", "nachdem", "weitere", "weshalb"]);
is_stopword(_Term) ->
false.

If you feel like it, you can play with the code in the Erlang shell to see what
it does. The code is part of the examples repository. Given you have Erlang

Riak Handbook | 68

The Riak Search Document Schema

installed, bring up the shell in the same directory as the .erl file using the erl
command. Compile the source file like so:
c(german_analyzer).

Compiling it results in a .beam file which you can dump into


<riak_directory>/lib/riak_search-1.0.1/ebin. Note that this is not the
perfect solution to deploy custom Erlang code into a Riak cluster, but it'll do
for now. After restarting the Riak node you can start using the analyzer in
your schema.
How do you differentiate between different languages in different fields in
an object indexed by Riak Search? Putting automatic language detection
aside, if you know what language an input is in, denote the language using
the field name by adding a suffix like _de or _en. Here's an example using our
new analyzer.
{dynamic_field, [
{name, "*_de"},
{type, string},
{analyzer_factory, {erlang, german_analyzer,
german_analyzer_factory}}
]}

Other Schema Options


If you're setting a manual schema, the options for analyzer, name, and type
should get you pretty far, but they're not all options you've available.
padding is useful when you're expecting to have numbers that are lower or

higher than the default of 10 characters, for example a long value, or just a
short value (for enumerations). Here's an example for a field padded to fit
a long value, using a padding of 19. The maximum signed long value is
9223372036854775808, so that should fit nicely.
{field, [
{name, "long_number"},
{type, integer},
{padding, 19},
{analyzer_factory,
{erlang, text_analyzers, integer_analyzer_factory}}
]}

You can mark a field as required using the required option. By default, all
fields defined in a schema are optional.
Riak Handbook | 69

The Riak Search Document Schema

{field, [
{name, "email"},
{type, string},
{required, true},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]}

Using skip, you can tell Riak Search not to index a particular field. It's still
stored, but simply not available for search. Use this when you want to keep
your index as small as necessary, only indexing the fields you need to have
indexed.
{field, [
{name, "metadata"},
{type, string},
{skip, true},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]}

You can specify a list of aliases for a field using the aliases option, which
means nothing more than that the field is stored multiple times in the index,
with every alias in the list as field name. That way you can reference the field
using its different names in queries.
{field, [
{name, "user"},
{type, string},
{aliases, ["username"]},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]}

An Example Schema
Going back to our Justin Bieber tweets, it's only fair to start indexing them,
and to set a schema that matches our interests for search. We index the
tweet's identifier, the user name, the time, and the text. This schema will be
our blueprint for the following excursions into indexing data directly from
Riak KV and using the Solr interface.
Every schema starts with a header that defines a couple of things, most
notably the default field for queries without an explicit field prefix.

Riak Handbook | 70

The Riak Search Document Schema

{
schema,
[
{version, "1.1"},
{n_val, 3},
{default_field, "tweet"},
{default_op, "or"},
{analyzer_factory,
{erlang, text_analyzers, whitespace_analyzer_factory}}
],
[
%% IDs coming from Twitter's API are 64 bit integer values
%% Padding is 19 to accomodate for that
%% Keeping the field name id_str from the Twitter API
{field, [
{name, "id_str"},
{type, integer},
{required, true},
{padding, 19},
{analyzer_factory,
{erlang, text_analyzers, integer_analyzer_factory}}
]},
{field, [
{name, "tweet"},
{type, string},
{required, true},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]},
{field, [
{name, "tweeted_at"},
{type, date},
{required, true},
{analyzer_factory,
{erlang, text_analyzers, noop_analyzer_factory}}
]},
%% Username doesn't need to be analyzed
%% it's always one word
{field, [
{name, "user"},
{type, string},
{required, true},
{analyzer_factory,
{erlang, text_analyzers, noop_analyzer_factory}}
]},
%% Skip everything else
{dynamic_field, [
{name, "*"},
{skip, true}

Riak Handbook | 71

Indexing Data from Riak

]}
]
}.

The schema defines four fields (neatly using different analyzers, which suits
our purpose of having a good example set quite nicely), and skips everything
that we're not interested in by declaring a dynamic field that matches
everything else.
Fields in a document are matched in the order they appear in the schema.
In the above example Riak Search will first go through the fix fields and try
to find an exact match. Then it tries to match all the dynamic fields in the
order they appear in the list, the first match wins. So be sure to have catch all
definitions, like a field that skips everything that doesn't match, at the bottom
of the schema.

Setting the Schema


Setting the schema is simple, now that we have it. For your convenience, the
it's included in the sample code, in a file called tweets.erl. Once again we
resort to using search-cmd to get the job done.
$ search-cmd set-schema tweets 08-riak/tweets.erl

Note that an index doesn't have to exist yet for you to set the schema, much
like a bucket in Riak doesn't need to have any data in it before specifying a
configuration for it.
We now have a schema in place. To confirm, run the following command
to show the schema for the tweets index. It should show you the schema we
just set.
$ search-cmd show-schema tweets

Indexing Data from Riak


Riak Search comes with a pre-commit hook that runs whenever you update
data in Riak. That hook can be enabled for any bucket, with a bucket
corresponding to an index in Riak Search, meaning that both will have the
same name, and the index can still be queried independently of Riak KV
itself.
Once again, the search-cmd tool is at our service, it's the simplest way to
enable indexing for a bucket. When called with the command install and
Riak Handbook | 72

Indexing Data from Riak

a bucket name, it installs the pre-commit hook for us. The following
command installs a pre-commit hook for our tweets bucket.
$ search-cmd install tweets

Note that this doesn't retroactively index all data in the bucket. It only
indexes data stored from that moment on.
Remember that indexing through the command-line assumed all files to
be text? Turns out the commit hook is much smarter than that. It uses the
content type for a Riak object to determine how to deserialize it. If you're
storing JSON, it's got you covered, same for XML. If you're using some
other serialization format that's not supported by default, you can specify
your own Erlang code to deserialize it into something Riak Search can index.
All the tweets we've already indexed so far are unfortunately not part of the
index yet. But assuming you've installed the schema for the tweets index and
the pre-commit hook for the bucket, we're good to go on writing more data.
The hook will take care of updating data when you update existing objects
and deleting data from the index when you delete an object from Riak KV.
There's one thing to be aware before we continue. The current
implementation of the Twitter indexer doesn't use a date format that's
suitable for sorting, it uses the string returned from the Twitter API, which
is a string of the form "Thu Nov 03 22:27:30 +0000 2011".
Not a big deal if we're not sorting based on the date, but let's look at an
implementation that creates the proper format string (following ISO rules,
of course) and also adds the tweet's identifier to the document before storing
it. I left out the surrounding code for brevity. Thankfully JavaScript's
implementation of Date comes with a handy method for this purpose.
twitter.addListener('tweet', function(tweet) {
var createdAt = new Date(tweet.created_at).toISOString();
var key = tweet.id_str;
var tweetObject = {
user: tweet.user.screen_name,
tweet: tweet.text,
tweeted_at: createdAt,
id_str: key
}
var links = [];
if (tweet.in_reply_to_status_id_str != null) {
links.push({

Riak Handbook | 73

Using the Solr Interface

tag: 'in_reply_to',
bucket: 'tweets',
key: tweet.in_reply_to_status_id_str
});
}
riak.save('tweets', key, tweetObject, {links: links},
function(error) {
if (error != null)
console.log(error);
});
})

Re-run the Twitter stream for a little bit so we get data that we can run
queries on.
Now that we have some data in our search index, how do we get it out
again? Or rather, how can we query the search index to find documents
we're interested in?

Using the Solr Interface


If we just want to search for a specific string and get a bunch of matching
documents back, including all the data in the document, using the Solr
interface is the easiest way to get started. It's an HTTP interface inspired by
what Solr's API offers, so you can even point Solr client libraries at Riak's
endpoint.
To get started, riak-js has all the tools we need, and we'll look at the
underlying HTTP calls to show how easy it is to just use any HTTP library
to talk to Riak's search index. Not surprisingly, it takes an index and a query
string as arguments, and an optional callback to evaluate the results. The
simplest version to query the index is shown below.
riak.search('tweets', 'tweet:hate')

Running that, and given you have some fresh results in your database, you
should see some data running across your screen. It's not entirely helpful, so
let's look at what kind of data the Solr API returns. It's capable of handling
both XML and JSON, the latter being more interesting for us right now.
To figure out the exact data, we'll use curl to fetch it. The Solr HTTP
interface is mounted to /solr, so the URL http://localhost:8098/solr/
is our entry point, just add an index and an action. The full URL we can use
to query the tweets index is http://localhost:8098/solr/tweets/select,
and here's how you can query it using curl:
Riak Handbook | 74

Using the Solr Interface

$ curl 'localhost:8098/solr/tweets/select?q=tweet:hate'

As I'm sure you'll notice, we just received a result as XML, a sensible default
given that Solr returns XML too. It also lets you specify other formats using
the wt parameter. Riak Search supports XML and JSON, so you can use both
formats as parameters in lowercase. Here's the JSON equivalent of the above
query.
$ curl 'localhost:8098/solr/tweets/select?q=tweet:hate&wt=json'

By the way, when using riak-js, it already sets the result format to JSON for
you, because it's easier to parse directly into JavaScript objects.
There's one thing you should remember, especially when you've used Solr in
the past. The Solr layer Riak Search offers is merely for API compatibility. It
allows you to use an Solr client with Riak Search, at least for the feature set it
supports. Under the hood, there's nothing resembling Solr or Lucene except
for the query syntax. Riak Search and Solr are still two semantically different
things, one doesn't work like the other.

Paginating Search Results


By default, Riak's Solr API returns only 10 results, but you can use the
parameter rows to extend the result set, combined with start to specify an
offset. To fetch 100 results, the query turns into this:
$ curl 'localhost:8098/solr/tweets/select?q=tweet:love&rows=100'

You can paginate data with an offset, to fetch 20 rows but skipping the first
20, use something like this:
$ curl \
'localhost:8098/solr/tweets/select?q=tweet:love&rows=20&start=20'

The disadvantage of using rows and start is that Riak Search will still
accumulate all the data first and then apply the parameters, a known
problem.
Using riak-js, you simply specify all these options in a parameters hash:
riak.search("tweets", "tweet:love", {start: 20, rows: 20})

Riak Handbook | 75

Using the Solr Interface

You'll want to always specify a maximum for the number of rows returned,
because riak-js sets it to 10000.

Sorting Search Results


By default, Riak Search sorts the results based on the score, an indicative
number telling you how relevant a search result is to the query.
If you don't want to rely on the default sorting, you can sort on any field
that's indexed. This implies that you've chosen a value that's sortable in
the same lexicographical sense as their relation to the business value. Think
back to the date discussion we had earlier. You need to ensure that a value
is logically sorted before another value in the same way that it is sorted
lexicographically. It's the magic of full-text search we're dealing with here.
It's not black magic, lucky for us, and we already looked at an in-depth
example with dates.
Let's ditch the default order and sort the tweets by date.
riak.search('tweets', 'tweet:love', {sort: 'tweeted_at'})

If we simply want to use the key for sorting, which, in our case, should be
almost equivalent to sorting by the time a tweet was created, we can use the
handy presort option. The advantage of using presort is that it's applied
before limit and offset are applied, which is not the case with sort, a
known issue with Riak Search.
riak.search('tweets', 'tweet:love', {presort: 'key'})

Without any sort parameter, Riak Search sorts results in descending order, so
highest score and therefore, the best matches, come first. When you specify
a sort field, the order changes to ascending, but can be changed by explicitly
appending asc or desc to the field name.
riak.search('tweets', 'tweet:love', {sort: 'tweeted_at desc'})

Search Operators
A search query can be of arbitrary complexity. So far we've only looked at
queries looking for a simple word. Of course you can search for arbitrary
strings and phrases as well. Like any good full-text search, Riak Search
doesn't just keep track of simple words, but their occurrance in phrases as

Riak Handbook | 76

Using the Solr Interface

well. Simply surround the string in quotes, double quotes are as valid as
singles. This query searches for the string "justin bieber".
riak.search('tweets', 'tweet:"justin bieber"')

If you don't specify explicit quotes, and your query string is "tweet:justin
bieber", then Riak Search looks for documents that contain "justin" in the
field tweet or tweets that contain the string "bieber" in the default field.
Everything that's not explicitly prefixed with a field name and isn't
surrounded with quotes assumes the default field defined in the schema or
the optional query parameter df.
riak.search('tweets', 'tweet:love tweet:hate')

If you specify more than one field, Riak Search uses the operator OR to put
together the query. You can override that by either specifying operators
explicitly or by specifying a default search operator. Here's a query with
explicit operators, searching for tweets that contain both love and hate.
riak.search('tweets', 'tweet:love AND tweet:hate')

Note that search operators are case sensitive, meaning AND is an operator,
whereas "and" is a word to query for. Using the setting default_op in the
schema definition, you can tell Riak to always assume AND instead of OR.
You can use the NOT operator to negate parts of the query, use it to search for
tweets that contain "love" but not "hate".
riak.search('tweets', 'tweet:love AND NOT tweet:hate')

If you've used Solr and/or Lucene before, the rules for operators should be no
surprise. If you don't feel like littering your query statements with a pile of
AND, OR, and NOT, you can use + or - instead, though both imply that AND is the
search operator. The previous query would look like so.
riak.search('tweets', '+tweet:love -tweet:hate')

You can boost the relevance of certain terms by boosting their score,
influencing their appearance in the result set. Explaining the full magic of
term relevance is beyond the scope of this book: read the Solr wiki page for
all the gory details. Let's search for tweets containing "love" or "hate" and
give more relevance to hate.

Riak Handbook | 77

Using the Solr Interface

riak.search('tweets', 'tweet:love OR tweet:hate^1')

You can group a bunch of operators together by using parantheses, which is


useful for giving a group of OR'd queries higher precendence than OR would
normally have. As usual, AND is given higher precedence than OR.
riak.search('tweets',
'(tweet:love OR tweet:hate) AND user:roidrage')

Riak Search also supports proximity search, which helps finding documents
based on how close words in a phrase are to each other, using the tilde
operator. To look for things where the words "grandma loves" are no more
than four words apart, that is there are no more than three other words
between them, you can use the following.
riak.search('tweets', 'tweet:"grandma loves"~4')

Another useful feature is being able to search for ranges of matches. As all
data is stored as strings, and is sorted lexicographically in Riak Search (and
most if not all full-text searches, for that matter), you can even use our date
formatting to ask for tweets that occurred between the 1st and the 30th
November 2011. This kind of search is based on the fact that a string like
"20111101" is considered of lower order than "20111101T00:00:00", so it
matches anything that starts with the string "20111101".
Riak Search supports both exclusive and inclusive ranges, using brackets and
curly braces respectively. To search for all tweets of November, an inclusive
search is the way to go.
riak.search('tweets', 'tweeted_at:[2011-11-01 TO 2011-12-01]')

Even though something like this is possible, it's a good idea to be as specific
as possible with the lower and upper bounds, by specifying
"2011-11-01T00:00:00" as a lower bound and, to be even more specific,
"2011-11-30T23:59:59.999Z" as an upper bound. This is due to
lexicographical ordering, where "2011-12-01" has lower sorting priority
than "2011-12-01T00:00:00". Should we be interested in tweets from
December 1st too, we'd either have to specify the following day, or just
specify the full date string. As a general rule, if you can, be as specific as
possible, and make sure both bounds are equally specific. If the lower bound
is too lax you'll get results you may not expect.

Riak Handbook | 78

Using the Solr Interface

If you're not interested in the data matching the lower and upper bounds,
you can use curly braces instead.
riak.search('tweets',
'tweeted_at:{2011-11-01T00:00:00.000Z TO 2011-12-01T00:00:00.000Z}')

To search for partial matches, i.e. words that start with a specific term, the
* operator is here for you. It matches any word that begins with the term
provided and ends with any number of characters. It needs a minimum of
three characters though, so the first example below won't work, but the
second will. Note that you can't specify the operator at the beginning of the
search term, only at the end.
riak.search('tweets', 'tweet:j*');
riak.search('tweets', 'tweet:jus*');

Under the covers, the wildcard is just an inclusive range query, using the
lowest and highest boundary possible, speaking in lexicographical terms. So
the query above (the working one) is internally turned into something like
this.
riak.search('tweets', 'tweet:[jus TO jus]')

That last character is just an odd looking representation of the ASCII


character with code 255. You can also use a wildcard for just one character
by using the ? operator. This example matches the words "hats" and "hate".
riak.search('tweets', 'tweet:hat?')

Just like the * operator, the ? wildcard operates as a range query under the
cover, just a tiny bit more specific, adding a zero byte to the lower boundary.

Summary of Solr API Search Options


For your convenience, here's a handy table summarizing all the parameters
you can use when searching through the Solr interface. The only required
parameter is q, the query string. If it's missing, Riak will return a HTTP 400
error.
Parameter
Explanation
q
The query. Must be a valid Lucene style query. See
below for valid operators.

Riak Handbook | 79

Default
empty

Using the Solr Interface

Parameter
Explanation
wt
Determines the output format. Valid values are xml
and json.
sort
Determines the field to sort the result on. Field name
must match a field in the schema, that includes
dynamic fields.
presort
Determines default sort order when no explicit field to
sort on was specified. Can sort on either score or key,
defaults to score.
start
A numeric offset in the result set to skip a number of
documents.
rows
Maximum number of rows to be returned by the
query. The maximum value is only limited by a
configuration parameter.
df
Set a default field name for this query. Overrides the
schema setting, allows you to write a query without
explicit field references for that field.
q.op
Set the operator for the query. Valid values are "or"
and "and".
filter
Set a filter based on inline fields.
index
If you don't specify the index name as part of the
URL, you can set it as a parameter.
fl
A list of comma-separated fields to fetch from the
documents instead of returning the entire document.
If you specify only id, then the Solr API won't fetch
the documents at all, a nice touch for lower latency in
cases where you don't need the document right away,
e.g. for linking to it. Can't be combined with the sort
option.

Default
xml
none

score

0
10

none

or
none
none
none

There you go, all you need for a handy reference. While we're at it, let's tear
apart the features of the query language, obviously something that must end
in another handy reference table.

Summary of the Solr Query Operators


Here's a summary of all the available operators for queries as outlined in the
previous sections.

Riak Handbook | 80

Using the Solr Interface

Operator
Explanation
AND
Conjunction of two field queries. Both must match for the
query to yield a result. AND has higher precedence than OR.
Example: "tweet:justin AND tweet:love"
OR
Disjunction of two field queries. At least one of the fields must
match for the query to be successful. Example: "tweet:hate OR
tweet:love"
NOT
Negate a field query. Should not match the negated part. Can
be combined with other field queries, but also used on its own
to search for documents that don't contain a specific string.
Example: "tweet:hate AND NOT tweet:love"
*
Wildcard search. Matches all strings of any length that start with
a given string. Prefixing string must be at least three characters
long. Example: "tweet:bieb*"
?
Wildcard search. Matches exactly one character. Example:
"tweet:love?"
()
Group several queries logically to give higher precedence to
OR queries. Example: "(tweet:bieber OR tweet:justin) AND
tweet:hate"
[ TO ]
Inclusive range search, following lexicographical ordering.
Includes words matching the upper and lower bounds and
anything between. "TO" operator is case sensitive. Example:
"tweeted_at:[20111101 TO 20111130]"
{ TO } Exclusive range search. Includes only words in between lower
and upper bounds. Example: "tweeted_at:{20111101T00:00:00
TO 20111201T00:00:00}"
~<int>
Proximity search, works only on phrases, not single search
terms. Search for documents where the words in the search
string are as far as <int> words apart. Useful for somewhat fuzzy
searching. Example: "tweet:'love bieber'~2"
^<float> Boost a search term, giving a specific term a higher or lower
relevance than others, giving the results matching that term a
higher or lower score, and a higher or lower rank in the search
result. Defaults to 0.5. Example: "tweet:love^1 OR
tweet:hate^0.2"

Indexing Documents using the Solr API


Finally, we've arrived at the third way of indexing documents, by using the
Solr API directly. The main use case for this is when you want to use Riak
Riak Handbook | 81

Using the Solr Interface

Search solely as a full-text search engine, and you're not interested in using
Riak as a database, because your data lives elsewhere, in a MySQL database
for example.
Just like querying, the interface for indexing is compatible to Solr. Unlike
querying, indexing only supports XML. Bummer, but a client library should
hide that fact from you anyway, allowing you to pass something like hashes
and converting them to XML automatically. Speaking of client libraries,
riak-js unfortunately doesn't have support to use this API currently, but a
decent Solr client should do instead.
To add or update a document, you use the endpoint /solr/INDEX/update,
and POST to it. Both updating and adding are considered to be the same
thing, Riak Search doesn't keep track of conflicts or versions, the most recent
write wins.
Here's the simplest indexing that could possibly work using curl:
$ curl -X POST -d @- localhost:8098/solr/test-index/update
<add>
<doc>
<field name="id">1</field>
<field name="name">My god, it's full of XML!</field>
</doc>
</add>

Note that the XML shown here is actually entered on the command-line,
the parameter -d @- tells curl to read the post data from stdin, so when
done typing (or copy-pasting), type Ctrl-D to send end-of-file. You can
specify a number of documents at once, simply add more <doc> sections
inside the <add> tags. Be aware that this is not the same as a bulk import
though, a feature that Riak Search doesn't support unfortunately; it's merely
a convenient way to throw multiple documents at the index at the same time.
Every document is still indexed and committed separately.

Deleting Documents using the Solr API


Getting data out of Riak Search, again assuming you're not indexing data
directly from Riak KV, is similar, and it even uses the same endpoint. You
can specify documents either using a list of identifiers, or by specifying a
query. You can even mix and match both ways in a single request.
$ curl -X POST -d @- localhost:8098/solr/test-index/update
<delete>

Riak Handbook | 82

Using the Solr Interface

<id>1</id>
<query>name:"My god"</query>
</delete>

Using Riak's MapReduce with Riak Search


Using Riak Search to feed data into a MapReduce query makes our life much
easier, and even simplifies some of the things we've been doing to analyze
tweets, allowing us to avoid expensive full bucket queries or key filters.
As a general rule, Riak Search always returns bucket-key pairs. All
implementations fetch the pairs from the index first, and then fetch the
objects from Riak. This turns out to be a handy advantage for MapReduce,
because the input for a map phase is supposed to be a bucket-key pair.
The simplest way to use Riak Search with MapReduce is to fetch the objects
matching a particular query, allowing you to use Riak Search without
having to resort to the Solr API.
The only difference is that instead of specifying a bucket or a list of key-value
pairs, we now specify a search query as input. riak-js offers a handy method
to do that called addSearch(). Let's look at an example that fetches all tweets
having the word "hate" in them, assuming you still have a Node.js console
open somewhere.
riak.addSearch("tweets", "tweet:hate").
map('Riak.mapValuesJson').run()

Instead of specifying buckets or list bucket-key we simply specify the index


and a valid query string. Turns out this simple query already makes our initial
MapReduce examples redundant, and they're much faster too.

The Overhead of Indexing


There's no doubt that building a full-text index is computationally
expensive. Before you throw Riak Search into production, expecting the
same speed as just storing data in Riak KV, run some benchmarks. Indexing
data has some overhead and also involves lots of disk I/O, so there should be
no surprise that utilizing Riak Search as part of storing data in Riak is slower
than just storing the data without any indexing.

Riak Handbook | 83

Riak Secondary Indexes

Riak Secondary Indexes


Secondary indexes (or short: 2i) are a very recent addition to Riak. They
offer a much simpler way to do reverse lookups on data stored in Riak than
Riak Search. Instead of having Riak analyze and tokenize data in documents
(JSON, XML, text), 2i relies on the application to provide the indexing data
as key-value pairs when storing data. It doesn't do any tokenization either,
and doesn't allow things like partial matches like full-text search does, so it's
much less computationally expensive.
Computation is instead pushed into the application layer, which basically
tags objects stored in Riak with the data it wants to run queries on. Riak 2i
can index integers and binary values like strings, and everything is stored
in lowercase. So instead of indexing an object you add some metadata to
it which Riak can then use to do index lookups. Like Riak Search, 2i only
returns keys for you to use. It doesn't do any object lookups, but you can feed
the results into MapReduce to do so in the same step.
Querying is kept rather simple too, allowing only full matches and ranges.
Don't worry, the folks at Basho are continuously improving on that front,
but it's a good start for a more lightweight alternative to Riak Search.
To use Riak 2i, you have to enable the LevelDB storage backend. You can
find details on how to do that in the section on storage backends.

Indexing Data with 2i


Index data is stored alongside the objects they're associated with, much like
the metadata (links and such) mentioned earlier. Just like metadata, you
provide indexing data as additional headers, prefixed with X-Riak-Index.
Note that the header names are case-insensitive, so any case of X-Riak-Index
will do.
We'll start by indexing a tweet's username, working off the initial indexing
examples in this chapter, in the end adapting our Twitter indexer to use
secondary indexes.
Indexes need to be typed, and this is done by adding a suffix to the name
identifying the type, currently _bin and _int are supported. Indexing a
string field means it's binary for Riak 2i, so both the username and the date
get a _bin field name suffix. They end up being sent to Riak as X-RiakIndex-username_bin and X-Riak-Index-tweeted_at_bin respectively.

Riak Handbook | 84

Riak Secondary Indexes

riak-js comes with some preliminary support for 2i, but it's more than good
enough for our purposes. Secondary indexes are just really simple to build
and use. Just add a new index attribute to the meta data.
tweet = {
user: 'roidrage',
tweet: 'Using @riakjs for the examples in the Riak chapter!',
tweeted_at: new Date(2011, 1, 26, 8, 0).toISOString()
}
riak.save('tweets', '41399579391950848', tweet, {index: {
username: 'roidrage',
tweeted_at: new Date(2011, 1, 26, 8, 0).toISOString()
}})

The only change we've done is to add some metadata for indexes. riak-js will
automatically resolve the field names to have the proper datatype suffixes, so
the code looks a bit cleaner than the underlying HTTP request, which we'll
look at anyway.
$ curl -X PUT localhost:8098/riak/tweets/41399579391950848 \
-H 'Content-Type: application/json' \
-H 'X-Riak-Index-username_bin: roidrage' \
-H 'X-Riak-Index-tweeted_at_bin: 2011-02-26T08:00:00.000Z' \
-d @{
"username":"roidrage",
"tweet":"Using @riakjs for the examples in the Riak chapter!",
"tweeted_at":"2011-02-26T08:00:00.000Z"
}

There are special field names at your disposal too, namely the field $key,
which automatically indexes the key of the Riak object. Saves you the
trouble of specifying it twice. Riak automatically indexes the key as a binary
field for your convenience, so be sure to avoid using the field $key elsewhere.
It's also worth mentioning that the $key index is always at your disposal,
whether you index other things for objects or not. That gives you a nice
advantage over key filters when you query Riak for ranges of keys.
That's pretty much all you need to know to start indexing data. There's no
precondition, just go for it. It really is the simplest way to get started building
a query system around data stored in Riak.

Riak Handbook | 85

Riak Secondary Indexes

Querying Data with 2i


To query data, we'll have to switch our thinking to a new URL scheme,
because the traditional way of fetching data from Riak with /riak/
<bucket>/<key> had some problems with secondary index support, mainly
due to link walking.
To submit a query to Riak 2i, you use a different URL scheme instead.
/buckets/<bucket>/index/<fieldname>/<query>

Where <fieldname> is the indexed field (including _bin or _int suffix), and
<query> is the value you're searching the index for. So to search for all tweets
with the username roidrage the URL looks like this.
/buckets/tweets/index/username_bin/roidrage

The result is a JSON list of keys. Let's throw it at curl and see what comes
back.
$ curl localhost:8098/buckets/tweets/index/username_bin/roidrage
{"keys":["41399579391950848"]}

Searching 2i with riak-js is a lot easier on the eyes.


riak.query('tweets', {username: 'roidrage'})

You can ask for ranges of values too, just add another URL component as the
upper bound. Here's how you can fetch keys using a specific date range.
$ curl localhost:8098/buckets/tweets/index/tweeted_at_bin/ \
2011-02-26T00:00:00.000Z/2011-02-26T23:59:59.000Z
{"keys":["41399579391950848"]}

Ranges are always inclusive, so they include any matches of the upper and
lower bounds. To query for a range with riak-js, simply specify an array of
two values instead of a single value.
riak.query('tweets', {tweeted_at: [
"2011-02-26T00:00:00.000Z", "2011-02-26T23:59:59.000Z"
]})

And that's about it. 2i is pretty simple, especially compared to Riak Search,
which no doubt is much more powerful, but sometimes a simple lookup or

Riak Handbook | 86

Riak Secondary Indexes

a simple range query is all you need. It's worth mentioning that you can
only query one index with a single query, there's currently no way to do
compound queries across multiple indexes.

Using Riak 2i with MapReduce


The fun wouldn't be complete without being able to feed query results into
MapReduce for further analysis, or simply to fetch the values in one go. The
simplest MapReduce query just maps the values to JSON and returns the
results.
riak.add({bucket: 'tweets',
index: 'username_bin',
key: 'roidrage'}).
map('Riak.mapValuesJson').run()

A 2i query is just another type of input for a MapReduce request, specifying


bucket, index field, and value. You can do anything you could with normal
MapReduce here, it's easy to re-use any code from the Riak Search section.

Storing Multiple Index Values


Other than doing data look-ups based on simple keys, Riak's secondary
indexes have another neat feature: indexing fields as sets of data. You can
specify a comma-separated list of values, and Riak 2i will make sure all of
them are stored in the index separately. Because the index only stores unique
values per object, all values in the list of values are automatically unique as
well, hence we're talking about a set.
To store a list of values with an object, you specify an array for the
corresponding index. Think of tags as one simple example, but to make
it more interesting than that, the example below indexes the usernames of
all users mentioned in the tweet. It ignores the fact that Twitter users can
change their name at any time.
tweet = {
user: 'roidrage',
tweet: 'Using @riakjs for the examples in the Riak chapter!'+
' /cc @frank06',
tweeted_at: new Date(2011, 1, 26, 8, 0).toISOString()
}
riak.save('tweets', '41399579391950848', tweet, {index: {
mentions: ['riakjs', 'frank06'],

Riak Handbook | 87

Riak Secondary Indexes

tweeted_at: new Date(2011, 1, 26, 8, 0).toISOString()


}})

This gives us a nice way to express relationships between objects in Riak.


But wait, wasn't this supposed to be what links are for? Indeed, but using
secondary indexes to do look-ups in this kind of scenario turns out to be
slightly more efficient.

Managing Object Associations: Links vs. 2i


So why would you consider secondary indexes to manage associations
instead of using links? Let's start with a simple scenario first: a user keeps
links to all of her tweets, a one-to-many relationship. The foreign key is the
username and is stored with every tweet.
With links, there is some overhead. Not only does Riak have to store the ID
referenced, it stores the entire link path and the tag. When using a secondary
index, it only has to store the ID, relying on the application to resolve the
reference context, for instance based on the index name. This may not sound
like a big deal, but it adds up when you store millions of objects.
The real efficiency improvement though, lies in fetching and writing
references. When you want to fetch all tweets for a specific user based on
links, you have to first fetch the user's object and then walk the relevant links.
As the links are not indexed, Riak has to match them based on the patterns
you provide in the link walking query. With an index, Riak doesn't have to
do the matching itself, the index already does a much more efficient job.
On the other hand, when you add a new tweet for a user, you always have to
add a new link to the user object to be able to reference all his tweets. With
secondary indexes, you don't have to touch the user at all, you just store a
reference to it in the index for that particular tweet. Instead of updating two
objects, you only write one. This effectively cuts the number of writes in
half, while also minimizing the risk of having to handle conflicts on the user
object.
The end result is the same, as you can still look up data both ways. You can
find the user based on the tweet, either by looking at the indexed field or by
adding the metadata to the tweet object. To get all tweets for a user, you look
at the user index for tweets with the username or ID you're interested in. To
only fetch a number of a user's tweets, you feed it into MapReduce.
The fact that secondary indexes ensure uniqueness is hard to understate, it
allows for some handy things. Consider a tweet that mentions the same user

Riak Handbook | 88

Riak Secondary Indexes

multiple times, a many-to-many relationship. The application doesn't have


to care about making sure the list contains unique values, the index will do
that for you. You can just append new values to the list without having to
look at it first to check for duplicates.

How Does Riak 2i Compare to Riak Search?


There's some intersection between what Riak Search and Riak 2i have to
offer. But where are the main differences?
First of all, 2i pushes indexing into the application whereas Riak Search does
the hard work for you. 2i is therefore much better suited for simple data on
which you do only simple key lookups.
Riak Search is a lot more flexible when it comes to analyzing any kind of
data, given the right type of analyzer. But there's nothing that'd keep you
from doing all the analysis in your application and building bigger indexes
using Riak 2i. You can specify a whole lot of HTTP headers for a single
object.
One more important difference between the two approaches is the way they
partition data. Riak 2i stores indexes in the same data files as the objects,
whereas Riak Search uses a separate indexing backend called merge_index.
Riak 2i does what's called document partitioning. The index data for a single
object is subject to the same availability concerns as the object it belongs to.
Riak Search partitions index data based on terms, using a technique called
term partitioning. When a field is tokenized, it uses the resulting terms to
spread out the data across a cluster. That means the index for a single
document may be spread out across several physical nodes. But this way Riak
Search ensures that index is pretty evenly spread across the cluster.
This boils down to a matter of trade-offs. When querying a 2i index, Riak
needs to query multiple virtual nodes in the cluster to get a meaningful result.
The minimum required set of virtual nodes that need to be available for this
are called a covering set. This corresponds roughly to one third of all virtual
nodes in the cluster, in a way that every partition is available to the query
system on at least one of its replicas.
In Riak Search, on the other hand, for a single term lookup it only needs to
contact one virtual node, the one responsible for that term. The downside
being that when you have lots of documents which include the same term,
the partition responsible for it becomes a hotspot, having an uneven amount
of responsibility compared to partitions carrying less common terms.

Riak Handbook | 89

Riak Secondary Indexes

Another consequence is that when searching for multiple terms, Riak Search
has to fetch the results for all terms in the query first, querying all nodes
relevant to cover all requested terms, and then merge the results together.
Riak 2i adheres to the same replication settings as storing data in Riak does,
meaning an object that's replicated three times has a 2i index that's also
replicated three times. Riak Search does replication too, but all operations for
reads and writes don't use any quorum at all. Riak Search is still fault-tolerant
and recovers from failure, don't worry, but this is necessary to have clients
not block waiting for the indexing to finish. Writing to Riak Search is cheap
from the client's perspective, but still expensive to do on the server.
We'll look at when to use which approach in the next section, throwing in
MapReduce for good measure.

Riak Search vs. Riak 2i vs. MapReduce


Riak has a lot of ways to query and analyze your data, but when is it
appropriate to use them? There's no easy answer, but you can get a feel for
the implementations and their differences pretty quickly.
Riak Search is an obvious candidate if you need to analyze larger amounts
of text that go beyond simple values like names; our tweet bodies are a
good example. Even though they're only 140 characters in size, they're much
too varied to be useful for a simple index. Plus, different languages usually
require different ways to analyze the text.
Search is a clear winner for this kind of data, where the structure of text is not
fully known upfront, and its complexity goes beyond a simple index lookup.
If you need to query texts in some fuzzy manner, allowing you to only search
for partial terms and influence the weight of the data using boosts, it's also
clear that Riak Search is the way to go. Site searches come to mind.
Riak Secondary Indexes lend themselves as an alternative to traditional
indexing in relational databases. Your application has specific lookup
patterns like querying by email address, looking up user names, or fetching
ranges of objects based on a date range. If that's the case, secondary indexes
are for you. Certainly, they add a bit more complexity to the application by
pushing indexing to the developer, but that's a trade-off worth making if
simple yet scalable indexing and querying is what you need.
MapReduce is a tool for analyzing a specific set of data further, by
transforming, grouping, counting, or aggregating data, storing results back

Riak Handbook | 90

How Do I Index Data Already in Riak?

into Riak, and in general everything that requires more dynamic approaches
for looking at the data at hand.
A specific dataset is an important thing to keep in mind here. MapReduce in
Riak is not exactly the same as Hadoop, where you can practically analyze
an infinite amount of data. In Riak, data is pulled out of the system to be
analyzed. That alone sets the boundaries of what's possible. The more
focused you keep the set of data you feed into Riak's MapReduce, the better
the two of you will get along.
For more ad-hoc style analysis you can easily utilize Riak Search and Riak
2i to narrow down the dataset for MapReduce, and you should prefer that
approach any time over loading entire buckets of data.

How Do I Index Data Already in Riak?


A question that comes up regularly is how to index data that's already stored
in Riak. You already have several millions objects in Riak, how do you build
indexes on that data?
There are at least two ways. One is to stream all the keys of objects relevant
for indexes and index them in one large batch. This will put considerable
load on the cluster for a while until all data is indexed. It's certainly one way,
but it's not ideal.
The other is to gradually update data as it is read. Next time a piece of data
is accessed, and it's not indexed yet, you kick off the indexing process. That
could work either directly, by way of the code that reads the data, or through
some external means, using a message queue doing background indexing.
This path has the downside that it take time until all data is indexed. Plus, data
that's never read will never be indexed. If you're interested in only data that's
active from a user's perspective, that's not a big problem. But if you want
to look at all the data stored in Riak, you'll have to combine it with the first
approach to make sure everything is eventually indexed.
In general, and I can't stress this enough, you need to think about access
patterns up-front. Riak is a key-value store for sure, but the indexing tools
are available. They come with trade-offs, but they give you great flexibility
to look at your data in ways beyond simple key-value access. Access patterns
should drive data modelling as much as possible.

Riak Handbook | 91

Using Pre- and Post-Commit Hooks

Using Pre- and Post-Commit Hooks


There are scenarios where you want to run something before or after writing
data to Riak. It could be as simple as validating data written, for instance to
check if the JSON conforms to a well-known schema, or if it's written in the
expected serialization format. If the validation fails the code can fail the write,
returning an error to the client. The code could also modify the object before
it's written, for example to add timestamps or to add audit information.
Another use case we already came across with Riak Search is to update data
in a secondary data source with the data just written. Riak Search updates its
search index before the data is written to Riak, failing the write if indexing
caused an error. That way your application knows right away if there are
problems with either the data or your Riak setup.
This feature is called a pre-commit hook, and it's run before Riak sends the
object out to the replicas, allowing the hook to control if the write should
succeed or if it should fail. A pre-commit hook can be written in JavaScript
or Erlang.
A post-commit hook, on the other hand, is run after the write operation is
done, and the node that coordinates the request has already sent the reply to
the client. What you do in a post-commit hook has no effect on the request as
a whole. You can modify the data but you'd have to explicitly write it back to
Riak again. Post-commit hooks can be used to update external data sources,
for example a search index, to trigger notifications in a messaging system, or
to add metrics about the data to a system like Graphite, Ganglia, or Munin.
Post-commit hooks can only be written in Erlang.
Let's walk through some examples.

Validating Data
The simplest thing that could possibly work is a JavaScript function that
checks if the data written is valid JSON. To validate, the function tries to
parse the object from JSON into a JavaScript structure. Should parsing the
object fail, the function returns a hash with the key fail and a message to the
client. Alternatively, the function could just return the string "fail" to fail
the write.
If parsing succeeds, it returns the unmodified object. To make the code
easier to deploy later, it's wrapped into a Precommit namespace and assigned
to a function variable validateJson, so we can call the method as
Precommit.validateJson(object).

Riak Handbook | 92

Using Pre- and Post-Commit Hooks

var Precommit = {
validateJson: function(object) {
var value = object.values[0].data;
try {
JSON.parse(value);
return object;
} catch(error) {
return {"fail": "Parsing the object failed: " + error};
}
}
}

There is a problem with this code. Pre-commit hooks are not just called
for writes and updates, they're also called for delete operations. When a
client deletes object, the pre-commit hook will waste precious time trying
to decode the object. Riak sets the header X-Riak-Deleted on the object's
metadata when it's being deleted.
To work around this particular case, we'll extend the code to exit early and
return the object when the header is set.
Precommit = {
validateJson: function(object) {
var value = object.values[0];
if (value['metadata']['X-Riak-Deleted']) {
return object;
}
try {
JSON.parse(value.data);
return object;
} catch(error) {
return {"fail": "Parsing the object failed: " + error}
}
}
}

Unlike MapReduce, JavaScript code for pre-commit hooks needs to be


deployed on the Riak nodes. Further down below you'll find a section
dedicated to deploying custom JavaScript.

Enabling Pre-Commit Hooks


Given you've deployed the code, we can now tell Riak to use the precommit function for a bucket. Like so many other settings in Riak, commit

Riak Handbook | 93

Using Pre- and Post-Commit Hooks

hooks are defined in the bucket properties. To enable the above function for
the tweets bucket, we can run the following command.
$ curl -X PUT localhost:8098/riak/tweets \
-H 'Content-Type: application/json' -d @{
"props": {
"precommit": [{"name": "Precommit.validateJson"}]
}
}

You can specify a list of functions, they'll be chained together for every
write. For JavaScript functions, a simple hash with a key name and the name
of the function is all you need.
To validate that it works, let's throw some invalid JSON at Riak.
$ curl -v -X PUT localhost:8098/riak/tweets/123 \
-H 'Content-Type: application/json' \
-d 'abc'
< HTTP/1.1 403 Forbidden
Parsing the object failed: SyntaxError: JSON.parse

Riak returns a HTTP status code 403 and the error message that's generated
when parsing the data failed. To make sure valid JSON is still accepted, let's
test the positive case too.
$ curl -v -X PUT localhost:8098/riak/tweets/123 \
-H 'Content-Type: application/json' \
-d '{"id": "abc"}'
< HTTP/1.1 204 No Content

Works as expected. The code to validate your data can be of arbitrary


complexity. But the same caveats as for MapReduce with JavaScript apply:
you won't get the best possible performance out of it, as data will be serialized
into JSON before Riak calls out to the external JavaScript engine.
Let's rewrite the JSON validation in Erlang.

Pre-Commit Hooks in Erlang


The preconditions for a pre-commit hook written in Erlang are the same
as for JavaScript: it must be a function that accepts a single argument. If it
returns the same object, the write succeeds. If it returns a tuple consisting of

Riak Handbook | 94

Using Pre- and Post-Commit Hooks

the term fail and a string message, the write fails. Below is a rewrite of the
JavaScript JSON validator in Erlang.
-module(commit_hooks).
-export([validate_json/1]).
validate_json(Object) ->
try
mochijson2:decode(riak_object:get_value(Object)),
Object
catch
throw:invalid_utf8 ->
{fail, "Parsing the object failed: Illegal UTF-8 character"};
error:Error ->
{fail, "Parsing the object failed: " ++
binary_to_list(list_to_binary(
io_lib:format("~p", [Error])))}
end.

The code uses mochijson2 to decode the object into an Erlang structure.
mochijson2 can be a bit more specific as to why parsing failed, in particular
when it finds invalid UTF-8 characters not allowed in JSON.
You can compile this code in the Erlang console, much like the code in the
section on writing a custom analyzer for Riak Search.
To set an Erlang function as a pre-commit hook, the format of the bucket
property is a bit different. Instead of a name key, a mod and a function key
must be specified.
$ curl -X PUT localhost:8098/riak/tweets \
-H 'Content-Type: application/json' -d @{
"props": {
"precommit": [{"mod": "precommit", "fun": "validate_json"}]
}
}

Modifying Data in Pre-Commit Hooks


We already have a deserialized view of the data that are about to be written.
Why not put that to good use and add some metadata to it, the current time
would be handy to have to see when an object was last updated. On the
other hand, that information is already added by Riak. Like any good HTTP
citizen, you can already get the time the object was last modified from the
Last-Modified header.

Riak Handbook | 95

Using Pre- and Post-Commit Hooks

Let's do something more interesting instead, for instance add automatic


indexes for Riak 2i. Since we have access to all the metadata about an object
and can update it too, we can easily run a loop over all attributes and add
indexes for all of them. We don't even have to add HTTP headers, the
metadata contains a hash with all the indexes based on their name and values.
Below is a rewrite of the validation function that also extracts attributes into
secondary indexes. I introduced a method extractIndexes that dumps all
key-value pairs in the JSON document into a hash structure of indexes and
values. For simplicity it only uses binary index types, I'll leave adding support
for numeric indexes to you.
The result is {"id_bin": "abc"} for our example object from the previous
section. It's assigned to the index key in the object's metadata and returned
with the modified object.
var Precommit = {
validateJson: function(object) {
var value = object.values[0];
if (value['metadata']['X-Riak-Deleted']) {
return object;
}
try {
var data = JSON.parse(value.data);
object.values[0]['metadata']['index'] =
this.extractIndexes(data);
return object;
} catch(error) {
return {"fail": "Parsing the object failed: " + error};
}
},
extractIndexes: function(data) {
var indexes = {};
for (var key in data) {
var name = key + '_bin';
indexes[name] = data[key];
}
return indexes;
}
}

Don't forget to restart Riak after you've updated the script file.
If you want to modify the object's data, you deserialize it first, for instance
from JSON into a JavaScript object, modify it as needed, and then write

Riak Handbook | 96

Using Pre- and Post-Commit Hooks

it back to the object. For the sake of completeness, let's add a version that
inserts the current time to show, in the data, when the object was last
updated.
var Precommit = {
validateJson: function(object) {
var value = object.values[0];
if (value['metadata']['X-Riak-Deleted']) {
return object;
}
try {
var data = JSON.parse(value.data);
data['updated_at'] = new Date().toString();
object.values[0].data = JSON.stringify(data)
return object;
} catch(error) {
return {"fail": "Parsing the object failed: " + error};
}
}
}

Accessing Riak Objects in Commit Hooks


Sometimes you may want to read data in a pre-commit hook, for instance
to figure out what has changed or to keep an audit log to track changes over
time, maybe in a secondary data structure stored in a different bucket. Before
we dive into details, some thoughts on this.
Reading and writing data increases latency of the main request. Every read
or write you do adds another request and more latency to the client's write
request. If you read and then write another object, you effectively triple the
amount of time required to finish the request. As a first rule, it's wise to
avoid doing requests in a pre-commit hook. Consider doing them in a postcommit hook instead, as that won't block the main request.
Putting that bit aside, you can only interact with Riak from Erlang code.
You can't request objects from within JavaScript, as there is only one entry
and one exit point when Riak calls out to the SpiderMonkey JavaScript
sandbox.
With both caveats in mind, we can write some Erlang code that stores all
changes to a secondary Riak object, which serves as an audit trail. The code
uses the same local Riak client as we've already used to write and run
MapReduce functions on the Riak console.

Riak Handbook | 97

Using Pre- and Post-Commit Hooks

-module(commit_hooks).
-export([audit_trail/1])
audit_trail(Object) ->
Key = riak_object:key(Object),
Bucket = <<"audit_trail">>,
{ok, Client} = riak:local_client(),
AuditTrail = case Client:get(Bucket, Key) of
{error, notfound} -> [];
{ok, AuditObject} ->
{struct, [{<<"trail">>, Data}]} =
mochijson2:decode(riak_object:get_value(AuditObject)),
Data
end,
Entry = [{struct, [{get_timestamp(),
mochijson2:decode(
riak_object:get_value(Object))}]}],
UpdatedAudit = lists:append(AuditTrail, Entry),
Json = list_to_binary(mochijson2:encode({struct,
[{<<"trail">>, UpdatedAudit}]})),
AuditObject2 =
riak_object:new(Bucket, Key, Json, "application/json"),
Client:put(AuditObject2),
Object.
get_timestamp() ->
{Mega,Sec,Micro} = erlang:now(),
list_to_binary(integer_to_list(
(Mega*1000000+Sec)*1000000+Micro)).

Audit data is stored in the audit_trail bucket, for simplicity's sake. The
code could easily be adapted to use a separate bucket for audit data based on
the bucket of the audited data.
After fetching the local client to talk to Riak, the function tries to fetch the
audit object from Riak. If the data can't be found, which branches into the
clause {error, notfound}, the case statement returns an empty list. If it was
found, it uses the mochijson2 module to decode the value, a serialized JSON
object and extracts the list, stored in the trail attribute. This particular code
uses the magic of pattern matching to extract the list from the data structure
returned by mochijson2:decode().
Given an audited Riak object {"id": "abc"}, here's an example of the JSON
data structure stored in Riak for the audit trail.

Riak Handbook | 98

Using Pre- and Post-Commit Hooks

{"trail": [{"1335106786461071": {"id": "abc"}}]}

It's a simple list with entries indexed by a microseconds timestamp, and it


always stores the full object. An example of the resulting data structure when
decoded by mochijson2 into an Erlang tuple is shown below.
{struct, [{<<"trail">>,
[{<<"1335106786461071">>,{struct,[{<<"id">>,"abc"}]}}]
}]}

The corresponding code in the function then appends the current entry to
the trail, converts the data structure back to JSON, and writes the object back
to Riak.
Note that the code doesn't bother with handling siblings. The ideal version
would respect them and merge the data structures together. The way the
data is laid out here is already fully suitable to just take two different versions
of the list and merge them together. Below you'll find an entire section
dedicated to designing data structures for Riak.
While this code is suitable both as a pre- and post-commit hook, I'd
recommend only using it as a post-commit hook for the reasons outlined
above. If auditing is a strict requirement though, and the added latency is
acceptable, the function audit_trail returns the updated object when done,
as required for a pre-commit function.
One little addition we can make is handling deleted objects. When an object
gets deleted, we won't add the data structure, but assign null to the
timestamp. That way it's easy to add more entries should the object be recreated later on.
audit_trail(Object) ->
Key = riak_object:key(Object),
Bucket = <<"audit_trail">>,
Metadata = riak_object:get_metadata(Object),
Deleted = dict:is_key(<<"X-Riak-Deleted">>, Metadata),
{ok, Client} = riak:local_client(),
AuditTrail = case Client:get(Bucket, Key) of
{error, notfound} -> [];
{ok, AuditObject} ->
{struct, [{<<"trail">>, Data}]} =
mochijson2:decode(riak_object:get_value(AuditObject)),
Data
end,

Riak Handbook | 99

Using Pre- and Post-Commit Hooks

Entry = case Deleted of


false -> [{struct, [{get_timestamp(),
mochijson2:decode(
riak_object:get_value(Object))}]}];
true -> [{struct, [{get_timestamp(), null}]}]
end,
UpdatedAudit = lists:append(AuditTrail, Entry),
Json = list_to_binary(mochijson2:encode(
{struct, [{<<"trail">>, UpdatedAudit}]})),
AuditObject2 =
riak_object:new(Bucket, Key, Json, "application/json"),
Client:put(AuditObject2),
Object.

The updated version checks if the X-Riak-Deleted header is set in the


object's metadata. If it is, it sets the new audit trail entry to null instead of
using the object's deserialized contents.
While the above example is aimed at commit hooks, the part that accesses
and modifies data serves as a good example that you could also use in Erlang
MapReduce functions.

Enabling Post-Commit Hooks


To enable a post-commit hook, the procedure is the same as with precommit hooks, just the key in the bucket properties is different.
$ curl -X PUT localhost:8098/riak/tweets \
-H 'Content-Type: application/json' -d @{
"props": {
"postcommit": [{"mod": "commit_hooks", "fun": "audit_trail"}]
}
}

Deploying Custom Erlang Functions


Now that we've written a bit of custom Erlang code, it's about time we'd
look into deploying it on Riak nodes. I'm sure you'll agree that opening the
Riak console and manually recompiling the Erlang source files is not a great
option.
The first part of automating this is to find an easy way of compiling the
source files. Assuming your code is checked into a source control system and

Riak Handbook | 100

Using Pre- and Post-Commit Hooks

is checked out on every machine, it's easy to compile it. Erlang comes with a
built-in mechanism to compile all .erl files into their corresponding .beam
files. The resulting .beam files are bytecode ready to be loaded by the Erlang
VM.
You can try this out yourself in the example code repository. It contains a
folder commit-hooks, which in turn contains some Erlang source files. In
that directory, type erl -make. This compiles all files that haven't been
compiled yet and recompiles source files that are newer than their bytecode
counterparts. If you don't have Erlang installed separately on your nodes,
you can use the version shipped with Riak. Shown below is a sequence of
commands using the Erlang installed by Riak on a Ubuntu system to compile
the example code.
$ cd nosql-handbook-examples/08-riak/commit-hooks
$ /usr/lib/riak/erts-5.8.5/bin/erl -make

The second part is to make Riak aware of the compiled files. You can
configure additional paths to load into the Erlang VM in app.config. In the
section for riak_kv, add the following lines before the end of the section.
This example assumes the code is checked out and compiled in the home
directory of a user deploy.
,{add_paths, [
"/home/deploy/nosql-handbook-examples/08-riak/commit-hooks"
]},

You can specify any number of directories, just add more to the list, as shown
below.
,{add_paths, [
"/home/deploy/nosql-handbook-examples/08-riak/commit-hooks",
"/home/deploy/analyzers"
]},

There's one neat thing that Riak allows you to do. When you have specified
a number of directories in your app.config, and the code changes and gets
recompiled, you don't even need to restart Riak to reload updated Erlang
code. There's a handy command that makes Riak reload all Erlang .beam files
in directories specified in the add_paths section: riak-admin erl_reload.
You do however need to restart Riak every time you change the app.config
file.

Riak Handbook | 101

Riak in its Setting

It only took four simple steps to make Riak aware of our custom Erlang code.
All of them are easy to automate so that you can update and recompile the
source files when necessary.

Check out the Erlang source code


Compile the source code into BEAM files
Update app.config to point Erlang to the code directories
Reload the BEAM files with riak-admin erl_reload (Riak v1.1 or
newer) or
Restart Riak (prior to v1.1)

Updating External Sources in Post-Commit Hooks


Another example of using a post-commit hook is to update an external data
source. This could be a search index, for instance based on Solr or
ElasticSearch, or a graphing system like Graphite or Librato Metrics. The
latter is useful to track more specific metrics on a per-bucket basis, to track
updates to specific types of objects.
Another thing you could do is trigger messages in a broker like RabbitMQ,
for instance to update a search index using a seperate set of workers, fully
asynchronous. For inspiration, have a look at Jon Brisbin's Riak post-commit
hooks for RabbitMQ.

Riak in its Setting


We've looked at a lot of features Riak has to offer, but its real strength is
running in a cluster, happily serving your data no matter what. In this section
we'll look at how you can build your own cluster, and how you deal with
consistency, conflicts, and most importantly, failure.

Building a Cluster
So far we've only worked with a single node, a Riak instance running on
your local computer. While it's a credit to Riak itself, that it's so easy to get
started with it, it really shines in a clustered environment. It's easy to increase
your database's capacity just by adding more nodes. Adding more nodes
doesn't require any manual intervention on your end, all the reshuffling
of data is done automatically, thanks to the magic of consistent hashing
and partitioning. Everything we went through so far is no less valid with
multiple nodes than it is with just one, only with the added fun of a
networked environment.

Riak Handbook | 102

Building a Cluster

Adding a Node to a Riak Cluster


While it is certainly possible to use the binary packages you downloaded to
add more nodes, it's easier to follow along if you use EC2 or Rackspace to
create yourself some servers. They don't need to be big for our purposes.
Adding a new node involves a four simple steps:
1.
2.
3.
4.

Install Riak
Update the configuration with the right hostname
Start Riak
Tell the Riak node to join a cluster

We've been through the process of installing and starting Riak, so we'll go
through the other two steps involved.

Configuring a Riak Node


Now is the time to open the other configuration file, vm.args, as it specifies
the node's name. Erlang's way of communication relies on unique host
names combined with process names. By default, this is set to
riak@127.0.0.1, which is not exactly tuned for a clustered environment.
You'll want to just keep the name riak, but it can certainly be changed. That
usually makes more sense when you have more than one Riak cluster in your
environment.
The relevant line is right at the beginning:
# Name of the riak node
-name riak@127.0.0.1

Depending on your local networking scheme, change it to look something


like shown below. It doesn't have to be an IP address, a full hostname works
just the same.
-name riak@192.168.2.25

And that's all you need to change to start adding more nodes. Rinse and
repeat for every new node. Give every node a proper host name, and start
them. In a production environment, make sure these steps are fully
automated. The beauty of Riak is that adding a node to an existing cluster
involves no work that needs manual intervention. So do yourself and your
infrastructure a favor, use something like Chef or Puppet to do the work for
you.

Riak Handbook | 103

Building a Cluster

Assuming you did that, every node is still on its own. It keeps its own ring
configuration until you tell it to join another cluster (which only needs to be
one other node).

Joining a Cluster
The final step is to join the node with an existing cluster. For that, you
use the command riak-admin join, specifying the full Erlang node name.
Assuming you have another node already running on IP 192.168.2.24, here's
how to tell the new node to join its ring.
$ riak-admin join riak@192.168.2.24

What happens now is neither mystery nor a secret, so let's have a look at a
what defines a Riak node in a cluster, and what happens when a node joins a
cluster.

Anatomy of a Riak Node


A Riak node is more than just a single part of a cluster. In fact, a Riak node is
an entire cluster, acting to the outside world the same way as a cluster with
multiple nodes does.
If you remember consistent hashing and partitions, that's a) a good thing
and b) the basis for all this. When you start a Riak node it creates a number
of partitions, based on the ring_creation_size option, which defaults to
64. Every partition is handled by a separate process in Erlang. Processes are
lightweight entities that can act and be supervised independently of other
processes. So you end up with 64 processes, every one of them serving data
for a different partition in the ring.
For every replica (remember the N value) another process is created. So
with three replicas and 64 partitions, you end up with 192 processes solely
responsible for serving writes and reads for a replica of a partition. All this
magic is done by a library called Riak Core, which was extracted from Riak.
Diving into it is beyond the scope of this book, but what we're going
through here should get you a basic idea of what's happening.
There are a lot of other processes flying around our Erlang process, but
we'll focus on the 192 serving data. When you request data from Riak's web
API, Riak Core finds out which node is responsible for the requested data
by running a hash function on the bucket-key combination and figuring
out the processes. We'll refer to them as vnodes (as in virtual nodes) from

Riak Handbook | 104

Building a Cluster

now on, and they're responsible for the data. Riak Core keeps a so-called
preference list of partitions and the vnodes assigned to them, so that lookup
is easy enough. The vnodes can be local or remote; as far as Erlang is
concerned, the means of communication are the same.

What Happens When a Node Joins a Cluster


When a new node joins the cluster, it fetches the existing preference list
from the node it's told to join. It picks a random number of partitions, until
the number of partitions per node is close to or equal across all nodes in the
cluster. It also makes sure that it doesn't have multiple replicas of the same
partition to ensure that replication actually ensures physical separation of
replicas.
The result is an updated preference list which it then sends to a random node
in the cluster. As the vnodes regularly check if they're on the right physical
node, the ones picked by the new node will eventually realize they're not
responsible for a particular partition anymore. They'll start a process called
hinted handoff, which reshuffles data in the cluster so that proper ownership
of data is ensured.
Depending on the amount of data in your cluster, this process takes time and
is mostly limited by physical resources like network and bandwidth capacity.
After all, it's not an easy feat to shuffle around dozens of gigabytes of data.
The good news is, there's no human intervention involved, Riak is built with
operational ease in mind. As a general rule of thumb, you don't want to add
or remove nodes to or from a cluster that's already under heavy load, because
it puts even more load on it. But thanks to some recent changes in Riak, it's
not an impossible thing to do.
When that part is done, the node is ready to serve data.

Leaving a Cluster
Pretty much the same happens when a node leaves a cluster, but in reverse.
There are two ways to remove a node. One is run on the node that's to be
removed, the other can be run from any node in the cluster. The latter is
useful in case a node experiences a hardware failure and needs to be taken
into servicing.
To have a node leave the ring, simply run the command riak-admin leave
on that node. That makes the node drop ownership of all the partitions it
held, again making sure that the preference list is evenly spread out across the

Riak Handbook | 105

Eventually Consistent Riak

cluster. After it transmitted the new preference list and handed off data to the
new owners, it shuts down and is ready to be decommissioned.
If you're on a different node, you can run riak-admin force-remove
riak@192.168.2.24, where you specify the complete Erlang node name as
configured in vm.args. Be aware that you lose all data replicas on that node,
and there is no handoff of data, which is no surprise as you consider the target
node to be unrecoverable, which is why you resorted to using this rather
drastic measure in the first place.

Eventually Consistent Riak


Now that we got a lot of stuff out of the way, it's time to look at Riak for what
it really is, an eventually consistent database with all the bells and whistles to
handle and tune consistency. In the following sections we'll look at what it
means to work with Riak in a clustered environment.

Handling Consistency
So far we haven't talked about handling consistency in Riak at all. All writes
that dumped tweets into Riak used the default quorum of 3. That means data
written is replicated to exactly three nodes in the cluster during the write
operation, a sensible enough default for most cases.
You can tune the quorum for data in a bucket in general, and you can set a
quorum for every read or write request, the magical R and W values. The
quorum is an additional parameter for each request, and all client libraries
allow you set it, so let's look at how to do it with riak-js.

Writing with a Non-Default Quorum


Writing with a non-default quorum means requiring less than the default
number of replicas to successfully accept a write before the entire request is
returned as a success to the client. It's important to realize that, internally,
Riak will still send the write request to all replicas, but doesn't wait for all of
them to return before finishing the request. You're trading off lower latency
during the write process for the possibility of data not being fully replicated
yet to all nodes before the next client is reading it. If we want to only write to
two nodes, we can specify a parameter w. Let's play with our original tweet,
specifying a lower quorum. In a mass archival scenario like this it's acceptable
to trade off consistency for lower latency.

Riak Handbook | 106

Eventually Consistent Riak

var tweet = {
'user': 'roidrage',
'tweet': 'Using @riakjs for the examples in the Riak chapter!',
'tweeted_at': new Date(2011, 1, 26, 8, 0).toISOString()
};
riak.save('tweets', '41399579391950848', tweet, {w: 1})

Note that I adapted the object to use the ISO style date format, so this is now
a valid update of an existing object, only with a consistency setting lower
than the default. All this is pretty straightforward, but it gets much more
interesting when data is read.

Durable Writes
Riak supports more than one write consistency setting. Both W and N
specify a quorum that only requires the replica nodes to accept the data,
but not necessarily writes the data to their storage backends, which is what
makes the write truly durable.
To work around this, Riak supports a durable write setting, which defaults
to the same value as N. A durable write requires the storage backends, which
write the data to disk, to acknowledge that they in fact have done so. This
setting is a trade-off between latency and durability. The setting can be
lower han the W value, if you can live with the fact that only a DW number
of nodes has physically written the data to disk before the request returns to
the client.
Just like W, durable write is both a setting per bucket and per write request.
There's no equivalent for read requests, read requests will have to go through
the storage backend to get the data anyway.
Again, a slightly lower durability setting for increased latency works to our
advantage in the tweets scenario. We're more interested in keeping latency
low than the data being fully consistent immediately. To specify a different
setting for durable writes, use the parameter dw. We're lowering our
durability expectations to just one replica that accepted the write in a durable
fashion.
riak.save('tweets', '41399579391950848', tweet, {dw: 1})

Note that durability is still a different beast for every storage backend
supported by Riak. Depending on which one you use, and how it's
configured, the process of writing the data to disk might still be delayed, for
example to reduce disk I/O to a burst every 100 ms.
Riak Handbook | 107

Eventually Consistent Riak

Primary Writes
As if those weren't enough quorums, Riak supports a third one called a
primary write quorum. It specifies the number of primary replicas that need
to accept a write for it to be successful.
Remember that Riak keeps a preference list of the whole cluster around.
When it picks the replicas based on consistent hashing, it defaults to using
the first N in the list, with N being the default quorum. The first N nodes are
called the primary replicas for a particular key.
If one of the primary nodes is not available, the coordinating node sends the
request to a secondary node. It picks the next node in the list and uses it as
a temporary store until the primary node becomes available again. This is
called a sloppy quorum, because Riak doesn't actually fail when you write
with a quorum of 3 and only two of the primary replicas are available. It
temporary stores the data on a secondary node instead and doesn't actually
enforce the quorum based on primary replicas.
With a primary write you can force a write to go to primary nodes only or
otherwise fail. You may argue this forfeits the entire purpose of Riak. Write
availability is, after all, what it's all about. But there are scenarios where the
stronger consistency guarantees of a primary write are preferable.
In use cases where applications have to read their own writes, a successful
primary write followed by a primary read guarantees to return the data
you've just written. With a sloppy quorum, consider your write went to two
primaries, one of them partitions right after the write. A subsequent read
with R = 1 goes to a secondary node which doesn't know anything about the
data, so it doesn't return anything. A primary write followed by a primary
read would guarantee this doesn't happen, at least as long as PW + PR > N
holds.
To make a primary write or read operation, you specify the pw flag.
riak.save('tweets', '41399579391950848', tweet, {pw: 3})

Tuning Default-Replication and Quorum Per Bucket


Every bucket can have a different setting for the number of replicas in the
cluster. When you do a write or a read without a specific setting, Riak will
resort to the bucket configuration to determine the number of replicas.
Finding out the default is easy, you just curl the bucket's URL to get the
properties currently set for it.
Riak Handbook | 108

Eventually Consistent Riak

$ curl localhost:8098/riak/tweets
{
"props": {
"name": "tweets",
"w": "quorum",
"notfound_ok": true,
"young_vclock": 20,
"pr": 0,
"postcommit": [],
"rw": "quorum",
"chash_keyfun": {
"mod": "riak_core_util",
"fun": "chash_std_keyfun"
},
"big_vclock": 50,
"precommit": [{}],
"last_write_wins": false,
"small_vclock": 10,
"r": "quorum",
"pw": 0,
"old_vclock": 86400,
"n_val": 3,
"linkfun": {
"mod": "riak_kv_wm_link_walker",
"fun": "mapreduce_linkfun"
},
"dw": "quorum",
"allow_mult": true,
"basic_quorum": false,
"search": true
}
}

Amidst all the properties are the relevant options n_val, r, rw, dw, and w.
Updating the configuration is just as easy, send an updated JSON that
contains the new values you'd like the bucket to have. To set a different
n_val, which corresponds to the N value for replication, send a PUT to
the bucket's URL with JSON containing just the new value. It doesn't have
to contain any other value, you can update single values in the bucket
properties, so you don't have to specify all properties every time. You do
need to specify the content type for JSON though, and numbers need to be
JSON numbers too, otherwise you may end up seeing multiple values for the
same numeric property.

Riak Handbook | 109

Eventually Consistent Riak

$ curl -X PUT localhost:8098/riak/tweets -d @- \


-H "Content-Type: application/json"
{"props"{"n_val": 3}}

Alternatively you can set the values to "quorum", "all", "one", or any
integer which is lower than n_val, as it makes no sense to expect responses
from more replicas than you actually have.

Choosing the Right N Value


While the default N value is reasonable for a lot of cases, there are situations
where you need more or less durability of your data. Using Riak as a session
store for a web application is one scenario, where a replication factor of 2
would still be reasonable enough. The tradeoff is in the value of the data
stored in Riak. The more valuable the data is the more you should treat it as
such and choose a higher N value.
Truth be told though, there's not a lot of, if any, instances known where a
user deliberately chose an N value higher than 4.

Reading with a Non-Default Quorum


The quorum gets much more interesting when reading, especially in cases
where you choose a write quorum lower than N or when a node fails, or
in cases where the data is not consistent across all replicas. You specify the
quorum for read requests in the same way as you do for writes, using the
parameter r.
riak.get('tweets', '41399579391950848', {r: 1})

Just like the W value, using a different R is a tuning knob for consistency vs.
latency. A read request will always send reads to all replicas of a piece of data,
but will return to the client as soon as a sufficient number returned the value.
The magic formula is now in full effect. You can have your consistency cake
and eat it too, when both your W and your R values add up to be higher
than N. Given an N value of 3, and W and R having a value of 2 each, you'll
always get consistent data. If you happen to read, and this is pretty much out
of your control, from the same nodes you just wrote to, you're bound to end
up with the same data. If you happen to read from the one node that may
not have gotten the data yet, latency be damned, a process called read repair
ensures that data is consistent across all nodes once again, and the powers are
at peace again.

Riak Handbook | 110

Modeling Data for Eventual Consistency

Read-Repair
Read repair is a process that makes sure that all nodes have updated data. It
always kicks off after a read request has returned to the client.
Read repair does two things, the first is making sure that data is consistent
across all replicas of a piece of data, the second is ensuring that conflicts are
handled. Conflicts occur when two or more clients updated the same data,
but without being aware of each other's changes. This process involves a lot
of things, so it's well worth devoting an entire section to it.

Modeling Data for Eventual Consistency


Eventual consistency is what drives Riak's distribution and concurrency
model. It's assumed that, when conflicts arise, they'll be sorted out just as
much as updates will eventually be propagated to all replicas. But other than
replication, resolving conflicts is left to the user, and is sometimes cause for
confusion. To resolve conflicts, we need to look at how to model our data so
that reconciling concurrent updates to the same data is easy.
Handling conflicts is something that's ignored by default for every bucket.
If multiple clients write to the same key independent of each other, the last
write to the key wins. It overwrites the existing data.
So far we haven't bothered with conflicts, simply because tweets never get
updated. When you look at the stream coming from Twitter, only new
tweets show up. They may get deleted, but never changed.
Unless told otherwise, Riak uses vector clocks to determine which was the
latest write, and that one is the winner. You can make it even simpler by
letting the last write win, as determined by looking at the timestamp. All
these settings can be configured per bucket. When you look at a bucket's
properties, it has two settings relevant to this: last_write_wins and
allow_mult, both defaulting to false.
last_write_wins determines the winner of two or more conflicts by

looking at their respective timestamps and picking the newest update. So if


you want fire-and-forget writes, set last_write_wins to true.
allow_mult, on the other hand, has pretty big implications on both data
and application. With allow_mult set to true, whenever there are conflicts,

Riak keeps a copy of all the conflicting objects around. You don't even
have to work with a clustered Riak to get a taste of them, just two clients
independently updating the same data do the trick.

Riak Handbook | 111

Choosing the Right Data Structures

While we're going to look into how conflicts occur and how they can be
resolved, it's just as important to look at how you can design your data to be
able to resolve conflicting writes later. Data structures are just as important
when it comes to handling conflicts as resolving the conflicts themselves.
In fact, I would say they're even more important. Picking a winner might
be easy, but figuring out how to merge two diverged pieces of data back
together requires some thinking up-front.

Choosing the Right Data Structures


Let's assume for a minute that we want to update a tweet, changing one
attribute, while another process is updating a different one. Merging that
back together is pretty easy, you just merge the two attributes together.
But what if they changed the same attribute? How can you figure that out
without having access to the original version both are based on? Riak will
only keep the conflicts, but not the originals around.
So instead you need a way to track single changes, in a way that allows you
to restore all the actions on a piece of data later. It brings us back to Leslie
Lamport's paper on using versions to track the ordering of events, where he
uses clocks to determine their order in a distributed system.
Instead of single, independent writes, you start looking at writes as a series
of events, changing data over time. Using this way of thinking, our two
independent updates on the tweet instead turn into two events related to
the same data, and we start tracking these events alongside the original data.
Using this approach, we still have the original version and can always restore
the newest version by going through the list of changes and applying them.
This effectively turns the tracked changes into a time series. The question of
who logically wins a conflict is now pushed into the data itself and therefore,
the application.
There are several approaches on how to do that. You can store metadata for
changes in a separate Riak object, you can keep a list of changes together
with the original, or you can turn your objects into a simple list, where every
entry represents a new version of the object.
Which one you choose depends on your use case. If you have single objects,
like a tweet, you can keep a separate list of changes, either alongside the
tweet or in a separate object. Our tweet object turns into something like this,
notice the new changes attribute.

Riak Handbook | 112

Choosing the Right Data Structures

var tweet = {
username: 'roidrage',
tweet: 'A really long and uninteresting tweet.',
tweeted_at: '2011-11-23T14:29:51.650Z',
changes: []
}

In changes we can now keep tracking events. Whether those are partial or
complete objects, that's up to you, but partial changes certainly are easier on
storage. Let's assume we added an attribute for a location, here's how that
change could be reflected in the data structure.
var conflict1 = {
username: 'roidrage',
tweet: 'A really long and uninteresting tweet.',
tweeted_at: '2011-11-23T14:29:51.650Z',
location: 'Berlin, Germany',
changes: [{
attribute: 'location',
value: 'Berlin, Germany',
timestamp: '2011-11-23T14:30:21.350Z'
}]
}

We store the updated copy and record the change. Assume a second client
came along in the meantime, adding another new attribute to the original
object, maybe a biography for the user, resulting in an object as shown
below.
var conflict2 = {
username: 'roidrage',
tweet: 'A really long and uninteresting tweet.',
tweeted_at: '2011-11-23T14:29:51.650Z',
bio: 'FRESH POTS!!!!' ,
changes: [{
attribute: 'bio',
value: 'FRESH POTS!!!!',
timestamp: '2011-11-23T14:30:22.213Z'
}]
}

Now we have two data structures, both include a list of changes. That's
pretty neat, all we have to do now is take both changelogs, merge them
together, order them by time, and apply the changes. When our code applied
all the changes, it can discard parts of the changelog to keep the list from
growing indefinitely. But you need to make sure that the list is capable
Riak Handbook | 113

Choosing the Right Data Structures

of keeping enough information to restore the state even when a lot of


concurrent updates occur or when nodes are partitioned for a longer period
of time.
How many entries you keep in the list depends on the number of updates
in your system, there's no one size that fits all. If you have the potential for
several clients updating the same object at once (or even multiple times) over
a short period of time, your list should reflect that by keeping a longer history
of changes.
This very simple approach solves our problem pretty neatly, but comes with
some trade-offs, mainly duplication of data. It's an acceptable trade-off
though, and if we make sure the list of changes is capped, the gain is worth
it.
If we use both objects to resolve this conflict, all we need is code to
concatenate the two lists of changes, sort them by time and apply the changes
to the object, eventually restoring a state that represents all recent changes.
First thing we need to do is sort the list by the timestamp of each change.
var changes = conflict1.changes.concat(conflict2.changes);
changes = changes.sort(function(change1, change2) {
if (change1.timestamp < change2.timestamp) {
return 1;
} else if (change1.timestamp > change2.timestamp) {
return -1;
} else {
return 0;
}
});

Even though we're using strings as timestamps, the code relies on


lexicographical ordering to sort the entries. Not perfect, but good enough.
You could use numeric timestamps instead. The sort order is ascending, to
have the newest changes last. Next up we apply every change to the object,
building on the data in conflict1.
var tweet = conflict1;
for (var i in changes) {
tweet[changes[i].attribute] =
changes[i].value;
}

Riak Handbook | 114

Choosing the Right Data Structures

This is pretty simplistic, but is works for our purposes. We can now store the
resulting object back in Riak, capping the collection to 10 elements before
we do so.
tweet.changes = changes.slice(0, 10);
riak.save('tweets', '1', tweet);

Now, the fun part starts when you throw all this at a more meaningful
example, like building a timeline for every user we get tweets from. Before
we get to that though, let's look at what actually happens when conflicts in
Riak arise.

Conflicts in Riak
To enable support for proper tracking of changes and conflicts, we need to
enable the setting for the bucket first. The relevant setting is allow_mult,
and can be enabled using riak-js. Note that the new value needs to be a
proper boolean value, it's true not "true".
riak.updateProps('tweets', {allow_mult: true})

Now we can work off the examples above and create a conflict. I'm going
to keep it sequential for now, because the code would look slightly tangled
when looked at as a whole. The first write saves the initial tweet.
riak.save('tweets', '1', tweet);

We don't even need two consoles to create a conflict, running an update


based on the same version will do, but for the sake of making it more
obvious, it certainly helps.
Remember that every object has a vector clock associated with it. If we run
two updates based off the same vector clock, that's a conflict, no matter how
many clients and Riak nodes are involved. Here's what happens in the first
console.
// console 1
var tweet, meta;
riak.get('tweets', '1', function(error, obj, m) {
tweet = obj;
meta = m;
});
tweet.changes.push({

Riak Handbook | 115

Choosing the Right Data Structures

attribute: 'location',
value: 'Berlin, Germany',
timestamp: new Date().toISOString()
});
tweet.location = 'Berlin, Germany';
riak.save('tweets', '1', tweet, meta);

I'm assuming some time passes between fetching the object and saving it
back to Riak, time you'll have to fetch the objects in both consoles, making
sure they both work off the same version, hence the serialized nature of the
code. If you want to be sure, log the vector clock to the console. It's available
in the metadata object that the code uses to store the object back to Riak.
The second client does pretty much the same, but adds a different attribute.
In your application's code, you're likely to have some functions
encapsulating tracking the single changes instead of applying them manually
all the time.
// console 2
var tweet, meta;
riak.get('tweets', '1', function(error, obj, m) {
tweet = obj;
meta = m;
});
tweet.changes.push({
attribute: 'bio',
value: 'FRESH POTS!!!!',
timestamp: new Date().toISOString()
});
tweet.bio = 'FRESH POTS!!!!';
riak.save('tweets', '1', tweet, meta);

Siblings
What you have now in Riak are two versions of the objects. Before we look
at code, we can verify this using curl.
$ curl -v localhost:8098/riak/tweets/1
...
< HTTP/1.1 300 Multiple Choices
...
Siblings:
1nQGjhdebFwM5uBs9uNGz9
3sThgDXSJ78UdNPzkhg8Om

Riak Handbook | 116

Choosing the Right Data Structures

This may look like it's all output from curl, but it's Riak telling us that this
object has siblings. Siblings are two objects that are related to each other only
by their ancestor, the original tweet, identified by the vector clock we used
when saving the objects in both consoles. Every sibling is a full copy of the
object. If we continue updating based on the original vector clock we'll keep
creating more.
Riak creates siblings when two clients write different data based on the
same vector clock, or when a client provides a vector clock that's generally
different from the current line of vector clocks, so it has no ancestors in the
line of versions.
When requested via HTTP, Riak returns a 300 status code, which stands
for multiple choices. We can now use the vtags, which is what those odd
looking hashes are called, to fetch the sibling we're interested in by adding
them as a query parameter.
$ curl 'localhost:8098/riak/tweets/1?vtag=1nQGjhdebFwM5uBs9uNGz9'
{"username":"roidrage",
"tweet":"A really long an uninteresting tweet.",
"tweeted_at":"2011-11-23T14:29:51.650Z",
"changes":[{"attribute":"bio",
"value":"FRESH POTS!!!!",
"timestamp":"2011-11-24T20:26:24.838Z"}],
"bio":"FRESH POTS!!!!"}

That looks like one of our changes. Most clients don't go out and fetch
all siblings subsequently when they get a 300 status code, they specify
multipart/mixed for the request header Accept-Type. The response has a
Content-Type of multipart/mixed as well, so the clients know what to do
with it. Here's the curl version of that request, output omitted because it's not
pretty.
$ curl localhost:8098/riak/tweets/1 -H "Accept: multipart/mixed"

Now that we know how to access them, all we need to do is apply the code
above and reconcile the changes into one object again.

Reconciling Conflicts
Let's put this code together into something that can resolve the conflicts.
The next time an object is fetched the code checks if there are siblings and
reconciles them before returning to the client or doing whatever comes next.

Riak Handbook | 117

Choosing the Right Data Structures

When riak-js detects siblings, the object you get is an array, including all the
siblings and their metadata.
function reconcileConflicts(objects) {
var tweet = objects[0];
var changes = [];
for (var i in objects) {
changes = changes.concat(objects.data.changes);
}
changes = changes.sort(function(change1, change2) {
if (change1.timestamp < change2.timestamp) {
return 1;
} else if (change1.timestamp > change2.timestamp) {
return -1;
} else {
return 0;
}
});
for (var i in changes) {
tweet[changes[i].attribute] =
changes[i].value;
}
}
riak.get('tweets', '1', function(error, obj, meta) {
if (meta.statusCode == 300) {
obj = reconcileConflicts(obj)
}
});

The implementation still has some flaws. It blindly applies all writes as if they
were replacing the original value. For arrays or deeper hash structures, that
is not a great way of resolving, as you could instead merge them together,
including all new additions, instead of overwriting them.

Modeling Counters and Other Data Structures


Based on the basic examples to track changes to objects over time we can
take things a step further and start modeling changes to more complex data
structures like arrays, hashes, or counters.
We've arrived at the question that comes up on the Riak mailing list and IRC
channel at least once a month: how can I increment or decrement a counter
in a JSON document stored in Riak?

Riak Handbook | 118

Choosing the Right Data Structures

The simple answer is: you can't. If you come to Riak with the same
expectations as for MongoDB and Redis, you will be disappointed. The
more complex answer is: of course you can, but it depends. Something as
simple as atomic counters gets complicated in a distributed, eventually
consistent environment like Riak. Systems like Redis or MongoDB are
different as they're both not eventually consistent, at least not in the same
way as Riak is.
To adapt our changes list for more complex operations, there's not a lot we
have to do. Simply adding a new attribute to store an operation does the
trick. Here's a version that tracks increments, the number of replies to this
tweet.
var tweet = {
username: 'roidrage',
tweet: 'A really long and uninteresting tweet.',
tweeted_at: '2011-11-23T14:29:51.650Z',
replies: 1,
changes: [{
attribute: 'replies',
value: 1,
operation: 'incr',
timestamp: '2011-11-23T14:30:21.350Z'
}]
}

Now, the confusing part starts when we think about multiple concurrent
changes, incrementing the same value. To keep the counter as precise as
possible, applying the same increment twice could end up screwing the
results. In general, you should get used to the idea that a counter may not be
exactly precise all the time.
To merge the changes together, we can still use the common changes list.
This time, we have to remove all the duplicates that are common to all lists.
We also need to ignore the latest increment of the object that's used as the
basis for reconciling. The remaining operations can be executed on the value
in that object, bringing the counter up to par with all the changes.

Problems with Timestamps for Conflict Resolution


There is one problem with the timestamp approach. It didn't matter as much
when we simply tracked attribute changes instead of entire operations. What
if two clients set the same timestamp? It's tracked at the millisecond level, so
it's not that likely, but it's equally not unlikely to happen. It may just be that

Riak Handbook | 119

Choosing the Right Data Structures

the clocks on two different application servers, which we consider clients in


this scenario, are just a few milliseconds apart and happen to strike the same
millisecond when they do a write.
The deduplication falls apart when the code considers them to be duplicates,
we're likely even to lose single increments. The problem is that time alone
only gives a partial order, but it's not a logical order, in terms of what the
application might expect. Maybe this sounds familiar, but what we need here
sounds exactly like what vector clocks do.
To improve the scenario, every client gets a client identifier. That could
be based on the current thread, process, or host, or be some random hash
identifying a specific threads, process or the client library object used by it.
Instead of just storing the timestamp, the list of changes now also keeps the
client identifier. Every client is trusted to follow this protocol and always use
his own and unique identifier.
Now we have more data available to do a more reliable deduplication.
Instead of dropping the changes that have the same timestamp we drop the
data the has the same timestamp and client identifier. Note that it doesn't
have to be a timestamp, it might as well just be a version number that each
client increments, which is closer to vector clocks. Anything that is gradually
incremented is considered a clock.
Vector clocks and version vectors rely on clocks (or versions) to be
incremented. Together with the client identifier the effect is the same, given
that each client always correctly increments its own counter. I leave it as an
exercise to you to build a version that uses Lamport's incremental versions
combined with client identifiers.
Let's have a look at a version that uses client identifiers and timestamps.
Once again, the code first merges, then deduplicates, and drops changes of
the base object picked. Which one is picked shouldn't matter, the algorithm
just ignores all the changes already made in its path of changes. That way a
history of just the changes that are not common to all objects and not part of
the base object emerges, and that list can be applied to the base object.
Here's the part that sorts and deduplicates.
function sortChanges(changes) {
return changes.sort(function(change1, change2) {
if (change1.timestamp < change2.timestamp) {
return 1;
} else if (change2.timestamp < change1.timestamp) {

Riak Handbook | 120

Choosing the Right Data Structures

return -1;
} else {
return 0;
}
});
}
function filterDuplicates(base, current, changes, acc) {
var exists = base.filter(function(change) {
if (change.timestamp === current.timestamp &&
change.client === current.client) {
return true;
} else {
return false;
}
});
if (exists.length == 0) {
acc.push(current)
}
return acc;
}
function dropDuplicates(changes) {
return changes.reduce(function(acc, current) {
return filterDuplicates(acc, current, changes, acc);
}, []);
}
function dropBaseChanges(base, changes) {
return changes.reduce(function(acc, current) {
return filterDuplicates(base.changes, current, changes, acc);
}, []);
}

Both dropDuplicates and dropBaseChanges are very similar, just acting


on different lists of changes, discarding items that are already on it.
dropBaseChanges on a combined list of changes, while dropDuplicates
works on the list of changes in the base object. The code first removes all the
duplicates, creating a list of all changes, and then discards the ones already in
the base object.
The last bit of code goes through the remaining list of changes and applies
them to the base object. This example just does increments, but it could be
adapted to support different data structures, such as sets, arrays, and other
things.

Riak Handbook | 121

Choosing the Right Data Structures

function applyChanges(base, changes) {


console.log(changes);
changes.forEach(function(change) {
if (change.op == 'incr') {
base[change.attr] += change.value;
}
base.changes.push(change)
});
return base;
}

The final piece of code is a chain of calls to the above functions, picking a
base object beforehand, and then having at it. changes is assumed to be a list
of all changes in all objects. This code works with any number of siblings,
you just need to concatenate all of their changes in one list.
result = applyChanges(base,
dropBaseChanges(base,
dropDuplicates(
sortChanges(changes))));

Congratulations, you just built your own distributed system on top of Riak!
Riak itself uses vector clocks to keep track of conflicting changes, but on a
level that's not suitable for detecting logical changes to the actual data. You
just know there were changes, just not what has been changed. To improve
that on the application level, our model uses something similar to vector
clocks to track and resolve conflicts on the data itself.
Now that the initial enthusiasm has passed, this implementation still has
flaws. When the lists of changes on two different nodes diverge too far,
maybe one of the the node is partitioned from the others but still receives
updates independently, duplicate increments can occur. Assume one node
contains a couple of older entries in the list of changes, but another already
has received more new increments, so that the older entries have been
dropped. When those two lists are merged together, the older entries are
applied again, leading to duplicate increments. You can improve this by
tracking the last timestamp or version in the document itself, but
unfortunately there is no guarantee for 100% consistent counter values.
Thanks to Reid Draper and Russell Brown for pointing this out.
Both are working on similar data structures and libraries on top of Riak, one
result of which is knockbox, a Clojure implementation of convergent and
commutative replicated data types. It's based on a paper outlining the topic

Riak Handbook | 122

Choosing the Right Data Structures

and statebox, an Erlang implementation of similar data structures. Both focus


on things like sets and registers, but not counters.
Data structures like this are pretty neat, it's an often-asked feature for Riak,
and it's not impossible to do, just not in the simple ways other databases offer.
It's a matter of trade-offs, it bears repeating more than once. You buy into
fault-tolerance and availability by loosening the consistency requirements of
your data.
Be aware that the data structures may not lend themselves that well when
you have thousands of updates to a single object over short periods of time.
When reads and reconciliations can't keep up with the writes, it leads to
something called sibling explosion.

Strategies for Reconciling Conflicts


Reconciling multiple updates brings up several questions that are worth
diving into. The most important one is: when do you reconcile conflicts?
Most people resolve conflicts on the next read.
It's pointless to look for conflicts immediately after you wrote the data, as
they may not pop up until milliseconds later. It's an eventually consistent
system, after all. You also may keep blocking the client while it's waiting for
the write to finish.
Instead reconcile conflicts on the next read, because whether or not they
exist is not that relevant to the system unless they're read again. There are
consequences of letting them grow indefinitely, but we'll look into that part
in the next section. In the previous examples the code already followed this
principle. When Riak returns siblings with a request, the client first merges
the conflicts together into a single object and then returns the merged data.
The application can then decide to write the data directly to Riak, or to put
that work in a message queue to keep latency low. But what happens when
a new conflict arises while you store back the data? If you care for a scenario
like this, meaning that your users always need to have the very latest data
(which I doubt they do), you read the data again, finally returning the result
you just stored. But if there's another conflict, you need to keep reconciling.
That can turn into a vicious circle when conflicts are created faster than they
are read and reconciled. It's not what you want, most of the time. Instead,
reconcile again on the next read or write. It is a rather simplistic approach,
but fits in nicely Riak's eventually consistent model.

Riak Handbook | 123

Choosing the Right Data Structures

Reads Before Writes


Reads don't just need to happen when the application presents the data.
Ideally the application reads the data before writing it. That way, it can
detect and reconcile existing conflicts before adding new data, and it needs
the original vector clock for the new write.
This is important, if clients don't reference a valid vector clock, they keep
creating new siblings. Most client libraries will automatically keep track of
the vector clocks for you, but only if you fetch data before writing it.
If your code doesn't check for siblings on a read followed by a write using
the same vector clock, the write will nix all existing siblings, losing all the
changes made. Riak assumes you took care of all the conflicts.
Make sure you don't blindly write data unless that is what you want. If it
is, you probably don't need siblings for that particular data, so make sure
allow_mult is disabled.

Merging Strategies
No matter when you reconcile, there's always the question of picking a
proper strategy on how to merge diverged data structures, and how to pick
the right object to use as a base for reconciliation.
It depends on your application if you care about merging arrays, trying to
pick a winner when two clients updated the same string attribute, and so on.
But it all boils down to the timestamp or clock. If that's good enough for you,
start from the oldest change, pick the appropriate object as the basis for the
merge, and work your way up to the newest.

Sibling Explosion
Sibling explosion happens when either too many concurrent writes update
the same objects, or when clients don't use vector clocks properly when
writing data.
When too many clients write data, and reconciliation can't keep up with it,
you end up creating more siblings than your application can handle. This
can be circumvented by always reading and reconciling before you write. It
doesn't avoid the risk of exploding siblings, but it makes sure you don't write
more than you reconcile, because a write will always follow a merge.
On the other hand, when clients carelessly update data without specifying
a vector clock, Riak can't determine any ancestry, because all vector clocks

Riak Handbook | 124

Choosing the Right Data Structures

have a different lineage. You'll just keep creating more and more siblings if
they're not reconciled.
Sibling explosion can have the consequence of increased read latency. For
every read to the object, Riak has to load more and more data to fetch all the
siblings. One piece of data only 10 KB in size is no big deal, but a hundred
of it suddenly turn the whole object into 1 MB of data. If all you do is write
smaller pieces of data, increased read latency is a good (though only one)
indicator that your code creates too many siblings.

Building a Timeline with Riak


Now, you may remember that we wanted to build a timeline, for the tweets
we're collecting. Now that we went through the details of modeling data
structures for Riak, we have all the information we need to get started. The
timeline is nothing more than a list of changes. Every tweet in it is atomic,
there are no duplicates, and duplicate writes of the same tweet are easy to
filter out thanks to their identifier.
This idea has been made popular by Yammer. They built a notification
service on top of Riak that follows a similar way of modeling the data. In fact,
they led the way of how to build time series data structures on top of Riak.
Hat tip to you once again, Coda Hale. You should make sure to watch their
talk at a Riak meetup.
A timeline keeps a list of unique items for a single user, sorted by time. For
our purposes, it represents a Twitter user's timeline, based on the tweets that
matched the search. The timeline is stored in a Riak object per user and keeps
the whole timeline as a list, referencing the tweets. Here's an example.
{
"entries": [
"1231458592827",
"1203121288821",
"1192111486023",
"1171436045885"
]
}

The simplest timeline that could possibly work. To make it more efficient
we could make it include the entire tweet. Here's a slightly more complex
version.

Riak Handbook | 125

Choosing the Right Data Structures

{
"entries": [{
"id": "1231458592827",
"username": "roidrage",
"tweet": "Writing is hard."
}, {
"id": "1203121288821",
"username": "roidrage",
"tweet": "Finishing up those last chapters."
}, {
"id": "1192111486023",
"username": "roidrage",
"tweet": "Only two more chapters to go."
}, {
"id": "1171436045885",
"username": "roidrage",
"tweet": "Almost done with the part on Riak."
}]
}

You can keep adding attributes as you see fit, but it pays to keep the data
in the timeline simple. Assuming JSON is the serialization format of choice,
every new tweet added to the list adds up to 300 or 400 bytes. With 100
tweets, the Riak object is about 40 KB in size, with 500, it already clocks in
at 200 KB. That's not a massive size, but if it keeps growing indefinitely, the
Riak object grows bigger and bigger.
Both ways of modeling the timeline share the same advantage. You can
assume that the id attribute is already respecting time, as that's what
Twitter's Snowflake tool does. Snowflake generates unique, incrementing
numbers to identify tweets. One part of the generated number is derived
from a timestamp. Ordering the entries by that attribute will ensure that
they're sorted by time.
Here's the code to handle the timeline, first the part that adds new entries,
prepending them to an existing list of entries.
var tweet = {
id: '41399579391950848',
user: 'roidrage',
tweet: 'Using riakjs for the examples in the Riak chapter!',
tweeted_at: new Date(2011, 1, 26, 8, 0)
};
riak.get("timelines", "roidrage",
function(e, timeline, meta) {

Riak Handbook | 126

Choosing the Right Data Structures

if (e && e.notFound) {
timeline = {entries: []};
}
timeline.entries.unshift(tweet.id);
riak.save("timelines", "roidrage", timeline, meta);
}
});

If no timeline exists, we create a new one and then add the tweet to the
beginning of the list. Next up, we'll add the code that reconciles two
diverged timelines.
function reconcile(objects) {
var changes = [];
for (var i in objects) {
changes.concat(objects[i].data.entries);
}
changes.reduce(function(acc, current) {
if (acc.indexOf(current) == -1) {
acc.push(current);
}
return acc;
}, []);
return changes.sort().reverse();
}

First, all the changes are collected in one list. The list is then deduplicated,
having only single items in it. Lastly, it's sorted and the list reversed, so that
the items are in descending order, with newest tweets first.
All that's left to do is update the code saving timeline objects to reconcile
potential siblings before storing it back.
riak.get("timelines", "roidrage",
function(e, timeline, meta) {
if (e && e.notFound) {
timeline = {entries: []};
} else if (meta.statusCode == 300) {
var entries = reconcileConflicts(timeline);
timeline = timeline[0];
timeline.entries = entries;
}
timeline.entries.unshift(tweet.id);
if (!meta.vclock) {
meta = {}
}
riak.save("timelines", "roidrage", timeline, meta)

Riak Handbook | 127

Choosing the Right Data Structures

}
});

A full example that integrates building a timeline with the Twitter search
stream is included in the sample code accompanying the book.
I've heard the question if and why this is a valid approach people take on
with Riak. At first it might sound weird to store a timeline in a single object
in Riak, having to go through the troubles of merging siblings together
like this. But fetching a single object is cheap. Even when paginating only
100 out of 1000 items in the list, it may still be cheaper than fetching every
single item from Riak instead, piecing it together on the client side. Fetching
multiple items involves much more disk I/O and increased network traffic.

Multi-User Timelines
What we did so far only took care of a single user's timeline, containing
all his tweets. The traditional model, and the one popularized by Yammer,
is based on the idea that a user follows a number of other users. The user's
timeline is built based on all the activities of the users she's following.
The result won't be that much different than the example above. Instead
of storing a single user's items, you store references to other users' items in
the timeline for the user. That means duplicating data, a lot of duplicating
depending on the number of users he's following. The timelines of ten
people following the same person end up containing either references to the
same objects or containing the same data, depending on how much data you
want to store for efficiency. But again, you trade off the benefits of Riak for
a denormalized data structure. At a larger scale, it's a common trade-off.
If your timeline contains different kinds of activities such as likes, comments,
picture uploads, and so on, your entries will have some more detail. The data
structure could look something like this.
{
"id": "1212348454",
"timestamp": new Date(2011, 1, 26, 8, 0).toISOString(),
"entry": {
"type": "like",
"id": 334,
"owner": "roidrage"
}
}

Riak Handbook | 128

Choosing the Right Data Structures

As you can see, it includes a lot more data. It doesn't have to, but it allows you
to do roll-ups. Roll-ups are a way or presenting data in the feed in a denser
way. Instead of saying that three people like the same post as three different
items, you can just say it all in one item. You wouldn't even have to fetch
external data (like user objects) if all the data is included in the feed.
While Yammer's original application for timelines in Riak was internal,
there are several open sourced implementations available, one for Python,
and one for Ruby.

Avoiding Infinite Growth


Letting an object in Riak grow indefinitely comes with a cost. As objects
grow it takes more effort to read them from disk, to transfer them across a
network, to serialize and deserialize them. It all adds up to increased latency
for the user.
To keep a timeline from growing infinitely, cap the number of entries in
it. Don't store more than say 1000 entries. That keeps the timeline from
exploding. It might not even get that far, because not all users tweet things
containing the same words more than 1000 times in a lifetime, but it's
something to keep an eye on, especially when you build multi-user timelines
like Twitter itself.
If you don't want to cap the size of the timeline as a whole, there's ways to
work around it. Split up the timeline in multiple objects is one, but it has the
drawback that now you need to keep track of all the timeline objects to piece
them back together or just to browse through the timeline. Alternatively you
can build an index per user using Riak 2i or Search on all the tweets. That
has its own kind of overhead and implications, but is a viable option, and it's
something we already implemented in the respective sections on both.

Intermission: How to Fetch Multiple Objects in one


Request
The timeline examples bring up one problem. How do you fetch the items
referenced in the timelines? Or how do you fetch multiple tweets with just
one requests? For our tweets example in particular, there's a solution for
that: MapReduce. For fetching multiple objects in one request there's also a
solution: MapReduce. Let's start with the timeline.
All we need to do is feed the timeline into a MapReduce job, extract the keys
in a first map phase, and then extract the JSON for each object in a second

Riak Handbook | 129

Choosing the Right Data Structures

map phase. For the second phase we can reuse the built-in functions, so the
focus is on the code to extract the keys for the entries.
riak.add([['timelines', 'roidrage']]).
map(function(value) {
var doc = Riak.mapValuesJson(value)[0];
var entries = [];
for (var i in doc.entries) {
entries.push(['tweets', doc.entries[i]]);
}
return entries;
}).
map('Riak.mapValuesJson').run()

This solves the problem of fetching everything in one go. You can do a
number of transformations on the data, but the purpose of the example is
to show the simplest thing that could possibly work to fetch all items in the
timeline in one request.
To fetch multiple objects, when you know a number of keys and you want
to avoid the added overhead of multiple roundtrips to Riak, the solution is
even simpler. Instead of going through a list of entries, you specify the keys
as inputs to the map phase.
riak.add([["tweets", "1231458592827"],
["tweets", "1203121288821"],
["tweets", "1192111486023"],
["tweets", "1171436045885"]]).
map("Riak.mapValuesJson").run()

Intermission: Paginating Using MapReduce


How do I paginate through the items in the timeline? First up, you can fetch
all the objects from Riak one by one, only the ones for that particular page,
after fetching the full timeline object from Riak.
Instead, you can just hand the page identifier (the id of a tweet, a timestamp,
or something else that seems viable for your application) into a MapReduce
job and only fetch the next batch of objects from that identifier. So the
argument for the MapReduce job needs to include the key to start at and
the number of objects to fetch. We'll wrap it in a JSON object with two
attributes, limit and offset.
riak.add([['timelines', 'roidrage']]).map(
function(value, keyData, arg) {

Riak Handbook | 130

Handling Failure

var doc = Riak.mapValuesJson(value)[0];


if (!doc.entries) {
doc.entries = [];
}
return doc.entries.
slice(arg.offset - 1, arg.limit).
map(function(e) {
return ["tweets", e];
})
}, {offset: 1, limit: 2}).
map('Riak.mapValuesJson').run();

The code uses JavaScript's slice() method to cut out the entries we're
interested in, returning them as a list. With a simple map() the list is
converted into a list of bucket-key pairs, and then it's fed into another map
phase to extract the values of these pairs. There's your pagination. Comes in
handy when using Riak Search too, especially when you prefer the Protocol
Buffers API over HTTP, as the only way to query the index is by way of
MapReduce.

Handling Failure
What we haven't done yet is talk about what exactly happens when a
(physical) Riak node fails or becomes unavailable. There's heaps of reasons
why this could happen, a temporary network partition, where nodes can't
talk to each other over the network for an undefined amount of time, or
hardware failures due to disk, memory or other problems.
If you're running on something like EC2, a whole data center might be
unavailable too, but let's talk about the more common case, a node that's
become unavailable, i.e. is unreachable from the other nodes in the cluster.
John Allspaw (of Etsy and Flickr ops fame) had a nice analogy for Riak's will
to bend but not break. You can shoot a bullet in one of Riak's servers, and it
will still continue serving data. You can keep shooting servers until you've
reached the last one, and Riak will still try its best to keep serving the data
that's still available, not breaking because parts of it are missing.
The beauty of all this is that we're not talking theory here. It's easy enough
with Riak to emulate something like a node failure. Simply stop the Riak
process after you joined it into the cluster and handoff has finished. Given
you have a three node cluster, you can assume that every piece of data is
replicated on every physical node. So shutting off one of them should give us
enough room to start working through a failure scenario.

Riak Handbook | 131

Operating Riak

Operating Riak
This section is all about the operations part of Riak. It starts out by looking
at some basic settings that have a longer lasting impact on your cluster, then
looks at the available protocols, and finishes off with a detailed look at the
available storage backends in Riak, the part that makes your data durable.

Choosing a Ring Size


The ring size in Riak determines the number of partitions created. It defaults
to 64 partitions sharing the entire key space, which is enough to get you up
to about six physical nodes. When you add more nodes than six, all nodes
will be underutilized. It is recommended to have a minimum of at least ten
partitions on each server.
The ring size needs to be decided on upfront, so while 64 works well for
playing around, you need to consider and test how large your initial cluster is
going to be, and to how many nodes it's likely to grow. Trying to change it
later will cause trouble because it shifts the partition ranges, which are stored
in the file system based on their start key. When these are shifted around, the
files will be invalid.
The number of partitions in the cluster must be a power of 2, so the next steps
after 64 are 128, 256, 512, and 1024. With Riak 1.0, it's not recommended to
go beyond 1024 because of the underlying network protocol. There would
be too much traffic between the nodes, possibly causing overload. If your
cluster is going to have between 6 to 12 nodes, a ring size of 128 or 256 will
suffice. Using 1024 would take you up to 100 nodes.
To change the initial ring size 128, edit the app.config file, and add the
following line to the section for Riak Core, which you'll find right at the
beginning of the file.
{ring_creation_size, 128},

If you need to change this setting after you already started playing with
a cluster, you need to stop Riak processes and throw away your data files
(an unrecoverable step, mind you) and the ring definition. If you installed
a binary package for Ubuntu, these reside in /usr/lib(64)/riak/data/
leveldb and /usr/lib(64)/riak/data/ring.
Make sure you think about the ring size upfront. The default will only take
you so far.

Riak Handbook | 132

Operating Riak

Protocol Buffers vs. HTTP


I didn't go into much detail on how Protocol Buffers and HTTP differ yet.
HTTP is the simplest way to get started with Riak. You don't even need a
client library, anything that speaks HTTP will do.
While HTTP is pretty fast, there's a noticable difference in throughput
between Protocol Buffers and HTTP. Protocol Buffers is a binary interface,
based on a predefined interface definition. Both sides know that definition,
so it saves client and server the trouble of having to send both field names and
values (think HTTP headers) in every request, so normally just the values
suffice. That means less data is transferred across the wire, making the entire
request a bit faster.
Plus, Protocol Buffers use persistent network connections, whereas HTTP
can end up creating new connections for every request. Connection setup
adds considerable overhead and therefore, valuable milliseconds, which you
might be able to shave off by using Protocol Buffers.
The problem with the Protocol Buffers interface, however, is that it doesn't
support the full feature set of the HTTP API. There are a bunch of bucket
properties it doesn't allow you to set, and you can't use things like the Solr
API endpoint and secondary indexes directly. You can only utilize both by
feeding search or query as inputs to a MapReduce request.
Start out with the HTTP interface, but keep in mind that using the above
features may block the path to using Protocol Buffers. Some, but not all,
client libraries hide this from you, using HTTP when you request a search,
and Protocol Buffers for all normal requests.

Storage Backends
Storage backends in Riak are pluggable, and it comes with several built right
in. They're the ones responsible for efficiently storing your data on disk, so
choose wisely.
All backends are configured in Riak's app.config file. Using each of them
requires changing just one line, namely the option storage_backend in the
section riak_kv. The value must point to a valid Erlang module that
implements the storage API. The possible values are shown in the respective
sections below.
Make sure to read through the documentation for each of the backends,
there are links to the wiki page for every backend. Below is an outline of

Riak Handbook | 133

Operating Riak

their strengths and weaknesses, with some implementation detail sprinkled


on top.
Which one you choose depends on your specific use case. You're well
advised to do benchmarking with the ones you deem appropriate while
keeping a good eye on system load and I/O.

Innostore
Riak's first storage backend is Innostore, a library built on top of InnoDB,
which you may have come across working with MySQL, where it's now
the default engine. With Riak though, eventually it turned out that it's not
able to keep up with a large amount of writes, owing to its tree structure on
disk. The B-tree has certain advantages, because it keeps data sorted by key.
As long as you write data with sequential keys, writes will be fast, as they're
always appended to the end of the tree.
But it has the disadvantage of being a mutable data structure. As new data is
added, leaves and nodes in the tree are moved around, rearranging the data
on disk.
It's still in active use today, but requires installing a separate package, which
is available from the downloads site. Look for version 1.0.4, which should be
out by the time this book finds its way into your hands.
Innostore has lots of tuning knobs and can cache a pretty good amount of
data in memory, definitely an advantage over other storage backends. But in
general, when you expect a lot of writes, and I mean thousands per second,
you might want to look into the others. It has proven to be both stable and
unstable in the last two years. It can run pretty smoothly, but when some of
its transaction log files get corrupted, all bets are off.
To enable Innostore, install the package and change the storage backend to
riak_kv_innostore_backend. The wiki page has a good overview on all the
available configuration settings.

Bitcask
The answer to the problems Innostore had in production was Bitcask, a
storage system that's specific to Riak, and currently the default storage
backend. It writes data in a way that never mutates existing data, When you
write data to Riak, Bitcask appends the entry to an existing file. One disk
seek, that's all. Needless to say this gives pretty good performance. But, on

Riak Handbook | 134

Operating Riak

the other hand, Bitcask doesn't cache anything, so reads will always go to
disk, utilizing the file system cache if possible.
Bitcask keeps a list of all keys and pointers to their data on disk in memory,
so there's a limit of how much you can store. However, it has the benefit that
reads only require one lookup in the memory structure and one disk seek.
With this simple measure, Bitcask avoids random I/O entirely. Writing only
appends to existing files, therefore does sequential I/O, and reads directly go
to the data they're after.
Bitcask doesn't write to the same file indefinitely. Eventually a file's size
has reached a threshold and will be closed, never to be opened for writing
again. When it hits other thresholds, Bitcask will start compacting old files,
merging data into only a few files along the way. So there is operational
overhead here for the compaction phase (as this process is called), and you
should take good care that this phase doesn't happen during the time of
highest load on your application. The wiki page on Bitcask is a good starting
place for all the tuning knobs it offers.
When using Bitcask, make sure your servers have enough memory available
to allow storing enough keys and for some file system caching on top.
There's a simple capacity planning tool available on Basho's wiki to help you
calculate the memory requirements for Bitcask.
Use Bitcask when you want to reduce random disk I/O at all cost, as data
in Bitcask is not ordered, but sequential. It's definitely the most efficient of
the bunch while still offering high durability. Also make sure you read the
original paper outlining the ideas behind it.
To

use

Bitcask,

change the storage backend setting to


riak_kv_bitcask_backend, and configure Bitcask to suit your needs as per
the wiki page.

LevelDB
LevelDB is the newest contender in the field. Originally built at Google, it's
based on ideas implemented in BigTable. LevelDB is a sorted string table
combined with a memory table. The former is used to efficiently store data
on disk, sorted by key, the memory table is an in-memory cache.
Data is first written to the memory table and to a transaction log, which is fast
as it's a sequential write. Once enough data has accumulated in memory, it's
flushed to a sorted string table on disk, where data is also compressed, saving
up disk space compared to Bitcask and Innostore.

Riak Handbook | 135

Operating Riak

Just like Bitcask, LevelDB never modifies existing data, but always writes
new files, eventually compacting them into new files when they've reached
a size threshold. Unlike Bitcask, LevelDB keeps data sorted using a logstructured merge tree. That means reading can require more than one disk
seek, but has the advantage of the data being sorted, hence its great
applicability for Riak 2i.
As LevelDB doesn't keep any keys in memory, there's heaps of room to use it
for caching instead, giving you a nice performance boost and reducing disk
access significantly.
Compaction happens continuously, as LevelDB keeps data files pretty small.
When you write a lot of data, a lot of compaction might be going on in the
background. This doesn't directly affect latency, but as there's increased load
on the servers, it indirectly affects read and write performance. Needless to
say, fast disks (hint: use SSDs) will give you the best results.
Compared to Innostore, LevelDB is much faster thanks to its data structures,
and more reliable too. Riak uses a library called eLevelDB, an Erlang library
built on top of LevelDB.
The wiki page on LevelDB has exhaustive detail on its innards and the
configuration settings. Also be sure to go through this post on LevelDB over
at High Scalability, full of links, benchmarks, and little details.
To

enable

LevelDB,

set
riak_kv_eleveldb_backend.

the

storage

backend

setting

to

Load-Balancing Riak
When you have a cluster up and running, how do you get your application
to spread load evenly across all the nodes in the cluster? There's a certain
mismatch between the idea that you can elastically grow and shrink capacity
in your Riak cluster while all the clients (your application) have to know
about all nodes in the cluster and have to be updated as you increase and
decrease capacity.
One possible answer is to just leave a configuration file on all nodes at all
times, always updating it as new nodes are added or as nodes go away. Every
client can then load-balance on its own and spread out requests to all Riak
nodes.
The disadvantage of this model is that clients have to take care of nodes that
timed out and nodes that are temporarily unavailable, implementing their
own timeout and retry mechanism. That adds another level of complexity
Riak Handbook | 136

Operating Riak

to a Riak client's implementation or your application code. Though some


Riak clients, such as the one for Ruby, implement their own load-balancing,
hiding this extra bit of complexity from the application.
A much better option is to put a load balancer in front of the Riak nodes.
Clients only talk to the load balancer and not Riak directly. The load
balancer takes care of handling timeouts or nodes that are unavailable. So
the whole process of talking to Riak and distributing request load evenly is
opaque to the clients.
There is an abundance of options for load balancing tools available. The most
popular option in the open source world is HAProxy. As an alternative, you
can look at Pound, though its focus is on load-balancing HTTP(S) requests,
which excludes Protocol Buffers. HAProxy on the other hand, can loadbalance both HTTP(S) and any other kind of TCP traffic, making it a great
option for using both Protocol Buffers and HTTP.
Here's a simple example, without all the standard options, to set up HAProxy
for both HTTP and Protocol Buffers. HAProxy listens on ports 8198 for
HTTP and 8187 for Protocol Buffers, switching load balancing to TCP
for the latter. The default in most HAProxy configurations is to use HTTP
mode. You also have to use TCP mode for SSL connections, as HAProxy
itself doesn't terminate those but hands them through to the servers.
listen riak :8198
server riak-1 192.168.2.5:8098 weight 1 maxconn 4096
server riak-2 192.168.2.6:8098 weight 1 maxconn 4096
server riak-3 192.168.2.7:8098 weight 1 maxconn 4096
listen riak_pbc
mode tcp
server riak-1
server riak-2
server riak-3

:8187
192.168.2.5:8087 weight 1 maxconn 4096
192.168.2.6:8087 weight 1 maxconn 4096
192.168.2.7:8087 weight 1 maxconn 4096

So given we've just put a load balancer in front of our Riak, there's a problem.
We just set up a single load balancer instance to handle traffic for all of the
clients and all the Riak nodes. What if it suddenly becomes unavailable? The
load balancer is now a single point of failure in a distributed setup.
There are several options to solve this problem. First, you could set up a
cluster of load balancers, or set up a load balancer on every Riak node. The
latter scenario might sound odd, but if every Riak nodes has a load balancer

Riak Handbook | 137

Operating Riak

running on it that's configured to talk to all nodes in the cluster, the load will
still evenly spread. But then again, clients need to know about all the load
balancing nodes in the cluster.
The alternative, and a preferred way of setting this up, is to have a load
balancer running on every application server. This has the advantage that
clients don't need to know about any external Riak node, they just talk to
their own host, which always has the IP address 127.0.0.1. As new Riak
nodes come and go, you only need to update the load balancers on all
application servers instead of writing a new configuration for the
application.
This work is ideally fully automated, so it's transparent to the client and
removes the need for complex load-balancing logic in your code. Loadbalancers like HAProxy are capable of re-reading the configuration without
dropping existing connections, so updating them has little to no impact on
your application.
If you're running a Riak cluster on EC2, another alternative to look at is
Amazon's Elastic Load Balancing service, which takes care of being
redundant and highly available for you. You can add and remove Riak nodes
through an API or through a set of command line tools, which is, again, easy
to automate.
With Pound and Elastic Load Balancing, you can use both as an SSL
endpoint, encrypting communication only between your application and
the load balancer.

Placing Riak Nodes across a Network


Entire racks of servers can fail because of a power outage, and entire data
centers can be temporarily cut off from the network. Placing Riak nodes in
different racks or even data centers could, in theory, help minimize the risk
of failure.
But I'm afraid it's not that easy with Riak. When you send a request, any
request, to a Riak node, it usually involves more than just that node. It either
forwards the request to the node where the vnode currently resides or it sends
out requests to all replicas of the data.
In any way, there is network traffic and latency involved. The closer your
Riak nodes are grouped together in a data center, the lower the overall
latency for requests will be. When you start placing nodes in a Riak cluster

Riak Handbook | 138

Operating Riak

in different racks or data centers, you increase latency for an unpredictable


number of requests.
Consider a scenario where one replica is located in a different data center
and you always send requests with the default quorum. One read or write
would always go to the node in the other data center. Assuming a minimum
network latency of 40 ms between the two data centers, requests involving
that particular node will be no faster than 40 ms in total. The single node
in the other data center turns into the weakest link in your cluster, slowing
down all the other nodes.
Placing nodes in different racks is acceptable when you're aware of the added
latency, though it may be negligible depending on the network
environment. But there is no guarantee that replicas will end up on at least
two different racks, as Riak doesn't have any rack-awareness. Placing nodes
in other data centers is very impractical with Riak, considering the
unpredictable latency.
What you can do instead, especially considering data center failure scenarios,
is put separate Riak clusters in different data centers and have them
synchronize in one way or both ways, even multiple ways if you run Riak
in more than two data centers. This feature is part of the commercial Riak
offering, which is called Riak Enterprise.
It allows you to hook multiple Riak clusters together to synchronize data
across all of them, either for having clusters geographically closer to users or
to be prepared for data center failures.
The replication follows the same rules as the replication of data in a cluster.
Conflicts are detected using vector clocks and can be resolved in the same
way as shown in the section on conflict resolution. So clients can write to
a second Riak cluster, and your application would handle conflicting writes
just like it would with just a single Riak cluster.
If you run Riak on EC2, the same network placement rules apply, but with
slightly different terminology. EC2 offers multiple, geographically
distributed regions, and within these regions, you can choose an availability
zone. A region is only a loosely coupled collection of data centers, which
in turn each represent an availability zone. To deploy Riak on EC2, you'll
want to keep all the nodes in your Riak cluster within the same availability
zone to keep latency low. Latency between different availability zones isn't
too high, as they're geographically close together, but it's a noticable latency

Riak Handbook | 139

Operating Riak

nonetheless. Plus, in Amazon's multi-tenant network environment,


unexpected latency spikes are the norm, not the exception.

Monitoring Riak
Riak has the reputation of being operations-friendly, and monitoring is no
exception. Every node has an easy way to access performance and health data
so you can feed it into your favorite metrics collection and alerting system.
Every Riak node offers an HTTP endpoint to fetch current statistics. If you
point curl at localhost:8098/stats, you'll see a slew of JSON flying by that
gives you everything you need to keep an eye on your Riak cluster's health.
Below is a shortened example, there is a lot more data in the output than
shown here. We'll go through the relevant metrics and their meanings in the
following sections.
$ curl localhost:8098/stats | prettify_json
{
"vnode_index_writes_total": 0,
"vnode_index_writes_postings": 0,
"memory_binary": 76688,
"node_get_fsm_time_99": 0,
"node_puts_total": 0,
"riak_pipe_vnodeq_total": 0,
"converge_delay_max": -1,
"nodename": "riak@127.0.0.1",
"sasl_version": "2.1.10",
"node_get_fsm_objsize_median": 0,
"ssl_version": "4.1.6",
"luke_version": "0.2.5",
"memory_processes_used": 7034320,
"cpu_avg15": 166,
"precommit_fail": 0,
"sys_system_architecture": "i386-apple-darwin11.2.0",
"vnode_index_deletes_postings": 0,
"rebalance_delay_max": -1,
"read_repairs": 0,
"riak_pipe_version": "1.1.1",
"rings_reconciled": 0,
"sys_heap_type": "private",
"node_get_fsm_siblings_100": 0,
"stdlib_version": "1.17.5",
"os_mon_version": "2.2.7",
"riak_pipe_vnodeq_max": 0,
"sys_smp_support": true,
"ring_members": [

Riak Handbook | 140

Operating Riak

"riak@127.0.0.1"
]
}

Note that you can get all statistics on a Riak node by using riak-admin
status. But the result is an Erlang data structure that's dumped to the
console. In comparison the JSON data returned by the HTTP endpoint is
much nicer to parse for a metrics library.

Request Times
The most important thing you should monitor in Riak are request times.
They tell you how much time reads or writes take when sent to this node.
Riak gives you percentile values and not just averages. You get a 95th, a
99th, and a 100th percentile to find out if you got any slow edge cases in
your cluster. For more general values you get mean and median values, mean
representing the average request time, and median being the exact middle
value of all requests.
A percentile is calculated by taking a list of all request times, sorting it in
ascending order, and only looking at the average of the 100 - N number of
values, with N being the percentile number, for instance 99. While averages
mingle in exceptional request times with the rest, percentiles focus solely on
those slow cases, allowing you to look at them in isolation.
The

relevant

statistics
start
with
node_get_fsm_time
and
node_put_fsm_time. For percentiles, the values node_get_fsm_time_95,
node_get_fsm_time_99, node_get_fsm_time_100, node_put_fsm_time_95,
node_put_fsm_time_100, and node_put_fsm_time_99. To add averages,
you can fetch node_get_fsm_time_mean, node_get_fsm_time_median and
their corresponding counterparts for put. All numbers are represented as
collective values for the last 60 seconds, so they need to be considered gauges
in your graphing system while the totals are counters.
Monitoring these values will give you a good clue when there's something
wrong in your Riak cluster or on just a single node even. Below is an
example of how the resulting graphs may look. The graph is courtesy of
Librato Metrics. All values are microseconds, so if it makes more sense, you
can convert them to milliseconds before storing them in your graphing tool.

Riak Handbook | 141

Operating Riak

Graph of Get FSM times.

Why track three different percentiles and two averages? Having three
percentiles enables you to find out just how many requests are affected by
temporary performance degredations, more so than averages do. If your 95th
percentile doesn't show a huge difference, but your 99th or only your 100th
does, you get an idea of how many requests, and therefore, how many users
are affected by degraded performance and increased request latency.
If you're tracking any metrics about your Riak cluster, these are the most
important ones. Increased request latency is the first indicator something is
wrong.
Don't forget to track similar metrics inside your application, measuring the
request times on both ends. That way you can figure out if there's a problem
in the transport layer or in the serialization, should Riak not be the culprit.

Number of Requests
Every node keeps track of all the requests it coordinated and all requests that
went to the vnodes residing on that node. A Riak node forwards requests
to the relevant nodes instead of coordinating the request itself, should it not
have a vnode responsible for the requested key.
Riak keeps total counts and one-minute stats for both types of values. Which
of them you use to collect metrics is up to you, but for general tracking the
one-minute values are a good start. The totals will start at zero again when
you restart a Riak node.

Riak Handbook | 142

Operating Riak

The relevant values are node_gets, node_puts, vnode_gets, vnode_puts.


To track the totals, add _total to each of the previous names. Tracking
these metrics will give you a good idea of your cluster's throughput and
capacity, allowing you to make predictions for future growth and capacity
requirements.

Read Repairs, Object Size, Siblings


So far the metrics gave more of an outside view of the requests an application
throws at Riak. However, there are metrics that allow you to peel off another
layer and look at things that happen while an object is read.
That includes the object's size, the number of siblings, and the number of
read repairs. Both object size and siblings are available as percentile metrics,
just like request times. They give you yet more clues as to where increased
latency might come from. If you have a number of objects that are larger
than the rest, it might also be that they have a larger number of siblings than
others. We've talked about sibling explosion before, this is how you can keep
an eye on it.
By tracking the number of siblings in addition to the request percentiles, you
get a good idea if increased request latency is caused by sibling explosion
or, if you track object size too, by large objects. Look for the values for
node_get_fsm_siblings_95, node_get_fsm_objsize_95 et. al. Correlating
both lets you to monitor if there's a high number of unresolved write
conflicts in your application.
The number of read repairs is a metric that should usually have pretty low
values. It should only show a decent amount of activity when a temporarily
failed node rejoined the cluster and has outdated data to reconcile for a
certain amount of time.
Note that object size and siblings are only collected for objects that are read.
They're not a means to get general statistics about object size and number of
siblings of the data you stored in Riak.
The metrics so far are ones that you definitely want to track, no matter what.
They give you a good idea of what's going on inside your Riak cluster, they
allow you to keep an eye on capacity, and to make informed decision about
potential latency issues or future growth of your infrastructure.

Riak Handbook | 143

Operating Riak

Monitoring 2i
Riak's Secondary Indexes tracks statistics for reads, writes, and deletes. It also
keeps metrics for all index writes and deletes that happen during a single
update of an object. All of them are available as total counters and oneminute values.
To track all requests that involve reading and updating metrics for secondary
indexes, you can collect vnode_index_reads, vnode_index_writes, and
vnode_index_deletes for one-minute values. Add _total for full counters.
These metrics give you insight in all requests that involved reading, writing
and deleting to an index. To find out how many deletes and updates were
done in total, use the metrics vnode_index_writes_postings and
vnode_index_deletes_postings. Once again, these are updates in the last
60 seconds, add _total to get the number of all updates since Riak was
started. These metrics give you more insight in writes and deletes that were
done in a single request. For example when one write updated three indexes,
vnode_index_writes_postings will be incremented by three.
Ideally both types of metrics, for requests involved 2i and for all index
updates, should grow in relation to one another. Keeping an eye on both
means you can see unusual index activity.

Miscellany
If you're using Protocol Buffers, you'll want to track the relevant metrics for
it. Riak keeps statistics of the current number of open connections and the
total connections received.
For the number of open connections, track pbc_active. For the total
number of connections, track pbc_connects for one-minute statistics and
pbc_connects_total for the total number of connections this Riak node has
received during its lifetime.
Aside from metrics related to internal services, Riak also keeps metrics about
the Erlang process it's running in. This makes it easy to track how much
memory the whole Riak process is consuming. You can track memory_total
to keep an eye on Riak's memory consumption.

Monitoring Reference
Now that we went through all of them in all the gory detail, here's a handy
reference table for you giving a short overview of the relevant metrics Riak

Riak Handbook | 144

Operating Riak

makes available to you. All statistics are per node, so be sure to track all
statistics across all nodes.
Metric
node_gets
node_puts

node_gets_total
node_puts_total

node_get_fsm_time

node_get_fsm_objsize

node_get_fsm_siblings

node_put_fsm_time

vnode_gets

Meaning
Number of reads received by
this node in the last minute.
Number of writes and deletes
received by this node in the last
minute.
Total number of reads received
by this node since startup.
Total number of writes and
deletes received by this node
since startup.
Internal response time for read
requests coordinated by this
node. Available as 95th, 99th,
100th percentile, mean and
median. Add _95, _99, _100,
_mean, and _median,
respectively.
Object size measured on read
requests representing the total
size of all siblings. Available as
95th, 99th, 100th percentiles,
mean, and median.
Number of siblings per object.
Available as 95th, 99th, 100th
percentiles, mean, and median.
Internal response time for write
and delete requests coordinated
by this node. Available as 95th,
99th, 100th percentiles, mean,
and median.
Number of reads received by
vnodes on this node in the last
minute.

Riak Handbook | 145

Operating Riak

Metric

Meaning
vnode_puts
Number of writes received by
vnodes on this node in the last
minute.
vnode_gets_total
Total number of reads handled
by vnodes on this node since
startup.
vnode_puts_total
Total number of writes handled
by vnodes on this node since
startup.
read_repairs
Number of read repairs handled
by this node.
vnode_index_deletes
Secondary index deletes handled
by vnodes on this node in the
last minute.
vnode_index_writes
Secondary index updates
handled by vnodes on this node
in the last minute.
vnode_index_reads
Secondary index queries
handled by vnodes on this node
in the last minute.
vnode_index_deletes_postings
Unique secondary index deletes
handled by vnodes on this node
in the last minute.
vnode_index_writes_postings
Unique secondary index writes
handled by vnodes on this node
in the last minute.
vnode_index_deletes_postings_total Total number of unique
secondary deletes handled by
vnodes on this node since
startup.
vnode_index_writes_postings_total Total number of unique index
writes handled by vnodes on
this node since startup.
pbc_active
Number of currently active
Protocol Buffers connections.
pbc_connects
Number of Protocol Buffers
connections received in the last
minute.

Riak Handbook | 146

Operating Riak

Metric

Meaning
Total number of Protocol
Buffers connections received
since startup.
Total amount of memory
consumed by the Riak process.

pbc_connects_total

memory_total

Managing a Riak Cluster with Riak Control


With the latest release 1.1, Riak added a new tool that enables users to
manage and monitor their cluster in a visual way. It's called Riak Control,
and while it's still a work in progress, it's worth a look.
Riak Control is part of Riak's standard distribution, but it's disabled by
default. Before we have a look at it, let's enable it real quick.

Enabling Riak Control


For security reasons, it is recommended to run Riak Control over HTTPS
and not just plain old HTTP, so we'll enable that first. Open the app.config
file, and right at the beginning, in the riak_core section, you'll find the
following line.
%{https, [{ "127.0.0.1", 8098 }]},

Remove the % and change the port so that the line looks like shown below.
This changes the port for SSL connections to 8069 so that it doesn't interfere
with the normal HTTP endpoint, which runs on port 8098.
{https, [{ "127.0.0.1", 8069 }]},

Right below, there's a section that helps you generate SSL certificates.
Depending on the mode of installation, Riak comes with pre-generated,
self-signed certificates. The Debian packages, for example, do not, but the
Homebrew installation for Mac does.
Below you'll find instructions for generating your own self-signed
certificate. But before you do that, uncomment the relevant lines, so that
they look like shown below. The lines below assume your Riak
configuration is in /etc/riak, so the certificates will be in the same folder.
{ssl, [
{certfile, "/etc/riak/cert.pem"},

Riak Handbook | 147

Operating Riak

{keyfile, "/etc/riak/key.pem"}
]},

The final bit is to enable Riak Control. Right at the bottom of app.config
you'll find a section named riak_control. The first relevant setting enables
of disables Riak Control, and it's the first one in the section. Change it to
true as shown below.
%% Set to false to disable the admin panel.
{enabled, true},

The last and final missing piece is to specify a list of users and passwords to
authenticate. As you can modify cluster settings using Riak Control, it's best
to protect it accordingly. The default user is admin with password pass. If
you're coming from the world of Oracle 8, you'll appreciate a default user
with a default password (remember scott?). But you should change the list
to instead include real users with secure passwords. An example is shown
below.
%% If auth is set to 'userlist' then this is the
%% list of usernames and passwords for access to the
%% admin panel.
{userlist, [{"scott", "small-hills-384398"},
{"system", "pointy-hair-343437"}
]},

When all that's done, restart Riak, and point your browser to the host name
of one of your Riak nodes. Use port 8069 and the /admin endpoint, for
example https://192.168.2.1:8069/admin.

Intermission: Generating an SSL Certificate


The following instructions allow you to generate a self-signed certificate for
use with Riak in general and with Riak Control in particular. Note that these
should not be used for a production run, because they're not signed by a
proper authority. They're for testing purposes only.
First, we'll generate an RSA key.
$ openssl genrsa -out /etc/riak/key.pem 1024

Then we'll generate a certificate signing request.

Riak Handbook | 148

Operating Riak

$ openssl req -new -key /etc/riak/key.pem -out /etc/riak/server.csr \


-batch

The last part is generating a certificate for the key. We'll need this and the
key to enable SSL in Riak.
$ openssl x509 -req -days 365 -in /etc/riak/server.csr \
-signkey /etc/riak/key.pem -out /etc/riak/cert.pem

Once you're done, restart Riak, and you're good to go.

Riak Control Cluster Overview


When you open the user interface, you're greeted with a snapshot showing
you if your cluster is healthy and giving you clues as to why it might not be.
A happy cluster is all green, but since a picture says more than a thousand
words, here you go.

Riak Control's snapshot view, all green.

If something is wrong on one of the nodes in your cluster, the snapshot view
will be the first to tell you. For example, if one of your nodes has become
unavailable, or if one of the nodes is short on memory, you'll be greeted with
lots of red.

Riak Handbook | 149

Operating Riak

Riak Control's snapshot view, reporting a problem.

Be aware that this view is not a full replacement for monitoring. If one of
your Riak nodes goes down, an alert should pop up somewhere.

Managing Nodes with Riak Control


In the cluster view, you can see all the nodes that are currently part of your
cluster, their status, current memory consumption, and how many partitions
they own.

Riak Control's cluster view.

Riak Handbook | 150

Operating Riak

The cluster view allows you to run several basic management commands, for
example to stop a node, or to have it leave the cluster. All that is hidden in the
"Actions" menu, which is right next to every node's name.

Managing Nodes with Riak Control

Managing the Ring with Riak Control


In the ring view, you can see the partitions in your cluster, who owns them,
and what features they have enabled. This is a read-only view, but it also
shows you if one of the partitions has problems with particular features, for
example if the virtual node serving Riak Search is down for some reason.

Riak Handbook | 151

When To Riak?

Riak Control's ring view.

To Be Continued...
Riak Control is a work in progress, and this is just the beginning. Eventually
it's supposed to grow into a tool that allows you to view live graphs of your
nodes, to browse objects stored in Riak, and to fire off MapReduce queries
against your data. Keep an eye on future Riak releases!
It's worth noting that everything you can do with Riak Control can also be
achieved by way of the riak-admin command. The wiki page has a great
outline on the things you can do with it.

When To Riak?
Riak can do a lot, a whole lot. Yet, everything it does stays true to the spirit of
Riak as it was originally developed. All components are built to ensure data
is still available in the face of failure, and all of them scale up and down as you
add and remove nodes from a cluster.
Even though Riak follows a simple key-value model it's been proven to
work nicely in environments where time-ordered data is desired. It involves
work on the application's side, but for that tradeoff you get fault-tolerance,
availability, and operational simplicity.
High availability usually has the highest importance for people coming to
Riak. Add content agnosticism, and you get a highly scalable and faulttolerant store for any kind of data, addresses, images, simple JSON data
structures to more complex timelines.
Riak has been used as scalable and persistent session store as much as it has
been used to archive an abundance of data. Among the latter are text
messages, meta data for data mined from the web, address data. Anything
that you need to store an abundance of, and that can be identified by some
other means, a username, a URL, a session identifier, that's where Riak really
shines, and where it's commonly used.
The example of tweets isn't that far-fetched, and not just about the fun
of analyzing what people say about Justin Bieber. The Twitter streaming
search, and their firehose too, adds up to large amounts of data over time.
Leave that Twitter search running for a couple of days, and you'll find
hundreds of thousands of tweets have accumulated.

Riak Handbook | 152

Riak Use Cases in Detail

It's hard to give a specific recommendation on what use cases you'll come
across will be the right ones for Riak. It usually starts with the fact that your
existing database isn't up to the task anymore, maybe because of the amount
of data, because it's become a single point of failure, or because it lacks simple
ways to scale (simple compared to Riak anyway).
Riak goes nicely with the fact that, once you start growing, you slowly
but surely loosen up consistency constraints, simplify data model and data
access. Those are key areas for Riak. It may sound pretty hand-wavy, but
I wouldn't say that Riak is a database you run to right from the start. You
come to it only, maybe mostly, when you have a good idea of growth and
access patterns for your data, when it's foreseeable that data will out-grow a
relational database at some point, or when it's just easier (from an operational
perspective) to use Riak instead, saving you the trouble of migrating later.
Until not long ago, access through bucket and key names was the only
means of accessing data in Riak. Thanks to Riak Search and Riak 2i it's
gotten much easier to build different views on your data, taking Riak closer
to what you're used to and, more importantly, making it much more useful
in general.

Riak Use Cases in Detail


In one way or the other we came across several use cases for Riak already.
Think of the timelines, storing lots of smaller data (tweets), and managing
data structures for use cases where you want to ensure that your data can
handle Riak's eventual consistency model.
In this section I want to outline some use cases in more detail. The outlined
scenarios are not absolute, they're examples that you can build on or that you
can evolve into the specifics of your applications.

Using Riak for File Storage


Riak's key-value storage and content agnosticism make it a great fit to store
files in it, to have a redundant and scalable storage for assets of any kind.
Most web applications that involve user-generated content have this need
in one way or the other. Be it user avatar images, CSS or JavaScript files, or
a scalable means of handling file uploads and downloads without having to
resort to using a network file system.

Riak Handbook | 153

Riak Use Cases in Detail

Thanks to HTTP, you can even just put a HTTP proxy like nginx or a load
balancer like HAProxy in front of Riak and serve files directly. Also thanks to
HTTP, you can use any HTTP client to store files in Riak.

File Storage Access Patterns


Static assets should be served as quickly as possible, keeping latency in the
cluster and to the client at a bare minimum. To make that happen, you want
to make sure your read quorum is as low as possible, not involving more than
one replica, effectively making R = 1.
You still get to reap the benefits of Riak's replication and eventual
consistency. But instead of relying on reads, you make sure that your writes
involve enough replicas to make sure all of them have the most recent data.
So you use the N value as your write quorum, ensuring full consistency, but
trading off latency on writes.

Object Size
This usage scenario assumes your files don't exceed a size of 1 MB. Beyond
that, you won't gain as much from using Riak anymore. It shines with data in
the range of dozens to hundreds of kilobytes. There are known cases where
users stored data larger than that in Riak. But here's why you want to avoid
that and keep it small.
Object size affects a lot of things in Riak, but most notably it affects latency.
A request in Riak can involve several nodes. One physical node serves as
a proxy, delivering the request to the relevant virtual nodes in the cluster.
That means there's always network traffic between the Riak nodes involved,
network traffic that you want to keep as efficient as possible. Transferring
larger chunks of data increases network traffic and therefore latency, making
requests slower for the clients.
Throw disk I/O into the mix, and the larger your data is, the more latency
you get. You can try to keep it low by involving just one replica in read
requests, but it still adds up.
How far you can go with object size depends on the infrastructure and the
application. With a fast enough network and SSDs to store Riak's data on,
it will be more acceptable to have larger objects than when running on
Amazon EC2 with network-backed storage.

Riak Handbook | 154

Riak Use Cases in Detail

Storing Large Files in Riak


That leaves the question: how can you store files larger than a few hundred
kilobytes in Riak? Until recently, the answer was Luwak, a large-file store
built into Riak. It splits up a large file into chunks, distributing them across
the nodes in a Riak cluster. If the name reminds you of a certain kind of
coffee bean, you're right on the money.
However, Luwak reached the end of its life earlier this year. It turned out not
to be able to provide the reliability and availability that people expect when
using Riak. So it was replaced with Riak CS, short for Riak Cloud Storage.

Riak Cloud Storage


Riak CS was built as a more reliable solution to store large chunks of data
in Riak. It was also built to have an API compatible to Amazon S3, which is
a hosted, multi-tenant key-value store. Its release somewhat coincided with
Luwak nearing its end of life, but both events are not related to each other.
It builds on top of a running Riak cluster, adding services that allow it to run
in a multi-tenant mode, and to ensure that buckets are unique for all tenants
and that clients can only see their own data, just like S3 does. The REST API
uses an authentication and access control system that is 100% compatible to
how clients are authenticated by Amazon Web Services, including S3.
The API compatibility makes it possible to use any existing client library for
S3. Point it to a hostname that maps to your Riak CS cluster, and continue
working just like you would with S3.
When Riak CS is set up, and you have a hostname configured for it, for
instance s3.riakhandbook.com, you can use a library like Knox for Node.js
to talk to it, just like you would with S3.
s3 = require('knox').createClient({
endpoint: 's3.riakhandbook.com',
key: 'access-key',
secret: 'secret-key',
bucket: 'stylesheets'
});
s3.get('/application.css').
on('response', function(response) {
response.on('data', function(chunk) {
console.log(chunk.toString());
});

Riak Handbook | 155

Riak Use Cases in Detail

}
);

The example above creates a client with the custom endpoint. If your Riak
CS services listens on a custom port, you can specify that too. Note that as
of Knox 0.0.9, a custom port is not supported yet, you'll need to install the
current master. If you're using a different S3 library for Riak CS, make sure it
supports setting the port if you're not using an HTTP proxy or load balancer
on port 80.
The example then requests a file, in the bucket stylesheets, called
application.css. That's it. The beauty with the code above is that you
could easily leave out the custom endpoint, switching between S3 or your
own Riak CS cluster as you see fit.
While the recommended maximum object size for Riak is around 1 MB,
Riak CS can store objects up to 5 GB in size. Also, you can get accounting
data for every tenant in the system, allowing you to account for or bill
network traffic and storage.

Using Riak to Store Logs


Centralized logging is a common problem for many apps. As soon as you
move beyond just one or two servers. Especially in applications that produce
a lot of log output, it's important that the aggregation tool be scalable to
handle the load.
Log data has some properties that we're going to take into account:

Data is stored and accessed ordered by time


Newer data is more relevant than older data
Data older than X days can be purged
Writes can be bursty, for instance when error rates and log output
suddenly increase
Data is accessed by time frame, in most cases, only a range of entries is of
interest
That last point makes the use case of storing logs interesting, because it
requires a storage model that allows fetching records based on a range of
timestamps. You specify a lower and an upper bounds, for example to fetch
all entries from 8 o'clock to 9 o'clock. As Riak is a key-value store, it doesn't
appear to be a great fit for this kind of use case. Due to LevelDB's nature
though, it can be an acceptable data store for this particular job.

Riak Handbook | 156

Riak Use Cases in Detail

Modeling Log Records


In the world of traditional logging, one line of log output generates a single
record. While it used to be a line of normal text with some metadata like
timestamps, facility, log level added. More modern log systems use data
structures like JSON or custom log formats to accomodate more flexible log
structures and to get away from traditional, mostly text-based logging.
Either way, even with syslog output, it makes sense to deconstruct every
message into a JSON structure and store every line as a separate record in
Riak. Here's an example structure:
{
"time": "2012-04-30T15:02:17.273Z",
"log_level": "info",
"facility": "kernel",
"message": "--MARK--"
}

It's debatable how efficient this data structure will be when stored on disk
thousands of times, but it'll serve us well for an example. Most log formats,
like syslog or IEEE 1545-1999, can be decomposed into a data structure like
this.
Why not store the lines of text directly? Pre-analyzing allows you to run
efficient searches on it, utilizing Riak Search to index the data structure for
you. It's also easier to analyze the data using MapReduce.

Logging Access Patterns


There are two scenarios where good speed is required for this use case: writes
and running full-text searches on the log data, oftentimes narrowed down to
a specific time frame.
To keep writes efficient, using the lowest possible quorum will do the job.
It's much better to keep latency low to accomodate for those bursty
moments, when the system is flooded with an unusual amount of log data.
How important consistency is in a logging scenario depends on the
requirements. To keep read latency low it can be acceptable to prefer a
smaller read quorum, but that may have the consequence of some log lines
not showing up immediately. As the nature of logs is that you want to see as
much of it as possible but not always the millisecond they come in, again, it
can be acceptable to keep both read and write latency low.

Riak Handbook | 157

Riak Use Cases in Detail

Writing data in a centralized logging scenario is the simple part, getting the
data into a format that allows full-text search and accessing data ordered by
time are different stories. Let's look at simple access by ranges first.

Indexing Log Data for Efficient Access


Log data consists of several simple parts and one or more complex parts, the
log message and any kind of free-form data added to it. The simple data
includes things like the time, log facility or log level.
Mapping this part is straight-forward, thanks to LevelDB's sorted storage
and Riak's Secondary Indexes. If you structure the key based on the current
time, LevelDB ensures data is inserted and sorted by time.
Depending on the amount of data, just using the timestamp won't do, as
there may be collisions. To ensure ordering, you can add a random element,
also called jitter, the first part of a SHA1 hash, or a short UUID. Note that no
matter which key scheme you choose, there's always a chance of collision.
There are alternatives like time-based UUIDs, Twitter's Snowflake or
Boundary's Flake to ensure unique IDs across a cluster of nodes.
Ensuring that the first part of a key is ordered allows you to fetch a batch
of records based on just an upper and lower bounds, by using a timestamp
respectively. The unique component ensures there are no collisions when
a lot of records are stored at the same time. The higher resolution the
timestamp is the lower the chance of collisions, but you never know.
There's another reason why ordered keys are useful: efficient writes. In a
sorted storage like LevelDB or InnoDB, adding keys that are at the
lexicographical end of all records stored in a particular partition reduces the
amount of reshuffling needed when data is purged (in LevelDB), or when
rewriting the B-tree (InnoDB).
As Node.js doesn't give us access to microseconds, we'll make do with just
milliseconds for now, adding a random integer to the end. If you're
developing for Node.js and require microseconds, there's a library to make
them available to you, making it as simple as calling microseconds.now().
var key = new Date().getTime().toString() +
parseInt(Math.random() * 10000);
var logEntry = {
"time": new Date().getTime(),
"entry": "--MARK--",
"log_level": "info",
"facility": "kernel",

Riak Handbook | 158

Riak Use Cases in Detail

"host": "riak1.production.com"
};
riak.save('logs', key, logEntry);

Keys are not the only part that can be used to access records by time. You
could also just use random UUIDs for them and build secondary indexes on
the timestamp itself. You could even leave generating random IDs up to Riak
by not specifying a key and using POST to create the record. With riak-js,
you can set the key to null to do that, it automatically uses POST in that case.
We'll create an additional index on the timestamp while we're at it.
var indexes = {time: logEntry['time']};
riak.save('logs', null, logEntry, {index: indexes})

Why not just leave the key generation up to Riak and use secondary indexes
to fetch ranges? As mentioned before, there are efficiency gains with ordered
keys when writing data. There are trade-offs involved with both ways. If
you don't have a lot of log data generated at any given point in time, just
using Riak's random IDs can be an acceptable trade-off. If efficiency on
inserts is an issue, ordered keys are worth looking into.
Either way, accessing data based on a time range is straight forward, you
derive and upper and lower bounds from the time frame you're interested
in and do an index query based on the resulting range. The example below
fetches all indexed records that were created on May 5th 2012 between 12
and 1 pm.
var lower = new Date(2012, 4, 21, 12, 0).getTime();
var upper = new Date(2012, 4, 21, 13, 0).getTime();
riak.query('logs', {time: [lower, upper]});

There is a problem with this simplistic approach though: it only allows


accessing entire batches of records. There is no way to filter records further
except for using MapReduce, as Riak 2i doesn't allow querying multiple
indexes at once.

Secondary Index Ranges as Key Filter Replacement


The example for using timestamps and unique identifiers to generate keys
and then fetch ranges on them makes secondary indexes the best replacement
for using key filters. Instead of making your keys fit into a schema that you
can then use in key filters, you generate appropriate indexes instead.

Riak Handbook | 159

Riak Use Cases in Detail

If the key structure itself is not enough to fetch a subset of keys, you can add
more indexes to represent these access patterns. Instead of applying filters in
hindsight by using key filters, you create the indexes in a way that allows you
to do a more efficient matching ad-hoc.
Let's go back to an example from the section on key filters and see what
it looks like when implemented as a secondary index. Here's the key filter
version.
riak.add({bucket: 'tweets',
key_filters: [["string_to_int"],
["less_than", 41399579391950849]]}).
map('Riak.mapValues').run()

To turn this into an index that's suitable as a replacement, there's nothing


you need to do. You get a free index on the key already, but it's a binary
index by default. So to make it useful for numeric range queries you can
create a separate index for the numeric value.
riak.save('tweets', '41399579391950848', tweet, {index: {
id: 41399579391950848,
}})

Now you can run an index query on the data that yields the same results as
the key filter example above.
riak.query('tweets', {id: [0, 41399579391950849]})

If your key filters do transformations on the key before applying the


matchers, you need to take that into account when creating the indexes. For
any combination of transformations the key filters do, you need to create a
separate secondary index. What you can't do with secondary indexes is apply
regular expressions on the data. An index is not that flexible unfortunately.

Searching Logs
Having proper indexes in place is only one part of the story. The more
interesting bits of logs are not in their metadata but in the log lines itself.
You can use Riak Search to create an additional full-text search on the log
messages. Or it can stand alone, without using secondary indexes at all, using
Riak Search's sorting features to fetch data ordered by time.
The advantage of using Riak Search is that we can run queries on more than
one field in a log entry. This makes searching for a specific host, facility, or

Riak Handbook | 160

Riak Use Cases in Detail

log entry possible, while still allowing you to order matching entries by the
time of occurrence.
The one thing you need to take care of is to install a custom schema for Riak
Search, should your data structure not fit in with the default schema installed
by Riak Search. See the section on custom schemas for an example.

Riak for Log Storage in the Wild


Using Riak as a storage backend for a syslog service was (still is) a personal
curiosity of mine. For a showcase on using Riak with Node.js, we built
a little application called Riaktant, that combines a syslog server with a
frontend to search and filter logs, even run MapReduce queries on the search
results.
The Riaktant code base is slightly outdated, but it should still give you a
good idea of how you can use Riak and Riak Search to implement a logging
service. For a more recent example, check out Brightbox's riak-syslog, a
Ruby implementation of a log service with a corresponding command line
tool to search through the log data.
Both projects include a schema for Riak Search for inspiration.
To make use of Riak's full potential, you could add a post-commit hook
that pushes new log entries into a queue and stream them to the browser or
other connected clients in near real-time. The post-commit hooks pushing
messages to RabbitMQ mentioned above are a great place to start with that.

Deleting Historical Data


One property of log data poses a bit of a problem for Riak: efficiently
deleting historical data. Logs older than X days or weeks tend to not be of
much interest anymore. It makes sense to purge them after a while to not
weigh down the system with outdated data.
The most efficient way to do that would be to use a time range, just like
when fetching records, and tell Riak to delete them all in one go. This is a
feature that's missing in Riak. The only way to delete a range of records is to
fetch the keys and delete them one by one.
To keep this an efficient operation, you could save the batch deletes for times
of reduced load on the system.

Riak Handbook | 161

Riak Use Cases in Detail

What about Analytics?


Being able to sift through aggregated logs from lots of servers is a great
improvement over using grep over SSH on multiple machines. But what
about the part where you used to use tools like sort and awk to extract
statistics about the log entries?
That part is about analytics, about being able to aggregate data into
numerical values to, for instance, see log volume per hour, per minute, or to
render graphs for specific log messages over time.
In full-text search systems you can, at least to a certain extent, use facets to
get simple aggregations on time, on log levels, or on the host name. As Riak
Search doesn't have support for facets, the only way to achieve that without
any extra work is to use MapReduce. Riaktant includes several examples
to sort and group log data based on hour of the day and by hostname,
aggregated into hourly numbers or to extract the top five busiest hours in
the day. Using MapReduce is not very efficient though, especially when you
want to run a lot of analytical queries on the data or graph it in real-time.

Session Storage
A site serving lots of users has to keep a lot of user sessions around. Amazon
is a prime example, and it's the company that brought us Dynamo in the first
place.
The more traditional way is to store sessions in a filesystem, either local or
shared, in a database, or in an in-memory store like Memcached or Redis.
Problems start when the infrastructure needs to scale beyond a single
instance, being able to scale up and down on demand, or when it requires
persistent sessions. You can achieve these goals by using Redis' persistent
mode and adding a consistent hashing implementation to go distributed, by
using an in-process database like BerkeleyDB, or by using Riak.
Riak fulfills several requirements for session storage:
Persistent storage for durability
Replication for fault-tolerance
Expiring session data
That last point is not a must-have requirement. Stores like Amazon keep
sessions, in particular the shopping carts, around as long as possible, to
maximize profits even in the longer term.

Riak Handbook | 162

Riak Use Cases in Detail

Modeling Session Data


Modeling data for session storage doesn't require special treatment. In most
cases, session data is serialized into a type of blob, for instance as JSON,
YAML, or marshalled using a binary format like Protocol Buffers or
programming language built-ins.
However, things get slightly more complex when you look at a shopping
cart, a classic session storage example. Consider a customer, in quick
succession, putting two items in the shopping cart. Or two nodes in the
cluster partitioned, and the customer causes the two updates to go to both
nodes. Both scenarios can cause conflicting updates. When it comes to
reconciling the conflicts, you can't just pick one as a winner, or you'll lose
a sale. The same is true when you let any one of the writes win without
handling conflicts at all.
Your data structures need to reflect the possibility of your customers causing
conflicts in your data. To make sure you can reconcile them, for instance
when the customer checks out, your data structure should treat every
addition and removal to the shopping cart as separate operations.
If you remember the section on modeling data for eventual consistency, this
is another scenario where using a timestamp and a unique identifier (such
as the product's identifier in your database) is handy to make sure you can
restore all items in the shopping cart.
Here's an example of a JSON data structure for a shopping cart that keeps
track of every single addition, allowing multiple additions of the same
product. It uses unique identifiers for every operation and a timestamp to
restore the ordering of events. The data structure is a hash map indexed by
the unique identifier, storing the action, the item's ISBN code, and the time.
{
"a93d40ce-a757-11e1-9178-1093e90b5d80": {
"add": "978-0978739218",
"time": 1337001337
},
"56707cee-a757-11e1-8e1b-1093e90b5d80": {
"add": "978-0321200686",
"time": 1337001388
}
}

Riak Handbook | 163

Riak Use Cases in Detail

Upon checkout, the application can enforce strong consistency when


reading the shopping cart data to make sure all replicas converge and
reconcile conflicts if necessary.

Session Storage Access Patterns


The access patterns are based on the requirements for session data. If you
require low read latency, which is a common requirement for session
storage, you keep the read quorum as low as acceptable. With session data it
can be more acceptable to serve inconsistent data for a fraction of the user's
time on the site than trying to handle failed nodes.
The same is true for writes in this scenario. They're also a trade-off between
ensuring the session data is written and being able to handle node
unavailability.
A legitimate trade-off is to make sure both reads and writes go to at least two
nodes, given data is replicated to three nodes. You can even lower the read
quorum to just one node to keep latency low, but be aware that it might
bring up situations where users get an inconsistent view of their session data.

Bringing Session Data Closer to Users


One particular requirement for user sessions on larger sites is being able to
bring the sessions closer to user. That involves replicating data to several
different locations across the globe. Another reason why you might want to
do that is being able to survive outages of entire data centers.
That feat involves running multiple Riak cluster in multiple data centers, for
example in different EC2 regions, and replicating data between all of them.
This is only possible with Riak Enterprise, which you've read about in a
previous section.

URL Shortener
No database at any reasonable scale can avoid being used for shortening
URLs. Why is this even an interesting use case? Shortening a URL involves
several smaller steps:

Generate a short, unique identifier for a URL


Save mapping of unique identifier with the URL
Look up the URL based on the unique identifier
Redirect clients to the URL
Bonus: track statistics about clicks

Riak Handbook | 164

Where to go from here

URL Shortening Access Patterns


In common scenarios, a URL shortener has a lot more reads than writes, and
it stores only small pieces of data. That's where this use case gets interesting.
It's another scenario where it makes sense to ensure consistency on writes
instead of on reads, because you want to make sure that read latency is low
and redirects to the URL can happen as fast as possible.
To make this happen, use the full quorum on writes and only involve a small
number of replicas on reads.

Modeling Data
To store a URL the data structure doesn't have to be very complex. A simple
JSON hash will do the job, though storing the URL as simple plain text
also works well. Using plain text saves you the extra work of deserializing
the data structure, allowing you to just fetch the URL and send the client a
redirect.

Riak URL Shortening in the Wild


GitHub's Guillotine is the URL shortener serving git.io, whose main
purpose is to shorten GitHub URLs for notifications posted to external
services, for instance about new commits pushed to a repository. It comes
with several storage backends, one of them is Riak. It stores the URLs as
simple plain text.

Where to go from here


Now that you pretty much know all you need to know about Riak, it's time
to play. If you had some things you always wondered were possible with
Riak, you should try them out. Think about data structures, how they can be
resolved when conflicts arise, do you even need to worry about conflicts, and
so on.
Basho has collected a great deal of documentation around Riak. The whole
development process around Riak is open on GitHub, you can follow along
new features and the discussions around them in their respective pull
requests.
Basho also provides more client libraries for Riak, for Java, Erlang, PHP,
Python, and Ruby. There are a whole bunch of libraries built by the Riak
community, just like riak-js. You'll find that most of them work in a similar
fashion, if you know one, it's easy to understand the other.

Riak Handbook | 165

Where to go from here

Made in Berlin, Germany


2011-2012 Mathias Meyer, http://paperplanes.de

Riak Handbook | 166

Potrebbero piacerti anche