Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Mathias Meyer
Revision 1.1
Table of Contents
Introduction ................................................................................................... 8
Thank You ............................................................................................. 8
How to read the book............................................................................. 9
Feedback ................................................................................................. 9
Code........................................................................................................ 9
Changelog .............................................................................................. 9
CAP Theorem .............................................................................................. 11
The CAP Theorem is Not Absolute ........................................................ 12
Fine-Tuning CAP with Quorums .......................................................... 13
N, R, W, Quorums, Oh My!.................................................................... 13
How Quorums Affect CAP ..................................................................... 14
A Word of CAP Wisdom......................................................................... 15
Further Reading ....................................................................................... 15
Eventual Consistency................................................................................... 15
Consistency in Quorum-Based Systems ................................................. 16
Consistent Hashing ...................................................................................... 16
Sharding and Rehashing........................................................................... 16
A Better Way............................................................................................ 17
Enter Consistent Hashing ........................................................................ 17
Looking up an Object .............................................................................. 19
Problems with Consistent Hashing ......................................................... 20
Dealing with Overload and Data Loss ..................................................... 21
Amazon's Dynamo....................................................................................... 22
Basics......................................................................................................... 22
Virtual Nodes ........................................................................................... 22
Master-less Cluster ................................................................................... 23
Quorum-based Replication ..................................................................... 24
Read Repair and Hinted Handoff ............................................................ 24
Conflict Resolution using Vector Clocks ................................................ 24
Conclusion ............................................................................................... 26
What is Riak?................................................................................................ 27
Riak: Dynamo, And Then Some ................................................................. 27
Installation .................................................................................................... 28
Installing Riak using Binary Packages ..................................................... 28
Talking to Riak............................................................................................. 29
Buckets ..................................................................................................... 29
Fetching Objects ...................................................................................... 29
Creating Objects ...................................................................................... 30
Object Metadata ....................................................................................... 31
Custom Metadata ..................................................................................... 32
Linking Objects........................................................................................ 33
Walking Links.......................................................................................... 34
Walking Nested Links ............................................................................. 35
The Anatomy of a Bucket ........................................................................ 36
List All Of The Keys................................................................................. 37
How Do I Delete All Keys in a Bucket?............................................... 38
How Do I Get the Number of All Keys in a Bucket? .......................... 39
Querying Data ............................................................................................. 39
MapReduce............................................................................................... 40
MapReduce Basics .................................................................................... 41
Mapping Tweet Attributes ...................................................................... 41
Using Reduce to Count Tweets .............................................................. 42
Re-reducing for Great Good ................................................................... 43
Counting all Tweets................................................................................. 44
Chaining Reduce Phases .......................................................................... 44
Parameterizing MapReduce Queries....................................................... 46
Chaining Map Phases ............................................................................... 48
MapReduce in a Riak Cluster................................................................... 48
Efficiency of Buckets as Inputs................................................................. 50
Key Filters................................................................................................. 51
Using Riak's Built-in MapReduce Functions.......................................... 53
Intermission: Riak's Configuration Files ................................................. 54
Errors Running JavaScript MapReduce................................................... 55
Deploying Custom JavaScript Functions ................................................ 56
Using Erlang for MapReduce .................................................................. 57
Writing Custom Erlang MapReduce Functions ................................. 58
On Full-Bucket MapReduce and Key-Filters Performance ................... 61
Querying Data, For Real.............................................................................. 61
Riak Search ............................................................................................... 62
Enabling Riak Search ........................................................................... 62
Indexing Data ....................................................................................... 62
Indexing from the Command-Line................................................. 63
The Anatomy of a Riak Search Document.......................................... 63
Querying from the Command-Line ................................................... 64
Other Command-Line Features ...................................................... 64
The Riak Search Document Schema ....................................................... 64
Analyzers .............................................................................................. 65
Writing Custom Analyzers .................................................................. 66
Other Schema Options..................................................................... 69
An Example Schema............................................................................. 70
Setting the Schema ............................................................................... 72
Indexing Data from Riak ......................................................................... 72
Using the Solr Interface............................................................................ 74
Paginating Search Results .................................................................... 75
Sorting Search Results .......................................................................... 76
Search Operators .................................................................................. 76
Summary of Solr API Search Options.................................................. 79
Summary of the Solr Query Operators ................................................ 80
Indexing Documents using the Solr API ............................................. 81
Deleting Documents using the Solr API ............................................. 82
Using Riak's MapReduce with Riak Search ........................................ 83
The Overhead of Indexing................................................................... 83
Riak Secondary Indexes ........................................................................... 84
Indexing Data with 2i........................................................................... 84
Querying Data with 2i ......................................................................... 86
Using Riak 2i with MapReduce ........................................................... 87
Storing Multiple Index Values ............................................................. 87
Managing Object Associations: Links vs. 2i ........................................ 88
How Does Riak 2i Compare to Riak Search? ...................................... 89
Riak Search vs. Riak 2i vs. MapReduce................................................ 90
How Do I Index Data Already in Riak?................................................... 91
Using Pre- and Post-Commit Hooks ...................................................... 92
Validating Data..................................................................................... 92
Enabling Pre-Commit Hooks ............................................................. 93
Pre-Commit Hooks in Erlang ............................................................. 94
Modifying Data in Pre-Commit Hooks.............................................. 95
Accessing Riak Objects in Commit Hooks ......................................... 97
Enabling Post-Commit Hooks .......................................................... 100
Deploying Custom Erlang Functions................................................ 100
Updating External Sources in Post-Commit Hooks ......................... 102
Riak in its Setting........................................................................................ 102
Building a Cluster................................................................................... 102
Adding a Node to a Riak Cluster ....................................................... 103
Configuring a Riak Node .............................................................. 103
Joining a Cluster ............................................................................. 104
Anatomy of a Riak Node.................................................................... 104
What Happens When a Node Joins a Cluster ................................... 105
Leaving a Cluster................................................................................ 105
Eventually Consistent Riak .................................................................... 106
Handling Consistency........................................................................ 106
Writing with a Non-Default Quorum .......................................... 106
Durable Writes ............................................................................... 107
Primary Writes ............................................................................... 108
Tuning Default-Replication and Quorum Per Bucket................. 108
Choosing the Right N Value ......................................................... 110
Reading with a Non-Default Quorum.......................................... 110
Read-Repair.................................................................................... 111
Modeling Data for Eventual Consistency ................................................. 111
Choosing the Right Data Structures ...................................................... 112
Conflicts in Riak ................................................................................. 115
Siblings............................................................................................ 116
Reconciling Conflicts......................................................................... 117
Modeling Counters and Other Data Structures ................................ 118
Problems with Timestamps for Conflict Resolution ..................... 119
Strategies for Reconciling Conflicts .................................................. 123
Reads Before Writes ....................................................................... 124
Merging Strategies ......................................................................... 124
Sibling Explosion................................................................................ 124
Building a Timeline with Riak .......................................................... 125
Multi-User Timelines..................................................................... 128
Avoiding Infinite Growth.................................................................. 129
Intermission: How to Fetch Multiple Objects in one Request.......... 129
Intermission: Paginating Using MapReduce .................................... 130
Handling Failure .................................................................................... 131
Operating Riak....................................................................................... 132
Choosing a Ring Size ......................................................................... 132
Protocol Buffers vs. HTTP ................................................................ 133
Storage Backends................................................................................ 133
Innostore......................................................................................... 134
Bitcask............................................................................................. 134
LevelDB.......................................................................................... 135
Load-Balancing Riak ......................................................................... 136
Placing Riak Nodes across a Network ............................................... 138
Monitoring Riak................................................................................. 140
Request Times ................................................................................ 141
Number of Requests ....................................................................... 142
Read Repairs, Object Size, Siblings................................................ 143
Monitoring 2i ................................................................................. 144
Miscellany ....................................................................................... 144
Monitoring Reference.................................................................... 144
Managing a Riak Cluster with Riak Control..................................... 147
Enabling Riak Control ................................................................... 147
Intermission: Generating an SSL Certificate ................................. 148
Riak Control Cluster Overview..................................................... 149
Managing Nodes with Riak Control ............................................. 150
Managing the Ring with Riak Control ......................................... 151
To Be Continued............................................................................ 152
When To Riak? .......................................................................................... 152
Riak Use Cases in Detail......................................................................... 153
Using Riak for File Storage ................................................................ 153
File Storage Access Patterns ........................................................... 154
Object Size...................................................................................... 154
Storing Large Files in Riak ............................................................. 155
Riak Cloud Storage ........................................................................ 155
Using Riak to Store Logs.................................................................... 156
Modeling Log Records................................................................... 157
Logging Access Patterns ................................................................ 157
Indexing Log Data for Efficient Access ......................................... 158
Secondary Index Ranges as Key Filter Replacement ..................... 159
Searching Logs ............................................................................... 160
Riak for Log Storage in the Wild ................................................... 161
Deleting Historical Data ................................................................ 161
What about Analytics? ................................................................... 162
Session Storage ................................................................................... 162
Modeling Session Data ................................................................... 163
Session Storage Access Patterns...................................................... 164
Bringing Session Data Closer to Users .......................................... 164
URL Shortener ................................................................................... 164
URL Shortening Access Patterns ................................................... 165
Modeling Data................................................................................ 165
Riak URL Shortening in the Wild ................................................. 165
Where to go from here............................................................................... 165
Introduction
Introduction
I first heard about Riak in September 2009, right after it was unveiled to the
public, at one of the early events around NoSQL in Berlin. I tip my hat to
Martin Scholl for introducing the attendees (myself included) to this new
database. It's distributed, written in Erlang, supports JSON, and MapReduce.
That's all we needed to know.
Riak fascinated me right from the beginning. Its roots in Amazon's Dynamo
and the distributed nature were intriguing. It was fun to see it develop since
then, it's been more than two years.
Over that time, Riak went from a simple key-value store you can use to
reliably store sessions to a full-blown database with lots of bells and whistles.
I was more and more intrigued, and started playing with it more, diving into
its feature set and into Dynamo too.
Add to that the friendly Basho folks, makers of Riak, whom I had the great
pleasure of meeting a few times and even working with.
But something was missing. Every database should have a book dedicated to
it. I never thought that it would even be possible to write a whole book about
Riak, let alone that I would be the one to write it, yet here we are.
What you're looking at is my collective brain dump on all things Riak,
covering everything from basic usage, by way of MapReduce, full-text
search and indexing data, to advanced topics like modeling data to fit in well
with Riak's eventually consistent distribution model.
So here we are, I hope you'll enjoy what you're about to read as much as I
enjoyed writing it.
This is a one-man operation, please respect the time and effort that went into
this book. If you came by a free copy and find it useful, please buy the book.
Thank You
This book wouldn't be here, on your screen, without the help and support of
quite a few people. To be honest, I was surprised how much work goes into
a book, and how many people are more than willing to help you finish it. For
that I am incredibly grateful.
First and foremost I want to thank my wife Jördis, who not only was very
supportive, but also helped a great deal by doing all the design work in and
around the book, the cover, the illustrations, and the website. She gave me
Riak Handbook | 8
Introduction
that extra push when I needed it. My daughter Mari was supportive in her
very own way, probably without realizing it, but supportive nonetheless.
She was great to have around when writing this book.
Thank you so very much to everyone who reviewed the initial and advanced
versions of the book, devoting their valuable time to giving invaluable
feedback. You never realize until later how many typos you end up creating.
Thank you for your great feedback, for tirelessly answering my questions,
and for all the support you guys gave me: Florian Ebeling, Eric Lindvall, Till
Klampäckel, Steve Vinoski, Russell Brown, Sean Cribbs, Reid Draper, Ryan
Zezeski, John Vincent, Rick Olson, Corey Donohoe, Mark Philips, Ralph
von der Heyden, Patrick Hüsler, Robin Mehner, Stefan Schmidt, Kelly
McLaughlin, Brian Shumate, Jeremiah Peschka, Marc Heiligers. I bow to
you!
Feedback
If you think you found a typo, have some suggestions to make for things
you think are missing and whatnot, or generally would like to say hi, send
an email to feedback@riakhandbook.com. Be sure to include the revision
you're referring to, it's printed on the second page.
Code
This book includes a lot of code, but only in small chunks, easy to grasp.
There are only two listings in the entire book that stretch close to a page.
Most of the code doesn't build on top of each other but tries to stand alone,
though there's the occasional assumption that some piece of code has been
run at some point. What was worth breaking out into small programs or
what would require tedious copy and paste has been moved into a code
repository that accompanies this book. You can find it on GitHub.
Changelog
Version 1.1
• Added a section on load balancing
• Added a section on network placement of Riak nodes
• Added a section on monitoring
Riak Handbook | 9
Introduction
Riak Handbook | 10
CAP Theorem
CAP Theorem
CAP is an abbreviation for consistency, availability, and partition tolerance.
The basic idea is that in a distributed system, you can have only two of these
properties, but not all three at once. Let's look at what each property means.
• Consistency
Data access in a distributed database is considered to be consistent when
an update written on one node is immediately available on another node.
Traditional ways to achieve this in relational database systems are
distributed transactions. A write operation is only successful when it's
written to a master and at least one slave, or even all nodes in the system.
Every subsequent read on any node will always return the data written by
the update on all nodes.
• Availability
The system guarantees availability for requests even though one or more
nodes are down. For any database with just one node, this is impossible
to achieve. Even when you add slaves to one master database, there's still
the risk of unavailability when the master goes down. The system can still
return data for reads, but can't accept writes until the master comes back
up. To achieve availability data in a cluster must be replicated to a number
of nodes, and every node must be ready to claim master status at any time,
with the cluster automatically rebalancing the data set.
• Partition Tolerance
Nodes can be physically separated from each other at any given point and
for any length of time. The time they're not able to reach each other,
due to routing problems, network interface troubles, or firewall issues, is
called a network partition. During the partition, all nodes should still be
able to serve both read and write requests. Ideally the system automatically
reconciles updates as soon as every node can reach every other node again.
Given features like distributed transactions it's easy to describe consistency
as the prime property of relational databases. Think about it though, in a
master-slave setup data is usually replicated down to slaves in a lazy manner.
Unless your database supports it (like the semi-synchronous replication in
MySQL 5.5) and you enable it explicitly, there's no guarantee that a write
to the master will be immediately visible on a slave. It can take crucial
milliseconds for the data to show up, and your application needs to be able
to handle that. Unless of course, you've chosen to ignore the potential
Riak Handbook | 11
The CAP Theorem is Not Absolute
inconsistency, which is fair enough, I'm certainly guilty of having done that
myself in the past.
While Brewer's original description of CAP was more of a conjecture, by
now it's accepted and proven that a distributed database system can only
allow for two of the three properties. For example, it's considered impossible
for a database system to offer both full consistency and 100% availability
at the same time, there will always be trade-offs involved. That is, until
someone finds the universal cure against network partitions, network
latency, and all the other problems computers and networks face.
Riak Handbook | 12
Fine-Tuning CAP with Quorums
N, R, W, Quorums, Oh My!
In the real world, it will depend on your particular use case which N, W,
and R values you're going to pick. Need high insert and update speed? Pick a
low W and maybe a higher R value. Care about consistent reads and a bit less
about increased read latency? Pick a high R. If speed is all you're after in reads
and writes, but you still want to have data replicated for availability, pick a
Riak Handbook | 13
How Quorums Affect CAP
low W and R value, but an N of 3 or higher. Apart from the N value, the
other quorums are not written in stone, they can be tuned for every read and
write operation separately.
In a paper that takes a more detailed look at Brewer's conjecture, Gilbert and
Lynch quite fittingly state that in the real world, most systems have settled on
getting "most of the data, most of the time." You will see how this works out
in the practical part of this book.
Riak Handbook | 14
A Word of CAP Wisdom
Further Reading
To dive deeper into the ideas behind CAP, read Seth Gilbert's and Nancy
Lynch's dissection of Brewer's original conjecture. They're doing a great
job of proving the correctness of CAP, all the while investigating alternative
models, trying to find a sweet spot for all three properties along the way.
Julian Browne wrote a more illustrated explanation on CAP, going as far as
comparing the coinage of CAP to the creation of punk rock, something I can
certainly get on board with. Coda Hale recently wrote an update on CAP,
which is a lot less formal and aims towards practical applicability, a highly
recommended read. And last but not least, you can peek at Brewer's original
slides too.
Daniel Abadi brings up some interesting points regarding CAP, arguing
that CAP should consider latency as well. Eric Brewer and Armando Fox
followed up the CAP discussion with a paper on harvest and yield, which is
also worth your while, as it argues for a need of a weaker version of CAP.
One that focuses on dialing down one property while increasing another
instead of considering them binary switches.
Eventual Consistency
In the last chapter we already talked about updates that are not immediately
propagated to all replicas in a cluster. That can have lots of reasons, one being
the chosen R or W value, while others may involve network partitions,
making parts of the cluster unreachable or increasing latency. In other
scenarios, you may have a database running on your laptop, which
constantly synchronizes data with another node on a remote server. Or you
have a master-slave setup for a MySQL or PostgreSQL database, where all
writes go to a master, and subsequent reads only go to the slave. In this
scenario the master will first accept the write and then populates it to a
number of slaves, which takes time. We're usually talking about a couple of
Riak Handbook | 15
Consistency in Quorum-Based Systems
milliseconds, but as you never know what happens, it could end up being
hours. Sound familiar? It's what DNS does, a system you deal with almost
every day.
Consistent Hashing
The invention of consistent hashing is one of these things that only happen
once a century. At least that's how Andy Gross from Basho Technologies
likes to think about it. When you deal with a distributed database
environment and have to deal with an elastic cluster, where nodes come and
go, I'm pretty sure you'll agree with him. But before we delve into detail, let's
have a look at how data distribution is usually done in a cluster of databases
or cache farms.
Riak Handbook | 16
A Better Way
the other end, hoping that they never get out of sync (which they will). Or
you started sharding your data.
In a sharded setup, you split up your dataset using a predefined key. The
simplest version of that could be to simply use the primary key of any table as
your shard key. Using modulo math you calculate the modulo of the key and
the number of shards (i.e. nodes) in the cluster. So the key of 103 in a cluster
of 5 nodes would go to the fourth node, as 103 % 5 = 3. This is the simplest
way of sharding.
To get a bit more fancy, add a hash function, which is applied to the shard
key. Like before, calculate the modulo of the result and the number of
servers. The problems start when you want to add a new node. Almost all of
the data needs to be moved to another server, because the modulo needs to
be recalculated for every record, and the result is very likely to be different,
in fact, it's about N / (N + 1) likely to be different, with N being the number
of nodes currently being in the cluster. For going from three to four nodes
that's 75% of data affected, from four to five nodes it's 80%. The result gets
worse as you add more nodes.
Not only is that a very expensive operation, it also defeats the purpose of
adding new nodes, because for a while your cluster will be mostly busy
shuffling data around, when it should really deliver that data to your
customers.
A Better Way
As you will surely agree, this doesn't pan out too well in a production system.
It works, but it's not great.
In the late nineties Akamai needed a way to increase and decrease caching
capacity on demand without having to go through a full rebalancing process
every time. Sounds like the scenario I just described, doesn't it? They needed
it for caches, but it's easily applicable to databases too. The result is called
consistent hashing, and it's a technique that's so beautifully simple yet
incredibly efficient in avoiding moving unnecessary amounts of data
around, it blows my mind every time anew.
Riak Handbook | 17
Enter Consistent Hashing
room for plenty of keys in between. No really, that's a lot. Of course the
actual ring size depends on the hash function you're using. To use SHA–1 for
example, the ring must have a size of 2^160. The keys are ordered counter-
clockwise, starting at 0, ending at 2^160 and then folding over again.
The Ring.
Riak Handbook | 18
Looking up an Object
When a node joins the cluster, it picks a random key on the ring. The node
will then be responsible for handling all data between this and the next key
chosen by a different node. If there's only one node, it will be responsible for
all the keys in the ring.
Add another node, and it will once again pick a random key on the ring. All
it needs to do now is fetch the data between this key and the one picked by
the first node.
The ring is therefore sliced into what is generally called partitions. If a pizza
slice is a nicer image to you, it works as well. The difference is though, that
with a pizza everyone loves to have the biggest slice, while in a database
environment having that slice could kill you.
Now add a third node and it needs to transfer even less data because the
partitions created by the randomly picked keys on the ring get smaller and
smaller as you add more nodes. See where this is going? Suddenly we're
shuffling around much less data than with traditional sharding. Sure, we're
still shuffling, but somehow data has to be moved around, there's no
avoiding that part. You can only try and reduce the time and effort needed
to shuffle it.
Looking up an Object
When a client goes to fetch an object stored in the cluster, it needs to be
aware of the cluster structure and the partitions created in it. It uses the same
Riak Handbook | 19
Problems with Consistent Hashing
hash function as the cluster to choose the correct partition and therefore the
correct physical node the object resides on.
To do that, it hashes the key and then walks clockwise until it finds a key
that's mapped to a node, which will be the key the node randomly picked
when it joined the cluster. Say, your key hashes to the value 1234, and you
have two nodes in the cluster, one claiming the key space from 0 to 1023, the
other claiming the space from 1024 to 2048. Yes, that's indeed a rather small
key space, but much better suited to illustrate the example.
To find the node responsible for the data, you go clockwise from 1234 to
1024, the next lowest key picked by a node in the cluster, the second node in
our example.
Two nodes in a ring, one with less keys than the other.
Riak Handbook | 20
Dealing with Overload and Data Loss
Also, when a node goes down, due to hardware failure, a network partition,
who knows what's going to happen in production, there is still the question
of what happens to the data that it was responsible for. The solution once
again is rather simple.
Riak Handbook | 21
Amazon's Dynamo
Amazon's Dynamo
One of the more influential products and papers in the field has been
Amazon's Dynamo, responsible for, among other things, storing your
shopping cart. It takes concepts like eventual consistency, consistent
hashing, and the CAP theorem, and slaps a couple of niceties on top. The
result is a distributed, fault-tolerant, and highly available data store.
Basics
Dynamo is meant to be easily scalable in a linear fashion by adding and
removing nodes, to be fully fault-tolerant, highly available and redundant.
The goal was for it to survive network partitions and be easily replaceable
even across data centers.
All these requirements stemmed from actual business requirements, so either
way, it pays off to read the full paper to see how certain features relate to
production use cases Amazon has.
Dynamo is an accumulation of techniques and technologies, thrown
together to offer just what Amazon wanted for some of their business use
cases. Let's go through the most important ones, most notably virtual nodes,
replication, read repairs, and conflict resolution using vector clocks.
Virtual Nodes
Dynamo takes the idea of consistent hashing and adds virtual nodes to the
mix. We already came across them as a solution to spread load in a cluster
using consistent hashing. Dynamo takes it a step further. When a cluster is
defined, it splits up the ring into equally sized partitions. It's like an evenly
sliced pizza, and the slice size never changes.
Riak Handbook | 22
Master-less Cluster
The advantage of choosing a partitioning scheme like that is that the ring
setup is known and constant throughout the cluster's life. Whenever a node
joins, it doesn't need to pick a random key, it picks random partitions instead,
therefore avoiding the risk of having partitions are that are either too small
or too large for a single node.
Say you have a cluster with 3 nodes and 32 partitions, every node will hold
either 10 or 11 partitions. When you bring a fourth node into the ring, you
will end up with 8 partitions on each node. A partition is hosted by a virtual
node, which is only responsible for that particular slice of the data. As the
cluster grows and shrinks the virtual node may or may not move to other
physical nodes.
Master-less Cluster
No node in a Dynamo cluster is special. Every client can request data from
any node and write data to any node. Every node in the cluster has
knowledge of the partitioning scheme, that is which node in the cluster is
responsible for which partitions.
Whenever a client requests data from a node, that node becomes the
coordinator node, even if it's not holding the requested piece of data. When
the data is stored in a partition on a different node, the coordinator node
simply forwards the requests to the relevant node and returns its response to
the client.
Riak Handbook | 23
Quorum-based Replication
This has the added benefit that clients don't need to know about the way
data is partitioned. They don't need to keep track of a table with partitions
and their respective nodes. They simply ask any node for the data they're
interested in.
Quorum-based Replication
As explained above in the section on consistent hashing, partitioning makes
replicating data quite easy. A physical node may not only hold the data in the
partitions it picked, it will hold a total of up to P / PN * RE partitions, where
P is the number of partitions in the ring, PN the number of physical nodes,
and RE is the number of replicas configured for the cluster.
So if every piece of data is replicated three times across the cluster, a single
physical node in cluster of four may hold up to 48 virtual nodes, given that it
contains 64 partitions.
The quorum is the consistency-availability tuning knob in a Dynamo
cluster. Amazon leaves it up to a specific engineering team's preference how
to deal with read and write consistency in their particular setup. As I
mentioned already, it's a setting that's different for every use case.
Riak Handbook | 24
Conflict Resolution using Vector Clocks
updates an object it provides the vector clock it's referring to. Let's have a
look at an example.
When the object is updated the coordinating node adds a new pair with
server identifier and version, so an object's vector clock can grow
significantly over time when it's updated frequently. As long as the path
through the pairs is the same, an update is considered to be a descendant of
the previous one. All of Bob's updates descent from one another.
The fun starts when two different clients update the same objects. Each client
adds a new identifier to the list of pairs, and now there are two different lists
of pairs from each node. We've run into a conflict. We now have two vector
clocks that aren't descendants of each other. Like the conflicts created by
Alice and then Carol in the picture above.
Dynamo doesn't really bother with the conflict, it can simply store both
versions and let the next reading client know that there are multiple versions
that need to be reconciled. Vector clocks can be pretty mind-bending, but
they're actually quite simple. There are two great summaries on the Basho
blog, and Kresten Krab Thorup wrote another one, where he refers to them
as version vectors instead, which actually makes a lot of sense and, I'm sure,
will help you understand vector clocks better.
The basic idea of vector clocks goes way back into the seventies, when Leslie
Lamport wrote a paper on using time and version increments as a means to
restore order in a distributed system. That was in 1978, think about that for
a minute. But it wasn't until 1988 that the idea of vector clocks that include
both time and a secondary means of deriving ordering was published, in a
paper by Colin J. Fidge.
Riak Handbook | 25
Conclusion
Vector clocks are confusing, no doubt, and you hardly have to deal with their
inner workings. They're just a means for a database to discover conflicting
updates.
Conclusion
Dynamo throws quite a punch, don't you agree? It's a great collection of
different algorithms and technologies, brought together to solve real life
problems. Even though it's a lot to take in, you'll find that it influenced
a good bunch of databases in the NoSQL field and is referenced or cited
equally often.
There have been several open source implementations, namely Dynomite
(abandonded these days due to copyright issues, but the first open source
Dynamo clone), Project Voldemort, and Riak. Cassandra also drew some
inspiration from it.
Riak Handbook | 26
What is Riak?
What is Riak?
Riak does one thing, and one thing really well: it ensures data availability
in the face of system or network failure, even when it has only the slightest
chance to still serve a piece of data available to it, even though parts of the
whole dataset might be missing temporarily.
At the very core, Riak is an implementation of Amazon's Dynamo, made by
the smart folks from Basho. The basic way to store data is by specifying a
key and a value for it. Simple as that. A Riak cluster can scale in a linear and
predictable fashion, because adding more nodes increases capacity thanks to
consistent hashing and replication. Throw on top the whole shebang of fault
tolerance, no special nodes, and boom, there's Riak.
A value stored with a key can be anything, Riak is pretty agnostic, but
you're well advised to provide a proper content type for what you're storing.
To no-one's surprise, for any reasonably structured data, using JSON is
recommended.
Riak Handbook | 27
Installation
Installation
While you can use Homebrew and a simple brew install riak to install
Riak, you can also use one of the binary packages provided by Basho. Riak
requires Erlang R14B03 or newer, but using the binary packages or
Homebrew, that's already taken care of for you. As of this writing, 1.1.2
is the most recent version, and we'll stick to its feature set. Be aware that
Riak doesn't run on Windows, so you'll need some flavor of Unix to make it
through this book.
When properly installed and started using riak start, it should be up
and running on port 8098, and you should be able to run the following
command and get a response from Riak.
$ curl localhost:8098/riak
{}
While you're at it, install Node.js as well. We'll talk to Riak using Node.js
and the riak-js library, a nice and clean asynchronous library for Riak, while
we peek under the covers to figure out exactly what's going on.
Running npm install http://nosql-handbook.s3.amazonaws.com/pkg/
riak-js-7d3b8bbf.tar.gz installs the latest version of riak-js (we're using
the custom version as it includes some important fixes). After you're done,
you should be able to start a Node shell by running the command node and
executing the line below without causing any errors.
As we work our way through its feature set we'll store tweets in Riak. First
we'll just use the tweet's identifier to reference tweets, then we'll dig deeper
and store tweets per user, making them searchable along the way.
Riak Handbook | 28
Talking to Riak
So when you're on Ubuntu or Debian, simply download the .deb file and
install it using dpkg.
$ wget downloads.basho.com/riak/riak-1.1.2/riak_1.1.2-1_amd64.deb
$ dpkg -i riak_1.1.2-1_amd64.deb
Now you can start Riak using the provided init script.
Talking to Riak
The easiest way to become friends with Riak is to use its HTTP interface.
Later, in production, you're more likely to turn to the Protocol Buffers
interface for better performance and throughput, but HTTP is just a nice and
visual way to explore the things you can do with Riak.
Riak's HTTP implementation is as RESTful as it gets. Important details
(links, vector clocks, modification times, ETags, etc.) are nicely exposed
through proper HTTP headers, and Riak utilizes multi-part responses where
applicable.
Buckets
Other than a key and a value, Riak divides data into buckets. A bucket is
nothing more than a way to logically separate physical data, so for example,
all user objects can go into a bucket named users. A bucket is also a way to
set different properties for things like replication for different types of data.
This allows you to have stricter rules for objects that are of more importance
in terms of consistency and replication than data for which a lack of
immediate replication is acceptable, such as sessions.
Fetching Objects
Now that we got that out of the way, let's talk to our database. That's why
I love using HTTP to get to know it better; it's such a nice and human-
readable format, with no special libraries required. We'll start off the basics
using both a client library and curl, so you'll see what's going on under the
covers.
Riak Handbook | 29
Creating Objects
When you're installing and starting Riak, it installs a bunch of URL handlers,
one of them being /riak, which we'll play with for the next couple of
sections. Again, the client libraries are hiding that from us, but when you're
playing on your own, using curl, my favorite browser, it's good to know.
If you haven't done so already, fire up the Node shell, and let's start with
some basics. After this example I'm assuming the riak library is loaded in the
Node.js console and points to the riak-js library.
We're looking for a tweet with granted, a rather odd looking key, but it's
a real tweet, and the key conforms to Twitter's new scheme for tweet
identifiers, so there you have it.
What riak-js does behind the curtains is send a GET request to the URL
/riak/tweets/41399579391950848. Riak, being a good HTTP sport,
returns a status code of 404. You can try this yourself using curl.
$ curl localhost:8098/riak/tweets/41399579391950848
As you'll see it doesn't return anything yet, so let's create the object in Riak.
Creating Objects
To create or update an object using riak-js: we'll simply use the function
save() and specify the object to save.
riak.save('tweets', '41399579391950848', {
user: "roidrage",
tweet:
"Using @riakjs for the examples in the Riak chapter!",
tweeted_at: new Date(2011, 1, 26, 8, 0)
})
Under the covers, riak-js sends a PUT request to the URL /riak/tweets/
41399579391950848, with the object we specified as the body. It also
automatically uses application/json as the content type and serializes the
object to a JSON string, as this is clearly what we're trying to store in Riak.
Here's how you'd do that using curl.
Riak Handbook | 30
Object Metadata
{"user":"roidrage",
"tweet":"Using @riakjs for the examples in the Riak chapter!",
"tweeted_at":"Mon Dec 05 2011 17:31:40 GMT+0100 (CET)"}
Phew, this looks a tiny bit more confusing. We're telling curl to PUT to the
specified URL, to add a header for the content type, and to read the request
body from stdin (that's the odd-looking parameter -d @-). Type Ctrl-D after
you're done with the body to send the request.
Riak will automatically create the bucket and use the key specified in the
URL the PUT was sent to. Sending subsequent PUT requests to the same
URL won't recreate the object, they'll update it instead. Note that you can't
update single attributes of a JSON document in Riak. You always need to
specify the full object when writing to it.
Object Metadata
Every object in Riak has a default set of metadata associated with it. Examples
are the vector clock, links, date of last modification, and so on. Riak also
allows you to specify your own metadata, which will be stored with the
object. When HTTP is used, they'll be specified and returned as a set of
HTTP headers.
To fetch the metadata in JavaScript, you can add a third parameter to the call
to get(): a function to evaluate errors, the fetched object, and the metadata
for that object. By default, riak-js dumps errors and the object to the console.
Let's peek into the metadata and look at what we're getting.
riak.get('tweets', '41399579391950848',
function(error, object, meta) {
console.log(meta);
})
{ usermeta: {},
debug: false,
api: 'http',
encodeUri: false,
host: 'localhost',
clientId: 'riak-js',
accept: 'multipart/mixed, application/json;q=0.7, */*;q=0.5',
binary: false,
raw: 'riak',
connection: 'close',
Riak Handbook | 31
Custom Metadata
responseEncoding: 'utf8',
contentEncoding: 'utf8',
links: [],
port: 8098,
bucket: 'tweets',
key: '41399579391950848',
headers: {
Accept: 'multipart/mixed, application/json;q=0.7, */*;q=0.5',
Host: 'localhost', Connection: 'close' },
contentType: 'application/json',
vclock: 'a85hYGBgzGDKBVIcypz/fvptYKvIYEpkymNl4NxndYIvCwA=',
lastMod: 'Fri, 18 Nov 2011 11:31:21 GMT',
contentRange: undefined,
acceptRanges: undefined,
statusCode: 200,
etag: '68Ze86EpWbh8dbAcpMBpZ0' }
The vector clock is indeed a biggie, and as you update an object, you'll see
it grow even more. Try updating our tweet a few times, just for fun and
giggles.
Now if you dump the object's metadata on the console one more time, you'll
see that it has grown a good amount with just five updates.
Custom Metadata
You can specify a set of custom metadata yourself. riak-js makes that process
fairly easy: simply specify a fourth parameter when calling save(). Let's
attach some location information to the tweet.
When done via HTTP, you simply specify additional headers in the form
of X-Riak-Meta-Y, where Y is the name of the metadata you'd like to be
stored with the object. So in the example above, the headers would be X-
Riak Handbook | 32
Linking Objects
$ curl -v localhost:8098/riak/tweets/41399579391950848
...snip...
< X-Riak-Meta-Longitude: 13.41156
< X-Riak-Meta-Latitude: 52.523324
...snap...
Note that, just like with the object itself, you always need to specify the full
set of metadata when updating an object, as it's always written anew. Which
makes using riak-js all the better, because the meta object you get from the
callback when fetching an object lends itself nicely to be reused when saving
the object again later.
Linking Objects
Linking objects is one of the neat additions of Riak over and above Dynamo.
You can create logical trees or even graphs of objects. If you fancy object-
oriented programming, this can be used as the equivalent of object
associations.
By default, every object has only one link: a reference to its bucket. When
using HTTP, links are expressed using the syntax specified in the HTTP
RFC. A link can be tagged to give the connection context. Riak doesn't
enforce any referential integrity on links though, it's up to your application
to catch and handle nonexisting ends of links.
In our tweets example however, one thing we could nicely express with links
is a tweet reply. Say frank06, author of riak-js, responded to my tweet, saying
something like "@roidrage Dude, totally awesome!" We'd like to store the
reference to the original tweet as a link for future reference. We could of
course simply store the original tweet's identifier, but where's the fun in that?
To store a link, riak-js allows us to specify them as a list of JavaScript hashes
(some call them objects, but I like to mix it up).
var reply = {
user: 'frank06',
tweet: '@roidrage Dude, totally awesome!',
tweeted_at: new Date (2011, 1, 26, 8, 0)};
Riak Handbook | 33
Walking Links
key: '41399579391950848',
bucket: 'tweets'}]})
A link is a simple set consisting of a tag, a key and a bucket. The tag in
this case identifies this tweet as a reply to the one we had before, we're
using the tag in_reply_to to mark it as such. This way we can store entire
conversations as a combination of links and key-value, walking the path up
to the root tweet at any point.
Now when you fetch the new object via HTTP, you'll notice that the header
for links has grown and contains the link we just defined.
$ curl -v localhost:8098/riak/tweets/41399579391950849
...
Link: </riak/tweets/41399579391950848>; riaktag="in_reply_to",
</riak/tweets>; rel="up"
...
You can fetch them with riak-js too, using the metadata object, which will
give you a nice array of objects containing bucket, tag and key.
riak.get('tweets', '41399579391950849',
function(error, object, meta) {
console.log(meta.links)
})
An object can have an arbitrary number of links attached to it, but there are
some boundaries. It's not recommended to have more than 10000 links on
a single object. Consider for example that all the links are sent through the
HTTP API, which makes a couple of HTTP clients explode, because the
single header for links is much larger than expected. The number of links on
an object also adds to its total size, making an object with thousands of links
more and more expensive to fetch and send over the network.
Walking Links
So now that we have links in place, how do we walk them, how can we
follow the graph created by links? Riak's HTTP API offers a simple way to
fetch linked objects through an arbitrary number of links. When you request
a single object, you attach one or more additional parameters to the URL,
specifying the target bucket, the tag and whether you would like the linked
object to be included in the response.
Riak Handbook | 34
Walking Nested Links
riak-js doesn't have support to walk links from objects in this way yet, so
we'll look at the URLs instead. Play along to see what the results look like.
Let's have a look at an example.
$ curl .../riak/tweets/41399579391950849/tweets,in_reply_to,_/
$ curl localhost:8098/riak/tweets/41399579391950849/_,_,_/
var reply = {
user: 'roidrage',
tweet: "@frank06 Thanks for all the work you've put into it!",
tweeted_at: new Date(2011, 1, 26, 10, 0)};
riak.save('tweets', '41399579391950850', reply, {links:
Riak Handbook | 35
The Anatomy of a Bucket
$ curl localhost:8098/riak/tweets/41399579391950850/_,_,_/_,_,_/
This query will walk two levels of links, so given a conversation with one
reply to another reply to the original tweet, you can get the original tweet
from the second reply. Mind-bending in a way, but pretty neat, because with
this query you'll also receive all the objects in between with the response, not
just the original tweet, but all replies too.
Riak Handbook | 36
List All Of The Keys
• You can set specific properties on a per-bucket basis, such as the number
of replicas, quorum and other niceties, which override the defaults for all
buckets in the cluster. The configuration for every bucket created over the
lifetime of a cluster is part of the whole ring configuration that all nodes in
a Riak cluster share.
riak.keys('tweets')
A word of warning though, this will choke with Node.js when there are a lot
of objects in the bucket. This is because listing all keys generates a really long
header with links to all the objects in the bucket. You'll probably want to use
the streaming version of listing keys as shown further down.
The same as a plain old HTTP request using curl:
$ curl 'localhost:8098/riak/tweets?keys=true'
This will return pretty quickly if you have only a couple of objects stored in
Riak, several tens of thousands are not a big problem either, but what you
probably want to do instead is to stream the keys as they're read on each node
in the cluster. You won't get all keys in one response, but the Riak node
Riak Handbook | 37
List All Of The Keys
coordinating the request will send the keys to the clients as they are sent by
all the other nodes. To do that set the parameter keys to the value stream.
$ curl 'localhost:8098/riak/tweets?keys=stream'
With curl, it will keep dumping keys on your console as long as the
connection is kept open. In riak-js, due to its asynchronous nature, things
need some more care. It takes an EventEmitter object, a Node.js specific
type that triggers events when it receives data. We'll do the simplest thing
possible and dump the keys onto the console.
If you really must list keys, you want to use the streaming version. riak-js
uses the streaming mechanism to give you a means of counting all objects in
a bucket by way of a count('tweets') function.
In general, if you find yourself wanting to list keys in a bucket a lot, it's very
likely you actually want to use something like a full-text search or secondary
indexes. Thankfully, Riak comes with both. When you do list keys, keep a
good eye on the load in your cluster. With tens of millions of keys, the load
will increase for sure, and the request may eventually even time out. So you
need to do your homework, tens of millions of keys are a lot to gather and
collect over a network.
Riak Handbook | 38
Querying Data
objects. The list of keys is always an indication, it may not always be 100%
accurate when it comes to the objects stored with the keys.
Querying Data
Now that we got the basics out of the way, let's look at how you can get
data out of Riak. We already covered how you can get an object out of Riak,
Riak Handbook | 39
MapReduce
simply by using its key. The problem with that approach is that you have to
know the key. It's somewhat of the dilemma of using a key-value store.
There are some inherent problems involved when wanting to run a query
across the entire data set stored in a Riak cluster, especially when you're
dealing with millions of objects.
Because Justin Bieber is so wildly popular, and because we need some data
to play with, I whipped up a script to use Twitter's streaming API to fetch
all the tweets mentioning him. You can change the search term to anything
you want, but trust me, with Bieber in it, you'll end up having thousands of
tweets in your Riak database in no time.
The script requires your Twitter username and password to be set as
environment variables TWITTER_USER and TWITTER_PASSWORD respectively.
Now you can just run node 08-riak/twitter-riak.js and watch as pure
awesomeness is streaming into your database. Leave it running for an hour
or so, believe me, it's totally worth it.
If you can't wait, five minutes will do. You'll still have at least a hundred
tweets as a result. The script will also store replies as proper links, so the
longer it runs the more likely you'll end up at least having some discussions
in there.
MapReduce
Assuming you have a whole bunch of tweets in your local Riak, the easiest
way to sift through them is by using MapReduce. Riak's MapReduce
implementation supports both using JavaScript and Erlang to run
MapReduce, with JavaScript being more suitable for ad hoc style queries,
whereas Erlang code needs to be known to all physical nodes in the cluster
before you can use it, but comes with some performance benefits.
Speaking of Riak's MapReduce as a means to query data is actually a bit of a
lie, as it's rather a way to analyze and aggregate data. There are some caveats
involved, especially when you're trying to run an analysis on all the data in
your cluster, but we'll look at them in a minute.
A word of warning up-front: there is currently a bug in Riak that might
come up when you have stored several thousand tweets, and you're running
a JavaScript MapReduce request on them. Should you run into an error
running the examples below, there is a section dedicated to the issue and
workarounds.
Riak Handbook | 40
MapReduce Basics
MapReduce Basics
A MapReduce query consists of an arbitrary number of phases, each feeding
data into the next. The first part is usually specifying an input, which can be
an entire bucket or a number of keys. You can choose to walk links from
the objects returned from that phase too, and use the results as the basis for a
MapReduce request.
Following that can be any number of map phases, which will usually do any
kind of transformation of the data fed into them from buckets, link walks or a
previous map phase. A map phase will usually fetch attributes of interest and
transform them into a format that is either interesting to the user, or that will
be used and aggregated by a following reduce phase.
It can also transform these attributes into something else, like only fetch the
year and month from a stored date/time attribute. A map phase is called for
every object returned by the previous phase, and is expected to return a list of
items, even if it contains only one. If a map phase is supposed to be chained
with a subsequent map phase, it's expected to return a list of bucket and key
pairs.
Finally, any number of reduce phases can aggregate the data handed to them
by the map phases in any way, sort the results, group by an attribute, or
calculate maximum and minimum values.
Riak Handbook | 41
Using Reduce to Count Tweets
if (doc.tweet.match(/love/i)) {
return [doc];
} else {
return [];
}
} catch (error) {
return [];
}
}
Before we the look at the raw JSON that's sent to Riak, let's run this in the
Node console, feeding it all the tweets in the tweets bucket.
riak.add('tweets').map(loveTweets).run()
Imagine a long list of tweets mentioning Justin Bieber scrolling by, or try it
out yourself. The number of tweets you'll get will vary from day to day, but
given that so many people are in love with Justin, I don't have the slightest
doubt that you'll see a result here.
Looks simple enough, right? We iterate over the list of values using
JavaScript's built-in reduce function and keep a counter for all the results fed
to the function from the map phase.
Now we can run this in our console.
riak.add('tweets').map(loveTweets).
reduce(countTweets).run()
// Output: [ 8 ]
Riak Handbook | 42
Re-reducing for Great Good
The result is weird, the number is a lot smaller than expected, when you
compare the number to the list of actual tweets containing "love" you'll
notice that the number is a lot smaller. There's a reason for this, and it's
generally referred to as re-reduce. We can fix this no problem, but let's look
at what it actually is first.
Much more like it. When you rerun the query, I'm sure you'll now have a
reasonable number of "love" tweets as a result. I'm sure you'll agree that the
number is particularly crazy in comparison to the total number of tweets.
Riak Handbook | 43
Counting all Tweets
The reduce function simply sums up all these values. Since the return value
from our map function is already a number, we don't have to do any type
checking for re-reducing, as the values will always be numbers in both cases.
The reduce function then aggregates the values based on the key, and stores
the value again with the same key, so it's immune to both results from the
map phase and to re-reducing its own data. The resulting method is
something that can easily be reused, because grouping data is a pretty
Riak Handbook | 44
Chaining Reduce Phases
common pattern in aggregation. Just think of the last time you ran a GROUP
BY on a relational database.
The reduce phase iterates over all values and all attributes in the value, adding
up their values, and then stores it again in the result with the same key.
The function creates a new object and adds up the data from the values
handed into it, creating new attributes where necessary. Now let's run a map
reduce using the map function above and this reduce function.
riak.add('tweets').map(hourOfDay).
reduce(groupByValues).run()
The result of this MapReduce query depends on how long you left the
Twitter search running, but there should be at least one result for the hour
you ran it in. If you let it run for more than an hour, you'll see the hours
adding up and the number of tweets too.
Now, the real purpose of this was to explain chaining phases, right? So let's
add something that will extract the top five busiest Bieber hours in the day.
It'll be a tough call, since he's such a worldwide phenomenon, but we'll try
our best. In case you're wondering by now if I'm a big fan of his, I'm really
not.
Riak Handbook | 45
Parameterizing MapReduce Queries
Let's walk through it step by step, because this function is actually doing two
things. First it transforms the objects returned by the previous phase into an
array of arrays, because it's easier to iterate for the sorting that happens next.
Note that so far we're pretty oblivious whether the input is coming from a
map or a reduce function. You could argue that the transformation into an
array of arrays should probably be done in a map function, and I challenge
you to fix that after we're done here.
Anyhoo, after we're done transforming a single value, which is still of the
form {'17': 1234}, into a list of array tuples like ['17', 1234], we're
sorting the resulting array by the number of tweets, which is its second
element.
The result is then sliced to get the top five elements from the list. Now let's
chain us some reduce functions for great good.
riak.add('tweets').map(hourOfDay).
reduce(groupByValues).
reduce(topFiveHours).run();
You should see a nice list of sorted hours and the number of tweets as a result.
Riak Handbook | 46
Parameterizing MapReduce Queries
The only modification we made was to add a second parameter top to the
function and using it when calling slice() instead using a fixed number.
Let's run it real quick to verify that it actually works. Notice the second
parameter we added to the second reduce phase.
riak.add('tweets').map(hourOfDay).
reduce(groupByValues).
reduce(topHours, 1).run();
Similarly, we can adapt the map function we used to find tweets containing
“love” to accept an argument, so we can use it to search for arbitrary terms.
riak.add('tweets').map(searchTweets, 'love').
reduce(countTweets).run()
In case you're wondering about the second parameter for the map function,
the one called keyData, it contains the data that was used to fetch the inputs
for this MapReduce request, in this case the tweets bucket.
Riak Handbook | 47
Chaining Map Phases
riak.add('tweets').map(...).run({timeout: 10000})
The caveat that affects chained map phases is their data locality. A map
function that's part of a chain can't just return any data, it has to return a list of
bucket/key pairs unless it's the last phase in a particular MapReduce request,
the same type of list that's fed into the initial map phase as input.
Riak Handbook | 48
MapReduce in a Riak Cluster
There are two modifications to note. First, we're using the arg parameter as
a JavaScript object to fetch the keyword using the key with the same name.
Second, arg can now have a boolean attribute last. When it's true the
function returns the object's value, if not it returns its bucket and key for the
next map phase.
And just like that, our map function is easily chainable. Let's chain us some
map phases!
riak.add('tweets').
map(searchTweets, {keyword: 'love', last: false}).
map(searchTweets, {keyword: 'hate', last: true}).run();
Riak Handbook | 49
Efficiency of Buckets as Inputs
The results of this will once again depend on the number of tweets you have
in your Riak bucket. If you don't get anything, exchange the word "hate"
with just “bieber”, that should do the trick.
When would you use this kind of MapReduce magic? The above example
shows you one use case, sifting through data in multiple steps, narrowing
down the final result set as you go through the phases.
In a key-value store, keys are preferred to be in a well-known format,
something that can be derived from something like a session identifier, a user
name, an email address, a URL, or some other attribute you have easy access
to. Given this preference, you could use chained map phases to sort of walk
from one kind of object to another.
Say you have a user object with an email address as an attribute. The user
is identified by his login name. You hand in that object to a map phase,
extracting the email address. You can use it then to generate a new key to
fetch more details about the email address, or to find emails sent to that email
address, given that you stored them in a way you can reconstruct using the
email address, like a single key that contains an object with all the emails sent.
This turns out to be a very similar to link walking, which would certainly be
a preferable way, but it's nice to have options, right?
riak.add([['tweets', '41399579391950849'],
['tweets', '41399579391950848'], ...])
I'm sure you'll agree that this is pretty cumbersome, though it makes sense
in some scenarios, when your sole interest is to fetch more than one object
in one request to save round trips. Say we want to fetch the above tweets,
without even running a reduce phase, we could do that with MapReduce, a
pretty useful technique.
Riak Handbook | 50
Key Filters
riak.add([['tweets', '41399579391950849'],
['tweets', '41399579391950848']])
.map('Riak.mapValues').run()
This will return just the values of the specific keys. The Riak.mapValues()
function extracts the value from the Riak object handed to it. Be aware
that this won't include the usual metadata you'll get when fetching a single
object.
Back on our original track, what kind of data will MapReduce request
usually run on? We run some sort of analysis or query on a set of data
identified by keys or ranges of keys. We could use the keys' names as a way
to restrict the input to the map phase properly, to reduce the input to the data
we're really interested in.
Key Filters
Enter key filters, a way to reduce the data set fed into the initial map phase
using the key schema. Key filters can be used to fetch only a certain range of
keys that match a regular expression or fall into a certain range of integers or
strings. The keys can be transformed by splitting them up based on a token,
or by converting a string key to an integer.
Key filters are specified together with a bucket to restrict the initial set of
keys that needs to be sifted through. There are one or more preprocessing
filters followed by a list of one or more matching filters. The preprocessing
is optional, and is only necessary if you need to transform the keys into
something else before matching them, for example to convert strings to
integers.
Let's say we want to fetch all tweets whose keys start with 41399, which
should just return the tweets we created manually above.
Key filters are specified using a list which in turn contains one or more
lists containing filter names and parameters, and it can look a bit confusing
the more filters you add. Let's build another filter, this time combining a
preprocessor and a match.
Say we want do the same based on a numeric range, so we'll need to convert
the keys to an integer first.
Riak Handbook | 51
Key Filters
riak.add({bucket: 'tweets',
key_filters: [["string_to_int"],
["less_than", 41399579391950849]]}).
map('Riak.mapValues').run()
I should add that the number we're basing the range on is not a valid
JavaScript integer, so you may see some oddness, Node.js silently converts it
to 41399579391950850 so both tweets fall in that range. It's an oddity with
the Twitter API that came up when they changed their way of generating
tweet identifiers.
You can combine any number of transformation filters, you first tokenize
a string, and then convert a string to an integer. To combine a number of
matching filters, you can throw in an logical operator. So to ask for a specific
range with less_than and greater_than, you can combine them with and.
riak.add({bucket: 'tweets',
key_filters:
[["string_to_int"],
["and", [["less_than", 41399579391950850]],
[["greater_than", 41399579391950840]]]]}).
map('Riak.mapValues').run()
A logical operator accepts an arbitrary list of filters, so you can have separate
chains of transformation and matching in all parts of the operation Let's
make this even more fun.
riak.add(
{bucket: 'tweets',
key_filters:
[["and", [["string_to_int"],
["less_than", 41399579391950850]],
[["string_to_int"],
["greater_than", 41399579391950840]]]]}).
map('Riak.mapValues').run()
You can combine operators at any level, though it gets kind of messy at
some point because of all the brackets involved. This looks for tweets with
identifiers less than 41399579391950850 that don't match the string “849”.
Just like the top level list of key filters, every logical operator accepts a
number of lists of filters or once again, logical operators.
riak.add({bucket: 'tweets',
key_filters:
[["and", [["string_to_int"],
Riak Handbook | 52
Using Riak's Built-in MapReduce Functions
["less_than", 41399579391950850]],
[["not", [["matches", "849"]]]]]]
}).map('Riak.mapValues').run()
Key filters are a nifty little tool, but don't come without disadvantages, as
they still put a considerable load on the Riak cluster. Matching keys still
requires loading and checking all of them.
Mostly it's a matter of speed when resorting to using the built-in functions.
When Riak starts up, it also starts processes loading the SpiderMonkey VM,
which is responsible for running the JavaScript code and in turn loads the
built-in JavaScript functions when it boots. That, in the end, is a lot cheaper
than always parsing and evaluating ad-hoc JavaScript like in all the examples
above.
So on the one hand it makes a lot of sense to use the built-in functions as
much as you can, but it also makes sense to pre-load your own, custom
JavaScript, once it reaches a stable state, on all your Riak instances.
The set of built-in functions covers some basic ground. The good news is
they're even re-usable inside your own functions, as they're sharing the same
global namespace. You can go through the JavaScript file that contains the
built-ins to get a good grasp of what's available to you.
We already used a bunch of them, most notably Riak.mapValues or
Riak.mapValuesJson. To use them directly in a MapReduce query, without
specifying your own code, simply reference them as a string, Riak will look
up the function object and run it instead of a custom function you'd usually
provide.
What's more important than the built-ins even is that you can distribute
your own JavaScript functions. Using ad-hoc functions is nice to get going
with MapReduce, but distributing them across your Riak cluster is in the
long run more efficient, as Riak's JavaScript engine doesn't have to parse the
code every time, but only once, during start-up.
Riak Handbook | 53
Intermission: Riak's Configuration Files
Riak Handbook | 54
Errors Running JavaScript MapReduce
for additional, user-provided code to load. Here's an excerpt from the default
vm.args file.
We'll revisit the vm.args too when something relevant needs to be changed.
Riak Handbook | 55
Deploying Custom JavaScript Functions
There is currently an open ticket for this problem, which you can track to be
notified of fixes or further workarounds and findings. The bug still exists in
the current version of Riak, which, at the time of writing, is 1.1.2.
The problem is that there are not enough JavaScript processes around to
handle the current request. The simplest workaround is to increase the
number of processes. To do that, we go back to the app.config file, which
you've just been introduced to.
In the section for riak_kv, there are two settings you need to change, both
specify the number of JavaScript processes available to Riak Pipe. The default
is shown below.
{map_js_vm_count, 8 },
{reduce_js_vm_count, 6 },
To reduce the likelihood of the error popping up, increase both numbers to
24 and 18 respectively, tripling the number of processes available. Note that
this will also increase the amount of memory required to run Riak, as every
JavaScript process has a default of 8MB allocated to it.
{map_js_vm_count, 24 },
{reduce_js_vm_count, 18 },
Restart Riak using the riak restart command, and retry running the query
that caused the issues.
Riak Handbook | 56
Using Erlang for MapReduce
Once you've figured out a proper location, you just need to tell Riak about
it. In the app.config file there's a setting name js_source, which is
commented out by default. If you change the line to something like shown
below, Riak will load the JavaScript files in that directory on startup.
{js_source_dir, "/etc/riak/js_source"},
riak.add('tweets').map({language: 'erlang',
module: 'riak_kv_mapreduce',
function: 'map_object_value'}).run()
To add some reduce goodness, let's add a function that counts all the data
returned by the above map phase, effectively giving us a total number of
objects in the bucket.
Riak Handbook | 57
Using Erlang for MapReduce
riak.add('tweets').
map({language: 'erlang',
module: 'riak_kv_mapreduce',
function: 'map_object_value'}).
reduce({language: 'erlang',
module: 'riak_kv_mapreduce',
function: 'reduce_count_inputs'}).
run()
Now, the really nice part about this is that we actually don't need to run the
map phase just to count the objects. As we're querying the whole bucket and
just need to count the bucket-key pairs fed into the MapReduce request, we
might as well just use the reduce function all by itself.
riak.add('tweets').
reduce({language: 'erlang',
module: 'riak_kv_mapreduce',
function: 'reduce_count_inputs'}).
run()
This has the added benefit that the data doesn't need to be loaded. We're
running a query on the data without actually loading the data, just based on
the keys. That's a pretty neat tool right there, pretty useful when you just
want to count the number of objects in a bucket.
Riak comes with a bunch of built-in functions for Erlang too, properly
documented and well worth looking into. As a general rule, you'll benefit a
lot from using Erlang to write your own MapReduce code, simply because
it doesn't require all the overhead needed for JavaScript, like serializing and
calling out to external libraries (the SpiderMonkey JavaScript VM in this
case).
That said, there's a learning curve, but in my experience it's not too steep.
The Riak code itself is well readable and nicely documented too, so you can
get a good idea of what's going on in the system, and what you can do with
Erlang and the data stored in Riak.
Riak Handbook | 58
Using Erlang for MapReduce
$ riak attach
Attaching to /tmp/usr/local/riak-1.1.2/erlang.pipe.1 (^D to exit)
(riak@127.0.0.1)1>
The following steps are not just useful to run MapReduce queries using
Erlang, they're helpful to give you an idea of how you can work with Riak
more closely, if you need to debug something.
First of all, we'll fetch a local client. This talks directly to Riak without going
through HTTP or Protocol Buffers, instead using plain old Erlang function
calls and message passing. You can type the following lines into the console,
I'll omit the console sugar to focus on the Erlang code. Don't forget to end
every statement with a period.
{ok, C} = riak:local_client().
This fetches a local client, and assigns it, through the magic of pattern
matching, to the variable C, which now holds our client object. Using that,
you can fetch data from Riak, run search queries, or execute MapReduce
jobs. Here's how you fetch an object.
C:get(<<"tweets">>, <<"41399579391950848">>).
Riak Handbook | 59
Using Erlang for MapReduce
Riak also uses internally to handle JSON data, assigning the result to the
variable Obj. We're using pattern matching to get rid of the struct term.
The third line extracts a value from the leftover property list (a list of key-
values), namely the value tweet, which extracts the body, returning it in a
list. Erlang map functions are expected to return lists too.
That's it. No magic, and pretty straightforward too. To run it, the local Riak
client offers a mapred function. Here's how to run our function on a single
tweet.
C:mapred([{<<"tweets">>, <<"41399579391950848">>}],
[{map, {qfun, ExtractTweet}, none, true}]).
The first line specifies the input for the MapReduce request, a list of tuples
with bucket and key. The second line specifies a map phase, hence the map
term at the beginning. Phases are also specified in a list of tuples, but our
example contains only one. Using the qfun term, we're specifying that an
anonymous function is to be used, don't feed any arguments into the map
function (none), and finally tell it to return the data from this phase to the
client (true).
The resulting output should include a list of just one tweet body. There, you
just wrote you some Erlang. You can take this a lot further, you could even
write back to Riak from an Erlang map or reduce function, which you can't
do in JavaScript. Nice touch to store intermediate results back into Riak.
You can also run built-in Erlang map and reduce functions this way, and
even kick off JavaScript jobs. Here's an example for a full MapReduce with
two built-in functions, running the same functions as the JavaScript code in
the previous section.
C:mapred([{<<"tweets">>, <<"41399579391950848">>}],
[{map, {modfun, riak_kv_mapreduce, map_object_value},
none, false},
{reduce, {modfun, riak_kv_mapreduce, reduce_count_inputs},
none, true}]
).
Instead of qfun the code specifies modfun, with a module and function name
following, telling Riak to run this function on the inputs.
If you're working with Riak in production, it's well worth familiarizing
yourself with the things you can do from the Erlang console. It comes in
handy every now and then.
Riak Handbook | 60
On Full-Bucket MapReduce and Key-Filters Performance
Riak Handbook | 61
Riak Search
Riak Search
Riak Search is to Riak what Sphinx is to MySQL: a full-text search add-on
that can use your main data store as the source for building an inverted index
of your data, tokenizing strings in a way that allows you to search for only
parts of a string. It also was one of the first truly distributed full-text search
engines, right up there with ElasticSearch. It scales up and down just like the
rest of the Riak ecosystem does.
Riak Search was heavily inspired by Apache Lucene, which is close enough
to being a defacto standard for full-text search. The similarities focus mostly
on the interface, though, manifesting themselves in the Solr-like HTTP
interface and the Lucene-style query syntax, but Riak Search doesn't support
the full Lucene query set yet.
Indexing Data
Just enabling it only gives us the option to use Riak Search, it doesn't index
anything yet. But we do have three options to do so now:
• From the command line using search-cmd
• Indexing objects stored in Riak through a pre-commit hook on a per-
bucket basis
• Index data directly using a Solr-like HTTP interface
Riak Handbook | 62
Riak Search
There's a secret, fourth way to index data, using the Erlang API, which
search-cmd uses, but we'll ignore that for now.
You should see some nice output telling you how many documents were
indexed and how fast. Boom, we have data. Using the command-line is not
necessarily the best way to index data though, but it's the easiest to get you
started with Riak Search quickly.
Riak Handbook | 63
The Riak Search Document Schema
or strings, whereas using search-cmd will dump everything from a file into
one field, which makes up the whole document.
Every document has a bunch of fields and a default field. This comes from
Lucene, where the default field is the one that you don't have to explicitly
specify when querying data. If there's a string in your query that isn't
prefixed with a field name, Riak Search (and Lucene) assume you're
searching the default field.
What you get is a list of documents that matched the query. It should be
three, which is surprising because all of them are ipsum texts, but there you
go.
This doesn't return the documents themselves, it only gives you references
along with scores and positions. Use search-cmd search-doc if you're
interested in all indexes fields and their respective values. It's a nice tool for
simple debugging purposes.
Riak Handbook | 64
The Riak Search Document Schema
{dynamic_field, [
{name, "*_text"},
{type, string},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]}
The default schema specifies a bunch of these dynamic fields. Three different
field types are supported: integer, string, and date.
Every field has an analyzer attached to it. You can specify your own, but
there's a bunch of built-in analyzers that should be more than enough to get
you started.
If you have a fixed schema for documents, and you don't need to rely on
using name patterns to determine the type, but instead know it upfront, you
can declare a field instead of dynamic_field, simply specifying a full name
instead.
Analyzers
What does an analyzer do? Its main purpose is to look at a particular field,
tear apart the data in it, and tokenize it for efficient storage and lookup in the
search index.
The standard analyzer assumes a string is in the English language, tokenizes
it based on whitespace, removes tokens (words) shorter than 3 characters,
and removes a bunch of stop words like "a", "if", "the", "this", and the like.
The standard analyzer is (by default) used for fields whose names end with
_txt or _text.
Riak Handbook | 65
The Riak Search Document Schema
Why is it important for dates? Since everything is stored as strings and sorted
lexicographically, dates need to be treated as strings as well. Riak doesn't
really care about the format you're using, as long as it implies a sorting order
that is the same as the ordering of days and time, so that November 3rd 2011,
12:00 am comes before November 4th 2011, 1 pm.
This means that the American system for dates and time is of no real use, and
you're better off using the ISO 8601 format to specify both, so that the above
examples would be "2011-11-03T00:00:00" and "2011-11-04T13:00:00"
respectively.
The no-op analyzer is mostly useful for strings you expect to only contain
one word, that should be indexed as a whole. You can still do queries for
e.g. all documents that have the date 2011-11-04 or even 2011-11, thanks to
lexicographical ordering.
-module(german_analyzer).
-export([
german_analyzer_factory/2
]).
Riak Handbook | 66
The Riak Search Document Schema
Riak Handbook | 67
The Riak Search Document Schema
Last but not least, the part that determines if a term is a stop word based
on ordered lists of words. It's based on a simple and not fully complete set
of German words you usually don't want to have indexed, but you get the
idea. Turns out, there are a lot more stop words in German than there are in
English.
If you feel like it, you can play with the code in the Erlang shell to see what
it does. The code is part of the examples repository. Given you have Erlang
Riak Handbook | 68
The Riak Search Document Schema
installed, bring up the shell in the same directory as the .erl file using the erl
command. Compile the source file like so:
c(german_analyzer).
{dynamic_field, [
{name, "*_de"},
{type, string},
{analyzer_factory, {erlang, german_analyzer,
german_analyzer_factory}}
]}
{field, [
{name, "long_number"},
{type, integer},
{padding, 19},
{analyzer_factory,
{erlang, text_analyzers, integer_analyzer_factory}}
]}
You can mark a field as required using the required option. By default, all
fields defined in a schema are optional.
Riak Handbook | 69
The Riak Search Document Schema
{field, [
{name, "email"},
{type, string},
{required, true},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]}
Using skip, you can tell Riak Search not to index a particular field. It's still
stored, but simply not available for search. Use this when you want to keep
your index as small as necessary, only indexing the fields you need to have
indexed.
{field, [
{name, "metadata"},
{type, string},
{skip, true},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]}
You can specify a list of aliases for a field using the aliases option, which
means nothing more than that the field is stored multiple times in the index,
with every alias in the list as field name. That way you can reference the field
using its different names in queries.
{field, [
{name, "user"},
{type, string},
{aliases, ["username"]},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]}
An Example Schema
Going back to our Justin Bieber tweets, it's only fair to start indexing them,
and to set a schema that matches our interests for search. We index the
tweet's identifier, the user name, the time, and the text. This schema will be
our blueprint for the following excursions into indexing data directly from
Riak KV and using the Solr interface.
Every schema starts with a header that defines a couple of things, most
notably the default field for queries without an explicit field prefix.
Riak Handbook | 70
The Riak Search Document Schema
{
schema,
[
{version, "1.1"},
{n_val, 3},
{default_field, "tweet"},
{default_op, "or"},
{analyzer_factory,
{erlang, text_analyzers, whitespace_analyzer_factory}}
],
[
%% IDs coming from Twitter's API are 64 bit integer values
%% Padding is 19 to accomodate for that
%% Keeping the field name id_str from the Twitter API
{field, [
{name, "id_str"},
{type, integer},
{required, true},
{padding, 19},
{analyzer_factory,
{erlang, text_analyzers, integer_analyzer_factory}}
]},
{field, [
{name, "tweet"},
{type, string},
{required, true},
{analyzer_factory,
{erlang, text_analyzers, standard_analyzer_factory}}
]},
{field, [
{name, "tweeted_at"},
{type, date},
{required, true},
{analyzer_factory,
{erlang, text_analyzers, noop_analyzer_factory}}
]},
%% Username doesn't need to be analyzed
%% it's always one word
{field, [
{name, "user"},
{type, string},
{required, true},
{analyzer_factory,
{erlang, text_analyzers, noop_analyzer_factory}}
]},
%% Skip everything else
{dynamic_field, [
{name, "*"},
{skip, true}
Riak Handbook | 71
Indexing Data from Riak
]}
]
}.
The schema defines four fields (neatly using different analyzers, which suits
our purpose of having a good example set quite nicely), and skips everything
that we're not interested in by declaring a dynamic field that matches
everything else.
Fields in a document are matched in the order they appear in the schema.
In the above example Riak Search will first go through the fix fields and try
to find an exact match. Then it tries to match all the dynamic fields in the
order they appear in the list, the first match wins. So be sure to have catch all
definitions, like a field that skips everything that doesn't match, at the bottom
of the schema.
Note that an index doesn't have to exist yet for you to set the schema, much
like a bucket in Riak doesn't need to have any data in it before specifying a
configuration for it.
We now have a schema in place. To confirm, run the following command
to show the schema for the tweets index. It should show you the schema we
just set.
Riak Handbook | 72
Indexing Data from Riak
a bucket name, it installs the pre-commit hook for us. The following
command installs a pre-commit hook for our tweets bucket.
Note that this doesn't retroactively index all data in the bucket. It only
indexes data stored from that moment on.
Remember that indexing through the command-line assumed all files to
be text? Turns out the commit hook is much smarter than that. It uses the
content type for a Riak object to determine how to deserialize it. If you're
storing JSON, it's got you covered, same for XML. If you're using some
other serialization format that's not supported by default, you can specify
your own Erlang code to deserialize it into something Riak Search can index.
All the tweets we've already indexed so far are unfortunately not part of the
index yet. But assuming you've installed the schema for the tweets index and
the pre-commit hook for the bucket, we're good to go on writing more data.
The hook will take care of updating data when you update existing objects
and deleting data from the index when you delete an object from Riak KV.
There's one thing to be aware before we continue. The current
implementation of the Twitter indexer doesn't use a date format that's
suitable for sorting, it uses the string returned from the Twitter API, which
is a string of the form "Thu Nov 03 22:27:30 +0000 2011".
Not a big deal if we're not sorting based on the date, but let's look at an
implementation that creates the proper format string (following ISO rules,
of course) and also adds the tweet's identifier to the document before storing
it. I left out the surrounding code for brevity. Thankfully JavaScript's
implementation of Date comes with a handy method for this purpose.
twitter.addListener('tweet', function(tweet) {
var createdAt = new Date(tweet.created_at).toISOString();
var key = tweet.id_str;
var tweetObject = {
user: tweet.user.screen_name,
tweet: tweet.text,
tweeted_at: createdAt,
id_str: key
}
var links = [];
if (tweet.in_reply_to_status_id_str != null) {
links.push({
Riak Handbook | 73
Using the Solr Interface
tag: 'in_reply_to',
bucket: 'tweets',
key: tweet.in_reply_to_status_id_str
});
}
riak.save('tweets', key, tweetObject, {links: links},
function(error) {
if (error != null)
console.log(error);
});
})
Re-run the Twitter stream for a little bit so we get data that we can run
queries on.
Now that we have some data in our search index, how do we get it out
again? Or rather, how can we query the search index to find documents
we're interested in?
riak.search('tweets', 'tweet:hate')
Running that, and given you have some fresh results in your database, you
should see some data running across your screen. It's not entirely helpful, so
let's look at what kind of data the Solr API returns. It's capable of handling
both XML and JSON, the latter being more interesting for us right now.
To figure out the exact data, we'll use curl to fetch it. The Solr HTTP
interface is mounted to /solr, so the URL http://localhost:8098/solr/
is our entry point, just add an index and an action. The full URL we can use
to query the tweets index is http://localhost:8098/solr/tweets/select,
and here's how you can query it using curl:
Riak Handbook | 74
Using the Solr Interface
$ curl 'localhost:8098/solr/tweets/select?q=tweet:hate'
As I'm sure you'll notice, we just received a result as XML, a sensible default
given that Solr returns XML too. It also lets you specify other formats using
the wt parameter. Riak Search supports XML and JSON, so you can use both
formats as parameters in lowercase. Here's the JSON equivalent of the above
query.
$ curl 'localhost:8098/solr/tweets/select?q=tweet:hate&wt=json'
By the way, when using riak-js, it already sets the result format to JSON for
you, because it's easier to parse directly into JavaScript objects.
There's one thing you should remember, especially when you've used Solr in
the past. The Solr layer Riak Search offers is merely for API compatibility. It
allows you to use an Solr client with Riak Search, at least for the feature set it
supports. Under the hood, there's nothing resembling Solr or Lucene except
for the query syntax. Riak Search and Solr are still two semantically different
things, one doesn't work like the other.
$ curl 'localhost:8098/solr/tweets/select?q=tweet:love&rows=100'
You can paginate data with an offset, to fetch 20 rows but skipping the first
20, use something like this:
$ curl \
'localhost:8098/solr/tweets/select?q=tweet:love&rows=20&start=20'
The disadvantage of using rows and start is that Riak Search will still
accumulate all the data first and then apply the parameters, a known
problem.
Using riak-js, you simply specify all these options in a parameters hash:
Riak Handbook | 75
Using the Solr Interface
You'll want to always specify a maximum for the number of rows returned,
because riak-js sets it to 10000.
If we simply want to use the key for sorting, which, in our case, should be
almost equivalent to sorting by the time a tweet was created, we can use the
handy presort option. The advantage of using presort is that it's applied
before limit and offset are applied, which is not the case with sort, a
known issue with Riak Search.
Without any sort parameter, Riak Search sorts results in descending order, so
highest score and therefore, the best matches, come first. When you specify
a sort field, the order changes to ascending, but can be changed by explicitly
appending asc or desc to the field name.
Search Operators
A search query can be of arbitrary complexity. So far we've only looked at
queries looking for a simple word. Of course you can search for arbitrary
strings and phrases as well. Like any good full-text search, Riak Search
doesn't just keep track of simple words, but their occurrance in phrases as
Riak Handbook | 76
Using the Solr Interface
well. Simply surround the string in quotes, double quotes are as valid as
singles. This query searches for the string "justin bieber".
If you don't specify explicit quotes, and your query string is "tweet:justin
bieber", then Riak Search looks for documents that contain "justin" in the
field tweet or tweets that contain the string "bieber" in the default field.
Everything that's not explicitly prefixed with a field name and isn't
surrounded with quotes assumes the default field defined in the schema or
the optional query parameter df.
If you specify more than one field, Riak Search uses the operator OR to put
together the query. You can override that by either specifying operators
explicitly or by specifying a default search operator. Here's a query with
explicit operators, searching for tweets that contain both love and hate.
Note that search operators are case sensitive, meaning AND is an operator,
whereas "and" is a word to query for. Using the setting default_op in the
schema definition, you can tell Riak to always assume AND instead of OR.
You can use the NOT operator to negate parts of the query, use it to search for
tweets that contain "love" but not "hate".
If you've used Solr and/or Lucene before, the rules for operators should be no
surprise. If you don't feel like littering your query statements with a pile of
AND, OR, and NOT, you can use + or - instead, though both imply that AND is the
search operator. The previous query would look like so.
You can boost the relevance of certain terms by boosting their score,
influencing their appearance in the result set. Explaining the full magic of
term relevance is beyond the scope of this book: read the Solr wiki page for
all the gory details. Let's search for tweets containing "love" or "hate" and
give more relevance to hate.
Riak Handbook | 77
Using the Solr Interface
riak.search('tweets',
'(tweet:love OR tweet:hate) AND user:roidrage')
Riak Search also supports proximity search, which helps finding documents
based on how close words in a phrase are to each other, using the tilde
operator. To look for things where the words "grandma loves" are no more
than four words apart, that is there are no more than three other words
between them, you can use the following.
Another useful feature is being able to search for ranges of matches. As all
data is stored as strings, and is sorted lexicographically in Riak Search (and
most if not all full-text searches, for that matter), you can even use our date
formatting to ask for tweets that occurred between the 1st and the 30th
November 2011. This kind of search is based on the fact that a string like
"20111101" is considered of lower order than "20111101T00:00:00", so it
matches anything that starts with the string "20111101".
Riak Search supports both exclusive and inclusive ranges, using brackets and
curly braces respectively. To search for all tweets of November, an inclusive
search is the way to go.
Even though something like this is possible, it's a good idea to be as specific
as possible with the lower and upper bounds, by specifying
"2011-11-01T00:00:00" as a lower bound and, to be even more specific,
"2011-11-30T23:59:59.999Z" as an upper bound. This is due to
lexicographical ordering, where "2011-12-01" has lower sorting priority
than "2011-12-01T00:00:00". Should we be interested in tweets from
December 1st too, we'd either have to specify the following day, or just
specify the full date string. As a general rule, if you can, be as specific as
possible, and make sure both bounds are equally specific. If the lower bound
is too lax you'll get results you may not expect.
Riak Handbook | 78
Using the Solr Interface
If you're not interested in the data matching the lower and upper bounds,
you can use curly braces instead.
riak.search('tweets',
'tweeted_at:{2011-11-01T00:00:00.000Z TO 2011-12-01T00:00:00.000Z}')
To search for partial matches, i.e. words that start with a specific term, the
* operator is here for you. It matches any word that begins with the term
provided and ends with any number of characters. It needs a minimum of
three characters though, so the first example below won't work, but the
second will. Note that you can't specify the operator at the beginning of the
search term, only at the end.
riak.search('tweets', 'tweet:j*');
riak.search('tweets', 'tweet:jus*');
Under the covers, the wildcard is just an inclusive range query, using the
lowest and highest boundary possible, speaking in lexicographical terms. So
the query above (the working one) is internally turned into something like
this.
riak.search('tweets', 'tweet:hat?')
Just like the * operator, the ? wildcard operates as a range query under the
cover, just a tiny bit more specific, adding a zero byte to the lower boundary.
Riak Handbook | 79
Using the Solr Interface
There you go, all you need for a handy reference. While we're at it, let's tear
apart the features of the query language, obviously something that must end
in another handy reference table.
Riak Handbook | 80
Using the Solr Interface
Operator Explanation
AND Conjunction of two field queries. Both must match for the
query to yield a result. AND has higher precedence than OR.
Example: "tweet:justin AND tweet:love"
OR Disjunction of two field queries. At least one of the fields must
match for the query to be successful. Example: "tweet:hate OR
tweet:love"
NOT Negate a field query. Should not match the negated part. Can
be combined with other field queries, but also used on its own
to search for documents that don't contain a specific string.
Example: "tweet:hate AND NOT tweet:love"
* Wildcard search. Matches all strings of any length that start with
a given string. Prefixing string must be at least three characters
long. Example: "tweet:bieb*"
? Wildcard search. Matches exactly one character. Example:
"tweet:love?"
() Group several queries logically to give higher precedence to
OR queries. Example: "(tweet:bieber OR tweet:justin) AND
tweet:hate"
[ TO ] Inclusive range search, following lexicographical ordering.
Includes words matching the upper and lower bounds and
anything between. "TO" operator is case sensitive. Example:
"tweeted_at:[20111101 TO 20111130]"
{ TO } Exclusive range search. Includes only words in between lower
and upper bounds. Example: "tweeted_at:{20111101T00:00:00
TO 20111201T00:00:00}"
~<int> Proximity search, works only on phrases, not single search
terms. Search for documents where the words in the search
string are as far as <int> words apart. Useful for somewhat fuzzy
searching. Example: "tweet:'love bieber'~2"
^<float> Boost a search term, giving a specific term a higher or lower
relevance than others, giving the results matching that term a
higher or lower score, and a higher or lower rank in the search
result. Defaults to 0.5. Example: "tweet:love^1 OR
tweet:hate^0.2"
Riak Handbook | 81
Using the Solr Interface
Search solely as a full-text search engine, and you're not interested in using
Riak as a database, because your data lives elsewhere, in a MySQL database
for example.
Just like querying, the interface for indexing is compatible to Solr. Unlike
querying, indexing only supports XML. Bummer, but a client library should
hide that fact from you anyway, allowing you to pass something like hashes
and converting them to XML automatically. Speaking of client libraries,
riak-js unfortunately doesn't have support to use this API currently, but a
decent Solr client should do instead.
To add or update a document, you use the endpoint /solr/INDEX/update,
and POST to it. Both updating and adding are considered to be the same
thing, Riak Search doesn't keep track of conflicts or versions, the most recent
write wins.
Here's the simplest indexing that could possibly work using curl:
Note that the XML shown here is actually entered on the command-line,
the parameter -d @- tells curl to read the post data from stdin, so when
done typing (or copy-pasting), type Ctrl-D to send end-of-file. You can
specify a number of documents at once, simply add more <doc> sections
inside the <add> tags. Be aware that this is not the same as a bulk import
though, a feature that Riak Search doesn't support unfortunately; it's merely
a convenient way to throw multiple documents at the index at the same time.
Every document is still indexed and committed separately.
Riak Handbook | 82
Using the Solr Interface
<id>1</id>
<query>name:"My god"</query>
</delete>
riak.addSearch("tweets", "tweet:hate").
map('Riak.mapValuesJson').run()
Riak Handbook | 83
Riak Secondary Indexes
Riak Handbook | 84
Riak Secondary Indexes
riak-js comes with some preliminary support for 2i, but it's more than good
enough for our purposes. Secondary indexes are just really simple to build
and use. Just add a new index attribute to the meta data.
tweet = {
user: 'roidrage',
tweet: 'Using @riakjs for the examples in the Riak chapter!',
tweeted_at: new Date(2011, 1, 26, 8, 0).toISOString()
}
The only change we've done is to add some metadata for indexes. riak-js will
automatically resolve the field names to have the proper datatype suffixes, so
the code looks a bit cleaner than the underlying HTTP request, which we'll
look at anyway.
There are special field names at your disposal too, namely the field $key,
which automatically indexes the key of the Riak object. Saves you the
trouble of specifying it twice. Riak automatically indexes the key as a binary
field for your convenience, so be sure to avoid using the field $key elsewhere.
It's also worth mentioning that the $key index is always at your disposal,
whether you index other things for objects or not. That gives you a nice
advantage over key filters when you query Riak for ranges of keys.
That's pretty much all you need to know to start indexing data. There's no
precondition, just go for it. It really is the simplest way to get started building
a query system around data stored in Riak.
Riak Handbook | 85
Riak Secondary Indexes
/buckets/<bucket>/index/<fieldname>/<query>
Where <fieldname> is the indexed field (including _bin or _int suffix), and
<query> is the value you're searching the index for. So to search for all tweets
with the username roidrage the URL looks like this.
/buckets/tweets/index/username_bin/roidrage
The result is a JSON list of keys. Let's throw it at curl and see what comes
back.
$ curl localhost:8098/buckets/tweets/index/username_bin/roidrage
{"keys":["41399579391950848"]}
You can ask for ranges of values too, just add another URL component as the
upper bound. Here's how you can fetch keys using a specific date range.
$ curl localhost:8098/buckets/tweets/index/tweeted_at_bin/ \
2011-02-26T00:00:00.000Z/2011-02-26T23:59:59.000Z
{"keys":["41399579391950848"]}
Ranges are always inclusive, so they include any matches of the upper and
lower bounds. To query for a range with riak-js, simply specify an array of
two values instead of a single value.
riak.query('tweets', {tweeted_at: [
"2011-02-26T00:00:00.000Z", "2011-02-26T23:59:59.000Z"
]})
And that's about it. 2i is pretty simple, especially compared to Riak Search,
which no doubt is much more powerful, but sometimes a simple lookup or
Riak Handbook | 86
Riak Secondary Indexes
a simple range query is all you need. It's worth mentioning that you can
only query one index with a single query, there's currently no way to do
compound queries across multiple indexes.
riak.add({bucket: 'tweets',
index: 'username_bin',
key: 'roidrage'}).
map('Riak.mapValuesJson').run()
tweet = {
user: 'roidrage',
tweet: 'Using @riakjs for the examples in the Riak chapter!'+
' /cc @frank06',
tweeted_at: new Date(2011, 1, 26, 8, 0).toISOString()
}
Riak Handbook | 87
Riak Secondary Indexes
Riak Handbook | 88
Riak Secondary Indexes
Riak Handbook | 89
Riak Secondary Indexes
Another consequence is that when searching for multiple terms, Riak Search
has to fetch the results for all terms in the query first, querying all nodes
relevant to cover all requested terms, and then merge the results together.
Riak 2i adheres to the same replication settings as storing data in Riak does,
meaning an object that's replicated three times has a 2i index that's also
replicated three times. Riak Search does replication too, but all operations for
reads and writes don't use any quorum at all. Riak Search is still fault-tolerant
and recovers from failure, don't worry, but this is necessary to have clients
not block waiting for the indexing to finish. Writing to Riak Search is cheap
from the client's perspective, but still expensive to do on the server.
We'll look at when to use which approach in the next section, throwing in
MapReduce for good measure.
Riak Handbook | 90
How Do I Index Data Already in Riak?
into Riak, and in general everything that requires more dynamic approaches
for looking at the data at hand.
A specific dataset is an important thing to keep in mind here. MapReduce in
Riak is not exactly the same as Hadoop, where you can practically analyze
an infinite amount of data. In Riak, data is pulled out of the system to be
analyzed. That alone sets the boundaries of what's possible. The more
focused you keep the set of data you feed into Riak's MapReduce, the better
the two of you will get along.
For more ad-hoc style analysis you can easily utilize Riak Search and Riak
2i to narrow down the dataset for MapReduce, and you should prefer that
approach any time over loading entire buckets of data.
Riak Handbook | 91
Using Pre- and Post-Commit Hooks
Validating Data
The simplest thing that could possibly work is a JavaScript function that
checks if the data written is valid JSON. To validate, the function tries to
parse the object from JSON into a JavaScript structure. Should parsing the
object fail, the function returns a hash with the key fail and a message to the
client. Alternatively, the function could just return the string "fail" to fail
the write.
If parsing succeeds, it returns the unmodified object. To make the code
easier to deploy later, it's wrapped into a Precommit namespace and assigned
to a function variable validateJson, so we can call the method as
Precommit.validateJson(object).
Riak Handbook | 92
Using Pre- and Post-Commit Hooks
var Precommit = {
validateJson: function(object) {
var value = object.values[0].data;
try {
JSON.parse(value);
return object;
} catch(error) {
return {"fail": "Parsing the object failed: " + error};
}
}
}
There is a problem with this code. Pre-commit hooks are not just called
for writes and updates, they're also called for delete operations. When a
client deletes object, the pre-commit hook will waste precious time trying
to decode the object. Riak sets the header X-Riak-Deleted on the object's
metadata when it's being deleted.
To work around this particular case, we'll extend the code to exit early and
return the object when the header is set.
Precommit = {
validateJson: function(object) {
var value = object.values[0];
if (value['metadata']['X-Riak-Deleted']) {
return object;
}
try {
JSON.parse(value.data);
return object;
} catch(error) {
return {"fail": "Parsing the object failed: " + error}
}
}
}
Riak Handbook | 93
Using Pre- and Post-Commit Hooks
hooks are defined in the bucket properties. To enable the above function for
the tweets bucket, we can run the following command.
You can specify a list of functions, they'll be chained together for every
write. For JavaScript functions, a simple hash with a key name and the name
of the function is all you need.
To validate that it works, let's throw some invalid JSON at Riak.
Riak returns a HTTP status code 403 and the error message that's generated
when parsing the data failed. To make sure valid JSON is still accepted, let's
test the positive case too.
Riak Handbook | 94
Using Pre- and Post-Commit Hooks
the term fail and a string message, the write fails. Below is a rewrite of the
JavaScript JSON validator in Erlang.
-module(commit_hooks).
-export([validate_json/1]).
validate_json(Object) ->
try
mochijson2:decode(riak_object:get_value(Object)),
Object
catch
throw:invalid_utf8 ->
{fail, "Parsing the object failed: Illegal UTF-8 character"};
error:Error ->
{fail, "Parsing the object failed: " ++
binary_to_list(list_to_binary(
io_lib:format("~p", [Error])))}
end.
The code uses mochijson2 to decode the object into an Erlang structure.
mochijson2 can be a bit more specific as to why parsing failed, in particular
when it finds invalid UTF-8 characters not allowed in JSON.
You can compile this code in the Erlang console, much like the code in the
section on writing a custom analyzer for Riak Search.
To set an Erlang function as a pre-commit hook, the format of the bucket
property is a bit different. Instead of a name key, a mod and a function key
must be specified.
Riak Handbook | 95
Using Pre- and Post-Commit Hooks
var Precommit = {
validateJson: function(object) {
var value = object.values[0];
if (value['metadata']['X-Riak-Deleted']) {
return object;
}
try {
var data = JSON.parse(value.data);
object.values[0]['metadata']['index'] =
this.extractIndexes(data);
return object;
} catch(error) {
return {"fail": "Parsing the object failed: " + error};
}
},
extractIndexes: function(data) {
var indexes = {};
for (var key in data) {
var name = key + '_bin';
indexes[name] = data[key];
}
return indexes;
}
}
Don't forget to restart Riak after you've updated the script file.
If you want to modify the object's data, you deserialize it first, for instance
from JSON into a JavaScript object, modify it as needed, and then write
Riak Handbook | 96
Using Pre- and Post-Commit Hooks
it back to the object. For the sake of completeness, let's add a version that
inserts the current time to show, in the data, when the object was last
updated.
var Precommit = {
validateJson: function(object) {
var value = object.values[0];
if (value['metadata']['X-Riak-Deleted']) {
return object;
}
try {
var data = JSON.parse(value.data);
data['updated_at'] = new Date().toString();
object.values[0].data = JSON.stringify(data)
return object;
} catch(error) {
return {"fail": "Parsing the object failed: " + error};
}
}
}
Riak Handbook | 97
Using Pre- and Post-Commit Hooks
-module(commit_hooks).
-export([audit_trail/1])
audit_trail(Object) ->
Key = riak_object:key(Object),
Bucket = <<"audit_trail">>,
{ok, Client} = riak:local_client(),
AuditObject2 =
riak_object:new(Bucket, Key, Json, "application/json"),
Client:put(AuditObject2),
Object.
get_timestamp() ->
{Mega,Sec,Micro} = erlang:now(),
list_to_binary(integer_to_list(
(Mega*1000000+Sec)*1000000+Micro)).
Audit data is stored in the audit_trail bucket, for simplicity's sake. The
code could easily be adapted to use a separate bucket for audit data based on
the bucket of the audited data.
After fetching the local client to talk to Riak, the function tries to fetch the
audit object from Riak. If the data can't be found, which branches into the
clause {error, notfound}, the case statement returns an empty list. If it was
found, it uses the mochijson2 module to decode the value, a serialized JSON
object and extracts the list, stored in the trail attribute. This particular code
uses the magic of pattern matching to extract the list from the data structure
returned by mochijson2:decode().
Given an audited Riak object {"id": "abc"}, here's an example of the JSON
data structure stored in Riak for the audit trail.
Riak Handbook | 98
Using Pre- and Post-Commit Hooks
{struct, [{<<"trail">>,
[{<<"1335106786461071">>,{struct,[{<<"id">>,"abc"}]}}]
}]}
The corresponding code in the function then appends the current entry to
the trail, converts the data structure back to JSON, and writes the object back
to Riak.
Note that the code doesn't bother with handling siblings. The ideal version
would respect them and merge the data structures together. The way the
data is laid out here is already fully suitable to just take two different versions
of the list and merge them together. Below you'll find an entire section
dedicated to designing data structures for Riak.
While this code is suitable both as a pre- and post-commit hook, I'd
recommend only using it as a post-commit hook for the reasons outlined
above. If auditing is a strict requirement though, and the added latency is
acceptable, the function audit_trail returns the updated object when done,
as required for a pre-commit function.
One little addition we can make is handling deleted objects. When an object
gets deleted, we won't add the data structure, but assign null to the
timestamp. That way it's easy to add more entries should the object be re-
created later on.
audit_trail(Object) ->
Key = riak_object:key(Object),
Bucket = <<"audit_trail">>,
Metadata = riak_object:get_metadata(Object),
Deleted = dict:is_key(<<"X-Riak-Deleted">>, Metadata),
{ok, Client} = riak:local_client(),
Riak Handbook | 99
Using Pre- and Post-Commit Hooks
Json = list_to_binary(mochijson2:encode(
{struct, [{<<"trail">>, UpdatedAudit}]})),
AuditObject2 =
riak_object:new(Bucket, Key, Json, "application/json"),
Client:put(AuditObject2),
Object.
is checked out on every machine, it's easy to compile it. Erlang comes with a
built-in mechanism to compile all .erl files into their corresponding .beam
files. The resulting .beam files are bytecode ready to be loaded by the Erlang
VM.
You can try this out yourself in the example code repository. It contains a
folder commit-hooks, which in turn contains some Erlang source files. In
that directory, type erl -make. This compiles all files that haven't been
compiled yet and recompiles source files that are newer than their bytecode
counterparts. If you don't have Erlang installed separately on your nodes,
you can use the version shipped with Riak. Shown below is a sequence of
commands using the Erlang installed by Riak on a Ubuntu system to compile
the example code.
$ cd nosql-handbook-examples/08-riak/commit-hooks
$ /usr/lib/riak/erts-5.8.5/bin/erl -make
The second part is to make Riak aware of the compiled files. You can
configure additional paths to load into the Erlang VM in app.config. In the
section for riak_kv, add the following lines before the end of the section.
This example assumes the code is checked out and compiled in the home
directory of a user deploy.
,{add_paths, [
"/home/deploy/nosql-handbook-examples/08-riak/commit-hooks"
]},
You can specify any number of directories, just add more to the list, as shown
below.
,{add_paths, [
"/home/deploy/nosql-handbook-examples/08-riak/commit-hooks",
"/home/deploy/analyzers"
]},
There's one neat thing that Riak allows you to do. When you have specified
a number of directories in your app.config, and the code changes and gets
recompiled, you don't even need to restart Riak to reload updated Erlang
code. There's a handy command that makes Riak reload all Erlang .beam files
in directories specified in the add_paths section: riak-admin erl_reload.
You do however need to restart Riak every time you change the app.config
file.
It only took four simple steps to make Riak aware of our custom Erlang code.
All of them are easy to automate so that you can update and recompile the
source files when necessary.
• Check out the Erlang source code
• Compile the source code into BEAM files
• Update app.config to point Erlang to the code directories
• Reload the BEAM files with riak-admin erl_reload (Riak v1.1 or
newer) or
• Restart Riak (prior to v1.1)
Building a Cluster
So far we've only worked with a single node, a Riak instance running on
your local computer. While it's a credit to Riak itself, that it's so easy to get
started with it, it really shines in a clustered environment. It's easy to increase
your database's capacity just by adding more nodes. Adding more nodes
doesn't require any manual intervention on your end, all the reshuffling
of data is done automatically, thanks to the magic of consistent hashing
and partitioning. Everything we went through so far is no less valid with
multiple nodes than it is with just one, only with the added fun of a
networked environment.
-name riak@192.168.2.25
And that's all you need to change to start adding more nodes. Rinse and
repeat for every new node. Give every node a proper host name, and start
them. In a production environment, make sure these steps are fully
automated. The beauty of Riak is that adding a node to an existing cluster
involves no work that needs manual intervention. So do yourself and your
infrastructure a favor, use something like Chef or Puppet to do the work for
you.
Assuming you did that, every node is still on its own. It keeps its own ring
configuration until you tell it to join another cluster (which only needs to be
one other node).
Joining a Cluster
The final step is to join the node with an existing cluster. For that, you
use the command riak-admin join, specifying the full Erlang node name.
Assuming you have another node already running on IP 192.168.2.24, here's
how to tell the new node to join its ring.
What happens now is neither mystery nor a secret, so let's have a look at a
what defines a Riak node in a cluster, and what happens when a node joins a
cluster.
now on, and they're responsible for the data. Riak Core keeps a so-called
preference list of partitions and the vnodes assigned to them, so that lookup
is easy enough. The vnodes can be local or remote; as far as Erlang is
concerned, the means of communication are the same.
Leaving a Cluster
Pretty much the same happens when a node leaves a cluster, but in reverse.
There are two ways to remove a node. One is run on the node that's to be
removed, the other can be run from any node in the cluster. The latter is
useful in case a node experiences a hardware failure and needs to be taken
into servicing.
To have a node leave the ring, simply run the command riak-admin leave
on that node. That makes the node drop ownership of all the partitions it
held, again making sure that the preference list is evenly spread out across the
cluster. After it transmitted the new preference list and handed off data to the
new owners, it shuts down and is ready to be decommissioned.
If you're on a different node, you can run riak-admin force-remove
riak@192.168.2.24, where you specify the complete Erlang node name as
configured in vm.args. Be aware that you lose all data replicas on that node,
and there is no handoff of data, which is no surprise as you consider the target
node to be unrecoverable, which is why you resorted to using this rather
drastic measure in the first place.
Handling Consistency
So far we haven't talked about handling consistency in Riak at all. All writes
that dumped tweets into Riak used the default quorum of 3. That means data
written is replicated to exactly three nodes in the cluster during the write
operation, a sensible enough default for most cases.
You can tune the quorum for data in a bucket in general, and you can set a
quorum for every read or write request, the magical R and W values. The
quorum is an additional parameter for each request, and all client libraries
allow you set it, so let's look at how to do it with riak-js.
var tweet = {
'user': 'roidrage',
'tweet': 'Using @riakjs for the examples in the Riak chapter!',
'tweeted_at': new Date(2011, 1, 26, 8, 0).toISOString()
};
riak.save('tweets', '41399579391950848', tweet, {w: 1})
Note that I adapted the object to use the ISO style date format, so this is now
a valid update of an existing object, only with a consistency setting lower
than the default. All this is pretty straightforward, but it gets much more
interesting when data is read.
Durable Writes
Riak supports more than one write consistency setting. Both W and N
specify a quorum that only requires the replica nodes to accept the data,
but not necessarily writes the data to their storage backends, which is what
makes the write truly durable.
To work around this, Riak supports a durable write setting, which defaults
to the same value as N. A durable write requires the storage backends, which
write the data to disk, to acknowledge that they in fact have done so. This
setting is a trade-off between latency and durability. The setting can be
lower han the W value, if you can live with the fact that only a DW number
of nodes has physically written the data to disk before the request returns to
the client.
Just like W, durable write is both a setting per bucket and per write request.
There's no equivalent for read requests, read requests will have to go through
the storage backend to get the data anyway.
Again, a slightly lower durability setting for increased latency works to our
advantage in the tweets scenario. We're more interested in keeping latency
low than the data being fully consistent immediately. To specify a different
setting for durable writes, use the parameter dw. We're lowering our
durability expectations to just one replica that accepted the write in a durable
fashion.
Note that durability is still a different beast for every storage backend
supported by Riak. Depending on which one you use, and how it's
configured, the process of writing the data to disk might still be delayed, for
example to reduce disk I/O to a burst every 100 ms.
Primary Writes
As if those weren't enough quorums, Riak supports a third one called a
primary write quorum. It specifies the number of primary replicas that need
to accept a write for it to be successful.
Remember that Riak keeps a preference list of the whole cluster around.
When it picks the replicas based on consistent hashing, it defaults to using
the first N in the list, with N being the default quorum. The first N nodes are
called the primary replicas for a particular key.
If one of the primary nodes is not available, the coordinating node sends the
request to a secondary node. It picks the next node in the list and uses it as
a temporary store until the primary node becomes available again. This is
called a sloppy quorum, because Riak doesn't actually fail when you write
with a quorum of 3 and only two of the primary replicas are available. It
temporary stores the data on a secondary node instead and doesn't actually
enforce the quorum based on primary replicas.
With a primary write you can force a write to go to primary nodes only or
otherwise fail. You may argue this forfeits the entire purpose of Riak. Write
availability is, after all, what it's all about. But there are scenarios where the
stronger consistency guarantees of a primary write are preferable.
In use cases where applications have to read their own writes, a successful
primary write followed by a primary read guarantees to return the data
you've just written. With a sloppy quorum, consider your write went to two
primaries, one of them partitions right after the write. A subsequent read
with R = 1 goes to a secondary node which doesn't know anything about the
data, so it doesn't return anything. A primary write followed by a primary
read would guarantee this doesn't happen, at least as long as PW + PR > N
holds.
To make a primary write or read operation, you specify the pw flag.
$ curl localhost:8098/riak/tweets
{
"props": {
"name": "tweets",
"w": "quorum",
"notfound_ok": true,
"young_vclock": 20,
"pr": 0,
"postcommit": [],
"rw": "quorum",
"chash_keyfun": {
"mod": "riak_core_util",
"fun": "chash_std_keyfun"
},
"big_vclock": 50,
"precommit": [{}],
"last_write_wins": false,
"small_vclock": 10,
"r": "quorum",
"pw": 0,
"old_vclock": 86400,
"n_val": 3,
"linkfun": {
"mod": "riak_kv_wm_link_walker",
"fun": "mapreduce_linkfun"
},
"dw": "quorum",
"allow_mult": true,
"basic_quorum": false,
"search": true
}
}
Amidst all the properties are the relevant options n_val, r, rw, dw, and w.
Updating the configuration is just as easy, send an updated JSON that
contains the new values you'd like the bucket to have. To set a different
n_val, which corresponds to the N value for replication, send a PUT to
the bucket's URL with JSON containing just the new value. It doesn't have
to contain any other value, you can update single values in the bucket
properties, so you don't have to specify all properties every time. You do
need to specify the content type for JSON though, and numbers need to be
JSON numbers too, otherwise you may end up seeing multiple values for the
same numeric property.
Alternatively you can set the values to "quorum", "all", "one", or any
integer which is lower than n_val, as it makes no sense to expect responses
from more replicas than you actually have.
Just like the W value, using a different R is a tuning knob for consistency vs.
latency. A read request will always send reads to all replicas of a piece of data,
but will return to the client as soon as a sufficient number returned the value.
The magic formula is now in full effect. You can have your consistency cake
and eat it too, when both your W and your R values add up to be higher
than N. Given an N value of 3, and W and R having a value of 2 each, you'll
always get consistent data. If you happen to read, and this is pretty much out
of your control, from the same nodes you just wrote to, you're bound to end
up with the same data. If you happen to read from the one node that may
not have gotten the data yet, latency be damned, a process called read repair
ensures that data is consistent across all nodes once again, and the powers are
at peace again.
Read-Repair
Read repair is a process that makes sure that all nodes have updated data. It
always kicks off after a read request has returned to the client.
Read repair does two things, the first is making sure that data is consistent
across all replicas of a piece of data, the second is ensuring that conflicts are
handled. Conflicts occur when two or more clients updated the same data,
but without being aware of each other's changes. This process involves a lot
of things, so it's well worth devoting an entire section to it.
While we're going to look into how conflicts occur and how they can be
resolved, it's just as important to look at how you can design your data to be
able to resolve conflicting writes later. Data structures are just as important
when it comes to handling conflicts as resolving the conflicts themselves.
In fact, I would say they're even more important. Picking a winner might
be easy, but figuring out how to merge two diverged pieces of data back
together requires some thinking up-front.
var tweet = {
username: 'roidrage',
tweet: 'A really long and uninteresting tweet.',
tweeted_at: '2011-11-23T14:29:51.650Z',
changes: []
}
In changes we can now keep tracking events. Whether those are partial or
complete objects, that's up to you, but partial changes certainly are easier on
storage. Let's assume we added an attribute for a location, here's how that
change could be reflected in the data structure.
var conflict1 = {
username: 'roidrage',
tweet: 'A really long and uninteresting tweet.',
tweeted_at: '2011-11-23T14:29:51.650Z',
location: 'Berlin, Germany',
changes: [{
attribute: 'location',
value: 'Berlin, Germany',
timestamp: '2011-11-23T14:30:21.350Z'
}]
}
We store the updated copy and record the change. Assume a second client
came along in the meantime, adding another new attribute to the original
object, maybe a biography for the user, resulting in an object as shown
below.
var conflict2 = {
username: 'roidrage',
tweet: 'A really long and uninteresting tweet.',
tweeted_at: '2011-11-23T14:29:51.650Z',
bio: 'FRESH POTS!!!!' ,
changes: [{
attribute: 'bio',
value: 'FRESH POTS!!!!',
timestamp: '2011-11-23T14:30:22.213Z'
}]
}
Now we have two data structures, both include a list of changes. That's
pretty neat, all we have to do now is take both changelogs, merge them
together, order them by time, and apply the changes. When our code applied
all the changes, it can discard parts of the changelog to keep the list from
growing indefinitely. But you need to make sure that the list is capable
This is pretty simplistic, but is works for our purposes. We can now store the
resulting object back in Riak, capping the collection to 10 elements before
we do so.
Now, the fun part starts when you throw all this at a more meaningful
example, like building a timeline for every user we get tweets from. Before
we get to that though, let's look at what actually happens when conflicts in
Riak arise.
Conflicts in Riak
To enable support for proper tracking of changes and conflicts, we need to
enable the setting for the bucket first. The relevant setting is allow_mult,
and can be enabled using riak-js. Note that the new value needs to be a
proper boolean value, it's true not "true".
Now we can work off the examples above and create a conflict. I'm going
to keep it sequential for now, because the code would look slightly tangled
when looked at as a whole. The first write saves the initial tweet.
// console 1
var tweet, meta;
riak.get('tweets', '1', function(error, obj, m) {
tweet = obj;
meta = m;
});
tweet.changes.push({
attribute: 'location',
value: 'Berlin, Germany',
timestamp: new Date().toISOString()
});
tweet.location = 'Berlin, Germany';
riak.save('tweets', '1', tweet, meta);
I'm assuming some time passes between fetching the object and saving it
back to Riak, time you'll have to fetch the objects in both consoles, making
sure they both work off the same version, hence the serialized nature of the
code. If you want to be sure, log the vector clock to the console. It's available
in the metadata object that the code uses to store the object back to Riak.
The second client does pretty much the same, but adds a different attribute.
In your application's code, you're likely to have some functions
encapsulating tracking the single changes instead of applying them manually
all the time.
// console 2
var tweet, meta;
riak.get('tweets', '1', function(error, obj, m) {
tweet = obj;
meta = m;
});
tweet.changes.push({
attribute: 'bio',
value: 'FRESH POTS!!!!',
timestamp: new Date().toISOString()
});
tweet.bio = 'FRESH POTS!!!!';
riak.save('tweets', '1', tweet, meta);
Siblings
What you have now in Riak are two versions of the objects. Before we look
at code, we can verify this using curl.
$ curl -v localhost:8098/riak/tweets/1
...
< HTTP/1.1 300 Multiple Choices
...
Siblings:
1nQGjhdebFwM5uBs9uNGz9
3sThgDXSJ78UdNPzkhg8Om
This may look like it's all output from curl, but it's Riak telling us that this
object has siblings. Siblings are two objects that are related to each other only
by their ancestor, the original tweet, identified by the vector clock we used
when saving the objects in both consoles. Every sibling is a full copy of the
object. If we continue updating based on the original vector clock we'll keep
creating more.
Riak creates siblings when two clients write different data based on the
same vector clock, or when a client provides a vector clock that's generally
different from the current line of vector clocks, so it has no ancestors in the
line of versions.
When requested via HTTP, Riak returns a 300 status code, which stands
for multiple choices. We can now use the vtags, which is what those odd
looking hashes are called, to fetch the sibling we're interested in by adding
them as a query parameter.
$ curl 'localhost:8098/riak/tweets/1?vtag=1nQGjhdebFwM5uBs9uNGz9'
{"username":"roidrage",
"tweet":"A really long an uninteresting tweet.",
"tweeted_at":"2011-11-23T14:29:51.650Z",
"changes":[{"attribute":"bio",
"value":"FRESH POTS!!!!",
"timestamp":"2011-11-24T20:26:24.838Z"}],
"bio":"FRESH POTS!!!!"}
That looks like one of our changes. Most clients don't go out and fetch
all siblings subsequently when they get a 300 status code, they specify
multipart/mixed for the request header Accept-Type. The response has a
Content-Type of multipart/mixed as well, so the clients know what to do
with it. Here's the curl version of that request, output omitted because it's not
pretty.
Now that we know how to access them, all we need to do is apply the code
above and reconcile the changes into one object again.
Reconciling Conflicts
Let's put this code together into something that can resolve the conflicts.
The next time an object is fetched the code checks if there are siblings and
reconciles them before returning to the client or doing whatever comes next.
When riak-js detects siblings, the object you get is an array, including all the
siblings and their metadata.
function reconcileConflicts(objects) {
var tweet = objects[0];
var changes = [];
for (var i in objects) {
changes = changes.concat(objects.data.changes);
}
The implementation still has some flaws. It blindly applies all writes as if they
were replacing the original value. For arrays or deeper hash structures, that
is not a great way of resolving, as you could instead merge them together,
including all new additions, instead of overwriting them.
The simple answer is: you can't. If you come to Riak with the same
expectations as for MongoDB and Redis, you will be disappointed. The
more complex answer is: of course you can, but it depends. Something as
simple as atomic counters gets complicated in a distributed, eventually
consistent environment like Riak. Systems like Redis or MongoDB are
different as they're both not eventually consistent, at least not in the same
way as Riak is.
To adapt our changes list for more complex operations, there's not a lot we
have to do. Simply adding a new attribute to store an operation does the
trick. Here's a version that tracks increments, the number of replies to this
tweet.
var tweet = {
username: 'roidrage',
tweet: 'A really long and uninteresting tweet.',
tweeted_at: '2011-11-23T14:29:51.650Z',
replies: 1,
changes: [{
attribute: 'replies',
value: 1,
operation: 'incr',
timestamp: '2011-11-23T14:30:21.350Z'
}]
}
Now, the confusing part starts when we think about multiple concurrent
changes, incrementing the same value. To keep the counter as precise as
possible, applying the same increment twice could end up screwing the
results. In general, you should get used to the idea that a counter may not be
exactly precise all the time.
To merge the changes together, we can still use the common changes list.
This time, we have to remove all the duplicates that are common to all lists.
We also need to ignore the latest increment of the object that's used as the
basis for reconciling. The remaining operations can be executed on the value
in that object, bringing the counter up to par with all the changes.
function sortChanges(changes) {
return changes.sort(function(change1, change2) {
if (change1.timestamp < change2.timestamp) {
return 1;
} else if (change2.timestamp < change1.timestamp) {
return -1;
} else {
return 0;
}
});
}
if (exists.length == 0) {
acc.push(current)
}
return acc;
}
function dropDuplicates(changes) {
return changes.reduce(function(acc, current) {
return filterDuplicates(acc, current, changes, acc);
}, []);
}
The final piece of code is a chain of calls to the above functions, picking a
base object beforehand, and then having at it. changes is assumed to be a list
of all changes in all objects. This code works with any number of siblings,
you just need to concatenate all of their changes in one list.
result = applyChanges(base,
dropBaseChanges(base,
dropDuplicates(
sortChanges(changes))));
Congratulations, you just built your own distributed system on top of Riak!
Riak itself uses vector clocks to keep track of conflicting changes, but on a
level that's not suitable for detecting logical changes to the actual data. You
just know there were changes, just not what has been changed. To improve
that on the application level, our model uses something similar to vector
clocks to track and resolve conflicts on the data itself.
Now that the initial enthusiasm has passed, this implementation still has
flaws. When the lists of changes on two different nodes diverge too far,
maybe one of the the node is partitioned from the others but still receives
updates independently, duplicate increments can occur. Assume one node
contains a couple of older entries in the list of changes, but another already
has received more new increments, so that the older entries have been
dropped. When those two lists are merged together, the older entries are
applied again, leading to duplicate increments. You can improve this by
tracking the last timestamp or version in the document itself, but
unfortunately there is no guarantee for 100% consistent counter values.
Thanks to Reid Draper and Russell Brown for pointing this out.
Both are working on similar data structures and libraries on top of Riak, one
result of which is knockbox, a Clojure implementation of convergent and
commutative replicated data types. It's based on a paper outlining the topic
Merging Strategies
No matter when you reconcile, there's always the question of picking a
proper strategy on how to merge diverged data structures, and how to pick
the right object to use as a base for reconciliation.
It depends on your application if you care about merging arrays, trying to
pick a winner when two clients updated the same string attribute, and so on.
But it all boils down to the timestamp or clock. If that's good enough for you,
start from the oldest change, pick the appropriate object as the basis for the
merge, and work your way up to the newest.
Sibling Explosion
Sibling explosion happens when either too many concurrent writes update
the same objects, or when clients don't use vector clocks properly when
writing data.
When too many clients write data, and reconciliation can't keep up with it,
you end up creating more siblings than your application can handle. This
can be circumvented by always reading and reconciling before you write. It
doesn't avoid the risk of exploding siblings, but it makes sure you don't write
more than you reconcile, because a write will always follow a merge.
On the other hand, when clients carelessly update data without specifying
a vector clock, Riak can't determine any ancestry, because all vector clocks
have a different lineage. You'll just keep creating more and more siblings if
they're not reconciled.
Sibling explosion can have the consequence of increased read latency. For
every read to the object, Riak has to load more and more data to fetch all the
siblings. One piece of data only 10 KB in size is no big deal, but a hundred
of it suddenly turn the whole object into 1 MB of data. If all you do is write
smaller pieces of data, increased read latency is a good (though only one)
indicator that your code creates too many siblings.
{
"entries": [
"1231458592827",
"1203121288821",
"1192111486023",
"1171436045885"
]
}
The simplest timeline that could possibly work. To make it more efficient
we could make it include the entire tweet. Here's a slightly more complex
version.
{
"entries": [{
"id": "1231458592827",
"username": "roidrage",
"tweet": "Writing is hard."
}, {
"id": "1203121288821",
"username": "roidrage",
"tweet": "Finishing up those last chapters."
}, {
"id": "1192111486023",
"username": "roidrage",
"tweet": "Only two more chapters to go."
}, {
"id": "1171436045885",
"username": "roidrage",
"tweet": "Almost done with the part on Riak."
}]
}
You can keep adding attributes as you see fit, but it pays to keep the data
in the timeline simple. Assuming JSON is the serialization format of choice,
every new tweet added to the list adds up to 300 or 400 bytes. With 100
tweets, the Riak object is about 40 KB in size, with 500, it already clocks in
at 200 KB. That's not a massive size, but if it keeps growing indefinitely, the
Riak object grows bigger and bigger.
Both ways of modeling the timeline share the same advantage. You can
assume that the id attribute is already respecting time, as that's what
Twitter's Snowflake tool does. Snowflake generates unique, incrementing
numbers to identify tweets. One part of the generated number is derived
from a timestamp. Ordering the entries by that attribute will ensure that
they're sorted by time.
Here's the code to handle the timeline, first the part that adds new entries,
prepending them to an existing list of entries.
var tweet = {
id: '41399579391950848',
user: 'roidrage',
tweet: 'Using riakjs for the examples in the Riak chapter!',
tweeted_at: new Date(2011, 1, 26, 8, 0)
};
riak.get("timelines", "roidrage",
function(e, timeline, meta) {
if (e && e.notFound) {
timeline = {entries: []};
}
timeline.entries.unshift(tweet.id);
riak.save("timelines", "roidrage", timeline, meta);
}
});
If no timeline exists, we create a new one and then add the tweet to the
beginning of the list. Next up, we'll add the code that reconciles two
diverged timelines.
function reconcile(objects) {
var changes = [];
for (var i in objects) {
changes.concat(objects[i].data.entries);
}
changes.reduce(function(acc, current) {
if (acc.indexOf(current) == -1) {
acc.push(current);
}
return acc;
}, []);
return changes.sort().reverse();
}
First, all the changes are collected in one list. The list is then deduplicated,
having only single items in it. Lastly, it's sorted and the list reversed, so that
the items are in descending order, with newest tweets first.
All that's left to do is update the code saving timeline objects to reconcile
potential siblings before storing it back.
riak.get("timelines", "roidrage",
function(e, timeline, meta) {
if (e && e.notFound) {
timeline = {entries: []};
} else if (meta.statusCode == 300) {
var entries = reconcileConflicts(timeline);
timeline = timeline[0];
timeline.entries = entries;
}
timeline.entries.unshift(tweet.id);
if (!meta.vclock) {
meta = {}
}
riak.save("timelines", "roidrage", timeline, meta)
}
});
A full example that integrates building a timeline with the Twitter search
stream is included in the sample code accompanying the book.
I've heard the question if and why this is a valid approach people take on
with Riak. At first it might sound weird to store a timeline in a single object
in Riak, having to go through the troubles of merging siblings together
like this. But fetching a single object is cheap. Even when paginating only
100 out of 1000 items in the list, it may still be cheaper than fetching every
single item from Riak instead, piecing it together on the client side. Fetching
multiple items involves much more disk I/O and increased network traffic.
Multi-User Timelines
What we did so far only took care of a single user's timeline, containing
all his tweets. The traditional model, and the one popularized by Yammer,
is based on the idea that a user follows a number of other users. The user's
timeline is built based on all the activities of the users she's following.
The result won't be that much different than the example above. Instead
of storing a single user's items, you store references to other users' items in
the timeline for the user. That means duplicating data, a lot of duplicating
depending on the number of users he's following. The timelines of ten
people following the same person end up containing either references to the
same objects or containing the same data, depending on how much data you
want to store for efficiency. But again, you trade off the benefits of Riak for
a denormalized data structure. At a larger scale, it's a common trade-off.
If your timeline contains different kinds of activities such as likes, comments,
picture uploads, and so on, your entries will have some more detail. The data
structure could look something like this.
{
"id": "1212348454",
"timestamp": new Date(2011, 1, 26, 8, 0).toISOString(),
"entry": {
"type": "like",
"id": 334,
"owner": "roidrage"
}
}
As you can see, it includes a lot more data. It doesn't have to, but it allows you
to do roll-ups. Roll-ups are a way or presenting data in the feed in a denser
way. Instead of saying that three people like the same post as three different
items, you can just say it all in one item. You wouldn't even have to fetch
external data (like user objects) if all the data is included in the feed.
While Yammer's original application for timelines in Riak was internal,
there are several open sourced implementations available, one for Python,
and one for Ruby.
map phase. For the second phase we can reuse the built-in functions, so the
focus is on the code to extract the keys for the entries.
riak.add([['timelines', 'roidrage']]).
map(function(value) {
var doc = Riak.mapValuesJson(value)[0];
var entries = [];
for (var i in doc.entries) {
entries.push(['tweets', doc.entries[i]]);
}
return entries;
}).
map('Riak.mapValuesJson').run()
This solves the problem of fetching everything in one go. You can do a
number of transformations on the data, but the purpose of the example is
to show the simplest thing that could possibly work to fetch all items in the
timeline in one request.
To fetch multiple objects, when you know a number of keys and you want
to avoid the added overhead of multiple roundtrips to Riak, the solution is
even simpler. Instead of going through a list of entries, you specify the keys
as inputs to the map phase.
riak.add([["tweets", "1231458592827"],
["tweets", "1203121288821"],
["tweets", "1192111486023"],
["tweets", "1171436045885"]]).
map("Riak.mapValuesJson").run()
riak.add([['timelines', 'roidrage']]).map(
function(value, keyData, arg) {
The code uses JavaScript's slice() method to cut out the entries we're
interested in, returning them as a list. With a simple map() the list is
converted into a list of bucket-key pairs, and then it's fed into another map
phase to extract the values of these pairs. There's your pagination. Comes in
handy when using Riak Search too, especially when you prefer the Protocol
Buffers API over HTTP, as the only way to query the index is by way of
MapReduce.
Handling Failure
What we haven't done yet is talk about what exactly happens when a
(physical) Riak node fails or becomes unavailable. There's heaps of reasons
why this could happen, a temporary network partition, where nodes can't
talk to each other over the network for an undefined amount of time, or
hardware failures due to disk, memory or other problems.
If you're running on something like EC2, a whole data center might be
unavailable too, but let's talk about the more common case, a node that's
become unavailable, i.e. is unreachable from the other nodes in the cluster.
John Allspaw (of Etsy and Flickr ops fame) had a nice analogy for Riak's will
to bend but not break. You can shoot a bullet in one of Riak's servers, and it
will still continue serving data. You can keep shooting servers until you've
reached the last one, and Riak will still try its best to keep serving the data
that's still available, not breaking because parts of it are missing.
The beauty of all this is that we're not talking theory here. It's easy enough
with Riak to emulate something like a node failure. Simply stop the Riak
process after you joined it into the cluster and handoff has finished. Given
you have a three node cluster, you can assume that every piece of data is
replicated on every physical node. So shutting off one of them should give us
enough room to start working through a failure scenario.
Operating Riak
This section is all about the operations part of Riak. It starts out by looking
at some basic settings that have a longer lasting impact on your cluster, then
looks at the available protocols, and finishes off with a detailed look at the
available storage backends in Riak, the part that makes your data durable.
{ring_creation_size, 128},
If you need to change this setting after you already started playing with
a cluster, you need to stop Riak processes and throw away your data files
(an unrecoverable step, mind you) and the ring definition. If you installed
a binary package for Ubuntu, these reside in /usr/lib(64)/riak/data/
leveldb and /usr/lib(64)/riak/data/ring.
Make sure you think about the ring size upfront. The default will only take
you so far.
Storage Backends
Storage backends in Riak are pluggable, and it comes with several built right
in. They're the ones responsible for efficiently storing your data on disk, so
choose wisely.
All backends are configured in Riak's app.config file. Using each of them
requires changing just one line, namely the option storage_backend in the
section riak_kv. The value must point to a valid Erlang module that
implements the storage API. The possible values are shown in the respective
sections below.
Make sure to read through the documentation for each of the backends,
there are links to the wiki page for every backend. Below is an outline of
Innostore
Riak's first storage backend is Innostore, a library built on top of InnoDB,
which you may have come across working with MySQL, where it's now
the default engine. With Riak though, eventually it turned out that it's not
able to keep up with a large amount of writes, owing to its tree structure on
disk. The B-tree has certain advantages, because it keeps data sorted by key.
As long as you write data with sequential keys, writes will be fast, as they're
always appended to the end of the tree.
But it has the disadvantage of being a mutable data structure. As new data is
added, leaves and nodes in the tree are moved around, rearranging the data
on disk.
It's still in active use today, but requires installing a separate package, which
is available from the downloads site. Look for version 1.0.4, which should be
out by the time this book finds its way into your hands.
Innostore has lots of tuning knobs and can cache a pretty good amount of
data in memory, definitely an advantage over other storage backends. But in
general, when you expect a lot of writes, and I mean thousands per second,
you might want to look into the others. It has proven to be both stable and
unstable in the last two years. It can run pretty smoothly, but when some of
its transaction log files get corrupted, all bets are off.
To enable Innostore, install the package and change the storage backend to
riak_kv_innostore_backend. The wiki page has a good overview on all the
available configuration settings.
Bitcask
The answer to the problems Innostore had in production was Bitcask, a
storage system that's specific to Riak, and currently the default storage
backend. It writes data in a way that never mutates existing data, When you
write data to Riak, Bitcask appends the entry to an existing file. One disk
seek, that's all. Needless to say this gives pretty good performance. But, on
the other hand, Bitcask doesn't cache anything, so reads will always go to
disk, utilizing the file system cache if possible.
Bitcask keeps a list of all keys and pointers to their data on disk in memory,
so there's a limit of how much you can store. However, it has the benefit that
reads only require one lookup in the memory structure and one disk seek.
With this simple measure, Bitcask avoids random I/O entirely. Writing only
appends to existing files, therefore does sequential I/O, and reads directly go
to the data they're after.
Bitcask doesn't write to the same file indefinitely. Eventually a file's size
has reached a threshold and will be closed, never to be opened for writing
again. When it hits other thresholds, Bitcask will start compacting old files,
merging data into only a few files along the way. So there is operational
overhead here for the compaction phase (as this process is called), and you
should take good care that this phase doesn't happen during the time of
highest load on your application. The wiki page on Bitcask is a good starting
place for all the tuning knobs it offers.
When using Bitcask, make sure your servers have enough memory available
to allow storing enough keys and for some file system caching on top.
There's a simple capacity planning tool available on Basho's wiki to help you
calculate the memory requirements for Bitcask.
Use Bitcask when you want to reduce random disk I/O at all cost, as data
in Bitcask is not ordered, but sequential. It's definitely the most efficient of
the bunch while still offering high durability. Also make sure you read the
original paper outlining the ideas behind it.
To use Bitcask,
change the storage backend setting to
riak_kv_bitcask_backend, and configure Bitcask to suit your needs as per
the wiki page.
LevelDB
LevelDB is the newest contender in the field. Originally built at Google, it's
based on ideas implemented in BigTable. LevelDB is a sorted string table
combined with a memory table. The former is used to efficiently store data
on disk, sorted by key, the memory table is an in-memory cache.
Data is first written to the memory table and to a transaction log, which is fast
as it's a sequential write. Once enough data has accumulated in memory, it's
flushed to a sorted string table on disk, where data is also compressed, saving
up disk space compared to Bitcask and Innostore.
Just like Bitcask, LevelDB never modifies existing data, but always writes
new files, eventually compacting them into new files when they've reached
a size threshold. Unlike Bitcask, LevelDB keeps data sorted using a log-
structured merge tree. That means reading can require more than one disk
seek, but has the advantage of the data being sorted, hence its great
applicability for Riak 2i.
As LevelDB doesn't keep any keys in memory, there's heaps of room to use it
for caching instead, giving you a nice performance boost and reducing disk
access significantly.
Compaction happens continuously, as LevelDB keeps data files pretty small.
When you write a lot of data, a lot of compaction might be going on in the
background. This doesn't directly affect latency, but as there's increased load
on the servers, it indirectly affects read and write performance. Needless to
say, fast disks (hint: use SSDs) will give you the best results.
Compared to Innostore, LevelDB is much faster thanks to its data structures,
and more reliable too. Riak uses a library called eLevelDB, an Erlang library
built on top of LevelDB.
The wiki page on LevelDB has exhaustive detail on its innards and the
configuration settings. Also be sure to go through this post on LevelDB over
at High Scalability, full of links, benchmarks, and little details.
To enable LevelDB, set the storage backend setting to
riak_kv_eleveldb_backend.
Load-Balancing Riak
When you have a cluster up and running, how do you get your application
to spread load evenly across all the nodes in the cluster? There's a certain
mismatch between the idea that you can elastically grow and shrink capacity
in your Riak cluster while all the clients (your application) have to know
about all nodes in the cluster and have to be updated as you increase and
decrease capacity.
One possible answer is to just leave a configuration file on all nodes at all
times, always updating it as new nodes are added or as nodes go away. Every
client can then load-balance on its own and spread out requests to all Riak
nodes.
The disadvantage of this model is that clients have to take care of nodes that
timed out and nodes that are temporarily unavailable, implementing their
own timeout and retry mechanism. That adds another level of complexity
So given we've just put a load balancer in front of our Riak, there's a problem.
We just set up a single load balancer instance to handle traffic for all of the
clients and all the Riak nodes. What if it suddenly becomes unavailable? The
load balancer is now a single point of failure in a distributed setup.
There are several options to solve this problem. First, you could set up a
cluster of load balancers, or set up a load balancer on every Riak node. The
latter scenario might sound odd, but if every Riak nodes has a load balancer
running on it that's configured to talk to all nodes in the cluster, the load will
still evenly spread. But then again, clients need to know about all the load
balancing nodes in the cluster.
The alternative, and a preferred way of setting this up, is to have a load
balancer running on every application server. This has the advantage that
clients don't need to know about any external Riak node, they just talk to
their own host, which always has the IP address 127.0.0.1. As new Riak
nodes come and go, you only need to update the load balancers on all
application servers instead of writing a new configuration for the
application.
This work is ideally fully automated, so it's transparent to the client and
removes the need for complex load-balancing logic in your code. Load-
balancers like HAProxy are capable of re-reading the configuration without
dropping existing connections, so updating them has little to no impact on
your application.
If you're running a Riak cluster on EC2, another alternative to look at is
Amazon's Elastic Load Balancing service, which takes care of being
redundant and highly available for you. You can add and remove Riak nodes
through an API or through a set of command line tools, which is, again, easy
to automate.
With Pound and Elastic Load Balancing, you can use both as an SSL
endpoint, encrypting communication only between your application and
the load balancer.
Monitoring Riak
Riak has the reputation of being operations-friendly, and monitoring is no
exception. Every node has an easy way to access performance and health data
so you can feed it into your favorite metrics collection and alerting system.
Every Riak node offers an HTTP endpoint to fetch current statistics. If you
point curl at localhost:8098/stats, you'll see a slew of JSON flying by that
gives you everything you need to keep an eye on your Riak cluster's health.
Below is a shortened example, there is a lot more data in the output than
shown here. We'll go through the relevant metrics and their meanings in the
following sections.
"riak@127.0.0.1"
]
}
Note that you can get all statistics on a Riak node by using riak-admin
status. But the result is an Erlang data structure that's dumped to the
console. In comparison the JSON data returned by the HTTP endpoint is
much nicer to parse for a metrics library.
Request Times
The most important thing you should monitor in Riak are request times.
They tell you how much time reads or writes take when sent to this node.
Riak gives you percentile values and not just averages. You get a 95th, a
99th, and a 100th percentile to find out if you got any slow edge cases in
your cluster. For more general values you get mean and median values, mean
representing the average request time, and median being the exact middle
value of all requests.
A percentile is calculated by taking a list of all request times, sorting it in
ascending order, and only looking at the average of the 100 - N number of
values, with N being the percentile number, for instance 99. While averages
mingle in exceptional request times with the rest, percentiles focus solely on
those slow cases, allowing you to look at them in isolation.
The relevant statistics start with node_get_fsm_time and
node_put_fsm_time. For percentiles, the values node_get_fsm_time_95,
node_get_fsm_time_99, node_get_fsm_time_100, node_put_fsm_time_95,
node_put_fsm_time_100, and node_put_fsm_time_99. To add averages,
you can fetch node_get_fsm_time_mean, node_get_fsm_time_median and
their corresponding counterparts for put. All numbers are represented as
collective values for the last 60 seconds, so they need to be considered gauges
in your graphing system while the totals are counters.
Monitoring these values will give you a good clue when there's something
wrong in your Riak cluster or on just a single node even. Below is an
example of how the resulting graphs may look. The graph is courtesy of
Librato Metrics. All values are microseconds, so if it makes more sense, you
can convert them to milliseconds before storing them in your graphing tool.
Why track three different percentiles and two averages? Having three
percentiles enables you to find out just how many requests are affected by
temporary performance degredations, more so than averages do. If your 95th
percentile doesn't show a huge difference, but your 99th or only your 100th
does, you get an idea of how many requests, and therefore, how many users
are affected by degraded performance and increased request latency.
If you're tracking any metrics about your Riak cluster, these are the most
important ones. Increased request latency is the first indicator something is
wrong.
Don't forget to track similar metrics inside your application, measuring the
request times on both ends. That way you can figure out if there's a problem
in the transport layer or in the serialization, should Riak not be the culprit.
Number of Requests
Every node keeps track of all the requests it coordinated and all requests that
went to the vnodes residing on that node. A Riak node forwards requests
to the relevant nodes instead of coordinating the request itself, should it not
have a vnode responsible for the requested key.
Riak keeps total counts and one-minute stats for both types of values. Which
of them you use to collect metrics is up to you, but for general tracking the
one-minute values are a good start. The totals will start at zero again when
you restart a Riak node.
Monitoring 2i
Riak's Secondary Indexes tracks statistics for reads, writes, and deletes. It also
keeps metrics for all index writes and deletes that happen during a single
update of an object. All of them are available as total counters and one-
minute values.
To track all requests that involve reading and updating metrics for secondary
indexes, you can collect vnode_index_reads, vnode_index_writes, and
vnode_index_deletes for one-minute values. Add _total for full counters.
These metrics give you insight in all requests that involved reading, writing
and deleting to an index. To find out how many deletes and updates were
done in total, use the metrics vnode_index_writes_postings and
vnode_index_deletes_postings. Once again, these are updates in the last
60 seconds, add _total to get the number of all updates since Riak was
started. These metrics give you more insight in writes and deletes that were
done in a single request. For example when one write updated three indexes,
vnode_index_writes_postings will be incremented by three.
Ideally both types of metrics, for requests involved 2i and for all index
updates, should grow in relation to one another. Keeping an eye on both
means you can see unusual index activity.
Miscellany
If you're using Protocol Buffers, you'll want to track the relevant metrics for
it. Riak keeps statistics of the current number of open connections and the
total connections received.
For the number of open connections, track pbc_active. For the total
number of connections, track pbc_connects for one-minute statistics and
pbc_connects_total for the total number of connections this Riak node has
received during its lifetime.
Aside from metrics related to internal services, Riak also keeps metrics about
the Erlang process it's running in. This makes it easy to track how much
memory the whole Riak process is consuming. You can track memory_total
to keep an eye on Riak's memory consumption.
Monitoring Reference
Now that we went through all of them in all the gory detail, here's a handy
reference table for you giving a short overview of the relevant metrics Riak
makes available to you. All statistics are per node, so be sure to track all
statistics across all nodes.
Metric Meaning
node_gets Number of reads received by
this node in the last minute.
node_puts Number of writes and deletes
received by this node in the last
minute.
node_gets_total Total number of reads received
by this node since startup.
node_puts_total Total number of writes and
deletes received by this node
since startup.
node_get_fsm_time Internal response time for read
requests coordinated by this
node. Available as 95th, 99th,
100th percentile, mean and
median. Add _95, _99, _100,
_mean, and _median,
respectively.
node_get_fsm_objsize Object size measured on read
requests representing the total
size of all siblings. Available as
95th, 99th, 100th percentiles,
mean, and median.
node_get_fsm_siblings Number of siblings per object.
Available as 95th, 99th, 100th
percentiles, mean, and median.
node_put_fsm_time Internal response time for write
and delete requests coordinated
by this node. Available as 95th,
99th, 100th percentiles, mean,
and median.
vnode_gets Number of reads received by
vnodes on this node in the last
minute.
Metric Meaning
vnode_puts Number of writes received by
vnodes on this node in the last
minute.
vnode_gets_total Total number of reads handled
by vnodes on this node since
startup.
vnode_puts_total Total number of writes handled
by vnodes on this node since
startup.
read_repairs Number of read repairs handled
by this node.
vnode_index_deletes Secondary index deletes handled
by vnodes on this node in the
last minute.
vnode_index_writes Secondary index updates
handled by vnodes on this node
in the last minute.
vnode_index_reads Secondary index queries
handled by vnodes on this node
in the last minute.
vnode_index_deletes_postings Unique secondary index deletes
handled by vnodes on this node
in the last minute.
vnode_index_writes_postings Unique secondary index writes
handled by vnodes on this node
in the last minute.
vnode_index_deletes_postings_total Total number of unique
secondary deletes handled by
vnodes on this node since
startup.
vnode_index_writes_postings_total Total number of unique index
writes handled by vnodes on
this node since startup.
pbc_active Number of currently active
Protocol Buffers connections.
pbc_connects Number of Protocol Buffers
connections received in the last
minute.
Metric Meaning
pbc_connects_total Total number of Protocol
Buffers connections received
since startup.
memory_total Total amount of memory
consumed by the Riak process.
Remove the % and change the port so that the line looks like shown below.
This changes the port for SSL connections to 8069 so that it doesn't interfere
with the normal HTTP endpoint, which runs on port 8098.
Right below, there's a section that helps you generate SSL certificates.
Depending on the mode of installation, Riak comes with pre-generated,
self-signed certificates. The Debian packages, for example, do not, but the
Homebrew installation for Mac does.
Below you'll find instructions for generating your own self-signed
certificate. But before you do that, uncomment the relevant lines, so that
they look like shown below. The lines below assume your Riak
configuration is in /etc/riak, so the certificates will be in the same folder.
{ssl, [
{certfile, "/etc/riak/cert.pem"},
{keyfile, "/etc/riak/key.pem"}
]},
The final bit is to enable Riak Control. Right at the bottom of app.config
you'll find a section named riak_control. The first relevant setting enables
of disables Riak Control, and it's the first one in the section. Change it to
true as shown below.
The last and final missing piece is to specify a list of users and passwords to
authenticate. As you can modify cluster settings using Riak Control, it's best
to protect it accordingly. The default user is admin with password pass. If
you're coming from the world of Oracle 8, you'll appreciate a default user
with a default password (remember scott?). But you should change the list
to instead include real users with secure passwords. An example is shown
below.
When all that's done, restart Riak, and point your browser to the host name
of one of your Riak nodes. Use port 8069 and the /admin endpoint, for
example https://192.168.2.1:8069/admin.
The last part is generating a certificate for the key. We'll need this and the
key to enable SSL in Riak.
If something is wrong on one of the nodes in your cluster, the snapshot view
will be the first to tell you. For example, if one of your nodes has become
unavailable, or if one of the nodes is short on memory, you'll be greeted with
lots of red.
Be aware that this view is not a full replacement for monitoring. If one of
your Riak nodes goes down, an alert should pop up somewhere.
The cluster view allows you to run several basic management commands, for
example to stop a node, or to have it leave the cluster. All that is hidden in the
"Actions" menu, which is right next to every node's name.
To Be Continued...
Riak Control is a work in progress, and this is just the beginning. Eventually
it's supposed to grow into a tool that allows you to view live graphs of your
nodes, to browse objects stored in Riak, and to fire off MapReduce queries
against your data. Keep an eye on future Riak releases!
It's worth noting that everything you can do with Riak Control can also be
achieved by way of the riak-admin command. The wiki page has a great
outline on the things you can do with it.
When To Riak?
Riak can do a lot, a whole lot. Yet, everything it does stays true to the spirit of
Riak as it was originally developed. All components are built to ensure data
is still available in the face of failure, and all of them scale up and down as you
add and remove nodes from a cluster.
Even though Riak follows a simple key-value model it's been proven to
work nicely in environments where time-ordered data is desired. It involves
work on the application's side, but for that tradeoff you get fault-tolerance,
availability, and operational simplicity.
High availability usually has the highest importance for people coming to
Riak. Add content agnosticism, and you get a highly scalable and fault-
tolerant store for any kind of data, addresses, images, simple JSON data
structures to more complex timelines.
Riak has been used as scalable and persistent session store as much as it has
been used to archive an abundance of data. Among the latter are text
messages, meta data for data mined from the web, address data. Anything
that you need to store an abundance of, and that can be identified by some
other means, a username, a URL, a session identifier, that's where Riak really
shines, and where it's commonly used.
The example of tweets isn't that far-fetched, and not just about the fun
of analyzing what people say about Justin Bieber. The Twitter streaming
search, and their firehose too, adds up to large amounts of data over time.
Leave that Twitter search running for a couple of days, and you'll find
hundreds of thousands of tweets have accumulated.
It's hard to give a specific recommendation on what use cases you'll come
across will be the right ones for Riak. It usually starts with the fact that your
existing database isn't up to the task anymore, maybe because of the amount
of data, because it's become a single point of failure, or because it lacks simple
ways to scale (simple compared to Riak anyway).
Riak goes nicely with the fact that, once you start growing, you slowly
but surely loosen up consistency constraints, simplify data model and data
access. Those are key areas for Riak. It may sound pretty hand-wavy, but
I wouldn't say that Riak is a database you run to right from the start. You
come to it only, maybe mostly, when you have a good idea of growth and
access patterns for your data, when it's foreseeable that data will out-grow a
relational database at some point, or when it's just easier (from an operational
perspective) to use Riak instead, saving you the trouble of migrating later.
Until not long ago, access through bucket and key names was the only
means of accessing data in Riak. Thanks to Riak Search and Riak 2i it's
gotten much easier to build different views on your data, taking Riak closer
to what you're used to and, more importantly, making it much more useful
in general.
Thanks to HTTP, you can even just put a HTTP proxy like nginx or a load
balancer like HAProxy in front of Riak and serve files directly. Also thanks to
HTTP, you can use any HTTP client to store files in Riak.
Object Size
This usage scenario assumes your files don't exceed a size of 1 MB. Beyond
that, you won't gain as much from using Riak anymore. It shines with data in
the range of dozens to hundreds of kilobytes. There are known cases where
users stored data larger than that in Riak. But here's why you want to avoid
that and keep it small.
Object size affects a lot of things in Riak, but most notably it affects latency.
A request in Riak can involve several nodes. One physical node serves as
a proxy, delivering the request to the relevant virtual nodes in the cluster.
That means there's always network traffic between the Riak nodes involved,
network traffic that you want to keep as efficient as possible. Transferring
larger chunks of data increases network traffic and therefore latency, making
requests slower for the clients.
Throw disk I/O into the mix, and the larger your data is, the more latency
you get. You can try to keep it low by involving just one replica in read
requests, but it still adds up.
How far you can go with object size depends on the infrastructure and the
application. With a fast enough network and SSDs to store Riak's data on,
it will be more acceptable to have larger objects than when running on
Amazon EC2 with network-backed storage.
s3 = require('knox').createClient({
endpoint: 's3.riakhandbook.com',
key: 'access-key',
secret: 'secret-key',
bucket: 'stylesheets'
});
s3.get('/application.css').
on('response', function(response) {
response.on('data', function(chunk) {
console.log(chunk.toString());
});
}
);
The example above creates a client with the custom endpoint. If your Riak
CS services listens on a custom port, you can specify that too. Note that as
of Knox 0.0.9, a custom port is not supported yet, you'll need to install the
current master. If you're using a different S3 library for Riak CS, make sure it
supports setting the port if you're not using an HTTP proxy or load balancer
on port 80.
The example then requests a file, in the bucket stylesheets, called
application.css. That's it. The beauty with the code above is that you
could easily leave out the custom endpoint, switching between S3 or your
own Riak CS cluster as you see fit.
While the recommended maximum object size for Riak is around 1 MB,
Riak CS can store objects up to 5 GB in size. Also, you can get accounting
data for every tenant in the system, allowing you to account for or bill
network traffic and storage.
{
"time": "2012-04-30T15:02:17.273Z",
"log_level": "info",
"facility": "kernel",
"message": "--MARK--"
}
It's debatable how efficient this data structure will be when stored on disk
thousands of times, but it'll serve us well for an example. Most log formats,
like syslog or IEEE 1545-1999, can be decomposed into a data structure like
this.
Why not store the lines of text directly? Pre-analyzing allows you to run
efficient searches on it, utilizing Riak Search to index the data structure for
you. It's also easier to analyze the data using MapReduce.
Writing data in a centralized logging scenario is the simple part, getting the
data into a format that allows full-text search and accessing data ordered by
time are different stories. Let's look at simple access by ranges first.
"host": "riak1.production.com"
};
riak.save('logs', key, logEntry);
Keys are not the only part that can be used to access records by time. You
could also just use random UUIDs for them and build secondary indexes on
the timestamp itself. You could even leave generating random IDs up to Riak
by not specifying a key and using POST to create the record. With riak-js,
you can set the key to null to do that, it automatically uses POST in that case.
We'll create an additional index on the timestamp while we're at it.
Why not just leave the key generation up to Riak and use secondary indexes
to fetch ranges? As mentioned before, there are efficiency gains with ordered
keys when writing data. There are trade-offs involved with both ways. If
you don't have a lot of log data generated at any given point in time, just
using Riak's random IDs can be an acceptable trade-off. If efficiency on
inserts is an issue, ordered keys are worth looking into.
Either way, accessing data based on a time range is straight forward, you
derive and upper and lower bounds from the time frame you're interested
in and do an index query based on the resulting range. The example below
fetches all indexed records that were created on May 5th 2012 between 12
and 1 pm.
If the key structure itself is not enough to fetch a subset of keys, you can add
more indexes to represent these access patterns. Instead of applying filters in
hindsight by using key filters, you create the indexes in a way that allows you
to do a more efficient matching ad-hoc.
Let's go back to an example from the section on key filters and see what
it looks like when implemented as a secondary index. Here's the key filter
version.
riak.add({bucket: 'tweets',
key_filters: [["string_to_int"],
["less_than", 41399579391950849]]}).
map('Riak.mapValues').run()
Now you can run an index query on the data that yields the same results as
the key filter example above.
Searching Logs
Having proper indexes in place is only one part of the story. The more
interesting bits of logs are not in their metadata but in the log lines itself.
You can use Riak Search to create an additional full-text search on the log
messages. Or it can stand alone, without using secondary indexes at all, using
Riak Search's sorting features to fetch data ordered by time.
The advantage of using Riak Search is that we can run queries on more than
one field in a log entry. This makes searching for a specific host, facility, or
log entry possible, while still allowing you to order matching entries by the
time of occurrence.
The one thing you need to take care of is to install a custom schema for Riak
Search, should your data structure not fit in with the default schema installed
by Riak Search. See the section on custom schemas for an example.
Session Storage
A site serving lots of users has to keep a lot of user sessions around. Amazon
is a prime example, and it's the company that brought us Dynamo in the first
place.
The more traditional way is to store sessions in a filesystem, either local or
shared, in a database, or in an in-memory store like Memcached or Redis.
Problems start when the infrastructure needs to scale beyond a single
instance, being able to scale up and down on demand, or when it requires
persistent sessions. You can achieve these goals by using Redis' persistent
mode and adding a consistent hashing implementation to go distributed, by
using an in-process database like BerkeleyDB, or by using Riak.
Riak fulfills several requirements for session storage:
• Persistent storage for durability
• Replication for fault-tolerance
• Expiring session data
That last point is not a must-have requirement. Stores like Amazon keep
sessions, in particular the shopping carts, around as long as possible, to
maximize profits even in the longer term.
{
"a93d40ce-a757-11e1-9178-1093e90b5d80": {
"add": "978-0978739218",
"time": 1337001337
},
"56707cee-a757-11e1-8e1b-1093e90b5d80": {
"add": "978-0321200686",
"time": 1337001388
}
}
URL Shortener
No database at any reasonable scale can avoid being used for shortening
URLs. Why is this even an interesting use case? Shortening a URL involves
several smaller steps:
• Generate a short, unique identifier for a URL
• Save mapping of unique identifier with the URL
• Look up the URL based on the unique identifier
• Redirect clients to the URL
• Bonus: track statistics about clicks
Modeling Data
To store a URL the data structure doesn't have to be very complex. A simple
JSON hash will do the job, though storing the URL as simple plain text
also works well. Using plain text saves you the extra work of deserializing
the data structure, allowing you to just fetch the URL and send the client a
redirect.