Introducing MongoDB - Document Database Overview

Introducing:
MongoDB
David J. C. Beach
Sunday, August 1, 2010

David Beach
Software Consultant (past 6 years)
Python since v1.4 (late 90’s)
Design, Algorithms, Data Structures
Sometimes Database stuff
not a “frameworks” guy
Organizer: Front Range Pythoneers

Outline
Part I: Trends in Databases
Part II: Mongo Basic Usage
Part III: Advanced Features

Part I:
Trends in Databases

Database Trends
WARNING: extreme
oversimplification
Past: “Relational” (RDBMS)
Data stored in Tables, Rows, Columns
Relationships designated by Primary, Foreign

keys
Data is controlled & queried via SQL

Trends:
Criticisms of RDBMS
Lots of disagreement over this
Rigid data model There are points & counterpoints from
both sides
Hard to scale / distribute

The debate is not over
Not here to deliver a verdict
POINT: This is why we see an explosion of

Slow (transactions, disk seeks) new databases.
SQL not well standardized
Awkward for modern/dynamic languages

As with so many things in technology,
we’re seeing... FRAGMENTATION! Trends:
Fragmentation
some examples of DB categories
Relational with ORM (Hibernate, SQLAlchemy)
ODBMS / ORDBMS (push OO-concepts into database)
Key-Value Stores (MemcacheDB, Redis, Cassandra)
Graph (neo4j)
categories are
Document Oriented (Mongo, Couch, etc...) incomplete
some don’t fit neatly into

categories

Where Mongo Fits
Mongo’s Tagline (taken from website)
“The Best Features of
Document Databases,
Key-Value Stores,
and RDBMSes.”

What is Mongo
Document-Oriented Database
Produced by 10gen / Implemented in C++
Source Code Available
Runs on Linux, Mac, Windows, Solaris
Database: GNU AGPL v3.0 License
Drivers: Apache License v2.0

Mongo
Advantages
many of these taken
straight from home page
json-style documents fast queries (auto-tuning

(dynamic schemas) planner)
flexible indexing (B-Tree) fast insert & deletes

(sometimes trade-offs)
replication and high-
availability (HA)
sharding support available as of
v1.6 (late July 2010)
automatic sharding
support (v1.6)*
easy-to-use API

Mongo
Language Bindings
C, C++, Java
Python, Ruby, Perl
PHP, JavaScript
(many more community supported ones)

Mongo
Disadvantages
Can mimic with foreign IDs, but referential
No Relational Model / SQL integrity not enforced.
No Explicit Transactions / ACID

Operations can only be atomic within single
collection. (Generally)
Limited Query API You can do a lot more with MapReduce

and JavaScript!

When to use Mongo
My personal take on this...
Rich semistructured records (Documents)
Transaction isolation not essential
Humongous amounts of data
Need for extreme speed
You hate schema migrations

Caveat: I’ve never used Mongo in Production!

Part II:
Mongo Basic Usage
BRIEFLY cover:
- Download, Install, Configure

- connection, creating DB, creating Collection
- CRUD operations (Insert, Query, Update, Delete)

Installing Mongo
Use a 64-bit OS (Linux, Mac, Windows)
Get Binaries: www.mongodb.org

32-bit available; not for production
PyMongo uses memory-mapped files.
32-bits limits database to 2 GB!

Run “mongod” process

Installing PyMongo
Download: http://pypi.python.org/pypi/pymongo/1.7
Build with setuptools
(includes C extension for speed)
# python setup.py install
# python setup.py --no-ext install
(to compile without extension)

Mongo Anatomy
Mongo Server
Database
Collection
Document

Getting a Connection
Connection required for using Mongo
>>> import pymongo
>>> connection = pymongo.Connection(“localhost”)

Finding a Database
Databases = logically separate stores
Navigation using properties
Will create DB if not found
>>> db = connection.mydatabase

Using a Collection
Collection is analogous to Table
Contains documents
Will create collection if not found
>>> blog = db.blog

Inserting
collection.insert(document) => document_id
>>> entry1 = {“title”: “Mongo Tutorial”,

“body”: “Here’s a document to insert.” }
>>> blog.insert(entry1)
ObjectId('4c3a12eb1d41c82762000001')
document

Inserting (contd.)
Documents must have ‘_id’ field
Automatically generated unless assigned

You can also assign your own ‘_id’, can be
12-byte unique binary value any unique value.
>>> entry1
{'_id': ObjectId('4c3a12eb1d41c82762000001'),
'body': "Here's a document to insert.",
'title': 'Mongo Tutorial'}
Mongo’s IDs are designed to be unique...
...even if hundreds of thousands of

ID generated by driver. No waiting on DB. documents are generated per second, on
numerous clustered machines.

Inserting (contd.)
Documents may have different properties
Properties may be atomic, lists, dictionaries
>>> entry2 = {"title": "Another Post",

"body": "Mongo is powerful",
"author": "David",
"tags": ["Mongo", "Power"]}
>>> blog.insert(entry2)
ObjectId('4c3a1a501d41c82762000002')
another document
Indexing
May create index on any field
If field is list => index associates all values
index by single value

>>> blog.ensure_index(“author”)
>>> blog.ensure_index(“tags”)
by multiple values

Bulk Insert
Let’s produce 100,000 fake posts
bulk_entries = [ ]
for i in range(100000):
entry = { "title": "Bulk Entry #%i" % (i+1),
"body": "What Content!",
"author": random.choice(["David", "Robot"]),
"tags": ["bulk",
random.choice(["Red", "Blue", "Green"])]
}
bulk_entries.append(entry)

Bulk Insert (contd.)
collection.insert(list_of_documents)
Inserts 100,000 entries into blog
Returns in 2.11 seconds
>>> blog.insert(bulk_entries)
[ObjectId(...), ObjectId(...), ...]

Bulk Insert (contd.)
returns in 7.90 seconds (vs. 2.11 seconds)
driver returns early; DB is still working
...unless you specify “safe=True”
>>> blog.remove() # clear everything
>>> blog.insert(bulk_entries, safe=True)

Querying
collection.find_one(spec) => document
spec = document of query parameters
>>> blog.find_one({“title”: “Bulk Entry #12253”})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!', returned in 0.04s - extremely fast
u'tags': [u'bulk', u'Green'], No index created for “title”!
u'title': u'Bulk Entry #99999'}
presumably, need more entries to effectively test index performance...

Querying
(Specs)
Multiple conditions on document => “AND”
Value for tags is an “ANY” match
>>> blog.find_one({“title”: “Bulk Entry #12253”,

“tags”: “Green”})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!',
u'tags': [u'bulk', u'Green'],
u'title': u'Bulk Entry #99999'}

Querying
(Multiple)
collection.find(spec) => cursor
new items are fetched in bulk (behind the

scenes)
>>> green_items = [ ]
>>> for item in blog.find({“tags”: “Green”}):
green_items.append(item)
- or -
>>> green_items = list(blog.find({“tags”: “Green”}))

Querying
(Counting)
Use the find() method + count()
Returns number of matches found
>>> blog.find({"tags": "Green"}).count()
16646

Updating
collection.update(spec, document)
updates single document matching spec
“multi=True” => updates all matching docs
>>> item = blog.find_one({“title”: “Bulk Entry #12253”})

>>> item.tags.append(“New”)
>>> blog.update({“_id”: item[‘_id’]}, item)

Deleting
use remove(...)
it works like find(...)
>>> blog.remove({"author":"Robot"}, safe=True)
Example removed approximately 50% of records.
Took 2.48 seconds

Part III:
Advanced Features

Advanced Querying
Regular Expressions
{“tag” : re.compile(r“^Green|Blue$”)}
Nested Values {“foo.bar.x” : 3}
$where Clause (JavaScript)

Advanced Querying
$lt, $gt, $lte, $gte, $ne
$in, $nin, $mod, $all, $size, $exists, $type
$or, $not
$elemmatch
>>> blog.find({“$or”: [{“tags”: “Green”}, {“tags”:

“Blue”}]})

Advanced Querying
collection.find(...)
sort(“name”) - sorting
limit(...) & skip(...) [like LIMIT & OFFSET]
distinct(...) [like SQL’s DISTINCT]
collection.group(...) - like SQL’s GROUP

won’t beBY
showing detailed
examples of all these...
there are good tutorials online

>>> blog.find().limit(50) # find 50 articles for all of this
>>> blog.find().sort(“title”).limit(30) # 30 titles
let’s move on to something even
more interesting
>>> blog.find().distinct(“author”) # unique author names

Map/Reduce
Most powerful querying
mechanism
collection.map_reduce(mapper, reducer)
ultimate in querying power
distribute across multiple nodes

Map/Reduce
Visualized
1 2 3
)LJXUH0DS5HGXFHORJLFDOGDWDIORZ
Java MapReduce
also see: Diagram Credit:
+DYLQJUXQWKURXJKKRZWKH0DS5HGXFHSURJUDPZRUNVWKHQH[WVWHSLVWRH[SUHVVLW
Hadoop: The Definitive Guide
LQFRGH:HQHHGWKUHHWKLQJVDPDSIXQFWLRQDUHGXFHIXQFWLRQDQGVRPHFRGHWR
Map/Reduce : A Visual Explanation by Tom White; O’Reilly Books
UXQ WKH MRE 7KH PDS IXQFWLRQ LV UHSUHVHQWHG E\ DQ LPSOHPHQWDWLRQ
Chapter 2, RI
pageWKH
20Mapper
LQWHUIDFHZKLFKGHFODUHVD map()PHWKRG([DPSOHVKRZVWKHLPSOHPHQWDWLRQRI
RXUPDSIXQFWLRQ
([DPSOH0DSSHUIRUPD[LPXPWHPSHUDWXUHH[DPSOH
import
Sunday, August 1, 2010 java.io.IOException;
SELECT
19OPQ db.runCommand({
A*2=*LR
Dim1, Dim2, ! mapreduce: "DenormAggCollection",
SUM(Measure1) AS MSum, query: {
"
COUNT(*) AS RecordCount, filter1: { '$in': [ 'A', 'B' ] },
AVG(Measure2) AS MAvg, # filter2: 'C',
MIN(Measure1) AS MMin filter3: { '$gt': 123 }
MAX(CASE },
WHEN Measure2 < 100 $ map: function() { emit(
THEN Measure2 { d1: this.Dim1, d2: this.Dim2 },
END) AS MMax { msum: this.measure1, recs: 1, mmin: this.measure1,
FROM DenormAggTable mmax: this.measure2 < 100 ? this.measure2 : 0 }
WHERE (Filter1 IN (’A’,’B’)) );},
AND (Filter2 = ‘C’) % reduce: function(key, vals) {
AND (Filter3 > 123) var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 };
GROUP BY Dim1, Dim2 ! for(var i = 0; i < vals.length; i++) {
HAVING (MMin > 0) ret.msum += vals[i].msum;
ORDER BY RecordCount DESC ret.recs += vals[i].recs;
LIMIT 4, 8 if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin;
if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax))
ret.mmax = vals[i].mmax;
}
! ()*+,-./.01-230*2/4*5+123/6)-/,+55-./ return ret;
*+7/63/8-93/02/7:-/16,/;+2470*2</ },
)-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@ finalize: function(key, val) {
'
" A-63+)-3/1+37/B-/162+6559/6==)-=67-.@ & val.mavg = val.msum / val.recs;
return val;
# C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/
},
G-E030*2/$</M)-67-./"N!NIN#IN'
G048/F3B*)2-</)048*3B*)2-@*)=
1+37/?607/+2705/;02650>670*2@
$ A-63+)-3/462/+3-/,)*4-.+)65/5*=04@
out: 'result1',
verbose: true
% D057-)3/:6E-/62/FGAHC470E-G-4*).I });
5**802=/3795-@
db.result1.
' C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/
7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@
find({ mmin: { '$gt': 0 } }).
& C34-2.02=J/!K/L-34-2.02=J/I!
sort({ recs: -1 }).
skip(4).
limit(8);
http://rickosborne.org/download/SQL-to-MongoDB.pdf
Map/Reduce
Examples
This is me, playing with Map/Reduce

Health Clinic Example
Person registers with the Clinic
Weighs in on the scale
1 year => comes in 100 times

Health Clinic Example
person = { “name”: “Bob”,
! “weighings”: [
! ! {“date”: date(2009, 1, 15), “weight”: 165.0},
! ! {“date”: date(2009, 2, 12), “weight”: 163.2},
! ! ... ]

Map/Reduce
Insert Script
for i in range(N):
person = { 'name': 'person%04i' % i }
weighings = person['weighings'] = [ ]
std_weight = random.uniform(100, 200)
for w in range(100):
date = (datetime.datetime(2009, 1, 1) +
datetime.timedelta(
days=random.randint(0, 365))
weight = random.normalvariate(std_weight, 5.0)
weighings.append({ 'date': date,
'weight': weight })
weighings.sort(key=lambda x: x['date'])
all_people.append(person)

Insert Data
Performance
LOG-LOG scale
Linear scaling
Insert
1000
292s
100
29.5s
10
3.14s
1
1k 10k 100k

Map/Reduce
Total Weight by Day
map_fn = Code("""function () {
this.weighings.forEach(function(z) {
emit(z.date, z.weight);
});
}""")
reduce_fn = Code("""function (key, values) {

var total = 0;
for (var i = 0; i < values.length; i++) {
total += values[i];
}
return total;
}""")
result = people.map_reduce(map_fn, reduce_fn)

Map/Reduce
Total Weight by Day
>>> for doc in result.find():
print doc
{u'_id': datetime.datetime(2009, 1, 1, 0, 0), u'value':

39136.600753163315}
41685.341024046182}
38232.326554504165}
... lots more ...

Total Weight by Day
Performance
MapReduce
1000
384s
100
38.8s
10
4.29s
1
1k 10k 100k

Map/Reduce
Weight on Day
map_fn = Code("""function () {
var target_date = new Date(2009, 9, 5);
var pos = bsearch(this.weighings, "date",
target_date);
var recent = this.weighings[pos];
emit(this._id, { name: this.name,
date: recent.date,
weight: recent.weight });
};""")
reduce_fn = Code("""function (key, values) {

return values[0];
};""")
result = people.map_reduce(map_fn, reduce_fn,

scope={"bsearch": bsearch})

Map/Reduce
bsearch() function
bsearch = Code("""function(array, prop, value) {
var min, max, mid, midval;
for(min = 0, max = array.length - 1; min <= max; ) {
mid = min + Math.floor((max - min) / 2);
midval = array[mid][prop];
if(value === midval) {
break;
} else if(value > midval) {
min = mid + 1;
} else {
max = mid - 1;
}
}
return (midval > value) ? mid - 1 : mid;
};""")

Weight on Day
Performance
MapReduce
1000
100
108s
10
10s
1 1.23s
1k 10k 100k

Weight on Day
(Python Version)
target_date = datetime.datetime(2009, 10, 5)
for person in people.find():

dates = [ w['date'] for w in person['weighings'] ]
pos = bisect.bisect_right(dates, target_date)
val = person['weighings'][pos]

Map/Reduce
Performance
MapReduce Python
1000
100
108s
26s
10
10s
1 2.2s
1.23s
0.37s
0.1
1k 10k 100k

Summary

Resources
www.mongodb.org
PyMongo
api.mongodb.org/python
MongoDB
The Definitive Guide
O’Reilly
www.10gen.com

END OF SLIDES

Chalkboard
is not Comic Sans
This is Chalkboard, not Comic Sans.
This isn’t Chalkboard, it’s Comic Sans.
does it matter, anyway?

Introducing MongoDB - Document Database Overview

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Introducing MongoDB - Document Database Overview

Caricato da

Copyright:

Formati disponibili

Introducing:

Sunday, August 1, 2010

Python since v1.4 (late 90’s)

Design, Algorithms, Data Structures

Sometimes Database stuff

not a “frameworks” guy

Organizer: Front Range Pythoneers

Sunday, August 1, 2010

Part I: Trends in Databases

Part II: Mongo Basic Usage

Part III: Advanced Features

Sunday, August 1, 2010

Sunday, August 1, 2010

Data stored in Tables, Rows, Columns

Relationships designated by Primary, Foreign

Data is controlled & queried via SQL

Sunday, August 1, 2010

Hard to scale / distribute

Not here to deliver a verdict

POINT: This is why we see an explosion of

SQL not well standardized

Awkward for modern/dynamic languages

Sunday, August 1, 2010

Relational with ORM (Hibernate, SQLAlchemy)

ODBMS / ORDBMS (push OO-concepts into database)

Key-Value Stores (MemcacheDB, Redis, Cassandra)

some don’t fit neatly into

Sunday, August 1, 2010

“The Best Features of

Sunday, August 1, 2010

Produced by 10gen / Implemented in C++

Source Code Available

Runs on Linux, Mac, Windows, Solaris

Database: GNU AGPL v3.0 License

Drivers: Apache License v2.0

Sunday, August 1, 2010

json-style documents fast queries (auto-tuning

flexible indexing (B-Tree) fast insert & deletes

Sunday, August 1, 2010

Python, Ruby, Perl

(many more community supported ones)

Sunday, August 1, 2010

Can mimic with foreign IDs, but referential

No Relational Model / SQL integrity not enforced.

No Explicit Transactions / ACID

Limited Query API You can do a lot more with MapReduce

Sunday, August 1, 2010

Rich semistructured records (Documents)

Transaction isolation not essential

Humongous amounts of data

Need for extreme speed

You hate schema migrations

Sunday, August 1, 2010

- Download, Install, Configure

Sunday, August 1, 2010

Use a 64-bit OS (Linux, Mac, Windows)

Get Binaries: www.mongodb.org

PyMongo uses memory-mapped files.

32-bits limits database to 2 GB!

Sunday, August 1, 2010

Build with setuptools

(includes C extension for speed)

# python setup.py install