Sei sulla pagina 1di 57

Introducing:

MongoDB
David J. C. Beach

Sunday, August 1, 2010


David Beach
Software Consultant (past 6 years)

Python since v1.4 (late 90’s)

Design, Algorithms, Data Structures

Sometimes Database stuff

not a “frameworks” guy

Organizer: Front Range Pythoneers

Sunday, August 1, 2010


Outline

Part I: Trends in Databases

Part II: Mongo Basic Usage

Part III: Advanced Features

Sunday, August 1, 2010


Part I:
Trends in Databases

Sunday, August 1, 2010


Database Trends

WARNING: extreme
oversimplification
Past: “Relational” (RDBMS)

Data stored in Tables, Rows, Columns

Relationships designated by Primary, Foreign


keys

Data is controlled & queried via SQL

Sunday, August 1, 2010


Trends:
Criticisms of RDBMS
Lots of disagreement over this
Rigid data model There are points & counterpoints from
both sides

Hard to scale / distribute


The debate is not over

Not here to deliver a verdict

POINT: This is why we see an explosion of


Slow (transactions, disk seeks) new databases.

SQL not well standardized

Awkward for modern/dynamic languages

Sunday, August 1, 2010


As with so many things in technology,
we’re seeing... FRAGMENTATION! Trends:
Fragmentation
some examples of DB categories

Relational with ORM (Hibernate, SQLAlchemy)

ODBMS / ORDBMS (push OO-concepts into database)

Key-Value Stores (MemcacheDB, Redis, Cassandra)

Graph (neo4j)
categories are
Document Oriented (Mongo, Couch, etc...) incomplete

some don’t fit neatly into


categories

Sunday, August 1, 2010


Where Mongo Fits
Mongo’s Tagline (taken from website)

“The Best Features of

Document Databases,

Key-Value Stores,

and RDBMSes.”

Sunday, August 1, 2010


What is Mongo
Document-Oriented Database

Produced by 10gen / Implemented in C++

Source Code Available

Runs on Linux, Mac, Windows, Solaris

Database: GNU AGPL v3.0 License

Drivers: Apache License v2.0

Sunday, August 1, 2010


Mongo
Advantages
many of these taken
straight from home page

json-style documents fast queries (auto-tuning


(dynamic schemas) planner)

flexible indexing (B-Tree) fast insert & deletes


(sometimes trade-offs)
replication and high-
availability (HA)
sharding support available as of
v1.6 (late July 2010)

automatic sharding
support (v1.6)*

easy-to-use API

Sunday, August 1, 2010


Mongo
Language Bindings

C, C++, Java

Python, Ruby, Perl

PHP, JavaScript

(many more community supported ones)

Sunday, August 1, 2010


Mongo
Disadvantages

Can mimic with foreign IDs, but referential

No Relational Model / SQL integrity not enforced.

No Explicit Transactions / ACID


Operations can only be atomic within single
collection. (Generally)

Limited Query API You can do a lot more with MapReduce


and JavaScript!

Sunday, August 1, 2010


When to use Mongo
My personal take on this...

Rich semistructured records (Documents)

Transaction isolation not essential

Humongous amounts of data

Need for extreme speed

You hate schema migrations


Caveat: I’ve never used Mongo in Production!

Sunday, August 1, 2010


Part II:
Mongo Basic Usage
BRIEFLY cover:

- Download, Install, Configure


- connection, creating DB, creating Collection
- CRUD operations (Insert, Query, Update, Delete)

Sunday, August 1, 2010


Installing Mongo

Use a 64-bit OS (Linux, Mac, Windows)

Get Binaries: www.mongodb.org


32-bit available; not for production

PyMongo uses memory-mapped files.

32-bits limits database to 2 GB!


Run “mongod” process

Sunday, August 1, 2010


Installing PyMongo

Download: http://pypi.python.org/pypi/pymongo/1.7

Build with setuptools

(includes C extension for speed)

# python setup.py install

# python setup.py --no-ext install

(to compile without extension)

Sunday, August 1, 2010


Mongo Anatomy
Mongo Server

Database

Collection

Document

Sunday, August 1, 2010


Getting a Connection

Connection required for using Mongo

>>> import pymongo

>>> connection = pymongo.Connection(“localhost”)

Sunday, August 1, 2010


Finding a Database
Databases = logically separate stores

Navigation using properties

Will create DB if not found

>>> db = connection.mydatabase

Sunday, August 1, 2010


Using a Collection
Collection is analogous to Table

Contains documents

Will create collection if not found

>>> blog = db.blog

Sunday, August 1, 2010


Inserting

collection.insert(document) => document_id

>>> entry1 = {“title”: “Mongo Tutorial”,


“body”: “Here’s a document to insert.” }

>>> blog.insert(entry1)

ObjectId('4c3a12eb1d41c82762000001')

document

Sunday, August 1, 2010


Inserting (contd.)
Documents must have ‘_id’ field

Automatically generated unless assigned


You can also assign your own ‘_id’, can be
12-byte unique binary value any unique value.

>>> entry1

{'_id': ObjectId('4c3a12eb1d41c82762000001'),
'body': "Here's a document to insert.",
'title': 'Mongo Tutorial'}
Mongo’s IDs are designed to be unique...

...even if hundreds of thousands of


ID generated by driver. No waiting on DB. documents are generated per second, on
numerous clustered machines.

Sunday, August 1, 2010


Inserting (contd.)

Documents may have different properties

Properties may be atomic, lists, dictionaries

>>> entry2 = {"title": "Another Post",


"body": "Mongo is powerful",
"author": "David",
"tags": ["Mongo", "Power"]}

>>> blog.insert(entry2)
ObjectId('4c3a1a501d41c82762000002')

another document
Sunday, August 1, 2010
Indexing

May create index on any field

If field is list => index associates all values

index by single value


>>> blog.ensure_index(“author”)

>>> blog.ensure_index(“tags”)

by multiple values

Sunday, August 1, 2010


Bulk Insert

Let’s produce 100,000 fake posts

bulk_entries = [ ]
for i in range(100000):
entry = { "title": "Bulk Entry #%i" % (i+1),
"body": "What Content!",
"author": random.choice(["David", "Robot"]),
"tags": ["bulk",
random.choice(["Red", "Blue", "Green"])]
}
bulk_entries.append(entry)

Sunday, August 1, 2010


Bulk Insert (contd.)
collection.insert(list_of_documents)

Inserts 100,000 entries into blog

Returns in 2.11 seconds

>>> blog.insert(bulk_entries)

[ObjectId(...), ObjectId(...), ...]

Sunday, August 1, 2010


Bulk Insert (contd.)
returns in 7.90 seconds (vs. 2.11 seconds)

driver returns early; DB is still working

...unless you specify “safe=True”

>>> blog.remove() # clear everything

>>> blog.insert(bulk_entries, safe=True)

Sunday, August 1, 2010


Querying

collection.find_one(spec) => document

spec = document of query parameters

>>> blog.find_one({“title”: “Bulk Entry #12253”})

{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!', returned in 0.04s - extremely fast
u'tags': [u'bulk', u'Green'], No index created for “title”!
u'title': u'Bulk Entry #99999'}

presumably, need more entries to effectively test index performance...

Sunday, August 1, 2010


Querying
(Specs)
Multiple conditions on document => “AND”

Value for tags is an “ANY” match

>>> blog.find_one({“title”: “Bulk Entry #12253”,


“tags”: “Green”})

{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!',
u'tags': [u'bulk', u'Green'],
u'title': u'Bulk Entry #99999'}
presumably, need more entries to effectively test index performance...

Sunday, August 1, 2010


Querying
(Multiple)
collection.find(spec) => cursor

new items are fetched in bulk (behind the


scenes)

>>> green_items = [ ]
>>> for item in blog.find({“tags”: “Green”}):
green_items.append(item)

- or -
>>> green_items = list(blog.find({“tags”: “Green”}))

Sunday, August 1, 2010


Querying
(Counting)
Use the find() method + count()

Returns number of matches found

>>> blog.find({"tags": "Green"}).count()

16646

presumably, need more entries to effectively test index performance...

Sunday, August 1, 2010


Updating
collection.update(spec, document)

updates single document matching spec

“multi=True” => updates all matching docs

>>> item = blog.find_one({“title”: “Bulk Entry #12253”})


>>> item.tags.append(“New”)
>>> blog.update({“_id”: item[‘_id’]}, item)

Sunday, August 1, 2010


Deleting

use remove(...)

it works like find(...)

>>> blog.remove({"author":"Robot"}, safe=True)

Example removed approximately 50% of records.

Took 2.48 seconds

Sunday, August 1, 2010


Part III:
Advanced Features

Sunday, August 1, 2010


Advanced Querying

Regular Expressions

{“tag” : re.compile(r“^Green|Blue$”)}

Nested Values {“foo.bar.x” : 3}

$where Clause (JavaScript)

Sunday, August 1, 2010


Advanced Querying
$lt, $gt, $lte, $gte, $ne

$in, $nin, $mod, $all, $size, $exists, $type

$or, $not

$elemmatch

>>> blog.find({“$or”: [{“tags”: “Green”}, {“tags”:


“Blue”}]})

Sunday, August 1, 2010


Advanced Querying

collection.find(...)

sort(“name”) - sorting

limit(...) & skip(...) [like LIMIT & OFFSET]

distinct(...) [like SQL’s DISTINCT]

collection.group(...) - like SQL’s GROUP


won’t beBY
showing detailed
examples of all these...

there are good tutorials online


>>> blog.find().limit(50) # find 50 articles for all of this
>>> blog.find().sort(“title”).limit(30) # 30 titles
let’s move on to something even
more interesting
>>> blog.find().distinct(“author”) # unique author names

Sunday, August 1, 2010


Map/Reduce
Most powerful querying
mechanism

collection.map_reduce(mapper, reducer)

ultimate in querying power

distribute across multiple nodes

Sunday, August 1, 2010


Map/Reduce
Visualized
1 2 3

)LJXUH0DS5HGXFHORJLFDOGDWDIORZ

Java MapReduce
also see: Diagram Credit:
+DYLQJUXQWKURXJKKRZWKH0DS5HGXFHSURJUDPZRUNVWKHQH[WVWHSLVWRH[SUHVVLW
Hadoop: The Definitive Guide
LQFRGH:HQHHGWKUHHWKLQJVDPDSIXQFWLRQDUHGXFHIXQFWLRQDQGVRPHFRGHWR
Map/Reduce : A Visual Explanation by Tom White; O’Reilly Books
UXQ WKH MRE 7KH PDS IXQFWLRQ LV UHSUHVHQWHG E\ DQ LPSOHPHQWDWLRQ
Chapter 2, RI
pageWKH
20Mapper
LQWHUIDFHZKLFKGHFODUHVD map()PHWKRG([DPSOHVKRZVWKHLPSOHPHQWDWLRQRI
RXUPDSIXQFWLRQ
([DPSOH0DSSHUIRUPD[LPXPWHPSHUDWXUHH[DPSOH
import
Sunday, August 1, 2010 java.io.IOException;
SELECT
19OPQ db.runCommand({
A*2=*LR
Dim1, Dim2, ! mapreduce: "DenormAggCollection",
SUM(Measure1) AS MSum, query: {
"
COUNT(*) AS RecordCount, filter1: { '$in': [ 'A', 'B' ] },
AVG(Measure2) AS MAvg, # filter2: 'C',
MIN(Measure1) AS MMin filter3: { '$gt': 123 }
MAX(CASE },
WHEN Measure2 < 100 $ map: function() { emit(
THEN Measure2 { d1: this.Dim1, d2: this.Dim2 },
END) AS MMax { msum: this.measure1, recs: 1, mmin: this.measure1,
FROM DenormAggTable mmax: this.measure2 < 100 ? this.measure2 : 0 }
WHERE (Filter1 IN (’A’,’B’)) );},
AND (Filter2 = ‘C’) % reduce: function(key, vals) {
AND (Filter3 > 123) var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 };
GROUP BY Dim1, Dim2 ! for(var i = 0; i < vals.length; i++) {
HAVING (MMin > 0) ret.msum += vals[i].msum;
ORDER BY RecordCount DESC ret.recs += vals[i].recs;
LIMIT 4, 8 if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin;
if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax))
ret.mmax = vals[i].mmax;
}
! ()*+,-./.01-230*2/4*5+123/6)-/,+55-./ return ret;
*+7/63/8-93/02/7:-/16,/;+2470*2</ },
)-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@ finalize: function(key, val) {
'
" A-63+)-3/1+37/B-/162+6559/6==)-=67-.@ & val.mavg = val.msum / val.recs;
return val;
# C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/
},

G-E030*2/$</M)-67-./"N!NIN#IN'
G048/F3B*)2-</)048*3B*)2-@*)=
1+37/?607/+2705/;02650>670*2@
$ A-63+)-3/462/+3-/,)*4-.+)65/5*=04@
out: 'result1',
verbose: true
% D057-)3/:6E-/62/FGAHC470E-G-4*).I });
5**802=/3795-@
db.result1.
' C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/
7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@
find({ mmin: { '$gt': 0 } }).
& C34-2.02=J/!K/L-34-2.02=J/I!
sort({ recs: -1 }).
skip(4).
limit(8);

http://rickosborne.org/download/SQL-to-MongoDB.pdf
Sunday, August 1, 2010
Map/Reduce
Examples

This is me, playing with Map/Reduce

Sunday, August 1, 2010


Health Clinic Example

Person registers with the Clinic

Weighs in on the scale

1 year => comes in 100 times

Sunday, August 1, 2010


Health Clinic Example
person = { “name”: “Bob”,

! “weighings”: [

! ! {“date”: date(2009, 1, 15), “weight”: 165.0},

! ! {“date”: date(2009, 2, 12), “weight”: 163.2},

! ! ... ]

Sunday, August 1, 2010


Map/Reduce
Insert Script
for i in range(N):
person = { 'name': 'person%04i' % i }
weighings = person['weighings'] = [ ]
std_weight = random.uniform(100, 200)
for w in range(100):
date = (datetime.datetime(2009, 1, 1) +
datetime.timedelta(
days=random.randint(0, 365))
weight = random.normalvariate(std_weight, 5.0)
weighings.append({ 'date': date,
'weight': weight })
weighings.sort(key=lambda x: x['date'])
all_people.append(person)

Sunday, August 1, 2010


Insert Data
Performance
LOG-LOG scale

Linear scaling

Insert
1000

292s
100

29.5s
10

3.14s
1
1k 10k 100k

Sunday, August 1, 2010


Map/Reduce
Total Weight by Day
map_fn = Code("""function () {
this.weighings.forEach(function(z) {
emit(z.date, z.weight);
});
}""")

reduce_fn = Code("""function (key, values) {


var total = 0;
for (var i = 0; i < values.length; i++) {
total += values[i];
}
return total;
}""")

result = people.map_reduce(map_fn, reduce_fn)

Sunday, August 1, 2010


Map/Reduce
Total Weight by Day
>>> for doc in result.find():
print doc

{u'_id': datetime.datetime(2009, 1, 1, 0, 0), u'value':


39136.600753163315}
{u'_id': datetime.datetime(2009, 1, 2, 0, 0), u'value':
41685.341024046182}
{u'_id': datetime.datetime(2009, 1, 3, 0, 0), u'value':
38232.326554504165}

... lots more ...

Sunday, August 1, 2010


Total Weight by Day
Performance
MapReduce
1000

384s

100

38.8s

10

4.29s

1
1k 10k 100k

Sunday, August 1, 2010


Map/Reduce
Weight on Day
map_fn = Code("""function () {
var target_date = new Date(2009, 9, 5);
var pos = bsearch(this.weighings, "date",
target_date);
var recent = this.weighings[pos];
emit(this._id, { name: this.name,
date: recent.date,
weight: recent.weight });
};""")

reduce_fn = Code("""function (key, values) {


return values[0];
};""")

result = people.map_reduce(map_fn, reduce_fn,


scope={"bsearch": bsearch})

Sunday, August 1, 2010


Map/Reduce
bsearch() function
bsearch = Code("""function(array, prop, value) {
var min, max, mid, midval;
for(min = 0, max = array.length - 1; min <= max; ) {
mid = min + Math.floor((max - min) / 2);
midval = array[mid][prop];
if(value === midval) {
break;
} else if(value > midval) {
min = mid + 1;
} else {
max = mid - 1;
}
}
return (midval > value) ? mid - 1 : mid;
};""")

Sunday, August 1, 2010


Weight on Day
Performance
MapReduce
1000

100
108s

10
10s

1 1.23s
1k 10k 100k

Sunday, August 1, 2010


Weight on Day
(Python Version)

target_date = datetime.datetime(2009, 10, 5)

for person in people.find():


dates = [ w['date'] for w in person['weighings'] ]
pos = bisect.bisect_right(dates, target_date)
val = person['weighings'][pos]

Sunday, August 1, 2010


Map/Reduce
Performance
MapReduce Python
1000

100
108s

26s
10
10s

1 2.2s
1.23s

0.37s
0.1
1k 10k 100k

Sunday, August 1, 2010


Summary

Sunday, August 1, 2010


Resources

www.mongodb.org

PyMongo
api.mongodb.org/python

MongoDB
The Definitive Guide
O’Reilly
www.10gen.com

Sunday, August 1, 2010


END OF SLIDES

Sunday, August 1, 2010


Chalkboard
is not Comic Sans

This is Chalkboard, not Comic Sans.

This isn’t Chalkboard, it’s Comic Sans.

does it matter, anyway?

Sunday, August 1, 2010

Potrebbero piacerti anche