Ranga Sarvabhouman @MongoDB All application development is Schema Design
Success comes from Proper Data Structure What is a Record? Key Value One-dimensional storage Single value is a blob Query on key only No schema Value cannot be updated, only replaced Key Blob Relational Two-dimensional storage (tuples) Each field contains a single value Query on any field Very structured schema (table) In-place updates Normalization process requires many tables, joins, indexes, and poor data locality
Primary Key Document N-dimensional storage Each field can contain 0, 1, many, or embedded values Query on any field & level Flexible schema Inline updates * Embedding related data has optimal data locality, requires fewer indexes, has better performance
_id Core Concepts Traditional Schema Design Focus on data storage Document Schema Design Focus on data use
Another way to think about it What answers do I have? What questions do I have?
Three Building Blocks of Document Schema Design 1 Flexibility Choices for schema design Each record can have different fields Field names consistent for programming Common structure can be enforced by application Easy to evolve as needed
2 Arrays Multiple Values per Field Each field can be: Absent Set to null Set to a single value Set to an array of many values Query for any matching value Can be indexed and each value in the array is in the index
3 - Embedded Documents An acceptable value is a document Nested documents provide structure Query any field at any level Can be indexed
What is an Entity? An Entity Object in your model Associations with other entities An Entity Object in your model Associations with other entities Referencing (Relational) Embedding (Document) has_one embeds_one belongs_to embedded_in has_many embeds_many has_and_belongs_to_ma ny MongoDB has both referencing and embedding for universal coverage Let's model something together
{ name: , url: , title: , company: , email: , address: { street: , city: , state: , zip_code: } phone: , fax } Example Lets Look at an Address Book
Address Book What questions do I have? What are my entities? What are my associations?
Address Book Entity-Relationship
Contacts name company title
Addresses type street city state zip_code Phones type number Emails type address Thumbnail s mime_type data Portraits mime_type data Groups name N 1 N 1 N N N 1 1 1 1 1 Twitters name location web bio 1 1 Associating Entities One to One
Contacts name company title
Addresses type street city state zip_code Phones type number Emails type address Thumbnail s mime_type data Portraits mime_type data Groups name N 1 N 1 N N N 1 1 1 1 1 Twitters name location web bio 1 1 One to One Schema Design Choices
contact twitter_id
twitter 1 1 contact twitter contact_id 1 1 Redundant to track relationship on both sides Both references must be updated for consistency
May save a fetch? Contact twitter twitter 1 One to One General Recommendation Full contact info all at once Contact embeds twitter Parent-child relationship contains No additional data duplication Can query or index on embedded field e.g., twitter.name Exceptional cases Reference portrait which has very large data Contact twitter twitter 1 One to Many
Contacts name company title
Addresses type street city state zip_code Phones type number Emails type address Thumbnail s mime_type data Portraits mime_type data Groups name N 1 N 1 N N N 1 1 1 1 1 Twitters name location web bio 1 1 One to Many Schema Design Choices
contact phone_ids: [ ]
phone 1 N contact phone contact_id 1 N Redundant to track relationship on both sides Both references must be updated for consistency
Not possible in relational DBs Save a fetch? Contact phones phone N One to Many General Recommendation Full contact info all at once Contact embeds multiple phones Parent-children relationship contains No additional data duplication Can query or index on any field e.g., { phones.type: mobile } Exceptional cases Scaling: maximum document size is 16MB Contact phones phone N Many to Many
Contacts name company title
Addresses type street city state zip_code Phones type number Emails type address Thumbnail s mime_type data Portraits mime_type data Groups name N 1 N 1 N N N 1 1 1 1 1 Twitters name location web bio 1 1 Many to Many Traditional Relational Association
Join table
Contacts name company title phone
Groups name GroupContacts group_id contact_id Use arrays instead
X Many to Many Schema Design Choices
group contact_ids: [ ]
contact N N group contact group_ids: [ ] N N Redundant to track relationship on both sides Both references must be updated for consistency
Redundant to track relationship on both sides Duplicated data must be updated for consistency
group contacts contact N contact groups group N Many to Many General Recommendation Depends on use case 1. Simple address book Contact references groups 2. Corporate email groups Group embeds contacts for performance Exceptional cases Scaling: maximum document size is 16MB Scaling may affect performance and working set
group contact group_ids: [ ] N N Contacts name company title
addresses type street city state zip_code phones type number emails type address thumbnail mime_type data Portraits mime_type data Groups name N 1 N 1 twitter name location web bio N N N 1 1 Document model - holistic and efficient representation
Contact document example { name : Gary J. Murakami, Ph.D., company : MongoDB, Inc., title : Lead Engineer, twitter : { name : Gary Murakami, location : New Providence, NJ, web : http://www.nobell.org }, portrait_id : 1, addresses : , phones : , emails :
}
Working Set To reduce the working set, consider Reference bulk data, e.g., portrait Reference less-used data instead of embedding Extract into referenced child document
Also for performance issues with large documents General Recommendations Legacy Migration 1. Copy existing schema & some data to MongoDB 2. Iterate schema design development Measure performance, find bottlenecks, and embed 1. one to one associations first 2. one to many associations next 3. many to many associations 3. Migrate full dataset to new schema New Software Application? Embed by default
Embedding over Referencing Embedding is a bit like pre-joined data BSON (Binary JSON) document ops are easy for the server Embed (90/10 following rule of thumb) When the one or many objects are viewed in the context of their parent For performance For atomicity Reference When you need more scaling For easy consistency with many to many associations without duplicated data
Its All About Your Application Programs+Databases = (Big) Data Applications Your schema is the impedance matcher Design choices: normalize/denormalize, reference/embed Melds programming with MongoDB for best of both Flexible for development and change ProgramsMongoDB = Great Big Data Applications