Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Contents
Introduction ............................................................................................................................................................................................ 4 What is NoSQL? .................................................................................................................................................................................... 5 Key-Value Stores ............................................................................................................................................................................. 6 Document Stores ............................................................................................................................................................................. 7 Wide Column Stores ...................................................................................................................................................................... 8 Graph Databases .......................................................................................................................................................................... 10 From Relational to Relationships ...................................................................................................................................... 10 Graphs and ORM ..................................................................................................................................................................... 10 NoSQL Database Common Traits ............................................................................................................................................... 11 Shared Legacy: MapReduce, Hadoop, BigTable and HBase ....................................................................................... 11 NoSQL Database Consistency ................................................................................................................................................. 13 Logical Models, Physical Models, and the Ubiquity of Key-Value Pairs ................................................................. 13 NoSQL Indexing ............................................................................................................................................................................ 14 NoSQL options on the Windows Azure Platform ................................................................................................................. 14 Azure Table Storage .................................................................................................................................................................... 15 SQL Azure XML Columns .......................................................................................................................................................... 15 SQL Azure Federation ................................................................................................................................................................. 16 OData ................................................................................................................................................................................................ 17 What the Support Means ..................................................................................................................................................... 17 Running NoSQL Database Products using Azure Worker Roles, VM Roles and Azure Drive ........................ 18 On-Premise Technologies ......................................................................................................................................................... 18 SQL Server 2008/2008R2 Beyond Relational Features.......................................................................................... 19 SQL Server Parallel Data Warehouse Edition ............................................................................................................... 19 Microsoft Research Dryad .................................................................................................................................................... 20 NoSQL Upsides, Downsides .......................................................................................................................................................... 21 Upsides ............................................................................................................................................................................................. 22 Lightweight, low-friction ...................................................................................................................................................... 22 Minimalist tool requirements ............................................................................................................................................. 22 Sharding & Replication ......................................................................................................................................................... 22 Web Developer-Friendliness............................................................................................................................................... 22
Cross-Platform, Cross-Device Operation ....................................................................................................................... 23 Downsides ....................................................................................................................................................................................... 23 Optimizations Have a Price ................................................................................................................................................. 23 Requirement to Query using a Procedural Language .............................................................................................. 24 Necessity to Scale Manually................................................................................................................................................ 24 Primitive Tooling ...................................................................................................................................................................... 25 Lack of ACID Transactional Capabilities in Some Products .................................................................................... 25 Conclusion: Relationals Continued Indispensability in Line-of-Business ................................................................... 26
Introduction
Just at the time when the database market seemed to many to be almost completely mature, a group of non-relational data stores, collectively categorized as NoSQL databases, have attracted significant attention. These databases are often employed in public, massively scaled Web site scenarios, where traditional database features matter less, and fast fetching of relatively simple data sets matters most. Many of these databases employ parallelized query mechanisms, horizontal partitioning and allow storage of heterogeneous, loosely-schematized data records. With so much developer mindshare being focused on the Web these days, and with the constant thirst for performance amongst technologists, especially for large Web applications, its no wonder that NoSQL databases are seen favorably and used by an enthusiastic population of developers. As Cloud computing grows, and given the proclivity of developers to conflate Web computing and scale with Cloud computing and elasticity, interest in NoSQL databases amongst cloud developers is equally unsurprising. Together, these streams of interest and visibility are significant; understandably, then, even users of traditional, relational databases are exploring the question of whether NoSQL technology is something they should use, too. Theres no free lunch though. Although NoSQL databases do facilitate the performance and availability that public Web properties sometimes require, the cost can be great. Things that users of a Relational Database Management System (RDBMS) would take for granted, including some or all of: transactional, atomic writes; indexing of non-key columns; query optimizers; and declarative, set-oriented query, are sacrificed in the NoSQL world. In certain scenarios, that sacrifice is justified and acceptable. But in many others, including line-of-business applications, that sacrifice is much less reasonable. As with anything in the software world, when technologies enter the realm of phenomena, the prudent thing to do is deconstruct and demystify them, understand and enumerate their various capabilities, then judge if those capabilities merit the enthusiasm and justify a disruption. Specifically, in the realm of cloud computing with the Microsoft stack, i.e. Windows Azure and SQL Azure, important questions arise with respect to NoSQL, and need to be answered. What exactly is NoSQL, and what characterizes its various subcategories? Are individual facets of NoSQL database architectures available to Azure developers? Are they sufficient or will only a full-blown NoSQL technology fulfill most requirements? Where in the Azure stack do these NoSQL technologies sit? For the types of applications that .NET and SQL Server practitioners build, is NoSQL better than relational? Is it even as good? These questions must be explored and answered before the larger question of NoSQLs (or relationals) overall efficacy can be judged. In this paper, we will define NoSQL, explore some of its history, review the various types of NoSQL databases, and understand their respective features. We will determine the commonalities between the various NoSQL subcategories and try to determine what basket of features seem to attract developers the most. Well examine the scenarios where use of NoSQL makes the most sense. Well distill the enumeration of NoSQL features down to the overall tradeoffs between NoSQL and relational databases.
We will also review the various components of the Azure stack that offer NoSQL technology, or capabilities that are comparable to those found in NoSQL databases. We will look at Windows Azure Storage, new and imminent features in SQL Azure, and even ways to deploy non-Microsoft, NoSQL databases to the Azure cloud, to make them usable from .NET code that is also deployed there. By the end of this paper, readers should have a good understanding of what NoSQL is all about and whether individual NoSQL features, full-fledged NoSQL databases or continued use of relational technology will work best for them. Lets now define NoSQL, by examining the general use cases that it serves. Well also discuss the subcategories of NoSQL and take a more detailed look at each of them.
What is NoSQL?
There are scenarios in the software development world where data management is required, but what many of us might think of as a full-fledged database is not. Think of that application you wrote once that had a small amount of data to store, and did it using flat files, so you could avoid creating a database. Maybe you needed to store a few bits of information about the current user; maybe you needed to store application settings, or application state information, like window size and position; or perhaps you needed to store and retrieve actual content be it raw text, images, or media and the file system seemed to make more sense than a relational database as the repository. Now imagine an application like that one you wrote, but which ran on the Web and needed to serve a vast array of users distributed across the globe, many of them concurrently. You would find that your database needs, while still technically modest in terms of query complexity, would almost certainly outstrip what you could do comfortably using the file system. Youd need a server, or even a globally distributed cluster of servers. The server or cluster would need to be highly scalable to meet the demands of a popular Web-based application, and very fast at performing these relatively simple discrete store and fetch operations. You would need a database, but probably not the relational one youre used to. The grouping of database engines collectively referred to as NoSQL is optimized for these workloads. Most of them sport distributed architectures as a core feature. Many of them are Apache or independent open source projects. NoSQL databases are good at what they do, primarily by dispensing with many of the tenets of relational database management. Many NoSQL databases trade off ACID (atomicity, consistency, isolation and durability) guarantees in favor of providing for very-high performance in the broad scale/simple store and retrieve scenario. And as we mentioned already, NoSQL databases, to varying degrees, even allow for the schema of data to differ from record to record. The CAP theorem says that databases may only excel at two of the following three attributes: consistency, availability and partition tolerance. Relational databases favor the first and last of those three properties; NoSQL databases favor the last two. In other words, NoSQL intentionally de-emphasizes the rules and functionality of consistency that many database administrators and developers think of as the very prerequisites of database management.
In his paper Amazon's Dynamo (Dynamo is the online retailers foundational NoSQL database), Werner Vogels, Amazon.coms Chief Technology Officer, describes why such an approach is appropriate: Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS. In other words, various systems on the Web, many of which are consumer-facing, dont have sophisticated database needs, but they nonetheless have a huge burden. They must carry out their simple needs very, very quickly. NoSQL databases handle these workloads well, but they make serious concessions, to otherwise mainstream database needs, in order to do it. That is well-justified, but not always well-understood; in fact there exist NoSQL practitioners who advocate the usage of NoSQL as a general database technology applicable to the mainstream of application database needs. Such advocacy has caused some relational database customers to have concerns that they should perhaps switch to NoSQL databases even for lineof-business (LOB) applications. Customers have these concerns despite the fact that most LOB apps require transactional guarantees, and are well-served by normalized design and formal schema. This can be a controversial state of affairs and we hope to sort out that controversy. For now though, lets just say that NoSQL databases work well in certain scenarios, and that sketching out what those scenarios are, and what they are not, is an important goal of this paper. To help enumerate those scenarios, its best that we discuss four subcategories that NoSQL databases tend to break down into. Enumerations of such subcategories tend to vary, but they usually include KeyValue Stores, Document Stores, Wide Column Stores and Graph Databases. Each NoSQL subcategory serves certain scenarios best. To understand core NoSQL scenarios as best as we can, lets explore the various NoSQL subcategories and the specific types of applications and workloads they support most ably.
Key-Value Stores
The Key-Value Store subcategory (summarized graphically in Figure 1) is perhaps the mother of all NoSQL database types. Most NoSQL databases feature key-value mechanisms, even if only behind the scenes. NoSQL databases that belong to the explicit Key-Value Store category use their namesake construct as the basic unit of storage. A key-value pair might consist of a key like Phone Number that is associated with a value like (212) 555-1212. KeyValue Stores contain records whose entire content is made up of such pairs; the structure of one record can differ from the others in the same collection.
Figure 1: Key-Value Stores often use the nomenclature of tables and rows, but the latter simply contain collections of key-value pairs, which vary from row to row.
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
If you do much programming, youll recognize this construct right away. Thats because collections, dictionaries and associative arrays in the programming world work on the same principle. Data caches work on the key-value principle as well. In fact, one prominent Key-Value Store, MemcacheDB, is APIcompatible with the Memcached open source cache. The parallels between Key-Value Stores on the one hand, and collections, dictionaries, associative arrays and caches on the other, is more than academic; its significant. It shows that NoSQL databases work well in circumstances where data retrieval needs to be cache-like in speed and where the data which must be stored and retrieved consists of small, simple collections of attributes and values. Applications where Key-Value Stores would work well include anything where lists, like product categories, individual product attributes, shopping cart contents and top n best-selling products, or individual values like color schemes, a landing page URI, or a default account number, must be maintained. Values can consist of long text content, not just numeric and short string data. As such, content like comments, reviews, status messages or even private emails can be stored in a Key-Value Store. Most of this data is non-hierarchical, so the lack of relational logic or join constructs is acceptable. Some of this key-value-appropriate data (though probably not the long text content) is akin to lookup data, or configuration and preference data, in smaller applications. For a desktop app, we could imagine this data might be stored in a configuration file or a small, offline database. We could also imagine that much of it might do well to be loaded in memory upon application startup. For a consumer-facing Web app, the data is similarly straightforward, but the storage technology itself must be more capable. The data must live in a repository that is distributed, fault tolerant, fast and highly available. Beyond MemcacheDB and Dynamo lie other Key-Value Stores. Project Voldemort is an open source KeyValue store that originated at LinkedIn; and Dynomite, Kai and Riak are open source derivatives of Dynamo (which is not open source, nor publicly available, even though its architecture has been disclosed through published papers). Before we go on to describe other NoSQL database types, we must reiterate that almost all of them, whether physically or conceptually, build upon Key-Value Store principles. Therefore you should expect their applications to be more specialized than, but not wholly distinct from, those of Key-Value Stores themselves.
Document Stores
Document Stores are NoSQL databases which treat what might be otherwise called records or rows as documents. As with Key-Value Stores, each record can have a structure widely differentiated from the others. Each document consists of a set of keys and values, which can be compared to a relational tables field names and values. The Document Store data structure is summarized in Figure 2. Two leading Document Stores, CouchDB and MongoDB, each use JavaScript data types for the values stored in their documents. Because of this, their documents can be thought of as JavaScript objects and can, in fact, be written and read in JSON (JavaScript Object notation) format. That doesnt mean Document Stores equate to Object Databases, but it does mean that Document Stores have an affinity
with JavaScript programming and programmers. In fact, the native stored procedure/scripting language for both CouchDB and MongoDB is JavaScript itself. Documents can also contain attachments, making document stores useful for content management. The fact that certain Document Stores feature versioning of their documents (i.e. old versions are retained and all versions are numbered) makes this all the more so. CouchDB and MongoDB have been used for an array of public-facing Web application types including blog engines, event logs, appointment calendars, media stores, chat applications, cloud bookmark storage and even Twitter clients. An important facet of Document Stores is that the documents themselves can be addressed by unique URLs. And given the HTTP and URL orientation, document databases are automatically REST-friendly, as their APIs bear out. In the case of CouchDB, the HTTP orientation is developed to the point where the database can function as its own Web application server. Heres how: so-called Show Functions in CouchDB JavaScript functions that render HTML with the return statement can be stored in special documents called design documents, and each function within is accessible via URL. This means that entire Web applications can be implemented in a document database. Users visit a URL, code runs on the server and content is returned via the HTTP response stream, just as it would be with classic ASP, node.js, ASP.NET Web Pages or PHP. This HTTP and application orientation distinguishes Documents Stores from Key-Value Stores, the latter of which are more general purpose in their implementation and application. That said, there are some NoSQL taxonomies which do not recognize the Document Store category and instead label its members as Key-Value Stores. As you will see, the remaining two NoSQL subcategories utilize key-value technology as well.
Figure 2: Document Stores contain JSON objects, referred to as documents, each of which has a schema-free of set properties and values. Values may contain attachments, point to other documents, or directly contain them.
between the table and column level lie various intermediate structures that vary by product. For example, Apache Cassandra (originated by Facebook) features Super Columns. Hypertable and Apache HBase feature Column Families, and Googles BigTable features Tablets. The hierarchical structure and some of the varying nomenclature of Wide Column Stores is summarized in Figure 3. Although the schema within the intermediate structures can vary from row to row, tables and the intermediate structures themselves must be declared. Therefore, Wide Column Stores, while they tolerate schema variation at the leaf column level, are not completely schema-free. One could reasonably argue, in fact, that schema changes at the non-leaf level in Wide Column Stores are more disruptive than changes to table schemas in relational databases. Wide Column Stores work well for a subset of requirements that Key-Value Stores accommodate and many adopters of this category of NoSQL database cite the performance factors, over the structural ones, as reasons they chose it. But, clearly, Wide Column Stores are best for semistructured data, rather than data whose structure is completely variable from row to row. As an example, in a product catalog, we may have a collection of items, each of which has a size and a rating associated with it, and we may want to store these items together in a table. But certain items sizes may be represented by height, width and depth, others by radius, and still others by weight. The rating may be a star rating on a 1-5 scale (e.g. for a book), or collection of sub-ratings on various attributes (e.g. freshness, flavor, color, moistness). Accommodating a grouping of entities with high-level characteristics in common, but with differing context-specific attributes, is one area where Wide Column Stores do well. In the relational world, traditionally, such context-specific attributes would each need to be stored in separate tables, with a foreign key in the main table to link them . Joins and application-level merging of the datasets might be necessary. But Wide Column Stores allow such differently nuanced data to comingle in the same tables and query result sets.
2
Figure 3: Wide Column Stores contain tables (indicated above as T); Cassandra calls them supercolumn families (shown as SCF). These contain a key and columns (C) which consist of name/value pairs. Columns are subdivided into column families (CF), which are known as super columns (SC) in Cassandra. Columns are schema-free, but higher-level objects must be declared.
Recent versions of major RDBMS products offer new features to accommodate this requirement without resorting to separate attribute tables. Such features in SQL Server and SQL Azure will be discussed later in this paper. 9
Graph Databases
Graph databases recognize entities in a business or other domain, and explicitly track the relationships between them. In the graph database world, these entities are called nodes and the relationships between them are called edges; all of these terms come from mathematical graph theory as does this NoSQL database subcategorys name. An example of a graph database assertion (the fundamental atomic unit of data expression) might be: Chris city Auckland Where Chris and Auckland are nodes and city is an edge.
10
begs the question of whether object databases belong in the NoSQL camp or even of whether they are in fact synonymous with graph databases. There really are no rules or strict definitions to provide authoritative answers to these questions, but there are differences in intent between graph and object databases. Object databases typically are schema based (even if the schema describes a class rather than a table) and are focused on entities and their properties. Graph databases are designed to accommodate slowly- or even rapidly-changing schemas and focus on relationships between entities more than the entities themselves. Popular graph databases include AllegroGraph, Neo4j and Twitters FlockDB.
11
approach called map-reduce acknowledges and addresses this conundrum. Specifically, the process of distributing the query across multiple agents is the Map step, and the process of coalescing the results into a single result set is the Reduce step. Map-reduce is a general algorithm, and is prevalent in functional programming languages including F# which support the notion of map and reduce functions. MapReduce (without the hyphen) is the patented software framework from Google that the company applies in the realm of managing large datasets over clusters or other distributed topologies. Hadoop is the top-level Apache project which implements map-reduce as a generalized highly parallel, divide-and-conquer batch job task manager. Google MapReduce/ BigTable and Apache Hadoop /HBase have their fingerprints all over most NoSQL databases. For example, Apache CouchDB, one of the document store databases already discussed, is, according to its Web site on apache.org, queried and indexed in a MapReduce fashion. Some would argue that CouchDBs map and reduce steps differ conceptually from those in MapReduce itself. Nonetheless, the overarching map-reduce approach is the inspiration for the design of many NoSQL products. As effective as these mechanisms can be, they also introduce extra work for the database developer. Thats because instead of providing a declarative language over distributed storage that could then be implemented using map-reduce functionality under the covers, the architectures designers focused primarily on the raw processing approach and never added a language abstraction. In the world of lineof-business applications, the declarative power of SQL provides productivity that most organizations count on. Map-reduce based systems, by and large, cannot provide that productivity. A summary of the various NoSQL database subcategories, and the suitability of each to different scenarios and requirements, including map-reduce, is presented in table form in Figure 5.
Figure 5: This chart shows the applicability of different NoSQL database types to different needs or scenarios. Notice that wide column stores are more special-purposed than are the other NoSQL subcategories, which are applicable in a variety of scenarios.
12
13
Store, essentially imposes a logical super column hierarchy over key-value pairs. Key-Value Stores underlie most other subcategories, either in terms of technique (such as how CouchDBs documents are actually key-value structures, in an overt fashion) or in implementation (such as how edges and nodes in a graph database can be stored as key-value pairs as well, but behind the scenes). Document Stores, Wide Column Stores, and Graph Databases are in some senses akin to domain specific languages (DSL) in the programming world. While most NoSQL databases utilize key-value constructs, distributed architectures and sharding, and allow for schema-free databases, the various NoSQL subcategories provide different data interfaces, each of which works best in a subset of scenarios.
NoSQL Indexing
Despite the DSL analogy above, the common key-value substrate of most NoSQL databases does not render the subcategory a mere trivial abstraction. The quite wide spectrum of indexing features in the various NoSQL databases makes this clear. Some NoSQL databases index on little else than the keys used for rows/entities/documents and/or partitions. Others go a bit beyond this. For example, CouchDB indexes documents only on their IDs and sequence (version) numbers, but it also creates indexes on views. The AllegroGraph Graph Database, meanwhile, indexes everything (id, subject, predicate, object and graph), automatically. Some Key-Value and Wide Column Stores support so-called secondary indexes a generic term for an index built on the value of a property/column that is not the key. But secondary indexes are relatively new features in some databases and still a bit immature. For example, Cassandra added secondary indexes in version 0.7, which was just released on January 9, 2011. These secondary indexes are essentially hash indexes only; support for bitmapped indexes, with which range criteria could be satisfied, is in the works for a future release. In the absence of secondary index support, some developers implement them on their own. The common approach is to create a second table containing the values of the indexed column and their corresponding row keys from the main table. This requirement is somewhat emblematic of NoSQL databases in general: developers may need to implement on their own what could long be taken for granted in an RDBMS. Again, in some situations, the tradeoff is deemed reasonable given the performance and availability requirements, but the price should not be understated.
14
products and features) and which aspect of NoSQL technology each one implements. As you will see, elements of NoSQL computing can pop up in some unexpected places.
This analogy works best if we think of XML documents as a non-hierarchical storage mechanism. If we think of them as hierarchical (i.e. through the use of XML attributes or child elements) then an analogy with Wide Column Stores becomes more appropriate. 15
tables. Prior to XML in the database, the only way to accommodate changing schemas was to build out vertical tables, whose column values were stored as rows in attribute value tables (as key-value pairs, in fact). So if we consider one of the major value propositions of NoSQL, namely flexibility around changing schemas, we see that very scenario is the inspiration for the XML column feature in SQL Server (and now in SQL Azure). Using XML for NoSQL computing needs is not a kluge, but rather a sensible alignment of interests. It is important to note, however, that unlike on-premise editions of SQL Server, SQL Azure does not support indexes on XML columns. As long as your tables contain a scalar primary key column, then youll have the option of a key-based index, though you will lack the equivalent of a secondary index.
In this way, a Federation Key is similar to an Azure Table Storage Partition Key 16
distribution simpler and it also provides the foundation for a full map-reduce-style fan out query capability that could appear in a future release.
5
OData
OData is Microsofts generalized XML data serialization format, based on the ATOM feed standard, and RESTful API used to query, create and update data in the repositories it wraps. OData debuted as the transmission format and API for data exposed by what is now called WCF Data Services (originally known as project Astoria, then as ADO.NET Data Services). Typically, Astoria services act as RESTful wrappers around Entity Framework data models. But with the generalization of the data format and REST implementation, OData is now used by Microsoft and others to expose a variety of data sources. Onpremise Microsoft products and technologies that support OData interfaces include SQL Server Reporting Services in SQL Server 2008 R2, SharePoint 2010 lists and Dynamics CRM 2011. In the Azure world, both Azure Table Storage and SQL Azure support OData interfaces to their respective tables. Azure Storage does so natively, while SQL Azure exposes its OData interface via a pre-release tool (SQL Azure OData Service) at time of this writing available from SQL Azure Labs. By logging into the tool and enabling OData access with a single checkbox (either for anonymous access or access by specific named users), the OData interface is made available immediately; there is no coding required to enable it. Whats more, SQL Azure provides this RESTful interface while maintaining its conventional Tabular Data Stream (TDS) interface. As such, SQL Azure provides developer simplicity while retaining its native interface, and the performance necessary for heavy LOB workloads. Windows Azure Marketplace DataMarket leverages OData as its native format for publishing the free and subscription-based data feeds that comprise the service. This makes the OData format itself especially valuable, and arguably more so than more generic XML data serialization formats, as it is at once an API tool and a channel to commercial or public distribution of data.
Even in advance of such support, considering that map-reduce jobs must themselves be explicitly coded or scripted in many NoSQL databases, the notion of writing an Azure Federation fan-out query through code seems a reasonable task by comparison 17
ability to consume and analyze OData-formatted feeds using Azures RESTful services, Azure provides selfservice BI to customers and not just APIs to developers. This presents a very clear business case that various NoSQL databases may be hard-pressed to counter.
Running NoSQL Database Products using Azure Worker Roles, VM Roles and Azure Drive
If the desire or specific need is present to run a particular NoSQL database product, Worker and Virtual Machine Roles make it possible to accommodate this setup on Azure, provided the NoSQL product has a Windows Server-compatible version (and most do). The VM role allows customers to build their own machine image, upload it as a virtual hard drive (VHD) file to their Azure accounts, and then spin up instances of that image. Any properly licensed software can be installed in that machine image, including various free NoSQL products. Likewise, a Worker Role can accommodate such customization, but any products added to the baseline image must be xcopy-deployable or silently installed during the Worker Role's startup task or its code's RoleEntryPoint.OnStart method. There is one complication though: since Worker and even VM role instances may be recycled at any point, local hard drive storage within the instance may at any time revert back to its baseline image state. So unless the data in the instance is static and can itself be included with a VM Role image or placed on a Worker Role image in a scripted manner at startup, data storage becomes an issue. Luckily, the Windows Azure Drive offering provides a solution. Azure Drive allows a separate VHD file, hosted in Azure Blob Storage, to be mounted as a mapped drive, within the Worker/VM Role instance, through a simple .NET API. This means that a Worker/VM Role instance could have a NoSQL database product installed on it, configured to read and write data to a mapped drive, and as long as the drive were mounted before the NoSQL product initialized, all would be well. Scaling this to multiple Role instances gets tricky, since a given VHD can be used as a read/write volume by only one instance at a time, but there are ways to do it. Is this solution optimal? Probably not. But it is workable and still runs within the context of the Azure managed platform from which you can avail yourself of the elasticity and other traits and features of the Azure fabrics management. For Microsoft customers who already have a substantial investment in SQL Server and/or .NET, this no mere trivial benefit. And readers who find compelling the argument that NoSQL features and benefits can be had from existing Azure data products like Azure Storage, SQL Azure and their OData interfaces, will likely find the need to run dedicated NoSQL products an edge case. With that in mind, the Azure Worker Role/VM Role/Azure Drive option appears quite feasible.
On-Premise Technologies
Before we move on, three non-cloud technologies from Microsoft bear special mention, as they provide their own implementations of the non-tabular data, fan-out query and map-reduce job execution technology discussed in this paper.
18
19
Do not confuse Dryads graphs with those of Graph Databases. Though the vocabulary is quite similar, the contexts are rather different. 20
Figure 6: These lists summarize the cloud and on-premise technologies from Microsoft which deliver genuine NoSQL technology (e.g. Azure Table Storage) and/or features that NoSQL databases offer and which resonate with NoSQL developers (like ODatas HTTP/REST APIs). We also enumerate the option of running open source NoSQL database products in Azure compute instances, using Worker and VM Roles.
21
Upsides
Lightweight, low-friction
Probably the most touted attribute of NoSQL database systems is their ease of provisioning, deployment and integration into application code. Download, install, run a browser-based UI, create a new database, and away you go. Since the products are open source, the licensing worries are reduced. Since there are no schemas to declare with many NoSQL products, the database is ready as soon as you create it. And since many NoSQL APIs are HTTP- and REST-based, and, for a number of NoSQL databases, a multitude of client libraries for various programming environments are available, you can start coding quickly too.
Web Developer-Friendliness
Many Document Store databases use JavaScript Object Notation (JSON) as the internal storage format and JavaScript as an internal scripting language. Therefore, writing an AJAX application against a database in one of these products becomes much easier, as the objects in the applications JavaScript code can be directly written to, or read from, the database. This makes client-side (browser script-based) data access code quite feasible and simple.
22
Add to this the REST APIs used by most Document Store products, and the jQuery REST libraries available to Web developers, and it becomes clear that the suitability of NoSQL products to JavaScript/jQuerybased applications is high, with a reasonably low learning curve for many Web developers. For certain NoSQL products, especially Document Stores, it seems almost a core design principal that the databases function as an extension of JavaScripts implementation of object orientation. While it would probably be a stretch to call these NoSQL products object databases , that is a useful way to consider the intent with which they are built, with respect to JavaScript developers and their code.
7
Downsides
Having enumerated several facets of NoSQL databases that work out elegantly and advantageously, its important to point out some of the NoSQL products liabilities as well, especially with regard to productivity and suitability to line-of-business application development.
Recall that we had already drawn parallel between Graph Databases and Object Databases. Here we do so for Document Stores. As before, the distinctions between NoSQL categories are not cut and dry. 8 At time of writing, CouchDB for Android is available as a developer alpha release. 9 The lost data is recoverable from database log files. But the restore operation can prove inefficient. 23
overhead of a query optimizer need not impose itself. But for applications where requirements may shift over time, capabilities are much more limited than with relational databases. This has some irony to it, given the importance of schema flexibility (and thus accommodation of changing requirements) in NoSQL databases overall.
Of course, that statement really comingles two separate senses of the word scalability. For many Web applications, scalability involves the elimination of latency in rather simple operations, such as pulling up an individual note, writing out a status message, bringing up account settings for a specific customer, and so forth. Another kind of scaling involves things such as efficient keyword searches over a gigantic bodies of data, limiting the value of specific fields to a certain range or aggregating numeric field values over a large subset of data; this sense of scale is very important as well and procedural traversals do not often enhance it. So perhaps it is unfair to say that, generally, procedural, row-wise data evaluation impairs scalability, since notions of scale differ between classes of applications. But this assertion must hold true in the converse as well, making it inaccurate to say, in a sweeping fashion, that NoSQL databases are more scalable or Web scale than relational databases. The reality is that different applications have different needs, different burdens and different points of stress (or failure). Scalability really is measured by the degree to which these needs are met, burdens lifted and stresses reduced as the volume of data and/or user activity grows linearly, and exponentially. The best database for the job is just that: the best database for the job at hand. For some applications, relational databases are not the optimal vehicle for storage and retrieval. For many others, NoSQL databases would be quite inappropriate. So the most important question in evaluating options amongst NoSQL databases, as well as evaluating the option of using them at all, hinges on the type of application being written, the type of queries that must be expected and handled with relative ease, and the regularity vs. variability of the datas structure. That a certain type of database appears clumsy in certain situations does not by itself render that type of database inappropriate if that situation is merely an edge case.
SQL Server and SQL Azure provide this same data access option through cursors, but SQL Server developers use cursors very sparingly to avoid the downside. 24
provide geographically distributed points of presence may form the perfect approach for the problem space of these applications. The ad hoc, semi-federated nature of NoSQL clusters and replicas makes for low-friction provisioning and helps assure that growth spurts in services usage and membership are nondisruptive. That said, there is still work involved, both in terms of resource monitoring and provisioning, that must be done in order to meet these very demands. Meanwhile, a Platform as a Service cloud like Windows Azure, with a data platform like SQL Azure to match, facilitates a more automated approach to both the monitoring and provisioning which must be performed to make certain a site or application grows nondisruptively. New Windows Azure Web and Worker roles can be spun up through clicks in the Azure portals management interface, and they can be deactivated just as easily. As a result, elasticity is achieved more laboriously with hosted NoSQL database applications. Replicas for SQL Azure databases are created implicitly and the cutover from one replica to another is implicit as well. The ramifications of this for NoSQL include extra effort and greater opportunity for error, which may have a very real and measurable economic impact in labor costs and/or opportunity costs, as well as greater risk exposure, to the companies building sites or providing services that use NoSQL databases .
11
Primitive Tooling
NoSQL databases are, in many cases, easier to get up and running than are relational databases. Theres less up-front formality involved in terms of planning and design and, as a result, theres a shorter distance between concept and implementation. Thats exactly the kind of agility that growing companies and their sites may need. Theres also far less complexity in tooling around these databasessimple, selfexplanatory browser-based management interfaces, straightforward REST programming interfaces and conceptually simple key-value paradigms abound. But tooling has its value, and that value tends to increase over time, when the imperative of raw implementation has passed and need for smooth maintenance and troubleshooting becomes more pronounced (and economically impactful). The design, diagnostic and operational monitoring capabilities of SQL Servers tools are significant, and have evolved over the roughly 20-year existence of the product. These tools, including SQL Server Management Studio and its execution plan window, aid greatly in preventing problems, and in solving them quickly when they do arise. NoSQL databases more minimalist tooling approach leads to more manual and time-consuming management and troubleshooting than is the case with SQL Azure (which is compatible with SQL Servers tools), and may also make the process more error prone. The cost impact of this can be significant.
Some Web enterprises have large, dedicated technology staffs in place, who can handle this burden well. But many corporate business units, and even IT departments, are not in that position 25
find it perfectly acceptable to discover the failure (by noticing the message never appears in a feed or stream) and re-post the message. Furthermore, the occurrence of transactions that span more than a single database operation may not be significant in certain apps. Note taking-applications must update notes one at a time; blog posting is a simple operation; social networks may need to register a new follower for a given user, and thats a discrete operation. Unlike a financial system which may need to execute a debit and credit as an atomic operation, many Web applications interact with data in a more granular, minimalist way. But for most corporate business applications, ACID guarantees are imperative. Debits and credits must execute in an all-or-nothing fashion; ecommerce orders cannot be lost as customers will not be content to recreate them from scratch. So, once again, the context of an application/service/site in large part determines what defines standards of reliability and what determines whether certain advanced features of a database are overkill or absolute necessities.
26
appropriately scaled and tuned, but the overarching point is that the relational scheme is best in these scenarios. To understand the line-of-business versus structured data distinction, it may be helpful to consider a hypothetical large, online bookseller. This reseller likely keeps its catalog data in a NoSQL database. It may do likewise with its Web content, reviews and perhaps even its reading lists. But in all likelihood, its customer billing system, its inventory and supply chain systems, its publisher online inquiry systems and its shipping application all use relational databases. We dont know this for a fact about any one bookseller, but the assumptions are nonetheless based on good rules of thumb for when and where each type of database is best utilized. The regular, consistent data scenario is the most common one in most corporate settings. Granted, for any number of outward-, consumer-facing Web applications, which are essentially content-and relationship-driven, NoSQL structured stores have a welcoming home. So you must ask yourself: do I have irregularly schematized data, such that I need to use a NoSQL, structured storage approach to storing and retrieving it? Try not to be led to a conclusion by fear (or even guilt) over the issue of inflexibility. Just because schema-less databases let you store irregular data doesnt mean youll need that, and just because relational databases require you to go through steps that can be disruptive in order to modify a tables schema, doesnt mean youre somehow foolhardy for going that route. Consider a household analogy: if, as you build a house, you run wiring in conduit, external to your walls, and surface-mount your fixtures, youll always be able to upgrade your wiring, or repair a wiring segment gone bad. But if you know that the electrical, and maybe cable TV and computer network wiring to be installed will suit your purposes for the long term, then it makes perfect sense to run your wiring in-wall. You can always open the walls again if need be, and if youre reasonably certain that you wont need to, then running the wiring internally is the right decision. It will look better to most people, make it easier to push furniture against the wall and will, arguably, be somewhat safer. In general, your home will have a more finished look to it. If one day your needs change and you need to open the walls again, that will not necessarily mean you made a bad decision. People should not let a relatively insignificant chance of disruption thwart them from enjoying the advantages of something that is otherwise advantageous. By the same token, customers should not let the notion that their database schema may someday change force them into a decision of going with a non-relational, loosely-schematized database. As we have said, some applications by their nature manage data that is variable in structure, and NoSQL databases may work very well for those applications. But if your app uses highly structured data and most line-of-business apps do then why forego the compatibility, data consistency, query optimization, maturity, broad support and professional talent pool that a major relational database offers? You should give that up only if the benefits of doing so outweigh the costs, and each such benefit should be evaluated on a sober survey of likelihoods and risks.
27
But what if the wires in your house are changing a lot? What if youve got an app that manages a lot of data that is ever-changing in structure and much of it functions as content on your Web site? Do you need Cassandra or MongoDB or Neo4j on a hosted Linux server? Probably not. Azure tools like Azure Table Storage, SQL Azure XML columns and OData may be viable options for your structured storage or key-value retrieval needs. And if not, then running xcopy-deployable or silently-installable NoSQL databases in Azure Worker Roles and Azure Drive, or running full blown NoSQL installations using Azure VM Roles ,may well work for you. Hopefully this paper has made the choices more clear and your evaluation a more straightforward and less loaded prospect. The Azure cloud provides for a spectrum of choice, rather than a single, compulsory methodology. This provides flexibility and protection in a cost-effective, elastic computing environment. And thats really what Web scale should be all about.
28