AdeelIbrahim (1174103)

Data Ware/Business Intelligence: Selecting The Right Tool for The Job
Syed Muhammad Adeel Ibrahim1,

1
SZABIST, M.S. Computing (SE), Karachi, Pakistan smadeelibrahim@yahoo.com
Abstract. The rivalry between SQL and NoSQL has been building during the past year to the point where some people are predicting the end of the SQL era. Actually, the two camps are largely complementary, because they're designed to solve different problems. In this research author have paid extra attention to discuss various usage of these systems to solve different types of problem. Keywords: Hbase, Hudoop, MongoDB, MonetDB, Cassendra, CouchDB
1. Introduction
First we will walk through background regarding RDBMS (SQL) and non RDBMS (NoSQL) so that readers would understand the basic idea between both the technologies.
1.1. RDBMS (SQL)
Data is one of the most important assets of a company. It is very important to make sure data is stored and maintained accurately and quickly. DBMS (Database Management System) is a system that is used to store and manage data. A DBMS is aset of programs that is used to store and manipulation data.
Manipulation of data include the following: Adding new data, for example adding details of new student. Deleting unwanted data, for example deleting the details of students who have completed course. Changing existing data, for example modifying the fee paid by the student. A DBMS provides various functions like data security, data integrity, data sharing, data concurrence, data independence, data recovery etc.
1.1.1. Features of RDBMS The following are main features offered by DBMS. Apart from these features different database management systems may offer different features. For instance, Oracle is increasing being fine-tuned to be the database for Internet applications. This may not be found in other database management systems. These are the general features of database management systems. Each DBMS has its own way of implementing it. A DBMS may have more features the features discussed here and may also enhance these features.
1.1.1.2. Support for large amount of data Each DBMS is designed to support large amount of data. They provide special ways and means to store and manipulate large amount of data. Companies are trying to store more and more amount of data. Some of this data will have to be online (available every time). In most of the cases the amount of data that can be stored is not actually constrained by DBSM and instead constrained by the availability of the hardware. For example, Oracle can store terabytes of data.
1.1.1.3. Data sharing, concurrency and locking DBMS also allows data to be shared by two or more users. The same data can be accessed by multiple users at the same time data concurrency. However when same data is being manipulated at the same time by multiple users certain problems arise.To avoid these problems, DBMS locks data that is being manipulated to avoid two users from modifying the same data at the same time. The locking mechanism is transparent and automatic. Neither we have to inform to DBMS about locking nor we need to know how and when DBMS is locking the data. However, as a programmer, if we can know intricacies of locking mechanism used by DBMS, we will be better programmers.
1.1.1.4. Data Security While DBMS allowing data to be shared, it also ensures that data in only accessed by authorized users. DBMS provides features needed to implement security at the enterprise level. By default, the data of a user cannot be accessed by other users unless the owner gives explicit permissions to other users to do so.
1.1.1.5. Data Integrity Maintaining integrity of the data is an import process. If data loses integrity, it becomes unusable and garbage. DBMS provides means to implement rules to maintain integrity of the data. Once we specify which rules are to be implemented, then DBMS can make sure that these rules are implemented always. Three integrity rules (discussed later in this chapter) domain, entity and referential are always supported by DBMS.
1.2 Non-RDBMS (NoSQL) Over the last couple years, we see an emerging data storage mechanism for storing large scale of data. These storage solution differs quite significantly with the RDBMS model and is also known as the NOSQL. Some of the key players include ...
GoogleBigTable, HBase, Hypertable AmazonDynamo, Voldemort, Cassendra, Riak Redis CouchDB, MongoDB
These solutions has a number of characteristics in common Key value store Run on large number of commodity machines Data are partitioned and replicated among these machines Relax the data consistency requirement. (because the CAP theorem proves that you cannot get Consistency, Availability and Partitioning at the the same time)
2. Types of NoSQL datastores

The following section describes the different types of NoSQL datastores.
2.1.1. Key Value stores Examples: Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB Typical Applications: Content caching Strengths: Fast lookups Weaknesses: Stored data has no schema Example application: You are writing forum software where you have a home profile page that gives the user's statistics (messages posted, etc) and the last ten messages by them. The page reads from a key that is based on the user's id and retrieves a string of JSON that represents all the relevant information. A background process recalculates the information every 15 minutes and writes to the store independently.
2.1.2. Document databases Examples: CouchDB, MongoDb Typical applications: Web applications Strengths: Tolerant of incomplete data Weaknesses: Query performance, no standard query syntax Example application: You are creating software that creates profiles of refugee children with the aim of reuniting them with their families. The details you need to record for each child vary tremendously with circumstances of the event and they are built up piecemeal, for example a young child may know their first name and you can take a picture of them but they may not know their parent's first names. Later a local may claim to recognise the child and provide you with additional information that you definitely want to record but until you can verify the information you have to treat it sceptically. 2.1.3. Graph databases Examples: Neo4J, InfoGrid, Recommendations
Infinite
Graph
Typical
applications:
Social
networking,
Strengths: Graph algorithms e.g. shortest path, connectedness, n degree relationships, etc. Weaknesses: Has to traverse the entire graph to achieve a definitive answer. Not easy to cluster. Example application: Any application that requires social networking is best suited to a graph database. These same principles can be extended to any application where you need to understand what people are doing, buying or enjoying so that you can recommend
further things for them to do, buy or like. Any time you need to answer the question along the lines of "What restaurants do the sisters of people who are over-40, enjoy skiing and have visited Kenya dislike?" a graph database will usually help. 2.1.3. XML databases Examples: Exist, Oracle, MarkLogic Typical applications: Publishing Strengths: Mature search technologies, Schema validation Weaknesses: No real binary solution, easier to re-write documents than update them Example application: A publishing company that uses bespoke XML formats to produce web, print and eBook versions of their articles. Editors need to quickly search either text or semantic sections of the markup (e.g. articles whose summary contains diabetes, where the author's institution is Liverpool University and Stephen was a revising editor at some point in the document history). They store the XML of finished articles in the XML database and wrap it in a readable-URL web service for the document production systems. Workflow metadata (which stage a manuscript is in) is held in a separate RDBMS. When system-wide changes are required, XQuery updates bulk update all the documents to match the new format. Distributed Peer Stores Examples: Cassandra, HBase, Riak Typical applications: Distributed file systems Strengths: Fast lookups, good distributed storage of data
Weaknesses: Very low-level API Example application: You have a news site where any piece of content: articles, comments, author profiles, can be voted on and an optional comment supplied on the vote. You create one store per user and one store per piece of content, using a UUID as the key (generating one for each piece of content and user). The user's store holds every vote they have ever made while the content "bucket" contains a copy of every vote that has been made on the piece of content. Overnight you run a batch job to identify content that users have voted on, you generate a list of content for each user that has high votes but which they have not voted on. You then push this list of recommended articles into the user's "bucket". Object stores Examples: Oracle Coherence, db4o, ObjectStore, GemStone, Polar Typical applications: Finance systems Strengths: Matches OO development paradigm, low-latency ACID, mature technology Weaknesses: Limited querying or batch-update options Example application: A global trading company has a monoculture of development and wants to have trades done on desks in Japan and New York pass through a risk checking process in London. An object representing the trade is pushed into the object store and the risk checker is listening to for appearance or modification of trade objects. When the object is replicated into the local European space the risk checker reads the Trade and assesses the risk. It then rewrites the object to indicate that the trade is approved and generates an actual trade fulfilment request. The trader's client is listening for changes to objects that contain the trader's id and updates the local detail of the trade in the client indicating to the trader that the trader has been approved. The trading system
will consume the trade fulfilment and when the trade elapses or is fulfilled feeds back the information to the risk assessor.
3. Coloum Oriented DBMS

Examples: MonetDB, InfoBright, Ms Sql Server 2012(CTP denali), Vertica Typical applications: Banks, Corporate sectors. Strengths: Data compression: The repeating column values are represented by the single column value and also different projections can be used for storing the column in a format that is used mostly. Improved Bandwidth Utilization: Only the required data is read from the disk, it does not read any extra data or columns as in the case of the row oriented database. Improved Code Pipelining: CPU cycle performance is saved with the column oriented as we use the performance only for the required attributes. Improved cache locality: the cache in the column oriented contains only the required data instead of the unnecessary data which is the case for the row oriented database.
Weaknesses: Increased Disk Seek Time: As multiple columns are read in parallel increases the time disk seek time. Increased cost of Inserts: Small inserts will need more time because as column oriented data is stored in columns, multiple places needs to be updated.
Increases tuple reconstruction costs: while interfacing with the Drivers and JDBC the reconstruction of the row from these columns takes more time and offsets the advantages of the column oriented database.
4. Specifications and Usage of Various data stores
4.1. Redis
Written in: C/C++ Main point: Blazing fast License: BSD Protocol: Telnet-like Disk-backed in-memory database, Currently without disk-swap (VM and Diskstore were abandoned) Master-slave replication Simple values or hash tables by keys, but complex operations like ZREVRANGEBYSCORE. INCR & co (good for rate limiting or statistics) Has sets (also union/diff/inter) Has lists (also a queue; blocking pop) Has hashes (objects of multiple fields) Sorted sets (high score table, good for range queries) Redis has transactions (!) Values can be set to expire (as in a cache) Pub/Sub lets one implement messaging (!)
Best used: For rapidly changing data with a foreseeable database size (should fit mostly in memory). For example: Stock prices. Analytics. Real-time data collection. Real-time communication.
4.2 MongoDB
Written in: C++ Main point: Retains some friendly properties of SQL. (Query, index)
License: AGPL (Drivers: Apache) Protocol: Custom, binary (BSON) Master/slave replication (auto failover with replica sets) Sharding built-in Queries are javascript expressions Run arbitrary javascript functions server-side Better update-in-place than CouchDB Uses memory mapped files for data storage Performance over features Journaling (with --journal) is best turned on On 32bit systems, limited to ~2.5Gb An empty database takes up 192Mb GridFS to store big data + metadata (not actually an FS) Has geospatial indexing
Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks. For example: For most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back.
4.3 Neo4j
Written in: Java Main point: Graph database - connected data License: GPL, some features AGPL/commercial Protocol: HTTP/REST (or embedding in Java) Standalone, or embeddable into Java applications Full ACID conformity (including durable data) Both nodes and relationships can have metadata Integrated pattern-matching-based query language ("Cypher") Also the "Gremlin" graph traversal language can be used Indexing of nodes and relationships Nice self-contained web admin Advanced path-finding with multiple algorithms Indexing of keys and relationships
Optimized for reads Has transactions (in the Java API) Scriptable in Groovy Online backup, advanced monitoring AGPL/commercial licensed
and
High
Availability
is
Best used: For graph-style, rich or complex, interconnected data. Neo4j is quite different from the others in this sense. For example: Social relations, public transport links, road maps, network topologies.
4.4 Riak
Written in: Erlang & C, some Javascript Main point: Fault tolerance License: Apache Protocol: HTTP/REST or custom binary Tunable trade-offs for distribution and replication (N, R, W) Pre- and post-commit hooks in JavaScript or Erlang, for validation and security. Map/reduce in JavaScript or Erlang Links & link walking: use it as a graph database Secondary indices: but only one at once Large object support (Luwak) Comes in "open source" and "enterprise" editions Full-text search, indexing, querying with Riak Search server (beta) In the process of migrating the storing backend from "Bitcask" to Google's "LevelDB" Masterless multi-site replication replication and SNMP monitoring are commercially licensed
Best used: If you want something Cassandra-like (Dynamo-like), but no way you're gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you're ready to pay for multi-site replication.
For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt. Could be used as a well-update-able web server.
4.5. CouchDB
Written in: Erlang Main point: DB consistency, ease of use License: Apache Protocol: HTTP/REST Bi-directional (!) replication, continuous or ad-hoc, with conflict detection, thus, master-master replication. (!) MVCC - write operations do not block reads Previous versions of documents are available Crash-only (reliable) design Needs compacting from time to time Views: embedded map/reduce Formatting views: lists & shows Server-side document validation possible Authentication possible Real-time updates via _changes (!) Attachment handling thus, CouchApps (standalone js apps) jQuery library included
Best used: For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important. For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments.
4.6. HBase
Written in: Java Main point: Billions of rows X millions of columns
License: Apache Protocol: HTTP/REST (also Thrift) Modeled after Google's BigTable Uses Hadoop's HDFS as storage Map/reduce with Hadoop Query predicate push down via server side scan and get filters Optimizations for real time queries A high performance Thrift gateway HTTP supports XML, Protobuf, and binary Cascading, hive, and pig source and sink modules Jruby-based (JIRB) shell Rolling restart for configuration changes and minor upgrades Random access performance is like MySQL
Best used: When you use the Hadoop/HDFS stack. When you need random, realtime read/write access to BigTable-like data. For example: For data that's similar to a search engine's data
4.7. Cassandra
Written in: Java Main point: Best of BigTable and Dynamo License: Apache Protocol: Custom, binary (Thrift) Tunable trade-offs for distribution and replication (N, R, W) Querying by column, range of keys BigTable-like features: columns, column families Has secondary indices Writes are much faster than reads (!) Map/reduce possible with Apache Hadoop I admit being a bit biased against it, because of the bloat and complexity it has partly because of Java (configuration, seeing exceptions, etc)
Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")
For example: Banking, financial industry (though not necessarily for financial transactions, but these industries are much bigger than that.) Writes are faster than reads, so one natural niche is real time data analysis.
4.8. Membase
Written in: Erlang & C Main point: Memcache compatible, but with persistence and clustering License: Apache 2.0 Protocol: memcached plus extensions Very fast (200k+/sec) access of data by key Persistence to disk All nodes are identical (master-master replication) Provides memcached-style in-memory caching buckets, too Write de-duplication to reduce IO Very nice cluster-management web GUI Software upgrades without taking the DB offline Connection proxy for connection pooling and multiplexing (Moxi)
Best used: Any application where low-latency data access, high concurrency support and high availability is a requirement. For example: Low-latency use-cases like ad targeting or highly-concurrent web apps like online gaming (e.g. Zynga).
5. Conclusion
A SQL database implementation that uses NoSQL infrastructure is a good solution. A SQL database that is scalable, manageable, cloud-ready, highly available and built entirely on NoSQL infrastructure, but still provides all the advantages of a SQL database, such as interoperability, well-defined semantics and more. This hybrid would not be as fast as a NoSQL service, but it may be good enough for the 80% of the market that needs stronger scalability and organic cloud behavior.
Such a solution would also allow migrating existing applications easily into cloud environments,thus protecting huge investments made by organizations in those applications. It is my opinion that a SQL database built on NoSQL foundations can provide the highest value to customers who wish to be both agile and efficient while they grow.
References
1. 2. http://nosql-database.org/ NoSql-Databases NoSQL, no problem, An introduction to NoSQL databases by Robert Rees http://www.thoughtworks.com/articles/nosqlcomparison Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison By Kristf Kovcs http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis SQL vs NoSQL in the Cloud: Which Database Should You Choose? By Avi Kapuya http://cloud.dzone.com/news/sql-vs-nosqlcloud-which Column-oriented DBMS http://en.wikipedia.org/wiki/Columnoriented_DBMS INTRODUCTION TO RDBMS by Srikanth http://srikanthtechnologies.com/books/orabook/ch1.pdf Introduction to NoSQL http://www.slideshare.net/darrellk2000/introduction-to-nosql NoSQL Pattern By Ricky HO http://horicky.blogspot.com/2009/11/nosql-patterns.html Paper: NoSQL Databases - NoSQL Introduction And Overview By Christof Strauch http://highscalability.com/blog/2011/4/13/papernosql-databases-nosql-introduction-and-overview.html
3.
4.
5. 6. 7. 8. 9.

AdeelIbrahim (1174103)

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

AdeelIbrahim (1174103)

Caricato da

Copyright:

Formati disponibili

Data Ware/Business Intelligence: Selecting The Right Tool for The Job

Syed Muhammad Adeel Ibrahim1,

SZABIST, M.S. Computing (SE), Karachi, Pakistan smadeelibrahim@yahoo.com

1.1. RDBMS (SQL)

2. Types of NoSQL datastores

3. Coloum Oriented DBMS

4. Specifications and Usage of Various data stores

Potrebbero piacerti anche