Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
LAB WEEK 7
Submitted
1|Page
Table of Contents
Scenario.......................................................................................................................................................................................................................... 3
Explanation SQL, NoSQL, and NewSQL .......................................................................................................................................................................... 4
Where RDBMS fall short? .......................................................................................................................................................................................... 4
What NoSQL brings into the table? ........................................................................................................................................................................... 5
Data models in NoSQL. .............................................................................................................................................................................................. 7
1. Key-Value (K-V) Stores ................................................................................................................................................................................... 7
2. Document Stores ........................................................................................................................................................................................... 8
3. Column-Oriented Stores ................................................................................................................................................................................ 8
4. Graph Databases ............................................................................................................................................................................................ 8
Preview of Some NoSQL Solutions................................................................................................................................................................................. 8
Redis ........................................................................................................................................................................................................................... 9
CouchDB™ .................................................................................................................................................................................................................. 9
MongoDB ................................................................................................................................................................................................................... 9
Why MongoDB? ........................................................................................................................................................................................................... 11
Lift Chart: Compare and contrast DT and SVM. ........................................................................................................................................................... 13
ROC charts: Compare and contrast DT and SVM. ........................................................................................................................................................ 15
Literature Sources ........................................................................................................................................................................................................ 18
2|Page
Scenario
Throughout this course we have examined many different ways to manage and analyze data. This week, our challenge is to explore NoSQL. Your
CEO had decided that using big data is the wave of the future and is especially interested in the topic of NoSQL and how it works. Before you can
explain NoSQL and NoSQL data management options to the CEO, it is necessary to do some further research. The CEO has indicated that his
advisors believe that MongoDB is a viable option for managing large amounts of data. Your task is to research NoSQL and discover how it can
benefit the organization. You are also curious about MongoDB; what it is and how it works. Please build a two-page report on your NoSQL
discussion and recommendation of MongoDB. Explain the pros and cons of working with MongoDB, and indicate why it might be the best option
for the CEO. Report should include the following.
Discussion of MongoDB
While doing your research, you encounter information on the following topics that you feel should also be included in your report to the CEO. An
additional two pages are needed to explain the following.
You will need to put your findings into a Word document and submit the file.
3|Page
Explanation SQL, NoSQL, and NewSQL
“SQL” is used both as the name of a language and as a type of database. SQL the language is a structured query language designed for
managing data in relational database management systems (RDBMS). Relational database management systems are often called SQL databases
since they use the SQL language. For over 30 years, relational
database technology based on a model model of organizing
data in the form of tables, columns, and rows. has been the
gold standard. Since the mid-1980s, SQL has been
unquestionably the standard for querying and managing
RDBMS data sets. SQL systems proven themselves as good
option for vertical expansion in databases, independently of
their size. However in in the era of Cloud computing with
unprecedented data volumes, massive workloads for Web
services, and the need to store new types of data brought
and gathering of data for BI, the need for different database
systems raised.
4|Page
Another set of problems that relational databases
struggle with is related to an exponentially increasing
amount of data. The direct consequence is the so-called big
data problem. This problem arises when standard SQL query
operations do not have acceptable performances, especially
when transactions are involved. In this situations developer
founded that the schema of extending more servers raised
more issues in regard to configuration and maintainability,
which ultimately proven to be costly without any improved
performance. (Krisciunas, 2014)
5|Page
demand stream ingestion, real-time processing and data
storage, while running on a distributed cloud native
infrastructure. Differently from traditional messaging
tool providers which are trying hard to “bolt-on” ACID
transactions and move towards adding statefulness to
their streams and their inherent “data pipeline”
architecture limits them from offering these must have
capabilities, the new, using this different data models
don’t force them to do something they aren’t designed
for. It is of the upmost importance to understand and
correctly use the data model when choosing the proper
NoSQL solutions. (Laudon & Laudon, 2015).
As there are indicated through the charts enclosed, the main advantages of NoSQL include:
1. being able to handle large volumes of structured, semi-structured, and unstructured data,
2. agile sprints, quick iteration, and frequent code pushes
3. object-oriented programming that is easy to use and flexible,
4. efficient, scale-out architecture instead of expensive, monolithic architecture,
5. Many NoSQL databases also tend to be open-source which means a relatively low-cost way of developing, implementing and sharing
software.
6|Page
These are some of the main reasons why the fast-growing companies are embracing quickly “NoSQL” non-relational database technologies for
their benefit.
Non-relational database management systems use a more flexible data model and are designed for managing large data sets across
many distributed machines and for easily scaling up or down. They have been very useful for accelerating simple queries against large volumes
of structured and unstructured data, including Web, social media, graphics, and other forms of data that are difficult to analyze with traditional
SQL-based tools.
7|Page
data sharing between application instances like distributed cache or to store user session data. In this case or document store data models,
transaction consistency is rarely needed as most operations are by definition atomic.
2. Document Stores
Document store is a data model for storing semi-structured document object data and metadata. The JSON format is normally used to represent
such objects. Documents can be queried by their properties in a similar manner to relational databases but aren’t required to adhere to the
strict structure of a database table. Additionally, only parts of the object may be requested or updated. Document stores are used for aggregate
objects that have no shared complex data between them and to quickly search or filter by some object properties.
3. Column-Oriented Stores
A more advanced K-V store data model is a column family. These are used for organizing data based on individual columns where actual data is
used as a key to refer to whole data collections. It is similar to a relational database index, however a column family may be an arbitrary
collection of columns. There are more complex aggregation structures like super columns and super column families to allow access to the data
by several keys. This approach is used for very large scalable databases to greatly reduce time for searching data. It is rarely used outside of
enterprise level applications.
4. Graph Databases
As the name implies, this data model allows objects to link and be linked by several other objects thus constructing a graph structure. Links
usually have additional properties to describe the relation between objects. Graph databases map more directly to object-oriented programming
models and are faster for highly associative data sets and graph queries. Furthermore, they typically support ACID transaction properties in the
same way as most RDBMS.
8|Page
Redis
Redis, is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings,
hashes, lists, sets and sorted sets. Redis allows a user to set an expiration time for key-value pairs and requires all stored data to fit into a server
RAM. Clearly, it is designed to be used as a distributed caching and session service. Data storage in RAM allows very fast read/write operations.
Furthermore, data is persisted to a disk and in the case of a server restart can be restored back to RAM for quick access.
CouchDB™
Apache CouchDB is a document storage mainly targeted for mobile devices with offline mode support. It uses JSON for document storage and
REST for API. Field values are restricted to standard JSON types. CouchDB provides ACID transaction semantics meaning that it can handle a high
volume of concurrent readers and writers without conflict. CouchDB also guarantees eventual consistency to be able to provide both availability
and partition tolerance. Aggregation in CouchDB is done by using a specialized view model similar to a map-reduce system, and is continuously
updated and processed in parallel. CouchDB is a perfect candidate for usage on mobile devices and client side focused web browser applications.
MongoDB
MongoDB is document storage designed for high performance, high availability, and with automatic scaling. Documents are saved in a BSON
format (binary JSON) and field values aside from the usual JSON types can include other documents, arrays and arrays of documents. Every field
can be indexed and queried. MongoDB has a write lock support which blocks all other operations, including reads.
9|Page
Also, MangoDB supports dynamic consistency where each write operation can specify the guaranteed level of success for that operation. When
inserts, updates and deletes have a weak write concern, write operations return quickly. In some failed cases, write operations issued with weak
write concerns may not continue. With stronger write concerns, clients wait after sending a write operation for MongoDB to confirm the write
operations. Additional notable features include:
10 | P a g e
Why MongoDB?
MongoDB is designed to meet the demands of modern apps with a technology foundation that enables the developer and the end user through:
The document model approach also simplifies query development and optimization. There’s no need to write complex code to manipulate text
and values into SQL and work with multiple tables. Figure below illustrates the difference between using the MongoDB query language and SQL
to insert a single user record, where users have multiple properties including name, all of their addresses, phone numbers, interests, and more.
11 | P a g e
12 | P a g e
Lift Chart: Compare and contrast DT and SVM.
The lift chart gives us a quick way of identifying the cases in the test (evaluation) sample according to their predicted probabilities of
success. Cases with large probabilities of success are listed first. Next to them we print the actual results; we assume that we know these as we
are evaluating what would have happened if we had used this rule. We see that the cases we predict as successes (cases with probabilities 0.5 or
larger) are in fact actual successes. The lift curve graphs the cumulative number of successes (after having sorted the cases according to their
predicted values in decreasing order) against the number of cases. The reference line expresses the performance of the naïve model.
Decision Trees and Random Forests are actually extremely good classifiers. While SVM's (Support Vector Machines) are seen as more
complex it does not actually mean they will perform better. The paper "An Empirical Comparison of Supervised Learning Algorithms" by Rich
Caruana compared 10 different binary classifiers, SVM, Neural-Networks, KNN, Logistic Regression, Naive Bayes, Random Forests, Decision
Trees, Bagged Decision Trees, Boosted Decision trees and Bootstrapped Decision Trees on eleven different data sets and compared the results
on 8 different performance metrics. They found that Boosted decision trees came in first with Random Forests second and then Bagged
Decision. As Zaniwics puts it, uplift modeling is a branch of Machine Learning which aims to predict the difference between the class variable
behavior in treatment and control. Objects in the treatment group have been subject to some action, while objects in the control group have
not. By including the control group, it is possible to build a model which predicts the causal effect of the action for a given individual. In this
paper we present a variant of Support Vector Machines designed specifically for uplift modeling. The SVM optimization task has been
reformulated to explicitly model the difference in class behavior between two datasets. The model predicts whether a given object will have a
positive, neutral or negative response to a given action, and by tuning a parameter of the model the analyst is able to influence the relative
proportion of neutral predictions and thus the sensitivity of the model.
Traditional classification methods predict the conditional class probability distribution in a given dataset. Based on those predictions an
action is often taken on the classified individuals. This approach is, however, usually incorrect, especially in the case of marketing campaigns or
controlled medical trials. Standard classification methods are only able to model what happens after the action has been taken not what
13 | P a g e
happens because of the action. The reason is that such models do not take into account what would have happened had the action not been
taken.
14 | P a g e
ROC charts: Compare and contrast DT and SVM.
The term ROC stands for Receiver Operating Characteristic. ROC curves were first employed in the study of discriminator systems for the
detection of radio signals in the presence of noise in the 1940s, following the attack on Pearl Harbor. The initial research was motivated by the
desire to determine how the US RADAR "receiver operators" had missed the Japanese aircraft.
ROC curves are frequently used to show in a graphical way the connection/trade-off between sensitivity and specificity for a binary logic
situation. In addition, the area under the ROC curve gives an idea about the benefit of using the model in question.
ROC curves are used in to choose the most appropriate model to predict the data or in machine learning. The best cut-off has the highest true
positive rate together with the lowest false positive rate. As the area under an ROC curve is a measure of the usefulness of a model in general,
where a greater area means a more adequately it presents the reality.
Now ROC curves are frequently used to show the connection between clinical sensitivity and specificity for every possible cut-off for a test or a
combination of tests. In addition, the area under the ROC curve gives an idea about the benefit of using the test(s) in question.
Process An ROC curve shows the relationship between model sensitivity and
True Outcome False Outcome specificity for every possible cut-off. The ROC curve is a graph with:
True Prediction True Positive False Positive
for Outcome Prediction Prediction The x-axis showing 1 – specificity (= false positive fraction = FP/(FP+TN))
(TP) (FP) The y-axis showing sensitivity (= true positive fraction = TP/(TP+FN))
False Prediction False Negative True Negative Thus, every point on the ROC curve represents a chosen cut-off even though
Model
for Outcome Prediction Prediction you cannot see this cut-off. What you can see is the true positive fraction
(FN) (TN) and the false positive fraction that you will get when you choose this cut-off.
An example of ROC curves comparing classification performance of five machine learning and R script is presented by Heuristic Andrew, 2009. It
appears that SVM has the lowest false positive cut-offs.
15 | P a g e
16 | P a g e
17 | P a g e
Literature Sources
Caruana, R Niculescu-Mizil, A. (2006): An Empirical Comparison of Supervised Learning Algorithms, Proceedings of the 23rd International
Conference on Machine Learning, Pittsburgh, PA, 2006., file:///C:/Users/tpapazisi/Downloads/caruana.icml06.pdf
Henschen, D., (2013) “ MetLife Uses NoSQL for Customer Service Breakthrough.” Information Week.
Lander, J. P. R for Everyone: Advanced Analytics and Graphics. [VitalSource Bookshelf]. Retrieved from
https://online.vitalsource.com/#/books/9781323582657/
Laudon, K. C., Laudon,, J. P. (01/2015). Management Information Systems: Managing the Digital Firm, 15th Edition Retrieved from
vbk://9781323187944
MongoDB (2018): A MongoDB White Paper: Top 5 Considerations When Evaluating NoSQL Databases,
https://webassets.mongodb.com/_com_assets/collateral/10gen_Top_5_NoSQL_Considerations.pdf?_ga=2.45945828.1554317026.1561312189-
1896449855.1561312189
MongoDB (2018): A MongoDB White Paper: MongoDB Architecture Guide MongoDB 4.0,
file:///C:/Users/tpapazisi/Downloads/MongoDB_Architecture_Guide.pdf
Zaniewicz, L., Jaroszewicz, S., (2013): Support Vector Machines for Uplift Modeling, http://www.ipipan.waw.pl/~sj/pdf/upsvm.pdf
18 | P a g e