Sei sulla pagina 1di 20

220CT Data and Information retrieval

Short Notes
By: Salman Fazal

Contents
Normalisation 1
Big Data 3
Map/Reduce 4
Hadoop 4
NoSQL 7
Graph DB 8
Mongo DB 9
Cassandra DB 11
Data Mining 13
Extras 17
- Big Data & Hadoop
- Clusters & Consistency
220CT Notes
Salman Fazal

Normalisation
The process by which we efficiently organize data to achieve the following goals:

1. Eliminate redundancy (not important data)


2. Organize data efficiently
3. Reducing data anomalies
Anomalies inconsistencies in the data stored. This can arise when inserting,
updating or deleting.
Eg. When a particular record is in many locations but each of them will need to be
updated individually.

Normal Forms (3 levels)

*in order to achieve one level of normal form, each previous level must be met.

Item Colors Price Tax <- TASK: CONVERT THIS TABLE INTO
T-shirt Red, Blue 12 0.60 THIRD NORMAL FORM
Polo Red, Yellow 12 0.60
T-shirt Red, Blue 12 0.60
Shirt Blue, Black 25 1.25

First Normal Form

- Each record is unique (no repeating data)


- Each cell is atomic (contains only a single value)
- No repeating groups (multiple columns do not store similar info. Eg. Child1, child2)

Item Colors Price Tax


T-shirt Red 12 0.60
T-shirt Blue 12 0.60 - Duplicate records removed
Polo Red 12 0.60 - Colour column contains single value.
Polo Yellow 12 0.60
Shirt Blue 25 1.25
Shirt Black 25 1.25

Second Normal Form

- All attributes (non-key columns) are dependent on the key


Must be moved to a separate table & be related by a foreign key.

Item Colors Item Price Tax Price and tax depends on item
T-shirt Red T-shirt 12 0.60 but not colour, so its moves to a
T-shirt Blue Polo 12 0.60 different table.
Polo Red
Polo Yellow Shirt 25 1.25
Shirt Blue
Shirt Black

1
220CT Notes
Salman Fazal

Third Normal Form

- Eliminate fields that do not depend on primary key


- If a column isnt dependent on the primary key but another column, it must be moved to
another table

Item Price Tax Price Tax


T-shirt 12 0.60 12 0.60
Polo 12 0.60 25 1.25
Shirt 25 1.25

In the above tables, tax is dependent on price and not item, so a new table is created.

Extra

Mission Equipment Qty Item Total


No Weight Weight
ISS- Portable Water 2 100KG 200KG
2237 Dispenser
Flexable Airduct 6 0.5KG 3KG
Small 4 2KG 8KG
Storage/Rack
ISS- Biofilter 6 0.2KG 1.2KG
3664
ISS- Small 3 2KG 6KG
2356 Storage/Rack

In this table, the last column is Total Weight which is calculated from the previous row columns.

When normalizing, we need eliminated the Total Weight column. Although total weight depends
on the weight and quantity, the column is computed and it can easily be constructed outside of the
database. Therefore, the column does not belong to the database and must be discarded.

2
220CT Notes
Salman Fazal

Big Data
Definition:

1. Big data refers to data sets that grow so large that it is difficult to capture, manage, store
and analyse with typical database software tools.
2. Huge volume of data that cannot be stored and processed using the traditional approach
within the given time-frame.

Types of Data:

Structured Data often referred to as data in a structured relational database, mostly


organized in tabular format. Eg; SQL Database
Unstructured Data everything else. Eg; emails, video, audio, social networking.

90% of the data is unstructured!

Characteristics of Big Data (4 Vs):

1. Volume (size): describes the amount of data generated by organizations or individuals.


(The need to process terabytes of data).
2. Velocity (speed): describes the rate at which the data is generated.
(The need to be analysed quickly).
3. Variety (types): Deals with different types of data coming in from different sources.
Structured + unstructured data.
4. Value: Locating information within the data. Having access to big data is of no good unless
we can turn it into value.

Meaningful data analysis: Big Data QoS


i. Identify what youre looking for Analysing Big Data allows for better, faster and more
ii. Prepare the data profitable decisions from a business point of view. This is
iii. Explore the data done using data that was not accessible before, not
available before and not usable before.
iv. Apply algorithms
v. Analyse results
Availability
vi. Repeat o system remains operational on failing nodes
(clients can read & write).
This leads to: o Ensures business continuity
Discovering a relationship Reliability
between pieces of data that o High accuracy
o Low accuracy puts the organisation at risk
would not have been possible to
Flexibility
identify otherwise
o System is able to meet changing business
Discovering how and when demand
behaviours occur (eg. can help Scalability
business create a new sales o Systems ability to meet growth requirements
model) Performance
o How quickly and efficiently a system runs
Security
o Protection against unauthorized access
*Scalability, Tolerant, Flexibility and Efficiency benefits
the user too.

3
220CT Notes
Salman Fazal

Map/Reduce
Simply a way to take a big task and divide it into discrete tasks that can be done in parallel.
Cost effective and easy to use

Functionalities:
1. Split data into smaller chunks
2. Map data according to mapping key
3. Reduce and merge all related data

Split - Map - Shuffle/Sort - Reduce

Pros Cons
Simplicity Restricted
Fault-tolerant Does not provide solution for
Graphs
Scalability

Hadoop
Big Data (recap) High-volume, high-velocity and high-variety data that demand cost-effective
information processing for enhanced insight and decision-making.

Hadoop Framework for parallel processing of large datasets distributed across clusters of nodes
(computers). An open-source software implementation of MapReduce.

Cluster multiple machines linked together by high speed LAN

It focuses on the following:

1. Performance supports processing of huge data sets parallelly within clusters


2. Economics lower costs by using commodity computing hardware (high-performance, low-
cost machines)
3. Linearly scalable more nodes can do more work within the same time
4. Fault-tolerance node failure does not cause computational failure as data is replicated.

*Hadoop QoS -> Scalable, Tolerant, Flexible & Efficient (In big data section)

4
220CT Notes
Salman Fazal

- Hadoop consists of 2 components; HDFS (storing data) and MapReduce (processing data)

- Word Counting example (see image in mapreduce section):

Counting the number of times each word is used in every book in Coventry University Library.
We would do the following:

1. Partition the texts (pages) and put each on a separate computer or computing
element/instance (think cloud).
2. Each computing element takes care of its portion
3. The word count is then combined

Hadoop The design

Data is distributed around the


network
o Every node can host data
o Data is replicated to support fault-
tolerance

Computation is sent to data, not


viceversa
o Codes to be run are sent to nodes
o Results of computations are
combined

Basic architecture is master/worker


o Master (JobNode) launches application
o Workers (WorkerNodes) perform computation

The Architecture (components)


1. Name Node (master):
i. Keeps track of where the
data is within the node
ii. Executes operations (like
opening, closing, renaming
a file)
iii. One per cluster
2. Data/Slave Node (worker):
i. Stores the data and
communicates with other
nodes
ii. One per node
3. Job Tracker:
i. Central Manager; schedules the mapreduce tasks to run
4. Task Tracker:
i. Accepts & runs map, reduce and shuffle

5
220CT Notes
Salman Fazal

How Hadoop works (HDFS and MapReduce):

1. MapReduce library splits files into pieces (64-256MB), master assigns the tasks.
o Blocks are distributed across nodes
o Each input file is processed by one mapper (local)
o Splitting depends on file format
2. Mapping tasks
o Read contents from input then parse into key-value pairs
o Apply map operation to each pair
o File location is forwarded to master, which then forwards the file locations to reduce
workers
3. Reduce
o Fetch input sent by master
o Sort the input by key
o For each key, apply the reduce operation to the key values associated with key.
o Write result in output file then return file location to master

Summary: During the map process, the master node instructs worker nodes to process their local
input data. Hadoop performs a map process, where each worker node passes its results to the
appropriate reducer node. The master node collects the results from all reducers and compiles the
answer to the overall query.

HDFS basics
- Files split into fixed size
blocks and stored on nodes.
- Data blocks are replicated
for fault-tolerance (default is 3)
- Client talks to namenode for
metadata (info about filesystem.
Ie. Which datanodes manage
which block), and talks with
datanodes for reads and writes.

Hadoop and fault tolerance


The bigger the cluster, the more chances of hardware failure (ie. Disk crashes, overheating).
What happens if
- Worker fails:
o Worker is marked failed if the master gets no response from it when pinged
o Tasks assigned to failed worker will be added back to the task list of re-assignment,
at this time HDFS ensures data is replicated.
- Master fails:
o Master writes checkpoints showing progress
o If master fails, a new master can start from previous checkpoint, therefore job is
restarted.

Replication
- 3 copies (default) are created. (objectives: load-balancing, fast access & fault tolerance)
First written to the same node. Second to a different node within the same rack. Third to a
node in another rack.

6
220CT Notes
Salman Fazal

NoSQL
Not Only SQL.

NoSQL databases are geared toward managing large sets of data which come in huge variety and
velocity, often in distributed systems or the cloud.

CAP Theorem

NoSQL Family

1. Graph-family elements structured in nodes and edges. Eg. Neo4j Graph DB


2. Document-family Elements stored in document-like structures. each document in this type
of database has its own data, and its own unique key, which is used to retrieve it. Eg.
MongoDB
3. Column-family stores data tables as columns rather than rows (therefore has a large
number of columns). Eg. CassandraDB.

7
220CT Notes
Salman Fazal

RDBMS VS NoSQL
RDBMS NoSQL
Can store only structured data Works with all kinds of data
Structured query language (SQL) No predefined schema
Performance decreases with large volumes Can support huge volumes of data
of data (joins required) without affecting its performance
Expensive hardware required for scaling Horizontally scalable. Uses cheap
commodity hardware
Offers powerful queries such as joins and Has no functionality for joins as data is
group by denormalized
ACID Atomic, Consistent, Isolated, Durable CAP Consistent, Available & Partition-
Tolerance

8
220CT Notes
Salman Fazal

Graph DB
A database that uses graph structures with nodes, edges (relationships) & properties to
store and represent information.

- A graph is a collection of nodes (things) and edges (relationships). Both of these have
properties (in key-value pairs).

ER Model Graph Model

Tables Nodes + Edges

Rows Nodes

Columns Key-value pairs


(Properties)
Joins Edges

Nodes Instances of objects (entities). Eg. Billy is an instance of a user, Toyota of a car.
Relationships connection between nodes. Must have a name and direction. This adds a
structure to the graph.

Features:

1. Flexible can easily adapt to changes/additions. Ie. Relationships and properties can be
expanded, nodes can be tailored without affecting existing queries.
2. Speed as the volume increases, traversal is constant unlike RDBMS where speed is
dependent on the total amount of data stored (as several joins may be required).
3. Agility can effectively and rapidly respond to changes.
4. Schemaless unstructured (not a tabular-type format)

Traversal

Navigating a graph (from a specific node to other nodes) along relationship edges. Traversal is
bidirectional can follow incoming or outgoing nodes.

Eg. Find my friends of friend => start with my node, navigate to friend, find friends.

Traversal can be of two types;

- Depth-first: follow the first path to its end, then return and go to second and so on.
- Breadth-first: follows all the first steps/depths then moves to second depth and so on.

Cypher query language for graph databases. Declarative language (specify what you want rather
than how to achieve it).

Commands are built from clauses that represent matches to patterns and relationships

Eg. Create(Kev:Person{Name:Kevin, Age:45}) Create(Beer:Drink{Name:Beer, Alcoholic:Yes})

9
220CT Notes
Salman Fazal

MongoDB
An open-source, non-relational, document-family database that provides high-performance, high-
availability and horizontal scalability.

- A MongoDB hosts a number of databases. A database holds a set of collections. A collection


holds a set of documents. A document, is a set of key-value pairs
- MongoDB stores data in many nodes, which contain replicas of the data. Therefore:
o Consistency All replicas contain same data, client always has the same view of the
data no matter what node.
o Availability system remains operational on failing nodes (clients can read & write).
o Partition Tolerance system functions even if there is a communication breakdown
between nodes.

MongoDB Architecture
MongoDB can host a number of databases.

A database holds a set of collection.

A collection holds a set of documents.

A document is a set of key-value pairs.

RDBMS MongoDB
Database Database
Table Collection
Row Documents
Column Fields

MongoDB Features

Document-based documents are stored JSON format


Querying supports dynamic querying thats nearly as powerful as SQL
Replication and availability Provides redundancy and increases data availability with
multiple copies of data on different database servers
Horizontal-scalability easy to scale out on commodity hardware
Supports map/reduce functionality Ie. In a situation where you would have to use GROUP
BY in SQL, map/reduce is the right tool in MongoDB.
Schemaless non-relational. Does not follow a specific structure like relational databases.
Can store any number and variety of key-value pairs in a document
Scalable replication and sharding
o Replication duplicates data across multiple nodes
o Sharding Splits data across multiple machines/shards

10
220CT Notes
Salman Fazal

When to use MongoDB

When you need scalability and high-availability


Real time: can analyse data within the database, giving results straight away.
If your data size will increase a lot, and you will need to scale (by sharding)
If you dont need too many joins on tables
Particularly useful for storing unstructured data
When your application is supposed to handle high insert loads

Sharding

Sharding is the process of storing data records across


multiple machines and it is MongoDB's approach to
meeting the demands of data growth. As the size of the
data increases, a single machine may not be sufficient
to store the data nor provide an acceptable read and
write throughput. Sharding solves the problem with
horizontal scaling. With sharding, you add more
machines to support data growth and the demands of
read and write operations.

Sharding reduces the number of operations each node


handles. Each node processes fewer operations as the
cluster grows. As a result, a cluster can increase
capacity and throughput horizontally.

Ie. to insert data, the application only needs to access the machine/shard responsible for that
record.

Benefits

- Splits workload work is distributed amongst machines. This increases performance as there
will be a smaller working set.
- Scaling vertical scaling is to costly, sharding lets you add more machines to your cluster.
This makes it possible to increase capacity without any downtime.

Replication

Process of duplicating data across multiple nodes. Provides redundancy and increases data
availability.

Why replication?

- To keep the data safe


- High-availability (24/7)
- Disaster recovery
- No downtime for maintenance

11
220CT Notes
Salman Fazal

Cassandra DB
A distributed, highly-scalable, fault tolerant columnar database.

Column-family Database

A column family is very similar to RDBMS, it consists of rows and columns.


- Rows are uniquely indexed by an ID (rowkey) and each row can have different columns, each
of which has a name, a value plus a timestamp (of when the data was last added/updated).
- Organised into families of related columns while relational databases are organised into
tables
- Empty Columns: The basic idea behind this is, if a row does not contain a value for a certain
column, instead of giving it a null value (like in RDBMS), the column is simply missing from
the given row.
- Denormalized: Joins in a relational model are flexible, storage efficient and elegant, but can
also be very slow at run time. They perform very poorly in a distributed data model.
Cassandra has no joins and therefore denormalisation can be the answer
-

-
Discovery Orbital
ID Host Name Timestamp
Method Period
2016-02-12
1 11 Com Radial Velocity 326.030.32
11:32:00
2016-11-12
2 2MASS Imaging
18:05:09

Cassandra Architecture:

The Cassandra cluster is pictured as a ring in which nodes


communicate and exchange information with other nodes.

How writes and reads operate?


Cassandra is a masterless architecture, meaning at any
point, the client can connect to any node, the node the
client is connected to takes charge and forwards and
replicates the data to other appropriate nodes.

When reading the data, the client supplies a rowkey then


the node the client is connected to determines the latest
version replica using the rowkey.

Peer to peer replication: no master, no slaves. No single


point of failure!

12
220CT Notes
Salman Fazal

Key Cassandra Features and Benefits:

1. Flexible-schema with CassandraDB, it isnt necessary to decide what fields your records will
need beforehand. You could add/remove required fields extemporaneously. For massive
databases, this is an incredible efficiency boost.
2. Scalability you could add more hardware (nodes) as the amount of data increases. This also
increases performance as more nodes can do more work within the same time.
3. Fault-tolerant In NoSQL databases (specifically Cassandra), data is replicated to multiple
nodes. Therefore, a node failure will not cause any downtime or computational failure.
Replication 3 copies of the same data are created into different nodes. If a node
fails, data is replicated again to a third node. Other objectives for replication are load-
balancing and fast access.
4. Flexible data storage Cassandra can store all data type, these could be structured, semi-
structured or unstructured.
5. Fast read and writes with linear scalability, Cassandra can perform extremely fast writes
without effecting its read efficiency.
6. Query Language an SQL-like language that makes moving from a relational database very
easy.
Extra (How Cassandra retrieves data): (Part of the NASA Exponent dataset)

In a traditional database method


(first picture), data is retrieved row
by row, reading data from left to
right. In the picture we see although
we are acquiring data from just two
columns we eventually we read the
entire row then just retrieve the
required column.

Now, Let us say if we took the


entire dataset with thousands of
rows, getting all the data could take
a while! This is when a columnar
database could be very effective. If
we look at the second picture,
instead of taking every row, we just
take the column reading the data
from top to bottom. All you will need is the rowkey column, in this case we needed the first 16 rows.

This method is much more effective and has a much better performance when running large
numbers of queries!

13
220CT Notes
Salman Fazal

Data Mining
- Data mining (sometimes called data or knowledge discovery) is the process of analyzing data
from different perspectives and summarizing it into useful information.
- Simple terms: Data Mining refers to extracting knowledge from large amount of data.
- The information can be used for any application purpose such as to increase revenue, cut
costs, make forecasts, etc.

Why do we need it?


- Too much data and too little information. There is a need to extract useful information from
the data and to interpret it.
- Data mining helps discover relationships between two or more variables in your data. This
can help create new opportunities (ie. For businesses) by:
o Predicting trends and behaviours
o Discovering previously unknown or hidden patterns
- The tasks of data mining are twofold:
o Predictiveusing features to predict unknown or future values of the same or other
feature
o Descriptivefind interesting, human-interpretable patterns that describe the data.

Data Warehousing is a process of combining data from multiple sources into one common
repository (dataset). Data Mining is a process of finding patterns in a given dataset.

Problems with data mining:

Individual privacy analyses routine behaviour and gathers a significant amount of


information.
Data integrity inaccurate, conflicting or out-of-date data from different sources.
Cost
Efficiency & scalability data mining algorithms must be able to work with masses of data.

Data Mining Process

Steps:

1. Understanding the problem


and what we are trying to achieve.

2. Setting up a data source.


Here we collect the historical data
then put it into a structured form
(dataset) so it can be used in the
next step.
3. In this step the model (dataset) is built and turned into a predictive mode. The results are
then tested and evaluated to get the best and most accurate results.

4. Here we apply the model and combine the feedback and findings on new incoming
examples.

14
220CT Notes
Salman Fazal

Data Mining Tasks/Methods

Classification [predictive] categorizing. process in which ideas and objects are recognized,
differentiated and understood.
Clustering [descriptive] grouping the data in more than one group based on their similarity.
o For example, news can be clustered into different groups, entertainment group,
politics, national, and world news.
Association [descriptive] identifies relationships between events that occur at one time
Sequencing [descriptive] identifies relationships that exist over a period of time.
Forecasting process of making predictions of the future based on past and present data
and analysis of trends.
Regression [predictive] statistical process for estimating the relationships among variables.
Time Series analysis examines a value as it varies over time.

Data Mining can help in

- fraud detection,
- aid marketing campaigns,
- detecting diseases,
- scientific experiments,
- weather predictions
- study consumers.

Build Model Decision Tree (Classification)

A decision tree can be used as a model for a sequential decision problems under uncertainty

Pros

- easy to interpret
- easy to construct
- can handle large number of features
- very fast at testing time

Cons

- low predictive accuracy


- not possible to predict beyond min and max limits

15
220CT Notes
Salman Fazal

Build Model SOM (Clustering)

Self Organising Map. Train the map using the examples from the data sets. Used for clustering data
without knowing the class from the input data

Pros Cons
No need to specify classes Difficult to understand
decision
Can visualise data Train gets a different map
each time
Can identify new relationships

Tools for performing data mining

1. SAS Enterprise Miner reorganises the data mining process to create highly accurate
predictive and descriptive models.

Benefits:

o Supports the entire data mining process


o Builds more models faster
o Enhance accuracy of prediction
2. WEKA a collection of data mining tools (pre-processing data, classification, clustering,
association).

Data pre-processing

- High quality data mining need data that is useful, to achieve this we need to perform some
preprocessing on the data. This combines data cleaning, data integration and data
transformation.
- Data quality issues can be expensive and time consuming to overcome.

Why Data Quality?

- Cost saving, increased efficiency, reduction of risk/fraud, enable more informed decisions.

Measures for data quality:

Accuracy: accurate or not


Completeness: complete? Or unavailable?
Consistency: some modified but some not, dodgy?
Timeliness: timely updates?
Reliable: trustable?
Interpretability: can easily be understood?

Data cleaning fill in missing values, smooth noisy data, correcting incorrect values.

Data integration combination of multiple data sources

Data transformation techniques to transform data (ie. normalization)

Data reduction techniques that can be applied to obtain a reduces representation of data that is
much smaller in volume, yet very similar to the original data.

16
220CT Notes
Salman Fazal

EXTRAS
HADOOP AND BIG DATA

why big data?


- data growth is HUGE
- all that data is valuable

- disk is cheap BUT;


- wont fit on a single computer
- so, it needs to be distributed across thousands of nodes
- good side is, distributed data = faster computation when run parallelly.
ie. 1 HDD = 100Mb/sec
100HDD = 10Gb/sec

Hadoop has 2 components:


HDFS- allows storing huge amounts of data in a distributed manner
MapReduce- allows processing the huge data in a parallel manner

HDFS
HDFS architecture
-files stored in blocks (64-256MB)
-provides reliability through replication

HDFS file storage


- NameNode (master) = stores all metadata (filenames, location of blocks in DataNodes)
- DataNode (slave) = stores file contents as blocks. Blocks are replicated. Periodically sends reports of
existing blocks to NameNode.
- Clients reads NameNode for metadata, then directly talks with DataNode for reads and writes

Failures:
-DataNode- marked failed if no report/heartbeat is sent to NameNode. NameNode replicates lost
blocks to other nodes.

17
220CT Notes
Salman Fazal

-NameNode- a new or the backup master takes over. NameNode keeps checkpoints therefore new
master starts from previous checkpoint.

Replication:
3 copies are created;
-first on same node
-second to different node within the same rack
-third to a node in another rack

MAPREDUCE
-2 stages:
1. Map stage - split data into smaller chunks and map them into key/value pairs)
2. Reduce stage - sorts/shuffles by key, then outputs the combined results

MapReduce file storage


-JobTracker = schedules the tasks to run (on the slaves)
-TaskTracker = executes the tasks (from the master)
*task=map/reduce

Steps:
Input data, Split, Map, Shuffle, Reduce, Output results.

How Hadoop Works?


Input Split
o input is splitted into blocks and distributed across the nodes (HDFS).
Mapper
o JobTracker retrieves the input splits from HDFS.
o JobTracker will initiate mapper phase on available TaskTrackers
o Once the assigned Task Trackers are done with mapping, they will send status to the
JobTracker.
Reduce
o JobTracker initiates sort/shuffle phase on the mapper outputs
o Once completed, JobTracker initiates the reduce operation from the results on the
TaskTrackers.
o TaskTrackers will send output back to JobTracker once reduce is complete. The
JobTracker then sends the output report back to the client.

CLUSTER DATABASES

Why run databases on clusters?

The traditional model runs on one big machine, there is a single point of failure if machine, storage
or network goes down. It is also difficult to scale up, as you would need to buy a whole new machine
(server), this is too costly and not flexible.
To resolve this, we use a cluster. A cluster combines several racks, which contains several
machines/nodes. Flexibility is achieved as data is replicated, meaning we wont need to backup as
data is always available. Also there is no single point of failure as nodes are replicated at least 2
times. If scaling-out is required, just add more nodes to the cluster. Cheaper and flexible.

18
220CT Notes
Salman Fazal

Types of replication
Synchronous all replicas are updated on every write. All nodes are always up to date.
Asynchronous writes the data as soon as possible, but reads could be out of date. Eventual
consistency.

Consistency
In relational databases, there is ACID consistency which maintains data integrity. In NoSQL,
consistency refers to whether or not reads reflect previous writes.

- Strict Consistency A read is guaranteed to be up to date data


- Eventual Consistency (MongoDB uses this) Read data may be stale, but writes are very
quick. This provides high performance.

Inconsistencies occur if two database versions are updated at the same time, or read is made from
one machine while its still not updated.

19

Potrebbero piacerti anche