Sei sulla pagina 1di 20

UCS551

Chapter 2:
Data Understanding

AZLIN BINTI AHMAD (DR.)


EZZATUL AKMAL KAMARU-ZAMAN
SAYANG MOHD DENI (DR.)
NORSHAHIDA SHAADAN (DR.)
Ref: https://www.coursera.org/lecture/data-
analytics-business/0-introduction-to-data-analysis-in-
real-world-1TNLZ
Outline
DATA UNDERSTANDING
 
1. Data Type: structured and unstructured
2. Data Structure (vector, matrix, data frame, array, factor, list)
3. Level of Measurement
4. Univariate and Multivariate Data
5. Data representation with example
– (Understanding data based on an example of data set (i.e: data
frame))
2.1 Data Type: Structured And Unstructured
2.1 Data Type: Structured And Unstructured

Definition 1: Structured Data

Information with a degree of organization that


is readily searchable and quickly consolidate
into facts

Example: RDMBS – relational Data Base,


spreadsheet
2.1 Data Type: Structured And Unstructured
2.1 Data Type: Structured And Unstructured
2.1 Data Type: Structured And Unstructured

Definition 2: Unstructured

Information with a lack of structure that is time


and energy consuming to search and find and
consolidate into facts using traditional
mechanism.

Example: Email, documents, images, reports


2.2 Data Structure (Vector, Matrix, Data Frame, Array,
Factor, List)

DATA STRUCTURE
A data structure is a specialized format for organizing
and storing data. General data structure types include
the array, vector, the matrix, the list, data frame and so
on. Any data structure is designed to organize data to
suit a specific purpose so that it can be accessed and
worked with in appropriate ways.
2.2 Data Structure (Vector, Matrix, Data Frame, Array,
Factor, List)

VECTOR

• A vector, in computing, is generally a one-dimensional array,


typically storing numbers. Vectors typically have fixed sizes,
unlike lists and queues. The vector data structure can be used to
represent the mathematical vector used in linear algebra. See
related pages for mathematical vector operations
2.2 Data Structure (Vector, Matrix, Data Frame, Array,
Factor, List)

ARRAY

• An array is stored such that the position of each element can be


computed from its index tuple by a mathematical formula. The
simplest type of data structure is a linear array, also called one-
dimensional array
2.2 Data Structure (Vector, Matrix, Data Frame, Array,
Factor, List)

MATRIX

• A matrix is composed of numbers in


two dimensions: rows and columns. It is
similar to a data table composed only
of numbers. A member is identified by
two indices, one for rows and another
for columns. ... Matrices are equal if
they are of the same size and each
corresponding member is equal.
2.2 Data Structure (Vector, Matrix, Data Frame, Array,
Factor, List)

DATA FRAME
• Generated by combining together multiple vectors
such that each vector becomes a separate column.
The concept of a data frame comes from the world
of statistical software used in empirical research; it
generally refers to "tabular" data: a data structure
representing cases (rows), each of which consists of
a number of observations or measurements
(columns). Alternatively, each row may be treated as
a single observation of multiple "variables". In any
case, each row and each column has the same data
type, but the row ("record") data type may be
heterogenous (a tuple of different types), while the
column data type must be homogenous. Data
frames usually contain some metadata in addition to
data; for example, column and row names.
2.3 Level of Measurement
Measured data can be categorized into FOUR (4) levels :
2.3 Level of Measurement

• The lowest level of measurement is Nominal while the highest is Ratio.


Knowledge on the category level is very important and should be skilled by an
analyst. Why? Different level of data measurement require different type of
statistical analysis technique.
• Nominal data has very limited techniques to analyse the data while various
statistical technique can be used to analyse ratio data.
• Suitable technique to analyse nominal and ordinal data are mode, count and
frequency.
• Interval and ratio data has more flexible technique for data analysis.
• Central tendency measures and dispersion such as mean, mode, median,
variance, range and standard deviation can all be used in data analysis.
2.3 Level of Measurement

Nominal Ordinal Interval Ratio


       
Eg: Eg: Eg: Eg:
 Gender –(male,    Temperature  Age in years
female)  Level of  Shoe-size  Expenses (RM)
 Race – (malay, reference  Age rage: (15  Sales (RM)
Chinese, Indian)  Level of -20), (21-25) etc  Weight (kg)
 Education level- importance  Heught (cm)
(SPM, STPM,  Level of
Diploma, etc) agreement
   Level of
satisfaction
 
2.4 Univariate Data versus Multivariate
Data

Univariate Data:
• Data set that consist of a single variable
– Data structure for analysis: A vector
– Example: Age of 5 patients: [34, 67, 56, 30, 43]
Multivariate Data:
• Data set that consist of a multiple variables
– Data structure for analysis: Matrix or data frame
2.4 Univariate Data versus Multivariate
Data

-Example: Data set A consist of 4 variables: Age,


Weight, Height, Gender
2.5 Data Representation

When you have a data set, you need to know the data you are
using. Including:
• What variables or attributes did we collect?
• How are those variables or attributes coded?
• What level of measurement does the variables or attribute has?
• What do each data points means?
• What is the quality of the data?
• Does the data contain missing values?
• Is the data complete and relevant to solve business problem?
2.5 Data Representation

What do we do to understand data?


Data exploration is the approach that may
help to understand data.

Data exploration might reveal unexpected


information, and properties such as:
• relative importance
• data distribution
• key attributes
• correlation and association among variables
• pattern of behavior
• outliers
2.5 Data Representation

Methods to explore data can be divided into


two major categories:

Descriptive Statistics
• Measures of central tendency: Mean,
mode, median
• Measures of dispersion: Range, Variance,
Standard Deviation

Graphical Approach – Visualization


• Representing data into visual graphics

Potrebbero piacerti anche