Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Chapter 2:
Data Understanding
Definition 2: Unstructured
DATA STRUCTURE
A data structure is a specialized format for organizing
and storing data. General data structure types include
the array, vector, the matrix, the list, data frame and so
on. Any data structure is designed to organize data to
suit a specific purpose so that it can be accessed and
worked with in appropriate ways.
2.2 Data Structure (Vector, Matrix, Data Frame, Array,
Factor, List)
VECTOR
ARRAY
MATRIX
DATA FRAME
• Generated by combining together multiple vectors
such that each vector becomes a separate column.
The concept of a data frame comes from the world
of statistical software used in empirical research; it
generally refers to "tabular" data: a data structure
representing cases (rows), each of which consists of
a number of observations or measurements
(columns). Alternatively, each row may be treated as
a single observation of multiple "variables". In any
case, each row and each column has the same data
type, but the row ("record") data type may be
heterogenous (a tuple of different types), while the
column data type must be homogenous. Data
frames usually contain some metadata in addition to
data; for example, column and row names.
2.3 Level of Measurement
Measured data can be categorized into FOUR (4) levels :
2.3 Level of Measurement
Univariate Data:
• Data set that consist of a single variable
– Data structure for analysis: A vector
– Example: Age of 5 patients: [34, 67, 56, 30, 43]
Multivariate Data:
• Data set that consist of a multiple variables
– Data structure for analysis: Matrix or data frame
2.4 Univariate Data versus Multivariate
Data
When you have a data set, you need to know the data you are
using. Including:
• What variables or attributes did we collect?
• How are those variables or attributes coded?
• What level of measurement does the variables or attribute has?
• What do each data points means?
• What is the quality of the data?
• Does the data contain missing values?
• Is the data complete and relevant to solve business problem?
2.5 Data Representation
Descriptive Statistics
• Measures of central tendency: Mean,
mode, median
• Measures of dispersion: Range, Variance,
Standard Deviation