Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
o Icebreaker questions
o About data
o Data cleaning
o Data manipulation
o Analysis of data Spread
o Data visualization
o Key Terms in Data Science
1|Page
Icebreaker questions
1. Explain about any analytics project you had recently completed?
The answers could be in the context of data visualization or machine learning domain.
Explain objective of the study, the data you had used and the final outcome
Objective allows us defining the business of research problems more concisely and help
3. Name the objectives of a couple of Analytics projects you have worked on?
Reporting
R Programming, Python, Tableau, SQL, Excel and related tool. Be sure to specify for
6. What were the outcomes of the Analytics projects you worked on?
Describe any Key decisions you were able to make using the data-driven insights
7. What were the mistakes you did when working on an analytics project and how did you
2|Page
improve on them?
You can describe mistakes as learning steps and define all the mistakes you had done in
About data
8. How did you determine which data is needed for your study or project?
Data we collect should be able to derive the predicted outcome or describe a past event
9. How did you collect the data? Describe the sources of the data?
12. Give an example of Nominal, Ordinal, Interval and Ratio data types
3|Page
Ratio – Height, weight, Number of houses
Data mining, Web scraping, Survey data collected for the purpose of the study
Extracting structured and unstructured data from web pages, to gain competetive
intelligence
17. What is the difference between primary and secondary data sources?
Primary data addresses a specific problem we are studying, collecting data via primary
The secondary might give useful facts concerning our study objective, but its quality
cannot be determined
18. Give an example on when you would use Primary data and Secondary data?
4|Page
Collecting external survey results on mobile phone usage is an example of a secondary
source
19. Explain how would you collect data in order to prove your hypothesis?
We need to define a hypothesis with your reject or prove a belief. You can then collect
Structured data or unstructured data generated on a scale which impacts volume, velocity
Real-time service providers like ride-hailing need to analyze real-time, unstructured and
22. Explain the differences between small data and big data?
Small data is mostly static, with limited growth in volume and variety. Big data exhibits
volume growth and the variety factor and data is generated at a faster rate.
23. A company is generating 1GB of data a day from various IOT devices. Is this data, big
data?
Depends on context, is 1GB data too much to store and analyze in a day. Maybe in the
5|Page
1990’s you can call it a big data problem, due to limited infrastructure for storage and
Can be analyzed using Machine learning algorithms and packages which can analyze
To understand the past trends and gain insights on predictive and outcome variables. This
should give you important insights into the structure of the data set
26. During what stages of a data analytics or data science projects do you perform exploratory
data analysis?
27. What kind of business insights does EDA provide? Give an example
Data cleaning
Identify the types of data you are cleaning. Identify the range of values which you
6|Page
29. Which tools did you use to clean the data?
Check the collection process and see if the data collected pertained to the problem you
were trying to solve
Text can be cleaned using excel where you can divide sub-components of numerical and
text values into separate columns
Duplicate data can influence outcomes and these values are removed before data
analysis. Care must take that the ID values are not unique
37. What measures can you take to prevent collecting missing data values?
Proper survey design and data collection have to be standardized so that all requires
observation are duly noted.
7|Page
Data Manipulation
38. How are outliers identified?
For any data distribution, any data value which lies far away from the median can be
examined as a potential outlier
Salaries of highest-paid players in any sports team might skew the salary distribution and
change the average values
When proving a hypothesis, we reject the null hypothesis or the status quo. Valid outliers
could help us reject the null hypothesis
If outliers are invalid, they can be excluded from our analysis. Valid outliers presence has
to be determined and values can be standardized before analysis
Missing values can be substituted with median values of a particular attribute where
missing values exist. However, the type of imputation carried out needs to done on the
basis of understanding of the dataset and variables.
43. What methods are used to deal with missing value imputation?
8|Page
44. Why do we normalize data?
For numerical values of different variables, normalization allows falling within a specified
range while maintaining the differences
46. What is metadata? How is it useful for gaining insights, give an example?
Allows providing information about other data. Useful for tagging Unstructured
data format like
In an outcome variable, if certain classes of data (e.g. Yes & No observations) are
underrepresented, then we can assume a class imbalance.
All the above terms point out to the distribution of data, the center indicates the median or
the mean or the mode of the data distribution. The shape indicates the skewness and
Mode indicates the most common value in a data set but it does not have a relation with
unimodal distribution. Uni-modal, multi-modal distributions indicate one peak and more
than one peak. A normal distribution is unimodal, with mean is zero, most common
9|Page
50. What is kurtosis and what does it imply in terms of deriving insights?
Data sets with high kurtosis tend to have heavy tails or outliers. Data sets with low kurtosis
The distribution of a statistical data set (or a population) is a listing or function showing all
the possible values (or intervals) of the data and how often they occur
A distribution is skewed if one of its tails is longer than the other. The mean is higher in
53. Can you guess about the distribution and skewness of data for salaries of IPL Players?
The highest and lowest paid player’s salary values could create a skew in the salary
10 | P a g e
54. How do you summarize categorical data?
Categorical variables with different levels can be summarized using count functions or their
Most commonly we use Mean, Median and Mode to measure the centrality of data
Mean defines the average value, Median defines the 50th percentile value, Mode defines
the most common value
Median, since it reflects the 50th percentile value and not influenced by outliers. Outliers
skew the mean value
11 | P a g e
58. Explain why the median is the more robust measure
Median, since it reflects the 50th percentile value and not influenced by outliers. Outliers
skew the mean value.
When the sample contains no outliers the mean provides a better measure of central
tendency
Standard deviation and variance allow us to determine the variability of the data.
e.g. the age variable of students in class ten will have less variability, than the age variable
of students in an executive MBA program
Variance measures how far a set of random numbers are spread out from their average
Measures the dispersion of the dataset relative to its mean and is calculated as the Square
root of the variance.
63. Explain how standard deviation values are interpreted using an example
A small standard deviation can be a goal in certain situations where the results are
restricted, for example, in product manufacturing and quality control. A particular type of
car part that has to be two centimeters in diameter to fit properly had better not have a very
big standard deviation during the manufacturing process. A big standard deviation, in this
case, would mean that lots of parts end up in the trash because they don’t fit right; either
that or the cars will have problems down the road. (Source: dummies)
Correlation shows how strongly pairs of variables are related. e.g. height and weight
12 | P a g e
Variables can exhibit a certain relationship
In the Univariate analysis, we will be focusing on a single variable and try to understand its
measure of central tendency and dispersion
In multivariate analysis, we try to understand the relationships between variables and see
if they have certain or positive or negative influence over each other. Which could be
useful in determining predictor and response variables
Also called the bell curve occurs naturally in many situations. For example, the bell curve
is seen in tests like the GRE. The bell curve is symmetrical. Half of the data will fall to the
left of the mean; half will fall to the right. (Source: Statistics How To)
To extrapolate is to infer something that is not explicitly stated from existing information
from a data set. Interpolation is an estimation of a value within two known values in a
sequence of values.
13 | P a g e
68. What is the difference between the sample data and the population from which it’s
obtained from?
Sample data when collected randomly should contain the information about the whole
population which we can infer. Collecting samples is more feasible in real-world statistical
studies
The power of any test of statistical significance is defined as the probability that it will reject
a false null hypothesis
E.g. To determine the voter preferences for a political party we randomly select people
from a voter list and then collect opinions. Non Random selection of sample will lead to
bias.
Cluster sampling breaks the population down into clusters, while systematic
sampling uses fixed intervals from the larger population to create the sample
72. How will you assess the statistical significance of an insight whether it is a real insight or
just by chance?
Statistical significance is the probability of finding a given observation against the null
hypothesis. An arbitrary convention is to reject the null hypothesis if p < 0.05. The
presence of invalid outliers should be carefully examined while rejecting the null
hypothesis.
In 4 steps, 1. State your hypothesis 2. Formulate your analysis plan 3. Collect data
4. Analyze data and calculate statistical significance
14 | P a g e
74. What is z - score?
It’s a measure of how many standard deviations below or above the population mean a
raw score is. Z score allows us to standardize all raw observations in a data set and z
scores allow us to better variables.
A p-value helps determine the significance of our statistical analysis. The p-value is a
number between 0 and 1 and interpreted in the following way. A small p-value
(typically ≤ 0.05) indicates strong evidence against the null hypothesis, so we can reject
the null hypothesis.
76. What are the usual significance criteria you set for the p-value?
p-value (typically p ≤ 0.01 or p ≤ 0.05, or p ≤ 0.10 ). The criteria for setting a p-value are
specific to the business domain. Hypothesis testing in a healthcare sector might require a
strict significance criteria
p ≤ 0.01 the definition of ‘stringent’ depends on how the hypothesis is being tested.
P is also described in terms of rejecting the null hypothesis when it is actually true
Null hypothesis (H0) explains the status quo or an existing belief. E.g. the average temp in
the Himalayas in summer does not cross 30C.
Alternative hypothesis (Halpha) allows challenging existing belief. In this case, we state the
average summer temperatures in the Himalayas is above 30C.
15 | P a g e
79. What happens when you fail to reject the null hypothesis? Give an example?
80. Explain the consequences of rejecting the null hypothesis if in case it’s valid?
We make a wrong diagnosis e.g. diagnosing cancer disease in a patient when in reality
they are healthy
81. In case of low sample size which test would you perform to evaluate significance?
We use student T-test when the sample size is less than 30 observations
82. What are a one-tailed test and two-tailed tests? Give examples?
When we expand our hypothesis on avg. summer temperature in the Himalayas, we can
state that the temperatures could be less or more than the null hypothesis stated value of
30C.
If we want to prove the above hypothesis, we would need data and if our observations are
less than 30, then we perform a two-tailed test
A random variable is a variable whose value is unknown or a function that assigns values
to each of an experiment's outcomes. Random variables are often designated by letters
and can be classified as discrete, which are variables that have specific values, or
continuous, which are variables that can have any values within a continuous range
(Source: Investopedia)
16 | P a g e
85. What is the correlation coefficient value? How does it signify the relationship
between variables?
The correlation coefficient represented with symbol r measures the strength and
direction of a linear relationship between two variables on a scatterplot. The value
of r is always between +1 and –1
86. What is the covariance? How is it different from correlation coefficient values?
Covariance is a measure of how much two random variables vary together. It’s
similar to variance, but where variance tells you how a single variable varies,
covariance tells you how two variables vary together. A large covariance could
mean a strong relationship exists between variables.
17 | P a g e
87. Does correlation implicate causation?
Data visualization
The representation of data and information in the form of a chart, diagram, picture
Easy to perceive information through visuals, when especially we are dealing with
a large number of observations
Define Business Objective Collect data Clean and process data EDA
Communicate insights through reports
Allows gaining basic insights into trends and identify predictor and response variables
94. How do you plot two quantitative variables against each other?
We use a scatter plot to plot two quantitative variables against each other
18 | P a g e
96. Which plot would you use to analyze the distribution of a data points in a single
variable?
They Represent the Minimum and Maximum observed values in the variable
Values lying far away from the median value beyond the reach of whiskers are
considered outlier values
Difference between 75th percentile value and the 25th percentile value
100. What other statistical measures can be obtained using box plots
Median, Range
101. Explain when would you use a Stem and Leaf plot?
19 | P a g e
Key Data Science Terms
Accuracy
Accuracy is the fractional number of correct predictions made by a ‘classification model.’
Accuracy = (No. of correct predictions/Total no. of examples), as per ‘multi-class
classification.’
Activation function
It is a function that generates and passes an output value (usually nonlinear) to the next
layer, by taking in the weighted sum of all the inputs from the previous layer. ‘ReLU’ or
‘Sigmoid’ are the examples of activation functions.
AdaGrad
AdaGrad is an advanced gradient descent algorithm that rescales the gradients of each
parameter. It allows each parameter to have an independent learning rate.
AUC is an evaluation metric that considers all the possible classification thresholds.
Algorithm
An algorithm is a series of repeatable steps for executing a specific type of task with the
given data. It is a process governed by a set of rules, to be followed by a computer to
perform operations on the data.
Artificial Intelligence
Artificial intelligence or AI is the ‘machines’ acting with apparent intelligence. Modern AI
employs statistical and predictive analysis of large amounts of data to ‘train’ the computer
systems to make decisions, that appear as intelligence.
Backpropagation or ‘backprop.’
Backpropagation is the primary algorithm for implementing ‘gradient descent’ on ‘neural
networks.’ In a backprop algorithm, the output values of each node are calculated in a
forward pass. Then, the partial derivative of the error corresponding to each parameter is
calculated in a backward pass through the graph.
20 | P a g e
Baseline
A baseline is a reference point for comparing ‘how well a given model is performing.’ It is
a simple model that help model developers quantify the ‘minimal’ expected performance
by a system for a particular problem.
Batch
A batch is a set of examples used in ‘one gradient update’ or iteration of ‘model training.’
Bayes’ theorem
Bayes’ theorem describes the probability of an event, based on prior knowledge of
conditions that might be related to the event. For an observed outcome, Bayes’ theorem
describes that the conditional probability of each of the set of possible causes can be
computed from the knowledge of the probability of each cause and the underlying
conditional probability of the outcome of each event.
P(A/B)=(P(B/A)P(A))/P(B)
Binary classification
A binary classification is a type of classification task that outputs one of the two mutually
exclusive conditions as a result. For example, a machine learning model either outputs
‘Spam’ or ‘Not spam’ if it evaluates the email messages.
Big Data
Big data corresponds to extremely large datasets, that had been ‘impractical’ to use
before because of their ‘volume,' ‘velocity’ and ‘variety.' Such datasets are analyzed
computationally to reveal patterns, trends, associations & conditional relations, especially
relating to human behavior and interactions.
Crunching down such extensive data requires data science skills to reveal useful insights
and patterns hidden from general human intelligence.
Bucketing
Bucketing is the conversion of continuous features, based on their value range, into
multiple binary features called ‘buckets’ or ‘bins.’ Instead of using a variable as a
21 | P a g e
continuous ‘floating-point’ feature, its value ranges can be chopped down to fit into
discrete buckets.
For example, given a temperature data, all temperatures ranging from 0.0 to 15.0 can be
put into one bin or bucket, 15.1 to 30.0 into another and so on.
Calibration layer
A calibration layer is used in post-prediction adjustment to minimize the ‘Prediction bias.’
The calibrated predictions and probabilities must match the distribution value of an
observed set of labels.
Candidate Sampling
Candidate sampling is the ‘training-time’ optimization method, in which the probability is
calculated for all the positive labels, but only for ‘random’ samples of negative labels. The
idea is based on an empirical observation that ‘negative classes’ can learn from ‘less
frequent negative reinforcement’ as long as the ‘positive classes’ always get ‘proper
positive reinforcement.’
For example, if we use an example labeled as ‘Ferrari’, and the ‘car’ candidate sampling
computes the predicted probabilities and the corresponding loss terms for the ‘Ferrari’
and ‘Car’ class outputs, in addition to random subset of other remaining classes such as
‘Trucks’, ‘Aircraft’ and ‘Motorcycles.’
Checkpoint
The ‘captured state of variables’ of a model at a given time is called ‘Checkpoint’ data.
Checkpoint data enables performing training across multiple sessions. It also aids in
‘exporting’ model weights. Checkpoint data also enables to continue task preemption
training.
Chi-square test
Chi-square test is an analytical method to determine ‘whether the classification of data
can be attributed to some underlying law or chance.’ The chi-square analysis is used to
estimate whether two variables in a ‘cross-tabulation’ are correlated. It is a test to check
for the ‘independence’ or ‘degree of freedom’ of variables.
Classification or class
Classification is used to determine the categories to which an item belongs. It is an
example of classic machine learning task. The two types of classifications are ‘binary
classification’ and ‘multiclass classification.’
Example of a binary classification model is where a system detects ‘spam’ or ‘not spam’
emails.
22 | P a g e
Example of a multiclass classification is where a model identifies ‘cars,' the classes being
‘Ferrari,' ‘Porsche,' ‘Mercedes’ and so on.
Class-imbalanced dataset
The class-imbalanced dataset is a binary classification problem in which two classes
have wide frequency gap.
For example, a viral flu dataset in which 0.0004 of examples have positive labels and
0.9996 examples have negative labels pose a class-imbalance problem.
Whereas, in ‘a marriage success predictor’ in which 0.55 of examples label the ‘couple’
keeping a long-term marriage and 0.45 examples label the ‘couple’ ending up in divorce,
is ‘not’ a classification problem.
Classification threshold
A classification threshold is a scalar-value that is used when mapping ‘logistic regression’
results to binary classification. This threshold value is applied to a model’s predicted
score to separate the ‘positive class’ from ‘negative class.’
For example, consider a logistic regression model, with a classification threshold value of
0.8, which estimates the probability of a given email message as being spam or not
spam. The logistic regression values above 0.8 will be classified as ‘spam’ and values
below 0.8 are classed as ‘not spam.’
Clustering
Clustering is an unsupervised algorithm for dividing ‘data instances’ into groups and not a
‘predetermined set of groups.’ the groups ‘clustered’ by the execution of such an
algorithm are ‘because of similarities found amongst the instances.’
‘Centroid’ is the term used to denote the center of each such cluster.
Coefficient
A coefficient is the ‘multiplier value’ prefixed to a variable. It can be a number or an
algebraic symbol. Data statistics involve the usage of specific coefficient terms such as
Cramer’s coefficient and Gini coefficient.
For example, a model can analyze and process text documents, Facebook posts, etc. to
mine for potentially valuable information.
Confidence interval
The confidence interval is a specific range around a ‘prediction’ or ‘estimate’ to indicate
the scope of error by the model. The confidence interval is also combined with the
probability that a predicted value will fall within that specified range.
23 | P a g e
Confusion matrix
A confusion matrix is a NxN matrix that depicts ‘how successful a classification model’s
predictions were.’ one axis of the matrix represents the label that the model predicted,
and the other axis depicts the actual label.
The confusion matrix, in case of a multi-class model, helps in determining the mistake
patterns. Such a confusion matrix contains sufficient information to calculate performance
metrics, like ‘precision’ and ‘recall.’
Continuous variable
A continuous variable can have an infinite number of values within a particular range. Its
nature contrasts with the ‘discrete variables’ or ‘discrete feature.’ For example, if you can
express a value as a decimal number, then it is a continuous variable.
Convergence
In simple words, convergence is a point where additional training on the data will not
improve the model anymore. At convergence, the ‘training loss’ and ‘validation loss’ do
not change with further iterations.
In deep learning models, loss values can stay constant or unchanging for many numbers
of iterations, before finally descending. This observation might produce a false sense of
convergence.
Convex function
A convex function is usually a ‘U-shaped’ curve. In degenerate cases, however, the
convex function is shaped like a line. These functions represent loss functions. The sum
of two convex functions is always a convex function. Example of a convex function is ‘Log
Loss’ function.
Correlation
Correlation is the measure of ‘how closely the two data sets are correlated.’ Take for
example two data sets, ‘subscriptions’ and ‘magazine ads.’ When more ads get
displayed. More subscriptions for a magazine get added, i.e., these data sets correlate. A
correlation coefficient of ‘1’ is a perfect correlation, 0.8 represents a strong correlation
while a value of 0.12 represents weak correlation.
The correlation coefficient can also be negative. In the cases where data sets are
inversely related to each other, a negative correlation might occur. For example, when
‘mileage’ goes up, the ‘fuel costs’ go down. A correlation coefficient of -1 is a perfect
negative correlation.
Covariance
Covariance is the ‘measure of association between the average value of two variables,
diminished by the product of their average values.’ It represents how ‘two variables vary
together from their mean.’
24 | P a g e
Cross-entropy
Cross-entropy is a means to quantify the difference between two probability distributions.
It is a generalization of ‘Log Loss’ function to multi-class classification problems.
Data-driven Documents or D3
D3 is a popular JavaScript library used by the data scientists, to present their results of
the analysis in the form of interactive visualizations embedded in web pages.
Data mining
Data mining is the analysis of large structured datasets by a computer to find hidden
patterns, relations, trends and insights within it. Data mining comes from data science.
Dataset
A data set is a collection of structured information, used as ‘examples’ in machine
learning.
Data science
Data science is the field of study employing scientific methods, processes, and systems
to extract knowledge and insights from complex data in various forms.
Data structure
A data structure represents the way in which the information of the data is arranged.
Example, array structure or ‘tree’ data structure.
Data wrangling
Data wrangling or data munging is the conversion of data to make it easier to work with. It
is achieved by using scripting languages like ‘Perl.’
Decision boundary
Decision boundary is the separating line between the classes learned by a model in a
‘binary class’ or ‘multiclass classification’ problems.
Decision trees
A decision tree represents the number of possible decision paths and an outcome for
each path, in the form of a tree structure.
25 | P a g e
Deep models depend on ‘trainable nonlinearities.’ It is a popular model for image
classification.
Dense feature
It is a function feature in which most values are non-zero. A dense feature is typically a
‘Tensor’ of floating point values.
Dependent variable
A dependent variable’s value is influenced by the value of an independent variable. For
example, ‘The magazine ad budget’ is an independent variable value. However, the
number of ‘subscriptions’ made is dependent on the former variable.
Dimension reduction
Dimension reduction is the extraction of one or more ‘dimensions’ that ‘capture’ as many
variations in the data as possible. It is implemented with a technique called ‘Principal
component analysis.’ Data reduction is useful in finding a small subset of data that
captures ‘most of the variation’ in a given dataset.
Dropout regularization
A dropout regularization ‘removes a random selection of a fixed number of units in a
network layer’ for a single gradient step. This form of regularization is used in training
neural networks. The more the number of units dropped out, the stronger will be the
regularization.
Dynamic model
A dynamic model is trained online with the continuously updated data. In such a model,
the data keeps entering it continually.
Early stopping
If the loss on ‘validation dataset’ increases, the ‘generalization performance’ worsens.
Hence, the model training has to be ended. It is known as early stopping. ‘Early stopping’
is a method of regularization in which the model training ends before ‘training loss’ finish
decreasing.
Embeddings
Embeddings are categorical features represented as continuous-valued features. An
embedding is a translation of a ‘High-dimensional vector’ into a ‘Low dimensional space.’
Embeddings are trained by ‘Backpropagating loss’ like any other parameter in a neural
network.
26 | P a g e
Empirical Risk Minimization (ERM)
ERM is the selection of ‘model function’ that minimizes training losses. It contrasts with
‘Structural risk minimization.’
Ensemble
To ‘ensemble’ is to merge the predictions of multiple models. For example, ‘deep and
wide models’ are the ensemble. An ensemble can be created via different initializations,
different overall structures or different hyperparameters.
Estimator
An estimator encapsulates or contains the logic that builds a TensorFlow graph and runs
a TensorFlow session.
Example
An ‘example’ represents ‘one row’ of a given data set. It also contains one or more
‘features.' It might also carry labels. Hence, examples can be labeled or unlabeled.
False Negative or FN
If a model mistakenly predicts an example to be of ‘negative class,' the outcome is called
false negative. Example, if a model predicts an email as ‘not spam’(negative class) but it
actually was ‘spam.’
False Positive or FP
If a model mistakenly predicts an example to be of ‘positive class,' the outcome is called
false positive. Example, if a model predicts an email as ‘spam’(positive class) but it
actually was ‘not spam.’
Feature
A feature is an input variable value used to make predictions. It represents ‘pieces of
measurable information’ about something. For example, a person’s age, height, and
weight represent three features about him/her. A feature can also be called property or
an attribute.
27 | P a g e
A feature column is a set of related features of an example. For instance, ‘a set of all
possible languages,' a person might know, will be listed under one feature column. A
feature column might contain a single feature as well.
Feature cross
A feature cross represents non-linear relationships between features. It is formed by
multiplying or taking a Cartesian product of individual features.
Feature engineering
Feature engineering involves ‘determining which feature will be useful in training a
model.’ The ‘raw data’ from log files and other sources is then converted into the said
features. Feature engineering is also referred to as ‘feature extraction.’
Feature set
It is the ‘set of features’ on which a machine learning model trains. Take, for example, the
model of a used car, its age, distance covered, etc. These ‘set of features’ can be used to
predict the price of that car.
Generalization
Generalization is the ability of a model to judge correct predictions based on ‘fresh and
unseen’ data, and not on the data previously used to train the model.
A generalized linear model cannot learn ‘new features’ like a deep learning model does.
Gradient
A gradient represents the ‘vector of partial derivatives’ concerning to all the independent
variables. A gradient always points towards the ‘steepest ascent.’
Gradient boosting
Gradient boosting produces a prediction model in the form of an ensemble of weak
prediction models. This is a machine learning technique for regression and solving
classification problems.
28 | P a g e
Gradient boosting builds the model stage-wise and generalizes them by allowing
optimization of arbitrary differentiable loss functions.
Gradient clipping
Gradient clipping is the method of ensuring numerical stability by ‘capping’ gradient
values before applying them.
Gradient descent
Gradient descent is a loss minimization technique, which involves computing of gradients
of loss with respect to the model’s parameters, learned or trained on training data.
Gradient descent works by adjusting parameters and finding the optimum combination of
‘weights’ and bias to minimize loss.
Graph
A graph represents a ‘computation specification’ to be processed in TensorFlow. Such a
graph is visualized using TensorBoard. The nodes on the graph depict operations and
edges represent the passing of the result as an operand to another operation (or Tensor).
Heuristic
A heuristic is a practical solution to a problem that aids in learning and making progress.
Hidden Layer
A hidden layer in a neural network lies between the input layer (or feature) and the output
layer (or prediction). A neural network can contain single or multiple hidden layers.
Hinge loss
A hinge loss is a loss function designed for classification models, to find the decision
boundary as far as possible from each training example. A hinge loss function maximizes
the margin between examples and the boundary.
Histogram
A histogram represents the distribution of numerical data through a vertical bar graph.
Holdout data
These are the datasets that are intentionally held-out during the model’s training. Holdout
data helps in evaluation of the model’s ability to generalize to data, other than the data it
was trained on. Examples of holdout datasets are validation dataset and test data set.
Hyperparameter
The parameters that can be ‘changed’ or ‘tweaked’ during successive training runs of a
model are known as hyperparameters.
29 | P a g e
I
Example of an IID is ‘a fairly rolled dice.’ Here, all the faces always have an equal
probability of coming up, irrespective of the number of times the number faces that
already came up.
Inference
The inference is the process through which a trained model makes predictions to
unlabeled examples. This definition is in regards to machine learning.
Input layer
The input layer is the first layer to receive the input data in a neural network.
Inter-rater agreement
Inter-rater agreement is a way to measure the ‘agreement’ between human raters while
undertaking a task. A disagreement amongst the raters calls for the improvement in ‘task
instructions.’
K-means clustering
It is a data-mining algorithm to classify or group or ‘cluster’ ‘N’ number of objects based
on their features into ‘K’ number of groups (or clusters).
Latent variable
Latent variables are hidden variables, whose presence is inferred by directly measuring
the observed variables. The inference of these variables is made through a mathematical
model.
30 | P a g e
Label
In machine learning terms, a label represents the ‘answer’ or ‘result’ associated with an
example.
Layer
A layer is a set of neurons that process a set of input features or the output of those
neurons in a neural network.
Lift
Life signifies ‘how frequently a pattern will be observed by chance.’ if the lift is 1, then the
pattern is supposed to be occurring coincidentally. The higher the lift, the higher is the
chance that the occurring pattern is real.
Linear regression
Linear regression is the method of graphically expressing the relationship between a
scalar dependent variable ‘y,' and one or more independent variable ‘X.' For example,
the relationship between ‘price’ and ‘sales’ can be expressed with an equation as a
straight line on the graph.
Logistic regression
Logistic regression is a model similar to linear regression, and only the output result is
made to fit the logistic function. In other words, the potential results are not continuous
but ‘specific set of categories.’
Machine learning
Machine learning or ML involves the development of algorithms to figure out insights from
extensive and vast data. ‘Learning’ refers to ‘refining’ of the models by supplying
additional data, to make it perform better with each iteration.
Markov chain
Markov chain is an algorithm, used to determine the possibility of occurrence of an event,
based on which other events have already occurred. This algorithm works with the data
of ‘series of events.’
Matrix
Matrix is merely a set of data arranged in rows and columns.
Mean
Mean, or arithmetic mean is the average value of numbers.
31 | P a g e
Mean Squared error of MSE
MSE is the average of the squares of all the predicted values as compared to the
observed values.
Median
The central or middle value of a sorted data is called the median. If the number of values
in data is even, the average value of the two central digits become the median.
Mode
For a given set of data values, the value that appears most frequently is called the mode.
Mode, like median, is a way to measure the central tendency.
Model
In statistical analysis, modeling refers to the specification of a probabilistic relationship
existing between different variables. A ‘model’ runs on algorithms and training data to
‘learn’ and then make predictions.
Moving average
Moving average represents the ‘continuous average’ of new time series data. The mean
of such data is calculated at equal time intervals and is updated according to the most
recent value, while the older value gets dropped.
Multivariate analysis
The analysis of ‘dependency of multiple variables over each other’ is called the
multivariate analysis.
N-gram
N-gram is the ‘scanning of patterns in a sequence of ‘N’ items.’ It is typically used in
natural language processing. For example, unigram analysis, bigram analysis, trigram
analysis and so on.
32 | P a g e
Natural language processing
Natural language processing or NLP is a collection of techniques to structurize and
process raw text from human spoken languages to extract information.
Neural network
A neural network uses algorithmic processes that mimic the human brain. It attempts to
find insights and hidden patterns from vast data sets. A neural network runs on learning
architectures and is ‘trained’ on large data sets to make such predictions.
Normal distribution
Normal distribution or ‘bell curve’ or ‘Gaussian distribution’ is a continuous bell-shaped
graph with the mean value at the center. It is a widely used distribution curve in statistics.
Null hypothesis
A null hypothesis depicts that ‘a single variable is not different from its mean or not
variation exists between variables.’ Hence, according to a null hypothesis, the given
observations hold not ‘statistical significance.’
Objective function
An objective function maximizes or minimizes a ‘result’ (or objective) by changing the
values of other quantities like decision variables, constraints and the result into an
objective function.
Ordinal variable
Ordinal variables are ordered variables with discrete values.
Outlier
Observations that diverge far away from the overall pattern in a sample are called
outliers. An outlier may also indicate an error or rare events.
Overfitting
An overly complicated model of data that takes too many outliers or ‘intrinsic data quirks’
into account. Overfitting model of training data is not much useful in finding patterns in
test data.
P value
33 | P a g e
P value depicts the probability of getting a result equal to or more than the actual
observation, under the null hypothesis. It is a measure of ‘the gap shown between the
groups when there actually isn’t any gap.’
Perceptron
Perceptron is the simplest neural network, in which a single neuron approximates 'n'
binary inputs.
Pivot table
A pivot table allows for easily rearranging long lists of data and summarize them. The act
of rearranging the data is known as ‘pivoting.’ Pivot table also allows for the dynamic
rearrangement of the data by just creating a pivot summary. It takes away the need for
employing a formula or copying to data arrangement.
Poisson distribution
It is the distribution of independent events over a defined time period and space. Poisson
distribution is used to predict the probability of occurrence of an event.
Predictive analytics
Predictive analytics involves extraction of information from existing data sets to determine
patterns and insights. These patterns and insights are used to predict future outcomes or
event occurrences.
Recall, on the other hand, is the measure of ‘number of correct positive predictions.’
For example, take a visual recognition model that recognizes ‘oranges.’ It recognizes
seven oranges in a picture containing ten oranges with some apples.
Out of those seven oranges, five are actually oranges (true positives), and the rest two
are apples (false positives).
Predictor variables
Predictor variables make predictions for dependent variables.
Prior distribution
34 | P a g e
In Bayesian statistics, ‘prior’ probability distribution of an uncertain quantity is based on
assumptions and beliefs, without taking any evidence into account.
R Programming
R is an open-source programming language for statistical analysis and graph generation,
available for different operating systems.
Random forest
An algorithm that employs ‘a collection of tree data structures’ for the classification task.
The input is classified or ‘voted’ for by each tree. ‘Random forest’ chooses the
classification with the highest ‘votes’ compared to all the trees.
Range
The range is the difference between the highest and the lowest value in a given set of
numbers. For example, consider the set 2,4,5,7,8,9,12. The range=12-2 i.e. 10.
Regression
Regression aims to measure the dependency of one dependent variable and other
changing variables. Examples, linear regression, logistic regression, lasso regression,
etc.
Reinforcement learning
Reinforcement learning or RL is a learning algorithm that allows a model to interact with
an environment and make decisions. The model is not given specific goals, but when it
does something ‘right’, it is given feedback. This ‘reinforcement’ helps the classification
model in learning to make right predictions. RL model also learns from its ‘past’
experiences.
Response variable
The response variable is the one that can be manipulated by other variables. It is also
called dependent variable.
Ridge regression
Ridge regression performs the ‘L2 regularization’ function on the optimization objective.
In other words, it adds the factor of the sum of squares of coefficients to the objective.
35 | P a g e
RMSE denotes the standard deviation of prediction errors from the regression line. It is
simply the square root of the mean squared error. RMSE signifies the ‘spread’ or
‘concentration’ of data around the regression line.
S curve
As the name suggests, ‘S-curve’ is a graph shaped like the letter ‘S.' It is a curve that
plots variables like cost, number, population, etc. against time.
Scalar
A scalar quantity represents the ‘magnitude’ or ‘intensity’ of a measure and not its
direction in space or time. For example, temperature, volume, etc.
Semi-supervised learning
Semi-supervised learning involves the use of extensive ‘unlabeled data’ at the input. Only
a small data is ‘labeled’ for the model to ‘learn’ to make the right classifications without
much external supervision.
Serial correlation
Serial correlation or autocorrelation is a pattern in a series, where each value is directly
influenced by the value next to it or preceding it. It is calculated by shifting a time-series
over the numerical series by an interval called ’lag.’
Skewness
Skewness represents symmetry of distribution or a data set, to the left or the right from its
center point.
Spatiotemporal data
A spatiotemporal data includes the space and time information about its values. In other
words, it is a time-series data with geographic identifiers.
Standard deviation
Standard deviation represents the ‘dispersion of the data.’ It is the square root of the
variance to show how far an observation is from the mean value.
Standard error
Standard error signifies the ‘statistical accuracy of an estimate.’ It is equal to the standard
deviation of the sampling distribution of a statistic.
Standardized score
36 | P a g e
Standard score, normal score or Z-score is the ‘transformation of the raw score for
evaluating it in reference with the standard normal distribution, by converting it into units
of standard deviation above or below the mean.
Strata
Division of the data into homogeneous groups and drawing random samples from each
group represents a ‘strata.’ for example, forming ‘strata’ of the population or demographic
data.
Supervised learning
Supervised learning involves using algorithms to classify the input into specific pre-
determined or known classes. In such a case, the prediction made by the model is based
on a ‘given set of predictors.’
Some examples of supervised learning algorithms are Random forest, decision tree, and
kNN, etc.
T-distribution
T-distribution is the ‘sampling of all the possible values instead of actually using them’ on
the normal distribution curve. It is also known as ‘Student’s T-distribution.’
Type I error
The incorrect decision to reject the null-hypothesis is called type I error.
Type II error
The incorrect decision to retain or keep the null-hypothesis is called type II error.
T-test
It is the analysis of ‘two population datasets’ by finding the difference of their populations.
Univariate analysis
The univariate analysis' purpose is to describe the data. It analyzes the dependency of a
single predictor and the response variable.
Unsupervised learning
37 | P a g e
An algorithm that classifies groups of data without knowing what the groups will be. There
is no target or outcome variable to predict or estimate. Unsupervised learning focuses
majorly on learning from the underlying data based on its attributes.
Variance
It is the ‘variation’ of the numbers in a given data from the mean value. Variance
represents the magnitude of differences in a given set of numbers.
Vector
In mathematical terms, vector denotes the quantities with magnitude and direction in the
space or time. In data science terms, it means ‘ordered set of real numbers, each
representing a distance on a coordinate axis.’ For example, velocity, momentum, or any
other series of details around which the model is being built.
Vector-space
Vector space is the collection of vectors. For example, a matrix is a vector-space.
Weka
Weka is a collection of machine learning algorithms and tools for mining data. Using
Weka, the data can be pre-processed, regressed, classified, associated with rules and
visualized.
There, we have it! The updated glossary of machine learning definitions. Do you think we
missed out on something? Share it with us in the comment box below.
38 | P a g e
Next Go Through!
SQL prep
39 | P a g e
Disclaimer:
A small fraction of Information presented in this prep is researched and adapted as per the context of the prep. Various
online platforms like Github and blogs provided required information on Data Science domain interview scenarios.
We do not claim that this prep is a comprehensive resource to prepare yourself facing any interviews.
Any questions can be addressed to data science prep authors: actiondatas@gmail.com
40 | P a g e