Sei sulla pagina 1di 41

DATA SCIENCE INTERVIEW PREP

Contents of the Prep

o Icebreaker questions
o About data
o Data cleaning
o Data manipulation
o Analysis of data Spread
o Data visualization
o Key Terms in Data Science

1|Page
Icebreaker questions
1. Explain about any analytics project you had recently completed?

The answers could be in the context of data visualization or machine learning domain.

Explain objective of the study, the data you had used and the final outcome

2. What is a business or research objective for the project?

Objective allows us defining the business of research problems more concisely and help

us frame the data science scope for solving them

3. Name the objectives of a couple of Analytics projects you have worked on?

E.g. I developed a regression model to estimate the price of a house

4. Describe the process of your data analytics project?

Defining objective  Data Collection  Data Cleaning  EDA  Regression analysis 

Reporting

5. What different tools did you use to complete this project?

R Programming, Python, Tableau, SQL, Excel and related tool. Be sure to specify for

which purpose each of these tools can be used

6. What were the outcomes of the Analytics projects you worked on?

Describe any Key decisions you were able to make using the data-driven insights

7. What were the mistakes you did when working on an analytics project and how did you

2|Page
improve on them?

You can describe mistakes as learning steps and define all the mistakes you had done in

any of the abovementioned projects

About data

8. How did you determine which data is needed for your study or project?

Data we collect should be able to derive the predicted outcome or describe a past event

9. How did you collect the data? Describe the sources of the data?

Different sources are Primary & Secondary

Primary: Data we collect specifically for a study

Secondary: Data gathered from external sources

10. What are the different types of data?

Quantitative and Qualitative data

11. What are the different levels of measurement?

Nominal, Ordinal, Interval and Ratio

12. Give an example of Nominal, Ordinal, Interval and Ratio data types

Nominal – Gender Represented as 0 & 1

Ordinal – Top Ranks represented as 1, 2,3

Interval – Time difference between years

3|Page
Ratio – Height, weight, Number of houses

13. Explain different methods of data collection?

Data mining, Web scraping, Survey data collected for the purpose of the study

14. Explain web scraping

Extracting structured and unstructured data from web pages, to gain competetive

intelligence

15. What is data mining?

Examining past databases to yield useful information

16. How is data collected through experiments?

Mainly through surveys or conduction real blinded experiments on subjects

17. What is the difference between primary and secondary data sources?

Primary data addresses a specific problem we are studying, collecting data via primary

source is expensive and time-consuming.

The secondary might give useful facts concerning our study objective, but its quality

cannot be determined

18. Give an example on when you would use Primary data and Secondary data?

Interacting with customers to gather information on customer sentiments on our recently

launched products is an example of a primary source

4|Page
Collecting external survey results on mobile phone usage is an example of a secondary

source

19. Explain how would you collect data in order to prove your hypothesis?

We need to define a hypothesis with your reject or prove a belief. You can then collect

primary or secondary data specific to the hypothesis

20. Describe big data? Which companies deal with it?

Structured data or unstructured data generated on a scale which impacts volume, velocity

factors and pose a challenge to its storage and processing.

Uber, Google, deal with big data scenario

21. Which business scenarios need big data analysis?

Real-time service providers like ride-hailing need to analyze real-time, unstructured and

structured data to provide insights on pricing and travel route optimization

22. Explain the differences between small data and big data?

Small data is mostly static, with limited growth in volume and variety. Big data exhibits

volume growth and the variety factor and data is generated at a faster rate.

23. A company is generating 1GB of data a day from various IOT devices. Is this data, big

data?

Depends on context, is 1GB data too much to store and analyze in a day. Maybe in the

5|Page
1990’s you can call it a big data problem, due to limited infrastructure for storage and

limited computation power

24. How can you analyze unstructured data?

Can be analyzed using Machine learning algorithms and packages which can analyze

meta tags and perform analysis of unstructured data

25. Explain the aim of Exploratory data analysis (EDA)?

To understand the past trends and gain insights on predictive and outcome variables. This

should give you important insights into the structure of the data set

26. During what stages of a data analytics or data science projects do you perform exploratory

data analysis?

After data cleaning

27. What kind of business insights does EDA provide? Give an example

e.g. Past revenue trends

Data cleaning

28. Explain best practices for cleaning data

Identify the types of data you are cleaning. Identify the range of values which you

encounter in a variable and define the cardinality

6|Page
29. Which tools did you use to clean the data?

R, Python, Excel, SQL

30. What is the cardinality of a variable?

Defined as the number of distinct values in a variable

31. Why data cleaning plays a vital role in the analysis?

Poor quality data leads to poor quality insights

32. How do you validate the accuracy of your data?

Check the collection process and see if the data collected pertained to the problem you
were trying to solve

33. How would you clean unstructured data?

Text can be cleaned using excel where you can divide sub-components of numerical and
text values into separate columns

34. How do you treat duplicate data?

Duplicate data can influence outcomes and these values are removed before data
analysis. Care must take that the ID values are not unique

35. How do you standardize data? Give examples

We can use z scores to standardize data

36. How does missing data values create a bias?

Missing value cause an improper representation of the samples collected from a


population

37. What measures can you take to prevent collecting missing data values?

Proper survey design and data collection have to be standardized so that all requires
observation are duly noted.

7|Page
Data Manipulation
38. How are outliers identified?

For any data distribution, any data value which lies far away from the median can be
examined as a potential outlier

39. Give an example of outliers influencing your analysis?

Salaries of highest-paid players in any sports team might skew the salary distribution and
change the average values

40. When would you consider including outliers in your analysis?

When proving a hypothesis, we reject the null hypothesis or the status quo. Valid outliers
could help us reject the null hypothesis

41. How can outliers be treated?

If outliers are invalid, they can be excluded from our analysis. Valid outliers presence has
to be determined and values can be standardized before analysis

42. What is missing value imputation?

Missing values can be substituted with median values of a particular attribute where
missing values exist. However, the type of imputation carried out needs to done on the
basis of understanding of the dataset and variables.

43. What methods are used to deal with missing value imputation?

8|Page
44. Why do we normalize data?

For numerical values of different variables, normalization allows falling within a specified
range while maintaining the differences

45. Explain a method to normalize data?

Range normalization is approach to normalize data. Typical normalization scales are


between [0,1] & [-1,1]

46. What is metadata? How is it useful for gaining insights, give an example?

Allows providing information about other data. Useful for tagging Unstructured
data format like

47. What is a class imbalance, give an example?

In an outcome variable, if certain classes of data (e.g. Yes & No observations) are
underrepresented, then we can assume a class imbalance.

Analysis of data spread

48. What is center, shape, and spread of data?

All the above terms point out to the distribution of data, the center indicates the median or

the mean or the mode of the data distribution. The shape indicates the skewness and

spread indicates the variability of the data distribution

49. What is uni-modal, bi-modal and multi-modal distribution of data?

Mode indicates the most common value in a data set but it does not have a relation with

unimodal distribution. Uni-modal, multi-modal distributions indicate one peak and more

than one peak. A normal distribution is unimodal, with mean is zero, most common

numbers in a variable occur at the peak

9|Page
50. What is kurtosis and what does it imply in terms of deriving insights?

Data sets with high kurtosis tend to have heavy tails or outliers. Data sets with low kurtosis

tend to have light tails or lack of outliers

51. What does uniform data distribution mean?

The distribution of a statistical data set (or a population) is a listing or function showing all

the possible values (or intervals) of the data and how often they occur

52. Explain the positive and negative skew of data distributions

A distribution is skewed if one of its tails is longer than the other. The mean is higher in

right skew distributions

53. Can you guess about the distribution and skewness of data for salaries of IPL Players?

The highest and lowest paid player’s salary values could create a skew in the salary

distribution. Positive or negative skews of the distribution are observed.

10 | P a g e
54. How do you summarize categorical data?

Categorical variables with different levels can be summarized using count functions or their

relative proportions to each other can be estimated using tables

55. How do you summarize numerical data?

Most commonly we use Mean, Median and Mode to measure the centrality of data

distribution. The measures of dispersion of a data distribution can be performed using

standard deviation and variance

56. Explain measures of central tendency?

Mean defines the average value, Median defines the 50th percentile value, Mode defines
the most common value

57. Between median and mean, which is a more robust measure?

Median, since it reflects the 50th percentile value and not influenced by outliers. Outliers
skew the mean value

11 | P a g e
58. Explain why the median is the more robust measure

Median, since it reflects the 50th percentile value and not influenced by outliers. Outliers
skew the mean value.

59. Explain when would you prefer to use the mean

When the sample contains no outliers the mean provides a better measure of central
tendency

60. Explain the applications of measures of dispersion

Standard deviation and variance allow us to determine the variability of the data.

e.g. the age variable of students in class ten will have less variability, than the age variable
of students in an executive MBA program

61. What is the statistical variance? How is it calculated?

Variance measures how far a set of random numbers are spread out from their average

62. What is the standard deviation? How is it calculated?

Measures the dispersion of the dataset relative to its mean and is calculated as the Square
root of the variance.

63. Explain how standard deviation values are interpreted using an example

A small standard deviation can be a goal in certain situations where the results are
restricted, for example, in product manufacturing and quality control. A particular type of
car part that has to be two centimeters in diameter to fit properly had better not have a very
big standard deviation during the manufacturing process. A big standard deviation, in this
case, would mean that lots of parts end up in the trash because they don’t fit right; either
that or the cars will have problems down the road. (Source: dummies)

64. Explain statistical correlation

Correlation shows how strongly pairs of variables are related. e.g. height and weight

12 | P a g e
Variables can exhibit a certain relationship

65. Differentiate between univariate and multivariate analysis?

In the Univariate analysis, we will be focusing on a single variable and try to understand its
measure of central tendency and dispersion

In multivariate analysis, we try to understand the relationships between variables and see
if they have certain or positive or negative influence over each other. Which could be
useful in determining predictor and response variables

66. What is a normal distribution of data?

Also called a Gaussian distribution

Also called the bell curve occurs naturally in many situations. For example, the bell curve
is seen in tests like the GRE. The bell curve is symmetrical. Half of the data will fall to the
left of the mean; half will fall to the right. (Source: Statistics How To)

67. What is the difference between Interpolation and extrapolation?

To extrapolate is to infer something that is not explicitly stated from existing information
from a data set. Interpolation is an estimation of a value within two known values in a
sequence of values.

13 | P a g e
68. What is the difference between the sample data and the population from which it’s
obtained from?

Sample data when collected randomly should contain the information about the whole
population which we can infer. Collecting samples is more feasible in real-world statistical
studies

69. What is the statistical power?

The power of any test of statistical significance is defined as the probability that it will reject
a false null hypothesis

70. Explain sampling and its types with an example?

Random sampling allows us to select observations from a population. Each observation


will have an equal probability of being selected.

E.g. To determine the voter preferences for a political party we randomly select people
from a voter list and then collect opinions. Non Random selection of sample will lead to
bias.

71. What is the difference between a cluster and systematic sampling?

Cluster sampling breaks the population down into clusters, while systematic
sampling uses fixed intervals from the larger population to create the sample

72. How will you assess the statistical significance of an insight whether it is a real insight or
just by chance?

Statistical significance is the probability of finding a given observation against the null
hypothesis. An arbitrary convention is to reject the null hypothesis if p < 0.05. The
presence of invalid outliers should be carefully examined while rejecting the null
hypothesis.

73. How do you test your hypothesis?

In 4 steps, 1. State your hypothesis  2. Formulate your analysis plan  3. Collect data
 4. Analyze data and calculate statistical significance

14 | P a g e
74. What is z - score?

It’s a measure of how many standard deviations below or above the population mean a
raw score is. Z score allows us to standardize all raw observations in a data set and z
scores allow us to better variables.

75. Explain p-value

A p-value helps determine the significance of our statistical analysis. The p-value is a
number between 0 and 1 and interpreted in the following way. A small p-value
(typically ≤ 0.05) indicates strong evidence against the null hypothesis, so we can reject
the null hypothesis.

76. What are the usual significance criteria you set for the p-value?

p-value (typically p ≤ 0.01 or p ≤ 0.05, or p ≤ 0.10 ). The criteria for setting a p-value are
specific to the business domain. Hypothesis testing in a healthcare sector might require a
strict significance criteria

77. Which p-value is stringent? Why?

p ≤ 0.01 the definition of ‘stringent’ depends on how the hypothesis is being tested.
P is also described in terms of rejecting the null hypothesis when it is actually true

78. Explain the difference between alternative and null hypothesis?

Null hypothesis (H0) explains the status quo or an existing belief. E.g. the average temp in
the Himalayas in summer does not cross 30C.

Alternative hypothesis (Halpha) allows challenging existing belief. In this case, we state the
average summer temperatures in the Himalayas is above 30C.

15 | P a g e
79. What happens when you fail to reject the null hypothesis? Give an example?

Then we can make a wrong decision or make no decision to act.


e.g. failing to diagnose cancer when it exists in a patient

80. Explain the consequences of rejecting the null hypothesis if in case it’s valid?

We make a wrong diagnosis e.g. diagnosing cancer disease in a patient when in reality
they are healthy

81. In case of low sample size which test would you perform to evaluate significance?

We use student T-test when the sample size is less than 30 observations

82. What are a one-tailed test and two-tailed tests? Give examples?

When we expand our hypothesis on avg. summer temperature in the Himalayas, we can
state that the temperatures could be less or more than the null hypothesis stated value of
30C.

If we want to prove the above hypothesis, we would need data and if our observations are
less than 30, then we perform a two-tailed test

83. What is a random variable?

A random variable is a variable whose value is unknown or a function that assigns values
to each of an experiment's outcomes. Random variables are often designated by letters
and can be classified as discrete, which are variables that have specific values, or
continuous, which are variables that can have any values within a continuous range
(Source: Investopedia)

84. What insights can you derive out of a correlation matrix?

A correlation matrix is a table showing correlation coefficients between sets of


variables. Each random variable (Xi) in the table is correlated with each of the
other values in the table (Xj)

16 | P a g e
85. What is the correlation coefficient value? How does it signify the relationship
between variables?

The correlation coefficient represented with symbol r measures the strength and
direction of a linear relationship between two variables on a scatterplot. The value
of r is always between +1 and –1

86. What is the covariance? How is it different from correlation coefficient values?

Covariance is a measure of how much two random variables vary together. It’s
similar to variance, but where variance tells you how a single variable varies,
covariance tells you how two variables vary together. A large covariance could
mean a strong relationship exists between variables.

The Correlation Coefficient has several advantages over covariance for


determining strengths of relationships between variables

Covariance can take on practically any number while a correlation is measured on


a scale of -1 to +1.

17 | P a g e
87. Does correlation implicate causation?

No, we need to do a diligent research on if any confounding factors are causing


the events we observe

Data visualization

88. Explain data visualization to a layman?

The representation of data and information in the form of a chart, diagram, picture

89. Why do we visualize Data?

Easy to perceive information through visuals, when especially we are dealing with
a large number of observations

90. Explain the process of data visualization in EDA

Define Business Objective  Collect data  Clean and process data  EDA 
Communicate insights through reports

91. Explain the importance of visualization in EDA

Allows gaining basic insights into trends and identify predictor and response variables

92. When do you use bar charts?

Allow representing and comparing categorical information

93. When do you use line charts?

Allows plotting trends line and observe patterns

94. How do you plot two quantitative variables against each other?

We use a scatter plot to plot two quantitative variables against each other

95. What are the applications of a scatter plot?

Allows us to understand the relationships between variables

18 | P a g e
96. Which plot would you use to analyze the distribution of a data points in a single
variable?

Box plot can be used to understand the distribution of a single variable

97. What are whiskers in a box plot?

They Represent the Minimum and Maximum observed values in the variable

98. How does boxplot help us identify outliers?

Values lying far away from the median value beyond the reach of whiskers are
considered outlier values

99. What is IQR in a box plot?

Difference between 75th percentile value and the 25th percentile value

100. What other statistical measures can be obtained using box plots

Median, Range

101. Explain when would you use a Stem and Leaf plot?

To count frequency values in a smaller dataset

102. What is a correlation matrix?

A correlation matrix is a table showing correlation coefficients between sets of


variables

19 | P a g e
Key Data Science Terms

Accuracy
Accuracy is the fractional number of correct predictions made by a ‘classification model.’
Accuracy = (No. of correct predictions/Total no. of examples), as per ‘multi-class
classification.’

Activation function
It is a function that generates and passes an output value (usually nonlinear) to the next
layer, by taking in the weighted sum of all the inputs from the previous layer. ‘ReLU’ or
‘Sigmoid’ are the examples of activation functions.

AdaGrad
AdaGrad is an advanced gradient descent algorithm that rescales the gradients of each
parameter. It allows each parameter to have an independent learning rate.

AUC (Area Under the ROC Curve)


The area under the ROC curve represents that a classifier will be more confident that ‘a
randomly chosen positive example is actually positive’ than ‘a randomly chosen negative
example is positive.’

AUC is an evaluation metric that considers all the possible classification thresholds.

Algorithm
An algorithm is a series of repeatable steps for executing a specific type of task with the
given data. It is a process governed by a set of rules, to be followed by a computer to
perform operations on the data.

Artificial Intelligence
Artificial intelligence or AI is the ‘machines’ acting with apparent intelligence. Modern AI
employs statistical and predictive analysis of large amounts of data to ‘train’ the computer
systems to make decisions, that appear as intelligence.

Backpropagation or ‘backprop.’
Backpropagation is the primary algorithm for implementing ‘gradient descent’ on ‘neural
networks.’ In a backprop algorithm, the output values of each node are calculated in a
forward pass. Then, the partial derivative of the error corresponding to each parameter is
calculated in a backward pass through the graph.

20 | P a g e
Baseline
A baseline is a reference point for comparing ‘how well a given model is performing.’ It is
a simple model that help model developers quantify the ‘minimal’ expected performance
by a system for a particular problem.

Batch
A batch is a set of examples used in ‘one gradient update’ or iteration of ‘model training.’

Similarly, a ‘batch size’ represents the number of examples in a batch.

Bayes’ theorem
Bayes’ theorem describes the probability of an event, based on prior knowledge of
conditions that might be related to the event. For an observed outcome, Bayes’ theorem
describes that the conditional probability of each of the set of possible causes can be
computed from the knowledge of the probability of each cause and the underlying
conditional probability of the outcome of each event.

Mathematically, Bayes’ theorem says,

P(A/B)=(P(B/A)P(A))/P(B)

Bayesian Network or Bayes Net


A Bayesian network is used for reasoning or decision making in the face of uncertainty. It
consists of graphs that represents the relationship between random variables for a
particular problem. The reasoning in Bayes net depends heavily upon the Bayes’ rule.

Bias term or bias


A bias term is an intercept or the offset from an ‘origin.’ Bias is represented as ‘b’ or ‘w0.

Binary classification
A binary classification is a type of classification task that outputs one of the two mutually
exclusive conditions as a result. For example, a machine learning model either outputs
‘Spam’ or ‘Not spam’ if it evaluates the email messages.

Big Data
Big data corresponds to extremely large datasets, that had been ‘impractical’ to use
before because of their ‘volume,' ‘velocity’ and ‘variety.' Such datasets are analyzed
computationally to reveal patterns, trends, associations & conditional relations, especially
relating to human behavior and interactions.

Crunching down such extensive data requires data science skills to reveal useful insights
and patterns hidden from general human intelligence.

Bucketing
Bucketing is the conversion of continuous features, based on their value range, into
multiple binary features called ‘buckets’ or ‘bins.’ Instead of using a variable as a

21 | P a g e
continuous ‘floating-point’ feature, its value ranges can be chopped down to fit into
discrete buckets.

For example, given a temperature data, all temperatures ranging from 0.0 to 15.0 can be
put into one bin or bucket, 15.1 to 30.0 into another and so on.

Calibration layer
A calibration layer is used in post-prediction adjustment to minimize the ‘Prediction bias.’
The calibrated predictions and probabilities must match the distribution value of an
observed set of labels.

Candidate Sampling
Candidate sampling is the ‘training-time’ optimization method, in which the probability is
calculated for all the positive labels, but only for ‘random’ samples of negative labels. The
idea is based on an empirical observation that ‘negative classes’ can learn from ‘less
frequent negative reinforcement’ as long as the ‘positive classes’ always get ‘proper
positive reinforcement.’

For example, if we use an example labeled as ‘Ferrari’, and the ‘car’ candidate sampling
computes the predicted probabilities and the corresponding loss terms for the ‘Ferrari’
and ‘Car’ class outputs, in addition to random subset of other remaining classes such as
‘Trucks’, ‘Aircraft’ and ‘Motorcycles.’

Checkpoint
The ‘captured state of variables’ of a model at a given time is called ‘Checkpoint’ data.
Checkpoint data enables performing training across multiple sessions. It also aids in
‘exporting’ model weights. Checkpoint data also enables to continue task preemption
training.

Chi-square test
Chi-square test is an analytical method to determine ‘whether the classification of data
can be attributed to some underlying law or chance.’ The chi-square analysis is used to
estimate whether two variables in a ‘cross-tabulation’ are correlated. It is a test to check
for the ‘independence’ or ‘degree of freedom’ of variables.

Classification or class
Classification is used to determine the categories to which an item belongs. It is an
example of classic machine learning task. The two types of classifications are ‘binary
classification’ and ‘multiclass classification.’

Example of a binary classification model is where a system detects ‘spam’ or ‘not spam’
emails.

22 | P a g e
Example of a multiclass classification is where a model identifies ‘cars,' the classes being
‘Ferrari,' ‘Porsche,' ‘Mercedes’ and so on.

Class-imbalanced dataset
The class-imbalanced dataset is a binary classification problem in which two classes
have wide frequency gap.
For example, a viral flu dataset in which 0.0004 of examples have positive labels and
0.9996 examples have negative labels pose a class-imbalance problem.
Whereas, in ‘a marriage success predictor’ in which 0.55 of examples label the ‘couple’
keeping a long-term marriage and 0.45 examples label the ‘couple’ ending up in divorce,
is ‘not’ a classification problem.

Classification threshold
A classification threshold is a scalar-value that is used when mapping ‘logistic regression’
results to binary classification. This threshold value is applied to a model’s predicted
score to separate the ‘positive class’ from ‘negative class.’

For example, consider a logistic regression model, with a classification threshold value of
0.8, which estimates the probability of a given email message as being spam or not
spam. The logistic regression values above 0.8 will be classified as ‘spam’ and values
below 0.8 are classed as ‘not spam.’

Clustering
Clustering is an unsupervised algorithm for dividing ‘data instances’ into groups and not a
‘predetermined set of groups.’ the groups ‘clustered’ by the execution of such an
algorithm are ‘because of similarities found amongst the instances.’

‘Centroid’ is the term used to denote the center of each such cluster.

Coefficient
A coefficient is the ‘multiplier value’ prefixed to a variable. It can be a number or an
algebraic symbol. Data statistics involve the usage of specific coefficient terms such as
Cramer’s coefficient and Gini coefficient.

Computational linguistics or Natural language processing or NLP


Computational linguistics or NLP is a branch of computer science to analyze the text of
spoken languages like Spanish or English, and convert it into structured data that can be
used to drive the program logic.

For example, a model can analyze and process text documents, Facebook posts, etc. to
mine for potentially valuable information.

Confidence interval
The confidence interval is a specific range around a ‘prediction’ or ‘estimate’ to indicate
the scope of error by the model. The confidence interval is also combined with the
probability that a predicted value will fall within that specified range.

23 | P a g e
Confusion matrix
A confusion matrix is a NxN matrix that depicts ‘how successful a classification model’s
predictions were.’ one axis of the matrix represents the label that the model predicted,
and the other axis depicts the actual label.

The confusion matrix, in case of a multi-class model, helps in determining the mistake
patterns. Such a confusion matrix contains sufficient information to calculate performance
metrics, like ‘precision’ and ‘recall.’

Continuous variable
A continuous variable can have an infinite number of values within a particular range. Its
nature contrasts with the ‘discrete variables’ or ‘discrete feature.’ For example, if you can
express a value as a decimal number, then it is a continuous variable.

Convergence
In simple words, convergence is a point where additional training on the data will not
improve the model anymore. At convergence, the ‘training loss’ and ‘validation loss’ do
not change with further iterations.

In deep learning models, loss values can stay constant or unchanging for many numbers
of iterations, before finally descending. This observation might produce a false sense of
convergence.

Convex function
A convex function is usually a ‘U-shaped’ curve. In degenerate cases, however, the
convex function is shaped like a line. These functions represent loss functions. The sum
of two convex functions is always a convex function. Example of a convex function is ‘Log
Loss’ function.

Correlation
Correlation is the measure of ‘how closely the two data sets are correlated.’ Take for
example two data sets, ‘subscriptions’ and ‘magazine ads.’ When more ads get
displayed. More subscriptions for a magazine get added, i.e., these data sets correlate. A
correlation coefficient of ‘1’ is a perfect correlation, 0.8 represents a strong correlation
while a value of 0.12 represents weak correlation.

The correlation coefficient can also be negative. In the cases where data sets are
inversely related to each other, a negative correlation might occur. For example, when
‘mileage’ goes up, the ‘fuel costs’ go down. A correlation coefficient of -1 is a perfect
negative correlation.

Covariance
Covariance is the ‘measure of association between the average value of two variables,
diminished by the product of their average values.’ It represents how ‘two variables vary
together from their mean.’

24 | P a g e
Cross-entropy
Cross-entropy is a means to quantify the difference between two probability distributions.
It is a generalization of ‘Log Loss’ function to multi-class classification problems.

Data-driven Documents or D3
D3 is a popular JavaScript library used by the data scientists, to present their results of
the analysis in the form of interactive visualizations embedded in web pages.

Data mining
Data mining is the analysis of large structured datasets by a computer to find hidden
patterns, relations, trends and insights within it. Data mining comes from data science.

Dataset
A data set is a collection of structured information, used as ‘examples’ in machine
learning.

Data science
Data science is the field of study employing scientific methods, processes, and systems
to extract knowledge and insights from complex data in various forms.

Data structure
A data structure represents the way in which the information of the data is arranged.
Example, array structure or ‘tree’ data structure.

Data wrangling
Data wrangling or data munging is the conversion of data to make it easier to work with. It
is achieved by using scripting languages like ‘Perl.’

Decision boundary
Decision boundary is the separating line between the classes learned by a model in a
‘binary class’ or ‘multiclass classification’ problems.

Decision trees
A decision tree represents the number of possible decision paths and an outcome for
each path, in the form of a tree structure.

Deep learning or deep model


It is a type of ‘neural network’ containing a multi-level algorithm to process data at
increasing level of abstraction. For example, the first level of the algorithm may identify
lines, and the second recognizes the combination of lines as shapes and the third level
recognizes the combination of shapes as objects.

25 | P a g e
Deep models depend on ‘trainable nonlinearities.’ It is a popular model for image
classification.

Dense feature
It is a function feature in which most values are non-zero. A dense feature is typically a
‘Tensor’ of floating point values.

Dependent variable
A dependent variable’s value is influenced by the value of an independent variable. For
example, ‘The magazine ad budget’ is an independent variable value. However, the
number of ‘subscriptions’ made is dependent on the former variable.

Dimension reduction
Dimension reduction is the extraction of one or more ‘dimensions’ that ‘capture’ as many
variations in the data as possible. It is implemented with a technique called ‘Principal
component analysis.’ Data reduction is useful in finding a small subset of data that
captures ‘most of the variation’ in a given dataset.

Discrete feature or discrete variable


A discrete feature is a variable whose possible values are finite. It contrasts with
‘continuous feature.’

Dropout regularization
A dropout regularization ‘removes a random selection of a fixed number of units in a
network layer’ for a single gradient step. This form of regularization is used in training
neural networks. The more the number of units dropped out, the stronger will be the
regularization.

Dynamic model
A dynamic model is trained online with the continuously updated data. In such a model,
the data keeps entering it continually.

Early stopping
If the loss on ‘validation dataset’ increases, the ‘generalization performance’ worsens.
Hence, the model training has to be ended. It is known as early stopping. ‘Early stopping’
is a method of regularization in which the model training ends before ‘training loss’ finish
decreasing.

Embeddings
Embeddings are categorical features represented as continuous-valued features. An
embedding is a translation of a ‘High-dimensional vector’ into a ‘Low dimensional space.’

Embeddings are trained by ‘Backpropagating loss’ like any other parameter in a neural
network.

26 | P a g e
Empirical Risk Minimization (ERM)
ERM is the selection of ‘model function’ that minimizes training losses. It contrasts with
‘Structural risk minimization.’

Ensemble
To ‘ensemble’ is to merge the predictions of multiple models. For example, ‘deep and
wide models’ are the ensemble. An ensemble can be created via different initializations,
different overall structures or different hyperparameters.

Estimator
An estimator encapsulates or contains the logic that builds a TensorFlow graph and runs
a TensorFlow session.

Example
An ‘example’ represents ‘one row’ of a given data set. It also contains one or more
‘features.' It might also carry labels. Hence, examples can be labeled or unlabeled.

False Negative or FN
If a model mistakenly predicts an example to be of ‘negative class,' the outcome is called
false negative. Example, if a model predicts an email as ‘not spam’(negative class) but it
actually was ‘spam.’

False Positive or FP
If a model mistakenly predicts an example to be of ‘positive class,' the outcome is called
false positive. Example, if a model predicts an email as ‘spam’(positive class) but it
actually was ‘not spam.’

False positive rate


Mathematically, the false positive rate is defined as;

FP rate=(Number of false positives)/(Number of false positives+number of true negatives)

FP rate is represented by x-axis in a ROC curve.

Feature
A feature is an input variable value used to make predictions. It represents ‘pieces of
measurable information’ about something. For example, a person’s age, height, and
weight represent three features about him/her. A feature can also be called property or
an attribute.

Feature columns or FeatureColumns

27 | P a g e
A feature column is a set of related features of an example. For instance, ‘a set of all
possible languages,' a person might know, will be listed under one feature column. A
feature column might contain a single feature as well.

Feature cross
A feature cross represents non-linear relationships between features. It is formed by
multiplying or taking a Cartesian product of individual features.

Feature engineering
Feature engineering involves ‘determining which feature will be useful in training a
model.’ The ‘raw data’ from log files and other sources is then converted into the said
features. Feature engineering is also referred to as ‘feature extraction.’

Feature set
It is the ‘set of features’ on which a machine learning model trains. Take, for example, the
model of a used car, its age, distance covered, etc. These ‘set of features’ can be used to
predict the price of that car.

GATE or General Architecture for Text Engineering


GATE is an open-source Java-based framework for natural language processing tasks.
This framework allows the user to integrate other tools designed to be plugged into it.

Generalization
Generalization is the ability of a model to judge correct predictions based on ‘fresh and
unseen’ data, and not on the data previously used to train the model.

Generalized linear model


A generalized linear model is a generalization of ‘least squares regression models’ based
on Gaussian noise, to other types of models based on other types of noises. The
examples of generalized linear models are ‘Logistic regression’ and ‘multiclass
regression.’

A generalized linear model cannot learn ‘new features’ like a deep learning model does.

Gradient
A gradient represents the ‘vector of partial derivatives’ concerning to all the independent
variables. A gradient always points towards the ‘steepest ascent.’

Gradient boosting
Gradient boosting produces a prediction model in the form of an ensemble of weak
prediction models. This is a machine learning technique for regression and solving
classification problems.

28 | P a g e
Gradient boosting builds the model stage-wise and generalizes them by allowing
optimization of arbitrary differentiable loss functions.

Gradient clipping
Gradient clipping is the method of ensuring numerical stability by ‘capping’ gradient
values before applying them.

Gradient descent
Gradient descent is a loss minimization technique, which involves computing of gradients
of loss with respect to the model’s parameters, learned or trained on training data.
Gradient descent works by adjusting parameters and finding the optimum combination of
‘weights’ and bias to minimize loss.

Graph
A graph represents a ‘computation specification’ to be processed in TensorFlow. Such a
graph is visualized using TensorBoard. The nodes on the graph depict operations and
edges represent the passing of the result as an operand to another operation (or Tensor).

Heuristic
A heuristic is a practical solution to a problem that aids in learning and making progress.

Hidden Layer
A hidden layer in a neural network lies between the input layer (or feature) and the output
layer (or prediction). A neural network can contain single or multiple hidden layers.

Hinge loss
A hinge loss is a loss function designed for classification models, to find the decision
boundary as far as possible from each training example. A hinge loss function maximizes
the margin between examples and the boundary.

Histogram
A histogram represents the distribution of numerical data through a vertical bar graph.

Holdout data
These are the datasets that are intentionally held-out during the model’s training. Holdout
data helps in evaluation of the model’s ability to generalize to data, other than the data it
was trained on. Examples of holdout datasets are validation dataset and test data set.

Hyperparameter
The parameters that can be ‘changed’ or ‘tweaked’ during successive training runs of a
model are known as hyperparameters.

29 | P a g e
I

Independently and identically distributed (IID)


IID represents a collective of data or variables that have ‘same probability distribution’ as
the others and are mutually independent. In case of IIDs, the probability of a predicted
outcome is ‘no more’ or ‘less’ likely than any other prediction.

Example of an IID is ‘a fairly rolled dice.’ Here, all the faces always have an equal
probability of coming up, irrespective of the number of times the number faces that
already came up.

Inference
The inference is the process through which a trained model makes predictions to
unlabeled examples. This definition is in regards to machine learning.

Input layer
The input layer is the first layer to receive the input data in a neural network.

Inter-rater agreement
Inter-rater agreement is a way to measure the ‘agreement’ between human raters while
undertaking a task. A disagreement amongst the raters calls for the improvement in ‘task
instructions.’

Kernel Support Vector Machines (KSVMs)


A KSVM maps the input data vectors to a higher dimensional space for maximizing the
margin between positive and negative classes. KSVMs employ hinge loss as a loss
function.

K-means clustering
It is a data-mining algorithm to classify or group or ‘cluster’ ‘N’ number of objects based
on their features into ‘K’ number of groups (or clusters).

K-nearest neighbors or kNN


It is a machine learning algorithm that examines ‘k’ number of ‘neighbors’ to classify
things based on their similarity. Here, ‘similarity’ means the comparison of ‘feature values’
in the neighbors being compared.

Latent variable
Latent variables are hidden variables, whose presence is inferred by directly measuring
the observed variables. The inference of these variables is made through a mathematical
model.

30 | P a g e
Label
In machine learning terms, a label represents the ‘answer’ or ‘result’ associated with an
example.

Layer
A layer is a set of neurons that process a set of input features or the output of those
neurons in a neural network.

Lift
Life signifies ‘how frequently a pattern will be observed by chance.’ if the lift is 1, then the
pattern is supposed to be occurring coincidentally. The higher the lift, the higher is the
chance that the occurring pattern is real.

Linear regression
Linear regression is the method of graphically expressing the relationship between a
scalar dependent variable ‘y,' and one or more independent variable ‘X.' For example,
the relationship between ‘price’ and ‘sales’ can be expressed with an equation as a
straight line on the graph.

Logistic regression
Logistic regression is a model similar to linear regression, and only the output result is
made to fit the logistic function. In other words, the potential results are not continuous
but ‘specific set of categories.’

Machine learning
Machine learning or ML involves the development of algorithms to figure out insights from
extensive and vast data. ‘Learning’ refers to ‘refining’ of the models by supplying
additional data, to make it perform better with each iteration.

Markov chain
Markov chain is an algorithm, used to determine the possibility of occurrence of an event,
based on which other events have already occurred. This algorithm works with the data
of ‘series of events.’

Matrix
Matrix is merely a set of data arranged in rows and columns.

Mean
Mean, or arithmetic mean is the average value of numbers.

Mean Absolute error


Mean Absolute error or MAE is the average error of predicted values as compared to the
observed values.

31 | P a g e
Mean Squared error of MSE
MSE is the average of the squares of all the predicted values as compared to the
observed values.

Median
The central or middle value of a sorted data is called the median. If the number of values
in data is even, the average value of the two central digits become the median.

Mode
For a given set of data values, the value that appears most frequently is called the mode.
Mode, like median, is a way to measure the central tendency.

Model
In statistical analysis, modeling refers to the specification of a probabilistic relationship
existing between different variables. A ‘model’ runs on algorithms and training data to
‘learn’ and then make predictions.

Monte Carlo method


Monte Carlo method is a technique to solve numerical problems by studying numerous
randomly generated numbers, to find an approximate solution. Such a numerical problem
is often challenging to solve by other mathematical methods.

Monte Carlo method is often used by Markov chain algorithm.

Moving average
Moving average represents the ‘continuous average’ of new time series data. The mean
of such data is calculated at equal time intervals and is updated according to the most
recent value, while the older value gets dropped.

Multivariate analysis
The analysis of ‘dependency of multiple variables over each other’ is called the
multivariate analysis.

N-gram
N-gram is the ‘scanning of patterns in a sequence of ‘N’ items.’ It is typically used in
natural language processing. For example, unigram analysis, bigram analysis, trigram
analysis and so on.

Naive Bayes classifier


A naive Bayes classifier is an algorithm based on Bayes’ theorem, which classifies
features with an assumption that ‘every feature is independent of every other feature.’
This classification algorithm is called ‘naive’ because all the features might not
necessarily be independent, and it becomes one downside of this algorithm.

32 | P a g e
Natural language processing
Natural language processing or NLP is a collection of techniques to structurize and
process raw text from human spoken languages to extract information.

Neural network
A neural network uses algorithmic processes that mimic the human brain. It attempts to
find insights and hidden patterns from vast data sets. A neural network runs on learning
architectures and is ‘trained’ on large data sets to make such predictions.

Normal distribution
Normal distribution or ‘bell curve’ or ‘Gaussian distribution’ is a continuous bell-shaped
graph with the mean value at the center. It is a widely used distribution curve in statistics.

Null hypothesis
A null hypothesis depicts that ‘a single variable is not different from its mean or not
variation exists between variables.’ Hence, according to a null hypothesis, the given
observations hold not ‘statistical significance.’

Objective function
An objective function maximizes or minimizes a ‘result’ (or objective) by changing the
values of other quantities like decision variables, constraints and the result into an
objective function.

One hot encoding


One hot encoding converts categorical variables into numerical, to make it interpretable
to the learning model.

Ordinal variable
Ordinal variables are ordered variables with discrete values.

Outlier
Observations that diverge far away from the overall pattern in a sample are called
outliers. An outlier may also indicate an error or rare events.

Overfitting
An overly complicated model of data that takes too many outliers or ‘intrinsic data quirks’
into account. Overfitting model of training data is not much useful in finding patterns in
test data.

P value

33 | P a g e
P value depicts the probability of getting a result equal to or more than the actual
observation, under the null hypothesis. It is a measure of ‘the gap shown between the
groups when there actually isn’t any gap.’

Perceptron
Perceptron is the simplest neural network, in which a single neuron approximates 'n'
binary inputs.

Pivot table
A pivot table allows for easily rearranging long lists of data and summarize them. The act
of rearranging the data is known as ‘pivoting.’ Pivot table also allows for the dynamic
rearrangement of the data by just creating a pivot summary. It takes away the need for
employing a formula or copying to data arrangement.

Poisson distribution
It is the distribution of independent events over a defined time period and space. Poisson
distribution is used to predict the probability of occurrence of an event.

Predictive analytics
Predictive analytics involves extraction of information from existing data sets to determine
patterns and insights. These patterns and insights are used to predict future outcomes or
event occurrences.

Precision and recall


Precision is simply the measure of ‘true positive predictions’ out of all the positive
predictions. Mathematically, ‘Precision’=(True positive predictions)/(True positives+false
positives).

Recall, on the other hand, is the measure of ‘number of correct positive predictions.’

For example, take a visual recognition model that recognizes ‘oranges.’ It recognizes
seven oranges in a picture containing ten oranges with some apples.

Out of those seven oranges, five are actually oranges (true positives), and the rest two
are apples (false positives).

Then, ‘precision’=5/7 and ‘recall’=5/10.

Predictor variables
Predictor variables make predictions for dependent variables.

Principal component analysis


This algorithm analysis highest variance in the given data. This variance value is tagged
as the principal component.

Prior distribution

34 | P a g e
In Bayesian statistics, ‘prior’ probability distribution of an uncertain quantity is based on
assumptions and beliefs, without taking any evidence into account.

Quantiles and quartiles


Division of sorted values into groups having the same number of values is called a
‘quantile’ group. If the number of these groups is four, they are called ‘quartiles.’

R Programming
R is an open-source programming language for statistical analysis and graph generation,
available for different operating systems.

Random forest
An algorithm that employs ‘a collection of tree data structures’ for the classification task.
The input is classified or ‘voted’ for by each tree. ‘Random forest’ chooses the
classification with the highest ‘votes’ compared to all the trees.

Range
The range is the difference between the highest and the lowest value in a given set of
numbers. For example, consider the set 2,4,5,7,8,9,12. The range=12-2 i.e. 10.

Regression
Regression aims to measure the dependency of one dependent variable and other
changing variables. Examples, linear regression, logistic regression, lasso regression,
etc.

Reinforcement learning
Reinforcement learning or RL is a learning algorithm that allows a model to interact with
an environment and make decisions. The model is not given specific goals, but when it
does something ‘right’, it is given feedback. This ‘reinforcement’ helps the classification
model in learning to make right predictions. RL model also learns from its ‘past’
experiences.

Response variable
The response variable is the one that can be manipulated by other variables. It is also
called dependent variable.

Ridge regression
Ridge regression performs the ‘L2 regularization’ function on the optimization objective.
In other words, it adds the factor of the sum of squares of coefficients to the objective.

Root Mean Squared Error or RMSE

35 | P a g e
RMSE denotes the standard deviation of prediction errors from the regression line. It is
simply the square root of the mean squared error. RMSE signifies the ‘spread’ or
‘concentration’ of data around the regression line.

S curve
As the name suggests, ‘S-curve’ is a graph shaped like the letter ‘S.' It is a curve that
plots variables like cost, number, population, etc. against time.

Scalar
A scalar quantity represents the ‘magnitude’ or ‘intensity’ of a measure and not its
direction in space or time. For example, temperature, volume, etc.

Semi-supervised learning
Semi-supervised learning involves the use of extensive ‘unlabeled data’ at the input. Only
a small data is ‘labeled’ for the model to ‘learn’ to make the right classifications without
much external supervision.

Serial correlation
Serial correlation or autocorrelation is a pattern in a series, where each value is directly
influenced by the value next to it or preceding it. It is calculated by shifting a time-series
over the numerical series by an interval called ’lag.’

Skewness
Skewness represents symmetry of distribution or a data set, to the left or the right from its
center point.

Spatiotemporal data
A spatiotemporal data includes the space and time information about its values. In other
words, it is a time-series data with geographic identifiers.

Standard deviation
Standard deviation represents the ‘dispersion of the data.’ It is the square root of the
variance to show how far an observation is from the mean value.

Standard error
Standard error signifies the ‘statistical accuracy of an estimate.’ It is equal to the standard
deviation of the sampling distribution of a statistic.

Standard normal distribution


It is same as the normal distribution, just with a mean of ‘0’ and standard deviation equal
to ‘1.’

Standardized score

36 | P a g e
Standard score, normal score or Z-score is the ‘transformation of the raw score for
evaluating it in reference with the standard normal distribution, by converting it into units
of standard deviation above or below the mean.

Strata
Division of the data into homogeneous groups and drawing random samples from each
group represents a ‘strata.’ for example, forming ‘strata’ of the population or demographic
data.

Supervised learning
Supervised learning involves using algorithms to classify the input into specific pre-
determined or known classes. In such a case, the prediction made by the model is based
on a ‘given set of predictors.’

Some examples of supervised learning algorithms are Random forest, decision tree, and
kNN, etc.

Support vector machine or SVM


A support vector machine is a discriminative classifier, which plots data-items in ‘n’
dimensional space. Here, ‘n’ represents the number of features each data-item (or data
point) has. The data points are plotted on the coordinates (support vectors).

T-distribution
T-distribution is the ‘sampling of all the possible values instead of actually using them’ on
the normal distribution curve. It is also known as ‘Student’s T-distribution.’

Type I error
The incorrect decision to reject the null-hypothesis is called type I error.

Type II error
The incorrect decision to retain or keep the null-hypothesis is called type II error.

T-test
It is the analysis of ‘two population datasets’ by finding the difference of their populations.

Univariate analysis
The univariate analysis' purpose is to describe the data. It analyzes the dependency of a
single predictor and the response variable.

Unsupervised learning

37 | P a g e
An algorithm that classifies groups of data without knowing what the groups will be. There
is no target or outcome variable to predict or estimate. Unsupervised learning focuses
majorly on learning from the underlying data based on its attributes.

Variance
It is the ‘variation’ of the numbers in a given data from the mean value. Variance
represents the magnitude of differences in a given set of numbers.

Vector
In mathematical terms, vector denotes the quantities with magnitude and direction in the
space or time. In data science terms, it means ‘ordered set of real numbers, each
representing a distance on a coordinate axis.’ For example, velocity, momentum, or any
other series of details around which the model is being built.

Vector-space
Vector space is the collection of vectors. For example, a matrix is a vector-space.

Weka
Weka is a collection of machine learning algorithms and tools for mining data. Using
Weka, the data can be pre-processed, regressed, classified, associated with rules and
visualized.

There, we have it! The updated glossary of machine learning definitions. Do you think we
missed out on something? Share it with us in the comment box below.

38 | P a g e
Next Go Through!

 Machine Learning prep

 SQL prep

 Data Visualization prep

 Business Analytics Case study prep

39 | P a g e
Disclaimer:

A small fraction of Information presented in this prep is researched and adapted as per the context of the prep. Various
online platforms like Github and blogs provided required information on Data Science domain interview scenarios.

We do not claim that this prep is a comprehensive resource to prepare yourself facing any interviews.
Any questions can be addressed to data science prep authors: actiondatas@gmail.com

40 | P a g e

Potrebbero piacerti anche