Sei sulla pagina 1di 15

Data Visualization

IIM Udaipur
“A picture is worth a thousand words” – Chinese
proverb
Uses
• Preprocessing and cleaning the data: “illegal”
values, missing values, duplicate rows,
columns with all the same values etc
• Variable selection: which variables to be
selected in the analysis
• Exploration: hidden trends in the data
Some Most Commonly Used Graphs in
Business World
• Bar chart
• Line chart
• Scatterplot
• Histogram
• Boxplot
• Scatterplot matrix
• Parallel co-ordinates plot
Use
• Generally speaking, these graphs display one
or two columns (i.e., variables) of data at a
time
• Useful to explore data type, volume, relations
etc.
Bar Chart
• Maybe used to compare a single
statistic/measure (e.g., average, count,
percentage) across groups
• Typically used for qualitative data (categorical
data) – represents frequencies/counts for
each categories
• Height of the bar represents the value of the
statistic/measure (typically, count)
• Different bars correspond to different groups
Line Chart
• Useful for time series data
• Choice of the time frame should be made
keeping in mind the forecasting task
• Shows the trend in the values
Scatterplot
• Shows the relationship between two variables
• Suggests the type of correlation two variables may
have – positive, or negative
• Can show any non-linear relationship that may be
present between two variables
• Extremely important for prediction problems, as we
would like to see the relationship between the
response and predictors
• Cannot be used for classification task in its basic form,
as the response is binary in a classification task
• If colour-coded, one more variable can be looked at
Histogram
• Shows the distribution of a numerical
(continuous) variable
• Useful in supervised learning, for determining
potential data mining methods and variable
transformations (for example, transforming a
skewed variable to a symmetric one, for
regression)
Box Plot
• Shows the distribution of a numerical
(continuous) variable
• Useful in supervised learning, for determining
potential data mining methods and variable
transformations (for example, transforming a
skewed variable to a symmetric one, for
regression)
• Effective for comparing subgroups by
generating side-by-side box plots
Scatterplot Matrix
• Relations between variables
• Diagonal histograms show the individual
distribution of the variables
• Choose number of variables wisely for better
visibility
Parallel co-ordinates plot
• A vertical axis is drawn for each variable
• Each record is represented by drawing a line
that connects its values on different axes
• Creates a multivariate profile for every record
Some Methods Useful for Large
Datasets
• The graphs / charts discussed so far are very
powerful, and give accurate information
• But they cannot accommodate large amount
of data and/or variables at one go
• Some methods: Heatmap
Heatmap
• Useful for: (i) visualizing correlation tables, (ii)
visualizing missing values in the data
• Colour is used to denote values
• Darker shades correspond to stronger
correlation
• Caution: They are not replacements for more
accurate graphs already discussed, as colour
differences cannot be perceived accurately!
Colour-coded Scatterplot
• Colour-coding to bring in a categorical variable
• In this enhanced version, scatterplot can be
used for a classification task

Potrebbero piacerti anche