Sei sulla pagina 1di 272

https://seaborn.pydata.org/tutorial/axis_grids.

html (Cluster or Area Groups) or maybe CHGr-

https://towardsdatascience.com/a-step-by-step-guide-for-creating-advanced-python-data-
visualizations-with-seaborn-matplotlib-1579d6a1a7d0

https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-
f1c49c816f07

https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html

https://jovianlin.io/data-visualization-seaborn-part-2/

https://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid

FacetGrid Seaborn very very important library

https://jovianlin.io/data-visualization-seaborn-part-3/

https://elitedatascience.com/python-seaborn-tutorial

https://towardsdatascience.com/data-visualization-using-seaborn-fc24db95a850

https://stackabuse.com/seaborn-library-for-data-visualization-in-python-part-1/

https://stackabuse.com/seaborn-library-for-data-visualization-in-python-part-2/

http://seaborn.pydata.org/generated/seaborn.heatmap.html#seaborn.heatmap

https://www.drawingfromdata.com/setting-figure-size-using-seaborn-and-matplotlib

Matplotlib reference for everything

https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html

https://www.analyticsvidhya.com/blog/2019/09/comprehensive-data-visualization-guide-seaborn-
python/

https://gilberttanner.com/blog/introduction-to-data-visualization-inpython

https://likegeeks.com/seaborn-heatmap-tutorial/

https://wellsr.com/python/seaborn-line-plot-data-visualization/

Python Data Science Handbook by Jake VanderPlas chapter 4 MatPLOTLIB


Visulatizations

For a book-length treatment of Matplotlib, I would recommend Interactive


Applications Using Matplotlib, written by Matplotlib core developer Ben Root.
Other Python Graphics Libraries

Although Matplotlib is the most prominent Python visualization library, there


are other more modern tools that are worth exploring as well. I’ll mention a
few of them briefly here:

 Bokeh is a JavaScript visualization library with a Python frontend that creates


highly interactive visualizations capable of handling very large and/or streaming
datasets. The Python frontend outputs a JSON data structure that can be interpreted
by the Bokeh JS engine.
 Plotly is the eponymous open source product of the Plotly company, and is
similar in spirit to Bokeh. Because Plotly is the main product of a startup, it is
receiving a high level of development effort. Use of the library is entirely free.
 Vispy is an actively developed project focused on dynamic visualizations of
very large datasets. Because it is built to target OpenGL and make use of efficient
graphics processors in your computer, it is able to render some quite large and
stunning visualizations.
 Vega and Vega-Lite are declarative graphics representations, and are the
product of years of research into the fundamental language of data visualization.
The reference rendering implementation is JavaScript, but the API is language
agnostic. There is a Python API under development in the Altair package. Though
it’s not mature yet, I’m quite excited for the possibilities of this project to provide a
common reference point for visualization in Python and other languages.

# sns.distplot(data['Path Loss Diff. > 0 dB (%)'])

# sns.distplot(data['Power Red. BS = 0 dB (%)']);

#Subplot

#Box Plot
# plt.hist(data['Path Loss Diff. > 0 dB (%)'], normed=True, alpha=0.5)

sns.kdeplot(data['Path Loss Diff. > 0 dB (%)'], shade=True)

# sns.kdeplot(data['Path Loss DL > 150 dB (%)'], shade=True)

# plt.hist(data['Path Loss DL > 150 dB (%)'], normed=True, alpha=0.5)

#Cluster or Area Groups

# g = sns.FacetGrid(tips, col="time")

# g.map(plt.hist, "tip");

data['Path Loss Diff. > 0 dB (%)'].describe()

seaborn0.10.0

 Gallery
 Tutorial
 API
 Site 
 Page 

Building structured multi-plot grids

When exploring medium-dimensional data, a useful approach is to draw


multiple instances of the same plot on different subsets of your dataset.
This technique is sometimes called either “lattice” or “trellis” plotting, and
it is related to the idea of “small multiples” . It allows a viewer to quickly
extract a large amount of information about complex data. Matplotlib offers
good support for making figures with multiple axes; seaborn builds on top of
this to directly link the structure of the plot to the structure of your dataset.

To use these features, your data has to be in a Pandas DataFrame and it


must take the form of what Hadley Whickam calls “tidy” data. In brief, that
means your dataframe should be structured such that each column is a
variable and each row is an observation.
For advanced use, you can use the objects discussed in this part of the
tutorial directly, which will provide maximum flexibility. Some seaborn
functions (such as lmplot(), catplot(), and pairplot()) also use them
behind the scenes. Unlike other seaborn functions that are “Axes-level” and
draw onto specific (possibly already-existing) matplotlib Axes without
otherwise manipulating the figure, these higher-level functions create a
figure when called and are generally more strict about how it gets set up. In
some cases, arguments either to those functions or to the constructor of
the class they rely on will provide a different interface attributes like the
figure size, as in the case of lmplot() where you can set the height and
aspect ratio for each facet rather than the overall size of the figure. Any
function that uses one of these objects will always return it after plotting,
though, and most of these objects have convenience methods for changing
how the plot is drawn, often in a more abstract and easy way.

import seaborn as sns


import matplotlib.pyplot as plt
sns.set(style="ticks")

Conditional small multiples

The FacetGrid class is useful when you want to visualize the distribution of


a variable or the relationship between multiple variables separately within
subsets of your dataset. A FacetGrid can be drawn with up to three
dimensions: row, col, and hue. The first two have obvious correspondence
with the resulting array of axes; think of the hue variable as a third
dimension along a depth axis, where different levels are plotted with
different colors.

The class is used by initializing a FacetGrid object with a dataframe and


the names of the variables that will form the row, column, or hue
dimensions of the grid. These variables should be categorical or discrete,
and then the data at each level of the variable will be used for a facet along
that axis. For example, say we wanted to examine differences between
lunch and dinner in the tips dataset.

Additionally, each of relplot(), catplot(), and lmplot() use this object


internally, and they return the object when they are finished so that it can
be used for further tweaking.

tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time")

Initializing the grid like this sets up the matplotlib figure and axes, but
doesn’t draw anything on them.

The main approach for visualizing data on this grid is with


the FacetGrid.map() method. Provide it with a plotting function and the
name(s) of variable(s) in the dataframe to plot. Let’s look at the distribution
of tips in each of these subsets, using a histogram.

g = sns.FacetGrid(tips, col="time")
g.map(plt.hist, "tip");

This function will draw the figure and annotate the axes, hopefully
producing a finished plot in one step. To make a relational plot, just pass
multiple variable names. You can also provide keyword arguments, which
will be passed to the plotting function:

g = sns.FacetGrid(tips, col="sex", hue="smoker")


g.map(plt.scatter, "total_bill", "tip", alpha=.7)
g.add_legend();

There are several options for controlling the look of the grid that can be
passed to the class constructor.

g = sns.FacetGrid(tips, row="smoker", col="time", margin_titles=True)


g.map(sns.regplot, "size", "total_bill", color=".3", fit_reg=False,
x_jitter=.1);
Note that margin_titles isn’t formally supported by the matplotlib API, and
may not work well in all cases. In particular, it currently can’t be used with
a legend that lies outside of the plot.

The size of the figure is set by providing the height of each facet, along
with the aspect ratio:

g = sns.FacetGrid(tips, col="day", height=4, aspect=.5)


g.map(sns.barplot, "sex", "total_bill");
/Users/mwaskom/code/seaborn/seaborn/axisgrid.py:728: UserWarning: Using the
barplot function without specifying order is likely to produce an incorrect
plot.
warnings.warn(warning)

The default ordering of the facets is derived from the information in the
DataFrame. If the variable used to define facets has a categorical type, then
the order of the categories is used. Otherwise, the facets will be in the
order of appearance of the category levels. It is possible, however, to
specify an ordering of any facet dimension with the
appropriate *_order parameter:

ordered_days = tips.day.value_counts().index
g = sns.FacetGrid(tips, row="day", row_order=ordered_days,
height=1.7, aspect=4,)
g.map(sns.distplot, "total_bill", hist=False, rug=True);
Any seaborn color palette (i.e., something that can be passed
to color_palette() can be provided. You can also use a dictionary that
maps the names of values in the hue variable to valid matplotlib colors:

pal = dict(Lunch="seagreen", Dinner="gray")


g = sns.FacetGrid(tips, hue="time", palette=pal, height=5)
g.map(plt.scatter, "total_bill", "tip", s=50, alpha=.7, linewidth=.5,
edgecolor="white")
g.add_legend();
You can also let other aspects of the plot vary across levels of the hue
variable, which can be helpful for making plots that will be more
comprehensible when printed in black-and-white. To do this, pass a
dictionary to hue_kws where keys are the names of plotting function
keyword arguments and values are lists of keyword values, one for each
level of the hue variable.

g = sns.FacetGrid(tips, hue="sex", palette="Set1", height=5,


hue_kws={"marker": ["^", "v"]})
g.map(plt.scatter, "total_bill", "tip", s=100, linewidth=.5,
edgecolor="white")
g.add_legend();
If you have many levels of one variable, you can plot it along the columns
but “wrap” them so that they span multiple rows. When doing this, you
cannot use a row variable.

attend = sns.load_dataset("attention").query("subject <= 12")


g = sns.FacetGrid(attend, col="subject", col_wrap=4, height=2, ylim=(0, 10))
g.map(sns.pointplot, "solutions", "score", order=[1, 2, 3], color=".3",
ci=None);
Once you’ve drawn a plot using FacetGrid.map() (which can be called
multiple times), you may want to adjust some aspects of the plot. There are
also a number of methods on the FacetGrid object for manipulating the
figure at a higher level of abstraction. The most general is FacetGrid.set(),
and there are other more specialized methods
like FacetGrid.set_axis_labels(), which respects the fact that interior
facets do not have axis labels. For example:

with sns.axes_style("white"):
g = sns.FacetGrid(tips, row="sex", col="smoker", margin_titles=True,
height=2.5)
g.map(plt.scatter, "total_bill", "tip", color="#334488", edgecolor="white",
lw=.5);
g.set_axis_labels("Total bill (US Dollars)", "Tip");
g.set(xticks=[10, 30, 50], yticks=[2, 6, 10]);
g.fig.subplots_adjust(wspace=.02, hspace=.02);
For even more customization, you can work directly with the underling
matplotlib Figure and Axes objects, which are stored as member attributes
at fig and axes (a two-dimensional array), respectively. When making a
figure without row or column faceting, you can also use the ax attribute to
directly access the single axes.

g = sns.FacetGrid(tips, col="smoker", margin_titles=True, height=4)


g.map(plt.scatter, "total_bill", "tip", color="#338844", edgecolor="white",
s=50, lw=1)
for ax in g.axes.flat:
ax.plot((0, 50), (0, .2 * 50), c=".2", ls="--")
g.set(xlim=(0, 60), ylim=(0, 14));
Using custom functions

You’re not limited to existing matplotlib and seaborn functions when


using FacetGrid. However, to work properly, any function you use must
follow a few rules:

1. It must plot onto the “currently active” matplotlib Axes. This will be


true of functions in the matplotlib.pyplot namespace, and you can
call matplotlib.pyplot.gca() to get a reference to the current Axes if
you want to work directly with its methods.
2. It must accept the data that it plots in positional arguments.
Internally, FacetGrid will pass a Series of data for each of the named
positional arguments passed to FacetGrid.map().
3. It must be able to accept color and label keyword arguments, and,
ideally, it will do something useful with them. In most cases, it’s easiest
to catch a generic dictionary of **kwargs and pass it along to the
underlying plotting function.
Let’s look at minimal example of a function you can plot with. This function
will just take a single vector of data for each facet:

from scipy import stats


def quantile_plot(x, **kwargs):
qntls, xr = stats.probplot(x, fit=False)
plt.scatter(xr, qntls, **kwargs)

g = sns.FacetGrid(tips, col="sex", height=4)


g.map(quantile_plot, "total_bill");

If we want to make a bivariate plot, you should write the function so that it
accepts the x-axis variable first and the y-axis variable second:

def qqplot(x, y, **kwargs):


_, xr = stats.probplot(x, fit=False)
_, yr = stats.probplot(y, fit=False)
plt.scatter(xr, yr, **kwargs)

g = sns.FacetGrid(tips, col="smoker", height=4)


g.map(qqplot, "total_bill", "tip");
Because matplotlib.pyplot.scatter() accepts color and label keyword
arguments and does the right thing with them, we can add a hue facet
without any difficulty:

g = sns.FacetGrid(tips, hue="time", col="sex", height=4)


g.map(qqplot, "total_bill", "tip")
g.add_legend();

This approach also lets us use additional aesthetics to distinguish the


levels of the hue variable, along with keyword arguments that won’t be
dependent on the faceting variables:

g = sns.FacetGrid(tips, hue="time", col="sex", height=4,


hue_kws={"marker": ["s", "D"]})
g.map(qqplot, "total_bill", "tip", s=40, edgecolor="w")
g.add_legend();

Sometimes, though, you’ll want to map a function that doesn’t work the way
you expect with the color and label keyword arguments. In this case,
you’ll want to explicitly catch them and handle them in the logic of your
custom function. For example, this approach will allow use to
map matplotlib.pyplot.hexbin(), which otherwise does not play well with
the FacetGrid API:

def hexbin(x, y, color, **kwargs):


cmap = sns.light_palette(color, as_cmap=True)
plt.hexbin(x, y, gridsize=15, cmap=cmap, **kwargs)

with sns.axes_style("dark"):
g = sns.FacetGrid(tips, hue="time", col="time", height=4)
g.map(hexbin, "total_bill", "tip", extent=[0, 50, 0, 10]);
Plotting pairwise data relationships

PairGrid also allows you to quickly draw a grid of small subplots using the
same plot type to visualize data in each. In a PairGrid, each row and
column is assigned to a different variable, so the resulting plot shows each
pairwise relationship in the dataset. This style of plot is sometimes called a
“scatterplot matrix”, as this is the most common way to show each
relationship, but PairGrid is not limited to scatterplots.

It’s important to understand the differences between a FacetGrid and


a PairGrid. In the former, each facet shows the same relationship
conditioned on different levels of other variables. In the latter, each plot
shows a different relationship (although the upper and lower triangles will
have mirrored plots). Using PairGrid can give you a very quick, very high-
level summary of interesting relationships in your dataset.

The basic usage of the class is very similar to FacetGrid. First you initialize
the grid, then you pass plotting function to a map method and it will be
called on each subplot. There is also a companion function, pairplot() that
trades off some flexibility for faster plotting.

iris = sns.load_dataset("iris")
g = sns.PairGrid(iris)
g.map(plt.scatter);
It’s possible to plot a different function on the diagonal to show the
univariate distribution of the variable in each column. Note that the axis
ticks won’t correspond to the count or density axis of this plot, though.

g = sns.PairGrid(iris)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter);

A very common way to use this plot colors the observations by a separate
categorical variable. For example, the iris dataset has four measurements
for each of three different species of iris flowers so you can see how they
differ.

g = sns.PairGrid(iris, hue="species")
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
g.add_legend();
By default every numeric column in the dataset is used, but you can focus
on particular relationships if you want.

g = sns.PairGrid(iris, vars=["sepal_length", "sepal_width"], hue="species")


g.map(plt.scatter);
It’s also possible to use a different function in the upper and lower triangles
to emphasize different aspects of the relationship.

g = sns.PairGrid(iris)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=3, legend=False);
The square grid with identity relationships on the diagonal is actually just a
special case, and you can plot with different variables in the rows and
columns.

g = sns.PairGrid(tips, y_vars=["tip"], x_vars=["total_bill", "size"],


height=4)
g.map(sns.regplot, color=".3")
g.set(ylim=(-1, 11), yticks=[0, 5, 10]);

Of course, the aesthetic attributes are configurable. For instance, you can
use a different palette (say, to show an ordering of the hue variable) and
pass keyword arguments into the plotting functions.

g = sns.PairGrid(tips, hue="size", palette="GnBu_d")


g.map(plt.scatter, s=50, edgecolor="white")
g.add_legend();
PairGrid is flexible, but to take a quick look at a dataset, it can be easier to
use pairplot(). This function uses scatterplots and histograms by default,
although a few other kinds will be added (currently, you can also plot
regression plots on the off-diagonals and KDEs on the diagonal).

sns.pairplot(iris, hue="species", height=2.5);


You can also control the aesthetics of the plot with keyword arguments,
and it returns the PairGrid instance for further tweaking.

g = sns.pairplot(iris, hue="species", palette="Set2", diag_kind="kde",


height=2.5)
Back to top

© Copyright 2012-2020, Michael Waskom. Created using Sphinx 2.3.1.


https://towardsdatascience.com/a-step-by-step-guide-for-creating-advanced-python-data-
visualizations-with-seaborn-matplotlib-1579d6a1a7d0

A step-by-step guide for creating advanced Python data visualizations with

Seaborn / Matplotlib
Although there’re tons of great visualization tools in Python, Matplotlib + Seaborn still

stands out for its capability to create and customize all sorts of plots.

Shiu-Tang Li
Follow
Mar 26, 2019 · 10 min read
Photo by Jack Anstey on Unsplash

In this article, I will go through a few sections first to prepare background knowledge for some
readers who are new to Matplotlib:

1. Understand the two different Matplotlib interfaces (It has caused a lot of confusion!) .

2. Understand the elements in a figure, so that you can easily look up the APIs to solve your
problem.

3. Take a glance of a few common types of plots so the readers would have a better idea
about when / how to use them.

4. Learn how to increase the ‘dimension’ of your plots.

5. Learn how to partition the figure using GridSpec.

Then I’ll talk about the process of creating advanced visualizations with an example:

1. Set up a goal.

2. Prepare the variables.

3. Prepare the visualization.

Let’s start the journey.

Two different Matplotlib interfaces

There’re two ways to code in Matplotlib. The first one is state-based:


import matplotlib.pyplot as plt
plt.figure()
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('Test figure')
plt.show()

Which is good for creating easy plots (you call a bunch of plt.XXX to plot each component in the
graph), but you don’t have too much control of the graph. The other one is object-oriented:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(3,3))
ax.bar(x=['A','B','C'], height=[3.1,7,4.2], color='r')
ax.set_xlabel(xlabel='X title', size=20)
ax.set_ylabel(ylabel='Y title' , color='b', size=20)
plt.show()

It will take more time to code but you’ll have full control of your figure. The idea is that you create
a ‘figure’ object, which you can think of it as a bounding box of the whole visualization you’re
going to build, and one or more ‘axes’ object, which are subplots of the visualization, (Don’t ask
me why these subplots called ‘axes’. The name just sucks…) and the subplots can be manipulated
through the methods of these ‘axes’ objects.

(For detailed explanations of these two interfaces, the reader may refer to
https://matplotlib.org/tutorials/introductory/lifecycle.html
or
https://pbpython.com/effective-matplotlib.html )

Let’s stick with the objected-oriented approach in this tutorial.

Elements in a figure in object-oriented interface

The following figure taken from https://pbpython.com/effective-matplotlib.html explains the


components of a figure pretty well:
Let’s look at one simple example of how to create a line chart with object-oriented interface.
fig, ax = plt.subplots(figsize=(3,3))
ax.plot(['Alice','Bob','Catherine'], [4,6,3], color='r')
ax.set_xlabel('TITLE 1')
for tick in ax.get_xticklabels():
tick.set_rotation(45)
plt.show()

In the codes above, we created an axes object, created a line plot on top of it, added a title, and
rotated all the x-tick labels by 45 degrees counterclockwise.

Check out the official API to see how to manipulate axes


objects: https://matplotlib.org/api/axes_api.html

A few common types of plots

After getting a rough idea about how Matplotlib works, it’s time to check out some commonly
seen plots. They are
Scatter plots (x: Numerical #1, y: Numerical #2),

Line plots (x: Categorical - ordinal#1, y: Numerical #1) [Thanks to Michael Arons for pointing out
an issue in the previous figure],
Bar plots (x: Categorical #1, y: Numerical #1). Numerical #1 is often the count of Categorical #1.
Histogram (x: Numerical #1, y: Numerical #2). Numerical #1 is combined into groups (converted
to a categorical variable), and Numerical #2 is usually the count of this categorical variable.
Kernel density plot (x: Numerical #1, y: Numerical #2). Numerical #2 is the frequency of
Numerical #1.
2-D kernel density plot (x: Numerical #1, y: Numerical #2, color: Numerical #3). Numerical #3 is
the joint frequency of Numerical #1 and Numerical #2.
Box plot (x: Categorical #1, y: Numerical #1, marks: Numerical #2). Box plot shows the statistics
of each value in Categorical #1 so we’ll get an idea of the distribution in the other variable. y-
value: the value for the other variable; marks: showing how these values are distributed (range,
Q1, median, Q3).
Violin plot (x: Categorical #1, y: Numerical #1, Width/Mark: Numerical #2). Violin plot is sort of
similar to box plot but it shows the distribution better.
Heat map (x: Categorical #1, y: Categorical #2, Color: Numerical #1). Numerical #1 could be the
count for Categorical #1 and Categorical #2 jointly, or it could be other numerical attributes for
each value in the pair (Categorical #1, Categorical #2).
To learn how to plot these figures, the readers can check out the seaborn APIs by googling for the
following list:

sns.barplot / sns.distplot / sns.lineplot / sns.kdeplot / sns.violinplot


sns.scatterplot / sns.boxplot / sns.heatmap

I’ll give two example codes showing how 2D kde plots / heat map are generated in object-
oriented interface.
# 2D kde plots
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsnp.random.seed(1)
numerical_1 = np.random.randn(100)
np.random.seed(2)
numerical_2 = np.random.randn(100)fig, ax = plt.subplots(figsize=(3,3))
sns.kdeplot(data=numerical_1,
data2= numerical_2,
ax=ax,
shade=True,
color="blue",
bw=1)
plt.show()

The key is the argument ax=ax. When running .kdeplot() method, seaborn would apply the
changes to ax, an ‘axes’ object.
# heat mapimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.DataFrame(dict(categorical_1=['apple',
'banana', 'grapes',
'apple', 'banana', 'grapes',
'apple', 'banana', 'grapes'],
categorical_2=['A','A','A','B','B','B','C','C','C'],
value=[10,2,5,7,3,15,1,6,8]))
pivot_table = df.pivot("categorical_1", "categorical_2", "value")# try
printing out pivot_table to see what it looks like!fig, ax =
plt.subplots(figsize=(5,5))sns.heatmap(data=pivot_table,
cmap=sns.color_palette("Blues"),
ax=ax)

plt.show()

Increase the dimension of your plots

For these basic plots, only limited amount of information can be displayed (2–3 variables). What
if we’d like to show more info to these plots? Here are a few ways.

1. Overlay plots
If several line charts share the same x and y variables, you can call Seaborn plots multiple
times and plot all of them on the same figure. In the example below, we added one more
categorical variable [value = alpha, beta] in the plot with overlaying plots.
fig, ax = plt.subplots(figsize=(4,4))
sns.lineplot(x=['A','B','C','D'],
y=[4,2,5,3],
color='r',
ax=ax)
sns.lineplot(x=['A','B','C','D'],
y=[1,6,2,4],
color='b',
ax=ax)
ax.legend(['alpha', 'beta'], facecolor='w')
plt.show()

Or, we can combine a bar chart and a line chart with the same x-axis but different y-axis:
sns.set(style="white", rc={"lines.linewidth": 3})fig, ax1 =
plt.subplots(figsize=(4,4))
ax2 = ax1.twinx()sns.barplot(x=['A','B','C','D'],
y=[100,200,135,98],
color='#004488',
ax=ax1)sns.lineplot(x=['A','B','C','D'],
y=[4,2,5,3],
color='r',
marker="o",
ax=ax2)
plt.show()
sns.set()

A few comments here. Because the two plots have different y-axis, we need to create another
‘axes’ object with the same x-axis (using .twinx()) and then plot on different ‘axes’. sns.set(…) is to
set specific aesthetics for the current plot, and we run sns.set() in the end to set everything back
to default settings.
Combining different barplots into one grouped barplot also adds one categorical dimension to
the plot (one more categorical variable).

import matplotlib.pyplot as pltcategorical_1 = ['A', 'B', 'C', 'D']


colors = ['green', 'red', 'blue', 'orange']
numerical = [[6, 9, 2, 7],
[6, 7, 3, 8],
[9, 11, 13, 15],
[3, 5, 9, 6]]number_groups = len(categorical_1)
bin_width = 1.0/(number_groups+1)fig, ax = plt.subplots(figsize=(6,6))for
i in range(number_groups):
ax.bar(x=np.arange(len(categorical_1)) + i*bin_width,
height=numerical[i],
width=bin_width,
color=colors[i],
align='center')ax.set_xticks(np.arange(len(categorical_1)) +
number_groups/(2*(number_groups+1)))#
number_groups/(2*(number_groups+1)): offset of
xticklabelax.set_xticklabels(categorical_1)
ax.legend(categorical_1, facecolor='w')plt.show()

In the code example above, you can customize variable names, colors, and figure size.
number_groups and bin_width are calculated based on the input data. I then wrote a for-loop to
plot the bars, one color at a time, and set the ticks and legends in the very end.

2. Facet — mapping dataset into multiple axes, and they differ by one or two categorical
variable(s). The reader could find a bunch examples
in https://seaborn.pydata.org/generated/seaborn.FacetGrid.html

3. Color / Shape / Size of nodes in a scatter plot: The following code example taken from Seaborn
Scatter Plot API shows how it works.
(https://seaborn.pydata.org/generated/seaborn.scatterplot.html)
import seaborn as snstips = sns.load_dataset("tips")
ax = sns.scatterplot(x="total_bill", y="tip",
hue="size", size="size",
sizes=(20, 200), hue_norm=(0, 7),
legend="full", data=tips)
plt.show()

Partition the figure using GridSpec

One of the advantages for object-oriented interface is that we can easily partition our figure into
several subplots and manipulate each subplot with ‘axes’ API.
fig = plt.figure(figsize=(7,7))
gs = gridspec.GridSpec(nrows=3,
ncols=3,
figure=fig,
width_ratios= [1, 1, 1],
height_ratios=[1, 1, 1],
wspace=0.3,
hspace=0.3)ax1 = fig.add_subplot(gs[0, 0])
ax1.text(0.5, 0.5, 'ax1: gs[0, 0]', fontsize=12, fontweight="bold",
va="center", ha="center") # adding text to ax1ax2 =
fig.add_subplot(gs[0, 1:3])
ax2.text(0.5, 0.5, 'ax2: gs[0, 1:3]', fontsize=12, fontweight="bold",
va="center", ha="center")ax3 = fig.add_subplot(gs[1:3, 0:2])
ax3.text(0.5, 0.5, 'ax3: gs[1:3, 0:2]', fontsize=12, fontweight="bold",
va="center", ha="center")ax4 = fig.add_subplot(gs[1:3, 2])
ax4.text(0.5, 0.5, 'ax4: gs[1:3, 2]', fontsize=12, fontweight="bold",
va="center", ha="center")plt.show()

In the example, we first partition the figure into 3*3 = 9 small boxes with gridspec.GridSpec(),
and then define a few axes objects. Each axes object could contain one or more boxes. Say in the
codes above, gs[0, 1:3] = gs[0, 1] + gs[0, 2] is assigned to axes object ax2. wspace and hspace are
parameters controlling the space between plots.

Create advanced visualizations

With some tutorials from the previous sections, it’s time to produce some cool stuffs. Let’s
download the Analytics Vidhya Black Friday Sales data from
https://www.kaggle.com/mehdidag/black-friday and do some easy data preprocessing:
df = pd.read_csv('BlackFriday.csv', usecols = ['User_ID', 'Gender',
'Age', 'Purchase'])df_gp_1 = df[['User_ID',
'Purchase']].groupby('User_ID').agg(np.mean).reset_index()df_gp_2 =
df[['User_ID', 'Gender',
'Age']].groupby('User_ID').agg(max).reset_index()df_gp =
pd.merge(df_gp_1, df_gp_2, on = ['User_ID'])

You’ll then get a table of user ID, gender, age, and the average price of items in each customer’s
purchase.

Step 1. Goal
We’re curious about how age and gender would affect the average purchased item price during
Black Friday, and we hope to see the price distribution as well. We also want to know the
percentages for each age group.

Step 2. Variables
We’d like to include age group (categorical), gender (categorical), average item price (numerical),
and the distribution of average item price (numerical) in the plot. We need to include another
plot with the percentage for each age group (age group + count/frequency).
To show average item price + its distributions, we can go with kernel density plot, box plot, or
violin plot. Among these, kde shows the distribution the best. We then plot two or more kde plots
in the same figure and then do facet plots, so age group and gender info can be both included. For
the other plot, a bar plot can do the job well.

Step 3. Visualization
Once we have a plan about the variables, we could then think about how to visualize it. We need
to do figure partitions first, hide some boundaries, xticks, and yticks, and then add a bar chart to
the right.

The plot below is what we’re going to create. From the figure, we can clearly see that men tend to
purchase more expensive items then women do based on the data, and elder people tend to
purchase more expensive items (the trend is clearer for the top 4 age groups). We also found that
people with age 18–45 are the major buyers in Black Friday sales.
The codes below generate the plot (explanations are included in the comments):
freq = ((df_gp.Age.value_counts(normalize =
True).reset_index().sort_values(by = 'index').Age)*100).tolist()number_gp
= 7# freq = the percentage for each age group, and there’re 7 age
groups.def ax_settings(ax, var_name, x_min, x_max):
ax.set_xlim(x_min,x_max)
ax.set_yticks([])

ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

ax.spines['bottom'].set_edgecolor('#444444')
ax.spines['bottom'].set_linewidth(2)

ax.text(0.02, 0.05, var_name, fontsize=17, fontweight="bold",


transform = ax.transAxes)
return None# Manipulate each axes object in the left. Try to tune
some parameters and you'll know how each command works.fig =
plt.figure(figsize=(12,7))
gs = gridspec.GridSpec(nrows=number_gp,
ncols=2,
figure=fig,
width_ratios= [3, 1],
height_ratios= [1]*number_gp,
wspace=0.2, hspace=0.05
)ax = [None]*(number_gp + 1)
features = ['0-17', '18-25', '26-35', '36-45', '46-50', '51-55', '55+']#
Create a figure, partition the figure into 7*2 boxes, set up an ax array
to store axes objects, and create a list of age group names. for i in
range(number_gp):
ax[i] = fig.add_subplot(gs[i, 0])

ax_settings(ax[i], 'Age: ' + str(features[i]), -1000, 20000)

sns.kdeplot(data=df_gp[(df_gp.Gender == 'M') & (df_gp.Age ==


features[i])].Purchase,
ax=ax[i], shade=True, color="blue", bw=300, legend=False)
sns.kdeplot(data=df_gp[(df_gp.Gender == 'F') & (df_gp.Age ==
features[i])].Purchase,
ax=ax[i], shade=True, color="red", bw=300, legend=False)

if i < (number_gp - 1):


ax[i].set_xticks([])# this 'for loop' is to create a bunch of
axes objects, and link them to GridSpec boxes. Then, we manipulate them
with sns.kdeplot() and ax_settings() we just
defined.ax[0].legend(['Male', 'Female'], facecolor='w')# adding legends
on the top axes object ax[number_gp] = fig.add_subplot(gs[:, 1])
ax[number_gp].spines['right'].set_visible(False)
ax[number_gp].spines['top'].set_visible(False)ax[number_gp].barh(features
, freq, color='#004c99', height=0.4)
ax[number_gp].set_xlim(0,100)
Data Visualization with Seaborn (Part #2)
Jovian Lin 

️⭐️ Part #2 of a 3-Part Series

Continuing from Part 1 of my  seaborn  series, we'll proceed to cover 2D plots.

This notebook is a reorganization of the many ideas shared in this Github repo and this blog post. What
you see here is a modified version that works for me that I hope will work for you as well. Also, enjoy the
cat GIFs.

2D: Visualizing Data in Two Dimensions

One of the best ways to check out potential relationships or correlations amongst the different data
attributes is to leverage a pair-wise correlation matrix and depict it as a heatmap.

2D: Heatmap on Correlation Matrix

# Compute pairwise correlation of Dataframe's attributes


corr = wines.corr()
corr
fixed volatil citric residua chlorid free total density pH sulphat alcoho
acidity e acid l sugar es sulfur sulfur es
acidity dioxide dioxide

fixed 1.0000 0.2190 0.3244 - 0.2981 - - 0.4589 - 0.2995 -


acidity 00 08 36 0.1119 95 0.2827 0.3290 10 0.2527 68 0.0954
81 35 54 00 52

volatile 0.2190 1.0000 - - 0.3771 - - 0.2712 0.2614 0.2259 -


acidity 08 00 0.3779 0.1960 24 0.3525 0.4144 96 54 84 0.0376
81 11 57 76 40

citric 0.3244 - 1.0000 0.1424 0.0389 0.1331 0.1952 0.0961 - 0.0561 -


acid 36 0.3779 00 51 98 26 42 54 0.3298 97 0.0104
81 08 93

residua - - 0.1424 1.0000 - 0.4028 0.4954 0.5525 - - -


l sugar 0.1119 0.1960 51 00 0.1289 71 82 17 0.2673 0.1859 0.3594
81 11 40 20 27 15

chlorid 0.2981 0.3771 0.0389 - 1.0000 - - 0.3626 0.0447 0.3955 -


es 95 24 98 0.1289 00 0.1950 0.2796 15 08 93 0.2569
40 45 30 16

free - - 0.1331 0.4028 - 1.0000 0.7209 0.0257 - - -


sulfur 0.2827 0.3525 26 71 0.1950 00 34 17 0.1458 0.1884 0.1798
fixed volatil citric residua chlorid free total density pH sulphat alcoho
acidity e acid l sugar es sulfur sulfur es
acidity dioxide dioxide

dioxide 35 57 45 54 57 38

total - - 0.1952 0.4954 - 0.7209 1.0000 0.0323 - - -


sulfur 0.3290 0.4144 42 82 0.2796 34 00 95 0.2384 0.2757 0.2657
dioxide 54 76 30 13 27 40

density 0.4589 0.2712 0.0961 0.5525 0.3626 0.0257 0.0323 1.0000 0.0116 0.2594 -
10 96 54 17 15 17 95 00 86 78 0.6867
45

pH - 0.2614 - - 0.0447 - - 0.0116 1.0000 0.1921 0.1212


0.2527 54 0.3298 0.2673 08 0.1458 0.2384 86 00 23 48
00 08 20 54 13

sulphat 0.2995 0.2259 0.0561 - 0.3955 - - 0.2594 0.1921 1.0000 -


es 68 84 97 0.1859 93 0.1884 0.2757 78 23 00 0.0030
27 57 27 29

alcohol - - - - - - - - 0.1212 - 1.0000


0.0954 0.0376 0.0104 0.3594 0.2569 0.1798 0.2657 0.6867 48 0.0030 00
52 40 93 15 16 38 40 45
fixed volatil citric residua chlorid free total density pH sulphat alcoho
acidity e acid l sugar es sulfur sulfur es
acidity dioxide dioxide

29

quality - - 0.0855 - - 0.0554 - - 0.0195 0.0384 0.4443


0.0767 0.2656 32 0.0369 0.2006 63 0.0413 0.3058 06 85 19
43 99 80 66 85 58

fig, (ax) = plt.subplots(1, 1, figsize=(10,6))

hm = sns.heatmap(corr,
ax=ax, # Axes in which to draw the plot,
otherwise use the currently-active Axes.
cmap="coolwarm", # Color Map.
#square=True, # If True, set the Axes aspect to
“equal” so each cell will be square-shaped.
annot=True,
fmt='.2f', # String formatting code to use when
adding annotations.
#annot_kws={"size": 14},
linewidths=.05)

fig.subplots_adjust(top=0.93)
fig.suptitle('Wine Attributes Correlation Heatmap',
fontsize=14,
fontweight='bold')
 The gradients in the heatmap vary based on the strength of the correlation.
 You can clearly see that it is very easy to spot potential attributes having strong correlations
amongst themselves.

2D: Pair-Wise Scatter Plots


Another way to visualize the same is to use pair-wise scatter plots amongst attributes of interest.

Note: The diagonal Axes are treated differently — by drawing a plot to show the univariate


distribution of the data for the variable in that column.

# Attributes of interest
cols = ['density',
'residual sugar',
'total sulfur dioxide',
'free sulfur dioxide',
'fixed acidity']
pp = sns.pairplot(wines[cols],
size=1.8, aspect=1.2,
plot_kws=dict(edgecolor="k", linewidth=0.5),
diag_kws=dict(shade=True), # "diag" adjusts/tunes the
diagonal plots
diag_kind="kde") # use "kde" for diagonal plots

fig = pp.fig
fig.subplots_adjust(top=0.93, wspace=0.3)
fig.suptitle('Wine Attributes Pairwise Plots',
fontsize=14, fontweight='bold')
Bonus: You can also fit linear regression models to the scatter plots. See [😀].

pp = sns.pairplot(wines[cols],
diag_kws=dict(shade=True), # "diag" adjusts/tunes the
diagonal plots
diag_kind="kde") # use "kde" for diagonal plots
kind="reg") # <== 😀 linear regression to the scatter
plots
fig = pp.fig
fig.subplots_adjust(top=0.93, wspace=0.3)
fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14,
fontweight='bold')
Based on the above plot, you can see that scatter plots are also a decent way of observing potential
relationships or patterns in two-dimensions for data attributes.

2D: Parallel Coordinates


Another way of visualizing multivariate data for multiple attributes together (or concurrently) is to
use parallel coordinates.

More about parallel coordinates: https://github.com/matloff/parcoordtutorial

Important: Before we proceed to run  parallel_coordinates() , we'll need to scale our data first,
as different attributes are measured on different scales.

We'll be using  StandardScaler  in  sklearn.preprocessing  to do the job.

Note: I have another blog post on Feature Scaling (should you be interested to know more).

# Attributes of interest
cols = ['density',
'residual sugar',
'total sulfur dioxide',
'free sulfur dioxide',
'fixed acidity']

subset_df = wines[cols]

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
scaled_df = ss.fit_transform(subset_df)
scaled_df = pd.DataFrame(scaled_df, columns=cols)
final_df = pd.concat([scaled_df, wines['wine_type']], axis=1)
final_df.head()

density residual sugar total sulfur dioxide free sulfur dioxide fixed acidity

0 -0.165631 1.546371 0.181456 -0.367664 -0.166089 wh


density residual sugar total sulfur dioxide free sulfur dioxide fixed acidity

1 0.301278 -0.681719 0.305311 0.083090 0.373895 red

2 -0.859324 0.411306 0.305311 0.421155 -0.320370 wh

3 0.408001 1.210056 1.189993 1.717074 -0.706073 wh

4 1.395180 1.777588 2.003900 1.829762 0.142473 wh

from pandas.plotting import parallel_coordinates

fig = plt.figure(figsize=(12, 10))


title = fig.suptitle("Parallel Coordinates", fontsize=18)
fig.subplots_adjust(top=0.93, wspace=0)

pc = parallel_coordinates(final_df,
'wine_type',
color=('skyblue', 'firebrick'))
Basically, in this visualization as depicted above, points are represented as connected line segments.

 Each vertical line represents one data attribute (e.g.  residual sugar ).


 One complete set of connected line segments across all the attributes represents one data
point.
 Hence points that tend to cluster will appear closer together.

Just by looking at it, we can clearly see that  density  is slightly more for red wines as compared
to white — since there are more red lines clustered above the white ones.
Also  residual sugar  and  total sulfur dioxide  are higher for white wines as compared
to red, while  fixed acidity  is higher for red wines as compared to white.

Note: If you don't perform scaling beforehand, this is what you'll get:

# If you don't perform scaling beforehand, this is what you'll get:


fig = plt.figure(figsize=(12, 10))
title = fig.suptitle("Parallel Coordinates", fontsize=18)
fig.subplots_adjust(top=0.93, wspace=0)

new_cols = ['density', 'residual sugar', 'total sulfur dioxide', 'free


sulfur dioxide', 'fixed acidity', 'wine_type']
pc = parallel_coordinates(wines[new_cols],
'wine_type',
color=('skyblue', 'firebrick'))
2D: Two Continuous Numeric Attributes [📈]
[💔] The traditional way — using  matplotlib :

plt.scatter(wines['sulphates'],
wines['alcohol'],
alpha=0.4, edgecolors='w')

plt.xlabel('Sulphates')
plt.ylabel('Alcohol')
plt.title('Wine Sulphates - Alcohol Content', y=1.05)

[💚] The better alternative — using Seaborn's  jointplot() :

jp = sns.jointplot(data=wines,
x='sulphates',
y='alcohol',
kind='reg', # <== 😀 Add regression and kernel density
fits
space=0, size=6, ratio=4)
😀 Replace the scatterplot with a joint histogram using hexagonal bins:

jp = sns.jointplot(data=wines,
x='sulphates',
y='alcohol',
kind='hex', # <== 😀 Replace the scatterplot with a
joint histogram using hexagonal bins
space=0, size=6, ratio=4)
😀 KDE:

jp = sns.jointplot(data=wines,
x='sulphates',
y='alcohol',
kind='kde', # <== 😀 KDE
space=0, size=6, ratio=4)
2D: Two Discrete Categorical Attributes [📊]
Now that we've covered two continuous numeric attributes, how about visualizing two discrete,
categorical attributes?

One way is to leverage separate subplots or facets for one of the categorical dimensions.

[💔] The traditional way — using  matplotlib :

fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Wine Type - Quality", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Quality")
ax1.set_ylabel("Frequency")
rw_q = red_wine['quality'].value_counts()
rw_q = (list(rw_q.index), list(rw_q.values))
ax1.set_ylim([0,2500])
ax1.tick_params(axis='both', which='major', labelsize=8.5)
bar1 = ax1.bar(rw_q[0], rw_q[1], color='red',
edgecolor='black', linewidth=1)

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Quality")
ax2.set_ylabel("Frequency")
ww_q = white_wine['quality'].value_counts()
ww_q = (list(ww_q.index), list(ww_q.values))
ax2.set_ylim([0,2500])
ax2.tick_params(axis='both', which='major', labelsize=8.5)
bar2 = ax2.bar(ww_q[0], ww_q[1], color='white',
edgecolor='black', linewidth=1)

While the above is a good way to visualize categorical data, as you can see,
leveraging  matplotlib  has resulted in writing a lot of code 😤.

[💚] The better alternative — using Seaborn's  countplot() :

In addition, another good way is to use stacked bars or multiple bars for the different attributes in a
single plot. We can leverage  seaborn  for the same easily.
fig = plt.figure(figsize=(10, 7))

cp = sns.countplot(data=wines,
x="quality",
hue="wine_type",
palette={"red": "#FF9999", "white": "#FFE888"})

This definitely looks cleaner and you can also effectively compare the different categories easily from
this single plot.
2D: Mixed Attributes [📈+📊]
Let’s look at visualizing mixed attributes in 2-D (essentially numeric and categorical together).

One way is to use faceting/subplots along with generic histograms or density plots.

[💔] Again, let's first look at the traditional way — using  matplotlib  (histograms):

fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Frequency")
ax1.set_ylim([0, 1200])
ax1.text(1.2, 800, r'$\mu$='+str(round(red_wine['sulphates'].mean(),2)),
fontsize=12)
r_freq, r_bins, r_patches = ax1.hist(red_wine['sulphates'], color='red',
bins=15,
edgecolor='black', linewidth=1)

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Frequency")
ax2.set_ylim([0, 1200])
ax2.text(0.8, 800, r'$\mu$='+str(round(white_wine['sulphates'].mean(),2)),
fontsize=12)
w_freq, w_bins, w_patches = ax2.hist(white_wine['sulphates'],
color='white', bins=15,
edgecolor='black', linewidth=1)

[💔] Using  matplotlib  (density plots):

fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Density")
sns.kdeplot(red_wine['sulphates'], ax=ax1, shade=True, color='r')

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Density")
sns.kdeplot(white_wine['sulphates'], ax=ax2, shade=True, color='y')
While this is good, once again we have a lot of boilerplate code which we can avoid by
leveraging  seaborn  and even depict the plots in one single chart.

[💚] The better alternative — using Seaborn's  FacetGrid() :

The FacetGrid is an object that links a Pandas DataFrame to a  matplotlib  figure with a particular
structure.

In particular, FacetGrid is used to draw plots with multiple Axes where each Axes shows the same
relationship conditioned on different levels of some variable. It's possible to condition on up to three
variables by assigning variables to the rows and columns of the grid and using different colors for the
plot elements.

The basic workflow is to initialize the FacetGrid object with the dataset and the variables that are used to
structure the grid. Then one or more plotting functions can be applied to each subset by
calling  FacetGrid.map()  or  FacetGrid.map_dataframe() .

Finally, the plot can be tweaked with other methods to do things like change the axis labels, use different
ticks, or add a legend. See the detailed code examples here for more information.

fig = plt.figure(figsize=(10,8))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.93, wspace=0.3)

ax = fig.add_subplot(1,1,1)
ax.set_xlabel("Sulphates")
ax.set_ylabel("Frequency")
g = sns.FacetGrid(data=wines,
hue='wine_type',
palette={"red": "r", "white": "y"})

g.map(sns.distplot, 'sulphates',
kde=True, bins=15, ax=ax)

ax.legend(title='Wine Type')
plt.close(2)

You can see the plot generated above is clear and concise and we can easily compare across the
distributions easily.
2D: Box [📦] and Violin [🎻] Plots
[📦] Box plots are another way of effectively depicting groups of numeric data based on the different
values in the categorical attribute.

Additionally, box plots are a good way to know the quartile values in the data and
also potential outliers.

f, (ax) = plt.subplots(1, 1, figsize=(12, 4))


f.suptitle('Wine Quality - Alcohol Content', fontsize=14)

sns.boxplot(data=wines,
x="quality",
y="alcohol",
ax=ax)

ax.set_xlabel("Wine Quality",size=12,alpha=0.8)
ax.set_ylabel("Wine Alcohol %",size=12,alpha=0.8)
[🎻] Another similar visualization is violin plots, which is also an effective way to visualize grouped
numeric data using kernel density plots — depicting the probability density of the data at different
values.

f, (ax) = plt.subplots(1, 1, figsize=(12, 4))


f.suptitle('Wine Quality - Sulphates Content', fontsize=14)

sns.violinplot(data=wines,
x="quality",
y="sulphates",
ax=ax)

ax.set_xlabel("Wine Quality",size=12,alpha=0.8)
ax.set_ylabel("Wine Sulphates",size=12,alpha=0.8)
You can clearly see the density plots above for the different wine quality categories for  wine
sulphate .

🍷 That's it for Part #2 🎻


~ The Complete Seaborn Series ~

Part #1 (1D)

Part #2 (📍)

Part #3 (3D)

https://towardsdatascience.com/data-visualization-using-seaborn-fc24db95a850

Plotting with Categorical data

In the relational plot tutorial we saw how to use different visual representations to show the
relationship between multiple variables in a dataset. In the examples, we focused on cases where
the main relationship was between two numerical variables. If one of the main variables is
“categorical” (divided into discrete groups) it may be helpful to use a more specialized approach
to visualization.

In seaborn, there are several different ways to visualize a relationship involving categorical data.
Similar to the relationship between relplot() and either scatterplot() or lineplot(), there are two
ways to make these plots. There are a number of axes-level functions for plotting categorical data
in different ways and a figure-level interface, catplot(), that gives unified higher-level access to
them.

It’s helpful to think of the different categorical plot kinds as belonging to three different families,
which we’ll discuss in detail below. They are:

Categorical scatterplots:

stripplot() (with kind=”strip”; the default)


swarmplot() (with kind=”swarm”)

Categorical distribution plots:

boxplot() (with kind=”box”)


violinplot() (with kind=”violin”)
boxenplot() (with kind=”boxen”)

Categorical estimate plots:


pointplot() (with kind=”point”)
barplot() (with kind=”bar”)
countplot() (with kind=”count”)

These families represent the data using different levels of granularity.

The default representation of the data in catplot() uses a scatterplot. There are actually two
different categorical scatter plots in seaborn. They take different approaches to resolving the
main challenge in representing categorical data with a scatter plot, which is that all of the points
belonging to one category would fall on the same position along the axis corresponding to the
categorical variable. The approach used by stripplot(), which is the default “kind” in catplot() is to
adjust the positions of points on the categorical axis with a small amount of random “jitter”:

In [16]:
sns.catplot(x="age",y="marital_status",data=census_data)

Out[16]:
<seaborn.axisgrid.FacetGrid at 0xdb18470>
Figure 17

The second approach adjusts the points along the categorical axis using an algorithm that
prevents them from overlapping. It can give a better representation of the distribution of
observations, although it only works well for relatively small datasets. This kind of plot is
sometimes called a “beeswarm” and is drawn in seaborn by swarmplot(), which is activated by
setting kind=”swarm” in catplot():

In [27]:
#sns.catplot(x="age",y="relationship",kind='swarm',data=census_data)
# or
#sns.swarmplot(x="relationship",y="age",data=census_data)
sns.catplot(x="day", y="total_bill", kind="swarm", data=tips);

Out[27]:
Figure 18

Similar to the relational plots, it’s possible to add another dimension to a categorical plot by using
a hue semantic. (The categorical plots do not currently support size or style semantics). Each
different categorical plotting function handles the hue semantic differently. For the scatter plots,
it is only necessary to change the color of the points:
In [29]:
sns.catplot(x="day", y="total_bill", hue="sex", kind="swarm", data=tips);

Out[29]:
Figure 19

Box plot

The first is the familiar boxplot(). This kind of plot shows the three quartile values of the
distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of
the lower and upper quartile, and then observations that fall outside this range are displayed
independently. This means that each value in the boxplot corresponds to an actual observation in
the data.

In [32]:
sns.catplot(x="age",y="marital_status",kind='box',data=census_data)

Out[32]:
<seaborn.axisgrid.FacetGrid at 0xd411860>
Figure 20

When adding a hue semantic, the box for each level of the semantic variable is moved along the
categorical axis so they don’t overlap:

In [37]:
sns.catplot(x="age",y="marital_status",kind='box',hue='gender',data=censu
s_data)

Out[37]:
<seaborn.axisgrid.FacetGrid at 0xde8a8d0>
Figure 21

Violin plots

A different approach is a violinplot(), which combines a boxplot with the kernel density
estimation procedure described in the distributions tutorial:
In [38]:
sns.catplot(x="age",y="marital_status",kind='violin',data=census_data)

Out[38]:
<seaborn.axisgrid.FacetGrid at 0x184c4080>

Figure 22
This approach uses the kernel density estimate to provide a richer description of the distribution
of values. Additionally, the quartile and whikser values from the boxplot are shown inside the
violin. The downside is that, because the violinplot uses a KDE, there are some other parameters
that may need tweaking, adding some complexity relative to the straightforward boxplot:

In [41]:
sns.catplot(x="age",y="marital_status",kind='violin',bw=.15,
cut=0,data=census_data)

Out[41]:
<seaborn.axisgrid.FacetGrid at 0xfdea320>
Figure 23

Statistical estimation within categories

For other applications, rather than showing the distribution within each category, you might
want to show an estimate of the central tendency of the values. Seaborn has two main ways to
show this information. Importantly, the basic API for these functions is identical to that for the
ones discussed above.

Bar plots

A familiar style of plot that accomplishes this goal is a bar plot. In seaborn, the barplot() function
operates on a full dataset and applies a function to obtain the estimate (taking the mean by
default). When there are multiple observations in each category, it also uses bootstrapping to
compute a confidence interval around the estimate and plots that using error bars:

In [46]:
sns.catplot(x="income_bracket",y="age",kind='bar',data=census_data)

Out[46]:
<seaborn.axisgrid.FacetGrid at 0x160588d0>
Figure 24

In [47]:
sns.catplot(x="income_bracket",y="age",kind='bar',hue='gender',data=censu
s_data)
Out[47]:
<seaborn.axisgrid.FacetGrid at 0xdf262e8>

Figure 25

A special case for the bar plot is when you want to show the number of observations in each
category rather than computing a statistic for a second variable. This is similar to a histogram
over a categorical, rather than quantitative, variable. In seaborn, it’s easy to do so with the
countplot() function:

In [61]:
ax =
sns.catplot(x='marital_status',kind='count',data=census_data,orient="h")
ax.fig.autofmt_xdate()

Out[61]:
Figure 26

Point plots

An alternative style for visualizing the same information is offered by the pointplot() function.
This function also encodes the value of the estimate with height on the other axis, but rather than
showing a full bar, it plots the point estimate and confidence interval. Additionally, pointplot()
connects points from the same hue category. This makes it easy to see how the main relationship
is changing as a function of the hue semantic because your eyes are quite good at picking up on
differences of slopes:

In [67]:
ax =
sns.catplot(x='marital_status',y='age',hue='relationship',kind='point',da
ta=census_data)
ax.fig.autofmt_xdate()

Out[67]:
Figure 27
Showing multiple relationships with facets

Just like relplot(), the fact that catplot() is built on a FacetGrid means that it is easy to add
faceting variables to visualize higher-dimensional relationships:

In [78]:
sns.catplot(x="age", y="marital_status", hue="income_bracket",
col="gender", aspect=.6,
kind="box", data=census_data);

out[78]:
seaborn.heatmap(data, vmin=None, vmax=None, cmap=None , center=None, 
robust=False , annot=None, fmt='.2g', annot_kws=None, linewidths=0, line
color='white', cbar=True, cbar_kws=None , cbar_ax=None , square=False , 
xticklabels='auto' , yticklabels='auto' , mask=None, ax=None, **kwargs)
Plot rectangular data as a color-encoded matrix.

This is an Axes-level function and will draw the heatmap into the
currently-active Axes if none is provided to the ax argument. Part of
this Axes space will be taken and used to plot a colormap,
unless cbar is False or a separate Axes is provided to cbar_ax.

Parameters
datarectangular dataset
2D dataset that can be coerced into an ndarray. If a Pandas
DataFrame is provided, the index/column information will be used to
label the columns and rows.
vmin, vmaxfloats, optional
Values to anchor the colormap, otherwise they are inferred from the
data and other keyword arguments.
cmapmatplotlib colormap name or object, or list of colors, optional
The mapping from data values to color space. If not provided, the
default will depend on whether center is set.
centerfloat, optional
The value at which to center the colormap when plotting divergant
data. Using this parameter will change the default cmap if none is
specified.
robustbool, optional
If True and vmin or vmax are absent, the colormap range is computed
with robust quantiles instead of the extreme values.
annot bool or rectangular dataset, optional
If True, write the data value in each cell. If an array-like with the
same shape as data, then use this to annotate the heatmap instead of
the data. Note that DataFrames will match on position, not index.
fmtstring, optional
String formatting code to use when adding annotations.
annot_kwsdict of key, value mappings, optional
Keyword arguments for ax.text when annot is True.
linewidthsfloat, optional
Width of the lines that will divide each cell.
linecolorcolor, optional
Color of the lines that will divide each cell.
cbarboolean, optional
Whether to draw a colorbar.
cbar_kws dict of key, value mappings, optional
Keyword arguments for fig.colorbar.
cbar_axmatplotlib Axes, optional
Axes in which to draw the colorbar, otherwise take space from the
main Axes.
squareboolean, optional
If True, set the Axes aspect to “equal” so each cell will be square-
shaped.
xticklabels, yticklabels “auto”, bool, list-like, or int, optional
If True, plot the column names of the dataframe. If False, don’t plot
the column names. If list-like, plot these alternate labels as the
xticklabels. If an integer, use the column names but plot only every n
label. If “auto”, try to densely plot non-overlapping labels.
maskboolean array or DataFrame, optional
If passed, data will not be shown in cells where mask is True. Cells
with missing values are automatically masked.
axmatplotlib Axes, optional
Axes in which to draw the plot, otherwise use the currently-active
Axes.
kwargsother keyword arguments
All other keyword arguments are passed
to matplotlib.axes.Axes.pcolormesh().
Returns
axmatplotlib Axes
Axes object with the heatmap.
See also
clustermap
Plot a matrix using hierachical clustering to arrange the rows and
columns.

Examples

Plot a heatmap for a numpy array:

>>> import numpy as np; np.random.seed(0)


>>> import seaborn as sns; sns.set()
>>> uniform_data = np.random.rand(10, 12)
>>> ax = sns.heatmap(uniform_data)

Change the limits of the colormap:

>>> ax = sns.heatmap(uniform_data, vmin=0, vmax=1)


Plot a heatmap for data centered on 0 with a diverging colormap:

>>> normal_data = np.random.randn(10, 12)


>>> ax = sns.heatmap(normal_data, center=0)
Plot a dataframe with meaningful row and column labels:

>>> flights = sns.load_dataset("flights")


>>> flights = flights.pivot("month", "year", "passengers")
>>> ax = sns.heatmap(flights)
Annotate each cell with the numeric value using integer formatting:

>>> ax = sns.heatmap(flights, annot=True, fmt="d")


Add lines between each cell:

>>> ax = sns.heatmap(flights, linewidths=.5)


Use a different colormap:

>>> ax = sns.heatmap(flights, cmap="YlGnBu")


Center the colormap at a specific value:

>>> ax = sns.heatmap(flights, center=flights.loc["January", 1955])


Plot every other column label and don’t plot row labels:

>>> data = np.random.randn(50, 20)


>>> ax = sns.heatmap(data, xticklabels=2, yticklabels=False)
Don’t draw a colorbar:

>>> ax = sns.heatmap(flights, cbar=False)


Use different axes for the colorbar:

>>> grid_kws = {"height_ratios": (.9, .05), "hspace": .3}


>>> f, (ax, cbar_ax) = plt.subplots(2, gridspec_kw=grid_kws)
>>> ax = sns.heatmap(flights, ax=ax,
... cbar_ax=cbar_ax,
... cbar_kws={"orientation": "horizontal"})
Use a mask to plot only part of a matrix

>>> corr = np.corrcoef(np.random.randn(10, 200))


>>> mask = np.zeros_like(corr)
>>> mask[np.triu_indices_from(mask)] = True
>>> with sns.axes_style("white"):
... f, ax = plt.subplots(figsize=(7, 5))
... ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True)
https://stackoverflow.com/questions/55389247/how-to-plot-a-matrix-of-seaborn-distplots-for-all-
columns-in-the-dataframe
How to Plot a Matrix of Seaborn Distplots for All Columns in the Dataframe
Imagine I have a dataframe with 9 columns. I want to be able to achieve the same
effect as df.hist(), but with sns.distplot().

In other words, I want to be able to plot the sns.distplot() for each column in the
dataframe in a visualization of 3 rows and 3 columns where each sub figure
represents the unique sns.distplot() of each column for the total number of columns
in the dataframe.

I experimented a bit with using a for loop over axes and columns for the dataframe,
but I'm only able to achieve results for specifying columns. I'm not sure how to
represent the code to work for rows and columns.

I also looked into sns.FacetGrid, but I'm not sure how to go about solving this
problem using FacetGrid.

I find the df.hist() function to exactly what I want, but I want to be able to do it with
the sns.distplot for all the columns in that same representation as the output of
df.hist().

If it helps to put the context of the dataframe, I'm essentially reading Google
Colab's training and testing sets for the California Housing Dataset which contains
all the columns except for the ocean_proximity. If you want to help me figure out
this problem using that dataset, please get it from Kaggle and drop the
ocean_proximity column.

My approach for 9 columns:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('housing.csv')
df.drop('ocean_proximity', axis=1, inplace=True)
fig, axes = plt.subplots(ncols=len(df.columns), figsize=(30,15))
for ax, col in zip(axes, df.columns):
sns.distplot(df[col], ax=ax)
plt.tight_layout()
plt.show()
python pandas seaborn
shareimprove this questionfollow
asked Mar 28 '19 at 2:20

pythonRCNewbie
6911 silver badge55 bronze badges
add a comment
1 Answer

Active OldestVotes

6
You can create multiple figure with matplotlib using subplots like this

import matplotlib.pyplot as plt


fig, axes = plt.subplots(nrows=3, ncols=3)
Then you can plot each columns in a different subplot with the ax argument,
using ax=axes[nrow,ncol] to specify the subplot you want to print in:
for i, column in enumerate(df.columns):
sns.distplot(df[column],ax=axes[i//3,i%3])
shareimprove this answerfollow
edited Mar 7 at 22:45
answered Mar 28 '19 at 8:58

Bruce Swain
22811 silver badge88 bronze badges
 Beautiful solution, Bruce! Thank you so much. I got exactly the visualization I
wanted :) – pythonRCNewbie Mar 29 '19 at 9:18

https://www.geeksforgeeks.org/box-plot-visualization-with-pandas-and-seaborn/

Box plot visualization with Pandas and Seaborn

Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the
data efficiently with a simple box and whiskers and allows us to compare easily across groups.
Boxplot summarizes a sample data using 25th, 50th and 75th percentiles. These percentiles are
also known as the lower quartile, median and upper quartile.
A box plot consist of 5 things.
 Minimum
 First Quartile or 25%
 Median (Second Quartile) or 50%
 Third Quartile or 75%
 Maximum
To download the dataset used, click here.

Draw the box plot with Pandas:


One way to plot boxplot using pandas dataframe is to use boxplot() function that is part of
pandas library.
filter_none

brightness_4
# import the required library 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
% matplotlib inline
  

  
# load the dataset
df = pd.read_csv("tips.csv")

  
# display 5 rows of dataset
df.head()   

Boxplot of days with respect total_bill.


filter_none

brightness_4
df.boxplot(by ='day', column =['total_bill'], grid = False)

 
Boxplot of size with respect tip.
filter_none

brightness_4
df.boxplot(by ='size', column =['tip'], grid = False)
 
Draw the boxplot using seaborn library:
Syntax :
seaborn.boxplot(x=None, y=None, hue=None, data=None, order=None,
hue_order=None, orient=None, color=None, palette=None, saturation=0.75,
width=0.8, dodge=True, fliersize=5, linewidth=None, whis=1.5, notch=False,
ax=None, **kwargs)
Parameters:
x = feature of dataset
y = feature of dataset
hue = feature of dataset
data = datafram or full dataset
color = color name
Let’s see how to create the box plot through seaborn library.
Information about “tips” dataset.
filter_none

brightness_4
# load the dataset
tips = sns.load_dataset('tips')

  
tips.head()

Boxplot of days with respect total_bill.


filter_none

brightness_4
# Draw a vertical boxplot grouped 
# by a categorical variable:
sns.set_style("whitegrid")

  
sns.boxplot(x = 'day', y = 'total_bill', data = tips)

Let’s take the first box plot i.e, blue box plot of the figure and understand these statistical things:
 Bottom black horizontal line of blue box plot is minimum value
 First black horizontal line of rectangle shape of blue box plot is First quartile or 25%
 Second black horizontal line of rectangle shape of blue box plot is Second quartile or
50% or median.
 Third black horizontal line of rectangle shape of blue box plot is third quartile or 75%
 Top black horizontal line of rectangle shape of blue box plot is maximum value.
 Small diamond shape of blue box plot is outlier data or erroneous data.
https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-
f1c49c816f07

Seaborn Heatmaps: 13 Ways to Customize Correlation Matrix Visualizations

For data scientists, checking correlations is an important part of the exploratory data


analysis process. This analysis is one of the methods used to decide which features affect the
target variable the most, and in turn, get used in predicting this target variable. In other words,
it’s a commonly-used method for feature selection in machine learning.

And because visualization is generally easier to understand than reading tabular data, heatmaps
are typically used to visualize correlation matrices. A simple way to plot a heatmap in Python is
by importing and implementing the Seaborn library.
From seaborn documentation

Seaborn heatmap arguments

Seaborn heatmaps are appealing to the eyes, and they tend to send clear messages about data
almost immediately. This is why this method for correlation matrix visualization is widely used
by data analysts and data scientists alike.

But what else can we get from the heatmap apart from a simple plot of the correlation matrix?

In two words: A LOT.

Surprisingly, the Seaborn heatmap function has 18 arguments that can be used to customize a
correlation matrix, improving how fast insights can be derived. For the purposes of this tutorial,
we’re going to use 13 of those arguments.

Let’s get right to it


Getting started with Seaborn

To make things a bit simpler for the purposes of this tutorial, we’re going to use one of the pre-
installed datasets in Seaborn. The first thing we need to do is import the Seaborn library and load
the data.

Please note: If using Google Colab or any Anaconda package, there’s no need to install
Seaborn; you’ll only need to import it. Otherwise, use this link to install Seaborn.
The future of machine learning is on the edge. Subscribe to the Fritz AI Newsletter to discover the

possibilities and benefits of embedding ML models inside mobile apps.

The data

Our data, which is called Tips (a pre-installed dataset on Seaborn library), has 7 columns
consisting of 3 numeric features and 4 categorical features. Each entry or row captures a type of
customer (be it male or female or smoker or non-smoker ) having either dinner or lunch on a
particular day of the week. It also captures the amount of total bill, the tip given and the table size
of a customer. (For more info about pre-installed datasets on the Seaborn library, check here)

One important thing to note when plotting a correlation matrix is that it completely ignores any
non-numeric column. For the purposes of this tutorial, all the category variable were changed to
numeric variables.

This is how the DataFrame looks like after wrangling.


Take a look at how the data was wrangled here.

As mentioned previously, the Seaborn heatmap function can take in 18 arguments.

This is what the function looks like with all the arguments:

sns.heatmap(data, vmin=None, vmax=None, cmap=None,center=None, robust=False,


annot=None, fmt=’.2g’, annot_kws=None, linewidths=0, linecolor=’white’, cbar=True,
cbar_kws=None, cbar_ax=None, square=False, xticklabels=’auto’, yticklabels=’auto’,
mask=None, ax=None, **kwargs)

Just taking a look at the code and not having any idea about how it works can be very
overwhelming. Let’s dissect it together.

To better understand the arguments, we’re going to group them into 4 categories:

1. The Essentials

2. Adjusting the axis (the measurement bar)

3. Aesthetics

4. Changing the matrix shape


The Essentials

1. The most important argument in the function is to input the data since the end goal is to
plot a correlation. A .corr() method will be added to the data and passed as the first
argument.
2. Interpreting the insights by just using the first argument is sufficient. For an even easier
interpretation, an argument called annot=True should be passed as well, which helps display
the correlation coefficient.
3. There are times where correlation coefficients may be running towards 5 decimal digits. A
good trick to reduce the number displayed and improve readability is to pass the argument fmt
=’.3g'or fmt = ‘.1g' because by default the function displays two digits after the decimal
(greater than zero) i.e fmt='.2g'(this may not always mean it displays two decimal
places). Let's specify the default argument to fmt='.1g' .
For the rest of this tutorial, we will stick to the default fmt='.2g'

Adjusting the axis (the measurement bar)

4. The next three arguments have to do with rescaling the color bar. There are times where the
correlation matrix bar doesn’t start at zero, a negative number, or end at a particular number of
choice—or even have a distinct center. All this can be customized by specifying these three
arguments: vmin, which is the minimum value of the bar; vmax, which is the maximum value of
the bar; and center= . By default, all three aren’t specified. Let’s say we want our color bar to be
between -1 to 1 and be centered at 0.
One obvious change, apart from the rescaling, is that the color changed. This has to do with
changing the center from None to Zero or any other number. But this does not mean we can’t
change the color back or to any other available color. Let’s see how to do this.

Aesthetic

5. Let's change the color by specifying the argument cmap


Check here for more information on the available color codes.

6. By default, the thickness and color border of each row of the matrix are set at 0 and white,
respectively. There are times where the heatmap may look better with some border thickness and
a change of color. This is where the arguments linewidths and linecolor apply. Let's specify
the linewidths and the linecolor to 3 and black, respectively.
For the rest of this tutorial, we’ll switch back to the default cmap , linecolor,
and linewidths . This can be done either by passing the following
arguments: cmap=None , linecolor='white', and linewidths=0; or not passing the
arguments at all (which we’re going to do).

7. So far, the heatmap used has its color bar displayed vertically. This can be customized to be
horizontal instead by specifying the argument cbar_kws
8. There also might be instances where a heatmap may be better off not having a color bar at all.
This can be done by specifying cbar=False
For the rest of this tutorial, we will display the color bar.

9. Take a closer look at the shape of each matrix box above. They’re all rectangular in shape. We
can change them into squares by specifying the argument to square=True
Changing the matrix shape

Changing the whole shape of the matrix from rectangular to triangular is a little tricky. For this,
we’ll need to import NumPy methods .triu() & .tril() and then specify the Seaborn
heatmap argument called mask=
.triu() is a method in NumPy that returns the lower triangle of any matrix given to it,
while .tril() returns the upper triangle of any matrix given to it.

The idea is to pass the correlation matrix into the NumPy method and then pass this into the
mask argument in order to create a mask on the heatmap matrix. Let’s see how this works below.

First using the np.trui() method:


Then using the np.tril() method:
In conclusion

We discovered 13 ways to customize our Seaborn heatmap for a correlation matrix. The
remaining 5 arguments are rarely used because they’re very specific to the nature of the data and
the associated goals. Full source code for this tutorial can be found on GitHub:
anitaokoh/Understanding-the-Seaborn-heatmap-function
This is a tutorial for the purpose of dissecting the seaborn.heatmap() function. The
arguments were broken down into 4…
github.com

References

Learn more about the Seaborn function using the documentation here

To learn more about improving the EDA process through visualization, check out this
Dataquest tutorial (login required).

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated


to exploring the emerging intersection of mobile app development and machine learning. We’re
committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine
learning platform that helps developers teach devices to see, hear, sense, and think. We pay our
contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to
receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter), join
us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning

https://www.drawingfromdata.com/setting-figure-size-using-seaborn-and-matplotlib

How to set the size of a figure in matplotlib and seaborn

TL;DR

 if you're using plot() on a pandas Series or Dataframe, use


the figsize keyword
 if you're using matplotlib directly, use matplotlib.pyplot.figure with
the figsize keyword
 if you're using a seaborn function that draws a single plot,
use matplotlib.pyplot.figure with the figsize keyword
 if you're using a seaborn function that draws multiple plots, use
the height and aspect keyword arguments
Introduction

Setting figure sizes, like rotating axis tick labels , is one of those things that feels
like it should be very straightforward. However, it still manages to show up on the
first page of stackoverflow questions for both matplotlib and seaborn. Part of
the confusion arises because there are so many ways to do the same thing -  this
highly upvoted question  has six suggested solutions:
 manually create an Axes object with the desired size
 pass some configuration paramteters to seaborn so that the size you want
is the default
 call a method on the figure once it's been created
 pass hight and aspect keywords to the seaborn plotting function
 use the matplotlib.pyplot interface and call the figure() function
 use the matplotlib.pyplot interface to get the current figure then set its
size using a method

each of which will work in some circumstances but not others!

Drawing a figure using pandas

Let's jump in. As an example we'll use the olympic medal dataset, which we can
load directly from a URL::

import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6
data =
pd.read_csv("https://raw.githubusercontent.com/mojones/binders/master
/olympics.csv", sep="\t")
data
.. Int Olympic Committee
City Year Sport Medal Country
. code

Athen 189 ..
0 Aquatics Gold Hungary HUN
s 6 .

Athen 189 ..
1 Aquatics Silver Austria AUT
s 6 .

Athen 189 .. Bronz


2 Aquatics Greece GRE
s 6 . e

Athen 189 ..
3 Aquatics Gold Greece GRE
s 6 .

4 Athen 189 Aquatics .. Silver Greece GRE


.. Int Olympic Committee
City Year Sport Medal Country
. code

s 6 .

..
... ... ... ... ... ... ...
.

2921 200 Wrestlin .. German


Beijing Silver GER
1 8 g . y

2921 200 Wrestlin .. Bronz Lithuani


Beijing LTU
2 8 g . e a

2921 200 Wrestlin .. Bronz


Beijing Armenia ARM
3 8 g . e

2921 200 Wrestlin ..


Beijing Gold Cuba CUB
4 8 g .

2921 200 Wrestlin ..


Beijing Silver Russia RUS
5 8 g .

29216 rows × 12 columns

For our first figure, we'll count how many medals have been won in total by each
country, then take the top thirty:

data['Country'].value_counts().head(30)
United States 4335
Soviet Union 2049
United Kingdom 1594
France 1314
Italy 1228
...
Spain 377
Switzerland 376
Brazil 372
Bulgaria 331
Czechoslovakia 329
Name: Country, Length: 30, dtype: int64

And turn it into a bar chart:

data['Country'].value_counts().head(30).plot(kind='barh')
Ignoring other asthetic aspects of the plot, it's obvious that we need to change
the size - or rather the shape. Part of the confusion over sizes in plotting is that
sometimes we need to just make the chart bigger or smaller, and sometimes we
need to make it thinner or fatter. If we just scaled up this plot so that it was big
enough to read the names on the vertical axis, then it would also be very wide.
We can set the size by adding a figsize keyword argument to our
pandas plot() function. The value has to be a tuple of sizes - it's actually the
horizontal and vertical size in inches, but for most purposes we can think of them
as arbirary units.

Here's what happens if we make the plot bigger, but keep the original shape:

import matplotlib.pyplot as plt


data['Country'].value_counts().head(30).plot(kind='barh',
figsize=(20,10))

And here's a version that keeps the large vertical size but shrinks the chart
horizontally so it doesn't take up so much space:

import matplotlib.pyplot as plt


data['Country'].value_counts().head(30).plot(kind='barh',
figsize=(6,10))

Drawing a figure using matplotlib

OK, but what if we aren't using pandas' convenient plot() method but drawing


the chart using matplotlib directly? Let's look at the number of medals awarded
in each year:
plt.plot(data['Year'].value_counts().sort_index())

This time, we'll say that we want to make the plot longer in the horizontal
direction, to better see the pattern over time. If we search the documentation  for
the matplotlib plot() funtion, we won't find any mention of size or shape. This
actually makes sense in the design of matplotlib - plots don't really have a size,
figures do. So to change it we have to call the figure() function:
plt.figure(figsize=(15,4))
plt.plot(data['Year'].value_counts().sort_index())
Notice that with the figure() function we have to call it before we make the call
to plot(), otherwise it won't take effect:
plt.plot(data['Year'].value_counts().sort_index())

# no effect, the plot has already been drawn


plt.figure(figsize=(15,4))

Drawing a figure with seaborn

OK, now what if we're using seaborn rather than matplotlib? Well, happily the
same technique will work. We know from our first plot which countries have won
the most medals overall, but now let's look at how this varies by year. We'll
create a summary table to show the number of medals per year for all countries
that have won at least 500 medals total.

(ignore this panda stuff if it seems confusing, and just look at the final table)

summary = (
data
.groupby('Country')
.filter(lambda x : len(x) > 500)
.groupby(['Country', 'Year'])
.size()
.to_frame('medal count')
.reset_index()
)

# wrap long country names


summary['Country'] = summary['Country'].str.replace(' ', '\n')
summary
medal
Country Year
count

189
0 Australia 2
6

190
1 Australia 5
0

192
2 Australia 6
0

192
3 Australia 10
4

4 Australia 192 4
medal
Country Year
count

... ... ... ...

30 United\nState 199
224
9 s 2

31 United\nState 199
260
0 s 6

31 United\nState 200
248
1 s 0

31 United\nState 200
264
2 s 4

31 United\nState 200
315
3 s 8

314 rows × 3 columns

Now we can do a box plot to show the distribution of yearly medal totals for each
country:

import seaborn as sns


sns.boxplot(
data=summary,
x='Country',
y='medal count',
color='red')

This is hard to read because of all the names, so let's space them out a bit:

plt.figure(figsize=(20,5))
sns.boxplot(
data=summary,
x='Country',
y='medal count',
color='red')
Now we come to the final complication; let's say we want to look at the
distributions of the different medal types separately. We'll make a new summary
table - again, ignore the pandas stuff if it's confusing, and just look at the final
table:

summary_by_medal = (
data
.groupby('Country')
.filter(lambda x : len(x) > 500)
.groupby(['Country', 'Year', 'Medal'])
.size()
.to_frame('medal count')
.reset_index()
)
summary_by_medal['Country'] =
summary_by_medal['Country'].str.replace(' ', '\n')
summary_by_medal
medal
Country Year Medal
count

189
0 Australia Gold 2
6

190 Bronz
1 Australia 3
0 e

190
2 Australia Gold 2
0

192 Bronz
3 Australia 1
0 e

192
4 Australia Silver 5
0

... ... ... ... ...

88 United\nState 200
Gold 116
1 s 4

88 United\nState 200
Silver 75
2 s 4

88 United\nState 200 Bronz


81
3 s 8 e

88 United\nState 200
Gold 125
4 s 8
medal
Country Year Medal
count

88 United\nState 200
Silver 109
5 s 8

886 rows × 4 columns


Now we will switch from boxplot() to the higher level catplot(), as this makes
it easy to switch between different plot types. But notice that now our call
to plt.figure() gets ignored:
plt.figure(figsize=(20,5))

sns.catplot(
data=summary_by_medal,
x='Country',
y='medal count',
hue='Medal',
kind='box')

The reason for this is that the higher level plotting functions in seaborn (what the
documentation calls Figure-level interfaces) have a different way of managing
size, largely due to the fact that the often produce multiple subplots. To set the
size when
using catplot() or relplot() (also pairplot(), lmplot() and jointplot()),
use the height keyword to control the size and the aspect keyword to control the
shape:
sns.catplot(
data=summary_by_medal,
x='Country',
y='medal count',
hue='Medal',
kind='box',
height=5, # make the plot 5 units high
aspect=3) # height should be three times width

Because we often end up drawing small multiples with catplot() and relplot(),


being able to control the shape separately from the size is very convenient.
The height and aspect keywords apply to each subplot separately, not to the
figure as a whole. So if we put each medal on a separate row rather than using
hue, we'll end up with three subplots, so we'll want to set the height to be
smaller, but the aspect ratio to be bigger:
sns.catplot(
data=summary_by_medal,
x='Country',
y='medal count',
row='Medal',
kind='box',
height=3,
aspect=4,
color='blue')

Printing a figure

Finally, a word about printing. If the reason that you need to change the size of a
plot, rather than the shape, is because you need to print it, then don't worry
about the size - get the shape that you want, then use savefig() to make the
plot in SVG format:
plt.savefig('medals.svg')
This will give you a plot in Scalable Vector Graphics format, which stores the
actual lines and shapes of the chart so that you can print it at any size - even a
giant poster - and it will look sharp. As a nice bonus, you can also edit individual
bits of the chart using a graphical SVG editor (Inkscape is free and powerful,
though takes a bit of effort to learn).

# More Heat map Learn

plt.figure(figsize=(30,10))

sns.heatmap(corr,annot=True,cmap="YlGnBu",fmt='.1g',square=True)

# plt.figure(figsize=(15,4))

# plt.savefig('medals.svg')
Introduction

There is just something extraordinary about a well-designed visualization. The colors stand out,
the layers blend nicely together, the contours flow throughout, and the overall package not only
has a nice aesthetic quality, but it provides meaningful insights to us as well.

This is quite important in data science where we often work with a lot of messy data. Having the
ability to visualize it is critical for a data scientist. Our stakeholders or clients will more often than
not rely on visual cues rather than the intricacies of a machine learning model.

There are plenty of excellent Python visualization libraries available, including the built-
in matplotlib. But seaborn stands out for me. It combines aesthetic appeal seamlessly with
technical insights, as we’ll soon see.
In this article, we’ll learn what seaborn is and why you should use it ahead of matplotlib. We’ll
then use seaborn to generate all sorts of different data visualizations in Python. So put your
creative hats on and let’s get rolling!

Seaborn is part of the comprehensive and popular Applied Machine Learning course. It’s your
one-stop-destination to learning all about machine learning and its different aspects.

Table of Contents

 What is Seaborn?
 Why should you use Seaborn versus matplotlib?
 Setting up the Environment
 Data Visualization using Seaborn
o Visualizing Statistical Relationships
o Plotting with Categorical Data
o Visualizing the Distribution of a Dataset

What is Seaborn?

Have you ever used the ggplot2 library in R? It’s one of the best visualization packages in any
tool or language. Seaborn gives me the same overall feel.

Seaborn is an amazing Python visualization library built on top of matplotlib.


It gives us the capability to create amplified data visuals. This helps us understand the data by
displaying it in a visual context to unearth any hidden correlations between variables or trends
that might not be obvious initially. Seaborn has a high-level interface as compared to the low
level of Matplotlib.

Why should you use Seaborn versus matplotlib?

I’ve been talking about how awesome seaborn is so you might be wondering what all the fuss is
about.

I’ll answer that question comprehensively in a practical manner when we generate plots using
seaborn. For now, let’s quickly talk about how seaborn feels like it’s a step above matplotlib.

Seaborn makes our charts and plots look engaging and enables some of the common data
visualization needs (like mapping color to a variable or using faceting). Basically, it makes the
data visualization and exploration easy to conquer. And trust me, that is no easy task in data
science.
“If Matplotlib “tries to make easy things easy and hard things possible”, seaborn tries to make
a well-defined set of hard things easy too.” – Michael Waskom (Creator of Seaborn)
There are essentially a couple of (big) limitations in matplotlib that Seaborn fixes:

1. Seaborn comes with a large number of high-level interfaces and customized themes that
matplotlib lacks as it’s not easy to figure out the settings that make plots attractive
2. Matplotlib functions don’t work well with dataframes, whereas seaborn does

That second point stands out in data science since we work quite a lot with dataframes. Any
other reason(s) you feel seaborn is superior to matplotlib? Let us know in the comments section
below the article!

Setting up the Environment

The seaborn library has four mandatory dependencies you need to have:

 NumPy (>= 1.9.3)


 SciPy (>= 0.14.0)
 matplotlib (>= 1.4.3)
 Pandas (>= 0.15.2)

To install Seaborn and use it effectively, first, we need to install the aforementioned
dependencies. Once this step is done, we are all set to install Seaborn and enjoy its
mesmerizing plots. To install Seaborn, you can use the following line of code-

To install the latest release of seaborn, you can use pip:

pip install seaborn

You can also use conda to install the latest version of seaborn:

conda install seaborn

To import the dependencies and seaborn itself in your code, you can use the following code-

That’s it! We are all set to explore seaborn in detail.

 
Datasets Used for Data Visualization

We’ll be working primarily with two datasets:

 HR analytics challenge
 Predict the number of Upvotes

I’ve picked these two because they contain a multitude of variables so we have plenty of options
to play around with. Both these datasets also mimic real-world scenarios so you’ll get an idea of
how data visualization and exploration work in the industry.

You can check out this and other high-quality datasets and hackathons on the DataHack
platform. So go ahead and download the above two datasets before you proceed. We’ll be
using them in tandem.

Data Visualization using Seaborn

Let’s get started! I have divided this implementation section into two categories:

 Visualizing statistical relationships


 Plotting categorical data

We’ll look at multiple examples of each category and how to plot it using seaborn.

Visualizing statistical relationships

A statistical relationship denotes a process of understanding relationships between different


variables in a dataset and how that relationship affects or depends on other variables.

Here, we’ll be using seaborn to generate the below plots:

 Scatter plot
 SNS.relplot
 Hue plot

I have picked the ‘Predict the number of upvotes‘ project for this. So, let’s start by importing
the dataset in our working environment:
 

Scatterplot using Seaborn

A scatterplot is perhaps the most common example of visualizing relationships between two
variables. Each point shows an observation in the dataset and these observations are
represented by dot-like structures. The plot shows the joint distribution of two variables using a
cloud of points.

To draw the scatter plot, we’ll be using the relplot() function of the seaborn library. It is a figure-
level role for visualizing statistical relationships. By default, using a relplot produces a scatter
plot:
SNS.relplot using Seaborn

SNS.relplot is the relplot function from SNS class, which is a seaborn class that we imported


above with other dependencies.

The parameters – x, y, and data – represent the variables on X-axis, Y-axis and the data we
are using to plot respectively. Here, we’ve found a relationship between
the views and upvotes.

Next, if we want to see the tag associated with the data, we can use the below code:
Hue Plot

We can add another dimension in our plot with the help of hue as it gives color to the points and
each color has some meaning attached to it.

In the above plot, the hue semantic is categorical. That’s why it has a different color palette. If
the hue semantic is numeric, then the coloring becomes sequential.
We can also change the size of each point:
We can also change the size manually by using another parameter sizes as sizes = (15, 200).

Plotting Categorical Data

 Jitter
 Hue
 Boxplot
 Voilin Plot
 Pointplot

In the above section, we saw how we can use different visual representations to show the
relationship between multiple variables. We drew the plots between two numeric variables. In
this section, we’ll see the relationship between two variables of which one would be categorical
(divided into different groups).

We’ll be using catplot() function of seaborn library to draw the plots of categorical data. Let’s
dive in
Jitter Plot

For jitter plot we’ll be using another dataset from the problem HR analysis challenge, let’s
import the dataset now.

Now, we’ll see the plot between the columns education and avg_training_score by using


catplot() function.
 

Since we can see that the plot is scattered, so to handle that, we can set the jitter to false. Jitter
is the deviation from the true value. So, we’ll set the jitter to false by using another parameter.

Hue Plot

Next, if we want to introduce another variable or another dimension in our plot, we can use
the hue parameter just like we used in the above section. Let’s say we want to see the gender
distribution in the plot of education and avg_training_score, to do that, we can use the following
code
In the above plots, we can see that the points are overlapping each other, to eliminate this
situation, we can set kind = “swarm”, swarm uses an algorithm that prevents the points from
overlapping and adjusts the points along the categorical axis. Let’s see how it looks like-
Pretty amazing, right? What if we want to see the swarmed version of the plot as well as a third
dimension? Let’s see how it goes if we introduce is_promoted as a new variable
Clearly people with higher scores got a promotion.

Boxplot using seaborn

Another kind of plot we can draw is a boxplot which shows three quartile values of the
distribution along with the end values. Each value in the boxplot corresponds to actual
observation in the data. Let’s draw the boxplot now-
When we use hue semantic with boxplot, it is leveled along the categorical axis so they don’t
overlap. The boxplot with hue would look like-
Violin Plot using seaborn

We can also represent the above variables differently by using violin plots. Let’s try it out
The violin plots combine the boxplot and kernel density estimation procedure to provide richer
description of the distribution of values. The quartile values are displayed inside the violin. We
can also split the violin when the hue semantic parameter has only two levels, which could also
be helpful in saving space on the plot. Let’s look at the violin plot with a split of levels.
These amazing plots are the reason why I started using seaborn. It gives you a lot of options to
display the data. Another coming in the line is boxplot.

Boxplot using seaborn

Boxplot operates on the full dataset and obtains the mean value by default. Let’s face it now.
Pointplot using seaborn

Another type of plot coming in is pointplot, and this plot points out the estimate value and
confidence interval. Pointplot connects data from the same hue category. This helps in
identifying how the relationship is changing in a particular hue category. You can check out how
does a pointplot displays the information below.
As it is clear from the above plot, the one whose score is high has is more confident in getting a
promotion.

This is not the end, seaborn is a huge library with a lot of plotting functions for different
purposes. One such purpose is to introduce multiple dimensions. We can visualize higher
dimension relationships as well. Let’s check it out using swarm plot.

Swarm plot using seaborn


It becomes so easy to visualize the insights when we combine multiple concepts into one. Here
swarm plot is promoted attribute as hue semantic and gender attribute as a faceting variable.

Visualizing the Distribution of a Dataset

Whenever we are dealing with a dataset, we want to know how the data or the variables are
being distributed. Distribution of data could tell us a lot about the nature of the data, so let’s dive
into it.

Plotting Univariate Distributions

 Histogram

One of the most common plots you’ll come across while examining the distribution of a variable
is distplot. By default, distplot() function draws histogram and fits a Kernel Density Estimate.
Let’s check out how age is distributed across the data.
This clearly shows that the majority of people are in their late twenties and early thirties.

Histogram using Seaborn

Another kind of plot that we use for univariate distribution is a histogram.

A histogram represents the distribution of data in the form of bins and uses bars to show the
number of observations falling under each bin. We can also add a rugplot in it instead of using
KDE (Kernel Density Estimate), which means at every observation, it will draw a small vertical
stick.
 

Plotting Bivariate Distributions

 Hexplot
 KDE plot
 Boxen plot
 Ridge plot (Joyplot)

Apart from visualizing the distribution of a single variable, we can see how two independent
variables are distributed with respect to each other. Bivariate means joint, so to visualize it, we
use jointplot() function of seaborn library. By default, jointplot draws a scatter plot. Let’s check
out the bivariate distribution between age and avg_training_score.
There are multiple ways to visualize bivariate distribution. Let’s look at a couple of more.

Hexplot using Seaborn

Hexplot is a bivariate analog of histogram as it shows the number of observations that falls
within hexagonal bins. This is a plot which works with large dataset very easily. To draw a
hexplot, we’ll set kind attribute to hex. Let’s check it out now.
 

KDE Plot using Seaborn

That’s not the end of this, next comes KDE plot. It’s another very awesome method to visualize
the bivariate distribution. Let’s see how the above observations could also be achieved by
using jointplot() function and setting the attribute kind to KDE.
 

Heatmaps using Seaborn

Now let’s talk about my absolute favorite plot, the heatmap. Heatmaps are graphical
representations in which each variable is represented as a color.

Let’s go ahead and generate one:


Boxen Plot using Seaborn

Another plot that we can use to show the bivariate distribution is boxen plot. Boxen plots were
originally named letter value plot as it shows large number of values of a variable, also known
as quantiles. These quantiles are also defined as letter values. By plotting a large number of
quantiles, provides more insights about the shape of the distribution. These are similar to box
plots, let’s see how they could be used.
Ridge Plot using seaborn

The next plot is quite fascinating. It’s called ridge plot. It is also called joyplot. Ridge plot helps
in visualizing the distribution of a numeric value for several groups. These distributions could be
represented by using KDE plots or histograms. Now, let’s try to plot a ridge plot for age with
respect to gender.
 

Visualizing Pairwise Relationships in a Dataset

We can also plot multiple bivariate distributions in a dataset by using pairplot() function of the
seaborn library. This shows the relationship between each column of the database. It also
draws the univariate distribution plot of each variable on the diagonal axis. Let’s see how it
looks.
 

End Notes

We’ve covered a lot of plots here. We saw how the seaborn library can be so effective when it
comes to visualizing and exploring data (especially large datasets). We also discussed how we
can plot different functions of the seaborn library for different kinds of data.

Like I mentioned earlier, the best way to learn seaborn (or any concept or library) is by
practicing it. The more you generate new visualizations on your own, the more confident you’ll
become. Go ahead and try your hand at any practice problem on the DataHack platform and
start becoming a data visualization master!
https://stackabuse.com/seaborn-library-for-data-visualization-in-python-part-1/

Seaborn Library for Data Visualization in Python: Part 1


By    Usman Malik  •  0 Comments

Introduction

In the previous article, we looked at how Python's Matplotlib library can be


used for data visualization. In this article we will look at Seaborn which is
another extremely useful library for data visualization in Python. The Seaborn
library is built on top of Matplotlib and offers many advanced data
visualization capabilities.

Though, the Seaborn library can be used to draw a variety of charts such as
matrix plots, grid plots, regression plots etc., in this article we will see how
the Seaborn library can be used to draw distributional and categorial plots. In
the second part of the series, we will see how to draw regression plots,
matrix plots, and grid plots.

Downloading the Seaborn Library

The seaborn library can be downloaded in a couple of ways. If you are using


pip installer for Python libraries, you can execute the following command to
download the library:

pip install seaborn

Alternatively, if you are using the Anaconda distribution of Python, you can
use execute the following command to download the seaborn library:

conda install seaborn

The Dataset

The dataset that we are going to use to draw our plots will be the Titanic
dataset, which is downloaded by default with the Seaborn library. All you
have to do is use the load_dataset function and pass it the name of the
dataset.
Let's see what the Titanic dataset looks like. Execute the following script:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

dataset = sns.load_dataset('titanic')

dataset.head()

The script above loads the Titanic dataset and displays the first five rows of
the dataset using the head function. The output looks like this:

The dataset contains 891 rows and 15 columns and contains information
about the passengers who boarded the unfortunate Titanic ship. The original
task is to predict whether or not the passenger survived depending upon
different features such as their age, ticket, cabin they boarded, the class of
the ticket, etc. We will use the Seaborn library to see if we can find any
patterns in the data.
Distributional Plots

Distributional plots, as the name suggests are type of plots that show the
statistical distribution of data. In this section we will see some of the most
commonly used distribution plots in Seaborn.

The Dist Plot

The distplot() shows the histogram distribution of data for a single column.


The column name is passed as a parameter to the distplot() function. Let's
see how the price of the ticket for each passenger is distributed. Execute the
following script:

sns.distplot(dataset['fare'])

Output:

You can see that most of the tickets have been solved between 0-50 dollars.
The line that you see represents the kernel density estimation. You can
remove this line by passing False as the parameter for the kde attribute as
shown below:
sns.distplot(dataset['fare'], kde=False)

Output:

Now you can see there is no line for the kernel density estimation on the
plot.

You can also pass the value for the bins parameter in order to see more or
less details in the graph. Take a look at he following script:

sns.distplot(dataset['fare'], kde=False, bins=10)

Here we set the number of bins to 10. In the output, you will see data
distributed in 10 bins as shown below:

Output:
You can clearly see that for more than 700 passengers, the ticket price is
between 0 and 50.

The Joint Plot

The jointplot()is used to display the mutual distribution of each column.


You need to pass three parameters to jointplot. The first parameter is the
column name for which you want to display the distribution of data on x-axis.
The second parameter is the column name for which you want to display the
distribution of data on y-axis. Finally, the third parameter is the name of the
data frame.

Let's plot a joint plot of age and fare columns to see if we can find any


relationship between the two.

sns.jointplot(x='age', y='fare', data=dataset)

Output:
From the output, you can see that a joint plot has three parts. A distribution
plot at the top for the column on the x-axis, a distribution plot on the right for
the column on the y-axis and a scatter plot in between that shows the mutual
distribution of data for both the columns. You can see that there is no
correlation observed between prices and the fares.

You can change the type of the joint plot by passing a value for
the kind parameter. For instance, if instead of scatter plot, you want to
display the distribution of data in the form of a hexagonal plot, you can pass
the value hex for the kind parameter. Look at the following script:
sns.jointplot(x='age', y='fare', data=dataset, kind='hex')

Output:

In the hexagonal plot, the hexagon with most number of points gets darker
color. So if you look at the above plot, you can see that most of the
passengers are between age 20 and 30 and most of them paid between 10-50
for the tickets.
The Pair Plot

The paitplot() is a type of distribution plot that basically plots a joint plot


for all the possible combination of numeric and Boolean columns in your
dataset. You only need to pass the name of your dataset as the parameter to
the pairplot() function as shown below:

sns.pairplot(dataset)

A snapshot of the portion of the output is shown below:


Note: Before executing the script above, remove all null values from the
dataset using the following command:

dataset = dataset.dropna()

From the output of the pair plot you can see the joint plots for all the numeric
and Boolean columns in the Titanic dataset.
To add information from the categorical column to the pair plot, you can pass
the name of the categorical column to the hue parameter. For instance, if
we want to plot the gender information on the pair plot, we can execute the
following script:

sns.pairplot(dataset, hue='sex')

Output:
In the output you can see the information about the males in orange and the
information about the female in blue (as shown in the legend). From the joint
plot on the top left, you can clearly see that among the surviving passengers,
the majority were female.

The Rug Plot


The rugplot() is used to draw small bars along x-axis for each point in the
dataset. To plot a rug plot, you need to pass the name of the column. Let's
plot a rug plot for fare.

sns.rugplot(dataset['fare'])

Output:

From the output, you can see that as was the case with the distplot(), most
of the instances for the fares have values between 0 and 100.

These are some of the most commonly used distribution plots offered by the
Python's Seaborn Library. Let's see some of categorical plots in the Seaborn
library.

Categorical Plots

Categorical plots, as the name suggests are normally used to plot


categorical data. The categorical plots plot the values in the categorical
column against another categorical column or a numeric column. Let's see
some of the most commonly used categorical data.

The Bar Plot

The barplot() is used to display the mean value for each value in a


categorical column, against a numeric column. The first parameter is the
categorical column, the second parameter is the numeric column while the
third parameter is the dataset. For instance, if you want to know the mean
value of the age of the male and female passengers, you can use the bar plot
as follows.

sns.barplot(x='sex', y='age', data=dataset)

Output:

From the output, you can clearly see that the average age of male
passengers is just less than 40 while the average age of female passengers
is around 33.
In addition to finding the average, the bar plot can also be used to calculate
other aggregate values for each category. To do so, you need to pass the
aggregate function to the estimator. For instance, you can calculate the
standard deviation for the age of each gender as follows:

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.barplot(x='sex', y='age', data=dataset, estimator=np.std)

Notice, in the above script we use the std aggregate function from


the numpy library to calculate the standard deviation for the ages of male and
female passengers. The output looks like this:
The Count Plot

The count plot is similar to the bar plot, however it displays the count of the
categories in a specific column. For instance, if we want to count the number
of males and women passenger we can do so using count plot as follows:

sns.countplot(x='sex', data=dataset)

The output shows the count as follows:

Subscribe to our Newsletter

Get occassional tutorials, guides, and jobs in your inbox. No spam ever.
Unsubscribe at any time.

Newsletter Signup

Subscribe

Output:
The Box Plot

The box plot is used to display the distribution of the categorical data in the
form of quartiles. The center of the box shows the median value. The value
from the lower whisker to the bottom of the box shows the first quartile.
From the bottom of the box to the middle of the box lies the second quartile.
From the middle of the box to the top of the box lies the third quartile and
finally from the top of the box to the top whisker lies the last quartile.

You can study more about quartiles and box plots at this link.

Now let's plot a box plot that displays the distribution for the age with
respect to each gender. You need to pass the categorical column as the first
parameter (which is sex in our case) and the numeric column (age in our
case) as the second parameter. Finally, the dataset is passed as the third
parameter, take a look at the following script:

sns.boxplot(x='sex', y='age', data=dataset)

Output:
Let's try to understand the box plot for female. The first quartile starts at
around 5 and ends at 22 which means that 25% of the passengers are aged
between 5 and 25. The second quartile starts at around 23 and ends at
around 32 which means that 25% of the passengers are aged between 23 and
32. Similarly, the third quartile starts and ends between 34 and 42, hence
25% passengers are aged within this range and finally the fourth or last
quartile starts at 43 and ends around 65.

If there are any outliers or the passengers that do not belong to any of the
quartiles, they are called outliers and are represented by dots on the box
plot.

You can make your box plots more fancy by adding another layer of
distribution. For instance, if you want to see the box plots of forage of
passengers of both genders, along with the information about whether or not
they survived, you can pass the survived as value to the hue parameter as
shown below:

sns.boxplot(x='sex', y='age', data=dataset, hue="survived")

Output:
Now in addition to the information about the age of each gender, you can
also see the distribution of the passengers who survived. For instance, you
can see that among the male passengers, on average more younger people
survived as compared to the older ones. Similarly, you can see that the
variation among the age of female passengers who did not survive is much
greater than the age of the surviving female passengers.

The Violin Plot

The violin plot is similar to the box plot, however, the violin plot allows us to
display all the components that actually correspond to the data point.
The violinplot() function is used to plot the violin plot. Like the box plot,
the first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset.

Let's plot a violin plot that displays the distribution for the age with respect
to each gender.

sns.violinplot(x='sex', y='age', data=dataset)


Output:

You can see from the figure above that violin plots provide much more
information about the data as compared to the box plot. Instead of plotting
the quartile, the violin plot allows us to see all the components that actually
correspond to the data. The area where the violin plot is thicker has a higher
number of instances for the age. For instance, from the violin plot for males,
it is clearly evident that the number of passengers with age between 20 and
40 is higher than all the rest of the age brackets.

Like box plots, you can also add another categorical variable to the violin
plot using the hue parameter as shown below:

sns.violinplot(x='sex', y='age', data=dataset, hue='survived')


Now you can see a lot of information on the violin plot. For instance, if you
look at the bottom of the violin plot for the males who survived (left-orange),
you can see that it is thicker than the bottom of the violin plot for the males
who didn't survive (left-blue). This means that the number of young male
passengers who survived is greater than the number of young male
passengers who did not survive. The violin plots convey a lot of information,
however, on the downside, it takes a bit of time and effort to understand the
violin plots.

Instead of plotting two different graphs for the passengers who survived and
those who did not, you can have one violin plot divided into two halves,
where one half represents surviving while the other half represents the non-
surviving passengers. To do so, you need to pass True as value for
the split parameter of the violinplot() function. Let's see how we can do
this:

sns.violinplot(x='sex', y='age', data=dataset, hue='survived', split=True)

The output looks like this:


Now you can clearly see the comparison between the age of the passengers
who survived and who did not for both males and females.

Both violin and box plots can be extremely useful. However, as a rule of
thumb if you are presenting your data to a non-technical audience, box plots
should be preferred since they are easy to comprehend. On the other hand, if
you are presenting your results to the research community it is more
convenient to use violin plot to save space and to convey more information in
less time.

The Strip Plot

The strip plot draws a scatter plot where one of the variables is categorical.
We have seen scatter plots in the joint plot and the pair plot sections where
we had two numeric variables. The strip plot is different in a way that one of
the variables is categorical in this case, and for each category in the
categorical variable, you will see scatter plot with respect to the numeric
column.
The stripplot() function is used to plot the violin plot. Like the box plot, the
first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset. Look at the
following script:

sns.stripplot(x='sex', y='age', data=dataset)

Output:

You can see the scattered plots of age for both males and females. The data
points look like strips. It is difficult to comprehend the distribution of data in
this form. To better comprehend the data, pass True for
the jitter parameter which adds some random noise to the data. Look at
the following script:

sns.stripplot(x='sex', y='age', data=dataset, jitter=True)

Output:
Now you have a better view for the distribution of age across the genders.

Like violin and box plots, you can add an additional categorical column to
strip plot using hue parameter as shown below:

sns.stripplot(x='sex', y='age', data=dataset, jitter=True, hue='survived')


Again you can see there are more points for the males who survived near the
bottom of the plot compared to those who did not survive.

Like violin plots, we can also split the strip plots. Execute the following
script:

sns.stripplot(x='sex', y='age', data=dataset, jitter=True, hue='survived',


split=True)

Output:
Now you can clearly see the difference in the distribution for the age of both
male and female passengers who survived and those who did not survive.

The Swarm Plot

The swarm plot is a combination of the strip and the violin plots. In the
swarm plots, the points are adjusted in such a way that they don't overlap.
Let's plot a swarm plot for the distribution of age against gender.
The swarmplot() function is used to plot the violin plot. Like the box plot, the
first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset. Look at the
following script:

sns.swarmplot(x='sex', y='age', data=dataset)


You can clearly see that the above plot contains scattered data points like
the strip plot and the data points are not overlapping. Rather they are
arranged to give a view similar to that of a violin plot.

Let's add another categorical column to the swarm plot using


the hue parameter.

sns.swarmplot(x='sex', y='age', data=dataset, hue='survived')

Output:
From the output, it is evident that the ratio of surviving males is less than the
ratio of surviving females. Since for the male plot, there are more blue points
and less orange points. On the other hand, for females, there are more
orange points (surviving) than the blue points (not surviving). Another
observation is that amongst males of age less than 10, more passengers
survived as compared to those who didn't.

We can also split swarm plots as we did in the case of strip and box plots.
Execute the following script to do so:

sns.swarmplot(x='sex', y='age', data=dataset, hue='survived', split=True)

Output:
Now you can clearly see that more women survived, as compared to men.

Combining Swarm and Violin Plots

Swarm plots are not recommended if you have a huge dataset since they do
not scale well because they have to plot each data point. If you really like
swarm plots, a better way is to combine two plots. For instance, to combine
a violin plot with swarm plot, you need to execute the following script:

sns.violinplot(x='sex', y='age', data=dataset)

sns.swarmplot(x='sex', y='age', data=dataset, color='black')

Output:
Conclusion

Seaborn is an advanced data visualization library built on top of Matplotlib


library. In this article, we looked at how we can draw distributional and
categorical plots using Seaborn library. This is Part 1 of the series of article
on Seaborn. In the second article of the series, we will see how we play
around with grid functionalities in Seaborn and how we can draw Matrix and
Regression plots in Seaborn.
https://stackabuse.com/seaborn-library-for-data-visualization-in-python-part-2/

Seaborn Library for Data Visualization in Python: Part 2


By    Usman Malik  •  0 Comments

In the previous article Seaborn Library for Data Visualization in Python: Part


1, we looked at how the Seaborn Library is used to plot distributional and
categorial plots. In this article we will continue our discussion and will see
some of the other functionalities offered by Seaborn to draw different types
of plots. We will start our discussion with Matrix Plots.

Matrix Plots

Matrix plots are the type of plots that show data in the form of rows and
columns. Heat maps are the prime examples of matrix plots.

Heat Maps

Heat maps are normally used to plot correlation between numeric columns in
the form of a matrix. It is important to mention here that to draw matrix
plots, you need to have meaningful information on rows as well as columns.
Continuing with the theme from teh last article, let's plot the first five rows
of the Titanic dataset to see if both the rows and column headers have
meaningful information. Execute the following script:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

dataset = sns.load_dataset('titanic')

dataset.head()
In the output, you will see the following result:

From the output, you can see that the column headers contain useful
information such as passengers survived, their age, fare etc. However the
row headers only contains indexes 0, 1, 2, etc. To plot matrix plots, we need
useful information on both columns and row headers. One way to do this is to
call the corr() method on the dataset. The corr() function returns the
correlation between all the numeric columns of the dataset. Execute the
following script:

dataset.corr()

In the output, you will see that both the columns and the rows have
meaningful header information, as shown below:
Now to create a heat map with these correlation values, you need to call
the heatmap() function and pass it your correlation dataframe. Look at the
following script:

corr = dataset.corr()

sns.heatmap(corr)

The output looks like this:


From the output, it can be seen that what heatmap essentially does is that it
plots a box for every combination of rows and column value. The color of the
box depends upon the gradient. For instance, in the above image if there is a
high correlation between two features, the corresponding cell or the box is
white, on the other hand if there is no correlation, the corresponding cell
remains black.

The correlation values can also be plotted on the heatmap by


passing True for the annot parameter. Execute the following script to see
this in action:

corr = dataset.corr()

sns.heatmap(corr, annot=True)

Output:
You can also change the color of the heatmap by passing an argument for
the cmap parameter. For now, just look at the following script:

corr = dataset.corr()

sns.heatmap(corr, cmap='winter')

The output looks like this:


In addition to simply using correlation between all the columns, you can also
use pivot_table function to specify the index, the column and the values
that you want to see corresponding to the index and the columns. To
see pivot_table function in action, we will use the "flights" data set that
contains the information about the year, the month and the number of
passengers that traveled in that month.

Execute the following script to import the data set and to see the first five
rows of the dataset:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns


dataset = sns.load_dataset('flights')

dataset.head()

Output:

Now using the pivot_table function, we can create a heat map that displays


the number of passengers that traveled in a specific month of a specific year.
To do so, we will pass month as the value for the index parameter. The
index attribute corresponds to the rows. Next we need to pass year as value
for the column parameter. And finally for the values parameter, we will pass
the passengers column. Execute the following script:

data = dataset.pivot_table(index='month', columns='year',


values='passengers')

sns.heatmap(data)

The output looks like this:


It is evident from the output that in the early years the number of passengers
who took the flights was less. As the years progress, the number of
passengers increases.

Currently, you can see that the boxes or the cells are overlapping in some
cases and the distinction between the boundaries of the cells is not very
clear. To create a clear boundary between the cells, you can make use of
the linecolor and linewidths parameters. Take a look at the following
script:

data = dataset.pivot_table(index='month', columns='year', values='passengers'


)

sns.heatmap(data, linecolor='blue', linewidth=1)

In the script above, we passed "blue" as the value for


the linecolor parameter, while the linewidth parameter is set to 1. In the
output you will see a blue boundary around each cell:
You can increase the value for the linewidth parameter if you want thicker
boundaries.

Cluster Map

In addition to heat map, another commonly used matrix plot is the cluster
map. The cluster map basically uses Hierarchical Clustering to cluster the
rows and columns of the matrix.

Let's plot a cluster map for the number of passengers who traveled in a
specific month of a specific year. Execute the following script:

data = dataset.pivot_table(index='month', columns='year',


values='passengers')

sns.clustermap(data)
To plot a cluster map, clustermap function is used, and like the heat map
function, the dataset passed should have meaningful headers for both rows
and columns. The output of the script above looks like this:

In the output, you can see months and years clustered together on the basis
of number of passengers that traveled in a specific month.

With this, we conclude our discussion about the Matrix plots. In the next
section we will start our discussion about grid capabilities of the Seaborn
library.
Seaborn Grids

Grids in Seaborn allow us to manipulate the subplots depending upon the


features used in the plots.

Pair Grid

In Part 1 of this article series, we saw how pair plot can be used to draw
scatter plot for all possible combinations of the numeric columns in the
dataset.

Let's revise the pair plot here before we can move on to the pair grid. The
dataset we are going to use for the pair grid section is the "iris" dataset
which is downloaded by default when you download the seaborn library.
Execute the following script to load the iris dataset:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

dataset = sns.load_dataset('iris')

dataset.head()

The first five rows of the iris dataset look like this:
Now let's draw a pair plot on the iris dataset. Execute the following script:

sns.pairplot(dataset)

A snapshot of the out looks like this:


Now let's plot pair grid and see the difference between the pair plot and the
pair grid. To create a pair grid, you simply have to pass the dataset to
the PairGrid function, as shown below:
sns.PairGrid(dataset)

Output:
In the output, you can see empty grids. This is essentially what the pair grid
function does. It returns an empty set of grids for all the features in the
dataset.

Next, you need to call map function on the object returned by the pair grid
function and pass it the type of plot that you want to draw on the grids. Let's
plot a scatter plot using the pair grid.

grids = sns.PairGrid(dataset)

grids.map(plt.scatter)

The output looks like this:


You can see scatter plots for all the combinations of numeric columns in the
"iris" dataset.
You can also plot different types of graphs on the same pair grid. For
instance, if you want to plot a "distribution" plot on the diagonal, "kdeplot"
on the upper half of the diagonal, and "scatter" plot on the lower part of the
diagonal you can use map_diagonal, map_upper, and map_lower functions,
respectively. The type of plot to be drawn is passed as the parameter to
these functions. Take a look at the following script:

Subscribe to our Newsletter

Get occassional tutorials, guides, and jobs in your inbox. No spam ever.
Unsubscribe at any time.

Newsletter Signup

Subscribe

grids = sns.PairGrid(dataset)

grids.map_diag(sns.distplot)

grids.map_upper(sns.kdeplot)

grids.map_lower(plt.scatter)

The output of the script above looks like this:


You can see the true power of the pair grid function from the image above.
On the diagonals we have distribution plots, on the upper half we have the
kernel density plots, while on the lower half we have the scatter plots.
Facet Grids

The facet grids are used to plot two or more than two categorical features
against two or more than two numeric features. Let's plot a facet grid which
plots the distributional plot of gender vs alive with respect to the age of the
passengers.

For this section, we will again use the Titanic dataset. Execute the following
script to load the Titanic dataset:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

dataset = sns.load_dataset('titanic')

To draw facet grid, the FacetGrid() function is used. The first parameter to


the function is the dataset, the second parameter col specifies the feature
to plot on columns while the row parameter specifies the feature on the
rows. The FacetGrid() function returns an object. Like the pair grid, you can
use the map function to specify the type of plot you want to draw.

Execute the following script:

grid = sns.FacetGrid(data=dataset, col='alive', row='sex')

grid.map(sns.distplot, 'age')

In the above script, we plot the distributional plot for age on the facet grid.
The output looks like this:
From the output, you can see four plots. One for each combination of gender
and survival of the passenger. The columns contain information about the
survival while the rows contain information about the sex, as specified by
the FacetGrid() function.

The first row and first column contain age distribution of the passengers
where sex is male and the passengers did not survive. The first row and
second column contain age distribution of the passengers where sex is male
and the passengers survived. Similarly, the second row and first column
contain age distribution of the passengers where sex is female and the
passengers did not survive while the second row and second column contain
age distribution of the passengers where sex is female and the passengers
survived.

In addition to distributional plots for one feature, we can also plot scatter
plots that involve two features on the facet grid.

For instance, the following script plots the scatter plot for age and fare for
both the genders of the passengers who survived and who didn't.

grid = sns.FacetGrid(data= dataset, col= 'alive', row = 'sex')

grid.map(plt.scatter, 'age', 'fare')

The output of the script above looks like this:


Regression Plots

Regression plots, as the name suggests are used to perform regression


analysis between two or more variables.

In this section, we will study the linear model plot that plots a linear
relationship between two variables along with the best-fit regression line
depending upon the data.
The dataset that we are going to use for this section is the "diamonds"
dataset which is downloaded by default with the seaborn library. Execute
the following script to load the dataset:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

dataset = sns.load_dataset('diamonds')

dataset.head()

The dataset looks like this:

The dataset contains different features of a diamond such as weight in


carats, color, clarity, price, etc.

Let's plot a linear relationship between, carat and price of the diamond.
Ideally, the heavier the diamond is, the higher the price should be. Let's see
if this is actually true based on the information available in the diamonds
dataset.

To plot the linear model, the lmplot() function is used. The first parameter is


the feature you want to plot on the x-axis, while the second variable is the
feature you want to plot on the y-axis. The last parameter is the dataset.
Execute the following script:

sns.lmplot(x='carat', y='price', data=dataset)

The output looks like this:

You can also plot multiple linear models based on a categorical feature. The
feature name is passed as value to the hue parameter. For instance, if you
want to plot multiple linear models for the relationship between carat and
price feature, based on the cut of the diamond, you can use lmplot function
as follows:

sns.lmplot(x='carat', y='price', data=dataset, hue='cut')

The output looks like this:

From the output, you can see that the linear relationship between the carat
and the price of the diamond is steepest for the ideal cut diamond as
expected and the linear model is shallowest for fair cut diamond.
In addition to plotting the data for the cut feature with different hues, we can
also have one plot for each cut. To do so, you need to pass the column name
to the cols attribute. Take a look at the following script:

sns.lmplot(x='carat', y='price', data=dataset, col='cut')

In the output, you will see a separate column for each value in the cut
column of the diamonds dataset as shown below:

You can also change the size and aspect ratio of the plots using
the aspect and size parameters. Take a look at the following script:

sns.lmplot(x='carat', y = 'price', data= dataset, col = 'cut', aspect = 0.5,


size = 8 )

The aspect parameter defines the aspect ratio between the width and


height. An aspect ratio of 0.5 means that the width is half of the height as
shown in the output.
You can see through the size of the plot has changed, the font size is still
very small. In the next section, we will see how to control the fonts and
styles of the Seaborn plots.

Plot Styling

Seaborn library comes with a variety of styling options. In this section, we


will see some of them.

Set Style

The set_style() function is used to set the style of the grid. You can pass
the darkgrid, whitegrid, dark, white and ticks as the parameters to
the set_style function.
For this section, we will again use the "titanic dataset". Execute the
following script to see darkgrid style.

sns.set_style('darkgrid')

sns.distplot(dataset['fare'])

The output looks like this;

In the output, you can see that we have dark back ground with grids. Let's
see how whitegrid looks like. Execute the following script:

sns.set_style('whitegrid')

sns.distplot(dataset['fare'])

The output looks like this:


Now you can see that we still have grids in the background but the dark grey
background is not visible. I would suggest that you try and play with the rest
of the options and see which style suits you.

Change Figure Size

Since Seaborn uses Matplotlib functions behind the scenes, you can use
Matplotlib's pyplot package to change the figure size as shown below:

plt.figure(figsize=(8,4))

sns.distplot(dataset['fare'])

In the script above, we set the width and height of the plot to 8 and 4 inches
respectively. The output of the script above looks like this:
Set Context

Apart from the notebook, you may need to create plots for posters. To do so,
you can use the set_context() function and pass it poster as the only
attribute as shown below:

sns.set_context('poster')

sns.distplot(dataset['fare'])

In the output, you should see a plot with the poster specifications as shown
below. For instance, you can see that the fonts are much bigger compared to
normal plots.
Conclusion

Seaborn Library is an advanced Python library for data visualization. This


article is Part 2 of the series of articles on Seaborn for Data Visualization in
Python. In this article, we saw how to plot regression and matrix plots in
Seaborn. We also saw how to change plot styles and use grid functions to
manipulate subplots. In the next article, we will see how Python's Pandas
library's built-in capabilities can be used for data visualization.
https://gilberttanner.com/blog/introduction-to-data-visualization-inpython

Introduction to Data Visualization in Python


Data visualization is the discipline of trying to understand data by placing it in a visual context so

that patterns, trends and correlations that might not otherwise be detected can be exposed.

Python offers multiple great graphing libraries that come packed with lots of different features.

No matter if you want to create interactive, live or highly customized plots python has an

excellent library for you.

To get a little overview here are a few popular plotting libraries:

 Matplotlib: low level, provides lots of freedom

 Pandas Visualization: easy to use interface, built on Matplotlib

 Seaborn: high-level interface, great default styles

 ggplot: based on R’s ggplot2, uses Grammar of Graphics

 Plotly: can create interactive plots

In this article, we will learn how to create basic plots using  Matplotlib, Pandas visualization and

Seaborn as well as how to use some specific features of each library. This article will focus on the

syntax and not on interpreting the graphs, which I will cover in  another blog post.

In further articles, I will go over  interactive plotting tools like  Plotly, which is built on D3 and can

also be used with JavaScript.

Importing Datasets

In this article, we will use two datasets which are freely available. The Iris and Wine

Reviews dataset, which we can both load in using pandas read_csv method.


import pandas as pd
iris = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width',
'petal_length', 'petal_width', 'class'])
print(iris.head())

Figure 2: Iris dataset head

wine_reviews = pd.read_csv('winemag-data-130k-v2.csv', index_col=0)


wine_reviews.head()
Figure 3: Wine Review dataset head

Matplotlib

Matplotlib is the most popular python plotting library. It is a low-level library with a Matlab like

interface which offers lots of freedom at the cost of having to write more code.

To install Matplotlib pip and conda can be used.

pip install matplotlib


or
conda install matplotlib

Matplotlib is specifically good for creating basic graphs like line charts, bar charts, histograms and

many more. It can be imported by typing:

import matplotlib.pyplot as plt

Scatter Plot

To create a scatter plot in Matplotlib we can use the scatter method. We will also create a figure

and an axis using plt.subplots so we can give  our plot a title and labels.

# create a figure and axis


fig, ax = plt.subplots()

# scatter the sepal_length against the sepal_width


ax.scatter(iris['sepal_length'], iris['sepal_width'])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')
Figure 4: Matplotlib Scatter plot

We can give the graph more meaning by coloring in each data-point by its class. This can be done

by creating a dictionary which maps from class to color and then scattering each point on its own

using a for-loop and passing the respective color.

# create color dictionary


colors = {'Iris-setosa':'r', 'Iris-versicolor':'g', 'Iris-virginica':'b'}
# create a figure and axis
fig, ax = plt.subplots()
# plot each data-point
for i in range(len(iris['sepal_length'])):
ax.scatter(iris['sepal_length'][i], iris['sepal_width']
[i],color=colors[iris['class'][i]])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')
Figure 5: Scatter Plot colored by class

Line Chart

In Matplotlib we can create a line chart by calling the plot method. We can also plot multiple

columns in one graph, by looping through the columns we want and plotting each column on the

same axis.

# get columns to plot


columns = iris.columns.drop(['class'])
# create x data
x_data = range(0, iris.shape[0])
# create figure and axis
fig, ax = plt.subplots()
# plot each column
for column in columns:
ax.plot(x_data, iris[column])
# set title and legend
ax.set_title('Iris Dataset')
ax.legend()
Figure 6: Line Chart

Histogram

In Matplotlib we can create a Histogram using the hist method. If we pass it categorical data like

the points column from the wine-review dataset it will automatically calculate how often each

class occurs.

# create figure and axis


fig, ax = plt.subplots()
# plot histogram
ax.hist(wine_reviews['points'])
# set title and labels
ax.set_title('Wine Review Scores')
ax.set_xlabel('Points')
ax.set_ylabel('Frequency')
Figure 7: Histogram

Bar Chart

A bar chart can be  created using the bar method. The bar-chart isn’t automatically calculating the

frequency of a category so we are going to use pandas value_counts function to do this. The bar-

chart is useful for categorical data that doesn’t have a lot of different categories (less  than 30)

because else it can get quite messy.

# create a figure and axis


fig, ax = plt.subplots()
# count the occurrence of each class
data = wine_reviews['points'].value_counts()
# get x and y data
points = data.index
frequency = data.values
# create bar chart
ax.bar(points, frequency)
# set title and labels
ax.set_title('Wine Review Scores')
ax.set_xlabel('Points')
ax.set_ylabel('Frequency')

Figure 8: Bar-Chart

Pandas Visualization

Pandas is an open source high-performance, easy-to-use library providing data structures, such as

dataframes, and data analysis tools like the visualization tools we will use in this article.

Pandas Visualization makes it really easy to create plots out of a pandas dataframe and series. It

also has a higher level API than Matplotlib and therefore we need less code for the same results.

Pandas can be installed using either pip or conda.

pip install pandas


or
conda install pandas
Scatter Plot

To create a scatter plot in Pandas we can call <dataset>.plot.scatter() and pass it two arguments,

the name of the x-column as well as the name of the y-column. Optionally we can also pass it a

title.

iris.plot.scatter(x='sepal_length', y='sepal_width', title='Iris Dataset')

Figure 9: Scatter Plot

As you can see in the image it is automatically setting the x and y label to the column names.

Line Chart

To create a line-chart in Pandas we can call <dataframe>.plot.line(). Whilst in Matplotlib we

needed to loop-through each column we wanted to plot, in Pandas we don’t need to do this

because it automatically plots all available numeric columns (at least if we don’t specify a specific

column/s).
iris.drop(['class'], axis=1).plot.line(title='Iris Dataset')

Figure 10: Line Chart

If we have more than one feature Pandas automatically creates a legend for us, as can be seen in

the image above.

Histogram

In Pandas, we can create a Histogram with the plot.hist method. There aren’t any required

arguments but we can optionally pass some like the bin size.

wine_reviews['points'].plot.hist()
Figure 11: Histogram

It’s also really easy to create multiple histograms.

iris.plot.hist(subplots=True, layout=(2,2), figsize=(10, 10), bins=20)


Fi
gure 12: Multiple Histograms

The subplots argument specifies that we want a separate plot for each feature and the layout

specifies the number of plots per row and column.

Bar Chart
To plot a bar-chart we can use the plot.bar() method, but before we can call this we need to get

our data. For this we will first count the occurrences using the value_count() method and then sort

the occurrences from smallest to largest using the sort_index() method.

wine_reviews['points'].value_counts().sort_index().plot.bar()

Figure 13: Vertical Bar-Chart

It’s also really simple to make a horizontal bar-chart using the plot.barh() method.

wine_reviews['points'].value_counts().sort_index().plot.barh()

Figure 14: Horizontal Bar-Chart


We can also plot other data then the number of occurrences.

wine_reviews.groupby("country").price.mean().sort_values(ascending=False)
[:5].plot.bar()

Figure 15: Countries with the most


expensive wine(on average)

In the example above we grouped the data by country and then took the mean of the wine prices,

ordered it, and plotted the 5 countries with the highest average wine price.

Seaborn

Seaborn is a Python data visualization library based on Matplotlib. It provides  a high-level

interface for creating attractive graphs.

Seaborn has a lot to offer. You can create graphs in one line that would take you multiple tens of

lines in Matplotlib. Its standard designs are awesome and it also has a nice interface for working

with pandas  dataframes.


It can be imported by typing:

import seaborn as sns

Scatter plot

We  can use the .scatterplot method for creating a scatterplot, and just as in Pandas we need to

pass it the column names of the x and y data, but now we also need to pass the data as an

additional argument because  we aren’t calling the  function on the data directly as we did in

Pandas.

sns.scatterplot(x='sepal_length', y='sepal_width', data=iris)

Figure 16: Scatterplot

We can also highlight the points by class using the hue argument, which is a lot easier than in

Matplotlib.
sns.scatterplot(x='sepal_length', y='sepal_width', hue='class', data=iris)

Figure 17: Scatterplot colored by class

Line chart

To create a line-chart the sns.lineplot method can be used. The only required argument is the

data, which in our case are the four numeric columns from the Iris dataset. We could also use the

sns.kdeplot method which rounds of the edges of the curves and therefore is cleaner if you have

a lot of outliers in your dataset.

sns.lineplot(data=iris.drop(['class'], axis=1))
Figure 18: Line Chart

Histogram

To create a histogram in Seaborn we use the sns.distplot method. We need to pass it the column

we want to plot and it will calculate the occurrences itself. We can also pass it the number of

bins, and if we want to plot a gaussian kernel density estimate inside the graph.

sns.distplot(wine_reviews['points'], bins=10, kde=False)

Figure 19: Histogram

sns.distplot(wine_reviews['points'], bins=10, kde=True)


Figure 20: Histogram with gaussian
kernel density estimate

Bar chart

In Seaborn a bar-chart can be created using the sns.countplot method and passing it the data.

sns.countplot(wine_reviews['points'])

Figure 21: Bar-Chart


Other graphs

Now that you have a basic understanding of the Matplotlib, Pandas Visualization and Seaborn

syntax I want to show you a few other graph types that are useful for extracting insides.

For most of them, Seaborn is the go-to library because of its high-level interface that allows for

the creation of beautiful graphs in just a few lines of code.

Box plots

A Box Plot is a graphical method of displaying the five-number summary. We can create box plots

using seaborns sns.boxplot method and passing it the data as well as the x and y column name.

df = wine_reviews[(wine_reviews['points']>=95) &
(wine_reviews['price']<1000)]
sns.boxplot('points', 'price', data=df)

Figure 22: Boxplot

Box Plots, just like bar-charts are great for data with only a few categories but can get messy

really quickly.
Heatmap

A Heatmap is a graphical representation of data where the individual values contained in

a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features

in a dataset.

To get the correlation of the features inside a dataset we can call <dataset>.corr(), which is a

Pandas dataframe method. This will give us the correlation matrix.

We can now use either Matplotlib or Seaborn to create the heatmap.

Matplotlib:

# get correlation matrix


corr = iris.corr()
fig, ax = plt.subplots()
# create heatmap
im = ax.imshow(corr.values)

# set labels
ax.set_xticks(np.arange(len(corr.columns)))
ax.set_yticks(np.arange(len(corr.columns)))
ax.set_xticklabels(corr.columns)
ax.set_yticklabels(corr.columns)

# Rotate the tick labels and set their alignment.


plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
Figure 23: Heatmap without annotations

To add annotations to the heatmap we need to add two for loops:

# get correlation matrix


corr = iris.corr()
fig, ax = plt.subplots()
# create heatmap
im = ax.imshow(corr.values)

# set labels
ax.set_xticks(np.arange(len(corr.columns)))
ax.set_yticks(np.arange(len(corr.columns)))
ax.set_xticklabels(corr.columns)
ax.set_yticklabels(corr.columns)

# Rotate the tick labels and set their alignment.


plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")

# Loop over data dimensions and create text annotations.


for i in range(len(corr.columns)):
for j in range(len(corr.columns)):
text = ax.text(j, i, np.around(corr.iloc[i, j], decimals=2),
ha="center", va="center", color="black")

Figure 24: Heatmap with annotations

Seaborn makes it way easier to create a heatmap and add annotations:

sns.heatmap(iris.corr(), annot=True)

Figure 25: Heatmap with


annotations
Faceting

Faceting is the act of breaking data variables up across multiple subplots and combining those

subplots into a single figure.

Faceting is really helpful if you want to quickly explore your dataset.

To use one kind of faceting in Seaborn we can use the FacetGrid. First of all, we need to define

the FacetGrid and pass it our data as well as a row or column, which will be used to split the data.

Then we need to call the map function on our FacetGrid object and define the plot type we want

to use, as well as the column we want to graph.

g = sns.FacetGrid(iris, col='class')
g = g.map(sns.kdeplot, 'sepal_length')

Figure
26: Facet-plot

You can make plots a lot bigger and more complicated than the example above. You can find a

few examples here.

Pairplot
Lastly, I will show you Seaborns pairplot and Pandas scatter_matrix, which enable you to plot a

grid of pairwise relationships in a dataset.

sns.pairplot(iris)

Figure 27: Pairplot


from pandas.plotting import scatter_matrix

fig, ax = plt.subplots(figsize=(12,12))
scatter_matrix(iris, alpha=1, ax=ax)

Figure 28: Scatter matrix


As  you can see in the images above  these techniques are always plotting two features with each

other. The diagonal of the graph is filled with  histograms and the other plots are scatter plots.

Conclusion

Data visualization is the discipline of trying to understand data by placing it in a visual context so

that patterns, trends and correlations that might not otherwise be detected can be exposed.

Python offers multiple great graphing libraries that come packed with lots of different features. In

this article, we looked at Matplotlib, Pandas visualization and Seaborn.

If you liked this article consider subscribing on my Youtube Channel and following me on social

media.

The code covered in this article is available as a Github Repository.

If you have any questions, recommendations or critiques, I can be reached via Twitter or the

comment section.

Potrebbero piacerti anche