Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
https://towardsdatascience.com/a-step-by-step-guide-for-creating-advanced-python-data-
visualizations-with-seaborn-matplotlib-1579d6a1a7d0
https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-
f1c49c816f07
https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html
https://jovianlin.io/data-visualization-seaborn-part-2/
https://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid
https://jovianlin.io/data-visualization-seaborn-part-3/
https://elitedatascience.com/python-seaborn-tutorial
https://towardsdatascience.com/data-visualization-using-seaborn-fc24db95a850
https://stackabuse.com/seaborn-library-for-data-visualization-in-python-part-1/
https://stackabuse.com/seaborn-library-for-data-visualization-in-python-part-2/
http://seaborn.pydata.org/generated/seaborn.heatmap.html#seaborn.heatmap
https://www.drawingfromdata.com/setting-figure-size-using-seaborn-and-matplotlib
https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html
https://www.analyticsvidhya.com/blog/2019/09/comprehensive-data-visualization-guide-seaborn-
python/
https://gilberttanner.com/blog/introduction-to-data-visualization-inpython
https://likegeeks.com/seaborn-heatmap-tutorial/
https://wellsr.com/python/seaborn-line-plot-data-visualization/
#Subplot
#Box Plot
# plt.hist(data['Path Loss Diff. > 0 dB (%)'], normed=True, alpha=0.5)
# g = sns.FacetGrid(tips, col="time")
# g.map(plt.hist, "tip");
seaborn0.10.0
Gallery
Tutorial
API
Site
Page
tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time")
Initializing the grid like this sets up the matplotlib figure and axes, but
doesn’t draw anything on them.
g = sns.FacetGrid(tips, col="time")
g.map(plt.hist, "tip");
This function will draw the figure and annotate the axes, hopefully
producing a finished plot in one step. To make a relational plot, just pass
multiple variable names. You can also provide keyword arguments, which
will be passed to the plotting function:
There are several options for controlling the look of the grid that can be
passed to the class constructor.
The size of the figure is set by providing the height of each facet, along
with the aspect ratio:
The default ordering of the facets is derived from the information in the
DataFrame. If the variable used to define facets has a categorical type, then
the order of the categories is used. Otherwise, the facets will be in the
order of appearance of the category levels. It is possible, however, to
specify an ordering of any facet dimension with the
appropriate *_order parameter:
ordered_days = tips.day.value_counts().index
g = sns.FacetGrid(tips, row="day", row_order=ordered_days,
height=1.7, aspect=4,)
g.map(sns.distplot, "total_bill", hist=False, rug=True);
Any seaborn color palette (i.e., something that can be passed
to color_palette() can be provided. You can also use a dictionary that
maps the names of values in the hue variable to valid matplotlib colors:
with sns.axes_style("white"):
g = sns.FacetGrid(tips, row="sex", col="smoker", margin_titles=True,
height=2.5)
g.map(plt.scatter, "total_bill", "tip", color="#334488", edgecolor="white",
lw=.5);
g.set_axis_labels("Total bill (US Dollars)", "Tip");
g.set(xticks=[10, 30, 50], yticks=[2, 6, 10]);
g.fig.subplots_adjust(wspace=.02, hspace=.02);
For even more customization, you can work directly with the underling
matplotlib Figure and Axes objects, which are stored as member attributes
at fig and axes (a two-dimensional array), respectively. When making a
figure without row or column faceting, you can also use the ax attribute to
directly access the single axes.
If we want to make a bivariate plot, you should write the function so that it
accepts the x-axis variable first and the y-axis variable second:
Sometimes, though, you’ll want to map a function that doesn’t work the way
you expect with the color and label keyword arguments. In this case,
you’ll want to explicitly catch them and handle them in the logic of your
custom function. For example, this approach will allow use to
map matplotlib.pyplot.hexbin(), which otherwise does not play well with
the FacetGrid API:
with sns.axes_style("dark"):
g = sns.FacetGrid(tips, hue="time", col="time", height=4)
g.map(hexbin, "total_bill", "tip", extent=[0, 50, 0, 10]);
Plotting pairwise data relationships
PairGrid also allows you to quickly draw a grid of small subplots using the
same plot type to visualize data in each. In a PairGrid, each row and
column is assigned to a different variable, so the resulting plot shows each
pairwise relationship in the dataset. This style of plot is sometimes called a
“scatterplot matrix”, as this is the most common way to show each
relationship, but PairGrid is not limited to scatterplots.
The basic usage of the class is very similar to FacetGrid. First you initialize
the grid, then you pass plotting function to a map method and it will be
called on each subplot. There is also a companion function, pairplot() that
trades off some flexibility for faster plotting.
iris = sns.load_dataset("iris")
g = sns.PairGrid(iris)
g.map(plt.scatter);
It’s possible to plot a different function on the diagonal to show the
univariate distribution of the variable in each column. Note that the axis
ticks won’t correspond to the count or density axis of this plot, though.
g = sns.PairGrid(iris)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter);
A very common way to use this plot colors the observations by a separate
categorical variable. For example, the iris dataset has four measurements
for each of three different species of iris flowers so you can see how they
differ.
g = sns.PairGrid(iris, hue="species")
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
g.add_legend();
By default every numeric column in the dataset is used, but you can focus
on particular relationships if you want.
g = sns.PairGrid(iris)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=3, legend=False);
The square grid with identity relationships on the diagonal is actually just a
special case, and you can plot with different variables in the rows and
columns.
Of course, the aesthetic attributes are configurable. For instance, you can
use a different palette (say, to show an ordering of the hue variable) and
pass keyword arguments into the plotting functions.
Seaborn / Matplotlib
Although there’re tons of great visualization tools in Python, Matplotlib + Seaborn still
stands out for its capability to create and customize all sorts of plots.
Shiu-Tang Li
Follow
Mar 26, 2019 · 10 min read
Photo by Jack Anstey on Unsplash
In this article, I will go through a few sections first to prepare background knowledge for some
readers who are new to Matplotlib:
1. Understand the two different Matplotlib interfaces (It has caused a lot of confusion!) .
2. Understand the elements in a figure, so that you can easily look up the APIs to solve your
problem.
3. Take a glance of a few common types of plots so the readers would have a better idea
about when / how to use them.
Then I’ll talk about the process of creating advanced visualizations with an example:
1. Set up a goal.
Which is good for creating easy plots (you call a bunch of plt.XXX to plot each component in the
graph), but you don’t have too much control of the graph. The other one is object-oriented:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(3,3))
ax.bar(x=['A','B','C'], height=[3.1,7,4.2], color='r')
ax.set_xlabel(xlabel='X title', size=20)
ax.set_ylabel(ylabel='Y title' , color='b', size=20)
plt.show()
It will take more time to code but you’ll have full control of your figure. The idea is that you create
a ‘figure’ object, which you can think of it as a bounding box of the whole visualization you’re
going to build, and one or more ‘axes’ object, which are subplots of the visualization, (Don’t ask
me why these subplots called ‘axes’. The name just sucks…) and the subplots can be manipulated
through the methods of these ‘axes’ objects.
(For detailed explanations of these two interfaces, the reader may refer to
https://matplotlib.org/tutorials/introductory/lifecycle.html
or
https://pbpython.com/effective-matplotlib.html )
In the codes above, we created an axes object, created a line plot on top of it, added a title, and
rotated all the x-tick labels by 45 degrees counterclockwise.
After getting a rough idea about how Matplotlib works, it’s time to check out some commonly
seen plots. They are
Scatter plots (x: Numerical #1, y: Numerical #2),
Line plots (x: Categorical - ordinal#1, y: Numerical #1) [Thanks to Michael Arons for pointing out
an issue in the previous figure],
Bar plots (x: Categorical #1, y: Numerical #1). Numerical #1 is often the count of Categorical #1.
Histogram (x: Numerical #1, y: Numerical #2). Numerical #1 is combined into groups (converted
to a categorical variable), and Numerical #2 is usually the count of this categorical variable.
Kernel density plot (x: Numerical #1, y: Numerical #2). Numerical #2 is the frequency of
Numerical #1.
2-D kernel density plot (x: Numerical #1, y: Numerical #2, color: Numerical #3). Numerical #3 is
the joint frequency of Numerical #1 and Numerical #2.
Box plot (x: Categorical #1, y: Numerical #1, marks: Numerical #2). Box plot shows the statistics
of each value in Categorical #1 so we’ll get an idea of the distribution in the other variable. y-
value: the value for the other variable; marks: showing how these values are distributed (range,
Q1, median, Q3).
Violin plot (x: Categorical #1, y: Numerical #1, Width/Mark: Numerical #2). Violin plot is sort of
similar to box plot but it shows the distribution better.
Heat map (x: Categorical #1, y: Categorical #2, Color: Numerical #1). Numerical #1 could be the
count for Categorical #1 and Categorical #2 jointly, or it could be other numerical attributes for
each value in the pair (Categorical #1, Categorical #2).
To learn how to plot these figures, the readers can check out the seaborn APIs by googling for the
following list:
I’ll give two example codes showing how 2D kde plots / heat map are generated in object-
oriented interface.
# 2D kde plots
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsnp.random.seed(1)
numerical_1 = np.random.randn(100)
np.random.seed(2)
numerical_2 = np.random.randn(100)fig, ax = plt.subplots(figsize=(3,3))
sns.kdeplot(data=numerical_1,
data2= numerical_2,
ax=ax,
shade=True,
color="blue",
bw=1)
plt.show()
The key is the argument ax=ax. When running .kdeplot() method, seaborn would apply the
changes to ax, an ‘axes’ object.
# heat mapimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.DataFrame(dict(categorical_1=['apple',
'banana', 'grapes',
'apple', 'banana', 'grapes',
'apple', 'banana', 'grapes'],
categorical_2=['A','A','A','B','B','B','C','C','C'],
value=[10,2,5,7,3,15,1,6,8]))
pivot_table = df.pivot("categorical_1", "categorical_2", "value")# try
printing out pivot_table to see what it looks like!fig, ax =
plt.subplots(figsize=(5,5))sns.heatmap(data=pivot_table,
cmap=sns.color_palette("Blues"),
ax=ax)
plt.show()
For these basic plots, only limited amount of information can be displayed (2–3 variables). What
if we’d like to show more info to these plots? Here are a few ways.
1. Overlay plots
If several line charts share the same x and y variables, you can call Seaborn plots multiple
times and plot all of them on the same figure. In the example below, we added one more
categorical variable [value = alpha, beta] in the plot with overlaying plots.
fig, ax = plt.subplots(figsize=(4,4))
sns.lineplot(x=['A','B','C','D'],
y=[4,2,5,3],
color='r',
ax=ax)
sns.lineplot(x=['A','B','C','D'],
y=[1,6,2,4],
color='b',
ax=ax)
ax.legend(['alpha', 'beta'], facecolor='w')
plt.show()
Or, we can combine a bar chart and a line chart with the same x-axis but different y-axis:
sns.set(style="white", rc={"lines.linewidth": 3})fig, ax1 =
plt.subplots(figsize=(4,4))
ax2 = ax1.twinx()sns.barplot(x=['A','B','C','D'],
y=[100,200,135,98],
color='#004488',
ax=ax1)sns.lineplot(x=['A','B','C','D'],
y=[4,2,5,3],
color='r',
marker="o",
ax=ax2)
plt.show()
sns.set()
A few comments here. Because the two plots have different y-axis, we need to create another
‘axes’ object with the same x-axis (using .twinx()) and then plot on different ‘axes’. sns.set(…) is to
set specific aesthetics for the current plot, and we run sns.set() in the end to set everything back
to default settings.
Combining different barplots into one grouped barplot also adds one categorical dimension to
the plot (one more categorical variable).
In the code example above, you can customize variable names, colors, and figure size.
number_groups and bin_width are calculated based on the input data. I then wrote a for-loop to
plot the bars, one color at a time, and set the ticks and legends in the very end.
2. Facet — mapping dataset into multiple axes, and they differ by one or two categorical
variable(s). The reader could find a bunch examples
in https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
3. Color / Shape / Size of nodes in a scatter plot: The following code example taken from Seaborn
Scatter Plot API shows how it works.
(https://seaborn.pydata.org/generated/seaborn.scatterplot.html)
import seaborn as snstips = sns.load_dataset("tips")
ax = sns.scatterplot(x="total_bill", y="tip",
hue="size", size="size",
sizes=(20, 200), hue_norm=(0, 7),
legend="full", data=tips)
plt.show()
One of the advantages for object-oriented interface is that we can easily partition our figure into
several subplots and manipulate each subplot with ‘axes’ API.
fig = plt.figure(figsize=(7,7))
gs = gridspec.GridSpec(nrows=3,
ncols=3,
figure=fig,
width_ratios= [1, 1, 1],
height_ratios=[1, 1, 1],
wspace=0.3,
hspace=0.3)ax1 = fig.add_subplot(gs[0, 0])
ax1.text(0.5, 0.5, 'ax1: gs[0, 0]', fontsize=12, fontweight="bold",
va="center", ha="center") # adding text to ax1ax2 =
fig.add_subplot(gs[0, 1:3])
ax2.text(0.5, 0.5, 'ax2: gs[0, 1:3]', fontsize=12, fontweight="bold",
va="center", ha="center")ax3 = fig.add_subplot(gs[1:3, 0:2])
ax3.text(0.5, 0.5, 'ax3: gs[1:3, 0:2]', fontsize=12, fontweight="bold",
va="center", ha="center")ax4 = fig.add_subplot(gs[1:3, 2])
ax4.text(0.5, 0.5, 'ax4: gs[1:3, 2]', fontsize=12, fontweight="bold",
va="center", ha="center")plt.show()
In the example, we first partition the figure into 3*3 = 9 small boxes with gridspec.GridSpec(),
and then define a few axes objects. Each axes object could contain one or more boxes. Say in the
codes above, gs[0, 1:3] = gs[0, 1] + gs[0, 2] is assigned to axes object ax2. wspace and hspace are
parameters controlling the space between plots.
With some tutorials from the previous sections, it’s time to produce some cool stuffs. Let’s
download the Analytics Vidhya Black Friday Sales data from
https://www.kaggle.com/mehdidag/black-friday and do some easy data preprocessing:
df = pd.read_csv('BlackFriday.csv', usecols = ['User_ID', 'Gender',
'Age', 'Purchase'])df_gp_1 = df[['User_ID',
'Purchase']].groupby('User_ID').agg(np.mean).reset_index()df_gp_2 =
df[['User_ID', 'Gender',
'Age']].groupby('User_ID').agg(max).reset_index()df_gp =
pd.merge(df_gp_1, df_gp_2, on = ['User_ID'])
You’ll then get a table of user ID, gender, age, and the average price of items in each customer’s
purchase.
Step 1. Goal
We’re curious about how age and gender would affect the average purchased item price during
Black Friday, and we hope to see the price distribution as well. We also want to know the
percentages for each age group.
Step 2. Variables
We’d like to include age group (categorical), gender (categorical), average item price (numerical),
and the distribution of average item price (numerical) in the plot. We need to include another
plot with the percentage for each age group (age group + count/frequency).
To show average item price + its distributions, we can go with kernel density plot, box plot, or
violin plot. Among these, kde shows the distribution the best. We then plot two or more kde plots
in the same figure and then do facet plots, so age group and gender info can be both included. For
the other plot, a bar plot can do the job well.
Step 3. Visualization
Once we have a plan about the variables, we could then think about how to visualize it. We need
to do figure partitions first, hide some boundaries, xticks, and yticks, and then add a bar chart to
the right.
The plot below is what we’re going to create. From the figure, we can clearly see that men tend to
purchase more expensive items then women do based on the data, and elder people tend to
purchase more expensive items (the trend is clearer for the top 4 age groups). We also found that
people with age 18–45 are the major buyers in Black Friday sales.
The codes below generate the plot (explanations are included in the comments):
freq = ((df_gp.Age.value_counts(normalize =
True).reset_index().sort_values(by = 'index').Age)*100).tolist()number_gp
= 7# freq = the percentage for each age group, and there’re 7 age
groups.def ax_settings(ax, var_name, x_min, x_max):
ax.set_xlim(x_min,x_max)
ax.set_yticks([])
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_edgecolor('#444444')
ax.spines['bottom'].set_linewidth(2)
Continuing from Part 1 of my seaborn series, we'll proceed to cover 2D plots.
This notebook is a reorganization of the many ideas shared in this Github repo and this blog post. What
you see here is a modified version that works for me that I hope will work for you as well. Also, enjoy the
cat GIFs.
One of the best ways to check out potential relationships or correlations amongst the different data
attributes is to leverage a pair-wise correlation matrix and depict it as a heatmap.
2D: Heatmap on Correlation Matrix
dioxide 35 57 45 54 57 38
density 0.4589 0.2712 0.0961 0.5525 0.3626 0.0257 0.0323 1.0000 0.0116 0.2594 -
10 96 54 17 15 17 95 00 86 78 0.6867
45
29
hm = sns.heatmap(corr,
ax=ax, # Axes in which to draw the plot,
otherwise use the currently-active Axes.
cmap="coolwarm", # Color Map.
#square=True, # If True, set the Axes aspect to
“equal” so each cell will be square-shaped.
annot=True,
fmt='.2f', # String formatting code to use when
adding annotations.
#annot_kws={"size": 14},
linewidths=.05)
fig.subplots_adjust(top=0.93)
fig.suptitle('Wine Attributes Correlation Heatmap',
fontsize=14,
fontweight='bold')
The gradients in the heatmap vary based on the strength of the correlation.
You can clearly see that it is very easy to spot potential attributes having strong correlations
amongst themselves.
# Attributes of interest
cols = ['density',
'residual sugar',
'total sulfur dioxide',
'free sulfur dioxide',
'fixed acidity']
pp = sns.pairplot(wines[cols],
size=1.8, aspect=1.2,
plot_kws=dict(edgecolor="k", linewidth=0.5),
diag_kws=dict(shade=True), # "diag" adjusts/tunes the
diagonal plots
diag_kind="kde") # use "kde" for diagonal plots
fig = pp.fig
fig.subplots_adjust(top=0.93, wspace=0.3)
fig.suptitle('Wine Attributes Pairwise Plots',
fontsize=14, fontweight='bold')
Bonus: You can also fit linear regression models to the scatter plots. See [😀].
pp = sns.pairplot(wines[cols],
diag_kws=dict(shade=True), # "diag" adjusts/tunes the
diagonal plots
diag_kind="kde") # use "kde" for diagonal plots
kind="reg") # <== 😀 linear regression to the scatter
plots
fig = pp.fig
fig.subplots_adjust(top=0.93, wspace=0.3)
fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14,
fontweight='bold')
Based on the above plot, you can see that scatter plots are also a decent way of observing potential
relationships or patterns in two-dimensions for data attributes.
Important: Before we proceed to run parallel_coordinates() , we'll need to scale our data first,
as different attributes are measured on different scales.
Note: I have another blog post on Feature Scaling (should you be interested to know more).
# Attributes of interest
cols = ['density',
'residual sugar',
'total sulfur dioxide',
'free sulfur dioxide',
'fixed acidity']
subset_df = wines[cols]
ss = StandardScaler()
scaled_df = ss.fit_transform(subset_df)
scaled_df = pd.DataFrame(scaled_df, columns=cols)
final_df = pd.concat([scaled_df, wines['wine_type']], axis=1)
final_df.head()
density residual sugar total sulfur dioxide free sulfur dioxide fixed acidity
pc = parallel_coordinates(final_df,
'wine_type',
color=('skyblue', 'firebrick'))
Basically, in this visualization as depicted above, points are represented as connected line segments.
Just by looking at it, we can clearly see that density is slightly more for red wines as compared
to white — since there are more red lines clustered above the white ones.
Also residual sugar and total sulfur dioxide are higher for white wines as compared
to red, while fixed acidity is higher for red wines as compared to white.
plt.scatter(wines['sulphates'],
wines['alcohol'],
alpha=0.4, edgecolors='w')
plt.xlabel('Sulphates')
plt.ylabel('Alcohol')
plt.title('Wine Sulphates - Alcohol Content', y=1.05)
jp = sns.jointplot(data=wines,
x='sulphates',
y='alcohol',
kind='reg', # <== 😀 Add regression and kernel density
fits
space=0, size=6, ratio=4)
😀 Replace the scatterplot with a joint histogram using hexagonal bins:
jp = sns.jointplot(data=wines,
x='sulphates',
y='alcohol',
kind='hex', # <== 😀 Replace the scatterplot with a
joint histogram using hexagonal bins
space=0, size=6, ratio=4)
😀 KDE:
jp = sns.jointplot(data=wines,
x='sulphates',
y='alcohol',
kind='kde', # <== 😀 KDE
space=0, size=6, ratio=4)
2D: Two Discrete Categorical Attributes [📊]
Now that we've covered two continuous numeric attributes, how about visualizing two discrete,
categorical attributes?
fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Wine Type - Quality", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)
ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Quality")
ax1.set_ylabel("Frequency")
rw_q = red_wine['quality'].value_counts()
rw_q = (list(rw_q.index), list(rw_q.values))
ax1.set_ylim([0,2500])
ax1.tick_params(axis='both', which='major', labelsize=8.5)
bar1 = ax1.bar(rw_q[0], rw_q[1], color='red',
edgecolor='black', linewidth=1)
ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Quality")
ax2.set_ylabel("Frequency")
ww_q = white_wine['quality'].value_counts()
ww_q = (list(ww_q.index), list(ww_q.values))
ax2.set_ylim([0,2500])
ax2.tick_params(axis='both', which='major', labelsize=8.5)
bar2 = ax2.bar(ww_q[0], ww_q[1], color='white',
edgecolor='black', linewidth=1)
While the above is a good way to visualize categorical data, as you can see,
leveraging matplotlib has resulted in writing a lot of code 😤.
In addition, another good way is to use stacked bars or multiple bars for the different attributes in a
single plot. We can leverage seaborn for the same easily.
fig = plt.figure(figsize=(10, 7))
cp = sns.countplot(data=wines,
x="quality",
hue="wine_type",
palette={"red": "#FF9999", "white": "#FFE888"})
This definitely looks cleaner and you can also effectively compare the different categories easily from
this single plot.
2D: Mixed Attributes [📈+📊]
Let’s look at visualizing mixed attributes in 2-D (essentially numeric and categorical together).
[💔] Again, let's first look at the traditional way — using matplotlib (histograms):
fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)
ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Frequency")
ax1.set_ylim([0, 1200])
ax1.text(1.2, 800, r'$\mu$='+str(round(red_wine['sulphates'].mean(),2)),
fontsize=12)
r_freq, r_bins, r_patches = ax1.hist(red_wine['sulphates'], color='red',
bins=15,
edgecolor='black', linewidth=1)
ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Frequency")
ax2.set_ylim([0, 1200])
ax2.text(0.8, 800, r'$\mu$='+str(round(white_wine['sulphates'].mean(),2)),
fontsize=12)
w_freq, w_bins, w_patches = ax2.hist(white_wine['sulphates'],
color='white', bins=15,
edgecolor='black', linewidth=1)
fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)
ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Density")
sns.kdeplot(red_wine['sulphates'], ax=ax1, shade=True, color='r')
ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Density")
sns.kdeplot(white_wine['sulphates'], ax=ax2, shade=True, color='y')
While this is good, once again we have a lot of boilerplate code which we can avoid by
leveraging seaborn and even depict the plots in one single chart.
The FacetGrid is an object that links a Pandas DataFrame to a matplotlib figure with a particular
structure.
In particular, FacetGrid is used to draw plots with multiple Axes where each Axes shows the same
relationship conditioned on different levels of some variable. It's possible to condition on up to three
variables by assigning variables to the rows and columns of the grid and using different colors for the
plot elements.
The basic workflow is to initialize the FacetGrid object with the dataset and the variables that are used to
structure the grid. Then one or more plotting functions can be applied to each subset by
calling FacetGrid.map() or FacetGrid.map_dataframe() .
Finally, the plot can be tweaked with other methods to do things like change the axis labels, use different
ticks, or add a legend. See the detailed code examples here for more information.
fig = plt.figure(figsize=(10,8))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.93, wspace=0.3)
ax = fig.add_subplot(1,1,1)
ax.set_xlabel("Sulphates")
ax.set_ylabel("Frequency")
g = sns.FacetGrid(data=wines,
hue='wine_type',
palette={"red": "r", "white": "y"})
g.map(sns.distplot, 'sulphates',
kde=True, bins=15, ax=ax)
ax.legend(title='Wine Type')
plt.close(2)
You can see the plot generated above is clear and concise and we can easily compare across the
distributions easily.
2D: Box [📦] and Violin [🎻] Plots
[📦] Box plots are another way of effectively depicting groups of numeric data based on the different
values in the categorical attribute.
Additionally, box plots are a good way to know the quartile values in the data and
also potential outliers.
sns.boxplot(data=wines,
x="quality",
y="alcohol",
ax=ax)
ax.set_xlabel("Wine Quality",size=12,alpha=0.8)
ax.set_ylabel("Wine Alcohol %",size=12,alpha=0.8)
[🎻] Another similar visualization is violin plots, which is also an effective way to visualize grouped
numeric data using kernel density plots — depicting the probability density of the data at different
values.
sns.violinplot(data=wines,
x="quality",
y="sulphates",
ax=ax)
ax.set_xlabel("Wine Quality",size=12,alpha=0.8)
ax.set_ylabel("Wine Sulphates",size=12,alpha=0.8)
You can clearly see the density plots above for the different wine quality categories for wine
sulphate .
Part #1 (1D)
Part #2 (📍)
Part #3 (3D)
https://towardsdatascience.com/data-visualization-using-seaborn-fc24db95a850
In the relational plot tutorial we saw how to use different visual representations to show the
relationship between multiple variables in a dataset. In the examples, we focused on cases where
the main relationship was between two numerical variables. If one of the main variables is
“categorical” (divided into discrete groups) it may be helpful to use a more specialized approach
to visualization.
In seaborn, there are several different ways to visualize a relationship involving categorical data.
Similar to the relationship between relplot() and either scatterplot() or lineplot(), there are two
ways to make these plots. There are a number of axes-level functions for plotting categorical data
in different ways and a figure-level interface, catplot(), that gives unified higher-level access to
them.
It’s helpful to think of the different categorical plot kinds as belonging to three different families,
which we’ll discuss in detail below. They are:
Categorical scatterplots:
The default representation of the data in catplot() uses a scatterplot. There are actually two
different categorical scatter plots in seaborn. They take different approaches to resolving the
main challenge in representing categorical data with a scatter plot, which is that all of the points
belonging to one category would fall on the same position along the axis corresponding to the
categorical variable. The approach used by stripplot(), which is the default “kind” in catplot() is to
adjust the positions of points on the categorical axis with a small amount of random “jitter”:
In [16]:
sns.catplot(x="age",y="marital_status",data=census_data)
Out[16]:
<seaborn.axisgrid.FacetGrid at 0xdb18470>
Figure 17
The second approach adjusts the points along the categorical axis using an algorithm that
prevents them from overlapping. It can give a better representation of the distribution of
observations, although it only works well for relatively small datasets. This kind of plot is
sometimes called a “beeswarm” and is drawn in seaborn by swarmplot(), which is activated by
setting kind=”swarm” in catplot():
In [27]:
#sns.catplot(x="age",y="relationship",kind='swarm',data=census_data)
# or
#sns.swarmplot(x="relationship",y="age",data=census_data)
sns.catplot(x="day", y="total_bill", kind="swarm", data=tips);
Out[27]:
Figure 18
Similar to the relational plots, it’s possible to add another dimension to a categorical plot by using
a hue semantic. (The categorical plots do not currently support size or style semantics). Each
different categorical plotting function handles the hue semantic differently. For the scatter plots,
it is only necessary to change the color of the points:
In [29]:
sns.catplot(x="day", y="total_bill", hue="sex", kind="swarm", data=tips);
Out[29]:
Figure 19
Box plot
The first is the familiar boxplot(). This kind of plot shows the three quartile values of the
distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of
the lower and upper quartile, and then observations that fall outside this range are displayed
independently. This means that each value in the boxplot corresponds to an actual observation in
the data.
In [32]:
sns.catplot(x="age",y="marital_status",kind='box',data=census_data)
Out[32]:
<seaborn.axisgrid.FacetGrid at 0xd411860>
Figure 20
When adding a hue semantic, the box for each level of the semantic variable is moved along the
categorical axis so they don’t overlap:
In [37]:
sns.catplot(x="age",y="marital_status",kind='box',hue='gender',data=censu
s_data)
Out[37]:
<seaborn.axisgrid.FacetGrid at 0xde8a8d0>
Figure 21
Violin plots
A different approach is a violinplot(), which combines a boxplot with the kernel density
estimation procedure described in the distributions tutorial:
In [38]:
sns.catplot(x="age",y="marital_status",kind='violin',data=census_data)
Out[38]:
<seaborn.axisgrid.FacetGrid at 0x184c4080>
Figure 22
This approach uses the kernel density estimate to provide a richer description of the distribution
of values. Additionally, the quartile and whikser values from the boxplot are shown inside the
violin. The downside is that, because the violinplot uses a KDE, there are some other parameters
that may need tweaking, adding some complexity relative to the straightforward boxplot:
In [41]:
sns.catplot(x="age",y="marital_status",kind='violin',bw=.15,
cut=0,data=census_data)
Out[41]:
<seaborn.axisgrid.FacetGrid at 0xfdea320>
Figure 23
For other applications, rather than showing the distribution within each category, you might
want to show an estimate of the central tendency of the values. Seaborn has two main ways to
show this information. Importantly, the basic API for these functions is identical to that for the
ones discussed above.
Bar plots
A familiar style of plot that accomplishes this goal is a bar plot. In seaborn, the barplot() function
operates on a full dataset and applies a function to obtain the estimate (taking the mean by
default). When there are multiple observations in each category, it also uses bootstrapping to
compute a confidence interval around the estimate and plots that using error bars:
In [46]:
sns.catplot(x="income_bracket",y="age",kind='bar',data=census_data)
Out[46]:
<seaborn.axisgrid.FacetGrid at 0x160588d0>
Figure 24
In [47]:
sns.catplot(x="income_bracket",y="age",kind='bar',hue='gender',data=censu
s_data)
Out[47]:
<seaborn.axisgrid.FacetGrid at 0xdf262e8>
Figure 25
A special case for the bar plot is when you want to show the number of observations in each
category rather than computing a statistic for a second variable. This is similar to a histogram
over a categorical, rather than quantitative, variable. In seaborn, it’s easy to do so with the
countplot() function:
In [61]:
ax =
sns.catplot(x='marital_status',kind='count',data=census_data,orient="h")
ax.fig.autofmt_xdate()
Out[61]:
Figure 26
Point plots
An alternative style for visualizing the same information is offered by the pointplot() function.
This function also encodes the value of the estimate with height on the other axis, but rather than
showing a full bar, it plots the point estimate and confidence interval. Additionally, pointplot()
connects points from the same hue category. This makes it easy to see how the main relationship
is changing as a function of the hue semantic because your eyes are quite good at picking up on
differences of slopes:
In [67]:
ax =
sns.catplot(x='marital_status',y='age',hue='relationship',kind='point',da
ta=census_data)
ax.fig.autofmt_xdate()
Out[67]:
Figure 27
Showing multiple relationships with facets
Just like relplot(), the fact that catplot() is built on a FacetGrid means that it is easy to add
faceting variables to visualize higher-dimensional relationships:
In [78]:
sns.catplot(x="age", y="marital_status", hue="income_bracket",
col="gender", aspect=.6,
kind="box", data=census_data);
out[78]:
seaborn.heatmap(data, vmin=None, vmax=None, cmap=None , center=None,
robust=False , annot=None, fmt='.2g', annot_kws=None, linewidths=0, line
color='white', cbar=True, cbar_kws=None , cbar_ax=None , square=False ,
xticklabels='auto' , yticklabels='auto' , mask=None, ax=None, **kwargs)
Plot rectangular data as a color-encoded matrix.
This is an Axes-level function and will draw the heatmap into the
currently-active Axes if none is provided to the ax argument. Part of
this Axes space will be taken and used to plot a colormap,
unless cbar is False or a separate Axes is provided to cbar_ax.
Parameters
datarectangular dataset
2D dataset that can be coerced into an ndarray. If a Pandas
DataFrame is provided, the index/column information will be used to
label the columns and rows.
vmin, vmaxfloats, optional
Values to anchor the colormap, otherwise they are inferred from the
data and other keyword arguments.
cmapmatplotlib colormap name or object, or list of colors, optional
The mapping from data values to color space. If not provided, the
default will depend on whether center is set.
centerfloat, optional
The value at which to center the colormap when plotting divergant
data. Using this parameter will change the default cmap if none is
specified.
robustbool, optional
If True and vmin or vmax are absent, the colormap range is computed
with robust quantiles instead of the extreme values.
annot bool or rectangular dataset, optional
If True, write the data value in each cell. If an array-like with the
same shape as data, then use this to annotate the heatmap instead of
the data. Note that DataFrames will match on position, not index.
fmtstring, optional
String formatting code to use when adding annotations.
annot_kwsdict of key, value mappings, optional
Keyword arguments for ax.text when annot is True.
linewidthsfloat, optional
Width of the lines that will divide each cell.
linecolorcolor, optional
Color of the lines that will divide each cell.
cbarboolean, optional
Whether to draw a colorbar.
cbar_kws dict of key, value mappings, optional
Keyword arguments for fig.colorbar.
cbar_axmatplotlib Axes, optional
Axes in which to draw the colorbar, otherwise take space from the
main Axes.
squareboolean, optional
If True, set the Axes aspect to “equal” so each cell will be square-
shaped.
xticklabels, yticklabels “auto”, bool, list-like, or int, optional
If True, plot the column names of the dataframe. If False, don’t plot
the column names. If list-like, plot these alternate labels as the
xticklabels. If an integer, use the column names but plot only every n
label. If “auto”, try to densely plot non-overlapping labels.
maskboolean array or DataFrame, optional
If passed, data will not be shown in cells where mask is True. Cells
with missing values are automatically masked.
axmatplotlib Axes, optional
Axes in which to draw the plot, otherwise use the currently-active
Axes.
kwargsother keyword arguments
All other keyword arguments are passed
to matplotlib.axes.Axes.pcolormesh().
Returns
axmatplotlib Axes
Axes object with the heatmap.
See also
clustermap
Plot a matrix using hierachical clustering to arrange the rows and
columns.
Examples
In other words, I want to be able to plot the sns.distplot() for each column in the
dataframe in a visualization of 3 rows and 3 columns where each sub figure
represents the unique sns.distplot() of each column for the total number of columns
in the dataframe.
I experimented a bit with using a for loop over axes and columns for the dataframe,
but I'm only able to achieve results for specifying columns. I'm not sure how to
represent the code to work for rows and columns.
I also looked into sns.FacetGrid, but I'm not sure how to go about solving this
problem using FacetGrid.
I find the df.hist() function to exactly what I want, but I want to be able to do it with
the sns.distplot for all the columns in that same representation as the output of
df.hist().
If it helps to put the context of the dataframe, I'm essentially reading Google
Colab's training and testing sets for the California Housing Dataset which contains
all the columns except for the ocean_proximity. If you want to help me figure out
this problem using that dataset, please get it from Kaggle and drop the
ocean_proximity column.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('housing.csv')
df.drop('ocean_proximity', axis=1, inplace=True)
fig, axes = plt.subplots(ncols=len(df.columns), figsize=(30,15))
for ax, col in zip(axes, df.columns):
sns.distplot(df[col], ax=ax)
plt.tight_layout()
plt.show()
python pandas seaborn
shareimprove this questionfollow
asked Mar 28 '19 at 2:20
pythonRCNewbie
6911 silver badge55 bronze badges
add a comment
1 Answer
Active OldestVotes
6
You can create multiple figure with matplotlib using subplots like this
Bruce Swain
22811 silver badge88 bronze badges
Beautiful solution, Bruce! Thank you so much. I got exactly the visualization I
wanted :) – pythonRCNewbie Mar 29 '19 at 9:18
https://www.geeksforgeeks.org/box-plot-visualization-with-pandas-and-seaborn/
Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the
data efficiently with a simple box and whiskers and allows us to compare easily across groups.
Boxplot summarizes a sample data using 25th, 50th and 75th percentiles. These percentiles are
also known as the lower quartile, median and upper quartile.
A box plot consist of 5 things.
Minimum
First Quartile or 25%
Median (Second Quartile) or 50%
Third Quartile or 75%
Maximum
To download the dataset used, click here.
brightness_4
# import the required library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
# load the dataset
df = pd.read_csv("tips.csv")
# display 5 rows of dataset
df.head()
brightness_4
df.boxplot(by ='day', column =['total_bill'], grid = False)
Boxplot of size with respect tip.
filter_none
brightness_4
df.boxplot(by ='size', column =['tip'], grid = False)
Draw the boxplot using seaborn library:
Syntax :
seaborn.boxplot(x=None, y=None, hue=None, data=None, order=None,
hue_order=None, orient=None, color=None, palette=None, saturation=0.75,
width=0.8, dodge=True, fliersize=5, linewidth=None, whis=1.5, notch=False,
ax=None, **kwargs)
Parameters:
x = feature of dataset
y = feature of dataset
hue = feature of dataset
data = datafram or full dataset
color = color name
Let’s see how to create the box plot through seaborn library.
Information about “tips” dataset.
filter_none
brightness_4
# load the dataset
tips = sns.load_dataset('tips')
tips.head()
brightness_4
# Draw a vertical boxplot grouped
# by a categorical variable:
sns.set_style("whitegrid")
sns.boxplot(x = 'day', y = 'total_bill', data = tips)
Let’s take the first box plot i.e, blue box plot of the figure and understand these statistical things:
Bottom black horizontal line of blue box plot is minimum value
First black horizontal line of rectangle shape of blue box plot is First quartile or 25%
Second black horizontal line of rectangle shape of blue box plot is Second quartile or
50% or median.
Third black horizontal line of rectangle shape of blue box plot is third quartile or 75%
Top black horizontal line of rectangle shape of blue box plot is maximum value.
Small diamond shape of blue box plot is outlier data or erroneous data.
https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-
f1c49c816f07
And because visualization is generally easier to understand than reading tabular data, heatmaps
are typically used to visualize correlation matrices. A simple way to plot a heatmap in Python is
by importing and implementing the Seaborn library.
From seaborn documentation
Seaborn heatmaps are appealing to the eyes, and they tend to send clear messages about data
almost immediately. This is why this method for correlation matrix visualization is widely used
by data analysts and data scientists alike.
But what else can we get from the heatmap apart from a simple plot of the correlation matrix?
Surprisingly, the Seaborn heatmap function has 18 arguments that can be used to customize a
correlation matrix, improving how fast insights can be derived. For the purposes of this tutorial,
we’re going to use 13 of those arguments.
To make things a bit simpler for the purposes of this tutorial, we’re going to use one of the pre-
installed datasets in Seaborn. The first thing we need to do is import the Seaborn library and load
the data.
Please note: If using Google Colab or any Anaconda package, there’s no need to install
Seaborn; you’ll only need to import it. Otherwise, use this link to install Seaborn.
The future of machine learning is on the edge. Subscribe to the Fritz AI Newsletter to discover the
The data
Our data, which is called Tips (a pre-installed dataset on Seaborn library), has 7 columns
consisting of 3 numeric features and 4 categorical features. Each entry or row captures a type of
customer (be it male or female or smoker or non-smoker ) having either dinner or lunch on a
particular day of the week. It also captures the amount of total bill, the tip given and the table size
of a customer. (For more info about pre-installed datasets on the Seaborn library, check here)
One important thing to note when plotting a correlation matrix is that it completely ignores any
non-numeric column. For the purposes of this tutorial, all the category variable were changed to
numeric variables.
This is what the function looks like with all the arguments:
Just taking a look at the code and not having any idea about how it works can be very
overwhelming. Let’s dissect it together.
To better understand the arguments, we’re going to group them into 4 categories:
1. The Essentials
3. Aesthetics
1. The most important argument in the function is to input the data since the end goal is to
plot a correlation. A .corr() method will be added to the data and passed as the first
argument.
2. Interpreting the insights by just using the first argument is sufficient. For an even easier
interpretation, an argument called annot=True should be passed as well, which helps display
the correlation coefficient.
3. There are times where correlation coefficients may be running towards 5 decimal digits. A
good trick to reduce the number displayed and improve readability is to pass the argument fmt
=’.3g'or fmt = ‘.1g' because by default the function displays two digits after the decimal
(greater than zero) i.e fmt='.2g'(this may not always mean it displays two decimal
places). Let's specify the default argument to fmt='.1g' .
For the rest of this tutorial, we will stick to the default fmt='.2g'
4. The next three arguments have to do with rescaling the color bar. There are times where the
correlation matrix bar doesn’t start at zero, a negative number, or end at a particular number of
choice—or even have a distinct center. All this can be customized by specifying these three
arguments: vmin, which is the minimum value of the bar; vmax, which is the maximum value of
the bar; and center= . By default, all three aren’t specified. Let’s say we want our color bar to be
between -1 to 1 and be centered at 0.
One obvious change, apart from the rescaling, is that the color changed. This has to do with
changing the center from None to Zero or any other number. But this does not mean we can’t
change the color back or to any other available color. Let’s see how to do this.
Aesthetic
6. By default, the thickness and color border of each row of the matrix are set at 0 and white,
respectively. There are times where the heatmap may look better with some border thickness and
a change of color. This is where the arguments linewidths and linecolor apply. Let's specify
the linewidths and the linecolor to 3 and black, respectively.
For the rest of this tutorial, we’ll switch back to the default cmap , linecolor,
and linewidths . This can be done either by passing the following
arguments: cmap=None , linecolor='white', and linewidths=0; or not passing the
arguments at all (which we’re going to do).
7. So far, the heatmap used has its color bar displayed vertically. This can be customized to be
horizontal instead by specifying the argument cbar_kws
8. There also might be instances where a heatmap may be better off not having a color bar at all.
This can be done by specifying cbar=False
For the rest of this tutorial, we will display the color bar.
9. Take a closer look at the shape of each matrix box above. They’re all rectangular in shape. We
can change them into squares by specifying the argument to square=True
Changing the matrix shape
Changing the whole shape of the matrix from rectangular to triangular is a little tricky. For this,
we’ll need to import NumPy methods .triu() & .tril() and then specify the Seaborn
heatmap argument called mask=
.triu() is a method in NumPy that returns the lower triangle of any matrix given to it,
while .tril() returns the upper triangle of any matrix given to it.
The idea is to pass the correlation matrix into the NumPy method and then pass this into the
mask argument in order to create a mask on the heatmap matrix. Let’s see how this works below.
We discovered 13 ways to customize our Seaborn heatmap for a correlation matrix. The
remaining 5 arguments are rarely used because they’re very specific to the nature of the data and
the associated goals. Full source code for this tutorial can be found on GitHub:
anitaokoh/Understanding-the-Seaborn-heatmap-function
This is a tutorial for the purpose of dissecting the seaborn.heatmap() function. The
arguments were broken down into 4…
github.com
References
To learn more about improving the EDA process through visualization, check out this
Dataquest tutorial (login required).
Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine
learning platform that helps developers teach devices to see, hear, sense, and think. We pay our
contributors, and we don’t sell ads.
If you’d like to contribute, head on over to our call for contributors. You can also sign up to
receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter), join
us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning
https://www.drawingfromdata.com/setting-figure-size-using-seaborn-and-matplotlib
TL;DR
Setting figure sizes, like rotating axis tick labels , is one of those things that feels
like it should be very straightforward. However, it still manages to show up on the
first page of stackoverflow questions for both matplotlib and seaborn. Part of
the confusion arises because there are so many ways to do the same thing - this
highly upvoted question has six suggested solutions:
manually create an Axes object with the desired size
pass some configuration paramteters to seaborn so that the size you want
is the default
call a method on the figure once it's been created
pass hight and aspect keywords to the seaborn plotting function
use the matplotlib.pyplot interface and call the figure() function
use the matplotlib.pyplot interface to get the current figure then set its
size using a method
Let's jump in. As an example we'll use the olympic medal dataset, which we can
load directly from a URL::
import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6
data =
pd.read_csv("https://raw.githubusercontent.com/mojones/binders/master
/olympics.csv", sep="\t")
data
.. Int Olympic Committee
City Year Sport Medal Country
. code
Athen 189 ..
0 Aquatics Gold Hungary HUN
s 6 .
Athen 189 ..
1 Aquatics Silver Austria AUT
s 6 .
Athen 189 ..
3 Aquatics Gold Greece GRE
s 6 .
s 6 .
..
... ... ... ... ... ... ...
.
For our first figure, we'll count how many medals have been won in total by each
country, then take the top thirty:
data['Country'].value_counts().head(30)
United States 4335
Soviet Union 2049
United Kingdom 1594
France 1314
Italy 1228
...
Spain 377
Switzerland 376
Brazil 372
Bulgaria 331
Czechoslovakia 329
Name: Country, Length: 30, dtype: int64
data['Country'].value_counts().head(30).plot(kind='barh')
Ignoring other asthetic aspects of the plot, it's obvious that we need to change
the size - or rather the shape. Part of the confusion over sizes in plotting is that
sometimes we need to just make the chart bigger or smaller, and sometimes we
need to make it thinner or fatter. If we just scaled up this plot so that it was big
enough to read the names on the vertical axis, then it would also be very wide.
We can set the size by adding a figsize keyword argument to our
pandas plot() function. The value has to be a tuple of sizes - it's actually the
horizontal and vertical size in inches, but for most purposes we can think of them
as arbirary units.
Here's what happens if we make the plot bigger, but keep the original shape:
And here's a version that keeps the large vertical size but shrinks the chart
horizontally so it doesn't take up so much space:
This time, we'll say that we want to make the plot longer in the horizontal
direction, to better see the pattern over time. If we search the documentation for
the matplotlib plot() funtion, we won't find any mention of size or shape. This
actually makes sense in the design of matplotlib - plots don't really have a size,
figures do. So to change it we have to call the figure() function:
plt.figure(figsize=(15,4))
plt.plot(data['Year'].value_counts().sort_index())
Notice that with the figure() function we have to call it before we make the call
to plot(), otherwise it won't take effect:
plt.plot(data['Year'].value_counts().sort_index())
OK, now what if we're using seaborn rather than matplotlib? Well, happily the
same technique will work. We know from our first plot which countries have won
the most medals overall, but now let's look at how this varies by year. We'll
create a summary table to show the number of medals per year for all countries
that have won at least 500 medals total.
(ignore this panda stuff if it seems confusing, and just look at the final table)
summary = (
data
.groupby('Country')
.filter(lambda x : len(x) > 500)
.groupby(['Country', 'Year'])
.size()
.to_frame('medal count')
.reset_index()
)
189
0 Australia 2
6
190
1 Australia 5
0
192
2 Australia 6
0
192
3 Australia 10
4
4 Australia 192 4
medal
Country Year
count
30 United\nState 199
224
9 s 2
31 United\nState 199
260
0 s 6
31 United\nState 200
248
1 s 0
31 United\nState 200
264
2 s 4
31 United\nState 200
315
3 s 8
Now we can do a box plot to show the distribution of yearly medal totals for each
country:
This is hard to read because of all the names, so let's space them out a bit:
plt.figure(figsize=(20,5))
sns.boxplot(
data=summary,
x='Country',
y='medal count',
color='red')
Now we come to the final complication; let's say we want to look at the
distributions of the different medal types separately. We'll make a new summary
table - again, ignore the pandas stuff if it's confusing, and just look at the final
table:
summary_by_medal = (
data
.groupby('Country')
.filter(lambda x : len(x) > 500)
.groupby(['Country', 'Year', 'Medal'])
.size()
.to_frame('medal count')
.reset_index()
)
summary_by_medal['Country'] =
summary_by_medal['Country'].str.replace(' ', '\n')
summary_by_medal
medal
Country Year Medal
count
189
0 Australia Gold 2
6
190 Bronz
1 Australia 3
0 e
190
2 Australia Gold 2
0
192 Bronz
3 Australia 1
0 e
192
4 Australia Silver 5
0
88 United\nState 200
Gold 116
1 s 4
88 United\nState 200
Silver 75
2 s 4
88 United\nState 200
Gold 125
4 s 8
medal
Country Year Medal
count
88 United\nState 200
Silver 109
5 s 8
sns.catplot(
data=summary_by_medal,
x='Country',
y='medal count',
hue='Medal',
kind='box')
The reason for this is that the higher level plotting functions in seaborn (what the
documentation calls Figure-level interfaces) have a different way of managing
size, largely due to the fact that the often produce multiple subplots. To set the
size when
using catplot() or relplot() (also pairplot(), lmplot() and jointplot()),
use the height keyword to control the size and the aspect keyword to control the
shape:
sns.catplot(
data=summary_by_medal,
x='Country',
y='medal count',
hue='Medal',
kind='box',
height=5, # make the plot 5 units high
aspect=3) # height should be three times width
Printing a figure
Finally, a word about printing. If the reason that you need to change the size of a
plot, rather than the shape, is because you need to print it, then don't worry
about the size - get the shape that you want, then use savefig() to make the
plot in SVG format:
plt.savefig('medals.svg')
This will give you a plot in Scalable Vector Graphics format, which stores the
actual lines and shapes of the chart so that you can print it at any size - even a
giant poster - and it will look sharp. As a nice bonus, you can also edit individual
bits of the chart using a graphical SVG editor (Inkscape is free and powerful,
though takes a bit of effort to learn).
plt.figure(figsize=(30,10))
sns.heatmap(corr,annot=True,cmap="YlGnBu",fmt='.1g',square=True)
# plt.figure(figsize=(15,4))
# plt.savefig('medals.svg')
Introduction
There is just something extraordinary about a well-designed visualization. The colors stand out,
the layers blend nicely together, the contours flow throughout, and the overall package not only
has a nice aesthetic quality, but it provides meaningful insights to us as well.
This is quite important in data science where we often work with a lot of messy data. Having the
ability to visualize it is critical for a data scientist. Our stakeholders or clients will more often than
not rely on visual cues rather than the intricacies of a machine learning model.
There are plenty of excellent Python visualization libraries available, including the built-
in matplotlib. But seaborn stands out for me. It combines aesthetic appeal seamlessly with
technical insights, as we’ll soon see.
In this article, we’ll learn what seaborn is and why you should use it ahead of matplotlib. We’ll
then use seaborn to generate all sorts of different data visualizations in Python. So put your
creative hats on and let’s get rolling!
Seaborn is part of the comprehensive and popular Applied Machine Learning course. It’s your
one-stop-destination to learning all about machine learning and its different aspects.
Table of Contents
What is Seaborn?
Why should you use Seaborn versus matplotlib?
Setting up the Environment
Data Visualization using Seaborn
o Visualizing Statistical Relationships
o Plotting with Categorical Data
o Visualizing the Distribution of a Dataset
What is Seaborn?
Have you ever used the ggplot2 library in R? It’s one of the best visualization packages in any
tool or language. Seaborn gives me the same overall feel.
I’ve been talking about how awesome seaborn is so you might be wondering what all the fuss is
about.
I’ll answer that question comprehensively in a practical manner when we generate plots using
seaborn. For now, let’s quickly talk about how seaborn feels like it’s a step above matplotlib.
Seaborn makes our charts and plots look engaging and enables some of the common data
visualization needs (like mapping color to a variable or using faceting). Basically, it makes the
data visualization and exploration easy to conquer. And trust me, that is no easy task in data
science.
“If Matplotlib “tries to make easy things easy and hard things possible”, seaborn tries to make
a well-defined set of hard things easy too.” – Michael Waskom (Creator of Seaborn)
There are essentially a couple of (big) limitations in matplotlib that Seaborn fixes:
1. Seaborn comes with a large number of high-level interfaces and customized themes that
matplotlib lacks as it’s not easy to figure out the settings that make plots attractive
2. Matplotlib functions don’t work well with dataframes, whereas seaborn does
That second point stands out in data science since we work quite a lot with dataframes. Any
other reason(s) you feel seaborn is superior to matplotlib? Let us know in the comments section
below the article!
The seaborn library has four mandatory dependencies you need to have:
To install Seaborn and use it effectively, first, we need to install the aforementioned
dependencies. Once this step is done, we are all set to install Seaborn and enjoy its
mesmerizing plots. To install Seaborn, you can use the following line of code-
To import the dependencies and seaborn itself in your code, you can use the following code-
Datasets Used for Data Visualization
HR analytics challenge
Predict the number of Upvotes
I’ve picked these two because they contain a multitude of variables so we have plenty of options
to play around with. Both these datasets also mimic real-world scenarios so you’ll get an idea of
how data visualization and exploration work in the industry.
You can check out this and other high-quality datasets and hackathons on the DataHack
platform. So go ahead and download the above two datasets before you proceed. We’ll be
using them in tandem.
Let’s get started! I have divided this implementation section into two categories:
We’ll look at multiple examples of each category and how to plot it using seaborn.
Scatter plot
SNS.relplot
Hue plot
I have picked the ‘Predict the number of upvotes‘ project for this. So, let’s start by importing
the dataset in our working environment:
A scatterplot is perhaps the most common example of visualizing relationships between two
variables. Each point shows an observation in the dataset and these observations are
represented by dot-like structures. The plot shows the joint distribution of two variables using a
cloud of points.
To draw the scatter plot, we’ll be using the relplot() function of the seaborn library. It is a figure-
level role for visualizing statistical relationships. By default, using a relplot produces a scatter
plot:
SNS.relplot using Seaborn
The parameters – x, y, and data – represent the variables on X-axis, Y-axis and the data we
are using to plot respectively. Here, we’ve found a relationship between
the views and upvotes.
Next, if we want to see the tag associated with the data, we can use the below code:
Hue Plot
We can add another dimension in our plot with the help of hue as it gives color to the points and
each color has some meaning attached to it.
In the above plot, the hue semantic is categorical. That’s why it has a different color palette. If
the hue semantic is numeric, then the coloring becomes sequential.
We can also change the size of each point:
We can also change the size manually by using another parameter sizes as sizes = (15, 200).
Jitter
Hue
Boxplot
Voilin Plot
Pointplot
In the above section, we saw how we can use different visual representations to show the
relationship between multiple variables. We drew the plots between two numeric variables. In
this section, we’ll see the relationship between two variables of which one would be categorical
(divided into different groups).
We’ll be using catplot() function of seaborn library to draw the plots of categorical data. Let’s
dive in
Jitter Plot
For jitter plot we’ll be using another dataset from the problem HR analysis challenge, let’s
import the dataset now.
Since we can see that the plot is scattered, so to handle that, we can set the jitter to false. Jitter
is the deviation from the true value. So, we’ll set the jitter to false by using another parameter.
Hue Plot
Next, if we want to introduce another variable or another dimension in our plot, we can use
the hue parameter just like we used in the above section. Let’s say we want to see the gender
distribution in the plot of education and avg_training_score, to do that, we can use the following
code
In the above plots, we can see that the points are overlapping each other, to eliminate this
situation, we can set kind = “swarm”, swarm uses an algorithm that prevents the points from
overlapping and adjusts the points along the categorical axis. Let’s see how it looks like-
Pretty amazing, right? What if we want to see the swarmed version of the plot as well as a third
dimension? Let’s see how it goes if we introduce is_promoted as a new variable
Clearly people with higher scores got a promotion.
Another kind of plot we can draw is a boxplot which shows three quartile values of the
distribution along with the end values. Each value in the boxplot corresponds to actual
observation in the data. Let’s draw the boxplot now-
When we use hue semantic with boxplot, it is leveled along the categorical axis so they don’t
overlap. The boxplot with hue would look like-
Violin Plot using seaborn
We can also represent the above variables differently by using violin plots. Let’s try it out
The violin plots combine the boxplot and kernel density estimation procedure to provide richer
description of the distribution of values. The quartile values are displayed inside the violin. We
can also split the violin when the hue semantic parameter has only two levels, which could also
be helpful in saving space on the plot. Let’s look at the violin plot with a split of levels.
These amazing plots are the reason why I started using seaborn. It gives you a lot of options to
display the data. Another coming in the line is boxplot.
Boxplot operates on the full dataset and obtains the mean value by default. Let’s face it now.
Pointplot using seaborn
Another type of plot coming in is pointplot, and this plot points out the estimate value and
confidence interval. Pointplot connects data from the same hue category. This helps in
identifying how the relationship is changing in a particular hue category. You can check out how
does a pointplot displays the information below.
As it is clear from the above plot, the one whose score is high has is more confident in getting a
promotion.
This is not the end, seaborn is a huge library with a lot of plotting functions for different
purposes. One such purpose is to introduce multiple dimensions. We can visualize higher
dimension relationships as well. Let’s check it out using swarm plot.
Whenever we are dealing with a dataset, we want to know how the data or the variables are
being distributed. Distribution of data could tell us a lot about the nature of the data, so let’s dive
into it.
Histogram
One of the most common plots you’ll come across while examining the distribution of a variable
is distplot. By default, distplot() function draws histogram and fits a Kernel Density Estimate.
Let’s check out how age is distributed across the data.
This clearly shows that the majority of people are in their late twenties and early thirties.
A histogram represents the distribution of data in the form of bins and uses bars to show the
number of observations falling under each bin. We can also add a rugplot in it instead of using
KDE (Kernel Density Estimate), which means at every observation, it will draw a small vertical
stick.
Hexplot
KDE plot
Boxen plot
Ridge plot (Joyplot)
Apart from visualizing the distribution of a single variable, we can see how two independent
variables are distributed with respect to each other. Bivariate means joint, so to visualize it, we
use jointplot() function of seaborn library. By default, jointplot draws a scatter plot. Let’s check
out the bivariate distribution between age and avg_training_score.
There are multiple ways to visualize bivariate distribution. Let’s look at a couple of more.
Hexplot is a bivariate analog of histogram as it shows the number of observations that falls
within hexagonal bins. This is a plot which works with large dataset very easily. To draw a
hexplot, we’ll set kind attribute to hex. Let’s check it out now.
That’s not the end of this, next comes KDE plot. It’s another very awesome method to visualize
the bivariate distribution. Let’s see how the above observations could also be achieved by
using jointplot() function and setting the attribute kind to KDE.
Now let’s talk about my absolute favorite plot, the heatmap. Heatmaps are graphical
representations in which each variable is represented as a color.
Another plot that we can use to show the bivariate distribution is boxen plot. Boxen plots were
originally named letter value plot as it shows large number of values of a variable, also known
as quantiles. These quantiles are also defined as letter values. By plotting a large number of
quantiles, provides more insights about the shape of the distribution. These are similar to box
plots, let’s see how they could be used.
Ridge Plot using seaborn
The next plot is quite fascinating. It’s called ridge plot. It is also called joyplot. Ridge plot helps
in visualizing the distribution of a numeric value for several groups. These distributions could be
represented by using KDE plots or histograms. Now, let’s try to plot a ridge plot for age with
respect to gender.
We can also plot multiple bivariate distributions in a dataset by using pairplot() function of the
seaborn library. This shows the relationship between each column of the database. It also
draws the univariate distribution plot of each variable on the diagonal axis. Let’s see how it
looks.
End Notes
We’ve covered a lot of plots here. We saw how the seaborn library can be so effective when it
comes to visualizing and exploring data (especially large datasets). We also discussed how we
can plot different functions of the seaborn library for different kinds of data.
Like I mentioned earlier, the best way to learn seaborn (or any concept or library) is by
practicing it. The more you generate new visualizations on your own, the more confident you’ll
become. Go ahead and try your hand at any practice problem on the DataHack platform and
start becoming a data visualization master!
https://stackabuse.com/seaborn-library-for-data-visualization-in-python-part-1/
Introduction
Though, the Seaborn library can be used to draw a variety of charts such as
matrix plots, grid plots, regression plots etc., in this article we will see how
the Seaborn library can be used to draw distributional and categorial plots. In
the second part of the series, we will see how to draw regression plots,
matrix plots, and grid plots.
Alternatively, if you are using the Anaconda distribution of Python, you can
use execute the following command to download the seaborn library:
The Dataset
The dataset that we are going to use to draw our plots will be the Titanic
dataset, which is downloaded by default with the Seaborn library. All you
have to do is use the load_dataset function and pass it the name of the
dataset.
Let's see what the Titanic dataset looks like. Execute the following script:
import pandas as pd
import numpy as np
dataset = sns.load_dataset('titanic')
dataset.head()
The script above loads the Titanic dataset and displays the first five rows of
the dataset using the head function. The output looks like this:
The dataset contains 891 rows and 15 columns and contains information
about the passengers who boarded the unfortunate Titanic ship. The original
task is to predict whether or not the passenger survived depending upon
different features such as their age, ticket, cabin they boarded, the class of
the ticket, etc. We will use the Seaborn library to see if we can find any
patterns in the data.
Distributional Plots
Distributional plots, as the name suggests are type of plots that show the
statistical distribution of data. In this section we will see some of the most
commonly used distribution plots in Seaborn.
sns.distplot(dataset['fare'])
Output:
You can see that most of the tickets have been solved between 0-50 dollars.
The line that you see represents the kernel density estimation. You can
remove this line by passing False as the parameter for the kde attribute as
shown below:
sns.distplot(dataset['fare'], kde=False)
Output:
Now you can see there is no line for the kernel density estimation on the
plot.
You can also pass the value for the bins parameter in order to see more or
less details in the graph. Take a look at he following script:
Here we set the number of bins to 10. In the output, you will see data
distributed in 10 bins as shown below:
Output:
You can clearly see that for more than 700 passengers, the ticket price is
between 0 and 50.
Output:
From the output, you can see that a joint plot has three parts. A distribution
plot at the top for the column on the x-axis, a distribution plot on the right for
the column on the y-axis and a scatter plot in between that shows the mutual
distribution of data for both the columns. You can see that there is no
correlation observed between prices and the fares.
You can change the type of the joint plot by passing a value for
the kind parameter. For instance, if instead of scatter plot, you want to
display the distribution of data in the form of a hexagonal plot, you can pass
the value hex for the kind parameter. Look at the following script:
sns.jointplot(x='age', y='fare', data=dataset, kind='hex')
Output:
In the hexagonal plot, the hexagon with most number of points gets darker
color. So if you look at the above plot, you can see that most of the
passengers are between age 20 and 30 and most of them paid between 10-50
for the tickets.
The Pair Plot
sns.pairplot(dataset)
dataset = dataset.dropna()
From the output of the pair plot you can see the joint plots for all the numeric
and Boolean columns in the Titanic dataset.
To add information from the categorical column to the pair plot, you can pass
the name of the categorical column to the hue parameter. For instance, if
we want to plot the gender information on the pair plot, we can execute the
following script:
sns.pairplot(dataset, hue='sex')
Output:
In the output you can see the information about the males in orange and the
information about the female in blue (as shown in the legend). From the joint
plot on the top left, you can clearly see that among the surviving passengers,
the majority were female.
sns.rugplot(dataset['fare'])
Output:
From the output, you can see that as was the case with the distplot(), most
of the instances for the fares have values between 0 and 100.
These are some of the most commonly used distribution plots offered by the
Python's Seaborn Library. Let's see some of categorical plots in the Seaborn
library.
Categorical Plots
Output:
From the output, you can clearly see that the average age of male
passengers is just less than 40 while the average age of female passengers
is around 33.
In addition to finding the average, the bar plot can also be used to calculate
other aggregate values for each category. To do so, you need to pass the
aggregate function to the estimator. For instance, you can calculate the
standard deviation for the age of each gender as follows:
import numpy as np
The count plot is similar to the bar plot, however it displays the count of the
categories in a specific column. For instance, if we want to count the number
of males and women passenger we can do so using count plot as follows:
sns.countplot(x='sex', data=dataset)
Get occassional tutorials, guides, and jobs in your inbox. No spam ever.
Unsubscribe at any time.
Newsletter Signup
Subscribe
Output:
The Box Plot
The box plot is used to display the distribution of the categorical data in the
form of quartiles. The center of the box shows the median value. The value
from the lower whisker to the bottom of the box shows the first quartile.
From the bottom of the box to the middle of the box lies the second quartile.
From the middle of the box to the top of the box lies the third quartile and
finally from the top of the box to the top whisker lies the last quartile.
You can study more about quartiles and box plots at this link.
Now let's plot a box plot that displays the distribution for the age with
respect to each gender. You need to pass the categorical column as the first
parameter (which is sex in our case) and the numeric column (age in our
case) as the second parameter. Finally, the dataset is passed as the third
parameter, take a look at the following script:
Output:
Let's try to understand the box plot for female. The first quartile starts at
around 5 and ends at 22 which means that 25% of the passengers are aged
between 5 and 25. The second quartile starts at around 23 and ends at
around 32 which means that 25% of the passengers are aged between 23 and
32. Similarly, the third quartile starts and ends between 34 and 42, hence
25% passengers are aged within this range and finally the fourth or last
quartile starts at 43 and ends around 65.
If there are any outliers or the passengers that do not belong to any of the
quartiles, they are called outliers and are represented by dots on the box
plot.
You can make your box plots more fancy by adding another layer of
distribution. For instance, if you want to see the box plots of forage of
passengers of both genders, along with the information about whether or not
they survived, you can pass the survived as value to the hue parameter as
shown below:
Output:
Now in addition to the information about the age of each gender, you can
also see the distribution of the passengers who survived. For instance, you
can see that among the male passengers, on average more younger people
survived as compared to the older ones. Similarly, you can see that the
variation among the age of female passengers who did not survive is much
greater than the age of the surviving female passengers.
The violin plot is similar to the box plot, however, the violin plot allows us to
display all the components that actually correspond to the data point.
The violinplot() function is used to plot the violin plot. Like the box plot,
the first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset.
Let's plot a violin plot that displays the distribution for the age with respect
to each gender.
You can see from the figure above that violin plots provide much more
information about the data as compared to the box plot. Instead of plotting
the quartile, the violin plot allows us to see all the components that actually
correspond to the data. The area where the violin plot is thicker has a higher
number of instances for the age. For instance, from the violin plot for males,
it is clearly evident that the number of passengers with age between 20 and
40 is higher than all the rest of the age brackets.
Like box plots, you can also add another categorical variable to the violin
plot using the hue parameter as shown below:
Instead of plotting two different graphs for the passengers who survived and
those who did not, you can have one violin plot divided into two halves,
where one half represents surviving while the other half represents the non-
surviving passengers. To do so, you need to pass True as value for
the split parameter of the violinplot() function. Let's see how we can do
this:
Both violin and box plots can be extremely useful. However, as a rule of
thumb if you are presenting your data to a non-technical audience, box plots
should be preferred since they are easy to comprehend. On the other hand, if
you are presenting your results to the research community it is more
convenient to use violin plot to save space and to convey more information in
less time.
The strip plot draws a scatter plot where one of the variables is categorical.
We have seen scatter plots in the joint plot and the pair plot sections where
we had two numeric variables. The strip plot is different in a way that one of
the variables is categorical in this case, and for each category in the
categorical variable, you will see scatter plot with respect to the numeric
column.
The stripplot() function is used to plot the violin plot. Like the box plot, the
first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset. Look at the
following script:
Output:
You can see the scattered plots of age for both males and females. The data
points look like strips. It is difficult to comprehend the distribution of data in
this form. To better comprehend the data, pass True for
the jitter parameter which adds some random noise to the data. Look at
the following script:
Output:
Now you have a better view for the distribution of age across the genders.
Like violin and box plots, you can add an additional categorical column to
strip plot using hue parameter as shown below:
Like violin plots, we can also split the strip plots. Execute the following
script:
Output:
Now you can clearly see the difference in the distribution for the age of both
male and female passengers who survived and those who did not survive.
The swarm plot is a combination of the strip and the violin plots. In the
swarm plots, the points are adjusted in such a way that they don't overlap.
Let's plot a swarm plot for the distribution of age against gender.
The swarmplot() function is used to plot the violin plot. Like the box plot, the
first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset. Look at the
following script:
Output:
From the output, it is evident that the ratio of surviving males is less than the
ratio of surviving females. Since for the male plot, there are more blue points
and less orange points. On the other hand, for females, there are more
orange points (surviving) than the blue points (not surviving). Another
observation is that amongst males of age less than 10, more passengers
survived as compared to those who didn't.
We can also split swarm plots as we did in the case of strip and box plots.
Execute the following script to do so:
Output:
Now you can clearly see that more women survived, as compared to men.
Swarm plots are not recommended if you have a huge dataset since they do
not scale well because they have to plot each data point. If you really like
swarm plots, a better way is to combine two plots. For instance, to combine
a violin plot with swarm plot, you need to execute the following script:
Output:
Conclusion
Matrix Plots
Matrix plots are the type of plots that show data in the form of rows and
columns. Heat maps are the prime examples of matrix plots.
Heat Maps
Heat maps are normally used to plot correlation between numeric columns in
the form of a matrix. It is important to mention here that to draw matrix
plots, you need to have meaningful information on rows as well as columns.
Continuing with the theme from teh last article, let's plot the first five rows
of the Titanic dataset to see if both the rows and column headers have
meaningful information. Execute the following script:
import pandas as pd
import numpy as np
dataset = sns.load_dataset('titanic')
dataset.head()
In the output, you will see the following result:
From the output, you can see that the column headers contain useful
information such as passengers survived, their age, fare etc. However the
row headers only contains indexes 0, 1, 2, etc. To plot matrix plots, we need
useful information on both columns and row headers. One way to do this is to
call the corr() method on the dataset. The corr() function returns the
correlation between all the numeric columns of the dataset. Execute the
following script:
dataset.corr()
In the output, you will see that both the columns and the rows have
meaningful header information, as shown below:
Now to create a heat map with these correlation values, you need to call
the heatmap() function and pass it your correlation dataframe. Look at the
following script:
corr = dataset.corr()
sns.heatmap(corr)
corr = dataset.corr()
sns.heatmap(corr, annot=True)
Output:
You can also change the color of the heatmap by passing an argument for
the cmap parameter. For now, just look at the following script:
corr = dataset.corr()
sns.heatmap(corr, cmap='winter')
Execute the following script to import the data set and to see the first five
rows of the dataset:
import pandas as pd
import numpy as np
dataset.head()
Output:
sns.heatmap(data)
Currently, you can see that the boxes or the cells are overlapping in some
cases and the distinction between the boundaries of the cells is not very
clear. To create a clear boundary between the cells, you can make use of
the linecolor and linewidths parameters. Take a look at the following
script:
Cluster Map
In addition to heat map, another commonly used matrix plot is the cluster
map. The cluster map basically uses Hierarchical Clustering to cluster the
rows and columns of the matrix.
Let's plot a cluster map for the number of passengers who traveled in a
specific month of a specific year. Execute the following script:
sns.clustermap(data)
To plot a cluster map, clustermap function is used, and like the heat map
function, the dataset passed should have meaningful headers for both rows
and columns. The output of the script above looks like this:
In the output, you can see months and years clustered together on the basis
of number of passengers that traveled in a specific month.
With this, we conclude our discussion about the Matrix plots. In the next
section we will start our discussion about grid capabilities of the Seaborn
library.
Seaborn Grids
Pair Grid
In Part 1 of this article series, we saw how pair plot can be used to draw
scatter plot for all possible combinations of the numeric columns in the
dataset.
Let's revise the pair plot here before we can move on to the pair grid. The
dataset we are going to use for the pair grid section is the "iris" dataset
which is downloaded by default when you download the seaborn library.
Execute the following script to load the iris dataset:
import pandas as pd
import numpy as np
dataset = sns.load_dataset('iris')
dataset.head()
The first five rows of the iris dataset look like this:
Now let's draw a pair plot on the iris dataset. Execute the following script:
sns.pairplot(dataset)
Output:
In the output, you can see empty grids. This is essentially what the pair grid
function does. It returns an empty set of grids for all the features in the
dataset.
Next, you need to call map function on the object returned by the pair grid
function and pass it the type of plot that you want to draw on the grids. Let's
plot a scatter plot using the pair grid.
grids = sns.PairGrid(dataset)
grids.map(plt.scatter)
Get occassional tutorials, guides, and jobs in your inbox. No spam ever.
Unsubscribe at any time.
Newsletter Signup
Subscribe
grids = sns.PairGrid(dataset)
grids.map_diag(sns.distplot)
grids.map_upper(sns.kdeplot)
grids.map_lower(plt.scatter)
The facet grids are used to plot two or more than two categorical features
against two or more than two numeric features. Let's plot a facet grid which
plots the distributional plot of gender vs alive with respect to the age of the
passengers.
For this section, we will again use the Titanic dataset. Execute the following
script to load the Titanic dataset:
import pandas as pd
import numpy as np
dataset = sns.load_dataset('titanic')
grid.map(sns.distplot, 'age')
In the above script, we plot the distributional plot for age on the facet grid.
The output looks like this:
From the output, you can see four plots. One for each combination of gender
and survival of the passenger. The columns contain information about the
survival while the rows contain information about the sex, as specified by
the FacetGrid() function.
The first row and first column contain age distribution of the passengers
where sex is male and the passengers did not survive. The first row and
second column contain age distribution of the passengers where sex is male
and the passengers survived. Similarly, the second row and first column
contain age distribution of the passengers where sex is female and the
passengers did not survive while the second row and second column contain
age distribution of the passengers where sex is female and the passengers
survived.
In addition to distributional plots for one feature, we can also plot scatter
plots that involve two features on the facet grid.
For instance, the following script plots the scatter plot for age and fare for
both the genders of the passengers who survived and who didn't.
In this section, we will study the linear model plot that plots a linear
relationship between two variables along with the best-fit regression line
depending upon the data.
The dataset that we are going to use for this section is the "diamonds"
dataset which is downloaded by default with the seaborn library. Execute
the following script to load the dataset:
import pandas as pd
import numpy as np
dataset = sns.load_dataset('diamonds')
dataset.head()
Let's plot a linear relationship between, carat and price of the diamond.
Ideally, the heavier the diamond is, the higher the price should be. Let's see
if this is actually true based on the information available in the diamonds
dataset.
You can also plot multiple linear models based on a categorical feature. The
feature name is passed as value to the hue parameter. For instance, if you
want to plot multiple linear models for the relationship between carat and
price feature, based on the cut of the diamond, you can use lmplot function
as follows:
From the output, you can see that the linear relationship between the carat
and the price of the diamond is steepest for the ideal cut diamond as
expected and the linear model is shallowest for fair cut diamond.
In addition to plotting the data for the cut feature with different hues, we can
also have one plot for each cut. To do so, you need to pass the column name
to the cols attribute. Take a look at the following script:
In the output, you will see a separate column for each value in the cut
column of the diamonds dataset as shown below:
You can also change the size and aspect ratio of the plots using
the aspect and size parameters. Take a look at the following script:
Plot Styling
Set Style
The set_style() function is used to set the style of the grid. You can pass
the darkgrid, whitegrid, dark, white and ticks as the parameters to
the set_style function.
For this section, we will again use the "titanic dataset". Execute the
following script to see darkgrid style.
sns.set_style('darkgrid')
sns.distplot(dataset['fare'])
In the output, you can see that we have dark back ground with grids. Let's
see how whitegrid looks like. Execute the following script:
sns.set_style('whitegrid')
sns.distplot(dataset['fare'])
Since Seaborn uses Matplotlib functions behind the scenes, you can use
Matplotlib's pyplot package to change the figure size as shown below:
plt.figure(figsize=(8,4))
sns.distplot(dataset['fare'])
In the script above, we set the width and height of the plot to 8 and 4 inches
respectively. The output of the script above looks like this:
Set Context
Apart from the notebook, you may need to create plots for posters. To do so,
you can use the set_context() function and pass it poster as the only
attribute as shown below:
sns.set_context('poster')
sns.distplot(dataset['fare'])
In the output, you should see a plot with the poster specifications as shown
below. For instance, you can see that the fonts are much bigger compared to
normal plots.
Conclusion
that patterns, trends and correlations that might not otherwise be detected can be exposed.
Python offers multiple great graphing libraries that come packed with lots of different features.
No matter if you want to create interactive, live or highly customized plots python has an
In this article, we will learn how to create basic plots using Matplotlib, Pandas visualization and
Seaborn as well as how to use some specific features of each library. This article will focus on the
syntax and not on interpreting the graphs, which I will cover in another blog post.
In further articles, I will go over interactive plotting tools like Plotly, which is built on D3 and can
Importing Datasets
In this article, we will use two datasets which are freely available. The Iris and Wine
Matplotlib
Matplotlib is the most popular python plotting library. It is a low-level library with a Matlab like
interface which offers lots of freedom at the cost of having to write more code.
Matplotlib is specifically good for creating basic graphs like line charts, bar charts, histograms and
Scatter Plot
To create a scatter plot in Matplotlib we can use the scatter method. We will also create a figure
and an axis using plt.subplots so we can give our plot a title and labels.
We can give the graph more meaning by coloring in each data-point by its class. This can be done
by creating a dictionary which maps from class to color and then scattering each point on its own
Line Chart
In Matplotlib we can create a line chart by calling the plot method. We can also plot multiple
columns in one graph, by looping through the columns we want and plotting each column on the
same axis.
Histogram
In Matplotlib we can create a Histogram using the hist method. If we pass it categorical data like
the points column from the wine-review dataset it will automatically calculate how often each
class occurs.
Bar Chart
A bar chart can be created using the bar method. The bar-chart isn’t automatically calculating the
frequency of a category so we are going to use pandas value_counts function to do this. The bar-
chart is useful for categorical data that doesn’t have a lot of different categories (less than 30)
Figure 8: Bar-Chart
Pandas Visualization
Pandas is an open source high-performance, easy-to-use library providing data structures, such as
dataframes, and data analysis tools like the visualization tools we will use in this article.
Pandas Visualization makes it really easy to create plots out of a pandas dataframe and series. It
also has a higher level API than Matplotlib and therefore we need less code for the same results.
To create a scatter plot in Pandas we can call <dataset>.plot.scatter() and pass it two arguments,
the name of the x-column as well as the name of the y-column. Optionally we can also pass it a
title.
As you can see in the image it is automatically setting the x and y label to the column names.
Line Chart
needed to loop-through each column we wanted to plot, in Pandas we don’t need to do this
because it automatically plots all available numeric columns (at least if we don’t specify a specific
column/s).
iris.drop(['class'], axis=1).plot.line(title='Iris Dataset')
If we have more than one feature Pandas automatically creates a legend for us, as can be seen in
Histogram
In Pandas, we can create a Histogram with the plot.hist method. There aren’t any required
arguments but we can optionally pass some like the bin size.
wine_reviews['points'].plot.hist()
Figure 11: Histogram
The subplots argument specifies that we want a separate plot for each feature and the layout
Bar Chart
To plot a bar-chart we can use the plot.bar() method, but before we can call this we need to get
our data. For this we will first count the occurrences using the value_count() method and then sort
wine_reviews['points'].value_counts().sort_index().plot.bar()
It’s also really simple to make a horizontal bar-chart using the plot.barh() method.
wine_reviews['points'].value_counts().sort_index().plot.barh()
wine_reviews.groupby("country").price.mean().sort_values(ascending=False)
[:5].plot.bar()
In the example above we grouped the data by country and then took the mean of the wine prices,
ordered it, and plotted the 5 countries with the highest average wine price.
Seaborn
Seaborn has a lot to offer. You can create graphs in one line that would take you multiple tens of
lines in Matplotlib. Its standard designs are awesome and it also has a nice interface for working
Scatter plot
We can use the .scatterplot method for creating a scatterplot, and just as in Pandas we need to
pass it the column names of the x and y data, but now we also need to pass the data as an
additional argument because we aren’t calling the function on the data directly as we did in
Pandas.
We can also highlight the points by class using the hue argument, which is a lot easier than in
Matplotlib.
sns.scatterplot(x='sepal_length', y='sepal_width', hue='class', data=iris)
Line chart
To create a line-chart the sns.lineplot method can be used. The only required argument is the
data, which in our case are the four numeric columns from the Iris dataset. We could also use the
sns.kdeplot method which rounds of the edges of the curves and therefore is cleaner if you have
sns.lineplot(data=iris.drop(['class'], axis=1))
Figure 18: Line Chart
Histogram
To create a histogram in Seaborn we use the sns.distplot method. We need to pass it the column
we want to plot and it will calculate the occurrences itself. We can also pass it the number of
bins, and if we want to plot a gaussian kernel density estimate inside the graph.
Bar chart
In Seaborn a bar-chart can be created using the sns.countplot method and passing it the data.
sns.countplot(wine_reviews['points'])
Now that you have a basic understanding of the Matplotlib, Pandas Visualization and Seaborn
syntax I want to show you a few other graph types that are useful for extracting insides.
For most of them, Seaborn is the go-to library because of its high-level interface that allows for
Box plots
A Box Plot is a graphical method of displaying the five-number summary. We can create box plots
using seaborns sns.boxplot method and passing it the data as well as the x and y column name.
df = wine_reviews[(wine_reviews['points']>=95) &
(wine_reviews['price']<1000)]
sns.boxplot('points', 'price', data=df)
Box Plots, just like bar-charts are great for data with only a few categories but can get messy
really quickly.
Heatmap
a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features
in a dataset.
To get the correlation of the features inside a dataset we can call <dataset>.corr(), which is a
Matplotlib:
# set labels
ax.set_xticks(np.arange(len(corr.columns)))
ax.set_yticks(np.arange(len(corr.columns)))
ax.set_xticklabels(corr.columns)
ax.set_yticklabels(corr.columns)
# set labels
ax.set_xticks(np.arange(len(corr.columns)))
ax.set_yticks(np.arange(len(corr.columns)))
ax.set_xticklabels(corr.columns)
ax.set_yticklabels(corr.columns)
sns.heatmap(iris.corr(), annot=True)
Faceting is the act of breaking data variables up across multiple subplots and combining those
To use one kind of faceting in Seaborn we can use the FacetGrid. First of all, we need to define
the FacetGrid and pass it our data as well as a row or column, which will be used to split the data.
Then we need to call the map function on our FacetGrid object and define the plot type we want
g = sns.FacetGrid(iris, col='class')
g = g.map(sns.kdeplot, 'sepal_length')
Figure
26: Facet-plot
You can make plots a lot bigger and more complicated than the example above. You can find a
few examples here.
Pairplot
Lastly, I will show you Seaborns pairplot and Pandas scatter_matrix, which enable you to plot a
sns.pairplot(iris)
fig, ax = plt.subplots(figsize=(12,12))
scatter_matrix(iris, alpha=1, ax=ax)
other. The diagonal of the graph is filled with histograms and the other plots are scatter plots.
Conclusion
Data visualization is the discipline of trying to understand data by placing it in a visual context so
that patterns, trends and correlations that might not otherwise be detected can be exposed.
Python offers multiple great graphing libraries that come packed with lots of different features. In
If you liked this article consider subscribing on my Youtube Channel and following me on social
media.
If you have any questions, recommendations or critiques, I can be reached via Twitter or the
comment section.