Sei sulla pagina 1di 43

### py DATA VIS

Multiple plots on single axis


The data set here comes from records of undergraduate degrees awarded to women in a
variety of fields from 1970 to 2011. You can compare trends in degrees most easily
by viewing two curves on the same set of axes. Here, three NumPy arrays have been
pre-loaded for you: year (enumerating years, 1970 - 2011 inclusive),
physical_sciences (the percentage of Physical Sci degrees awarded to women each in
corresponding year), and computer_science (the percentage of Computer Sci degrees
awarded to women in each year).

You will issue two plt.plot() to draw line plots of different colors on the same
set of axes. Here, year represents the x-axis, while physical_sciences and
computer_science are the y-axes.

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Plot in blue the % of degrees awarded to women in the Physical Sciences


plt.plot(year, physical_sciences, color='blue')
# Plot in red the % of degrees awarded to women in Computer Science
plt.plot(year, computer_science, color='red')
plt.show()

Using axes()
Rather than overlaying line plots on common axes, you may prefer to plot different
line plots on distinct axes. The command plt.axes() is one way to do this (but it
requires specifying coord relative to the size of the figure).

plt.axes([xlo, ylo, width, height]), a set of axes is created and made active with
lower corner at (xlo, ylo) of the specified width and height. These coordinates can
be passed to plt.axes() in the form of a list or a tuple. The coordinates and
lengths are values between 0 and 1 representing lengths relative to the dimensions
of the figure. After a plt.axes(), plots generated are put in that set of axes.

# Create plot axes for the first line plot: blue for %women Phys Sci degree
plt.axes([.05,.05,.425,.9])
plt.plot(year, physical_sciences, color='blue')

# Create plot axes for the second line plot: red %women Comp-Sci
plt.axes([.525, .05, .425, .9])
plt.plot(year, computer_science, color='red')

# Display the plot


plt.show()

Using subplot() (1)


plt.axes(): the coordinates of the axes need to be set manually. A better
alternative is plt.subplot() to layout automatically.

In this exercise, rather than plt.axes(), use plt.subplot(m, n, k) to make the


subplot grid of dims m by n and to make the kth subplot active (subplots are
numbered starting from 1 row-wise from the top left corner of the grid).

# Create a figure with 1x2 subplot and make the left subplot active: blue for
%women Physical-Sciences
plt.subplot(1,2,1)
plt.plot(year, physical_sciences, color='blue')
plt.title('Physical Sciences')

# Make the right subplot active in the current 1x2 subplot grid: # red % women
Computer Science
plt.subplot(1,2,2)
plt.plot(year, computer_science, color='red')
plt.title('Computer Science')

# Use plt.tight_layout() to improve the spacing between subplots


plt.tight_layout()
plt.show()

Using subplot() (2)


Here, you will make a 2×2 grid of subplots and plot the percentage of degrees
awarded to women in Physical Sciences (using physical_sciences), in Computer
Science (using computer_science), in Health Professions (using health), and in
Education (using education).

# Create a figure with 2x2 grid, and top left: # Plot in blue the % of degrees
awarded to women in the Physical Sciences
plt.subplot(2,2,1)
plt.plot(year, physical_sciences, color='blue')
plt.title('Physical Sciences')

# Top right subplot: red %women in Computer Science


plt.subplot(2,2,2)
plt.plot(year, computer_science, color='red')
plt.title('Computer Science')

# Bottom left: green the %women in Health Professions


plt.subplot(2,2,3)
plt.plot(year, health, color='green')
plt.title('Health Professions')

# Bottom right: yellow %women in Education


plt.subplot(2,2,4)
plt.plot(year, education, color='yellow')
plt.title('Education')

# Improve the spacing between subplots and display them


plt.tight_layout()
plt.show()

Using xlim(), ylim()


In this exercise, you will work with the matplotlib.pyplot interface to quickly set
the x- and y-limits of your plots.
You will now create the same figure as in the previous exercise using plt.plot(),
this time setting the axis extents using plt.xlim() and plt.ylim(). These commands
allow you to either zoom or expand the plot or to set the axis ranges to include
important values (such as the origin).
Use plt.savefig() to export the image produced to a file.

# Plot the % of degrees awarded to women in Computer Science and the Physical
Sciences
plt.plot(year,computer_science, color='red')
plt.plot(year, physical_sciences, color='blue')

# Add the axis labels


plt.xlabel('Year')
plt.ylabel('Degrees awarded to women (%)')

# Set the x-axis range, # Set the y-axis range


plt.xlim(1990,2010)
plt.ylim(0, 50)

# Add a title and display the plot


plt.title('Degrees awarded to women (1990-2010)\nComputer Science (red)\nPhysical
Sciences (blue)')
plt.show()

# Save the image as 'xlim_and_ylim.png'


plt.savefig('xlim_and_ylim.png')

Using axis()
plt.xlim() and plt.ylim() are useful for setting the axis limits individually. In
this exercise, you will see how you can pass a 4-tuple to plt.axis() to set limits
for both axes at once. For example, plt.axis((1980,1990,0,75)) would set the extent
of the x-axis to the period between 1980 and 1990, and would set the y-axis extent
from 0 to 75% degrees award.

# Plot in blue % women in Computer Science, in red % women in Phys Sci


plt.plot(year,computer_science, color='blue')
plt.plot(year, physical_sciences,color='red')

# Set the x-axis and y-axis limits #tuple


plt.axis([1990,2010,0,50])
plt.show()

# Save the figure as 'axis_limits.png'


plt.savefig('axis_limits.png')

Using legend()
Legends are useful for distinguishing between multiple datasets displayed on common
axes. The relevant data are created using specific line colors or markers in
various plot commands. Using the keyword argument label in the plotting function
associates a string to use in a legend.
For example, here, you will plot enrollment of women in the Physical Sciences and
in Computer Science over time. You can label each curve by passing a label argument
to the plotting call, and request a legend using plt.legend(). Specifying the
keyword argument loc determines where the legend will be placed.

# Specify the label 'Computer Science'


plt.plot(year, computer_science, color='red', label='Computer Science')
# Specify the label 'Physical Sciences'
plt.plot(year, physical_sciences, color='blue', label='Physical Sciences')
# Add a legend at the lower center # 8 = 'lower center'
plt.legend(loc=8)

# Add axis labels and title


plt.xlabel('Year')
plt.ylabel('Enrollment (%)')
plt.title('Undergraduate enrollment of women')
plt.show()

Using annotate()
Plot enrollment of women in the Physical Sciences and Computer science over time,
with legend. Additionally, mark the point when enrollment of women in CompSci
reached a peak and started declining using plt.annotate().
To enable an arrow, set arrowprops=dict(facecolor='black'). The arrow will point to
the location given by xy and the text will appear at the location given by xytext.

# Plot with legend as before


plt.plot(year, computer_science, color='red', label='Computer Science')
plt.plot(year, physical_sciences, color='blue', label='Physical Sciences')
plt.legend(loc='lower right')

# Compute the max enrollment of women CompSci: cs_max


cs_max = computer_science.max()

# Calculate in which year was max enrollment women CompSci: yr_max


yr_max = year[computer_science.argmax()]

# Add a black arrow annotation


plt.annotate('Maximum', xy = (yr_max, cs_max), xytext=(yr_max+5, cs_max+5),
arrowprops = dict(facecolor='black'))

# Add axis labels and title


plt.xlabel('Year')
plt.ylabel('Enrollment (%)')
plt.title('Undergraduate enrollment of women')
plt.show()

Modifying styles
Matplotlib comes with a number of different stylesheets to customize the overall
look of different plots. To activate a particular stylesheet you can simply call
plt.style.use() with the name of the style sheet you want. To list all the
available style sheets you can execute: print(plt.style.available).

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Set the style to 'ggplot'


plt.style.use('ggplot')

# 2x2 layout: Top-left % women PhySci


plt.subplot(2, 2, 1)
plt.plot(year, physical_sciences, color='blue')
plt.title('Physical Sciences')

# Plot % women CompSci


plt.subplot(2, 2, 2)
plt.plot(year, computer_science, color='red')
plt.title('Computer Science')

# Add annotation
cs_max = computer_science.max()
yr_max = year[computer_science.argmax()]
plt.annotate('Maximum', xy=(yr_max, cs_max), xytext=(yr_max-1, cs_max-10),
arrowprops=dict(facecolor='black'))

# Plot the enrollmment % of women in Health professions


plt.subplot(2, 2, 3)
plt.plot(year, health, color='green')
plt.title('Health Professions')

# Plot the enrollment % of women in Education


plt.subplot(2, 2, 4)
plt.plot(year, education, color='yellow')
plt.title('Education')

# Improve spacing between subplots and display them


plt.tight_layout()
plt.show()

Generating meshes
To visualize two-dimensional arrays of data, it is necessary to understand how to
generate and manipulate 2-D arrays. Many Matplotlib plots support arrays as input
and in particular, they support NumPy arrays. The NumPy library is the most widely-
supported means for supporting numeric arrays in Python.

Use the meshgrid function in NumPy to generate 2-D arrays. then visualize using
plt.imshow(). The simplest way to generate a meshgrid is as follows:
import numpy as np
Y,X = np.meshgrid(range(10),range(20))
This will create two arrays with a shape of (20,10), 20 rows along the Y-axis and
10 columns along the X-axis.

import numpy as np
import matplotlib.pyplot as plt

# Generate two 1-D arrays: u, v


u = np.linspace(-2, 2, 41)
v = np.linspace(-1,1,21)

# Generate 2-D arrays from u and v: X, Y


X,Y = np.meshgrid(u,v)

# Compute Z based on X and Y


Z = np.sin(3*np.sqrt(X**2 + Y**2))

# Display the resulting image with pcolor()


plt.pcolor(Z)
plt.show()

# Save the figure to 'sine_mesh.png'


plt.savefig('sine_mesh.png')

Array orientation
matrix picture

The commands
plt.pcolor(A, cmap='Blues')
plt.colorbar()
plt.show()
produce the pseudocolor plot above using a Numpy array A. Which of the commands
below could have generated A?

Contour & filled contour plots


Although plt.imshow() or plt.pcolor() are often used to visualize a 2-D array in
entirety, there are other ways of visualizing such data without displaying all the
available sample values. One option is to use the array to compute contours that
are visualized instead.
Two types of contour plot: plt.contour() and plt.contourf() displays contours as
lines and filled areas, respectively. Both accept a two dimensional array from
which contours are computed.

In this exercise, you will visualize a 2-D array repeatedly using both
plt.contour() and plt.contourf(). You will use plt.subplot() to display several
contour plots in a common figure, using the meshgrid X, Y as the axes. For example,
plt.contour(X, Y, Z) generates a default contour map of the array Z.

#Generate a default contour map of the array Z


plt.subplot(2,2,1)
plt.contour(X,Y,Z)

# Generate a contour map with 20 contours


plt.subplot(2,2,2)
plt.contour(X,Y,Z,20)

# Generate a default filled contour map of the array Z


plt.subplot(2,2,3)
plt.contourf(X,Y,Z)

# Generate a default filled contour map with 20 contours


plt.subplot(2,2,4)
plt.contourf(X,Y,Z,20)

# Improve the spacing between subplots


plt.tight_layout()
plt.show()

Modifying colormaps
When displaying a 2-D array with plt.imshow() or plt.pcolor(), the values of the
array are mapped to a corresponding color. The set of colors used is determined by
a colormap which smoothly maps values to colors, making it easy to understand the
structure of the data at a glance.

It is often useful to change the colormap from the default 'jet' colormap used by
matplotlib. A good colormap is visually pleasing and conveys the structure of the
data faithfully and in a way that makes sense for the application.

matplotlib colormaps
the option cmap=<name> in most matplotlib functions change the color map of the
resulting plot.
unique names 'jet', 'coolwarm', 'magma' and 'viridis'.
overall color 'Greens', 'Blues', 'Reds', and 'Purples'.
seasons 'summer', 'autumn', 'winter' and 'spring'.

# Create a filled contour plot with a color map of 'viridis'


plt.subplot(2,2,1)
plt.contourf(X,Y,Z,20, cmap='viridis')
plt.colorbar()
plt.title('Viridis')

# Create a filled contour plot with a color map of 'gray'


plt.subplot(2,2,2)
plt.contourf(X,Y,Z,20, cmap='gray')
plt.colorbar()
plt.title('Gray')

# Create a filled contour plot with a color map of 'autumn'


plt.subplot(2,2,3)
plt.contourf(X,Y,Z,20, cmap='autumn')
plt.colorbar()
plt.title('Autumn')

# Create a filled contour plot with a color map of 'winter'


plt.subplot(2,2,4)
plt.contourf(X,Y,Z,20, cmap='winter')
plt.colorbar()
plt.title('Winter')

# Improve the spacing between subplots and display them


plt.tight_layout()
plt.show()

Using hist2d()
Given a set of ordered pairs describing data points, you can count the number of
points with similar values to construct a two-dimensional histogram. This is
similar to a one-dimensional histogram, but it describes the joint variation of two
random variables rather than just one.

In matplotlib, one function to visualize 2-D histograms is plt.hist2d().


Specify the coordinates of the points using plt.hist2d(x,y) assuming x and y are
two vectors of the same length.
the number of bins with the argument bins=(nx, ny) where nx is the number of bins
to use in the horizontal direction and ny is the number of bins to use in the
vertical direction.
You can specify the rectangular region in which the samples are counted in
constructing the 2D histogram. The optional parameter required is range=((xmin,
xmax), (ymin, ymax)) where
xmin and xmax are the respective lower and upper limits for the variables on the x-
axis and
ymin and ymax are the respective lower and upper limits for the variables on the y-
axis. Notice that the optional range argument can use nested tuples or lists.
In this exercise, you'll use some data from the auto-mpg data set. There are two
arrays mpg and hp that respectively contain miles per gallon and horse power
ratings from over three hundred automobiles built.

# Generate a 2-D histogram


plt.hist2d(hp,mpg,bins=(20,20),range=[(40,235),(8,48)])

# Add a color bar to the histogram


plt.colorbar()

# Add labels, title, and display the plot


plt.xlabel('Horse power [hp]')
plt.ylabel('Miles per gallon [mpg]')
plt.title('hist2d() plot')
plt.show()

Using hexbin()
The function plt.hist2d() uses rectangular bins to construct a two dimensional
histogram. As an alternative, the function plt.hexbin() uses hexagonal bins. The
underlying algorithm (based on this article from 1987) constructs a hexagonal
tesselation of a planar region and aggregates points inside hexagonal bins.

The optional gridsize argument (default 100) gives the number of hexagons across
the x-direction used in the hexagonal tiling. If specified as a list or a tuple of
length two, gridsize fixes the number of hexagon in the x- and y-directions
respectively in the tiling.
The optional parameter extent=(xmin, xmax, ymin, ymax) specifies rectangular region
covered by the hexagonal tiling. In that case, xmin and xmax are the respective
lower and upper limits for the variables on the x-axis and ymin and ymax are the
respective lower and upper limits for the variables on the y-axis.
In this exercise, you'll use the same auto-mpg data as in the last exercise (again
using arrays mpg and hp). This time, you'll use plt.hexbin() to visualize the two-
dimensional histogram.

# Generate a 2d histogram with hexagonal bins


plt.hexbin(hp,mpg, gridsize=(15,12), extent=[40,235,8,48])

# Add a color bar to the histogram


plt.colorbar()

# Add labels, title, and display the plot


plt.xlabel('Horse power [hp]')
plt.ylabel('Miles per gallon [mpg]')
plt.title('hexbin() plot')
plt.show()

Loading, examining images


Color images such as photographs contain the intensity of the red, green and blue
color channels.

To read an image from file, use plt.imread() by passing the path to a file, such as
a PNG or JPG file.
The color image can be plotted as usual using plt.imshow().
The resulting image loaded is a NumPy array of three dimensions. The array
typically has dimensions M×N×3, where M×N is the dimensions of the image. The third
dimensions are referred to as color channels (typically red, green, and blue).
The color channels can be extracted by Numpy array slicing.
In this exercise, you will load & display an image of an astronaut (by NASA (Public
domain), via Wikimedia Commons). You will also examine its attributes to understand
how color images are represented.

# Load the image into an array: img


img = plt.imread('480px-Astronaut-EVA.jpg')
# Print the shape of the image
print(img.shape)

# Display the image


plt.imshow(img)
# Hide the axes
plt.axis('off')
plt.show()

Pseudocolor plot from image data


Image data comes in many forms and it is not always appropriate to display the
available channels in RGB space. In many situations, an image may be processed and
analysed in some way before it is visualized in pseudocolor, also known as 'false'
color.

In this exercise, you will perform a simple analysis using the image showing an
astronaut as viewed from space. Instead of simply displaying the image, you will
compute the total intensity across the red, green and blue channels. The result is
a single two dimensional array which you will display using plt.imshow() with the
'gray' colormap.

# Load the image into an array: img


img = plt.imread('480px-Astronaut-EVA.jpg')
# Print the shape of the image
print(img.shape)

# Compute the sum of the red, green and blue channels: intensity
intensity = img.sum(axis=2)

# Print the shape of the intensity


print(intensity.shape)

# Display the intensity with a colormap of 'gray'


plt.imshow(intensity, cmap='gray')

# Add a colorbar
plt.colorbar()

# Hide the axes and show the figure


plt.axis('off')
plt.show()

Extent and aspect


When using plt.imshow() to display an array, the default behavior is to keep pixels
square so that the height to width ratio of the output matches the ratio determined
by the shape of the array. In addition, by default, the x- and y-axes are labeled
by the number of samples in each direction.
The ratio of the displayed width to height is known as the image aspect and the
range used to label the x- and y-axes is known as the image extent. The default
aspect value of 'auto' keeps the pixels square and the extents are automatically
computed from the shape of the array if not specified otherwise.

# Load the image into an array: img


img = plt.imread('480px-Astronaut-EVA.jpg')

# Specify the extent and aspect ratio of the top left subplot
plt.subplot(2,2,1)
plt.title('extent=(-1,1,-1,1),\naspect=0.5')
plt.xticks([-1,0,1])
plt.yticks([-1,0,1])
plt.imshow(img, extent=(-1,1,-1,1), aspect=0.5)
# Specify the extent and aspect ratio of the top right subplot
plt.subplot(2,2,2)
plt.title('extent=(-1,1,-1,1),\naspect=1')
plt.xticks([-1,0,1])
plt.yticks([-1,0,1])
plt.imshow(img, extent=(-1,1,-1,1), aspect=1)

# Specify the extent and aspect ratio of the bottom left subplot
plt.subplot(2,2,3)
plt.title('extent=(-1,1,-1,1),\naspect=2')
plt.xticks([-1,0,1])
plt.yticks([-1,0,1])
plt.imshow(img, extent=(-1,1,-1,1), aspect=2)

# Specify the extent and aspect ratio of the bottom right subplot
plt.subplot(2,2,4)
plt.title('extent=(-2,2,-1,1),\naspect=2')
plt.xticks([-2,-1,0,1,2])
plt.yticks([-1,0,1])
plt.imshow(img, extent=(-2,2,-1,1), aspect=2)

# Improve spacing and display the figure


plt.tight_layout()
plt.show()

Rescaling pixel intensities


Sometimes, low contrast images can be improved by rescaling their intensities. For
instance, this image of Hawkes Bay, New Zealand (originally by Phillip Capper,
modified by User:Konstable, via Wikimedia Commons, CC BY 2.0) has no pixel values
near 0 or near 255 (the limits of valid intensities).
For this exercise, you will do a simple rescaling (remember, an image is NumPy
array) to translate and stretch the pixel intensities so that the intensities of
the new image fill the range from 0 to 255.

# Load the image into an array: image


image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')

# Extract minimum and maximum values from the image: pmin, pmax
pmin, pmax = image.min(), image.max()
print("The smallest & largest pixel intensities are %d & %d." % (pmin, pmax))

# Rescale the pixels: rescaled_image


rescaled_image = 256*(image - pmin) / (pmax - pmin)
print("The rescaled smallest & largest pixel intensities are %.1f & %.1f." %
(rescaled_image.min(), rescaled_image.max()))

# Display the original image in the top subplot


plt.subplot(2,1,1)
plt.title('original image')
plt.axis('off')
plt.imshow(image)

# Display the rescaled image in the bottom subplot


plt.subplot(2,1,2)
plt.title('rescaled image')
plt.axis('off')
plt.imshow(rescaled_image)
plt.show()

Simple linear regressions


As you have seen, seaborn provides a convenient interface to generate complex and
great-looking statistical plots. One of the simplest things you can do using
seaborn is to fit and visualize a simple linear regression between two variables
using sns.lmplot().
One difference between seaborn and regular matplotlib plotting is that you can pass
pandas DataFrames directly to the plot and refer to each column by name. For
example, if you were to plot the column 'price' vs the column 'area' from a
DataFrame df, you could call sns.lmplot(x='area', y='price', data=df).

# Import plotting modules


import matplotlib.pyplot as plt
import seaborn as sns

# Plot a linear regression between 'weight' and 'hp'


sns.lmplot(x='weight', y='hp', data=auto)
plt.show()

Plotting residuals of a regression


Seaborn provides sns.residplot(), visualizing how far datapoints diverge from the
regression line.
In this exercise, you will visualize the residuals of a regression between the 'hp'
column (horse power) and the 'mpg' column (miles per gallon) of the auto DataFrame
used previously.

# Generate a green residual plot of the regression between 'hp' and 'mpg'
sns.residplot(x='hp', y='mpg', data=auto, color='green')
plt.show()

Higher-order regressions
When there are more complex relationships between two variables, a simple first
order regression is often not sufficient to accurately capture the relationship
between the variables. Seaborn makes it simple to compute and visualize regressions
of varying orders.

Here, you will plot a second order regression between the horse power ('hp') and
miles per gallon ('mpg') using sns.regplot() (the function sns.lmplot() is a
higher-level interface to sns.regplot()). However, before plotting this
relationship, compare how the residual changes depending on the order of the
regression. Does a second order regression perform significantly better than a
simple linear regression?

A principal difference between sns.lmplot() and sns.regplot() is the way in which


matplotlib options are passed (sns.regplot() is more permissive). For both
sns.lmplot() and sns.regplot(), the keyword order is used to control the order of
polynomial regression. sns.regplot() uses scatter=None to prevent plotting the
scatter plot points again.

# Generate a scatter plot of 'weight' and 'mpg' using red circles


plt.scatter(auto['weight'], auto['mpg'], label='data', color='red', marker='o')

# Plot in blue a linear regression of order 1 between 'weight' and 'mpg'


sns.regplot(x='weight', y='mpg', data=auto, scatter=None, color='blue',
label='order 1')
# Plot in green a linear regression of order 2 between 'weight' and 'mpg'
sns.regplot(x='weight', y='mpg', data=auto, scatter=None, order=2, color='green',
label='order 2')

# Add a legend and display the plot


plt.legend(loc='upper right')
plt.show()

Grouping linear regressions by hue


Often it is useful to compare and contrast trends between different groups. Seaborn
makes it possible to apply linear regressions separately for subsets of the data by
applying a groupby operation. Using the hue argument, you can specify a categorical
variable by which to group data observations. The distinct groups of points are
used to produce distinct regressions with different hues in the plot.

In the automobile dataset - which has been pre-loaded here as auto - you can view
the relationship between weight ('weight') and horsepower ('hp') of the cars and
group them by their origin ('origin'), giving you a quick visual indication how the
relationship differs by continent.

# linear regr between 'weight' and 'hp', hue of 'origin' and palette 'Set1'
sns.lmplot(x='weight', y='hp', data=auto, hue='origin', palette='Set1')
plt.show()

Grouping linear regressions by row or column


Rather than overlaying linear regressions of grouped data in the same plot, we may
want to use a grid of subplots. The sns.lmplot() accepts the arguments row and/or
col to arrangements of subplots for regressions.

You'll use the automobile dataset again and, this time, you'll use the keyword
argument row to display the subplots organized in rows. That is, you'll produce
horsepower vs. weight regressions grouped by continent of origin in separate
subplots stacked vertically.

# Plot lr between 'weight' and 'hp' grouped col-wise by 'origin'


sns.lmplot(x='weight', y='hp', data=auto, col='origin')

## Plot lr between 'weight' and 'hp' grouped row-wise by 'origin'


sns.lmplot(x='weight', y='hp', data=auto, row='origin')

# Display the plot


plt.show()

Constructing strip plots


Regressions are useful to understand relationships between two continuous
variables. Often we want to explore how the distribution of a single continuous
variable is affected by a second categorical variable. Seaborn provides a variety
of plot types to perform these types of comparisons between univariate
distributions.

The strip plot is one way of visualizing this kind of data. It plots the
distribution of variables for each category as individual datapoints. For vertical
strip plots (the default), distributions of continuous values are laid out parallel
to the y-axis and the distinct categories are spaced out along the x-axis.
For example, sns.stripplot(x='type', y='length', data=df) produces a sequence of
vertical strip plots of length distributions grouped by type (assuming length is a
continuous column and type is a categorical column of the DataFrame df).
Overlapping points can be difficult to distinguish in strip plots. The argument
jitter=True helps spread out overlapping points.
Other matplotlib arguments can be passed to sns.stripplot(), e.g., marker, color,
size, etc.

# Make a strip plot of 'hp' grouped by 'cyl'


plt.subplot(2,1,1)
sns.stripplot(x='cyl', y='hp', data=auto)

# Make the strip plot again using jitter and a smaller point size
plt.subplot(2,1,2)
sns.stripplot(x='cyl', y='hp', data=auto, size=3, jitter=True)

# Display the plot


plt.show()

Constructing swarm plots


A strip plot can be visually crowded even with jitter applied and smaller point
sizes. An alternative is provided by the swarm plot (sns.swarmplot()), which
spreads out the points to avoid overlap and provides a better visual overview of
the data.

Syntax is similar to sns.stripplot(), e.g., sns.swarmplot(x='type', y='length',


data=df).
The orientation for the continuous variable in the strip/swarm plot can be inferred
from the choice of x and y from the DataFrame. The orientation can be set
explicitly using orient='h' or orient='v'.
Another grouping can be added in using the hue keyword. For instance, using
sns.swarmplot(x='type', y='length', data=df, hue='build year') makes a swarm plot
from the DataFrame df with the 'length' column values spread out vertically,
horizontally grouped by the column 'type' and each point colored by the categorical
column 'build year'.
In this exercise, you'll use the auto DataFrame again to illustrate the use of
sns.swarmplot() with grouping by hue and with explicit specification of the
orientation using the keyword orient.

# swarm plot of 'hp' grouped horizontally by 'cyl'


plt.subplot(2,1,1)
sns.swarmplot(x='cyl', y='hp', data=auto)

# swarm plot of 'hp' grouped vertically by 'cyl' with a hue of 'origin'


plt.subplot(2,1,2)
sns.swarmplot(x='hp', y='cyl', data=auto, hue='origin', orient='h')

plt.show()

Constructing violin plots


Both strip and swarm plots visualize all the datapoints. For large datasets, this
can result in significant overplotting. Therefore, it is often useful to use plot
types which reduce a dataset to more descriptive statistics and provide a good
summary of the data. Box and whisker plots are a classic way of summarizing
univariate distributions but seaborn provides a more sophisticated extension of the
standard box plot, called a violin plot.
Here, you will produce violin plots of the distribution of horse power ('hp') by
the number of cylinders ('cyl'). Additionally, you will combine two different plot
types by overlaying a strip plot on the violin plot.

As before, the DataFrame has been pre-loaded for you as auto.

# Generate a violin plot of 'hp' grouped horizontally by 'cyl'


plt.subplot(2,1,1)
sns.violinplot(x='cyl', y='hp', data=auto)

# Generate the same violin plot again with a color of 'lightgray' and without inner
annotations
plt.subplot(2,1,2)
sns.violinplot(x='cyl', y='hp', data=auto, inner=None, color='lightgray')

# Overlay a strip plot on the violin plot


sns.stripplot(x='cyl', y='hp', data=auto, jitter=True, size=1.5)

plt.show()

Plotting joint distributions (1)


There are numerous strategies to visualize how pairs of continuous random variables
vary jointly. Regression and residual plots are one strategy. Another is to
visualize a bivariate distribution. Seaborn's sns.jointplot() provides means of
visualizing bivariate distributions. The basic calling syntax is similar to that of
sns.lmplot(). By default, calling sns.jointplot(x, y, data) renders a few things:

A scatter plot using the specified columns x and y from the DataFrame data.
A (univariate) histogram along the top of the scatter plot showing distribution of
the column x.
A (univariate) histogram along the right of the scatter plot showing distribution
of the column y.

# Generate a joint plot of 'hp' and 'mpg'


sns.jointplot(x='hp', y='mpg', data=auto)
plt.show()

Plotting joint distributions (2)


sns.jointplot() has a parameter kind to specify how to visualize the joint
variation of two continuous random variables (i.e., two columns of a DataFrame)
kind='scatter' uses a scatter plot of the data points
kind='reg' uses a regression plot (default order 1)
kind='resid' uses a residual plot
kind='kde' uses a kernel density estimate of the joint distribution
kind='hex' uses a hexbin plot of the joint distribution
For this exercise, you will use kind='hex' to generate a hexbin plot of the joint
distribution.

# Generate a joint plot of 'hp' and 'mpg' using a hexbin plot


sns.jointplot(x='hp', y='mpg', data=auto, kind='hex')
plt.show()

Plotting distributions pairwise (1)


sns.jointplot() is restricted to representing joint variation between only two
quantities (i.e., two columns of a DataFrame). sns.pairplot() constructs a grid of
all joint plots pairwise from all pairs of (non-categorical) columns in a
DataFrame. The syntax is very simple: sns.pairplot(df), where df is a DataFrame.
The non-categorical columns are identified and the corresponding joint plots are
plotted in a square grid of subplots. The diagonal of the subplot grid shows the
univariate histograms of the individual columns.

# Plot the pairwise joint distributions from the DataFrame


sns.pairplot(auto)
plt.show()

Plotting distributions pairwise (2)


to show regressions as well as scatter plots in the off-diagonal subplots, use
kind='reg' (where 'reg' means 'regression'). kind='scatter' (the default) show
scatter plots in the off-diagonal subplots.

# Plot the pairwise joint distributions grouped by 'origin' along with regression
lines
sns.pairplot(auto, kind='reg', hue='origin')
plt.show()

Visualizing correlations with a heatmap


Plotting relationships between many variables using a pair plot can quickly get
visually overwhelming. It is therefore often useful to compute covariances between
the variables instead. The covariance matrix can then easily be visualized as a
heatmap. A heatmap is effectively a pseudocolor plot with labelled rows and columns
(i.e., a pseudocolor plot based on a pandas DataFrame rather than a matrix). The
DataFrame does not have to be square or symmetric (but, in the context of a
covariance matrix, it is both).

In this exercise, you will view the covariance matrix between the continuous
variables in the auto-mpg dataset. You do not have to know here how the covariance
matrix is computed; the important point is that its diagonal entries are all 1s,
and the off-diagonal entries are between -1 and +1 (quantifying the degree to which
variable pairs vary jointly). It is also, then, a symmetric matrix.

# Print the covariance matrix


print(cov_matrix)

# Visualize the covariance matrix using a heatmap


sns.heatmap(cov_matrix)
plt.show()

Multiple time series on common axes


For this exercise, you will construct a plot showing four time series stocks on the
same axes. The time series in question are represented in the session using the
identifiers aapl, ibm, csco, and msft. You'll generate a single plot showing all
the time series on common axes with a legend.

# Import matplotlib.pyplot as plt


import matplotlib.pyplot as plt

# Plot the aapl time series in blue, ibm green, csco red, msft magenta
plt.plot(aapl, color='blue', label='AAPL')
plt.plot(ibm, color='green', label='IBM')
plt.plot(csco, color='red', label='CSCO')
plt.plot(msft, color='magenta', label='MSFT')

# Add a legend in the top left corner, xticks orientation


plt.legend(loc='upper left')
plt.xticks(rotation=60)
plt.show()

Multiple time series slices (1)


Slice subsets corresponding to different time intervals from a time series:
'2001:2005', '2011-03:2011-12', or '2010-04-19:2010-04-30' to extract from
intervals of length 5 yrs, 10 mos, or 12 days.

Unlike slicing from standard Python lists, tuples, and strings, when slicing time
series by labels (and other pandas Series & DataFrames by labels), the slice
includes the right-most portion of the slice.
You can use partial strings or datetime objects for indexing and slicing from time
series.
For this exercise, you will use time series slicing to plot the time series aapl
over its full 11-year range and also over a shorter 2-year range. You'll arrange
these plots in a 2×1 grid of subplots

# Plot the series in the top subplot in blue


plt.subplot(2,1,1)
plt.xticks(rotation=45)
plt.title('AAPL: 2001 to 2011')
plt.plot(aapl, color='blue')

# Slice aapl from '2007' to '2008' inclusive: view


view = aapl['2007':'2008']

# Plot the sliced data in the bottom subplot in black


plt.subplot(2,1,2)
plt.xticks(rotation=45)
plt.title('AAPL: 2007 to 2008')
plt.plot(view, color='k')
plt.tight_layout()
plt.show()

Multiple time series slices (2)


In this exercise, you will use the same time series aapl from the previous exercise
and plot tighter views of the data.

Partial string indexing works without slicing as well. For instance, using
my_time_series['1995'], my_time_series['1999-05'], and my_time_series['2000-11-04']
respectively extracts views of the time series my_time_series corresponding to the
entire year 1995, the entire month May 1999, and the entire day November 4, 2000.

# Slice aapl from Nov. 2007 to Apr. 2008 inclusive: view, # January 2008: view2
view_1 = aapl['2007-11':'2008-04']
view_2 = aapl['2008-01']

# Plot the sliced series in the top subplot in red


plt.subplot(2,1,1)
plt.xticks(rotation=45)
plt.title('AAPL: Nov. 2007 to Apr. 2008')
plt.plot(view_1, color='red')

# Plot the sliced series in the bottom subplot in green


plt.subplot(2,1,2)
plt.xticks(rotation=45)
plt.title('AAPL: Jan. 2008')
plt.plot(view_2, color='green')

# Improve spacing and display the plot


plt.tight_layout()
plt.show()

Plotting an inset view


Remember, rather than comparing plots with subplots or overlayed plots, you can
generate an inset view directly using plt.axes(). In this exercise, you'll
reproduce two of the time series plots from the preceding two exercises. Your
figure will contain an inset plot to highlight the dramatic changes in AAPL stock
price between November 2007 and April 2008 (as compared to the 11 years from 2001
to 2011).

# Slice aapl from Nov. 2007 to Apr. 2008 inclusive: view


view = aapl['2007-11':'2008-04']

# Plot the entire series


plt.plot(aapl)
plt.xticks(rotation=45)
plt.title('AAPL: 2001-2011')

# Specify the axes #tuple


plt.axes([.25,.5,.35,.35])

# Plot the sliced series in red using the current axes


plt.plot(view, color='red')
plt.xticks(rotation=45)
plt.title('2007/11-2008/04')
plt.show()

Plotting moving averages


In this exercise, you will plot pre-computed moving averages of AAPL stock prices
in distinct subplots. The time series aapl is overlayed in black in each subplot
for comparison. The time series mean_30, mean_75, mean_125, and mean_250 have been
computed for you (containing the windowed averages of the series aapl computed over
windows of width 30 days, 75 days, 125 days, and 250 days respectively).

# Plot the 30-day moving average in the top left subplot in green
plt.subplot(2,2,1)
plt.plot(mean_30, color='green')
plt.plot(aapl, 'k-.')
plt.xticks(rotation=60)
plt.title('30d averages')

# Plot the 75-day moving average in the top right subplot in red
plt.subplot(2,2,2)
plt.plot(mean_75, 'red')
plt.plot(aapl, 'k-.')
plt.xticks(rotation=60)
plt.title('75d averages')

# Plot the 125-day moving average in the bottom left subplot in magenta
plt.subplot(2, 2, 3)
plt.plot(mean_125, 'magenta')
plt.plot(aapl, 'k-.')
plt.xticks(rotation=60)
plt.title('125d averages')

# Plot the 250-day moving average in the bottom right subplot in cyan
plt.subplot(2,2,4)
plt.plot(mean_250, 'cyan')
plt.plot(aapl, 'k-.')
plt.xticks(rotation=60)
plt.title('250d averages')

plt.show()

Plotting moving standard deviations


Having plotted pre-computed moving averages of AAPL stock prices on distinct
subplots in the previous exercise, you will now plot pre-computed moving standard
deviations of the same stock prices, this time together on common axes.

The time series aapl is not plotted in this case; it is of a different length scale
than the standard deviations.
The time series std_30, std_75, stdn_125, & std_250 have been computed for you
(containing the windowed standard deviations of the series aapl computed over
windows of width 30 days, 75 days, 125 days, & 250 days respectively).

# Plot std_30 in red, 75 cyan, 125 green, 250 magenta


plt.plot(std_30, 'red', label='30d')
plt.plot(std_75, 'cyan', label='75d')
plt.plot(std_125, 'green', label='125d')
plt.plot(std_250, 'magenta', label='250d')

# Add a legend to the 'upper left'


plt.legend(loc=2)
plt.title('Moving standard deviations')
plt.show()

Extracting a histogram from a grayscale image


For grayscale images, various image processing algorithms use an image histogram.
Recall that an image is a two-dimensional array of numerical intensities. An image
histogram, then, is computed by counting the occurences of distinct pixel
intensities over all the pixels in the image.

For this exercise, you will load an unequalized low contrast image of Hawkes Bay,
New Zealand (originally by Phillip Capper, modified by User:Konstable, via
Wikimedia Commons, CC BY 2.0). You will plot the image and use the pixel intensity
values to plot a normalized histogram of pixel intensities.

# Load the image into an array: image


image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')

# Display image in top subplot using color map 'gray'


plt.subplot(2,1,1)
plt.title('Original image')
plt.axis('off')
plt.imshow(image, 'gray')

# Flatten the image into 1 dimension: pixels


pixels = image.flatten()

# Display a histogram of the pixels in the bottom subplot


plt.subplot(2,1,2)
plt.xlim((0,255))
plt.title('Normalized histogram')
plt.hist(pixels, 'red', bins=64, range=(0,256), normed=True, alpha=0.4)

# Display the plot


plt.show()

Cumulative Distribution Function from an image histogram


A histogram of a continuous random variable is sometimes called a Probability
Distribution Function (or PDF). The area under a PDF (a definite integral) is
called a Cumulative Distribution Function (or CDF). The CDF quantifies the
probability of observing certain pixel intensities.

Your task here is to plot the PDF and CDF of pixel intensities from a grayscale
image. You will use the grayscale image of Hawkes Bay, New Zealand (originally by
Phillip Capper, modified by User:Konstable, via Wikimedia Commons, CC BY 2.0).

cumulative=True permits viewing the CDF instead of the PDF.


plt.grid('off') switches off distracting grid lines.
plt.twinx() allows two plots to be overlayed sharing the x-axis but with different
scales on the y-axis.

# Load the image into an array: image


image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')

# Display image in top subplot using color map 'gray'


plt.subplot(2,1,1)
plt.imshow(image, cmap='gray')
plt.title('Original image')
plt.axis('off')

# Flatten the image into 1 dimension: pixels


pixels = image.flatten()

# Display a histogram of the pixels in the bottom subplot


plt.subplot(2,1,2)
pdf = plt.hist(pixels, bins=64, range=(0,256), normed=False, color='red',
alpha=0.4)
plt.grid('off')

# Use plt.twinx() to overlay the CDF in the bottom subplot


plt.twinx()
# Display a cumulative histogram of the pixels
cdf = plt.hist(pixels, bins=64, range=(0,256), normed=True, cumulative=True,
color='blue', alpha=0.4)

# Specify x-axis range, hide axes, add title and display plot
plt.xlim((0,256))
plt.grid('off')
plt.title('PDF & CDF (original image)')
plt.show()

Equalizing an image histogram


Histogram equalization is an image processing procedure that reassigns image pixel
intensities. The basic idea is to use interpolation to map the original CDF of
pixel intensities to a CDF that is almost a straight line. In essence, the pixel
intensities are spread out and this has the practical effect of making a sharper,
contrast-enhanced image. This is particularly useful in astronomy and medical
imaging to help us see more features.

For this exercise, you will again work with the grayscale image of Hawkes Bay, New
Zealand (originally by Phillip Capper, modified by User:Konstable, via Wikimedia
Commons, CC BY 2.0). Notice the sample code produces the same plot as the previous
exercise. Your task is to modify the code from the previous exercise to plot the
new equalized image as well as its PDF and CDF.

The CDF of the original image is computed using plt.hist().


Notice: new_pixels interpolates new pixel values using the original image CDF.

# Load the image into an array: image


image = plt.imread('640px-Unequalized_Hawkes_Bay_NZ.jpg')
# Flatten the image into 1 dimension: pixels
pixels = image.flatten()
# Generate a cumulative histogram
cdf, bins, patches = plt.hist(pixels, bins=256, range=(0,256), normed=True,
cumulative=True)
new_pixels = np.interp(pixels, bins[:-1], cdf*255)
# Reshape new_pixels as a 2-D array: new_image
new_image = new_pixels.reshape(image.shape)

# Display the new image with 'gray' color map


plt.subplot(2,1,1)
plt.title('Equalized image')
plt.axis('off')
plt.imshow(new_image, 'gray')

# Generate a histogram of the new pixels


plt.subplot(2,1,2)
pdf = plt.hist(new_pixels, bins=64, range=(0,256), normed=False, color='red',
alpha=0.4)
plt.grid('off')

# Use plt.twinx() to overlay the CDF in the bottom subplot


plt.twinx()
plt.xlim((0,256))
plt.grid('off')
plt.title('PDF & CDF (equalized image)')

# Generate a cumulative histogram of the new pixels


cdf = plt.hist(new_pixels, bins=64, range=(0,256), cumulative=True, normed=True,
color='blue', alpha=0.4)
plt.show()

Extracting histograms from a color image


This exercise resembles the last in that you will plot histograms from an image.
This time, you will use a color image of the Helix Nebula as seen by the Hubble and
the Cerro Toledo Inter-American Observatory. The separate RGB (red-green-blue)
channels will be extracted for you as two-dimensional arrays red, green, and blue
respectively. You will plot three overlaid color histograms on common axes (one for
each channel) in a subplot as well as the original image in a separate subplot.

# Load the image into an array: image


image = plt.imread('hs-2004-32-b-small_web.jpg')
# Display image in top subplot
plt.subplot(2,1,1)
plt.title('Original image')
plt.axis('off')
plt.imshow(image)

# Extract 2-D arrays of the RGB channels: red, blue, green


red, blue, green = image[:,:,0], image[:,:,1], image[:,:,2]

# Flatten the 2-D arrays of the RGB channels into 1-D


red_pixels = red.flatten()
blue_pixels = blue.flatten()
green_pixels = green.flatten()

# Overlay histograms of the pixels of each color in the bottom subplot


plt.subplot(2,1,2)
plt.title('Histograms from color image')
plt.xlim((0,256))
plt.hist(red_pixels, bins=64, normed=True, color='red', alpha=.2)
plt.hist(blue_pixels, bins=64, normed=True, color='blue', alpha=.2)
plt.hist(green_pixels, bins=64, normed=True, color='green', alpha=.2)

# Display the plot


plt.show()

Extracting bivariate histograms from a color image


Rather than overlaying univariate histograms of intensities in distinct channels,
it is also possible to view the joint variation of pixel intensity in two different
channels.

For this final exercise, you will use the same color image of the Helix Nebula as
seen by the Hubble and the Cerro Toledo Inter-American Observatory. The separate
RGB (red-green-blue) channels will be extracted for you as one-dimensional arrays
red_pixels, green_pixels, & blue_pixels respectively.

# Load the image into an array: image


image = plt.imread('hs-2004-32-b-small_web.jpg')

# Extract RGB channels and flatten into 1-D array


red, blue, green = image[:,:,0], image[:,:,1], image[:,:,2]
red_pixels = red.flatten()
blue_pixels = blue.flatten()
green_pixels = green.flatten()

# Generate a 2-D histogram of the red and green pixels


plt.subplot(2,2,1)
plt.grid('off')
plt.xticks(rotation=60)
plt.xlabel('red')
plt.ylabel('green')
plt.hist2d(red_pixels, green_pixels, bins=(32,32))

# Generate a 2-D histogram of the green and blue pixels


plt.subplot(2,2,2)
plt.grid('off')
plt.xticks(rotation=60)
plt.xlabel('green')
plt.ylabel('blue')
plt.hist2d(green_pixels, blue_pixels, bins=(32,32))

# Generate a 2-D histogram of the blue and red pixels


plt.subplot(2,2,3)
plt.grid('off')
plt.xticks(rotation=60)
plt.xlabel('blue')
plt.ylabel('red')
plt.hist2d(blue_pixels, red_pixels, bins=(32,32))

# Display the plot


plt.show()

####
####
#### VIX BOKEH

A simple scatter plot


In this example, you're going to make a scatter plot of female literacy vs
fertility using data from the European Environmental Agency. This dataset
highlights that countries with low female literacy have high birthrates. The x-axis
data has been loaded for you as fertility and the y-axis data has been loaded as
female_literacy.

Your job is to create a figure, assign x-axis and y-axis labels, and plot
female_literacy vs fertility using the circle glyph.

After you have created the figure, in this exercise and the ones to follow, play
around with it! Explore the different options available to you on the tab to the
right, such as "Pan", "Box Zoom", and "Wheel Zoom". You can click on the question
mark sign for more details on any of these tools.

Note: You may have to scroll down to view the lower portion of the figure.

# Import figure from bokeh.plotting


from bokeh.plotting import figure

# Import output_file and show from bokeh.io


from bokeh.io import output_file, show

# Create the figure: p


p = figure(x_axis_label='fertility (children per woman)',
y_axis_label='female_literacy (% population)')

# Add a circle glyph to the figure p


p.circle(fertility, female_literacy)

# Call the output_file() function and specify the name of the file
output_file('fert_lit.html')

# Display the plot


show(p)

A scatter plot with different shapes


By calling multiple glyph functions on the same figure object, we can overlay
multiple data sets in the same figure.

In this exercise, you will plot female literacy vs fertility for two different
regions, Africa and Latin America. Each set of x and y data has been loaded
separately for you as fertility_africa, female_literacy_africa,
fertility_latinamerica, and female_literacy_latinamerica. Plot the Latin America
data with the circle() glyph, and the Africa data with the x() glyph. figure has
already been imported for you from bokeh.plotting.

# Create the figure: p


p = figure(x_axis_label='fertility', y_axis_label='female_literacy (% population)')

# Add a circle glyph to the figure p


p.circle(fertility_latinamerica,female_literacy_latinamerica)

# Add an x glyph to the figure p


p.x(fertility_africa, female_literacy_africa)

# Specify the name of the file


output_file('fert_lit_separate.html')

# Display the plot


show(p)

Customizing your scatter plots


The three most important arguments to customize scatter glyphs are color, size, and
alpha. Bokeh accepts colors as hexadecimal strings, tuples of RGB values between 0
and 255, and any of the 147 CSS color names. Size values are supplied in screen
space units with 100 meaning the size of the entire figure.

# Create the figure: p


p = figure(x_axis_label='fertility (children per woman)',
y_axis_label='female_literacy (% population)')

# Add a blue circle glyph to the figure p


p.circle(fertility_latinamerica, female_literacy_latinamerica, color='blue',
size=10, alpha=.8)

# Add a red circle glyph to the figure p


p.circle(fertility_africa, female_literacy_africa, color='red', size=10, alpha=.8)

# Specify the name of the file


output_file('fert_lit_separate_colors.html')

# Display the plot


show(p)

Lines
We can draw lines on Bokeh plots with the line() glyph function.
In this exercise, you'll plot the daily adjusted closing price of Apple Inc.'s
stock (AAPL) from 2000 to 2013.
The data points are provided for you as lists. date is a list of datetime objects
to plot on the x-axis and price is a list of prices to plot on the y-axis.
Since we are plotting dates on the x-axis, you must add x_axis_type='datetime' when
creating the figure object.
# Import figure from bokeh.plotting
from bokeh.plotting import figure

# Create a figure with x_axis_type="datetime": p


p = figure(x_axis_type='datetime', x_axis_label='Date', y_axis_label='US Dollars')

# Plot date along the x axis and price along the y axis
p.line(date, price)

# Specify the name of the output file and show the result
output_file('line.html')
show(p)

Lines and markers


Lines and markers can be combined by plotting them separately using the same data
points.
In this exercise, you'll plot a line and circle glyph for the AAPL stock prices.
Further, you'll adjust the fill_color keyword argument of the circle() glyph
function while leaving the line_color at the default value. The date and price
lists are provided. The Bokeh figure object p that you created in the previous
exercise has also been provided.

# Import figure from bokeh.plotting


from bokeh.plotting import figure

# Create a figure with x_axis_type='datetime': p


p = figure(x_axis_type='datetime', x_axis_label='Date', y_axis_label='US Dollars')
p.line(date, price)

# With date on the x-axis and price on the y-axis, add a white circle glyph of size
4
p.circle(date, price, fill_color='white', size=4)

# Specify the name of the output file and show the result
output_file('line.html')
show(p)

Patches
In Bokeh, extended geometrical shapes can be plotted by using the patches() glyph
function. The patches glyph takes as input a list-of-lists collection of numeric
values specifying the vertices in x and y directions of each distinct patch to
plot.

In this exercise, you will plot the state borders of Arizona, Colorado, New Mexico
and Utah. The latitude and longitude vertices for each state have been prepared as
lists. Your job is to plot longitude on the x-axis and latitude on the y-axis. The
figure object has been created for you as p.

# Create a list of az_lons, co_lons, nm_lons and ut_lons: x


x = [az_lons, co_lons, nm_lons, ut_lons]

# Create a list of az_lats, co_lats, nm_lats and ut_lats: y


y = [az_lats, co_lats, nm_lats, ut_lats]

# Add patches to figure p with line_color=white for x and y


p.patches(x,y, line_color='white')
## p.axis.visible=False

# Specify the name of the output file and show the result
output_file('four_corners.html')
show(p)

Plotting data from NumPy arrays


In the previous exercises, you made plots using data stored in lists. You learned
that Bokeh can plot both numbers and datetime objects.

In this exercise, you'll generate NumPy arrays using np.linspace() and np.cos() and
plot them using the circle glyph.

np.linspace() is a function that returns an array of evenly spaced numbers over a


specified interval. For example, np.linspace(0, 10, 5) returns an array of 5 evenly
spaced samples calculated over the interval [0, 10]. np.cos(x) calculates the
element-wise cosine of some array x.

For more information on NumPy functions, you can refer to the NumPy User Guide and
NumPy Reference.

The figure p has been provided for you.

# Import numpy as np
import numpy as np

# Create array using np.linspace: x


x = np.linspace(0,5,100)

# Create array using np.cos: y


y = np.cos(x)

# Add circles at x and y


p.circle(x,y)

# Specify the name of the output file and show the result
output_file('numpy.html')
show(p)

Plotting data from Pandas DataFrames


You can create Bokeh plots from Pandas DataFrames by passing column selections to
the glyph functions.
Bokeh can plot floating point numbers, integers, and datetime data types. In this
example, you will read a CSV file containing information on 392 automobiles
manufactured in the US, Europe and Asia from 1970 to 1982.

The CSV file is provided for you as 'auto.csv'. Your job is to plot miles-per-
gallon (mpg) vs horsepower (hp) by passing Pandas column selections into the
p.circle() function. Additionally, each glyph will be colored according to values
in the color column.

# Import pandas as pd
import pandas as pd
# Read in the CSV file: df
df = pd.read_csv('auto.csv')

# Import figure from bokeh.plotting


from bokeh.plotting import figure

# Create the figure: p


p = figure(x_axis_label='HP', y_axis_label='MPG')

# Plot mpg vs hp by color


p.circle(df.hp, df.mpg, color = df.color, size = 10)

# Specify the name of the output file and show the result
output_file('auto-df.html')
show(p)

The Bokeh ColumnDataSource (continued)


You can create a ColumnDataSource object directly from a Pandas DataFrame by
passing the DataFrame to the class initializer.
In this exercise, we have imported pandas as pd and read in a data set containing
all Olympic medals awarded in the 100 meter sprint from 1896 to 2012. A color
column has been added indicating the CSS colorname we wish to use in the plot for
every data point. Your job is to import the ColumnDataSource class, create a new
ColumnDataSource object from the DataFrame df, and plot circle glyphs with 'Year'
on the x-axis and 'Time' on the y-axis. Color each glyph by the color column.

The figure object p has already been created for you.

# Import the ColumnDataSource class from bokeh.plotting


from bokeh.plotting import ColumnDataSource

# Create a ColumnDataSource from df: source


source = ColumnDataSource(df)

# Add circle glyphs to the figure p


p.circle('Year', 'Time', source=source, size=8, color='color')

# Specify the name of the output file and show the result
output_file('sprint.html')
show(p)

Selection and non-selection glyphs


In this exercise, you're going to add the box_select tool to a figure and change
the selected and non-selected circle glyph properties so that selected glyphs are
red and non-selected glyphs are transparent blue.

You'll use the ColumnDataSource object of the Olympic Sprint dataset you made in
the last exercise. It is provided to you with the name source. After you have
created the figure, be sure to experiment with the Box Select tool you added! As in
previous exercises, you may have to scroll down to view the lower portion of the
figure.

# Create a figure with the "box_select" tool: p


p = figure(x_axis_label='Year', y_axis_label='Time', tools='box_select')

# Add circle glyphs to the figure p with the selected and non-selected properties
p.circle('Year','Time', source=source, selection_color='red',
nonselection_alpha=.1)

# Specify the name of the output file and show the result
output_file('selection_glyph.html')
show(p)

Hover glyphs
Now let's practice using and customizing the hover tool.
In this exercise, you're going to plot the blood glucose levels for an unknown
patient. The blood glucose levels were recorded every 5 minutes on October 7th
starting at 3 minutes past midnight.

The date and time of each measurement are provided to you as x and the blood
glucose levels in mg/dL are provided as y. A bokeh figure is also provided in the
workspace as p. Your job is to add a circle glyph that will appear red when the
mouse is hovered near the data points. You will also add a customized hover tool
object to the plot.

# import the HoverTool


from bokeh.models import HoverTool

# Add circle glyphs to figure p


p.circle(x, y, size=10,
fill_color='grey', alpha=.1, line_color=None,
hover_fill_color='firebrick', hover_alpha=.5, hover_line_color='white')

# Create a HoverTool: hover


hover = HoverTool(tooltips=None, mode='vline')

# Add the hover tool to the figure p


p.add_tools(hover)

# Specify the name of the output file and show the result
output_file('hover_glyph.html')
show(p)

Colormapping
The final glyph customization we'll practice is using the CategoricalColorMapper to
color each glyph by a categorical property. Here, you're going to use the
automobile dataset to plot miles-per-gallon vs weight and color each circle glyph
by the region where the automobile was manufactured.

The origin column will be used in the ColorMapper to color automobiles manufactured
in the US as blue, Europe as red and Asia as green. The automobile data set is
provided to you as a Pandas DataFrame called df. The figure is provided for you as
p.

#Import CategoricalColorMapper from bokeh.models


from bokeh.models import CategoricalColorMapper

# Convert df to a ColumnDataSource: source


source = ColumnDataSource(df)

# Make a CategoricalColorMapper object: color_mapper


color_mapper = CategoricalColorMapper(factors=['Europe', 'Asia', 'US'],
palette=['red', 'green', 'blue'])

# Add a circle glyph to the figure p


p.circle('weight', 'mpg', source=source, legend='origin',
color=dict(field='origin',transform=color_mapper))

# Specify the name of the output file and show the result
output_file('colormap.html')
show(p)

# Import row from bokeh.layouts


from bokeh.layouts import row

# Create the first figure: p1


p1 = figure(x_axis_label='fertility (children per woman)',
y_axis_label='female_literacy (% population)')

# Add a circle glyph to p1


p1.circle('fertility', 'female_literacy', source=source)

# Create the second figure: p2


p2 = figure(x_axis_label='population', y_axis_label='female_literacy (%
population)')

# Add a circle glyph to p2


p2.circle('population', 'female_literacy', source=source)

# Put p1 and p2 into a horizontal row: layout


layout = row(p1,p2)

# Specify the name of the output_file and show the result


output_file('fert_row.html')
show(layout)

Creating columns of plots


In this exercise, you're going to use the column() function to create a single
column layout of the two plots you created in the previous exercise.

Figure p1 has been created for you.

In this exercise and the ones to follow, you may have to scroll down to view the
lower portion of the figure.

# Import column from the bokeh.layouts module


from bokeh.layouts import column

# Create a blank figure: p1


p1 = figure(x_axis_label='fertility (children per woman)',
y_axis_label='female_literacy (% population)')

# Add circle scatter to the figure p1


p1.circle('fertility', 'female_literacy', source=source)

# Create a new blank figure: p2


p2 = figure(x_axis_label='population', y_axis_label='female_literacy (%
population)')
# Add circle scatter to the figure p2
p2.circle('population', 'female_literacy', source=source)

# Put plots p1 and p2 in a column: layout


layout = column(p1,p2)

# Specify the name of the output_file and show the result


output_file('fert_column.html')
show(layout)

Nesting rows and columns of plots


You can create nested layouts of plots by combining row and column layouts.

In this exercise, you'll make a 3-plot layout in two rows using the auto-mpg data
set.

Three plots have been created for you of average mpg vs year, mpg vs hp, and mpg vs
weight.

Your job is to use the column() and row() functions to make a two-row layout where
the first row will have only the average mpg vs year plot and the second row will
have mpg vs hp and mpg vs weight plots as columns.

By using the sizing_mode argument, you can scale the widths to fill the whole
figure.

# Import column and row from bokeh.layouts


from bokeh.layouts import row, column

# Make a column layout that will be used as the second row: row2
row2 = column([mpg_hp, mpg_weight], sizing_mode='scale_width')

# Make a row layout that includes the above column layout: layout
layout = row([avg_mpg, row2], sizing_mode='scale_width')

# Specify the name of the output_file and show the result


output_file('layout_custom.html')
show(layout)

Creating gridded layouts


Regular grids of Bokeh plots can be generated with gridplot.

In this example, you're going to display four plots of fertility vs female literacy
for four regions: Latin America, Africa, Asia and Europe.

Your job is to create a list-of-lists for the four Bokeh plots that have been
provided to you as p1, p2, p3 and p4. The list-of-lists defines the row and column
placement of each plot.

# Import gridplot from bokeh.layouts


from bokeh.layouts import gridplot

# Create a list containing plots p1 and p2: row1


row1 = [p1,p2]
# Create a list containing plots p3 and p4: row2
row2 = [p3,p4]

# Create a gridplot using row1 and row2: layout


layout = gridplot([row1, row2])

# Specify the name of the output_file and show the result


output_file('grid.html')
show(layout)

Starting tabbed layouts


Tabbed layouts can be created in Bokeh by placing plots or layouts in Panels.
In this exercise, you'll take the four fertility vs female literacy plots from the
last exercise and make a Panel() for each.

# Import Panel from bokeh.models.widgets


from bokeh.models.widgets import Panel

# Create tab1 from plot p1..4: tab1..4


tab1 = Panel(child=p1, title='Latin America')
tab2 = Panel(child=p2, title='Africa')
tab3 = Panel(child=p3, title='Asia')
tab4 = Panel(child=p4, title='Europe')

Displaying tabbed layouts


Tabbed layouts are collections of Panel objects. Using the figures and Panels from
the previous two exercises, you'll create a tabbed layout to change the region in
the fertility vs female literacy plots.
Your job is to create the layout using Tabs() and assign the tabs keyword argument
to your list of Panels. The Panels have been created for you as tab1, tab2, tab3
and tab4.

# Import Tabs from bokeh.models.widgets


from bokeh.models.widgets import Tabs

# Create a Tabs layout: layout


layout = Tabs(tabs=[tab1, tab2, tab3, tab4])

# Specify the name of the output_file and show the result


output_file('tabs.html')
show(layout)

Linked axes
Linking axes between plots is achieved by sharing range objects.

In this exercise, you'll link four plots of female literacy vs fertility so that
when one plot is zoomed or dragged, one or more of the other plots will respond.

The four plots p1, p2, p3 and p4 along with the layout that you created in the last
section have been provided for you.

Your job is link p1 with the three other plots by assignment of the .x_range and
.y_range attributes.

After you have linked the axes, explore the plots by clicking and dragging along
the x or y axes of any of the plots, and notice how the linked plots change
together.

# Link the x_range of p2 to p1: p2.x_range


p2.x_range = p1.x_range

# Link the y_range of p2 to p1: p2.y_range


p2.y_range = p1.y_range

# Link the x_range of p3 to p1: p3.x_range


p3.x_range = p1.x_range

# Link the y_range of p4 to p1: p4.y_range


p4.y_range = p1.y_range

# Specify the name of the output_file and show the result


output_file('linked_range.html')
show(layout)

Linked brushing
By sharing the same ColumnDataSource object between multiple plots, selection tools
like BoxSelect and LassoSelect will highlight points in both plots that share a row
in the ColumnDataSource.

In this exercise, you'll plot female literacy vs fertility and population vs


fertility in two plots using the same ColumnDataSource.

After you have built the figure, experiment with the Lasso Select and Box Select
tools. Use your mouse to drag a box or lasso around points in one figure, and
notice how points in the other figure that share a row in the ColumnDataSource also
get highlighted.

Before experimenting with the Lasso Select, however, click the Bokeh plot pop-out
icon to pop out the figure so that you can definitely see everything that you're
doing.

# Create ColumnDataSource: source


source = ColumnDataSource(data)

# Create the first figure: p1


p1 = figure(x_axis_label='fertility (children per woman)', y_axis_label='female
literacy (% population)', tools='box_select, lasso_select')
# Add a circle glyph to p1
p1.circle('fertility', 'female_literacy', source=source)

# Create the second figure: p2


p2 = figure(x_axis_label='fertility (children per woman)', y_axis_label='population
(millions)', tools='box_select, lasso_select')
# Add a circle glyph to p2
p2.circle('fertility', 'population', source=source)

# Create row layout of figures p1 and p2: layout


layout = row(p1,p2)

# Specify the name of the output_file and show the result


output_file('linked_brush.html')
show(layout)

How to create legends


Legends can be added to any glyph by using the legend keyword argument. In this
exercise, you will plot two circle glyphs for female literacy vs fertility in
Africa and Latin America.

Two ColumnDataSources called latin_america and africa have been provided. Plot two
circle glyphs for these two objects. The figure p has been provided for you.

# Add the first circle glyph to the figure p


p.circle('fertility', 'female_literacy', source=latin_america, size=10,
color='red', legend='Latin America')

# Add the second circle glyph to the figure p


p.circle('fertility', 'female_literacy', source=africa, size=10, color='blue',
legend='Africa')

# Specify the name of the output_file and show the result


output_file('fert_lit_groups.html')
show(p)

Positioning and styling legends


Properties of the legend can be changed by using the legend member attribute of a
Bokeh figure after the glyphs have been plotted.

In this exercise, you'll adjust the background color and legend location of the
female literacy vs fertility plot from the previous exercise.

The figure object p has been created for you along with the circle glyphs.

# Assign the legend to the bottom left


p.legend.location = 'bottom_left'
# Fill the legend background with the color 'lightgray':
p.legend.background_fill_color = 'lightgray'

# Specify the name of the output_file and show the result


output_file('fert_lit_groups.html')
show(p)

Adding a hover tooltip


Working with the HoverTool is easy for data stored in a ColumnDataSource.

In this exercise, you will create a HoverTool object and display the country for
each circle glyph in the figure that you created in the last exercise. This is done
by assigning the tooltips keyword argument to a list-of-tuples specifying the label
and the column of values from the ColumnDataSource using the @ operator.

The figure object has been prepared for you as p.

# Import HoverTool from bokeh.models


from bokeh.models import HoverTool

# Create a HoverTool object: hover


hover = HoverTool(tooltips=[('Country','@Country')])
# Add the HoverTool object to figure p
p.add_tools(hover)

# Specify the name of the output_file and show the result


output_file('hover.html')
show(p)

Understanding Bokeh apps


The main purpose of the Bokeh server is to synchronize python objects with web
applications in a browser, so that rich, interactive data applications can be
connected to powerful PyData libraries such as NumPy, SciPy, Pandas, and scikit-
learn.

What sort of properties can the Bokeh server automatically keep in sync? Bokeh
server will automatically keep every property of any Bokeh object in sync.

Using the current document


Let's get started with building an interactive Bokeh app. This typically begins
with importing the curdoc, or "current document", function from bokeh.io. This
current document will eventually hold all the plots, controls, and layouts that you
create. Your job in this exercise is to use this function to add a single plot to
your application.

In the video, Bryan described the process for running a Bokeh app using the bokeh
serve command line tool. In this chapter and the one that follows, the DataCamp
environment does this for you behind the scenes. Notice that your code is part of a
script.py file. When you hit 'Submit Answer', you'll see in the IPython Shell that
we call bokeh serve script.py for you.

Remember, as in the previous chapters, that there are different options available
for you to interact with your plots, and as before, you may have to scroll down to
view the lower portion of the plots.

# Perform necessary imports


from bokeh.io import curdoc
from bokeh.plotting import figure

# Create a new plot: plot


plot = figure()

# Add a line to the plot


plot.line([1,2,3,4,5],[2,5,4,6,7])

# Add the plot to the current document


curdoc().add_root(plot)

Add a single slider


In the previous exercise, you added a single plot to the "current document" of your
application. In this exercise, you'll practice adding a layout to your current
document.

Your job here is to create a single slider, use it to create a widgetbox layout,
and then add this layout to the current document.
The slider you create here cannot be used for much, but in the later exercises,
you'll use it to update your plots!

# Perform the necessary imports


from bokeh.io import curdoc
from bokeh.layouts import widgetbox
from bokeh.models import Slider

# Create a slider: slider


slider = Slider(title='my slider', start=0, end=10, step=0.1, value=2)

# Create a widgetbox layout: layout


layout = widgetbox(slider)

# Add the layout to the current document


curdoc().add_root(layout)

Multiple sliders in one document


Having added a single slider in a widgetbox layout to your current document, you'll
now add multiple sliders into the current document.

Your job in this exercise is to create two sliders, add them to a widgetbox layout,
and then add the layout into the current document.

# Perform necessary imports


from bokeh.io import curdoc
from bokeh.layouts import widgetbox
from bokeh.models import Slider

# Create first slider: slider1


slider1 = Slider(title='slider1', start=0, end=10, step=.1, value=2)

# Create second slider: slider2


slider2 = Slider(title='slider2', start=10, end=100, step=1, value=20)

# Add slider1 and slider2 to a widgetbox


layout = widgetbox(slider1, slider2)

# Add the layout to the current document


curdoc().add_root(layout)

How to combine Bokeh models into layouts


Let's begin making a Bokeh application that has a simple slider and plot, that also
updates the plot based on the slider.

In this exercise, your job is to first explicitly create a ColumnDataSource. You'll


then combine a plot and a slider into a single column layout, and add it to the
current document.

After you are done, notice how in the figure you generate, the slider will not
actually update the plot, because a widget callback has not been defined. You'll
learn how to update the plot using widget callbacks in the next exercise.

All the necessary modules have been imported for you. The plot is available in the
workspace as plot, and the slider is available as slider.

# Create ColumnDataSource: source


source = ColumnDataSource(data={'x': x, 'y': y})

# Add a line to the plot


plot.line('x', 'y', source=source)

# Create a column layout: layout


layout = column(widgetbox(slider), plot)

# Add the layout to the current document


curdoc().add_root(layout)

Learn about widget callbacks


You'll now learn how to use widget callbacks to update the state of a Bokeh
application, and in turn, the data that is presented to the user.

Your job in this exercise is to use the slider's on_change() function to update the
plot's data from the previous example. NumPy's sin() function will be used to
update the y-axis data of the plot.

Now that you have added a widget callback, notice how as you move the slider of
your app, the figure also updates!

Define a callback function with the parameters attr, old, new.


Read the current value of slider as a variable scale. You can do this using
slider.value.
Compute values for the updated y using np.sin(scale/x).
Update source.data with the new data dictionary.
Attach the callback to the 'value' property of slider. This can be done using
on_change() and passing in 'value' and callback.

# Define a callback function: callback


def callback(attr, old, new):

# Read the current value of the slider: scale


scale = slider.value

# Compute the updated y using np.sin(scale/x): new_y


new_y = np.sin(scale/x)

# Update source with the new data values


source.data = {'x': x, 'y': new_y}

# Attach the callback to the 'value' property of slider


slider.on_change('value', callback)

# Create layout and add to current document


layout = column(widgetbox(slider), plot)
curdoc().add_root(layout)

Updating data sources from dropdown callbacks


You'll now learn to update the plot's data using a drop down menu instead of a
slider. This would allow users to do things like select between different data
sources to view.

The ColumnDataSource source has been created for you along with the plot. Your job
in this exercise is to add a drop down menu to update the plot's data.

All necessary modules have been imported for you.

# Perform necessary imports


from bokeh.models import ColumnDataSource, Select

# Create ColumnDataSource: source


source = ColumnDataSource(data={
'x' : fertility,
'y' : female_literacy
})

# Create a new plot: plot


plot = figure()

# Add circles to the plot


plot.circle('x', 'y', source=source)

# Define a callback function: update_plot


def update_plot(attr, old, new):
# If the new Selection is 'female_literacy', update 'y' to female_literacy
if new == 'female_literacy':
source.data = {
'x' : fertility,
'y' : female_literacy
}
# Else, update 'y' to population
else:
source.data = {
'x' : fertility,
'y' : population
}

# Create a dropdown Select widget: select


select = Select(title="distribution", options=['female_literacy', 'population'],
value='female_literacy')

# Attach the update_plot callback to the 'value' property of select


select.on_change('value', update_plot)

# Create layout and add to current document


layout = row(select, plot)
curdoc().add_root(layout)

Synchronize two dropdowns


Here, you'll practice using a dropdown callback to update another dropdown's
options. This will allow you to customize your applications even further and is a
powerful addition to your toolbox.
Your job in this exercise is to create two dropdown select widgets and then define
a callback such that one dropdown is used to update the other dropdown. All modules
necessary have been imported.

# Create two dropdown Select widgets: select1, select2


select1 = Select(title='First', options=['A', 'B'], value='A')
select2 = Select(title='Second', options=['1', '2', '3'], value='1')

# Define a callback function: callback


def callback(attr, old, new):
# If select1 is 'A'
if select1.value == 'A':
# Set select2 options to ['1', '2', '3']
select2.options = ['1','2','3']

# Set select2 value to '1'


select2.value = '1'
else:
# Set select2 options to ['100', '200', '300']
select2.options = ['100','200','300']

# Set select2 value to '100'


select2.value = '100'

# Attach the callback to the 'value' property of select1


select1.on_change('value', callback)

# Create layout and add to current document


layout = widgetbox(select1, select2)
curdoc().add_root(layout)

Button widgets
It's time to practice adding buttons to your interactive visualizations. Your job
in this exercise is to create a button and use its on_click() method to update a
plot.
All necessary modules have been imported for you. In addition, the ColumnDataSource
with data x and y as well as the figure have been created for you and are available
in the workspace as source and plot.

# Create a Button with label 'Update Data'


button = Button(label='Update Data')

# Define an update callback with no arguments: update


def update():
# Compute new y values: y
y = np.sin(x) + np.random.random(N)
# Update the ColumnDataSource data dictionary
source.data = {'x':x,'y':y}

# Add the update callback to the button


button.on_click(update)

# Create layout and add to current document


layout = column(widgetbox(button), plot)
curdoc().add_root(layout)

Button styles
You can also get really creative with your Button widgets.
In this exercise, you'll practice using CheckboxGroup, RadioGroup, and Toggle to
add multiple Button widgets with different styles. curdoc and widgetbox have
already been imported for you.

# Import CheckboxGroup, RadioGroup, Toggle from bokeh.models


from bokeh.models import CheckboxGroup, RadioGroup, Toggle

# Add a Toggle: toggle


toggle = Toggle(button_type='success', label='Toggle button')

# Add a CheckboxGroup: checkbox


checkbox = CheckboxGroup(labels=['Option 1', 'Option 2', 'Option 3'])

# Add a RadioGroup: radio


radio = RadioGroup(labels=['Option 1', 'Option 2', 'Option 3'])

# Add widgetbox(toggle, checkbox, radio) to the current document


curdoc().add_root(widgetbox(toggle,checkbox,radio))

Introducing the project dataset


For the final chapter, you'll be looking at some of the Gapminder datasets combined
into one tidy file called "gapminder_tidy.csv". This data set is available as a
pandas DataFrame under the variable name data.

It is always a good idea to begin with some Exploratory Data Analysis. Pandas has a
number of built-in methods that help with this. For example, data.head() displays
the first five rows/entries of data, while data.tail() displays the last five
rows/entries. data.shape gives you information about how many rows and columns
there are in the data set. Another particularly useful method is data.info(), which
provides a concise summary of data, including information about the number of
entries, columns, data type of each column, and number of non-null entries in each
column.

Use the IPython Shell and the pandas methods mentioned above to explore this data
set. How many entries and columns does this data set have?

data.shape

Some exploratory plots of the data


Here, you'll continue your Exploratory Data Analysis by making a simple plot of
Life Expectancy vs Fertility for the year 1970.
Your job is to import the relevant Bokeh modules and then prepare a
ColumnDataSource object with the fertility, life and Country columns, where you
only select the rows with the index value 1970. Remember, as with the figures you
generated in previous chapters, you can interact with your figures here with a
variety of tools.

# Perform necessary imports


from bokeh.io import output_file, show
from bokeh.plotting import figure
from bokeh.models import HoverTool, ColumnDataSource

# Make the ColumnDataSource: source


source = ColumnDataSource(data={
'x' : data.loc[1970].fertility,
'y' : data.loc[1970].life,
'country' : data.loc[1970].Country
})

# Create the figure: p


p = figure(title='1970', x_axis_label='Fertility (children per woman)',
y_axis_label='Life Expectancy (years)',
plot_height=400, plot_width=700,
tools=[HoverTool(tooltips='@country')])

# Add a circle glyph to the figure p


p.circle(x='x', y='y', source=source)

# Output the file and show the figure


output_file('gapminder.html')
show(p)

Beginning with just a plot


Let's get started on the Gapminder app. Your job is to make the ColumnDataSource
object, prepare the plot, and add circles for Life expectancy vs Fertility. You'll
also set x and y ranges for the axes.

As in the previous chapter, the DataCamp environment executes the bokeh serve
command to run the app for you. When you hit 'Submit Answer', you'll see in the
IPython Shell that bokeh serve script.py gets called to run the app. This is
something to keep in mind when you are creating your own interactive visualizations
outside of the DataCamp environment.

# Import the necessary modules


from bokeh.io import curdoc
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure

# Make the ColumnDataSource: source


source = ColumnDataSource(data={
'x' : data.loc[1970].fertility,
'y' : data.loc[1970].life,
'country' : data.loc[1970].Country,
'pop' : (data.loc[1970].population / 20000000) + 2,
'region' : data.loc[1970].region,
})

# Save the minimum and maximum values of the fertility column: xmin, xmax
xmin, xmax = min(data.fertility), max(data.fertility)

# Save the minimum and maximum values of the life expectancy column: ymin, ymax
ymin, ymax = min(data.life), max(data.life)

# Create the figure: plot


plot = figure(title='Gapminder Data for 1970', plot_height=400, plot_width=800,
x_range=(xmin, xmax), y_range=(ymin, ymax))

# Add circle glyphs to the plot


plot.circle(x='x', y='y', fill_alpha=.8, source=source)

# Set the x-axis label


plot.xaxis.axis_label ='Fertility (children per woman)'
# Set the y-axis label
plot.yaxis.axis_label = 'Life Expectancy (years)'

# Add the plot to the current document and add a title


curdoc().add_root(plot)
curdoc().title = 'Gapminder'

Enhancing the plot with some shading


Now that you have the base plot ready, you can enhance it by coloring each circle
glyph by continent.

Your job is to make a list of the unique regions from the data frame, prepare a
ColorMapper, and add it to the circle glyph.

# Make a list of the unique values from the region column: regions_list
regions_list = data.region.unique().tolist()

# Import CategoricalColorMapper from bokeh.models and the Spectral6 palette from


bokeh.palettes
from bokeh.models import CategoricalColorMapper
from bokeh.palettes import Spectral6

# Make a color mapper: color_mapper


color_mapper = CategoricalColorMapper(factors=regions_list, palette=Spectral6)

# Add the color mapper to the circle glyph


plot.circle(x='x', y='y', fill_alpha=0.8, source=source,
color=dict(field='region', transform=color_mapper), legend='region')

# Set the legend.location attribute of the plot to 'top_right'


plot.legend.location = 'top_right'

# Add the plot to the current document and add the title
curdoc().add_root(plot)
curdoc().title = 'Gapminder'

Adding a slider to vary the year


Until now, we've been plotting data only for 1970. In this exercise, you'll add a
slider to your plot to change the year being plotted. To do this, you'll create an
update_plot() function and associate it with a slider to select values between 1970
and 2010.

After you are done, you may have to scroll to the right to view the entire plot. As
you play around with the slider, notice that the title of the plot is not updated
along with the year. This is something you'll fix in the next exercise!

# Import the necessary modules


from bokeh.layouts import widgetbox, row
from bokeh.models import Slider

# Define the callback function: update_plot


def update_plot(attr, old, new):
# Set the yr name to slider.value and new_data to source.data
yr = slider.value
new_data = {
'x' : data.loc[yr].fertility,
'y' : data.loc[yr].life,
'country' : data.loc[yr].Country,
'pop' : (data.loc[yr].population / 20000000) + 2,
'region' : data.loc[yr].region,
}
source.data = new_data

# Make a slider object: slider


slider = Slider(title='Year', start=1970, end=2010, step=1, value=1970)
# Attach the callback to the 'value' property of slider
slider.on_change('value', update_plot )

# Make a row layout of widgetbox(slider) and plot and add it to the current
document
layout = row(widgetbox(slider), plot)
curdoc().add_root(layout)

Customizing based on user input


Remember how in the plot from the previous exercise, the title did not update along
with the slider? In this exercise, you'll fix this.

In Python, you can format strings by specifying placeholders with the % keyword.
For example, if you have a string company = 'DataCamp', you can use print('%s' %
company) to print DataCamp. Placeholders are useful when you are printing values
that are not static, such as the value of the year slider. You can specify a
placeholder for a number with %d. Here, when you're updating the plot title inside
your callback function, you should make use of a placeholder so that the year
displayed is in accordance with the value of the year slider.

In addition to updating the plot title, you'll also create the callback function
and slider as you did in the previous exercise, so you get a chance to practice
these concepts further. All necessary modules have been imported for you, and as in
the previous exercise, you may have to scroll to the right to view the entire
figure.

# Define the callback function: update_plot


def update_plot(attr, old, new):
# Assign the value of the slider: yr
yr = slider.value
# Set new_data
new_data = {
'x' : data.loc[yr].fertility,
'y' : data.loc[yr].life,
'country' : data.loc[yr].Country,
'pop' : (data.loc[yr].population / 20000000) + 2,
'region' : data.loc[yr].region,
}
# Assign new_data to: source.data
source.data = new_data

# Add title to figure: plot.title.text


plot.title.text = 'Gapminder data for %d' % yr

# Make a slider object: slider


slider = Slider(title='Year', start=1970, end=2010, step=1, value=1970)
# Attach the callback to the 'value' property of slider
slider.on_change('value', update_plot)

# Make a row layout of widgetbox(slider) and plot and add it to the current
document
layout = row(widgetbox(slider), plot)
curdoc().add_root(layout)

Adding a hover tool


In this exercise, you'll practice adding a hover tool to drill down into data
column values and display more detailed information about each scatter point.

After you're done, experiment with the hover tool and see how it displays the name
of the country when your mouse hovers over a point! The figure and slider have been
created for you and are available in the workspace as plot and slider.

# Import HoverTool from bokeh.models


from bokeh.models import HoverTool

# Create a HoverTool: hover


hover = HoverTool(tooltips=[('Country','@country')])
# Add the HoverTool to the plot
plot.add_tools(hover)

# Create layout: layout


layout = row(widgetbox(slider), plot)
# Add layout to current document
curdoc().add_root(layout)

Adding dropdowns to the app


As a final step in enhancing your application, in this exercise you'll add
dropdowns for interactively selecting different data features. In combination with
the hover tool you added in the previous exercise, as well as the slider to change
the year, you'll have a powerful app that allows you to interactively and quickly
extract some great insights from the dataset!

All necessary modules have been imported, and the previous code you wrote is taken
care of. In the provided sample code, the dropdown for selecting features on the x-
axis has been added for you. Using this as a reference, your job in this final
exercise is to add a dropdown menu for selecting features on the y-axis.

Take a moment, after you are done, to enjoy exploring the visualization by
experimenting with the hover tools, sliders, and dropdown menus that you have
learned how to implement in this course.

# Define the callback: update_plot


def update_plot(attr, old, new):
# Read the current value off the slider and 2 dropdowns: yr, x, y
yr = slider.value
x = x_select.value
y = y_select.value
# Label axes of plot
plot.xaxis.axis_label = x
plot.yaxis.axis_label = y
# Set new_data
new_data = {
'x' : data.loc[yr][x],
'y' : data.loc[yr][y],
'country' : data.loc[yr].Country,
'pop' : (data.loc[yr].population / 20000000) + 2,
'region' : data.loc[yr].region,
}
# Assign new_data to source.data
source.data = new_data

# Set the range of all axes


plot.x_range.start = min(data[x])
plot.x_range.end = max(data[x])
plot.y_range.start = min(data[y])
plot.y_range.end = max(data[y])

# Add title to plot


plot.title.text = 'Gapminder data for %d' % yr

# Create a dropdown slider widget: slider


slider = Slider(start=1970, end=2010, step=1, value=1970, title='Year')
# Attach the callback to the 'value' property of slider
slider.on_change('value', update_plot)

# Create a dropdown Select widget for the x data: x_select


x_select = Select(
options=['fertility', 'life', 'child_mortality', 'gdp'],
value='fertility',
title='x-axis data'
)
# Attach the update_plot callback to the 'value' property of x_select
x_select.on_change('value', update_plot)

# Create a dropdown Select widget for the y data: y_select


y_select = Select(
options=['fertility', 'life', 'child_mortality', 'gdp'],
value='life',
title='y-axis data'
)
# Attach the update_plot callback to the 'value' property of y_select
y_select.on_change('value', update_plot)

# Create layout and add to current document


layout = row(widgetbox(slider, x_select, y_select), plot)
curdoc().add_root(layout)

Potrebbero piacerti anche