Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
could be written as
add1 = partial(add, 1)
from functools import reduce
So now reduce(). This is actually the one I've always hated most, because, apart
from a few examples involving + or *, almost every time I see a reduce() call with a
non-trivial function argument, I need to grab pen and paper to diagram what's
actually being fed into that function before I understand what the reduce() is
supposed to do. So in my mind, the applicability of reduce() is pretty much limited
to associative operators, and in all other cases it's better to write out the
accumulation loop explicitly.
In [3]: next(it)
Out[3]: 1 take its values
with next
In [4]: next(it)
Out[4]: 2
In [5]: next(it)
Out[5]: 3
In [6]: next(it)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-121-5c05586d40e8> in <module>()
----> 1 next(iter)
(laziness)
xs = lazy_integers()
infinite sequence!
[next(xs) for _ in range(10)]
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# maintains state
[next(xs) for _ in range(10)]
# [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
generator comprehensions
# computes nothing until next or for
squares = (x**2 for x in lazy_integers())
doubles = (2*x for x in lazy_integers())
next(squares) # 0
next(squares) # 1
next(squares) # 4
next(squares) # 9
# don't do this!!!:
bad_squares = [x**2 for x in lazy_integers()]
generators and pipelines
$ cat euler.hs | grep -i prime | wc -l
63
count([start=0], [step=1])
tee(it, [n=2])
repeat(elem, [n=forever])
cycle(p)
chain(p, q, …)
accumulate(p, [func=add])
with default
take(10, lazy_integers())
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
fibonacci numbers
def fib(n): super inefficient
if n == 0: return 1
if n == 1: return 1
return fib(n-1) + fib(n-2)
take(10, fibs())
# [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
fibonacci numbers
"Haskellic" version
def fibs():
yield 1
yield 1
yield from map(add, fibs(), tail(fibs()))
def fibs():
return (y for x, y in iterate(next_fib, (0, 1)))
def all_primes():
return filter_primes(count(2))
take(10, all_primes())
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
What does any of this have
to do with data science?
k-means clustering
● have some points
● want to group them into k clusters
● want clusters to be "small"
k-means clustering
● have some points
● want to group them into k clusters
● want clusters to be "small"
k-means clustering
● have some points
● want to group them into k clusters
● want clusters to be "small"
● iterative approach:
○ choose k means
○ assign each point to cluster of "closest" mean
○ compute new means
○ repeat
● implementation intended to be expository not efficient
Not-functional approach
class KMeans:
def __init__(self, k):
self.k = k
self.means = [None for _ in range(k)]
def predict(self, point):
"""return index of closest mean"""
d_min = float('inf')
for j, m in enumerate(self.means):
d = sum((m_i - p_i)**2
for m_i, p_i in zip(m, point))
if d < d_min:
prediction = j
d_min = d
return prediction
def fit(self, points, num_iters=10):
"""find the k means"""
assignments = [None for _ in points]
self.means = random.sample(list(points), self.k)
for _ in range(num_iters):
# assign each point to its closest mean
for i, point in enumerate(points):
assignments[i] = self.predict(point)
# compute new means
for j in range(self.k):
cluster = [p for p, c in zip(points, assignments)
if c == j]
self.means[j] = list(
map(lambda x: x / len(cluster),
reduce(partial(map, add), cluster)))
# 100 random points in the unit square, 5 clusters
points = np.random.random((100,2))
model = KMeans(5)
model.fit(points)
assignments = [model.predict(point) for point in points]
k_meanses
Let's do it crazy functional!
def k_meanses(points, k):
"""returns the infinite sequence of meanses"""
initial_means = random.sample(points, k)
return iterate(partial(new_means, points),
initial_means)
Let's do it crazy functional!
def k_meanses(points, k):
"""returns the infinite sequence of meanses"""
initial_means = random.sample(points, k)
return iterate(partial(new_means, points),
initial_means)
HUH?
Let's do it crazy functional!
def k_meanses(points, k):
"""returns the infinite sequence of meanses"""
initial_means = random.sample(points, k)
return iterate(partial(new_means, points),
initial_means)
# 10 iterations
meanses = take(10, k_meanses(points, 5))
# until convergence
meanses = until_convergence(k_meanses(points, 5))
Let's do it crazy functional!
# until convergence
meanses = until_convergence(k_meanses(points, 5))
def animation_frame(nframe):
plt.cla()
x, y = get_data_for(nframe)
plt.plot(x, y)
fig = plt.figure(figsize=(5,4))
anim = animation.FuncAnimation(fig, animation_frame,
frames=num_frames)
anim.save('animation.gif', writer='imagemagick', fps=4)
data = [(random.random(), random.random()) for _ in range(500)]
meanses = [means for means in until_convergence(k_meanses(data, k))]
colors = ['r', 'g', 'b', 'c', 'm']
def animation_frame(nframe):
means = meanses[nframe]
plt.cla()
assignments = [closest_index(point, means)
for point in data]
clusters = [[point
for point, c in zip(data, assignments)
if c == j] for j in range(k)]
def animation_frame(nframe):
means = meanses[nframe]
plt.cla()
assignments = [closest_index(point, means)
for point in data]
clusters = [[point
for point, c in zip(data, assignments)
if c == j] for j in range(k)]
def f(x_i):
return sum(x_ij**2 for x_ij in x_i)
Gradient Descent
def f(x_i):
return sum(x_ij**2 for x_ij in x_i)
gradient is
def df(x_i):
return [2 * x_ij for x_ij in x_i]
def gradient_step(df, alpha, x_i):
return [x_ij + alpha * df_j
for x_ij, df_j in zip(x_i, df(x_i))]
def f(x):
"""f(x, y) = -exp(-x^3 / 3 + x - y^2)
has min at (1,0), saddle point at (-1,0)"""
return -math.exp(x[0]**3/-3 + x[0] - x[1]**2)
def df(x):
"""just the gradient"""
return ((1 - x[0]**2) * f(x), -2 * x[1] * f(x))
# let's try a more complex function
def random_point():
return (3 * random.random() - 1, 3 * random.random() - 1)
def f(x):
"""f(x, y) = -exp(-x^3 / 3 + x - y^2)
has min at (1,0), saddle point at (-1,0)"""
return -math.exp(x[0]**3/-3 + x[0] - x[1]**2)
def df(x):
"""just the gradient"""
return ((1 - x[0]**2) * f(x), -2 * x[1] * f(x))
Stochastic Gradient Descent
In previous example, just minimized a function of a single
point.
Treat x and y as fixed (i.e. curry!), look for optimal value of beta.
def sgd_step(df, alpha, prev_beta, xy_i):
"""df is a function of x_i, y_i, beta""" deal with one point
x_i, y_i = xy_i at a time by zip-ing
x and y together
return [beta_j + alpha * df_j
for beta_j, df_j in zip(prev_beta, df(x_i, y_i, prev_beta))]
xys is the sequence: beta_0, (x0, y0), (x1, y1), (x2, y2), …, (x0, y0), (x1, y1), (x2, y2), ….
def animation_frame(nframe):
a, b = subresults[nframe]
# regression line goes through (0, a) and (1, a + b)
plt.plot([0, 1], [a, a+b])
SGD for linear regression
x = [(1, random.random()) for _ in range(100)]
y = [-5 * x_i[0] + 10 * x_i[1] + random.random() for x_i in x]
def animation_frame(nframe):
a, b = subresults[nframe]
# regression line goes through (0, a) and (1, a + b)
plt.plot([0, 1], [a, a+b])
Moral of the story
● itertools is awesome
● laziness is awesome
● infinite sequences are awesome
● matplotlib animation is awesome
Thanks!
follow me on twitter: @joelgrus
code is at
https://github.com/joelgrus/stupid-itertools-tricks-pydata