Sei sulla pagina 1di 50

How Shazam Works

+
Intro to machine learning for
multimedia analysis
Perception & Multimedia Computing
Lecture 19

Rebecca Fiebrink
Lecturer, Department of Computing
Goldsmiths, University of London
1  
Last time
•  Perceptually-motivated features for IR
of audio and music
•  Evaluation of IR systems

2  
Today
• Shazam
• Machine learning intro
•  Supervised learning introduction
•  Wekinator demo

3  
Shazam

4  
Using Shazam
1.  Open app
2.  Record a few
seconds of audio
3.  Find out
immediately
which song you’re
listening to (and
have the
opportunity to
buy it)
5  
Fingerprinting
Goal: find a function f(song) such that
f(song1) = f(song2)
IF AND ONLY IF
song1 = song2
(*** if song1 and song2 are same
recording, but possibly with different
compression, background noise, etc.)
6  
Challenge:

Requires ability to quickly find


matching song out of a database of
millions!
Can’t just compare f(song) to
f(song1), f(song2), … f(songN) for all N
songs in the database.
Solution: use hashing
7  
Typical hash functions in
computing
Hash Table:
Each data item
mapped onto
fixed-length (often
shorter) key
Facilitates rapid
search (only search
among collisions)
8  
http://en.wikipedia.org/wiki/Hash_table
Review: Perceptual hash
•  Hash is a function computed on a song:
h(song)
•  Perceptually identical (or possibly “very
similar”) items will end up in same “bucket”
e.g., MP3, WAV versions of same song
•  Bucket is a specific location in memory /
on disk
•  When receive a new query, compute
h(query) and look in this “bucket” for possible
matches
9  
Creating a Shazam hash function
Requirements:
1. Ability to use a short snippet of a song as a query
(rather than having to use whole song as query)
•  Hash should be computable from just a small part of
a song, and shouldn’t require knowledge of where in
the song the query lies
2. Robust to degradation (compression, background
noise, etc.)
•  Hash values for clean DB version of song should
exactly match hash values on degraded versions
3. Hash should be “high entropy”: don’t map too many
songs to the same bucket! (This makes search take
longer)
10  
High-level solution
•  For each song in the DB, compute many hash values
using slightly different local segments of the song.
•  Create on disk a series of “buckets” (hash values), each
of which contains a list of songs containing hashes that
fall into that bucket
•  For each song, also store a list of hashes and the song
times associated with them
•  For a query, compute several hash values using slightly
different local segments
•  For each hash value, go to the corresponding bucket.
•  For all songs in the bucket, determine whether the
query matches hat song. (*** some magic here)

11  
Hash Function
1. Compute the spectrogram

12  
Hash Function
2. Identify peaks in spectrogram

Peak = a time-frequency point with higher energy than its


neighbors in a small region
13  
Pretty robust to encoding scheme, background noise, EQ
Hash Function
3. Find pairs of nearby peaks.
Hash value = (peak1_freq, peak2_freq, time2-time1)

Can concatenate each hash into a 32-bit uint


14  
Storing hashes
4. Store within each hash bucket:
(song number, time point of peak1)

e.g., contents of bucket “3235293”:

3235293:
(32572, 39280)
(209371, 94830)
(32572, 3927)

3235294:
(2324, 2323)

15  
Receiving a query
Repeat hash process for query:
1.  Compute spectral peaks
2.  Find pairs of nearby peaks
3.  Compute list of hashes: (freq1, freq2, time2-time1)
4.  For each hash value:
a. Find hash bucket.
b. For each song in bucket:
Check if query matches song

Want this to be VERY FAST!

16  
Checking for a match
Does query match song?

Create two lists:


1. List of (hash values, times) for song
2. List of (hash values, times) for query

Assumption: If query is a snippet of the song, then


1. They’ll share lots of hash values
And
2. The relative timing of different hashes will be
the same!

17  
Relative hash timings
If query matches song, we expect to see something like
this:

Song Hash (value, time) Query hash (value, time)


235, 4 23958, 5
23958, 10 9384, 17
9384, 23 39582, 23
39582, 28
273, 35

18  
Query begins
40 seconds
into song

19  
20  
Results
Robust to noise:
Only 1-2% of hash values
must survive in order to enable
identification

21  
22  
23  
Results
Very fast:
On a regular PC, search through
20,000 tracks takes 5-500ms

Very robust: few false positives

24  
For more info
Read the academic paper with
details:
http://www.ee.columbia.edu/
~dpwe/papers/Wang03-
shazam.pdf

25  
Intro to machine
learning

26  
Feature values give a data item (e.g., song)
a point in feature space
Feature  2:  Average  centroid  

Feature  1:  Average  RMS  


27   27
Retrieval: Find items closest to a query
Feature  2:  Average  centroid  

Feature  1:  Average  RMS  


28   28
Classification: Assigning a data point a label
based on its position in feature space

David Bowie
songs
Feature  2:  Average  centroid  

Lady Gaga songs


Feature  1:  Average  RMS  
29   29
You be the classifier: Bowie or Gaga?
Feature  2:  Average  centroid  

Feature  1:  Lady Gaga songs


30   30
You be the classifier: Bowie or Gaga?
Feature  2:  Average  centroid  

Feature  1:  Lady Gaga songs


31   31
K-nearest-neighbor (kNN) classifier:
Take a vote of k nearest points

K=3  
Feature  2:  Average  centroid  

Feature  1:  Lady Gaga songs


32   32
Supervised learning builds models.
Models are functions.

inputs

training
data

algorithm model

Training
outputs
33  

33  
Supervised learning algorithms
build models from data.
Each example represented as a feature vector

.01, .59, .03, 32 .05, 1.2, 3.2, 31 -.1, .34, .20, 8.2 .01, .64, .02, 20

inputs
“C Major” “F minor” “G7”
training
data

algorithm model

Training
Running “Coutputs
Major”
34  
Classification: Assign 1 of N discrete labels
to each point in feature space
feature2

This model: a
separating
line or
hyperplane
(decision
boundary)
feature1
35  
Regression
output

This model: a
real-valued
function of the
input features

feature
36  
Unsupervised learning
Dataset includes examples, but no labels
Example: Infer clusters from data:
feature2

37  
feature1
How supervised learning
algorithms work
(the basics)

38  
The learning problem

Goal: Build the best** model given the training


data
Definition of “best” depends on context,
assumptions…

39  
Which classifier is best?

“Underfit” “Overfit”

Competing goals:
Accurately model training data
**Accurately classify unseen data points**

40  
Image from Andrew Ng
A simple classifier: nearest neighbor

feature2 feature1
?

41  
Another simple classifier: Decision tree

Images: http://ai.cs.umbc.edu/~oates/classes/2009/ML/homework1.html,
42   http://nghiaho.com/?p=1300
AdaBoost: Iteratively train a “weak” learner
Image from http://www.cc.gatech.edu/~kihwan23/
imageCV/Final2005/FinalProject_KH.htm

43  
Support vector machine

Re-map input space into a higher number of


dimensions and find a separating hyperplane

44  
Supervised learning and music

Create classifiers for:


•  Genre (rock, pop, …)
•  Artist, album
•  Pitch (apply frame by frame in audio)
•  Mood
•  Tag/semantic descriptor (e.g., last.fm tags)
•  Instrument identification
•  Beatbox vocalization identification &
remapping
•  … and many others
45  
Practical tips

May have to explore many feature sets


Normalize features! (e.g., all between [0,1])
Using too many features can hurt
performance: “curse of dimensionality”

May have to explore many learning algorithms


(and many parameter settings for each)
Compare using “test accuracy” or cross-
validation, NOT performance on training data

Usually more examples is better


46  
Choosing a classifier: Practical considerations

k-Nearest Neighbor
+ Can tune k to adjust smoothness of decision boundaries
- Sensitive to noisy, redundant, irrelevant features; prone to overfitting;
weird in high dimensions
Decision tree:
+ Can prune to reduce overfitting, produces human-understandable
model
- Can still overfit
AdaBoost
+ Theoretical benefits, less prone to overfitting
+ Can tune by changing base learner, number of training rounds
Support Vector Machine
+ Theoretical benefits similar to AdaBoost
Many parameters to tune, training can take a long time
Other considerations…
47  
For more info
Wekinator:
http://wekinator.cs.princeton.edu/

For non-realtime ML: Weka –


http://www.cs.waikato.ac.nz/ml/weka/

A good intro book on machine learning:

Data Mining by
Witten, Frank, and Hall

48  
More detailed and technical textbooks

Bishop, 2006: Pattern Recognition & Machine


Learning. Science and Business Media, Springer
Duda, 2001: Pattern Classification, Wiley-
Interscience

49  
Next week
Visualization

50  

Potrebbero piacerti anche