Sei sulla pagina 1di 22

Haverford College

Automatic Chord Recognition Using


Hidden Markov Models
Alex Wang

supervised by
Dr. Jane Chandlee

December 27, 2016

Contents
1 Abstract

2 Introduction
2.1 Music Information Retrieval . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Practical Applications . . . . . . . . . . . . . . . . . . . . . .

2
2
3

3 Music Theory
3.1 Melody . . . . . . . . . .
3.2 Harmony . . . . . . . . .
3.2.1 Intervals . . . . .
3.2.2 Triad . . . . . . .
3.2.3 Chord Extensions
3.2.4 Inversions . . . .
3.3 Rhythm . . . . . . . . .

.
.
.
.
.
.
.

3
3
4
4
4
5
5
6

.
.
.
.
.

6
6
7
8
9
11

.
.
.
.
.
.
.
.

11
12
13
15
15
17
19
19
19

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

4 Chord Recognition System Details


4.1 Markov Chains . . . . . . . . . . .
4.2 First-Order Markov Assumption . .
4.3 Hidden Markov Models . . . . . . .
4.4 Viterbi Algorithm . . . . . . . . . .
4.5 Baum-Welch Algorithm . . . . . . .
5 Literature Review
5.1 Fujishimas Seminal Research
5.2 Variations on a PCP . . . . .
5.3 Statistical Learning Models .
5.3.1 Training Data . . . . .
5.4 Language Models . . . . . . .
5.5 Alternate Approaches . . . . .
5.6 Future Work . . . . . . . . . .
5.7 Conclusion . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

Abstract

In this paper, I will be implementing an HMM model to accomplish the task of


recognizing a chord from audio input. I plan to build a system that can output
chord inversions, can recognize key changes, and can handle mistuned notes. This
system will be trained with classical music chord progressions, as this kind of music
is more varied harmonically and is practical for musicology research.

Introduction

The main goal of automatic chord recognition is to identify the chord labels from
audio data. This can be accomplished through feature extraction, search algorithms, machine learning algorithms, probabilistic models, and other learning models. Chord recognition falls under a broader field of study called Music Information
Retrieval (MIR).

2.1

Music Information Retrieval

Music Information Retrieval (MIR) is a multidisciplinary field that combines many


other areas of study such as machine learning, signal processing, algorithm design,
and music theory. MIR systems are designed to "make sense" of music in the same
way a listener hears and analyzes music. While it is not completely understood
exactly how the human brain processes music, an MIR system understands music
by extracting high-level features such as melody, harmony, and rhythm from lowlevel audio representation through a variety of techniques.
There are many different types of MIR systems that accept different kinds of
input. Though music can be represented in many different forms, there are three
general categories that all music fall under:

Figure 1: Basic representations of music [1]


2

The types of representations an MIR system accepts depends on the task it is


trying to accomplish. An MIR system designed to identify covers of a song would
not have to analyze a music score. Generally, automatic chord recognition systems
only make use of audio signals and possibly time-stamped events.
2.1.1

Practical Applications

Because of the versatility of MIR systems, Music Information Retrieval technology


can be found in many practical applications [2]:
Identifying a piece of music by pointing a smartphone at an audio source.
Applications such as Shazam, SoundHound, and Amazon Firefly are examples
of mobile applications that can perform this task by applying MIR technology.
Surveillance footage can be searched through for suspicious sounds.
A composer can search for a theme or a motif (a short musical idea) to see if
there is any possible plagiarism of his work.
Recommending music and generating playlists of similar songs to what a user
enjoys
Given a query, a search engine that could search a database of musical scores
to find similar music based on melody, similar chord progressions, etc. This
was normally done manually by musicologists but a search engine could free
up time and energy for musicologists.
For this thesis, we will be focusing on the task of extracting chord information
from digital audio information. The input to a chord recognition system will be
audio data and the output will the chord progression of that data.

Music Theory

Before we talk about the details of a chord recognition system, we will now discuss
the music theory background necessary to understanding such a system. This thesis will be focusing on "classical" music that follows common practice tonality".
The so-called common practice era lasted from about 1600 to 1900. We will not
be discussing avant-garde, serialist, atonal music, etc. because traditional rules of
harmony do not apply to these genres. Regardless of when a piece was composed,
all music has three basic elements: melody, harmony, and rhythm.

3.1

Melody

A melody is often the most memorable part of a piece and is the linear progression
of one note to the next. The melody of Beethovens Ode to Joy from his Ninth
Symphony is shown in Figure 2 below:

Figure 2: A lovely and simple melody with notes following each other in a linear
fashion
3

3.2

Harmony

The main focus of this thesis will be on harmony which is the simultaneous sounding
of notes (chords) and how these chords are joined together to form a progression.
3.2.1

Intervals

The type of chord formed from depends on the interval relationship between its
notes. An interval is defined as the distance between two notes and is expressed as
a quality and a number. The number describes the distance between the two notes
and the quality describes the "type" of distance. There many different qualities
but the most common are: minor, major, perfect, augmented, diminished, and halfdiminished. These intervals are summarized in Table 1.
Interval Name
Distance (in half-steps)
Unison (P1)
0
Minor second (m2)
1
Major second (M2)
2
Minor third (m3)
3
Major third (M3)
4
Perfect fourth (P4)
5
Augmented fourth (A4)
6
Diminished fifth (D5)
6
Perfect fifth (P5)
7
Augmented fifth (A5)
7
Major sixth (M6)
8
Minor sixth (m6)
9
Minor seventh (m7)
10
Major seventh (M7)
11
Octave (P8)
12
Table 1: Some common intervals
Different interval types that are the same distance are said to be enharmonic.
They sound the same but are spelled differently depending on the musical context.
3.2.2

Triad

A chord is a set of at least three different notes that is perceived as sounding simultaneously. A chord progression is a sequence of different chords that can then form
a musical phrase. The most basic type of chord is called a triad which has exactly
three components called chord factors. For a triad, the chords factors are called
the root, the third, and the fifth. The root note determines the name of the chord.
For example, if we have a chord with a root of C, then we know that the chord is
a C chord of some sort. The third is called such because it is an interval of a third
above the root, likewise with the fifth.
To illustrate, suppose we have a chord with notes C-E-G. The C is the root of
the triad, the E is the third of the triad because its distance from C is a major third
and the G is the fifth of the triad because its distance from the C is a perfect fifth.
The C-major chord is shown in Figure 3 below:
4

Figure 3: A C-major triad in root position


3.2.3

Chord Extensions

Any triad can be extended with another note which is usually the seventh of the
chord. It is the addition of this fourth note that opens a new world of harmonic
possibilities. While there exists even more chord extensions (e.g. ninths, elevenths,
thirteenths, etc.), we will not go into detail as these chords are merely variations on
chord types rather than new types of chords. The chord types we will be aiming to
identify are summarized in Table 2.
Chord Quality
Third
Major triad (M)
major
Minor triad (m)
minor
minor
Augmented triad
minor
Diminished triad (dim)
Diminished seventh (dim7)
minor
Half-diminished seventh (hdim7) minor
minor
Minor seventh (m7)
Dominant seventh (dom7)
major

Fifth
Seventh
perfect
N/A
perfect
N/A
augmented
N/A
diminished
N/A
diminished diminished
diminished
minor
perfect
minor
perfect
minor

Example
C-E-G
C-E[-G
C-E-G]
D-F-A[
B-D-F-A[
B-D-F-A
C-E[-G-B[
G-B-D-F

Table 2: Common chord types [3]


3.2.4

Inversions

An inverted chord is formed when a note in the chord other than the root is the
lowest frequency one. For a C major chord the root position is C-E-G where the
note C is currently in the bass. If we move the C from the bass and to the top, we
get E-G-C; this is the first inversion position. If we move the bass note E of this
inverted chord, the result is G-E-C which is the second inversion position. Inverting
this second inversion chord will turn back into the root position. The process of
inverting chords works likewise with seventh chords except that there are three
possible inversions. The inversions of the C-major chord are shown in Figure 4.

Figure 4: Root position, first inversion, and second inversion of the C-major chord
Inversions of chords affect the harmonic function and stability of the chord. A
root position chord has a strong sense of stability and feels "rooted" while a second
5

inversion chord is inherently unstable. Understanding how individual chords get


their own characteristics requires a detailed discussion of chord functions that will
not be covered. But it is important to note that most automatic chord recognition
systems do not detect inversions of chords but only give enough information about
the chord for the user (assuming he has a good working knowledge of music theory)
to deduce whether it is a chord in its root position or in an inverted one. This thesis
will attempt to incorporate inverted chord detection which would save the user the
work of having to adjust the results afterward.

3.3

Rhythm

Rhythm is the organization of musical sounds in relation to time and is characterized


by a cyclic pattern of strong and weak beats. Because chords tend to change on the
strong beat, this information can be used to aid chord recognition. We will not be
implementing any sort of beat detection mechanism as this is outside the scope of
this thesis

4
4.1

Chord Recognition System Details


Markov Chains

The Markov Chain is one of the most commonly used statistical models in not just
automatic chord recognition but in a variety of other applications including speech
recognition, gesture recognition, part-of-speech tagging, and bioinformatics. Markov
Chains can be represented by a weighted finite state automaton: a set of weighted
transitions between a set of states. In the context of chord progressions, each state
is a chord and each weighted edge is a transition probability of moving from one
chord to another as shown in Figure 5.

0.6

0.4

0.4
G-major

C-major

0.4

0.9

0.3

0.8

0.7
D-major

0.6
Figure 5: Example of a Markov Chain with three chords and made-up transitional
probabilities

4.2

First-Order Markov Assumption

Let cn be the nth chord in a chord sequence. The probability we are trying to find
is the next chord in a sequence which is expressed formally as:
P (cn |cn1 , cn2 , cn3 , ..., c1 )
In a long chord sequence, however, it is difficult to consider all the previous
observations. And with the addition of chord inversions there will be even more
possible states to consider. To deal with this issue, we can use the First-Order
Markov Assumption which assumes that the probability of the current state only
depends on the previous state.
P (cn |cn1 , cn2 , cn3 , ..., c1 ) = P (cn |cn1 )

4.3

Hidden Markov Models

Markov Chains are useful for sequences where the states are observed directly but
are limited in situations where the events cannot be observed directly. The Hidden
Markov Model is an extension of the Markov chain that models the underlying
stochastic process that causes an observable one. In chord recognition the hidden
stochastic process is the chord sequence that we are interested in finding and the
the observed event is a feature vectora representation of audio information in
an n-dimensional vector. Given a feature vector, the question that needs to be
answered is what is the most likely chord that generated this feature vector? This
value is called the observation likelihood. Once we determine the likelihood that
the observation was generated by a certain chord ci , we need to find out how we
transitioned into that state. The probability values of moving from one state to
another are stored in a transition probability matrix. The components of an HMM
are defined below:
1. A set of N states (chords):
C = c1 , c2 , ..., cn
2. A matrix A where each aij represents the transition probability from
state i to state j:
A = a11 , a12 , ..., an1 , ..., an
3. A sequence of T observations which in this context are the feature vectors
extracted from waveform data:
O = o1 , o2 , ..., oT
4. A sequence of observation likelihoods:
B = bi (ot )
5. A special start state and end state which is not associated with any of
the observations:
c0 , cF
6. Initial probabilities distribution which describe the probability of starting in a state. i is the probability of starting in state i. Some states
will have a probability of 0 which means they cannot be starting states:
= 1 , 2 , ..., N
[4]

We will make two simplifying assumptions about Hidden Markov Models. The
first assumption is the first-order Markov assumption which we assumed for Markov
Chains. The second assumption we will make is output independence which
8

means that an observation oi depends only on the state ci that produced the observation and not on any other states or observations:
P (oi |c1 ...ci , ..., cT , o1 , ..., oi , ..., oT ) = P (oi |ci )
Now that we see how an HMM is constructed, we will now discuss applications
of this kind of model. Lawrence Rabiner postulated three main problems that can
be solved with an HMM [5]:
1. Likelihood Given an HMM = (A, B) what is the likelihood of a particular
observation sequence O?
2. Decoding Given an HMM = (A, B) and an observation sequence O what is
the best possible hidden state sequence C that generated those observations?
3. Learning and Training Given an observation sequence O and set of states
C, learn the HMM parameters A and B
For this thesis, we will be focusing on decoding: given a sequence of feature
vectors (observation sequence), what is the most likely chord sequence (hidden state
sequence) that generated these vector. We will also discuss how to train the transition probabilities A and observation likelihoods B to model real-life situations.

4.4

Viterbi Algorithm

The Viterbi algorithm can be used to decode the best possible hidden state sequence
that is generated by a particular observation sequence. The algorithm is a dynamoic
programming algorithm that makes use of two matrices: a transition probability
matrix V that stores values aij , the probability of moving from state ci to cj and a
path matrix which keeps track of the hidden state sequence at the same time as the
decoder is calculating the probabilities. The components of the Viterbi algorithm is
defined formally on the next page.

1. Initialization
v1 (j) = a0j bj (o1 ) s.t. 1 j N
bt1 (j) = 0
vt (j) is a cell in the probability matrix that represents the probability
that we are in state j at time step t. This value is calculated by recursively taking the most probable path that could lead us to this cell.
This probability matrix contains probability values but the actual path
is computed in parallel and stored in a separate backtrace matrix. btt (j)
represents the backtrace at state j at time step t.
2. Recursion
N

vt (j) = max[vt1 (i)aij bj (ot )] s.t. 1 j N, 1 t T


i=1

btt (j) = argmaxN


i=1 [vt1 (i)aij bj (ot )] s.t. 1 j N, 1 t T
3. Termination
Best Score:

P = vt (cF ) = max[vt (i) aiF ]


i=1

Start of backtrace:
cT = btT (qF ) = argmaxN
i=1 [vt (i) aiF ]
The algorithm terminates when it has reached the final time step T .
[4]

The Viterbi algorithm is crucial to the process of chord recognition. Given an


observation sequence of feature vectors and a trained HMM, the Viterbi decoder will
determine the most likely chord sequence which is precisely what we are looking for.
Figure 6 illustrates a trellis with only two possible states (G and D) and an
observation sequence O with only three observations. The probabilities on the left
are the product of the transition probabilities of transitioning from one state to the
next and the observation likelihood of the observation once we are in that state. For
example, the probability p4 = P (G|D) P (o1 |G) is the probability of transitioning
from state D to G times the probability that once we are in state G that we will
see observe o1 . At each state c in time step t, the probability of the optimal path
is computed and is expressed as Vt (c). The result is the most probable path that
generated the observation sequence. In the context of chord recognition, the result
is the most likely chord sequence that generated the given audio signal.

10

p1 = P(G|start)P(o1 |G)
p2 = P(D|start)P(o1 |D)
p3 = P(G|G)P(o2 |G)
p4 = P(G|D)P(o2 |G)
p5 = P(D|G)P(o2 |D)
p6 = P(D|D)P(o2 |D)
p7 = P(G|G)P(o3 |G)
p8 = P(D|G)P(o3 |D)
p9 = P(G|D)P(o3 |G)
p10 = P(D|D)P(o3 |D)
p11 = P(end|G)
p12 = P(end|D)

end
V1 (G)
G

V2 (G)
p3

p4
p5

V3 (G)
p7

p11

G
p12

p8
p9

p1
D
p2

p6

p10

V1 (D)

V2 (D)

V3 (D)

o1

o2

o3

start

Figure 6: Example of a Viterbi trellis

4.5

Baum-Welch Algorithm

Before decoding the observation sequence, we must have an HMM that has values for
its transition probabilities and emission probabilities. The Baum-Welch Algorithm,
which makes use of the Forward-Backward Algorithm, can be used to train the
HMM and to generate these values.

Literature Review

We will now discuss what researchers have1 done to approach the problem of chord
recognition. Modern chord recognition research began with Takyua Fujishimas
pitch class profile feature vector, a twelve-dimensional feature vector. The traditional approach to chord recognition is through polyphonic transcriptionextracting
all the individual notes from an input signal [6]. Then drawing on knowledge of music theory, we would infer what the chord is based on the notes extracted. This
method, however, is error-prone because of the fact that a single note has multiple
harmonics [7].

Figure 7: The traditional approach to chord recognition [7]

11

5.1

Fujishimas Seminal Research

Fujishimas major contribution to chord recognition research is the pitch class profile
vector (PCP) also called a chroma vector. He believed that a recognition system
should analyze chords as a single entity and not just a collection of individual notes.
Inspired by the concept of pitch class from musical set theorya system used to
categorize musical objectshe created a chord recognition algorithm based on the
twelve tones in Western music. Each tone has two components: chroma and height.
For example A4 has an octave number of 4 and the chroma is the note A. Both the
chroma and the octave number determine the frequency of the pitch which in this
case is 440 Hz. A pitch class is the set of all notes that are an octave apart and
there are 12 distinct pitch classes in Western music. The pitch class A" can be
defined as:
{An : n Z} = {, ..., A1 , A2 , A3 , A4 , ...}
[8]
A PCP vector is created by measuring the relative intensities of each of the twelve
pitch classes. To generate a PCP vector, first a Discrete Fourier Transformation is
performed on sample windows of audio data. This procedure creates a spectral map
of the energy levels at different frequencies. The next step is to then derive the
PCP vector from this DFT spectrum by summing the spectral energies across the
different pitch classes. Once the vectors are created for each window, it is just a
matter of pattern matching.

Figure 8: A PCP vector of a C major triad showing the relative intensities of the
chroma [8]
Fujishima approaches pattern matching by using Chord Type Templates (CTT)
which represent the vocabulary of chords. For his research, he decided to use 27
chord types that are split into two groups: chords that stay within an octave and
chords that span more than an octave (e.g. chords with extensions such as ninths).
He explains how the CTT works with a C major seventh chord.
CT Tc (p) is set to 1 if the chord type c has the pitch class p in it, and 0
otherwise. For instance, the CTT for M7" is (1,0,0,1,0,0,1,0,0,0,1). [7]
We see that the CTT is a 12-dimensional vector just like the PCP vector but
unlike the PCP vector, CTT contains boolean values. Any note that is part of a
chord is represented with a 1. So a CTT (1,0,0,1,0,0,1,0,0,0,1), where the template
labeling scheme is (C, C], D, D], E, F, F], G, G], A, A], B), represents C, E, G, B
which are the notes that belong to a C major seventh chord. Fujishima then uses
two matching methods, the nearest neighbor and the weighted sum, and compares
the results of the two. [7]
12

Figure 9: Overview of Fujishimas chord recognition system [7]

5.2

Variations on a PCP

Though the introduction of the pitch class profile was a seminal contribution to the
literature, one caveat is that the vector discards bass note information because all
the spectral energies of pitch classes are added up and information about the lowest
frequency note is lost. This makes identifying chord inversions impossible as these
depend on knowing what the lowest note is.
Another issue with PCP vectors is that they often contain information about
overtones that is irrelevant to identifying the chord. Recall that in chord template
matching, the chord template is 12-dimensional boolean vector. Even though the
chord templates are binary, the chroma vectors are not. This is because every note
has a fundamental frequency and multiple harmonics and all of this information is
captured by the PCP vector. Notice how in Figure 10 that though C, E, and G have
the most prominent peaks, there are smaller peaks at tones that do not belong to
the C major chord.

Figure 10: Chroma vector of a C major chord played on a piano [9]


13

This problem would be compounded with chords that share notes such as an Aminor seventh chord and a C-major chord. To deal with this issue, Lee created his
own vector he called the Enhanced Pitch Class Profile (EPCP) [9]. The EPCP vector
is a PCP vector where the overtone intensities are suppressed and the fundamental
frequencies are emphasized. To process of creating the EPCP is as follows:
1. Derive the harmonic product spectrum (HPS) from the DFT of the input
signal. The HPS algorithm is used to detect the fundamental frequency of an
input signal by finding the greatest common divisor of its overtones.
2. Compute the EPCP vector the same way as the standard PCP vector except
that it is done through the HPS instead of the DFT.

Figure 11: Comparison between the DFT and the HPS of an A-minor chord[9]
We can see that the intensities of non-chord tones are suppressed in the HPS.
This gives a clearer picture of the chord that we are trying to identify. Based on
a frame-level recognition of Bachs Prelude in C major performed by Glenn Gould,
Lee concluded that the EPCP vector outperforms the conventional PCP vector in
identifying chords, particularly chords that share notes with others [9].
Another way a chord recognition system could be improved is having it deal with
mistuning. If there are any mistuned notes or if the recording is using a different
tuning system such as that of microtonal music, spectral energy can "leak" and
recognition accuracy is reduced because the energy is not going to the note it is
supposed to go to. To deal with this issue, we can use a higher resolution chroma
vector. Harte and Sandler computed a 36 bins-per-octave transform across five
octaves so that each note gets 3 bins. [10] The extra bins gives room for mistuning
(too flat or too sharp) so that they can be corrected. This requires a 743ms window
length which for a musical signal is a long analysis time. To compensate for this,
they computed overlapping analysis frames that are 1/8 the size of of the window
so the frame length is effectively 93ms. The last step is to convert this 36 bin vector
into a 12-bin chromagram. Many researchers follow a similar method. [9][11][12]

14

5.3

Statistical Learning Models

Many innovations have been made since Fujishimas original research in 1999, most
notably the usage of Hidden Markov Models and even networks of Hidden Markov
Models. Most of the research involving HMMs follows the same basic process with
slight variations. An overview of this process is shown in Figure 12.

Figure 12: Overview of chord recognition system implemented with HMMs [12]
To train an HMM, many researchers [9][11][12] have used the expectation maximization algorithm. The application of the EM algorithm for HMMs is known as
the Baum-Welch algorithm, as mentioned in Section 4.5. The goal of the algorithm
is to maximize complete-data likelihood so that the parameters of the HMM will reflect real-life situations. Once the parameters of the model are set, then the Viterbi
algorithm is applied to decode the most likely chord sequence from an input signal.
5.3.1

Training Data

Implementing HMMs generally yielded good results but one common challenge that
all researchers had to deal with is the lack of training data. Audio data marked up
with chord boundaries is scarce and it is extremely time-consuming and error-prone
to mark up music by hand. So researchers have come up with different approaches to
dealing with this problem. Sheh & Ellis, while training their HMM, assume that only
the chord sequence of the work is known but treat the chord labels as hidden values.
Bello & Pickens took an unsupervised learning approach to train their HMM. One
of the main arguments of their research was music theory knowledge should play a
role in music information retrieval tasks. To initialize the values of the HMM, they
assume the initial probability of beginning in any state based on their understanding
of music theory. In their case, they had an HMM with 24 states (12 major chords
and 12 minor chords), and assumed the initial probability is 1/24 for each state.
To create the state transition probability matrix, they use the notion of the circle of
fifths, as shown in Figure 13, to determine which chords are more likely to transition
to others. For instance, a C major chord is much more likely to transition to a
G major chord than to an E[ minor chord. Once they have established all the
initial values of their HMM, they train their model with the standard maximization
algorithm.

15

Figure 13: Keys that are closer to each other on the circle of fifths have higher
transitional probabilities [13]
Both Sheh & Ellis and Bello & Pickens seem to only circumvent the issue of the
lack of training data. As a result, the accuracy ratings were not great - 22% and
75% for recognition respectively. Lee & Slaney took a supervised learning approach
and came up with a method to automatically generate labeled training data as
shown in Figure 14. Taking advantage of music encoded in midi format, they took
midi files and concurrently converted them into a format that could be put through a
chord analyzer and generated audio data through a sample-based synthesizer. There
is potentially as much training data available as there are pieces encoded in midi
format.

16

Figure 14: Overview of the process of generating labeled training data [14]

5.4

Language Models

The application of HMMs to chord recognition was a fresh approach to an old


problem. Another approach researchers took is integrating a language model into
the HMM framework. An N-gram model is usually associated with automatic speech
recognition but there are also applications in chord recognition. The question that
can be answered from an N-gram model is: given N 1 predecessors, what is the
most probable nth element appearing? Stated formally in the context of chord
recognition:
Y
Y
p(c1 , c2 , ..., ck ) =
p(cf |c1 , c2 , ..., ct1 ) =
p(ct |ht )
t

where k is the number of chords and ht is the history of all the previous chords.
In other words, given the history of all previous chords, what is the probability that
the next chord will be ct ? The history ht contains information about all the of the
immediately adjacent N 1 chords. For example, a trigram model (3-gram model)
is represented as p(ct |ct1 , ct2 ) and a bigram model is represented as p(ct |ct1 ).
Cheng et al. list several reasons why an HMM approach would benefit from
incorporating an N-gram model into its framework [15]:

17

1. In any chord progression, a chord depends not only on the immediate previous
chord but on all of the chords that came before it. The N-gram model reflects
this property of chord sequences.
2. Most HMM-based systems make the first order Markov assumption, that is
the outcome p(cn ) only depends on the outcome that came before it p(cn1 ).
In other words they only consider the probability between two consecutive
chords. This information of only two adjacent chords is insufficient.
3. The chord lexicon (usually around 24 chords) is small compared with the word
lexicon which makes training the N-gram model manageable.
4. The Viterbi algorithm is computationally expensive and training an HMM is
challenging because of the lack of massive quantities of training data.
In the training phase of the HMM, the N-gram model starts out with N = 2
based on the circle of fifths relationship between chords. Cheng et al. would initialize
the transitional probabilities between more closely related chords with higher values,
just like in Bello & Pickens methodology [11]. From there, higher order N-gram
models are trained on hand-labeled chord transcriptions. Because the N-gram model
is trained on chord sequences only, there is no issue with obtaining training data.
Another addition Cheng et al. made to the typical HMM approach is in the
decoding process. Most HMM approaches have relied on the Viterbi algorithm to
decode the most likely chord sequence. In the Viterbi algorithm, the most likely
previous chord is estimated for each time step and the paths are stored in a matrix.
Cheng et al. have come up with an algorithm that augments the Viterbi algorithm
with a bigram model and a trigram model. Their algorithm obtains the maximum
likelihood of a chord by weighing the bigram model and trigram model and by
calculating the correlation between an observation and its set of chord templates.
This is expressed formally as:
ci = argmaxci ( log Pbi + log Ptri + (1 ) log P (ci |oi )
where Pbi = P (ci |ci1 ), Ptri = P (ci |ci2 , ci1 ) and and are non-negative
weights such that + 1. Cheng et al. report that the N-gram based approach
outperforms the HMM-based approach without the N-gram model by 7% with an
overall accuracy of 67.3% but this is only compared to their implementation of an
HMM. Lees HMM implementation without an N-gram model actually outperformed
Cheng et al.s N-gram based approach with an overall accuracy of 72.8%. Though
this is a novel approach to the HMM framework, Cheng et al. likely obtained a
lower accuracy rating than Lee because of the lack of training data. In other words,
having more data has a greater effect on accuracy than adding an N-gram model.
Another paper implements a similar approach but with an extended N-gram
model called a factored language model [16]. In a standard N-gram model, a chord
is a single unit whereas in a factored language model a chord is represented as a
group of factors. The factored language model has the advantage of incorporating information such as chord durations. Khadkevich reports that compared to the
standard language model, the factored language model showed a very slight improvement of 0.25%. These studies show that language models have much potential
in the field of automatic chord recognition systems.
18

5.5

Alternate Approaches

So far we have described a few common approaches to solving the problem of chord
recognition. More recently the growing intrigue of deep learning in the machine
learning community has made its way to MIR tasks. Deep learning is useful for
untangling complicated patterns in large amounts of data and is well-suited for the
task of chord recognition. Though more research has yet to be done with applying deep learning to automatic chord recognition, present research show promising
results [17].

5.6

Future Work

All of the studies up to this point have only considered fairly basic chord types:
major, minor, diminished, augmented. They do not handle dominant seventh chords,
augmented sixth chords, Neapolitan chords, and other chords that are prevalent in
the Western harmonic vocabulary. And because of the nature of the PCP vector,
information about the bass note is lost which means there is no way to distinguish
a chord in its root position or in an inverted one.
There is also no reliable method of handling modulations or key changes. Every
chord serves a different function when in the realm of a different key so the transitional probabilities would change when a modulation occurs. This is something I
would want to explore further in my thesis.

5.7

Conclusion

The goal of all automatic chord recognition is to extract chord information from audio signal. The task of chord detection is important to the analysis of classical music
and can help facilitate musicological insights. Over the years, different researchers
have implemented different approaches to this problem including chord template
matching, Hidden Markov Models, language models, and most recently deep learning models. Though the focus of this paper is on automatic chord recognition, this
just one problem MIR research is striving to solve.

19

References
[1] D. Byrd and T. Crawford, Problems of music information retrieval in the real
world, Information processing & management, vol. 38, no. 2, pp. 249272,
2002.
[2] R. Typke, F. Wiering, R. C. Veltkamp, et al., A survey of music information
retrieval systems., in ISMIR, pp. 153160, 2005.
[3] P. Li, Automated Identification of Chord Progression in Classical Music. B.S.
Thesis, 2014.
[4] D. Jurafsky and J. H. Martin, Speech and language processing: an introduction
to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall, 2009.
[5] L. R. Rabiner, A tutorial on hidden markov models and selected applications
in speech recognition, Proceedings of the IEEE, vol. 77, no. 2, pp. 257286,
1989.
[6] C. Chafe and D. Jaffe, Source separation and note identification in polyphonic
music, in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP86., vol. 11, pp. 12891292, IEEE, 1986.
[7] T. Fujishima, Realtime chord recognition of musical sound: A system using
common lisp music, in Proc. ICMC, vol. 1999, pp. 464467, 1999.
[8] J. Leung, Automatic chord extraction.
[9] K. Lee, Automatic chord recognition from audio using enhanced pitch class
profile, in Proc. of the International Computer Music Conference, p. 26, 2006.
[10] C. Harte, M. Sandler, and M. Gasser, Detecting harmonic change in musical
audio, in Proceedings of the 1st ACM workshop on Audio and music computing
multimedia, pp. 2126, ACM, 2006.
[11] J. P. Bello and J. Pickens, A robust mid-level representation for harmonic
content in music signals., in ISMIR, vol. 5, pp. 304311, 2005.
[12] A. Sheh and D. P. Ellis, Chord segmentation and recognition using em-trained
hidden markov models., in ISMIR, vol. 3, pp. 183189, 2003.
[13] F. Jargstorff, The circle of fifths.
[14] K. Lee, Automatic chord recognition from audio using enhanced pitch class
profile, in Proc. of the International Computer Music Conference, p. 26, 2006.
[15] H.-T. Cheng, Y.-H. Yang, Y.-C. Lin, I.-B. Liao, and H. H. Chen, Automatic
chord recognition for music classification and retrieval, IEEE, 2008.
[16] M. Khadkevich and M. Omologo, Use of hidden markov models and factored
language models for automatic chord recognition., in ISMIR, pp. 561566,
2009.
20

[17] X. Zhou and A. Lerch, Chord detection using deep learning, in Proceedings
of the 16th ISMIR Conference, vol. 53, 2015.

21

Potrebbero piacerti anche