Sei sulla pagina 1di 103

Maja Bjelica

SPEECH RHYTHM IN
ENGLISH AND SERBIAN:
A Critical Study of Traditional and Modern Approaches

Novi Sad
2012.
Maja Bjelica

SPEECH RHYTHM IN ENGLISH AND SERBIAN:


A Critical Study of
Traditional and Modern Approaches
FILOZOFSKI FAKULTET U NOVOM SADU
Odsek za anglistiku

Za izdavača:
prof. dr Ljiljana Subotić, dekan

Recenzenti:
prof.dr Tatjana Paunović
doc. dr Biljana Radić - Bojanić
doc. dr Nataša Bikicki

ISBN 978-86-6065-111-4

Zabranjeno preštampavanje i kopiranje.


Sva prava zadržava izdavač i autor.
Maja Bjelica

SPEECH RHYTHM IN ENGLISH AND SERBIAN:


A Critical Study of
Traditional and Modern Approaches

КИ Ф
ФС А
ФИЛОЗО

КУ
Л ТЕ Т

Novi Sad, 2012


Filozofski fakultet u Novom Sadu
Odsek za anglistiku

Dr Zorana Đinđića 2
21 000 Novi Sad
Tel: +381214853900
+381214853852

www.ff.uns.ac.rs
Preface

The book called Speech Rhythm in English and Serbian: a Critical


Study of Traditional and Modern Approaches is a revised version of my
unpublished Master’s Thesis called “Characteristics of Speech Rhythm
in English and Serbian” and an attempt to draw attention to the confus-
ing situation in the theory of speech rhythm, as well as to emphasize the
importance of studying this topic more thoroughly by Serbian linguists
and of integrating the rhythm of speech into the language pronuncia-
tion classes at an early age. The book offers a classification of existing
approaches which shows a gradual movement from the traditional and
descriptive to modern and experimental as the development of instru-
mental means constantly progressed. It identifies the biggest problem in
the approaches to the study, which is the lack of any universal agreement
on basic principles and methodologies in the research process, which
consequently results in the plethora of different and even opposing ap-
proaches which need to be critically analysed and classified.
I became interested in this particular topic when I became aware of a
huge clash between the traditional theory and modern approaches based
on experimental research. I cannot remember if there exists a situation
like this in which you have the existing theory still taught in English
phonology classes which has been proved to be wrong but there has not
yet been offered a better solution. I was also intrigued by the fact that
this topic had been widely neglected by Serbian phonologists and for
no obvious reason. It has been quite a challenging job to write a critical
overview of the existing theories since all of them have their good points
as well as the drawbacks.
I would like to thank first and foremost my dear colleague Biljana who
encouraged me to take this leap of faith and start appreciating my own
work by publishing this book. Also, I would like to thank my mentor,
assistant professor Maja Marković, who gave me immense support and
advice during the process of writing my Master’s Thesis, the members
of my Master’s Thesis Committee, assistant professor Gordana Petričić
and Tanja Milićev, and the reviewers of this book, assistant professors
Tatjana Paunović, Nataša Bikicki, and Biljana Radić-Bojanić, for their
effort and suggestions which helped me finalize this book. I would like to

5
give my special thanks to my family, my father, mother and my brother,
my dear friends and, most of all, my patient fiancé, who were extremely
supportive and understanding in the process of writing the Thesis first
and then this book. And finally, I would like to dedicate this book to my
beloved sister Nataša who was my biggest fan in the world, who believed
in me even when nobody else did, even when I did not believe in myself,
and who contributed to writing my Thesis by proofreading the Serbian
abstract at the point when I was so lost in English phrases and English
rhythm that at times forgot my own native language.
Hopefully, this book will serve as a good starting point for some fu-
ture studies of speech rhythm in Serbian and help some future phonolo-
gists or students of English find all the relevant information about speech
rhythm in one place.
I take full responsibility for any omissions and deficiencies that may
be found in this book.

The author

Novi Sad, May 2012

6
Contents

1 INTRODUCTION: Why Speech Rhythm?..........................................9

2 WHERE TO START: Problems in Defining Basic Concepts


and Research Methodologies................................................................. 11

2.1 Rhythm in Speech............................................................................ 11


2.2 The Relationship between Syllables and Stresses...........................13
2.3 Accent, Stress or Stress Accent: a Problem of Terminology............18
2.4 Characteristics of Serbian Accentual System..................................22
2.5 Speech Segmentation: Syllable, Foot, Timing.................................24

3 BETWEEN TRADITION AND REALITY:


Classification of Different Approaches to Rhythmic Studies................28

3.1 Typological Approach to the Study of Speech Rhythm...................29


3.1.1 The Rhythm Class Hypothesis: Stress-timed and
Syllable-timed Languages................................................................30
3.1.2 Isochrony Accepted: Physiology of Speech Production .........34
3.1.3 Isochrony Questioned: Full-vowel Timing Theory ................39
3.1.4 Isochrony Rejected: Setting Grounds for Future
Experimental Studies........................................................................41
3.1.5 Rhythmic Studies in Serbian...................................................45
3.2 Phonological Approach to the Study of Speech Rhythm ................46
3.3 Perceptual Approach to the Study of Speech Rhythm.....................54
3.3.1 Languages in the Middle: the Existence of
Intermediate Languages....................................................................55
3.3.2 Corpus Selection......................................................................58
3.3.3 Data Segmentation...................................................................63
3.3.4 Is There Rhythm to Begin with: Instrumental Studies of
Rhythm.............................................................................................66
3.3.5 Nobody Puts Babies in the Corner: the Role of Rhythm
Perception in Language Acquisition.................................................68
3.3.6 It’s Not That Easy: Drawbacks of Instrumental
Studies...............................................................................................77

7
3.3.7 Serbian: the Scarcity of Instrumental Studies..........................80

4 HOW TO APPLY THE STUDY OF SPEECH RHYTHM:


Speech Synthesis and Rhythm Teaching...............................................85

4.1 Why Should Speech Rhythm Be Taught in


Language Classes?.................................................................................90

5 CONCLUSION...................................................................................93

REFERENCES......................................................................................95

8
1 INTRODUCTION:
Why Speech Rhythm?

The interest in speech rhythm comes from the observation that dif-
ferent languages give rise to the perception of different types of rhythm.
Being one of the basic prosodic features, rhythm has been a topic of
debates for many years, and even today, linguists cannot find the most
appropriate theory to account for all the things related to this issue across
languages. It has also been one of the most controversial and thus very
often neglected issues in prosodic studies. It is said to be the most prob-
lematic of all prosodic variables, and there have been many different
and even opposing approaches to the issue of rhythm in speech. Indi-
vidual languages are often perceived as having distinct rhythmic styles,
which was the basis for the theory about speech rhythm. According to
the existing and still widely applied theory, all the languages in the world
are classified into three distinct rhythmic classes: stress-timed, syllable-
timed, and mora-timed.
These typological labels rely on the hypothesis that isochrony holds
either at the level of stressed syllables or at the level of individual syl-
lables, depending on the language. However, a wealth of research done
in this area over the last few decades shows little phonetic evidence to
support the existing classification. The aim of the current study is to pre-
sent both sides of the coin, compare the existing approaches to speech
rhythm, both traditional and modern, and try to apply the existing theory
in the study of Serbian speech rhythm, in order to prepare the ground for
some future empirical studies on Serbian corpora.
Apart from being so controversial and complex, the study of speech
rhythm in terms of contrast between the two languages under study has
encountered a huge problem of disproportion between English and Ser-
bian sources on the topic in question.
Due to a very small amount of work done on the topic of speech
rhythm in Serbian, the current study mostly evolves around different ap-
proaches to speech rhythm in English, since all the hypotheses have been
presented after studying the English corpora, as well as the corpora in
some other widely studied languages like Spanish and French. The book

9
stresses the necessity of studying this issue more thoroughly by Serbian
phonologists, as well as the necessity to establish some universal para-
meters and methodologies for doing research in this area. Due to the lack
of such universal approach to research, the study of speech rhythm lacks
a universal theoretical basis, which in turn creates confusion and op-
posing views, which will not take us anywhere unless we do something
about it. Hopefully, this book may be one small step towards reaching
this goal since it will try to compare and contrast the existing approaches
by pointing out similarities and differences between them and present
their good sides as well as the drawbacks.
To explain the necessity of studying speech rhythm, it is important
to emphasize that rhythm of language is one of its fundamental features,
one of those which are acquired early by a child and quite difficult for
an adult to learn, change, or even modify when they want to learn to
pronounce a foreign language. It is said to play an essential role in the
first stages of language acquisition by a newborn child, a basis for an
early language discrimination process. However, rhythm seems to be a
much neglected factor not only in studying English prosody but also in
English language teaching. Moreover, the study of speech rhythm is very
important in developing a reliable speech synthesis programme which
will generate and reproduce more natural and thus more accurate speech.
Consequently, the messages sent via such programme will take much
less time to be understood.

10
2 WHERE TO START:
Problems in Defining Basic Concepts and Research Methodologies

The first problem this study was faced with is the lack of universal
definitions of basic concepts related to speech rhythm. These basic con-
cepts have been some of the most controversial issues in linguistics and
the description and explanation of such concepts and features raise some
general theoretical questions. However, it is necessary to establish the
phonetic principles required as a frame of reference for the specific dis-
cussions about speech rhythm in English and Serbian.
The first major problem in defining the basic concepts is that many
linguists have approached them from different points of view. More-
over, due to different methodologies used by various authors as a result
of these different approaches, it is rather difficult to find a uniform way
to compare their research results, which further complicates the whole
story about speech rhythm.

2.1 Rhythm in Speech

Although the term “rhythm” occurs in many different contexts be-


sides speech, like music, poetics, or neurology (most of the definitions
are listed in The Oxford Companion to the English Language, 1992:
869), the definition which will be used in this paper only concerns the
rhythm of speech.
Unfortunately, there is no universally accepted definition of rhythm.
According to Roach (2002), speech is defined as a sequence of events
in time and the way these events are distributed in time is said to be the
rhythm of speech (Roach 2002: 67). Since people do not normally per-
ceive speech rhythm in everyday communication, they tend to say that in
comparison to the only rhythm they know, which is the rhythm of music,
speech cannot have rhythm.
As the most obvious examples of speech having rhythm, Roach
(2002) mentions chanting as a part of children’s games (such as chil-
dren calling words while skipping, or football crowds calling their team’s
name) or in connection with work of some kind (the same as sailors use

11
chants in order to synchronise the pulling on an anchor rope). However,
conversational speech is by far more complicated, but most phonolo-
gists agree that some kind of regular timing is definitely present, even
in speech.
Generally speaking, rhythm is said to be a repetition of an event at
more or less regular intervals of time. In other words, the rhythm of
speech, as any other rhythm, arises from the periodic recurrence of spe-
cific units, producing an expectation that the regularity of succession
will continue (Abercrombie 1967: 96). However, these specific units of
rhythmic succession are the things which stir up trouble among pho-
nologists because they are said to vary across languages. In English, the
abovementioned repetition basically concerns the distribution of stresses
in utterances, which means that a foot is taken to be the basic unit which
occurs periodically. In some other languages, like Spanish or Italian, this
repetition concerns the distribution of syllables in time, a syllable being
that basic unit of speech rhythm.
However, more recent approaches to the theory of speech rhythm
point out that this picture is everything but “black and white”, as it was
thought earlier. On the other hand, Patel (2008) warns us to be careful
when we define rhythm in terms of periodicity, i.e. a pattern repeating
regularly in time. Although it has been said that rhythm denotes perio-
dicity since it arises out of the periodic recurrence of some sort of move-
ment (Abercrombie 1967), Patel points out a crucial difference between
the terms “periodic” and “rhythmic”.
Namely, Patel (2008: 96) says: “Although periodic patterns are rhyth-
mic, not all rhythmic patterns are periodic.” This means that not all re-
currences of particular units, which are perceived to be rhythmic, are
necessarily repeated after regular intervals of time. Thus, periodicity is
only one type of rhythmic organization, although speech rhythm has had
a long and largely unfruitful association with the notion of periodicity
throughout history. Having this in mind, Patel thinks that it is highly im-
portant to leave open the issue of periodicity in any definition of rhythm,
and he himself defines it as “the systematic patterning of sound in terms
of timing, accent, and grouping” (Patel 2008: 96). Therefore, speech, as
well as music, is characterized by systematic temporal, accentual, and
phrasal patterning.

12
2.2 The Relationship between Syllables and Stresses

In order to describe the pronunciation of a language and compare it


to the pronunciation of other languages, it is necessary to analyse speech
into units. Many different approaches have been proposed, but the most
applicable approach here seems to be the one where the starting point in
the speech analysis is the syllable1.
Although the approaches to the syllable vary among phoneticians and
phonologists, most of them agree that the syllable seems to be the most
basic unit of speech: every language has syllables; also, babies learn to
produce syllables before they learn any word in their native language,
while people with speech disorders still display syllabic organization
(Roach 2002: 77).
Defining the concept of syllable has always been a problematic issue,
although it appears to be a concept which can be intuitively recognized
by more or less speakers of every language in the world. In many lan-
guages, including Serbian, the syllable is very often defined in terms of
its hierarchically organised structure which consists of consonantal and
vocalic segments. A syllable always has one vowel (or a syllabic conso-
nant) as its nucleus and a number of consonants preceding and following
it (in onset and coda respectively). In English, the number of consonants
that precede the nucleus ranges from zero to three, while the number of
consonants that follow the nucleus does not exceed four consonants in a
syllable.
Gimson (1978) provides two definitions of syllable based on two
different approaches: phonetic and linguistic. The phonetic approach
to syllable seeks to find a universal definition in phonetic terms, while
the linguistic approach treats syllable as a language specific issue and
stresses the importance of having language specific definitions of syl-
lable rather than a universal one.
The phonetic approach in defining the notion of syllable has been
divided into two theories: the Prominence Theory and the Pulse Theory.
According to the Prominence Theory, there are sounds in an utterance

1 The analysis of individual phonemes will be put aside for now, since it is not
that relevant for the study of speech rhythm.

13
which are perceived to be more prominent than the rest of the sounds
in a sequence, i.e. to stand out in relation to their neighbouring sounds.
On the basis of this approach, an utterance contains as many syllables
as there are peaks of prominence or those sounds that are perceived to
be more prominent than others. Vowels are perceived to be more promi-
nent than other sounds (consonants) and that is why they are taken to be
central parts of each syllable. Consequently, syllable boundaries occur
at the points of relatively weak prominence (so-called “valleys”). Since
this approach is mainly based on auditory judgements, its major draw-
back is its inability to sometimes determine to which syllable the “weak”
sound at the syllabic boundary belongs, especially in large consonant
clusters. For example, the word ‘extra’ /, which is said to show
three peaks of prominence but intuitively has only two syllables, can be
segmented in the two following ways: [] or [] (Gimson
1978: 52). Similarly, Daniel Jones (1962: 327) points out that it is often
impossible to specify points at which a syllable begins and ends. For ex-
ample, although the sound /t/ in the word ‘letter’ has no sonority at all, it
is impossible to say at which part of the sound /t/ the syllable separation
takes place (is // segmented as [-] or [-]?).
On the other hand, the Pulse Theory is a syllabic theory originally
proposed by R.H. Stetson in his book called Motor Phonetics, a Study of
Speech Movements in Action (1951), later adopted by David Abercrombie
(1967) and postulated in terms of the pulmonic air-stream mechanism.
It is concerned with the muscular activity controlling lung movement
which takes place during speech. The syllable-producing movement of
the respiratory muscles has been called a chest pulse (because the in-
tercostal muscles in the chest are responsible for it), or breath-pulse, or
syllable pulse (the term “pulse” being used because of its recurrent and
periodic nature, thus defining rhythm of speech in terms of periodicity).
There are a number of chest-pulses accompanied by increases in air pres-
sure which determine the number of syllables in an utterance. Therefore,
such a pulse serves as the basis for the syllable and a flow of such pulses
creates a series of beats in the flow of syllables. According to Abercrom-
bie (1967), the syllable is essentially a movement of the speech organs,
and not a characteristic of the sound of speech. This means that the defi-
nition of syllable does not have to do with the structure of the sounds

14
that make them, but to the mere process happening in our speech organs
when we utter sentences.
“A syllable is the minimum utter-
ance, and nothing less than a syllable
can be produced” (Abercrombie 1967:
35). The syllable is essentially an audi-
ble movement (at least, in most cases) of
speech organs. After the air is released
from the lungs, the pulse is then associ-
ated with the movement of other speech
organs like vocal cords, velum, and
eventually tongue and lips in order to
articulate sounds. “All these movements,
combined together, are superimposed on
the fundamental syllable- and stress-producing processes of the pulmo-
nic mechanism, and they are felt by both speaker and hearer to constitute
one single speech-producing act” (Abercrombie 1967: 37). Due to these
unitary actions, the syllable is an integrated whole, although it is a com-
plex act2. That is why it is taken to be the smallest unit of speech, and
nothing less can be produced.
Another approach to syllable mentioned by Gimson (1978: 52) is the
so-called linguistic approach. Gimson states that this type of approach
is more useful than the phonetic one in defining the notion of syllable,
i.e. “with reference to the structure of one particular language rather than
in general, phonetic terms with universal application” (Gimson 1987:
52). It may be more appropriate to divide a similar sound sequence dif-
ferently in different languages depending on the language specific rules
concerning the possible combinations of segments (phonemes) in a par-
ticular language. However, this approach has also failed to explain the
division of the English word ‘extra’ / into syllables since both

2 A perfect example of one such complex act is given by Abercrombie (1967).


Namely, he gives an example of a golf swing, where movements of fingers,
wrists, arms, trunk, legs, and other body parts are involved and coordinated
in order to produce a single effect, so much so that the ingredient parts of the
swing, i.e. the movement of each organ independently, are not easily disentan-
gled.

15
/-k/ and /-ks/ are found at the end of English words, while both /str-/ and
/tr-/ are possible initial consonant clusters in English (Gimson 1978: 52).
Similarly to Abercrombie, Serbian authors Stanojčić and Popović (1999)
define syllable as a phonetic unit which is pronounced with one articula-
tory movement of speech organs. It can be composed of only one sound
as long as it is a vowel3 (e.g. u ’in’ as in u torbi ’in the bag’) but usually
it is composed of one vowel preceded by one or more consonants. The
general rule for the placement of syllabic boundaries is that the boundary
is placed after the vowel of one syllable but before the consonant (onset)
of the following syllable. For example:

(1) raditi ‘to do’ [ra-di-ti]


(2) lasta ‘swallow’ [la-sta]
(3) avioni ‘airplanes’ [a-vi-o-ni]
but: avion ‘airplane’ [a-vi-on]
(4) leptir ‘butterfly [lep-tir]

(Stanojčić and Popović 1999: 37)

As it is obvious from the last two examples, every rule has its excep-
tions which are stated by Stanojčić and Popović (1999: 37). However,
from the examples they give, it can be concluded that phonotactic rules
do not play the same role in Serbian as they do in English. Namely, on the
basis of the abovementioned rule, words such as grožđe ’grapes’ or voćka
‘fruit’ are divided into syllables in the following way: [gro-žđe] and [vo-
ćka]. This division seems to be problematic since consonant clusters /žđ/
and /ćk/ do not normally occur word initially in Serbian and a better so-
lution would be to divide the words in the following way: [grož-đe] and
[voć-ka]. It can be thus concluded that Stanojčić and Popović adopt the
phonetic approach to syllable. However, certain words sometimes opt for
the so-called semantic (also called psychological) approach over the
phonetic one. An example can be the word razljutiti ‘make somebody an-
gry’ in which the phonetic approach divides the word into [ra-zlju-ti-ti],

3 Words like rđati ‘to rust’ prove that this is not really true since the first syllable
in the word is constituted of a syllabic consonant alone

16
while the semantic approach into [raz-lju-ti-ti] (Stanojčić and Popović
1999: 37). According to the linguistic approach as well, the first division
is not justified since the /zlj/ cluster does not occur word initially in Ser-
bian. However, the word isterati ‘to cast out’ is a more problematic case:
the phonetic approach would divide this word into [i-ste-ra-ti], where the
initial cluster /st/ is possible in Serbian, while the semantic one would
divide it into [is-te-ra-ti]. The latter approach is said to be more appropri-
ate in this case (Stanojčić and Popović 1999: 37).
Whichever approach to syllable we decide to adopt, it is clear that
the syllable is a starting point of any discussion on speech rhythm. Not
only do syllables differ in structure, they also differ in the effort made
in producing them, i.e. the amount of air expelled from the lungs when
they are uttered in connected speech. Thus, there are some syllables that
are in some sense stronger and more prominent than others. Abercrom-
bie (1967) says that a chest pulse, the abovementioned movement which
produces syllables, can also be produced by exceptionally great muscu-
lar action. The pulse produced in this way is called a stress-pulse (Ab-
ercrombie 1967: 37).
As a result of this process, a stronger puff of air than usual is expelled
from the lungs, which causes a louder noise, among other things. A syl-
lable produced in such manner is said to be a stressed one or that the
stress is placed on it. According to Roach (2002), although stress has
been a widely discussed and extensively studied topic, there still remain
many areas of disagreement or lack of understanding. It seems likely that
stressed syllables are produced with greater effort than unstressed, and
that this effort is manifested in the air-pressure generated in the lungs
for producing the syllable and also in the articulatory movements in the
vocal tract. These effects produce different audible results, like the one
of pitch prominence where a stressed syllable stands out from its context
(a feature of Serbian accent); then, length of the stressed syllable, since
stressed syllables tend to be longer than unstressed ones (a feature which
is highly noticeable in English but much less in other languages); also,
stressed syllables tend to be louder than unstressed, etc.
Stretches of connected speech are combinations of stressed and un-
stressed syllables. Certain words, like lexical or content words, are pre-
disposed by their function in a language to receive stress or accent, while

17
functional words, such as auxiliary verbs, conjunctions, prepositions,
pronouns, etc, are more likely to be unstressed or unaccented in con-
nected speech. “In an extended dialogue in normal conversational style,
the number of weak syllables (unaccented) tends to exceed that of those
carrying an accent (primary or secondary)” (Gimson 1978: 259).

2.3 Accent, Stress or Stress Accent: a Problem of Terminology

At the end of the previous section, the confusion made by the use of
two terms, stress and accent, was intentional. Before we go any further,
it is very important to clarify the usage of the two terms in order to avoid
ambiguity.
According to Steiner (2004), accent is a phonological feature which,
when realized, promotes the perception of one particular syllable in rela-
tion to others. This means that stressed syllables are marked as having a
specific accent.
On the other hand, stress is just a phonetic realization of a certain
accent. However, since nothing seems to be universal in the theory of
speech rhythm, the situation is similar with the use of these two terms.
Namely, very often authors tend to use them interchangeably, without
clarifying any difference between them.
The problem with the comparative studies of English and Serbian
is the problem of defining stress since this phenomenon differs signifi-
cantly in the two languages. The way the stress is manifested in these
languages is highly language dependant.
Namely, languages like Serbian, ancient Greek, Latin, and even Japa-
nese use variations in pitch to give prominence to a syllable (or mora)
within a word. These languages are said to have pitch accent4 (or tonic
accent) and use phonemic tone to mark prominence of a specific syllable
in a word. On the other hand, languages like English and Spanish are said
to exhibit stress accent (or dynamic accent), which uses the impression
of loudness to mark the difference between the most prominent syllable
in a word and less prominent ones.

4 This usage of the term ‘pitch accent’ was proposed by Dwight Bolinger, taken
up by Janet Pierrehumbert (1980), and described in Robert D. Ladd.

18
Pitch-accented languages usually have a more complex accentual
system than stress-accented languages. Serbian distinguishes four types
of pitch accent, which is the result of different combinations of the tone
and quality of syllables, while in English there are no such variations:
accented syllables are just louder.
Moreover, while in English stress accent is said to give rise to the
most prominent syllables in an utterance, without adding any particular
meaning to it, the placement of the tone or the way pitch accent is real-
ized in a Serbian word influences the meaning of the word – the misuse
of pitch accent can lead to misunderstanding among the participants in
conversation.

In order to illustrate the changes in meaning depending on the type of


accent, the following examples are given:

(5) short-falling pitch accent (  ) vs. long-falling pitch accent (  )

luk (n.) a round white vegetable which has a strong taste and smell
(Eng. onion, or garlic)
luk (n.) part of a curved line or a circle (Eng. arc)

19
grad (n.) frozen rain drops which fall as hard balls of ice (Eng. hail)
grad (n.) a large area with houses, shops, offices etc. where people
live and work (Eng. town)

(6) short-rising pitch accent (  ) vs. long-rising pitch accent (  )

valjati (v.) to be good


valjati (v.) to roll

(7) pitch accent placed on the second syllable in the word vs. pitch
accent placed on the first syllable of the word

govoriti (v, infinitive) to speak


govorim (v, 1st person sg Present of “to speak”) I’m speaking

However, it is not entirely true that stress accent in English does not
influence the meaning of words. Namely, according to Roach (2002:73),
the position of stress in a word can change the meaning of the word. For
example:

(8) ‘import’ (noun) // vs. ‘import’ (verb) //

20
(9) ‘permit’ (noun) /p()/ vs. ‘permit’ (verb) /p()/

While syllable is said to recur regularly in some languages, stressed


syllables define units which tend to do so in languages like English. Units
defined by stress are called feet. Foot is a term used by phoneticians and
phonologists to describe the unit of rhythm in languages such as English,
for example. This term describes the distance between two consecutive
stressed syllables. Each foot consists of one stressed and a number of
unstressed syllables (or one stressed syllable and no unstressed sylla-
bles at all). Feet which consist of not more than two syllables are called
“bounded feet”, while a foot which contains only one syllable is called
“a degenerate foot” (Crystal 2008: 234).
The problem with the definition of a foot is that not all linguists agree
where a foot starts and where it ends. Namely, most of them define foot
as a sequence of syllables which start with a stressed syllable and ends
with an unstressed one before some other stressed syllable, which means
that the following foot again starts with a stressed syllable (“the next foot
begins when another stressed syllable is produced”, Roach 2002: 29).
For example:

(10) Here is the news at nine o’clock.


|Here is the |news at |nine o’|clock|5
(Roach 2002: 29)

However, in metrical phonology, there are two types of feet: left-


headed feet are those where the leftmost syllable of the foot is stressed,
i.e. the most prominent syllable comes first (as in the abovementioned
example given by Roach 2002), while right-headed feet are those where
the rightmost syllable is stressed, i.e. the most prominent syllable comes
last (Crystal 2008: 193).
For the purpose of this paper, Roach’s “left-headed” approach to the
segmentation of utterances into feet will be adopted. For a detailed clas-
sification of feet in English, see Bjelica (2010: 19).

5 Stressed syllables are underlined while foot divisions are marked with vertical
lines.

21
2.4 Characteristics of Serbian Accentual System

Given the complexity of accent in Serbian, we need to inspect this


prosodic feature of Serbian in more detail than in English. In Serbian, the
term “accent” is used rather than the term “stress” (used in English) due
to the fact that pitch and length are involved rather than intensity. Jovičić
(1999: 407) points out the difference between Serbian and English with
respect to accented syllable. Namely, English accented (stressed) syl-
lable is characterised by intensity (which results in such syllables being
the most prominent in an utterance), longer duration, and higher fun-
damental frequency F0. On the other hand, Serbian accented syllable
is characterised by longer duration and pitch change in relation to the
unaccented syllable, while intensity does not make much difference be-
tween accented and unaccented (especially post-accented) syllables, nor
between different types of accent. Moreover, the duration of the accented
syllable and pitch change are directly responsible for the perception of
different types of accent. Similarly, Crystal (1969) defines stress in Eng-
lish as variations in linguistically contrastive prominence primarily due
to loudness, while Lehiste and Ivić (1986) state that the decisive cue for
“stressedness” in Serbian is duration. The Serbian language is a system
where both tone and stress play a role in phonology. According to Inke-
las and Zec (1988: 227), the accents of Serbian “are decomposed into
two independent subcomponents within the accentual system: tone and
stress”, which are said to be separate phenomena in Serbian. Stress is
manifested as increase in relative duration, while tone is manifested as
relative difference in pitch. While tone is said to participate in lexical
contrasts (since accent in Serbian sometimes makes distinction between
otherwise the same words, as is shown in some previous examples), the
location of stress is said to be predictable from that of tone and makes
no contribution to lexical contrasts. This means that, for example, high
tone can be assigned lexically to any syllable in the word whereas stress
can only be assigned to a syllable containing this high tone (Inkelas and
Zec 1988: 244). The system of pitch accents in Serbian is traditionally
described in terms of two tonal movements within the accented syllable,
“falling” and “rising” (Lehiste and Ivić 1986: 1). Accented syllables,
both long and short, are termed as either rising or falling. Thus, Serbian

22
recognizes four lexically contrastive accents, based on the combinations
of the two criteria: long-rising (  ), long-falling (  ), short-rising (  ),
and short-falling (  ). For example:

(11) long-rising: razlika ‘difference’ ra azli ka


long-falling: zastava ‘flag’ za astava
short-rising: paprika ‘pepper’ papri ka
short-falling: jezero ‘lake’ je zero
(Inkelas and Zec 1988: 228)

The pitch contours of words are given in the rightmost column of


the example (11). As it can be noticed from the given contours, the fall-
ing accents “reside” within a single syllable, while the rising accents
“stretch” over two syllables, the first of which is perceived as stressed
(in example (11), the stressed syllables are bolded). As a result, there are
some distributional constraints on the four accents. Namely, the accent
in Serbian is said to be relatively free as it can occur on any syllable in
the word but the last one (unless the word is monosyllabic). The term
“relatively free” is used since, although the main accent always falls on a
particular syllable of any given word (so the accentual pattern of Serbian
is fixed in a way), it is not tied to any particular syllable in the sequence
of syllables which constitutes a word (like in French, Polish, or Czech).
However, not every type of accent can occur on every syllable. Falling
accents generally occur in monosyllabic words or in the first syllable of
a polysyllabic word. On the other hand, rising accents generally occur
in every syllable of a polysyllabic word except the last one and never in
monosyllabic words. This last point is understandable having in mind
their pitch contour (see examples above). As it can be seen in the ex-
ample given, all four accents can occur on the first syllable of the word
(unless the word has only one syllable).
When vowel length is concerned, long and short vowels are possible
in both accented and unaccented syllables (the long unaccented syllables
are usually related to post-accentual positions). Short unaccented syl-
lables are sometimes (for the purpose of marking all the syllables in a
word) marked with (  ), while long unaccented syllables are marked
with (  ). Thus a word nacionalni ‘national’ can be marked as follows:

23
(12) nacionalni     
(five syllables: short accented + 2 short unaccented + long
unaccented + short unaccented)

2.5 Speech Segmentation: Syllable, Foot, Timing

In order to achieve rhythmic succession, each language needs to de-


termine its own segments which tend to occur more or less regularly6.
According to the existing works on the present topic, this choice can be
made according to two types of units mentioned previously: a language
can persevere with the syllable as a common unit of sound (as many lan-
guages do) or select a larger unit consisting of a number of syllables (the
foot). According to some linguists, there is also the third unit of rhythmic
organization called the mora. Moras7 are often said to be units which
consist mostly of consonant–vowel (CV, V or CjV) combinations, single
vowels, or the nasal /n/ (e.g. na-ka-mu-ra and to-o-kyo-o each comprises
four moras). Some authors do not make the distinction between mora
and syllable since they treat mora as nothing more than a type of syllable
which is simple and reflects the simple structures of Japanese (Grabe and
Low 2002). Others say that mora is a unit out of which all other units of
rhythmic succession are composed. All in all, since a precise definition
of mora is difficult to determine, different authors define the term in ways
which suit their own theoretical or descriptive principles. For more on
moras, see Bjelica (2010: 21) and Arai and Greenberg (1997).
If we compare all three units, mora seems to be the smallest unit of
rhythmic succession, while foot is taken to be the largest, consisting of
a number of syllables, both stressed and unstressed8. What is the most

6 The controversial issue of the regularity of succession will be discussed in


detail in the later chapters of the book.
7 The plural form ‘morae’ is also used in some papers because the word is of
Latin origin (in Latin, ‘mora’ means ‘linger, delay, space of time’). In this paper,
the anglicized plural ‘moras’ will be used.
8 Gore (2004: 65) gives an example in order to illustrate how these three types
of units are perceived: if a heavy syllable is followed by a light syllable, it can
be perceived either as three moras, two syllables, or one foot, depending on the
language and its specific rules about speech segmentation.

24
important here is that the choice on which rhythmic unit will be used is
determined by language specific rules. From all that has been mentioned
previously, we can conclude that the syllable seems to be the starting
point for all other segmentations, since it is the general unit out of which
all other units are composed (Gore 2004: 65), given that most phonolo-
gists treat mora as a simple syllable of the CV type. However, until a
general phonological definition of syllable is presented, it cannot be re-
garded as a universal segment of rhythmic succession.
The traditional consonant/vowel segmentation does not seem to be
problematic, since every language has its inventory of consonants and
vowels. However, it is more complicated in connected speech. Although
consonant/vowel segmentation varies across languages, it is formulated
in general terms, considering not consonants and vowels in the narrow
sense of the word, but rather highs and lows in the universal sonority
curve – “highs” being vowels, since they are more sonorous than con-
sonants, which are represented as “lows” on the sonority curve (Ramus,
Nespor, and Mehler 1999). The problem with consonant/vowel segmen-
tation is the treatment of certain phonemes in connected speech. For ex-
ample, the treatment of syllabic consonants varies among linguists, as
well as the treatment of glides. This particular problem can directly affect
the placement of segment boundaries, which consequently influences the
interpretation of data attained during the experiments (especially those
experiments based on the measurements of vocalic and consonantal in-
tervals, e.g. Ramus, Nespor, and Mehler 1999).
If syllable is defined in linguistic terms and is determined by language
specific rules, even a non-linguist can often, without any difficulties, seg-
ment an utterance into syllables. However, stress is a more problematic
issue, as it was mentioned earlier in the text. It is still unclear what the
general rules for segmenting utterances into feet are. Bertrán (1999), for
example, used a traditional method of segmentation. He segmented the
utterances under study into feet, from the onset of the stressed vowel
until the next stressed vowel, in order to measure the absolute duration
of feet.
Once the stretches of speech are segmented, the question of the rhyth-
mic succession of units occurs. Language timing is a rhythmic quality
of speech in a particular language to distribute its rhythmic units across

25
time. According to this feature of speech, there are three types of timing:
syllable-timing, stress-timing, and mora-timing9, depending on which
units are taken to be the units of rhythmic succession: syllable, foot, or
mora, respectively. Each language belongs to one of the three classes.
However, some linguists, including Roach (1982), claim that there is no
language which is totally stress-timed or syllable-timed (leaving mora-
timed languages on the side, for now, since this is not a widely accepted
classification). Since each language is a mixture of different segments,
Roach (1982) states that every language displays both sorts of timing
depending on the context and occasion. The main difference between
languages, however, lies in the distribution of the two types of timing
in a language, i.e. which type of timing predominates in the particular
language.
Gore (2004: 64) gives a very interesting example which illustrates the
fact that languages do look alike at some points with respect to rhythmic
properties. Namely, he points out that linguistic similarity in prosodic tim-
ing can be seen in the rhythm of counting from one to ten. The counting
is based on the timing of the heavy syllable and does not vary noticeably
from language to language, or among different age groups. Moreover, an
example from Japanese is also given. Although mora-timing prevails in
Japanese (according to more recent studies), some larger units can also
be found in common, everyday greetings. “In such utterances, the heavy
syllable is clearly the most prominent unit and the one that determines
the rhythm of the whole phrase” (Gore 2004: 64). Such examples can be
found even in English. The language of the advertisements usually tends
to use such heavy syllables in order for an advertisement slogan to sound
more exciting and to draw attention of potential customers. For example:

(13) “Never stop playing” (McDonald’s 2007)


(14) “What you want is what you get” (McDonald’s 1993)

However, we should be careful with all the examples mentioned pre-


viously since their language is highly marked in some way, and such

9 Although not mentioned in earlier works on the topic of speech rhythm, mora-
timing is becoming more popular in contemporary works.

26
examples cannot be taken as typical representatives of the rhythmic pat-
terns in their respective languages.
The classification of languages into the three classes mentioned above
is the most disputable topic in the study of speech rhythm and will be
dealt with in this book. However, before doing any further study on
speech rhythm in different languages, it is highly important to come to a
general agreement on how to segment utterances. Moreover, it would be
necessary for the present study as well because, in doing so, the studies
done by different linguists and the results of those studies could be easily
comparable.

27
3 BETWEEN TRADITION AND REALITY:
Classification of Different Approaches to Rhythmic Studies

Although pauses, hesitations, and other forms of interrupting the con-


tinuous flow of speech tend to disguise that fact, it can be said that all
human languages have rhythm. However, there are some languages, like
Chinese or Japanese, that may sound like “a machine-gun” (Lloyd James
1940), while when we hear an Italian speaking, it sounds like music.
Due to these perceptions, many people would disagree that all languages
have rhythm. Although many theories about language rhythm exist, the
question is whether they are valid since there is no empirical evidence to
support them.
Even though the studies of rhythm in poetry date back to ancient
Greek, Latin, and even Indian texts, the study of speech rhythm is rela-
tively recent in linguistics (for more on this, see Bjelica 2010: 28). Re-
searchers have taken at least three different approaches to this topic,
so their research methodologies differ in this respect (Patel 2008). All
the important studies analysed here can be classified into three differ-
ent groups depending on the approach taken in the study of rhythm in
spoken language.
The first approach is typological and it seeks to understand the rhyth-
mic similarities and differences among human languages. According to
this approach, languages are grouped into distinct categories according
to their speech rhythm property. One of the most influential and wide-
spread typological classifications is based on the notion of periodicity in
speech and classifies languages on the basis of whether they have stress-
timed rhythm (like English, Arabic, and Thai), or syllable-timed rhythm
(like French, Hindi, and Yoruba).
This approach was introduced by Kenneth Pike (1945) and accepted
later by many of his successors. As is evident from these few examples
of languages which fall into either of the two categories, membership in
a rhythmic class is not determined by the historical relationship of classi-
fied languages. This means that, on the basis of the typological approach,
rhythm can group languages which are otherwise quite distant both his-
torically and geographically.

28
The second approach to speech rhythm is theoretical or phonologi-
cal, and seeks “to uncover the principles that govern the rhythmic shape
of words and utterances in a given language or languages” (Patel 2008:
118). This type of research includes an area called metrical phonology
and tries to bring the study of speech rhythm in line with the rest of
modern linguistics by formalising rules and using these rules to observe
the phenomena of speech rhythm. The first linguist who proposed the
phonological account of rhythm, putting forward the rhythmic properties
of languages, was Rebecca Dauer (1987).
The third approach is perceptual and is said to examine the role that
rhythm plays in the perception of ordinary speech. The research done in
this area includes the perceptual segmentation of words from connected
speech and examining the effects of rhythmic predictability in speech
perception. Some later works use this particular approach (e.g. Ramus,
Nespor, and Mehler 1999, Ramus et al. 2000, Tatham and Morton 2001,
Ramus 2002, Setter and Ordin 2008, etc.).

3.1 Typological Approach to the Study of Speech Rhythm

Daniel Jones (1978: 240), in his book called An Outline of English


Phonetics (1918, reprinted in 1978), notices that in every spoken word
or phrase there is at least one sound which is perceived as louder than
the sounds next to it. This high prominence of certain sounds may be
the result of inherent sonority, length, stress, or special intonation, or
the combination of all of these factors (Jones 1978: 55). These “peaks
of prominence” (as he calls them in opposite to “troughs” which denote
minimal prominence) are said to be easily counted in a word or a phrase.
He also noticed the pattern in speech according to which these highly
prominent syllables, i.e. stressed syllables, tend to follow each other “as
nearly as possible at equal distances” in connected speech (Jones 1978:
237).
Jones (1978: 242) pointed out that those syllable quantities which
tend to regularly follow each other are not the lengths of syllables but the
lengths separating the “stress-points” or “peaks of prominence” of the
syllables. He claims that one of the principal characteristics of rhythm in
the English language is that these “interstress spaces” are approximately

29
of equal length, i.e. that they are isochronous. By interstress spaces he
means the stretch of speech between the two consecutive stressed syl-
lables. Also, many other authors who dealt with this issue give similar
definitions. For example, M. A. K. Halliday, in his book called An In-
troduction to Functional Grammar (1985) expresses his opinion in the
following way: “[…] there is a strong tendency in English for the salient
syllables to occur at regular intervals; speakers of English like their feet
to be all roughly the same length” (quoted in Bertrán 1999: 2).
André Classe, in his book called The Rhythm of English Prose (1939),
measured the quantity of syllables of different phonetic types, in differ-
ent phonetic places in relation to stress groups and grammatical struc-
ture. He tested some of the rhythm theories of Daniel Jones and con-
cluded that “an English sentence is normally composed of a number of
more or less isochronous groups which include a varying number of syl-
lables” (quoted in Steiner 2004: 3).He also concluded that, while the
length of syllables must vary, stress groups tend to have approximately
the same duration, although containing a different number of syllables.
He explained this as an effect of the increased speed articulation for the
longer groups, which seems to be the result of a desire to make stress
groups isochronous. This approach was adopted and further elaborated
by Kenneth Pike (1945), among others.

3.1.1 The Rhythm Class Hypothesis: Stress-timed and


Syllable-timed Languages

Arthur Lloyd James in his work called Speech Signals in Telephony


(1940) was one of the first writers to discuss in detail speech rhythm in
language and even to note down differences among languages concern-
ing this issue. He says in his work that languages like Spanish or French
have a type of rhythm in language which he described as a “machine-gun
rhythm”.
He used this metaphor because each underlying rhythmical unit is of
the same duration, similar to the transient bullet noise of a machine-gun.
On the other hand, languages like English tend to sound more like the
Morse code, and hence the term “Morse code rhythm” for such languag-
es. James coined these terms “machine-gun rhythm” and “Morse code

30
rhythm”10 in order to draw attention to different perceptions of speech in
different languages11. His principle of classifying languages according to
the perception of the hearer was adopted by his followers like Kenneth L.
Pike and David Abercrombie, but was criticised in much later works by
Roach and Dauer, among others. They criticised this perceptual approach
because it is said to observe the phenomenon of speech rhythm from only
one point of view – the perception of speech, while its acoustics is left
aside, maybe due to the lack of empirical evidence.
This difference between languages on the perceptual level was adopt-
ed by Kenneth L. Pike in his book called Intonation of American Eng-
lish (1945). He created the most influential typology of language rhythm
based on the notion of periodicity in speech. Namely, his theory of
speech rhythm was based on a dichotomy between languages in terms of
syllable and stress patterns. Pike (1945) changed James’s metaphors to
more convenient terms “syllable-timed rhythm” (for his “machine-gun
rhythm”) and “stress-timed rhythm” (for his “Morse code rhythm”).

These terms were coined on the basis of Pike’s theory according to


which languages differ from each other in as to which movements will
10 Found in Abercrombie (1965).
11 Besides terms “Morse-code like” and “machine-gun like” rhythm, Crystal
(1996: 8) also mentions terms like “bouncing”, “heart-beat”, and “tum-te-tum”
for the former type of rhythm, and “staccato”, “pattering”, and “rat-a-tat” for the
later type of rhythm, which characterize these different auditory impressions.

31
periodically recur12. According to his classification, languages like Span-
ish and French are said to be “syllable-timed”, based on the idea that syl-
lables last roughly the same amount of time, i.e. they are pronounced in
roughly equal temporal intervals. On the other hand, according to Pike,
there are languages such as English that are said to be “stress-timed”,
based on the assumption that they have roughly equal temporal intervals
between stresses, stress-points, or peaks of prominence, as James (1978)
calls them. To illustrate stress-timed rhythm in English, Pike points out
that in the following example the reader can notice “the more or less
equal lapses of time between the stresses in the sentence” (Pike 1945:
34):

(1) The teacher is interested in buying some books.


The |teacher is| interested in| buying some| books|
(Pike 1945: 34)

The vertical lines in the example show the division of the sentence
into rhythmic units which, according to Pike, tend to last approximately
an equal amount of time. Each unit has one stressed and a number of un-
stressed syllables following or preceding it. For comparison, he provides
yet another example in order to show that despite the different number
of syllables, the intervals between stressed syllables are approximately
equal:

(2) Big battles are fought daily.


|Big |battles are| fought |daily|
(Pike 1945: 34)

Apparently, rhythmic units have a different number of syllables (only


one stressed and an uneven number of unstressed syllables), but they
have a similar time value. In order to achieve this, to pronounce them
12 Since this book deals with both English and Serbian, it would be appropriate
to offer a terminology in Serbian as well. However, I have not come across any
translation of the terms “syllable-timed” and “stress-timed,” so I am forced to
offer the Serbian descriptive equivalents “ritmično ponavljanje slogova” and
“ritmično ponavljanje naglašenih slogova” for the two terms respectively.

32
in a roughly equal amount
of time, unstressed syllables
of longer rhythm units need
to be somehow “crushed to-
gether” and pronounced very
rapidly. In order to achieve
evenly timed feet, syllables
need to be contracted and
compressed to fit into the
typical foot duration. De-
pending on the number of syllables per foot, these unstressed syllables
are thus contracted that sometimes they are barely audible. This is how
Pike (1945) accounts for many abbreviations which exist in English, in
which syllables may be omitted entirely, not only in pronunciation, but
in orthography as well13.
Jones (1978) also talks about the processes which make feet last ap-
proximately the same amount of time. Namely, if a stressed syllable is
followed by a number of unstressed syllables, that vowel or diphthong
of the stressed syllable is generally shorter than if the stressed syllable is
followed by another stressed syllable or at syntactic boundaries. Moreo-
ver, “the greater the number of following unstressed syllables the shorter
is the stressed vowel” (Jones 1978: 237). As it is obvious, not only un-
stressed syllables but stressed ones as well are affected by the processes
of contraction and compression in order for the feet to be of equal dura-
tion in time, thus producing the rhythmic succession of units.
On the other hand, languages like Spanish and French, which are
characterized by having a syllable-timed rhythm, according to Pike
(1945), have individual syllables which tend to come at approximately
evenly recurrent intervals of time. In this case, phrases with more than
one syllable are said to take proportionally more time and their syllables,
or vowels in those syllables, are less likely to be compressed, shortened,
or even omitted.

13 Some linguists, like Dauer (1987), tried to modify Pike’s theory on the basis
of the phenomenon of vowel reduction, which will be further elaborated in one
of the later sections.

33
Since it is said that in such languages syllables tend to last the same
amount of time, it seems that all syllables are thus of equal prominence
and duration. This consequently means that no syllable compression or
reduction is necessary. The syllables which are stressed more in the pro-
cess of word or phrase accentuation are said to be just extra strong and
extra long, but that it does not affect the pattern of recurrent syllabic
prominence.

3.1.2 Isochrony Accepted: Physiology of Speech Production

The Rhythm Class Hypothesis proposed by Pike (1945) was adopted


by David Abercrombie in his books called Studies in Phonetics and Lin-
guistics (1965) and Elements of General Phonetics (1967). Abercrombie
went further on with the theory of speech rhythm by proposing a physi-
ological basis for Pike’s stress- versus syllable-timing. His contribution
to the theory was based on a specific hypothesis on how syllables are
produced. According to Abercrombie (1965), the most appropriate ap-
proach in defining the notion of syllable is the one that explains the syl-
lable in terms of the pulmonic air-stream mechanism. Speech depends
on breathing since the sounds of speech are produced when the air is
released from the lungs (by an air-stream from the lungs). However, this
air-stream is not released from the lungs in a continuous flow, but the
flow is rather “pulse-like” in nature.
There is a continuous and rapid fluctuation in the air-pressure, which
is the result of alternate contractions and relaxations of the breathing
muscles. Each of these muscular contractions, and the consequent rise
in the air-pressure, is a chest-pulse, since intercostal muscles in the chest
are responsible for it. Each chest-pulse is said to constitute a syllable.
That is why this process is called a syllable producing process, which is
the basis of human speech (Abercrombie 1965: 16-17). However, there
is yet another system relevant for human speech. This system, in part,
depends on the first one and consists of a series of less frequent, but
more powerful contractions of the breathing muscles which every now
and then coincide with, and reinforce, a chest-pulse, and cause more sig-
nificant and more sudden rise in the air-pressure. These movements in
the air-pressure constitute the system of stress-pulses. In human speech,

34
these two processes, the syllable producing process and the stress pro-
ducing process are combined and their rhythm constitutes the rhythm of
speech.
Abercrombie (1965) proposed that in any given language one or the
other kind of pulse occurs rhythmically, equating rhythm with periodic-
ity, like Pike did before him (1945):

“Rhythm, in speech as in other human activities, arises out of the peri-


odic recurrence of some sort of movement, producing an expectation
that the regularity of succession will continue”.
(Abercrombie 1967: 96)

Speech rhythm is a product of the way these two processes are com-
bined in producing an air-stream for talking. Abercrombie (1965: 17)
points out that, in fact, the rhythm is there in the air-stream even be-
fore the actual vowels and consonants are produced in order to make
words. Furthermore, since the combination of these two processes and
their rhythm does not depend on the actual sounds of a language, we can
thus conclude that all the languages in the world have speech rhythm,
regardless of what their sound inventory is. People of all languages, in
order to speak, need to start from releasing the air-stream from the lungs,
and since the rhythm is in the air-stream itself, we can then conclude that
rhythm is a universal feature of all languages.
However, studies have shown that not all languages have the same
type of speech rhythm. This is because different languages co-ordinate
the two processes differently. The status of Serbian in this classification
is not clear since there seems to be a lack of studies in this area. In the
later chapters of this book some general conclusions will be made for
rhythmic properties of Serbian in order to see whether it is reasonable
to believe that Serbian is closer to syllable-timed than stress-timed lan-
guages.
Abercrombie (1967), thus, agrees with Pike (1945) that not all lan-
guages have the same type of rhythm, adopting his classification of lan-
guages into stress-timed and syllable-timed. Not only does he accept his
Rhythm Class Hypothesis, but he also proposes that one language cannot
belong to both groups at the same time, i.e. that the two types of speech

35
rhythm are mutually exclusive (Abercrombie 1967: 97). This means that
one language cannot have both stress-producing and syllable-producing
processes isochronous at the same time, but it is one or the other. For
example, if English has the stress-pulses isochronous, then the syllable
pulses cannot be isochronous, i.e. they will occur at unequal intervals
of time, and vice versa. This actually means that, if a language shows
a tendency towards uniform spacing between the stresses in feet with
different number of syllables, that language cannot have syllables of the
same utterance last an equal amount of time, i.e. either a language will
have all the syllables that last approximately an equal amount of time,
or only stressed syllables. One of the most problematic points he made
and the most criticized one is the fact that he grouped all the languages
in the world into the two proposed classes. Not only is this approach a
bit utopian in thinking that such a large variety of languages in the world
can be put in nothing more than two groups on the basis of their rhyth-
mic properties, but it is a result of testing a small number of languages in
comparison to all the languages that exist and a great deal of languages
whose properties were not available to linguists at that time.
Furthermore, Abercrombie agrees with Pike in saying that in order
to equalize the duration of interstress intervals in languages like Eng-
lish which are said to have stress-timed rhythm, some adjustments need
to be done “in order to fit varying numbers of syllables into the same
time interval” (Abercrombie 1967: 98). Since the unstressed syllables
are unequally distributed between the stressed ones, therefore they are
said to be spoken at varying speeds to fill the spaces between the stressed
syllables. This produces an impression that the stressed syllables are pro-
nounced at equal intervals resulting in unstressed syllables being some-
times contracted and compressed so as to fill the intervals between two
stressed syllables.
The number of unstressed syllables is not important here and their
number does not count. What is important is that the more of them, the
shorter they will be in speech, thus producing the impression of isoch-
rony in language, or in other words, a tendency in English to place stress
at approximately equal intervals of time. As a result of the process of
contracting unstressed vowels is producing the weakest vowel in English
– // (schwa). In certain contexts, an unstressed vowel is so contracted

36
that it is pronounced as if it does not exist in a word. An example of
simple sentences can be given to illustrate the phenomenon of vowel
contraction:

(3) |John was| late.|


first foot: 2 syllables [stressed + unstressed]; second foot: one
syllable [stressed]
(4) |Jenny was| late.|
first foot: 3 syllables [stressed + 2 unstressed]; second foot: one
syllable [stressed]
(5) |Jennifer was| late.|
first foot: 4 syllables [stressed + 3 unstressed]; second foot: one
syllable [stressed]

Each of these three sentences consists of two stressed syllables com-


bined with a number of unstressed ones, but the number of unstressed
syllables varies as we change the subject of the sentence. In order for
the stressed syllables to follow one another at equal time distances, the
unstressed syllables of the first foot need to be compressed. According
to Jones (1978), some contractions affected the stressed syllable as well.
Here is one more example to illustrate this. In the sentence:

(6) What’s the difference between a sick elephant and a dead bee?
2 5 1 5 1 1
(Cruttenden 1986: 20)

although the number of syllables in each rhythmic unit varies consider-


ably due to the fact that there are more unstressed than stressed syllables,
the rhythmic units will be said in roughly the same amount of time, even
the group which has five syllables and the groups of only one syllable
(Cruttenden 1986: 20).
Abercrombie (1967: 97) (like Pike before him) approaches the issue
of rhythm from a point of view of perception. As he points out, “’the
identity of speaker and hearer’ is essential to an understanding of many
aspects of speech perception” (Abercrombie 1967: 97). Not only the
speaker (since he/she is the one uttering a stretch of speech) but also the

37
hearer experiences the rhythm of movement. He thus talks about “hear-
ing” the rhythm of a language, which was criticized by some linguists
later on. This is due to the fact that he does not provide any experimental
ways to prove the theory about speech rhythm, but he rather leaves it to
our mere perception. Also, some later linguists, like Ramus et al. (1999),
tried to find correlates of linguistic rhythm in the speech signal which is
perceived by the hearer, i.e. what in speech signal triggers the perception
of rhythm in speech. However, since it is rather elusive what physical
events contribute to the acceptance of rhythm as a feature of speech, it is
widely believed that rhythm is just a perceived effect which may or may
not have reliable acoustic correlates. Abercrombie (1967) further elabo-
rates his theory of rhythm perception by saying that the rhythm is intui-
tively experienced by “phonetic empathy”. This can only be achieved if
both the speaker and the hearer have the same mother tongue. It can be
illustrated by taking verse as a perfect example of rhythm in language.
English poetry will not be appreciated in the same way by a native speak-
er of English and a native speaker of French who learned English at
school, for example, due to different rhythmic patterns of their native
languages. The same thing would happen if a French speaker tried to
compose a verse in English. He would use the rhythmic patterns of his/
her own language (which differs from English in this respect) and many
native English speakers would probably not feel it as an English verse at
all. This kind of a clear-cut theory seems to be rather neat, but it has its
drawbacks. Since there has been no empirical evidence for this classifi-
cation, the Rhythm Class Hypothesis seems to be weak and thus prone to
criticism. It is said that this classification rather relies on the perception
of speech as such than on any real evidence. Listeners get such an im-
pression that there are two different kinds of rhythm. That is the reason
why the first classification of this kind, that done by Lloyd James (1940),
uses the terms which are a mere description of what people hear com-
pared to some other similar sounds.
Actually, the terms “machine-gun rhythm” and “Morse code rhythm”
best describe what people concluded many years later. Rhythm is not in
the production of speech but rather in its perception. Some authors, like
Tatham and Morton (2001), pose a question whether speakers can control
isochrony of speech, or if it is just perceived isochrony. The answer to

38
this question would help people in trying to discover how to synthesize
speech which would sound natural. Empirical measurements failed to
provide any support for Pike’s theory of speech rhythm. It failed to pro-
vide any valid evidence that the isochrony of stresses or syllables really
exists. Abercrombie’s theory about chest pulses has also been attacked
by later linguists. They tested it experimentally but came to conclusions
which oppose the theory (Roach 1982, Dauer 1983, Ramus 1999, etc.).
However, the existing theory, despite its many flaws, still persists.
One reason may be that it matches our subjective intuitions about
rhythm. Abercrombie (1967: 171) suggests that the idea of isochronous
stress in English dates back to the eighteenth century, although it was
first pointed out by Arthur Lloyd James in Speech Signals in Telephony
(1940), and further elaborated by K.L. Pike (1945). This means that even
without modern technology people were able to identify that stresses
in English tend to be isochronous. Another reason for the persistence
of this theory may be that it correctly groups together languages that
are perceived as rhythmically similar, even if the physical basis for this
grouping is not clearly understood.

3.1.3 Isochrony Questioned: Full-vowel Timing Theory

Yet another account of speech rhythm comes from a study done by


Dwight Bolinger in his book called Two Kinds of Vowels, Two Kinds
of Rhythm (1981). He adopts Abercrombie’s hypothesis which says
that vowels undergo some kind of reduction in unstressed positions so
stressed syllables could follow one another at equal temporal distances.
However, he goes one step further in proposing that there are actual-
ly two types of syllables – those containing full vowels, which Aber-
crombie calls “stressed syllables” and those containing reduced vowels,
which Abercrombie calls “unstressed syllables”. Bolinger suggests that
the most important factor is neither the number of syllables nor the num-
ber of stresses, but the pattern made in any section of continuous speech
by the mixture of syllables containing full vowels with syllables contain-
ing reduced vowels.
According to Bolinger, the basic unit of speech rhythm is a full-vow-
elled syllable together with any number of reduced-vowelled syllables

39
that follow it. Each rhythm unit must thus contain one and only one full-
vowelled syllable.
There is one fundamental difference between Pike’s stress-timing
theory and Bolinger’s full-vowel timing theory which can be illustrated
using the following examples taken from Cruttenden (1986: 22):

(7) Those porcupines aren’t dangerous.


Abercrombie: |Those|por|cu|pines| aren’t| dan|ge|rous|
U S U U U S U U

Bolinger: Those| por|cu|pines| aren’t| dan|ge|rous.


F F F F F F R R

The wallabies are dangerous.


Abercrombie: |The |wal|la|bies| are| dan|ge|rous|
U S U U U S U U

Bolinger: The| wal|la|bies| are| dan|ge|rous.


R F R R R F R R

Stress-timed isochrony (Pike 1945 and Abercrombie 1967) would


suggest the same rhythm in both sentences: namely, the two sentences
are said to contain two “rhythm-groups” (Cruttenden 1986: 20) with
an unstressed syllable at the very beginning (Cruttenden 1986: 21 calls
those types of unstressed syllables at the beginning of syntactic bounda-
ries “anacrusis”). Contrary to this, Bolinger’s full-vowel timing suggests
that there are six rhythmic units in the first example (three syllables of
“dangerous” makes one single unit) and only two units in the second
example (since there are only two full vowels and a number of reduced
vowels which are combined with the full vowels to make units). The
central idea which stands behind the full-vowel timing is that a reduced-
vowel syllable which follows a full-vowel syllable “borrows” time from
the full vowel, so that together they are roughly equal to a full-vowel
syllable timing, which can be a rhythmic unit on its own.
However, any other reduced-vowelled syllable succeeding a reduced-
vowelled syllable which is right next to a full-vowelled syllable does not

40
borrow time from the full-vowel syllable, which means that it adds to the
length of a rhythmic unit.

wal|la|bies
a reduced-vowel syllable which does not borrow time form
the full-vowel syllable
a reduced-vowel syllable which borrows time from a pre-
ceding full-vowel syllable
a full-vowel syllable
Full-vowel timing, thus, seems to account for the instrumentally
measured facts of English syllable durations more successfully than
stress-timed isochrony. According to it, rhythm-groups which consist of
an unequal number of syllables (one full-vowelled and a number of re-
duced-vowelled syllables) cannot have the same duration since only the
first reduced-vowelled syllable borrows time from the full-vowelled syl-
lable while the other reduced-vowelled syllables which follow only add
to the duration of that particular rhythm-group. It cannot, however, lead
us to completely discount some tendencies towards stress-timed isoch-
rony, since without it there would be no reason for the reduction of some
syllables, i.e. the reduction of vowels which make the unstressed sylla-
bles. Therefore, Bolinger (1965) showed that the duration of interstress
intervals is influenced by the specific types of syllables they contain as
well as the position of the interval within the utterance. Interstress inter-
vals thus do not seem to have a constant duration as it was predicted by
the theory of isochrony proposed by Abercrombie (1967) and rejected
for the first time by Roach’s experimental study (1982).

3.1.4 Isochrony Rejected: Setting Grounds for Future


Experimental Studies

One of the turning points in the study of speech rhythm was Peter
Roach’s paper called “On the distinction between ‘stress-timed’ and ‘syl-
lable-timed’ languages” (1982), which criticised Pike and Abercrombie’s
Rhythm Class Hypothesis. According to Roach, Abercrombie’s theory of
speech rhythm has several drawbacks. First of all, he attacks Abercrom-
bie (1967) for being too explicit in saying that all languages in the world
41
belong to either of the two categories – syllable-timed or stress-timed,
without setting out clear rules for assigning a language to one or the
other category. Although giving examples of utterances from different
languages which support this account of speech rhythm is easy, the ques-
tion of how to set out certain rules for classifying languages into the two
groups seems rather problematic. The answer to such a question seems
to be hard to test experimentally, and there is no empirical evidence
that languages really belong to either of the two groups. Rather, Roach
(1982) says that Abercrombie’s claims that the phonetician needs to “em-
pathize” with the speaker to apprehend speech rhythm and that people
need to learn to listen differently in order to be able to analyse speech
rhythm suggest that the distinction between stress-timed and syllable-
timed languages may rest entirely on perceptual skills acquired through
training. However, if someone is “trained” to classify languages to one
or the other category, it would consequently stress the need for a person
who already knows how to do so to act as a “trainer”.
As the second major problem of the existing theory, Roach points out
the lack of empirical evidence to support it and states the major problems
which linguists have been faced with in measuring aspects of rhythm
in continuous speech. He identified the need to test Abercrombie’s hy-
pothesis on spoken data by measuring time intervals in speech. Roach
wanted to test the hypothesis concerning the difference in syllable length
between syllable-timed and stress-timed languages, according to which
stress-timed languages have considerable variation in syllable length,
while syllable-timed languages have syllables that tend to last the same
amount of time. For this purpose, he set up a small corpus which consist-
ed of stretches of spontaneous, uninterrupted speech in all the languages
used in Pike’s studies (English, Russian, and Arabic as stress-timed, and
French, Telugu, and Yoruba as syllable-timed languages). The results of
his experiment show that there is no empirical evidence to support the
claim that in syllable-timed languages syllables are equal in length.
Another claim that Roach wanted to test in his experiment was that in
syllable-timed languages stress-pulses are unevenly spaced, while lan-
guages like English experience regular stress beats. He concludes that
the abovementioned isochrony in language is everything but straightfor-
ward. Namely, he does not negate it entirely, but rather points out that it

42
is more apparent than real and that “listeners tend to perceive isochrony
even in sequences of interstress intervals that are manifestly far from
equal” (Roach 1982: 2). However, instead of rejecting isochrony alto-
gether on the basis of the corpus he himself tested, Roach was realistic
about his results: he became aware of the scarcity of the data used in
this experiment and, instead of making generalisations, he prepared the
ground for further studies which would reject the hypothesis by instru-
mental means. Not only did he identify the necessity of further research
in order to test Abercrombie’s claims, but he stressed the importance of
testing more syllable-timed languages, since there is a disproportion be-
tween the studies done for this group of languages and the ones done for
English as a typical representative of the stress-timed category. One of
the languages which should obviously be included in this type of study
is Serbian.
Furthermore, in order to run a relevant instrumental study of languag-
es, some agreement needs to be reached about the characteristics of a test
which is to be used for the experiment in question. Roach (1982) recog-
nizes three major problems in designing a measurement-based test. First
of all, one of the problems is the identification of stresses, i.e. deciding on
what a stressed syllable is and what it is not. Since at that time there was
no instrumental technique for identifying stressed syllables, a phoneti-
cian needed to do it by himself auditorily. He did so by identifying the
peaks of prominence and, consequently, the stress placement in an utter-
ance, something which is difficult for a non-phonetician to do, especially
in spontaneous speech. Although a specialist in this area, a phonetician
can be subjected to many influences, since his intuitions about his native
language may interfere with the judgements about other languages. An
additional problem can arise from the disagreement among phoneticians
on what a syllable is and how to segment a stretch of speech into syl-
lables. The same problem occurs when it comes to defining stress and
the division of speech into feet, which makes this problem even more
serious. The second problem, according to Roach (1982), is the problem
of identifying and, consequently, measuring the interstress intervals in an
utterance since it is rather disputable from where to start measuring such
an interval. Some researchers have measured it from the intensity peak
of the vowel in the stressed syllable to the intensity peak of the following

43
stressed syllable, while others, including Roach, thought that it would be
intuitively more satisfying to start from the phonological beginning of
the stressed syllable, including not only the vowel of the stressed syllable
but also a consonant cluster if a syllable starts with one. Additionally,
Roach admits that although spontaneous, uninterrupted speech seems to
be the most suitable for research, it is likely to be “heavily influenced by
tempo variations” (Roach 1982:3) if put in different contexts and social
occasions. Consequently, a language would experience different types of
timing depending on the context and occasion, which leads to the con-
clusion that no language is exclusively stress-timed or syllable-timed.
Some later authors support this view by saying that one and the same
utterance can have different timing patterns depending not only on the
communicative situation (context) and on speech tempo, but also on the
emotional expression of the speaker uttering that stretch of speech (Cum-
mins 2002: 2).
Finally, if we decide to adopt the notion of isochrony and the existing
types of rhythm, Roach warns us that there is no language which is said
to be totally syllable-timed or totally stress-timed, as Abercrombie be-
lieved, but rather, every language is said to display both sorts of timing.
“Languages will, however, differ in which type of timing predominates”
(Roach 1982: 5). Moreover, the theory which divides languages into syl-
lable-timed and stress-timed depends only on the intuitions of speakers
of various Germanic, mostly stress-timed, languages, and more effort
needs to be made to examine more languages,
especially those belonging to the syllable-
timed group, in order to see whether the dis-
tinction is empirically supported or just based
on subjective impressions of the listener. The
only way to prove the validity of Pike’s clas-
sification is by designing a test which would
measure the acoustic or articulatory infor-
mation and thus prove or reject the idea that
certain information which is “hidden” in the
acoustic signal triggers the perception of one
or the other type of rhythm.

44
3.1.5 Rhythmic Studies in Serbian

At the very beginning it must be said that speech rhythm has been a
neglected topic in Serbian linguistics. This may be due to the fact that
accent in Serbian is a more complex issue than accent in English, encom-
passing not only the prominence of a stressed syllable but also the infor-
mation on tone, duration, and intonation. The problem which this study
was faced with at the very beginning was the lack of relevant rhythmic
studies done on a Serbian corpus. Some of the linguists who dealt with
this issue were Pavle Ivić and Ilse Lehiste, Jelica Jokanović–Mihajlov,
and Slobodan T. Jovičić.
Jelica Jokanović–Mihajlov (1990) discusses the models of rhythmic
organization in utterances, more precisely the prosodic features of Serbi-
an spoken utterances, focusing mainly on temporal organization of their
segments. She points out that focusing only on one element of prosody,
neglecting all the others, is rather difficult, especially having in mind that
Serbian pitch accent is composed of several elements.
This groundbreaking study suggests a unit of rhythmic organisation
which extremely resembles that of English. Although syllable is said to
be the basic articulatory and acoustic unit of speech in Serbian, the basic
unit of rhythmic organisation is said to be something beyond syllable.
The combination of different syllables produces a word which has its
prosodic contour and such a word, which has its own prosodic organi-
sation, is said to be an accentual word. Most lexical words in Serbian
have their own accents and some words need to lean on a word in front
of them or behind them in order to receive an accent. This means that
an accentual word does not always correspond to a phonological word.
Thus an accentual word is said to be a word or a word combination with
only one accentual pattern (Jokanović–Mihajlov 1990: 108). In con-
nected speech, utterances are organised as sequences of accentual words.
Between the accented syllables in an utterance there are a number of
unaccented syllables which are inferior both in prominence and quantity.
Thus an utterance is, actually, the succession of pulses which correspond
to the accented syllables with all the unaccented ones in between.
Many Serbian linguists negate the existence of rhythmic organisation
of prose. They say that it should be only reserved for poetry since poetry

45
is interested in producing such rhythmic effect. However, Jokanović–
Mihajlov (1990) did an experiment on Serbian data in order to show that
this type of rhythmic organization is present also in prose texts and in
everyday conversation. She points out that our impression that rhythm
does not exist in speech comes from the fact that in such contexts listen-
ers tend to concentrate on semantic and syntactic units while rhythmic
pulses are neutralized and thus hardly perceivable.
According to Jokanović–Mihajlov (1990), semantic properties of an
utterance cannot be neglected because the listener’s attention is concen-
trated on them.
Thus, instead of having syllable as a unit of rhythmic organization,
she suggests a rhythmic group which encompasses one or more pho-
netic words. Such groups are said to be both semantically and syntacti-
cally coherent, and, as such, are perceived as basic units of rhythmic
organization. One rhythmic group is said to be a semantic unit with only
one accentual pattern.
Syllables, on the other hand, do not carry any semantic content un-
less combined with other syllables: such combinations usually consist of
one accented and a number of unaccented syllables since one accentual
pattern is said to have only one syllable which bears accent. We can-
not help but notice that in this respect, Serbian looks a lot like English
and that these rhythmic groups are nothing more than units similar to
English feet. This brings us to the conclusion that in this respect Serbian
behaves like a typical stress-timed language, since the orientation point
in segmentation and thus understanding of an utterance is said to be the
stressed (accented) syllable.

3.2 Phonological Approach to the Study of Speech Rhythm

Over the years of studying speech rhythm, people have tried to devel-
op methods and reliable means which would test the perceived isochrony
of rhythmic units. After the turning point in the study due to Roach’s
experiment which rejected isochrony in speech reality and left it in the
scope of subjective perception, there have been numerous approaches to
speech rhythm which set aside the notion of isochrony. After developing
a reliable instrumental means which would be used for further research,

46
most of these approaches have relied upon them to refute the Rhythm
Class Hypothesis proposed by Pike (1945) and Abercrombie (1967).
However, it should be stressed that although it seems that the idea of
isochrony is more apparent than real, the idea of speech rhythm should
not be discarded.
In order to test the claim given by Roach that stress-timed languages
are likely to have more complex syllable structure than syllable-timed
languages and exhibit vowel reduction, unlike syllable-timed languag-
es, Rebecca Dauer (1983, 1987) developed a phonological account of
speech rhythm due to the fact that languages seem to exhibit different
properties in speech production. After realizing that rhythm was nothing
more than a total effect of a number of different components – phonetic
and phonological, as well as segmental and prosodic in nature – and a
property of all languages, she set grounds for creating a general pho-
nological theory. The need for creating such an account came from the
fact that one of the major drawbacks of Pike’s classification seems to
be that it groups languages into the two categories without stating any
parameters for assigning languages to one or the other category. Roach’s
experiment proved that the classification could not be based only on the
measurements of time intervals in speech. Moreover, since the concepts
of both syllable and stress lack general phonetic definitions, all these fac-
tors make a purely phonetic definition impossible.
According to Dauer (1987), some
of those components of speech which
allow us to compare languages on the
basis of speech rhythm include the rela-
tive length, pitch, and segmental quality
of accented and unaccented syllables, as
well as some of the phonological compo-
nents such as syllable structure and the
function of accent. Dauer points out that
rhythm can be broken down into these
components, and as any other distinc-
tive feature they can be assigned a plus
or a minus value. According to her, the
languages and language varieties differ

47
according to different combinations of these values as rhythm is said to
be the total effect of combining all the features mentioned above.
Dauer recognized three features of speech which are said to influence
rhythm in spoken language and to which she assigned [+], [0], and [-]
values14:
1) Duration;
2) Syllable structure;
3) Vowel reduction.

A language is assigned one of the three values on the basis of the


extent to which a particular value is exhibited. Namely, if a language is
marked positively with respect to duration, accented heavy syllables
tend to be longer than unaccented light syllables. As examples of such
languages which have regularly longer accented than unaccented syl-
lables, Dauer mentions English and Serbo-Croatian. On the other hand,
in languages which are assigned the [-] value syllables are not affected
by accent, i.e. accent does not influence the duration of syllables, either
accented or unaccented.
Concerning the second property – syllable structure, or more appro-
priately syllable complexity, languages which are marked positively
have a high percentage of complex syllables. In such languages, like
English or German, a great number of syllables have complex consonant
clusters in both the onset and the coda (three or even four consonants in
the cluster), whereas in languages which are marked negatively for this
feature, like Italian, most complex syllables have a maximum double
consonant coda and a single consonant in the onset. Moreover, those lan-
guages are said to have simpler syllable structures, predominant syllable
types being CV and CVC. Such languages are also said to exhibit many
active processes which break up or prevent the formation of particularly
heavy syllables (an example of such a language is Japanese). Simple syl-
lable structures include the syllables which lack consonant clusters in ei-
ther the onset or coda, namely, structures such as CV, V, VC15, and CVC
14 Dauer (1987) mentions yet another feature – quantity distinctions, but it will
not be mentioned in the present study since not all languages (including English)
exhibit this feature.
15 Dauer (1987) does not make difference between syllables and moras.

48
(and even C structure if there is a syllabic consonant). On the other hand,
all syllable structures which involve any kind of consonant clusters will
be treated as complex (CCV, VCC, CCVC, CVCC, CCCV, among oth-
ers). However, some authors like Dankovičová and Dellwo (2007) treat
even CVC structures as complex, so in order to compare the statistics
done for several languages, including Serbian (Jovičić 1999), the same
principle will be adopted in this study.
With respect to vowel reduction or whether there is the same vowel
system and similar articulation in all syllables, regardless of the context,
or not, languages which are positively marked for this feature are said to
exhibit vowel reduction in unaccented syllables, while a maximal vowel
system is used only in accented syllables (English). Unaccented weak
forms undergo the reductions in the length of sounds, centralisation of
vowels towards /, and very often the elision of vowels and con-
sonants (Gimson 1978: 263). On the other hand, languages which lack
this feature (and are marked negatively for it) are said to have the same
vowel system and a similar articulation for all syllables, independent of
accentuation. If any of the processes like reduction in the length of vow-
els or elision do exist, they affect both accented and unaccented syllables
equally and are determined by the phonetic environment rather than by
accent. Such languages are Spanish and Japanese (Dauer 1987).
What is obvious from the combination of values assigned to each fea-
ture, languages with more “pluses” than “minuses” are said to be stress-
timed. Such languages are said to have “strong stress” (Dauer 1987) and
tend to maximize the differences between accented and unaccented syl-
lables. Because of a dynamic and “expiratory” accent, accented syllables
have longer duration and their vowels are fully realized, while unaccent-
ed syllables have shorter duration and tend to be reduced in length or
centralized. On the other hand, languages with more “minuses” are said
to be syllable-timed, since they have all the vowels of equal duration and
vowels fully realized in all positions.
Moreover, stress-timed languages have more complex syllable struc-
tures than syllable-timed languages and as a result, the syllables in stress-
timed languages tend to be heavier, making them suitable for carrying
stress. On the other hand, in syllable-timed languages, stress and syllable
weight tend to be independent. “This in turn, creates the impression that

49
there are different types of rhythm” (Ramus, Nespor and Mehler 1999:
5). What is interesting about Dauer’s approach is the third value that she
introduced – [0] (zero). Languages which are marked [0] for a certain
feature are said to partially exhibit a particular feature. For example, if a
language is marked [0] for syllable length, that language is said to have
accented syllables only slightly longer than unaccented ones. Moreo-
ver, if a language is thus marked for vowel reduction, it is said to have
both accented and unaccented syllables, but the unaccented ones are not
necessarily reduced or centralized. However, there are some processes,
like devoicing or raising, which still occur in such languages (the exam-
ples are Russian and Portuguese). By introducing this third value, Dauer
(1987) tried to contribute to the forming of the theory by suggesting
that the distinction between different types of languages on the basis of
speech rhythm is not bimodal but scalar. Thus, languages should not be
put in either of the two existing categories but placed along a continuum
depending on the extent to which each feature is present in a language.
This means that these properties are cumulative, giving the impression
that there are less typical stress-timed and syllable-timed languages put
on the continuum with typical stress-timed and syllable-timed languages
at either end of the continuum.
To support these claims, Dankovičová and Dellwo (2007) performed
an experiment which showed that the languages traditionally classified
as syllable-timed (French and Italian) indeed have a much lower per-
centage of complex syllables than those traditionally classified as stress-
timed (English and German). Namely, in French, about 80% of syllables
consist of a single vowel or a consonant and a vowel (CV or, less fre-
quently, VC structure). In Italian, this percentage is even higher – 90%.
On the other hand, the percentage of complex CVC or CCV syllables
is not more than 10% in Italian and 18.3% in French, with a negligible
number of CVCC and CCVC syllables – 1.2% in Italian and 2.3% in
French (Dankovičová and Dellwo 2007: 1241).
On the other hand, English and German spoken examples consisted
of a considerably high percentage of complex syllables: in German, for
instance, the number of complex syllables prevails over the number of
simple syllables (only 35% of simple syllables). In English, the percent-
age of complex syllables is a little over 50%. However, both languages

50
contain a considerable amount of complex syllables, which classifies
them into the stress-timed group, according to Dauer’s parameters.
Czech has been traditionally classified as a syllable-timed language.
However, the studies of Dankovičová and Dellwo (2007), as well as oth-
er linguists, have shown that this picture is far from clear. Czech’s syl-
lable complexity is far from that typical for syllable-timed languages like
French and Italian. Interestingly enough, their experiment showed that
Czech contains 65% of simple syllables, which is obviously less than in
typical syllable-timed languages like French and Italian, but much more
than in typical stress-timed language such as German. This result goes
in favour of Dauer’s approach that there are certain languages which are
somewhere in the middle between typically stress-timed and typically
syllable-timed languages.

Figure 1: Syllable complexity for Czech, English, French, German, and


Italian (Dankovičová and Dellwo 2007: 1242)

Serbian has a variety of syllable structures but obvious preference for


simple syllables. According to Jovičić (1999: 95), 73% of Serbian syl-
lables have simple structures: V, CV, VC, or C, a property of typical syl-
lable-timed languages. In this respect, Serbian is similar to Czech since it
exhibits a great percentage of simple syllables. However, this percentage
is higher than in Czech and thus more similar to that of French, a typical
syllable timed language. On the other hand, a slightly over 22% of syl-
lables exhibit CVC, CCV, or VCC structure, 3% of syllables have CVCC
or CCVC structure, while a negligible number of syllables (only 2% in
the data examined) have other complex structures like CCCV, CCCVC,
CCVCC, CCCVCC, CVCCC, etc. Although Serbian has a high percent-
age of simple syllables, this great variety of possible syllable types is
similar to those of typical stress-timed languages.
51
Table 1: The frequency of existing syllable structures in Serbian
(examined on the data consisting of 401,076 syllables) –
taken from Jovičić (1999: 95)16

Syllable
Rang Frequency Total [%]
Structure
1 242,591 CV 60.48516
2 45,965 CCV 11.458
3 43,565 CVC 10.862
4 38,939 V 9.708
5 11,327 CCVC 2.824
6 10,112 VC 2.521
7 3,046 CCCV 0.759
8 2,593 CC 0.646
9 1,007 CVCC 0.251
10 554 C 0.138
11 486 CCCVC 0.121
12 415 CCC 0.103
13 331 CCVCC 0.0825
14 98 VCC 0.0244
15 40 CCCC 0.0097
16 7 CCCVCC 0.0017
17 5 CCCCV 0.0012
18 3 CVCCC 0.0007
19 1 CCCCC 0.0002

16 Jokanović-Mihajlov (1990: 109) reaches a similar conclusion from her data:


the CV structure prevails in Serbian.

52
Figure 2: Syllable structure and syllable complexity in Serbian

Because of all the things mentioned above, it can be concluded that


English and French may be fairly near the extremes of this scale, while
languages like Spanish (analysed in Bertrán 1999), Czech (analysed in
Dankovičová and Dellwo 2007) and even Serbian (analysed in Jovičić
1999) fall on the scale between the two extremes.
To conclude the story about the phonological account of speech
rhythm based on phonological properties of language, it should be noted
that Dauer’s and Pike’s approaches differ according to the point of view
from which they observe isochrony of rhythmic units. Namely, Pike re-
gards both syllable structure as well as vowel reduction as consequences
of isochrony – the isochrony of rhythmic units triggers the reduction
of unstressed syllables which in turn creates an impression that there
are different syllable types. On the other hand, according to Dauer’s ap-
proach, isochrony in speech is perceived as a result of the two properties
of speech – the combination of the two features in different languages
triggers the perception of isochronous segments in speech.
Many instrumental experiments were done in an attempt to apply the
existing Rhythm Class Hypothesis (Pike 1945) to various languages.
However, many of those experiments were not successful in proving that
the theory can be applied. They show that “a language cannot be assigned
to one or the other category on the basis of instrumental measurements of
interstress intervals or syllabic durations” (Dauer 1987: 447). According
to various discussions and experiments done, it can be concluded that the
Rhythm Class Hypothesis, founded on a notion of isochrony and widely
discussed and used in the past, has been refuted by instrumental means.

53
However, because of its universality (which is the tendency in all other
language theories) it has remained a popular view among many linguists.
Moreover, on the basis of all the tests mentioned so far which tested
the physical reality of the isochrony theory on stress- and syllable-timed
languages, it can be concluded that isochrony seems to exist only on the
level of subjective perception since there is no physical evidence to the
contrary. Future studies aim at the reformulation of the theory so that it
still includes isochrony, but on the perceptual level only.

3.3 Perceptual Approach to the Study of Speech Rhythm

Ever since people started noticing that languages are spoken in a cer-
tain manner which is perceived to be rhythmic, they started describing
their impressions of how these languages “sound” to them. Rhythmic
beats in the acoustic signal which is sent to the listener sound similar
to either a machine gun (Spanish or Italian) or Morse code (English or
Dutch), according to Lloyd James (1940). This description of our per-
ception of spoken word in several languages gave us, as a result, the
classification of languages made by Pike (1945) and widely used even
today: languages are spoken with either stress-timed or syllable-timed
rhythm (according to some other authors, there is also the third, mora-
timed, category).
If we want to scientifically explain the process of speech production,
we have to take into account something more objective than a simple
description of what we hear. When the appropriate technology for testing
this type of data was established, instrumental studies showed a great
inconsistency with the starting hypothesis: the perceived isochrony in
the speech signal, either of syllables or feet, does not have much to do
with the physical reality. However, it seems that both the listener and
the speaker are aware of the perception of this “patterned temporal oc-
currence of pre-defined rhythmic units” (Tatham and Morton 2001: 3).
That challenged the phonologists to try to discover what was there in the
acoustic signal which triggered such perception, what the correlates of
linguistic rhythm in speech signal were.
An extensive study in this area was done by Franck Ramus, Marina
Nespor and Jacques Mahler (1999). They realized the importance of

54
finding correlates of linguistic rhythm in acoustic signal without relying
entirely on the subjective perception of isochrony. They support Dauer’s
view that languages differ in speech rhythm because their rhythm is a
product of their phonological properties, of which the most important are
syllable structure and vowel reduction.
Moreover, Dauer proposes the independent and cumulative nature of
the properties in question, which places languages along the continuum
depending on how much each property is present in a language. Typical
stress-timed and syllable-timed languages would be placed on the two
ends of the continuum, while the less typical ones, like Spanish, Polish,
or Catalan (among others) would be scattered along the continuum, de-
pending on how much each property is present.
The language can be near to either of the ends depending on whether
it has more properties characteristic for stress-timed or syllable-timed
languages, or as Antonio Pamies Bertrán (1999: 1) calls them languages
of accentual isochrony and languages of syllabic isochrony, respectively.

3.3.1 Languages in the Middle: the Existence of Intermediate


Languages

By assigning the values of the abovementioned properties to languag-


es, it is obvious that there are languages which will not have all the “plus-
es” or all the “minuses”. There will be languages, like Spanish, which
seem to have some properties of the syllable-timed group as well as share
some properties with the stress-timed group of languages. This is one of
the reasons why Dauer (1987) suggested the existence of the continuum
on whose ends are the two extremes – languages which are typically
stress-timed, like English, on one end, and languages which are typically
syllable-timed, like French17, on the other end of the continuum. In the
middle of the continuum there are so-called “intermediate languages”,
i.e. languages whose phonological properties match neither those of typi-
cal stress-timed nor those of typical syllable-timed languages, or they
17 Although French is widely accepted to be a typical representative of syl-
lable-timed rhythm, some French linguists do not agree with this (Wenck and
Wioland 1982). Ramus, Nespor and Mehler (1999) are also very careful when
French is concerned.

55
share some properties typical of both types of languages. Examples of
such languages are Polish, Catalan, and Czech.
Catalan has often been described as a syllable-timed language, since
it has a syllable structure similar to Spanish, which means that Catalan
does not have a great variety of syllable types. However, studies done
for this particular language have shown that it exhibits the vowel reduc-
tion phenomenon, a property typically associated with stress-timed lan-
guages. On the other hand, there are languages, like Polish, which paint
the opposite picture: although having a great variety of syllable types
and showing remarkable syllable complexity, it does not exhibit vowel
reduction at normal speech rates. Although being traditionally classified
as a syllable-timed language, the case of Czech still remains unclear.
Dankovičová and Dellwo (2007) follow the rationale that syllable struc-
ture of languages is responsible for their rhythmic characteristics. On
the basis of this, they studied the complexity of Czech syllables. Czech’s
syllable complexity is far from that typical of syllable-timed languages
like French and Italian. With slightly less than 65% of simple syllables,
it is closer to the stress-timed group with German and English as typical
representatives (with the percentage of simple syllables 35% and slight-
ly over 50% respectively) than to the syllable-timed group (French has
80% while Italian 90% of simple syllables). However, Czech does not
allow vowel reduction, a fact which places this language in the group
of syllable-timed languages. This property might be “stronger” from the
point of view of perception, neutralizing the syllable complexity feature,
which may be the reason why Czech is perceived to be a syllable-timed
language.
On the basis of syllable structure property, Serbian should be treated
as a syllable-timed language (see Chapter 3.2). However, Jokanović–Mi-
hajlov (1990) suggests a kind of rhythmic organization similar to those
of English: the unit of rhythmic succession is said to be a so-called rhyth-
mic group which consists of one accented and a number of unaccented
syllables, a unit which resembles an English foot.
Thus, the event which tends to occur periodically is not any syllable
but the accented one. Moreover, Dauer (1987) suggests that Serbian is
similar to English concerning the difference in duration between accent-
ed and unaccented syllables. Accented syllables in Serbian, as well as in

56
English, are said to last longer than unaccented syllables, a fact which
may contribute to the perception of the rhythmic groups mentioned by
Jokanović–Mihajlov (1990). However, the other phonological property
which is used to determine the status of a language in the existing typol-
ogy, vowel reduction, has not yet been thoroughly studied by Serbian
phonologists. Therefore, the crucial step in classifying Serbian on the
basis of the existing rhythmic typology would be to determine whether
Serbian allows vowel reduction or not.
Before Dauer (1987) placed these types of languages on her “rhyth-
mic continuum”, languages such as Catalan and Polish (and even Czech)
did not have their rhythmic status in phonology (Ramus, Nespor, and
Mehler 1999: 5). However, Dauer’s approach, although perfectly ac-
ceptable and sensible, fails to explain how rhythm is extracted from the
speech signal by the perceptual system. Also, Dauer does not explicitly
state where each intermediate language is placed on the rhythmic con-
tinuum. She does not offer the exact parameters which determine the
status of each language along the continuum. Rather, she states that these
“intermediate languages” are just “scattered along a continuum.” Moreo-
ver, she does not say how much each phonological property contributes
to the perception of rhythm, nor how these properties interact with each
other. For example, it is not clear whether in an intermediate language
like Catalan its exhibited vowel reduction property is “stronger” than its
variety of syllable types, which will place this language near the end of
the continuum reserved for typically stress-timed languages like English,
or it is the other way around.
Due to this drawback of Dauer’s approach, Ramus, Nespor and Me-
hler (1999) go one step further and pose a question whether there is a
possibility that there are more classes instead of a continuum. This idea
should be taken into serious consideration. For instance, because of a
number of different possible syllable types, it seems likely that there are
more classes in which these different syllable types are grouped, three
of which should correspond to the three existing classes (syllable-timed,
stress-timed, and mora-timed). All in all, this is an empirical question
which can only be answered after a detailed study of a great number of
unrelated languages, which requires an exhaustive research to be done in
this particular area.

57
In order to do such an exhaustive research, it is necessary to establish
the steps according to which the research will be carried out:

I. The first thing that needs to be done before any kind of analysis
takes place is data collection. In order to obtain easily comparable
data, the same method needs to be applied for all the languages in
question.
II. Secondly, some universal and more appropriate methodology
needs to be established for studying this phenomenon.
III. In the end, both things need to be approached: empirical research
and its possible interpretation, but without any presuppositions.
This means that we start from the very beginning, not bearing in
mind the existing typology of rhythm, not even presupposing that
languages have rhythm. That is the only way we can make an ob-
jective study and not be constrained by something which already
exists and for which we are not yet sure whether it is true or not.

3.3.2 Corpus Selection

In order to do a research which would give us reliable results in the


study of speech rhythm, a corpus of study needs to be defined. Since the
topic at issue deals with the spoken language, some recorded speech data
needs to be gathered. As it is obvious from the typological approaches
to speech rhythm, most of the studies presented here deal with a limited
number of languages, probably the ones which were at the disposal of
researchers. That is why most of the things they hypothesized about can
be easily applied to English but not to many other languages.When con-
ducting such an experiment, it is highly important to select the corpus
properly. Most of the experiments use a corpus of authentic data such as
literary works, like Dauer (1983) who uses a fragment of literary prose,
or recordings of spontaneous speech, like Ramus, Nespor, and Mehler
(1999), among many others, who use sentences taken from a multi-lan-
guage corpus initially recorded by Nazzi and his associates (Language
discrimination by newborns: towards an understanding of the role of
rhythm, 1998, mentioned in Ramus et al 1999). Given all the works dis-
cussed in this book, the corpora used can be categorized as follows:

58
Table 2: Types of corpora1819
Type Examples Found in
Ramus, Nespor,
& Mehler 1999,
Independent sentences Bertrán 1999
Lehiste and Ivić
(1986)
Poetry Navarro Tomás 192219
Read Cummins 2002
Literary prose
text Dellwo 2002
Literary
passages18 Tatham & Morton
Newspaper 2001
articles Jokanović–Mihajlov
(1999)
Jokanović–Mihajlov
Segments from radio programmes
(1999)
Informal
Arai & Greenberg
telephone
On the particu- conversation 1997
lar topic
In-class
Spon- Setter 2008
presentation
taneous
Radio programmes (monologues Jokanović–Mihajlov
speech
and dialogues) (1999)
Picture description Setter 2008
not found in
Re-telling of a pre-read text
analysed works

For the purpose of their study, Ramus, Nespor, and Mehler (1999)
used a multi-language corpus which consisted of short news-like declar-
ative statements, whose number of syllables per sentence was in range

18 For more on what literary works were used in the studies of speech rhythm,
see Bjelica 2010: 62 & Appendix.
19 Mentioned by Bertrán (1999: 6).

59
of 15 to 19, and each sentence had an average duration of about 3 sec-
onds. Every sentence was translated into all eight languages under study
(English, Dutch, French, Spanish, Italian, Japanese, Polish, and Catalan)
so that they have similar semantic content. What is interesting about the
corpus used is that all the sentences were initially in French, but were
later translated into the other seven languages. The exact translation was
not as important as the number of syllables each translation contained.
Bertrán (1999), on the other hand, did not use semantically similar sen-
tences in all seven languages he studied. On the contrary, he mainly
concentrated on the sentences which have rhythmic units of different
length so that he could measure them without any difficulties. Tatham
and Morton (2001) chose to use four articles from the front page of The
Los Angeles Times issued on 25th December 2000, which was read by
only one female speaker with the general accent of Southern California.
The corpus consisted of a half a dozen stories in marginally different
journalistic styles20.
Most of the researchers used several already created databases for the
purpose of studying speech properties (especially speech rhythm). Some
of those databases used are BonnTempo Corpus, OGI TS Corpus (The
Oregon Graduate Institute Telephone Speech Corpus), and SCRIBE Cor-
pus (Spoken Corpus Recordings in British English). BonnTempo Corpus
is a collection of read speech which uses a short passage from a novel by
Bernhard Schlink Selbs Betrug translated into several languages under
investigation by philologically educated native speakers of the target lan-
guages. OGI TS Corpus contains recordings of telephone conversations
in which speakers responded in their native language to an automatically
generated series of prompts, while SCRIBE Corpus consists of a mixture
of read speech and spontaneous speech, where the speakers are given
pictures to describe. For more on these databases, see Bjelica (2010: 62).
In each of the corpora used, the input is quantitatively valuable,
because a very small corpus cannot be relevant in studying language
phenomena since in such cases, the attained results may not reflect the

20 For the purpose of this pilot investigation, the current data was enough. If
a more extensive research should be done, more data and a number of different
speakers are to be provided.

60
true state of the particular phenomenon in a language. Besides this, the
abovementioned corpora are qualitatively heterogeneous “and therefore
susceptible to uncontrolled variables that determine the result of the
measurements” (Bertrán 1999: 108). However, one needs to be careful
with this kind of corpus. For example, if one uses a literary work, es-
pecially a poem, as a corpus for studying speech rhythm, they could be
committing a methodological mistake. Namely, the isochrony of rhyth-
mic units within lines of verse in a poem is intentional rather than ac-
cidental since very often, the organization of verse dictates the regularity
of the occurrence of rhythmic units. Therefore, Bertrán (1999) poses a
very interesting question which casts a shadow over many of the studies
which used poetry for studying rhythmic organizations in a language:

“How can the basis of normal speech rhythm be the same as that in
verse? Why should the poet and the critic worry about the isochronic
distribution of stress if it is an inherent property of the language in
question?”
(Bertrán 1999: 107).

Therefore, it is better practice to use fragments of every day speech


for the purpose of studying this particular phenomenon.
In studying Serbian prosodic system, Lehiste and Ivić (1986) based
their research primarily on the acoustic analysis of a corpus which con-
sisted of 272 sentences of several types (different types of statements,
questions, exclamation sentences, etc.) produced by two informants (one
of them being one of the authors, Pavle Ivić, himself). Sentences were
constructed in such manner to encompass the words with all four types of
accent in Serbian. Although they tried to include as many different struc-
tures as they could in order to avoid repetition, it seems that repeated pro-
ductions of a smaller set of sentences would have produced more reliable
data (Lehiste and Ivić 1986: 180). They are realistic about their corpus
in saying that it does not provide enough material to answer all the ques-
tions relevant for their research. Similarly, Jelica Jokanović–Mihajlov
(1990) ran an experiment on the spoken corpus of Serbian. She included
different examples of spoken language in her corpus for the purpose of
studying Serbian prosody: segments of read text from radio programmes,

61
samples of spontaneous speech (monologues as well as dialogues) from
radio programmes and everyday conversations, and read passages taken
from literary prose and newspaper articles. This last type of texts is espe-
cially used in the study of speech rhythm in Serbian. What is common to
most of the studies presented in this book is the use of a controlled corpus
for the purpose of the research. Bertrán (1999) used a series of “artifi-
cially” created utterances which were similar, but which had the vari-
able distance between the stresses, i.e. the variable number of unstressed
syllables between the stressed ones. Ramus, Nespor, and Mehler (1999)
created their corpus out of short news-like declarative statements, whose
number of syllables per sentence was in range of 15 to 19, and each sen-
tence had an average duration of about 3 seconds. Every sentence was
translated into all eight languages under study so that they have similar
semantic content. Tatham and Morton (2001) also used a controlled cor-
pus made out of read speech. They excluded short sentences or unnatural
utterances within frames, since these types of utterances tend to develop
a rhythm of their own. Also, they decided to exclude ordinary conversa-
tion because it is made out of too many “interruptions”, such as false
starts, pauses, hesitations, and other interruptive effects. Lehiste and Ivić
(1986) included as many different structures as they could in order to
avoid repetition and to include all four accents equally.
The importance of making such a controlled environment for con-
ducting a research concerning rhythmic properties of language is in con-
centration on one particular point of speech process neglecting all the
factors which may influence the flow of speech. For the purpose of de-
signing a reliable speech synthesis programme, for example, it is more
useful to analyse a stretch of read speech than recorded conversations or
short sentences because the programme is more likely to produce speech
in a read speech manner (for example, when reciting some retrieved in-
formation form a database). However, utterances created ad hoc seem to
be more reliable than fragments of literary language if we want to isolate
the variables which are to be studied, as well as neutralize other factors.
For example, since Bertrán’s “artificially” created utterances served the
purpose of determining whether the addition of unstressed syllables af-
fected the duration of the stressed syllable in a foot, it was necessary to
create feet of different sizes and to measure their absolute durations.

62
The older studies suggested the compression of all the unstressed syl-
lables within a foot, in order to preserve the perceived isochrony. Ac-
cording to Bertrán, if this was the case, then the process of compression
would not only affect the unstressed syllables within a foot, but also the
stressed one. Bertrán mentions intrinsic duration of every vowel, the type
of consonant following the vowel, the type of syllable (open or closed),
as well as intonation pattern of the utterance to be the factors which may
also influence the duration of a stressed vowel. For the purpose of this
study, all these other factors were neglected, while special attention was
paid to the accents which followed one another at variable distances.
Syllables were more or less similarly structured. If it was not such a
controlled environment, other factors would intervene and make the data
more difficult to analyze.

3.3.3 Data Segmentation

One of the major problems of the rhythmic studies throughout history


has been the lack of a consistent principle of data segmentation. Differ-
ent methods in segmenting utterances are used and they consequently
produce different results, which are difficult to compare.
According to some linguists, including Roach (1982), there is no lan-
guage which is totally stress-timed or syllable-timed. He says that each
language displays both sorts of timing, which only means that each spo-
ken language is a mixture of different segments. Since these segments
are not equally distributed within a language, languages differ with re-
spect to the dominant segments and types of timing. This view was sup-
ported by Tatham and Morton (2001), who say that rhythmic differences
are not only detectable between two languages but also within the same
language, depending on the style and context21.
For the purpose of his study, Roach (1982) segmented his recorded
utterances by hand into syllables and feet. He also identified two ma-
jor problems of speech segmentation: the first problem was identifying
stresses in an utterance, while the second was identifying the beginning
and the end of an interstress interval. Roach (1982) decided to measure

21 Examples given by Gore (2004) are already mentioned in the Chapter 2.5.

63
feet from the beginning of the stressed syllable (its onset) rather than
from the intensity peak of the vowel in the stressed syllable, as many
linguists before him did. Bertrán (1999) does this in the same way in
order to measure the absolute duration of feet. However, unlike Roach
(1982) who manually segmented utterances, Bertrán (1999) segmented
his utterances using the system of computational prosodic analysis called
C.E.C.I.L (Computerised Extraction of Components of Intonation in
Language).
Ramus, Nespor, and Mehler (1999) segmented their utterances in
eight different languages into vocalic and consonantal (i.e. intervocalic)
intervals. Vowels and consonants were defined as highs and lows in a
universal sonority curve.
The “highs” on that curve represent vowels, since they are the most
sonorous sounds, while the “lows” on the curve represent consonants,
sounds less sonorous than vowels. However, possible problems may oc-
cur with syllabic consonants, especially liquids and glides. Glides, for
example, were treated both as consonants and vowels depending on their
position within an utterance. Namely, pre-vocalic and intervocalic glides
(i.e. glides which are placed before a vowel and between two vowels)
were treated as consonants, while post-vocalic glides (the ones that fol-
low a vowel) were treated as vowels, for example:

(8) English: pre-vocalic: // “queen”


intervocalic: // “vowel”
post-vocalic: // “how”
(Ramus, Nespor, and Mehler 1999: 7)

On the other hand, Grabe and Low (2002) approached the data seg-
mentation from acoustic point of view and not phonological. Namely,
they segment utterances into vocalic and intervocalic intervals on the
basis of the vowel formants –

“vocalic intervals were defined as the stretch of signal between vowel


onset and vowel offset, characterized by vowel formants, regardless
of the number of vowels included in the section”
(Grabe and Low 2002: 5).

64
Consequently, intervocalic intervals were defined as the stretch of
signal between vowel offset and vowel onset, regardless of the number
of consonants included. One major difference between Ramus et al.’s
approach to segmentation and Grabe and Low’s is that not all vowels
in an utterance are regarded as belonging to vocalic intervals. There are
languages, like Japanese, where devoicing of vowels between voiceless
consonants is common, and thus such vowels have different formant pat-
terns from voiced vowels. Because of this, they are included in intervo-
calic intervals (due to their acoustic properties, and not phonological),
which consequently influences the duration of intervocalic segments,
triggering the difference in results from those of Ramus et al.’s study.
However, Ramus, Nespor and Mehler (1999), as well as Grabe and
Low (2002), manually determined interval boundaries, which seems to
be one major limitation common to both of these studies. Despite the
best efforts of phoneticians to provide clear labelling principles, the
manual segmentation still seems to be largely subjective. Ensuring that
different researchers will employ exactly the same criteria, especially
in the studies of new languages, is virtually impossible (Ramus 2002:
5). Not only is it highly subjective, but it is also a time-consuming and
tedious process. In BonnTempo corpus as well, the collected recordings
were manually segmented into syllables, as well as into consonantal and
vocalic intervals, based on acoustic and visual cues (Steiner 2004: 2).
On the other hand, Setter (2008) segmented utterances the same way
Ramus, Nespor and Mehler (1999) did, but she used a programme for
automatic processing of the speech corpus called Speech Analyser 2.2
(SIL software). The duration of vocalic and intervocalic (consonantal)
segments was measured using wide-band spectrograms and waveforms
(Setter 2008: 2). One of the most popular programmes for speech analy-
sis and synthesis today is Praat (see Boersma and Weenink, 1992–2001).
Moreover, there is a special collection of Praat based software used
to facilitate access and analysis of the BonnTempo Corpus called the
BonnTempo-Tools (see Dellwo et al. 2002).
The initial data preparation which includes the segmentation of re-
corded data is everything but an easy work to do. As mentioned at the
beginning, in order to do this, there should be an agreement on where to
place the segment boundaries. The lack of a unique methodology has led

65
to different approaches and thus different results in instrumental studies,
since all the conclusions based on the measures derived from segmen-
tal durations crucially depend on the placement of segment boundaries.
Consequently, results attained this way are rather difficult to compare.

3.3.4 Is There Rhythm to Begin with: Instrumental Studies of


Rhythm

A number of studies and experiments have shown that the existing


Rhythm Class Hypothesis proposed by Pike (1945) and adopted by Ab-
ercrombie (1967) and many other linguists is based mainly on the sub-
jective perception of speech. Even the early research done by Roach
(1982) and Dauer (1983) showed a completely different picture from
the one proposed by the abovementioned linguists. After doing an ex-
periment involving six languages, three stress-timed and three syllable-
timed, Roach (1982) came to the conclusion that not only was the vari-
ation in syllable duration similar in all six languages, thus contradicting
the existing hypothesis about stress-timed languages, but also the stress
pulses were not more evenly spaced in stress-timed languages than in
syllable-timed languages.
The results of the research thus questioned the classification of lan-
guages on the basis of these properties. Similarly, Dauer (1983) con-
ducted an experiment on the four typical representatives of stress-timed
and syllable-timed languages (English being stress-timed, while Span-
ish, Italian, and Greek were syllable-timed). She came to the very same
conclusion that the stresses recurred no more regularly in English than in
all other languages tested. Additionally, she concluded that the duration
of interstress intervals for all languages analysed was directly propor-
tional to the number of syllables they contained, i.e. the more syllables,
the longer the interstress intervals will get. On the basis of all the tests
which examined the physical reality of the isochrony theory on stress-
and syllable-timed languages, it can be concluded that the isochrony
seems to exist only on the level of subjective perception since there is no
physical evidence to the contrary. As some linguists predicted, the instru-
mental studies did not prove the existence of isochrony: moreover, they
completely negated the very basis of the theory. For example, the results

66
of the study done by Antonio Pamies Bertrán (1999), which included
Romance languages (Spanish, Catalan, Portuguese, Italian, and French)
as well as English and Russian, prove that feet tend to last longer with
the increase of syllables, which confirms the lack of accentual isoch-
rony. Furthermore, syllables tend to last longer when they contain more
sounds, which thus confirms the lack of syllabic isochrony. His results
also prove that stressed vowels are not affected by the addition of syl-
lables, i.e. there is no compression of stressed syllables in feet. On the
basis of the results of Bertrán’s study, it seems that not only does the
concept of isochrony fail, but the typology of languages based on it fails
too. This is because all the languages studied responded similarly to the
three tests he used22, which can only mean that in this respect languages
behave the same.
The purpose of Bertrán’s study was not to negate only the concept
of isochrony, so widely accepted in the past, but to negate the existence
of rhythm in everyday speech. He starts by saying that the very term
“rhythm” is a metaphor borrowed from music to explain certain aspects
of verse and thus everyday language cannot be treated the same as the
language of verse. He also mentions the fact that even in music rhythm
is not always isochronic, but that there are even symbols to transcribe
all kinds of anisochronic phenomena. If music is not always isochronic,
there is no reason for speech to be isochronic at all. Although the main
idea of speech rhythm originated from music, even music does not jus-
tify this simplistic view of speech rhythm where all the languages in the
world are said to posses either of the two types of isochrony: accentual or

22 Bertrán (1999) studied Romance languages (Spanish, Catalan, Portuguese,


Italian, and French) and compared them to English and Russian. He reached
the abovementioned results by using three tests which he applied to the data
gathered through the experiment. First, he compared the feet durations to the
number of syllables they consist of. Also, he compared the duration of a syllable
to the number of sounds it contains. Secondly, he compared the temporal ratio
between formally similar feet that started having more and more syllables in ut-
terances which do not differ in any other way except this. Thirdly, he measured
the time variation of stressed vowels in relation to the increase of syllables per
foot in order to test whether the addition of syllables caused the compression of
sounds inside the feet.

67
syllabic. Although this kind of symmetry can seem attractive to phonolo-
gists, it is far from reality since the inherited vision of rhythm contradicts
the empirical data attained through numerous analyses. Bertrán (1999)
suggests that rhythm does not necessarily need to be isochronic, and that
there are languages which have what he calls anisochronic rhythm (like
in music). Additionally, he does not rule out the possibility that there are
languages in the world which lack any kind of rhythm. However, most
linguists did not completely discard the notion of isochrony, instead they
reformulated the theory so that it accounts for isochrony on the percep-
tual level only. One of such studies was done by Franck Ramus, Marina
Nespor, and Jacques Mehler (1999).
Since syllable duration highly correlates with both syllable complexi-
ty and vowel reduction, Ramus, Nespor, and Mehler (1999) support Dau-
er’s approach by arguing that indeed there are some languages “whose
features match neither those of typical stress-timed languages, nor those
of typical syllable-timed languages” (Ramus, Nespor, and Mehler 1999:
5) and can be placed on the abovementioned continuum somewhere be-
tween the typical representatives of the two classes. While Dauer (1987)
proposes that languages which contain properties of both classes are ran-
domly placed along the continuum, Ramus, Nespor, and Mehler (1999)
go a step further and suggest the possibility that there are more classes
than those originally proposed by Pike (1945) rather than a continuum
(Dauer 1987). They conclude that this is an empirical question which can
only be answered after a series of empirical investigations on a number
of languages belonging to unrelated families (Ramus, Nespor, and Me-
hler 1999: 5).

3.3.5 Nobody Puts Babies in the Corner:


the Role of Rhythm Perception in Language Acquisition

Ramus, Nespor, and Mehler (1999) did a research from which an


acoustic account of speech rhythm developed. This account classifies
languages on the basis of statistical measures of duration of the two seg-
ments: vocalic and consonantal. Moreover, the measures they attained
correlate with the two rhythm class parameters mentioned previously:
syllable complexity and vowel reduction. Their study was based on a

68
consonant/vowel segmentation of utterances in eight languages (English,
Polish, Dutch, French, Spanish, Italian, Catalan, and Japanese).
The aim of their research was to determine the correlates of linguistic
rhythm in speech signal in order to explain how infants extract rhythm
of their native language from speech signal and use it to discriminate
languages on the basis of rhythm perception and segmentation of speech.
Moreover, they wanted to explain the role of rhythm perception in lan-
guage acquisition. Namely, psycholinguists have relied on the existing
classification of languages on the basis of their rhythmic properties to
explain the infants’ capacity to discriminate languages. Despite exten-
sive research done over the last thirty years, scientists have failed to
identify reliable acoustic properties of language classes. From the results
of many studies explained so far (Roach1982, and Dauer 1987), it can
be concluded that the existing hypothesis was negated due to various
instrumental measurements. Ramus, Nespor, and Mehler (1999) pose a
question whether the instrumental means should be discarded altogether
since they fail to explain the rhythmic classification of languages, or we
should try to find more effective instrumental measurements that could
account for the perception of speech rhythm. Although the phonological
account, which explains rhythmic classes through phonological proper-
ties (Dauer 1987) seems preferable, it fails to explain how rhythm is
extracted from speech signal by the perceptual system. To clarify some
of these questions, Ramus, Nespor, and Mehler (1999) did a comparative
study of eight languages belonging to different rhythm classes.
They examine infants and not adults for the purpose of hypothesiz-
ing about speech perception and language discrimination simply because
adults use not only speech signal but also some other cues to differen-
tiate between their mother tongue and other languages or between the
two native languages in a bilingual environment. Apart from language
rhythm extracted from speech signal, adults use intonation, phonetics
and phonotactics, recognition of known words, and more generally, any
knowledge or experience related to the target languages and to languages
in general.
On the other hand, infants (the “newborns”) do not have any previ-
ous knowledge, so they have to look for the cues in the speech signal
they are exposed to from the very beginning of their lives. They rely on

69
the speech signal alone and extract every possible piece of information
available through speech signal. Since language rhythm is the only thing
infants can extract from speech signal at an early age, Ramus, Nespor,
and Mehler (1999) claim that infants’ language discrimination behaviour
relies on the stress-timed/syllable-timed dichotomy.
As it has already been mentioned, their main interest was to explain
how infants learn a part of the phonology of their native language. In
order to do so, Ramus, Nespor, and Mehler (1999) hypothesized that
the rhythm type should be correlated with the speech representation unit
in any given language. These representations may be feet (stress-timed
languages), syllables (syllable-timed languages), or moras (mora-timed
languages). Infants “decide” which representation to use by detecting
the rhythm type of their native language in the speech signal they are
exposed to. Ramus, Nespor, and Mehler (1999) tried to discover devices
which help infants make this decision, since it seems to be crucial in
acquiring their native language.

70
In doing so, Ramus, Nespor, and Mehler (1999) started with the pre-
dictions about bilingual environments. If the two languages belong to
the same rhythm class and thus have the same representation unit, in-
fants will have no trouble selecting it and acquire both languages easily.
On the other hand, if the two languages have different representation
units, children will receive contradictory data and unless they are able to
discriminate between the two languages without speech segmentation,
acquiring is said to be much more difficult. Infants are said to use rhythm
to do the discrimination process when they are exposed to languages of
different rhythmic classes.
However, a problem which this approach may encounter has to do
with the fact that only well-classified languages have been used in the
experiments which tested the abovementioned predictions, and thus it
cannot be predicted how infants would deal with the issue of intermedi-
ate languages.
The question of intermediate languages, those which seem to belong
to both rhythmic categories (stress-timed and syllable-timed, leaving mo-
ra-timed aside for now) or which are placed in the middle of the rhythm
scale (Dauer 1987), is inevitable. It is questionable whether infants are
able to discriminate between languages such as Catalan and Polish (Ra-
mus, Nespor, and Mehler 1999), or Catalan and any other stress-timed
language (since Catalan is said to share some properties of stress-timed
languages).
Answering all the questions stated so far would be crucial in under-
standing how infants perceive speech rhythm, how they learn the pho-
nology of their native language, and how they deal with any kind of
bilingual environment.
Since vowels are more sonorous than consonants, Ramus, Nespor and
Mehler (1999) point out that infant speech perception is concentrated on
vowels.

“Vowels carry most of the energy in the speech signal, they last longer
than most consonants, and they have greater stability. They also carry
accent and signal whether a syllable is strong or weak”
(Mehler et al 1996: 112,
quoted by Ramus,Nespor, and Mehler 1999: 6).

71
To support this assumption, many experiments were carried out (Ber-
toncini, Bijeljac-Babić, Jusczyk, Kennedy, and Mehler 1988, among oth-
ers) and their results show that infants do pay more attention to vowels
than to consonants. Also, it is said that newborns are able to “count”
syllables in a word, independently of syllable structure or weight. Ra-
mus, Nespor, and Mehler (1999) thus assume that “an infant primarily
perceives speech as a succession of vowels of variable durations and in-
tensities, alternating with periods of unanalyzed noise (i.e. consonants)”
(Ramus, Nespor, and Mehler 1999: 7). By proposing a hypothesis about
simple speech segmentation into consonants and vowels, Ramus, Ne-
spor, and Mehler (1999) wanted to show that this type of segmentation
can account for the standard stress-timed/syllable-timed dichotomy, as
well as to investigate the possibility of other types of rhythm. Moreover,
such simple segmentation should also account for language discrimina-
tion behaviour of infants, and in the end, it should be able to clarify
how rhythm might be extracted from speech signal. Ramus, Nespor, and
Mehler (1999) did not measure the duration of every single phoneme
individually, since infants are still not able to tell the difference between
the phonemes. Rather, infants have the capacity to tell the difference
only between vowels and consonants. This is why Ramus, Nespor, and
Mehler (1999) measured the durations of sequences of consecutive vow-
els, which they called vocalic intervals, and the durations of consecu-
tive consonants, better known as consonantal intervals, or as they, more
conveniently, termed them “intervocalic intervals”.
For example, if we take a random utterance and transcribe it, we can
segment the utterance as follows:

(9) We’re going to the playground


/
[
C V C V C V C V C V C

This sentence is said to have 21 individual phonemes, but five vocalic
and six consonantal intervals23.

23 Because of its pre-vocalic position, the glide /w/ is treated as a consonant.

72
From the measurements of the two types of segments, Ramus, Ne-
spor, and Mehler (1999: 7) derived three variables, each of them present-
ing values derived for one sentence only:
1. The proportion of vocalic intervals in the sentence, marked as %V;
2. The standard deviation of vocalic intervals within the sentence,
marked as ΔV;
3. The standard deviation of consonantal intervals within the sen-
tence, marked as ΔC.
The percentage of vocalic intervals of the overall utterance duration
is referred to as %V. This parameter shows how much the duration of vo-
calic intervals takes from the duration of entire utterance, or the portion
of vocalic intervals within the utterance. On the other hand, standard
deviation24 is a parameter which shows us how much variation there is
from the average; in this case, how much variation in duration there is
from the average duration of vocalic or consonantal intervals. This pa-
rameter, termed as ΔC or ΔV, is important because a low standard devia-
tion indicates that the data points tend to be very close to average, which
means that intervals tend to last the same amount of time. On the other
hand, a high standard deviation indicates the differences in duration of
vocalic or consonantal intervals, which indicates a greater variety of syl-
lable types.
After measuring vocalic and consonantal intervals and calculating the
three parameters, Ramus, Nespor, and Mehler (1999) concluded several
things. Namely, it seems that ΔC and %V are directly related to syllable
structure. As it was already mentioned, higher ΔC means more variabil-
ity in the number of consonants which in turn means a greater variety
of complex syllable types, while consequently the percentage of vowels
(%V) is lower. On the other hand, higher %V means the opposite – the
higher percentage of vowels in an utterance can only mean that the lan-
guage in question has a high percentage of simple syllables, and thus
lower standard deviation in the duration of consonantal intervals (ΔC).
However, ΔV parameter cannot be as transparently interpreted as the
previous two since there are number of factors which influence the vari-
ability of vocalic intervals: vowel reduction (English, Dutch, Catalan),

24 Gauss used the term “mean error”.

73
contrastive vowel length (Dutch, Japanese), vowel lengthening in spe-
cific contexts (Italian), etc. Dauer (1987) proposed that the two factors
which directly influence rhythm are only vowel reduction and contras-
tive vowel length. Ramus, Nespor, and Mehler (1999: 9) thus conclude
that ΔV still tells us something about the phonology of languages, but
it remains an empirical question whether it tells us something about the
perception of rhythm.
Due to this observation, the two parameters which are relevant for the
present study are ΔC and %V.

Table 3: Proportion of vocalic intervals (%V) and standard deviation of


consonantal intervals (ΔC) over a sentence, averaged by language
(taken from Ramus, Nespor, and Mehler 1999: 25)
Language %V ΔC (*100)
English 40.1 5.35
Polish 41.0 5.14
Dutch 42.3 5.33
French 43.6 4.39
Spanish 43.8 4.74
Italian 45.2 4.81
Catalan 45.6 4.52
Japanese 53.1 3.56

After doing extensive research on eight different languages, Ra-


mus, Nespor, and Mehler (1999) concluded that the measurements of
the speech signal seem to support the idea that rhythmic classes do re-
ally exist, not only in our intuitions about speech rhythm, but also as
meaningful categories which reflect the actual properties of speech sig-
nal in different languages. Not only do they support the Rhythm Class
Hypothesis, but they also include Dauer’s approach by stating that not all
languages belong to the three categories. Since they studied only eight
languages selected from those studied by other linguists in order to sup-
port the existing three classes, the data measured perfectly fit the story
about the rhythmic classes, which only led to a conclusion that more

74
languages must be measured to get the complete picture on the issue of
speech rhythm. However, Ramus, Nespor, and Mehler (1999) are real-
istic about future studies by stating that the further research and adding
more languages could dissolve the existing rhythmic categories. Since
the languages used in the study are well classified as belonging to the
three existing categories, Ramus, Nespor, and Mehler (1999) propose
that the spaces between the three categories may become occupied by
already mentioned intermediate languages. The idea of continuous dis-
tribution of languages would challenge the notion that languages cluster
into classes and diminish the very existence of the rhythmic categories,
supporting the idea that languages are placed along a continuum (Dauer
1987). Alternatively, adding new languages to the study about speech
rhythm could also reveal the existence of more than three rhythmic
classes. For example, an attempt to do so came from Levelt and van
de Vijver (1998). They used syllable complexity as a property directly
influencing the rhythmic properties of languages. On the basis of increas-
ing rhythmic complexity among languages under study, they proposed
the existence of five different classes. Three of those correspond to the
existing three rhythmic groups. In one of the other two groups, there
are languages which are said to have properties of both syllable-timed
and stress-timed languages – so-called intermediate languages. The fifth
group is reserved for languages which have the simplest syllables of all,
strictly CV languages.

Table 4: Different rhythmic classes based on syllable complexity


(Levelt, van de Vijver 1998)25

complex
MARKED I: stress-timed languages (English, Dutch)
MARKED II: intermediate languages (Catalan, Polish)
MARKED III: syllable-timed languages (Spanish, Italian,
French)
MARKED IV: mora-timed (Japanese)
UNMARKED: strictly CV languages
simple
25 Mentioned by Ramus, Nespor, and Mehler (1999: 17).

75
To conclude their story,
Ramus, Nespor, and Mehler
(1999) state that the notion
of three distinct and exclu-
sive rhythmic classes has not
yet been definitely proven,
but it is, in their opinion, the
best description of the cur-
rent evidence. Everything
mentioned so far only fur-
ther stresses the importance
of collecting more evidence
from less studied languages which need to be included in the present
study. Ramus, Nespor, and Mehler (1999) are aware of the problem of
incomplete data and have solutions to overcome it. They also suggest a
line of research which needs to be pursued. However, they draw conclu-
sions from the evidence they collected and preserve the notions of rhyth-
mic groupings since no other evidence is presented by the data they used
in their study. Moreover, if we take a closer look at the parameters they
used, it seems like they sub-
consciously went for the
most appropriate combina-
tion of parameters (Figure
3) – the combination which
would eventually prove the
existence of the three rhyth-
mic categories. No other
combination of the three pa-
rameters they used presented
the desirable results, so they
conveniently decided to ig-
nore them (Figures 4 and 5).
Ramus, Nespor and Mehler (1999) do not discard the idea that lan-
guages have some kind of rhythmic organisation, on the contrary. They
compare spoken language to any other well-organised motor sequences
which require precise and predictable timing, thus there is every reason

76
to expect a spoken language
to have such rhythmical or-
ganisation like walking or typ-
ing. Due to this, the temporal
organisation of speech should
not be arbitrary. In the light
of Chomsky’s Universality
Theory and Principles and Pa-
rameters, Ramus, Nespor and
Mehler (1999) suggest the ex-
istence of a basic rhythm of all
languages, and the differences
are due to a few adjusted “settings” (parameters). This approach looks
quite tempting since it is heading towards the universality of rhythmic
theory, but in order to answer all the questions about this and further
develop this universal idea, more research needs to be done and more
languages need to be included in the study.

3.3.6 It’s Not That Easy: Drawbacks of Instrumental Studies

Some linguists criticize Ramus et al.’s approach for several reasons.


First of all, the corpus that they use is said to be too controlled and does
not go in favour of the generalisation of the results. Ramus (2002) real-
izes this problem and states that any further extension of corpus would
have one major problem – extensions would need to follow an identical
method in order to produce comparable results, i.e. the same methodol-
ogy needs to be applied to all the languages in question, although the
strict control of the data can sometimes be too subjective and limiting.
He realizes the importance of controlling speech rates when recording
the data since their durational measurements would be affected by dif-
ferent speech rates.
In order to control it, Ramus, Nespor and Mehler (1999) chose a cor-
pus which matches both the number of syllables per sentence and sen-
tences’ duration across languages. This approach is questionable since it
requires that the speech rate is predefined and the speakers are asked to
adopt it, thus altering their own spontaneous speech to something which

77
is pseudo-spontaneous. Ramus (2002) states that future research would
need to contain not only more languages and more speakers per each
language, but also more speech samples which will be said in different
speech rates, different registers, etc.
Esther Grabe and Ee Ling Low (2002) tried to overcome this prob-
lem by introducing more languages, but not constraining the data used
– speakers were asked to speak at their own speech rates. Although the
number of languages tested is considerably higher than in Ramus et al.
(1999) (18 languages were tested, those well classified according to the
existing rhythmic classification as well as those less studied and not yet
classified languages), the number of speakers per language is smaller –
namely, only one speaker per language was recorded. This can be a seri-
ous problem since it can reflect that speaker’s speech characteristics and
personal style as well as language characteristics, so the need of having
more speakers in order to average the data across several speakers is not
questionable at all. “The more numerous the speakers, the safer the con-
clusions” (Ramus 2002: 2). Instead of tightly controlling the data used,
Grabe and Low (2002) normalized their results for changes in speech
rate (which will be discussed in detail later).
The main methodological difference between the two studies was
the segmentation of the data used. Namely, Ramus, Nespor, and Mehler
(1999) segmented their utterances into vocalic and consonantal intervals
by using their phonological properties – the vowels are said to be more
sonorous than consonants. On the other hand, Grabe and Low (2002)
approached the segmentation from the acoustic point of view. They seg-
mented the utterances into vocalic and intervocalic segments on the basis
of their acoustic properties, which means that they measured the duration
of a vowel only if there was evidence of a voiced vowel in the acoustic
signal. Due to the existence of devoiced vowels in some languages (like
Japanese), instead of using the term “consonantal intervals”, they decid-
ed to use more convenient term “intervocalic”, since not only consonants
can be included in these segments (see Data Segmentation). This in turn
created a serious methodological problem since the results of the two
experiments were considerably different for some languages due to the
increase in the duration of intervocalic intervals after including devoiced
vowels.

78
Furthermore, Grabe and Low (2002) introduced a different parameter
for calculating the differences in the duration of vocalic and intervocalic
intervals. Pairwise Variability Index (PVI) calculates the average dif-
ference in duration between two successive vowels over a whole utter-
ance. It allows us to determine the difference in prominence in the pairs
of analysed segments and expresses the level of variability in successive
measurements. In order to control the speech rate and not let it influence
the results, they normalized each difference between two intervals by
their average duration. Ramus (2002) criticised this approach by saying
that if we perform normalisation for all the data from all the languages
tested, it would mean neglecting the language specific rules of phono-
tactics and segmental inventories of languages under study. To defend
themselves in some way, Grabe and Low (2002) sensibly argue that nor-
malisation is desirable for vocalic, but not for intervocalic intervals since
they depend on the abovementioned language specific properties. So,
they compute normalized PVI (nPVI) for vocalic, and raw (unnormal-
ized) PVI (rPVI) for intervocalic intervals. Moreover, rPVI should be
definitely computed for intervocalic intervals of Japanese data.
Just like in Ramus et al.’s study (1999), the duration measurements in
Grabe and Low’s study provide acoustic evidence for the rhythmic clas-
sifications of speech. When they computed an acoustic variability index
which expresses the level of variability in vocalic and intervocalic inter-
vals, their data supported a weak categorical distinction between stress-
timing and syllable-timing. Namely, stress-timed languages are said to
exhibit high vocalic nPVI as well as high intervocalic rPVI values. This
is due to the fact that languages like English (stress-timed) have both full
and reduced vowels, which contributes to a high level of variability in
vowel durations.
Consequently, intervocalic intervals show a high level of duration
variability as well. On the other hand, syllable timed languages are said
to lack vowel reduction and their syllables are simple and more or less
of similar duration. Because of that, their level of duration variability in
both vocalic and intervocalic segments is expectedly low. However, due
to the existence of intermediate languages and their durational measures
which show that languages can be more or less stress-timed or syllable-
timed, Grabe and Low (2002: 10) opt for a gradient nature of rhythmic

79
classification rather than a strict categorical distinction between the two
groups (as Ramus et al. 1999 propose).
A problem which occurred in their approach was the inability to clas-
sify the newly studied languages into either of the two categories26. Ra-
mus (2002) gives as a possible explanation an insufficient number of
speakers per language. He thus concludes that in order to do reliable
research it is essential to have a variety of speakers for each language
and to control for speech rate either by constraining the corpus (Ramus,
Nespor, and Mehler 1999) or by using a normalization procedure (Grabe
and Low 2002).

3.3.7 Serbian: the Scarcity of Instrumental Studies

Given the scarcity of instrumental studies of speech rhythm in Ser-


bian, a classification of this language on the basis of the traditional
stress-timing/syllable-timing dichotomy is left without firm evidence.
Namely, phonology teachers tend to make a distinction between English
and Serbian in this respect and classify Serbian as a syllable-timed lan-
guage. However, without empirical evidence, this is nothing but a set of
words on a piece of paper. From the studies of Serbian speech rhythm
which have been analysed in this book, it can be concluded that a pic-
ture of Serbian speech rhythm characteristics is far from clear. These
studies (Jokanović–Mihajlov 1990, Jovičić 1999) show that Serbian ex-
hibits both sorts of timing and has characteristics of both stress-timed
and syllable-timed languages. Supporting the view that there is either a
rhythmic continuum or more rhythmic classes, it can be concluded that
Serbian, in this respect, looks quite like Czech and belongs to the group
of so-called “intermediate languages”.
As it has been already mentioned in previous chapters of the book,
Jokanović–Mihajlov (1990: 109), in her study of speech rhythm in Serbi-
an, discusses the structure of Serbian rhythmic groups. Namely, she pro-
poses a groundbreaking theory in suggesting a unit of Serbian rhythmic
organisation which extremely resembles that of English. She discards

26 They exclude mora-timed as a third category and regard Japanese as a syl-


lable-timed language.

80
syllable as a unit of rhythmic organisation and instead she introduces
rhythmic groups. Her study shows that most of these rhythmic groups
in Serbian are made out of two or three syllables, mostly CV in struc-
ture (61.6%). Then come groups with four (21.1%) and five syllables
(8.62%), while monosyllabic groups which have their own accents are
very rare (only 3%). This last percentage was expected due to the fact
that a large number of monosyllabic words in Serbian are clitics (either
proclitics or enclitics), words which do not have an accentual pattern on
their own but need to group with the word which precedes them (enclit-
ics) or the word which follows them (proclitics) to receive an accent
(consequently, they are treated as unaccented weak syllables). According
to the analysis done by Jokanović–Mihajlov (1990: 110), most monosyl-
labic rhythmic groups which have their own accents are of the CVC type
(clitics are of the CV type mostly).
Since most Serbian syllables are of the CV type, rhythmic groups are
made out of such syllables. Disyllabic rhythmic groups are of the CV-CV
type, trisyllabic and other polysyllabic of the CV-CV-CV-… type, with
the possibility of having V or VC type syllables as well. Not only do
these groups consist of a miscellaneous set of syllables, but the variety
of their structures is even higher due to different positions of accent in
similar structures. The duration of such segments is also measured on the
corpus used in the study.
The mean duration of vowels in accented syllables, regardless of the
accent type, was 79.19 ms for the corpus in question. Long accented
vowels lasted about 96.92 ms, while the average duration of short ac-
cented syllables was 73.69 ms. Unaccented syllables have the average
duration of 52.01 ms. Although it is said that accented syllables are sig-
nificantly longer than unaccented syllables in Serbian, Jokanović–Mi-
hajlov’s results show that the difference in duration between these two
types of syllables is not as great as it was expected, and that this differ-
ence is even more negligible when a word is pronounced in a sequence
than in isolation. This brings us to the conclusion that even Serbian syl-
lables, especially the accented ones, undergo some kind of contraction
in speech (discussed by Jones 1978). Jokanović–Mihajlov (1990) states
that this is evidence that syllables need to be modified inside a rhyth-
mic group in order for a stretch of speech to be rhythmically organised

81
(Jokanović–Mihajlov 1990: 110). Jokanović–Mihajlov even notices the
reduction in length of vowels in post-accented syllables in some con-
texts, as well as a general tendency of reducing vowels in pre-accented
syllables inside a rhythmic group in order for the ones in post-accented
syllables to be lengthened (Jokanović–Mihajlov 1990: 110). The longer
the rhythmic group, the shorter will its segments be, an idea which is
well-known from the typological studies of speech rhythm based on the
notion of perceived isochrony.
When vowel reduction is concerned, it is a well-known fact that
speakers of the dialect of Serbian spoken in some parts of Bosnia and
Herzegovina tend to reduce their vowels considerably in post-accentual
positions. According to Brown and Alt (2004), they tend to reduce their
post-accentual short vowels (especially /i/ and /u/), while the long ones
are heard clearly. Not only do they reduce short post-accentual vowels,
but they very often drop them completely, for example: Zen’ca instead
of ZenIca (the name of the town in Bosnia), slan’na instead of slanIna
(‘bakon’), napomen’ti instead of napomenUti (‘to remark’), etc. How-
ever, Bosnians are said to make fewer accent and length distinctions than
speakers in Serbia do. Moreover, even in the dialect(s) of Serbian spoken
in Serbia, the syllable following the falling accent is said to have a weak
vowel (even voiceless, according to Trager 1940, although Serbian does
not have voiceless vowels), while it is not the case with a syllable after
the rising accent which is said to have a full vowel (Trager 1940: 30).
It thus seems that the vowels of post-accentual syllables are prone to
reduction if preceded by a falling accent. Nevertheless, the weakening of
a post-accentual syllable is a good starting point for the vowel reduction
process to take place. Obviously, some kind of vowel reduction does oc-
cur in Serbian data.
If we dared to classify Serbian according to the existing rhythm typo-
logy of languages, Serbian would be somewhere between those typically
stress-timed and syllable-timed languages, just like Czech. As it has al-
ready been mentioned, Serbian has a high percentage of simple syllables
(73% – Jovičić 1999), a property of typically syllable-timed languages.
Moreover, the difference between accented and unaccented syllables in
terms of duration and intensity (prominence) is not as high as we ex-
pected – again, a characteristic of syllable-timed languages. However,

82
the basic unit of rhythmic organisation is not a syllable but something
more complex than syllable – it is a rhythmic group, a semantic unit con-
sisting of one accented and a number of unaccented syllables, which has
one accentual pattern. Furthermore, according to the results of a study
done on Serbian data, it seems likely that Serbian unaccented (either pre-
accented or post-accented syllables) undergo some kind of vowel reduc-
tion in connected speech – a property of a typical stress-timed language.
The problem with Dauer’s phonological account of speech rhythm, as it
has already been mentioned, is the fact that she does not state how much
each property contributes to the perception of rhythm, which property is
“stronger” and is thus more important in determining the exact position
of a language on the rhythmic continuum.
However, due to the scarcity of the studies concerning speech rhythm
in Serbian, such conclusion will be left open, with the hope that some
day valid research will be done on a Serbian corpus, similar to the ones
done for many other languages by Bertrán (1999), Ramus et al. (1999),
Dankovičová and Dellwo (2007), Setter (2008), among others. The fu-
ture studies about speech rhythm in Serbian should thus involve some
kind of empirical research which has already been done for some other
languages, including English, French, Italian, Spanish, and even Arabic,
Japanese, and Czech. First of all, in order to collect the data for small
research on the topic of speech rhythm in Serbian, some kind of a corpus
would need to be established.
Such corpus could be constructed similarly to some other corpora
used in the abovementioned studies. For example, Serbian data should
be included in the BonnTempo Corpus and some other multilanguage
corpora. Moreover, it is necessary to agree on an appropriate methodol-
ogy which would be applied for this type of research. Due to the con-
troversial and delicate nature of speech rhythm, the experiment needs to
be conducted in a highly controlled environment. Secondly, the corpus
should be segmented in several ways depending on what we want to
examine. Namely, we should measure consonantal and vocalic segments
in order to calculate the proportion of vocalic intervals (%V) as well as
the standard deviation of consonantal intervals within the utterance (ΔC).
On the basis of these measurements, Serbian could be placed on the %V/
ΔC diagram in order to precisely determine its position in relation to

83
the existing rhythmic classes. Moreover, since the syllable structure has
been widely studied on Serbian corpora, the vowel reduction phenom-
enon deserves more attention. When both of these properties are studied
carefully, Serbian could be described in relation to these properties and
placed on the rhythmic continuum proposed by Dauer (1987).

84
4 HOW TO APPLY THE STUDY OF SPEECH RHYTHM:
Speech Synthesis and Rhythm Teaching

The reasons for studying prosodic features are both scientific and
non-scientific in nature. Not only do these sorts of information help lis-
teners segment speech utterances and enhance their understanding, as
well as help learners of foreign languages sound more native-like, but
they can also help develop or improve programmes for speech synthesis
and speech-recognition devices. Consequently, the speech produced by
a machine can sound more natural, as if pronounced by humans, and
thus more accurate. It is of great importance to create a reliable speech
synthesis programme, as well as to introduce some exercises into English
language classes which would help students master the accurate English
way of speaking and, consequently, enable listeners to understand the
message more quickly and easily.
Mark Tatham and Katherine Morton’s (2001) study about speech
rhythm was aimed to help in designing a reliable speech synthesis pro-
gramme. As in all other studies based on the perceptual approach to
speech rhythm, they concluded that the assumed isochrony was only
perceived by the speaker and they sought to find the correlates of that
perceived isochrony in the acoustic signal. The question they pose in the
paper is concerned with why listeners hear the rhythmic succession of
units if that succession does not exist in the speech signal produced by
the speaker. Many researchers before them tried to find some measure-
able parameters in the acoustic signal which triggers the perception of a
regular rhythmic succession of pre-determined speech units (cf. Ramus,
Nespor, and Mehler 1999). Although many of the extensive statistical
studies done on different corpora negate the existence of any kind of
isochrony, researchers had hard time leaving the existing theory behind
and in the process of study they made a number of methodological mis-
takes. Instead of stating that isochrony does not exist in physical reality
of speech and looking for some other explanation for the perceived isoch-
rony, they manipulated their data in order to find any isochrony model in
the acoustic signal. In this manner, they did not follow the uniform path
of data segmentation but segmented the utterances in different ways to

85
find the segmentation which best fit the frame they wanted to present.
It is well known that in order to have a uniform approach to rhythmic
theory and comparable results, specific rules for data segmentation must
be set. Tatham and Morton (2001) set such rules for their data analysis.
Tatham and Morton (2001) state that both the listener and the speaker
are aware of isochrony in speech. If isochrony is an expected feature of
human speech, the acoustic effects which would trigger the perceived
isochrony in the listener seem to be highly important in designing a reli-
able synthesis system. If isochrony is lost due to the lack of such infor-
mation, the results of the synthesis process would sound unnatural and
difficult to perceive.
So the question which they pose is concerned with the way people
successfully generate the acoustic signal in order to cause an appropri-
ate response in the listener. Contrary to expectations, the results of most
studies show the lack of isochrony in the acoustic signal, so the task gets
more complicated since we obviously need to synthesise rhythm which
is not isochronic in nature but which gives rise to the perception of isoch-
rony in the listener.
Tatham and Morton (2001) did a pilot research in order to test wheth-
er their expectations about speech signal are true or not. The starting
hypotheses are the ones which exist in the theory of speech rhythm
proposed by Pike (1945) and Abercrombie (1967) and their supporters.
Namely, they start from the proposition that isochrony exists, that the
rhythmic units which will be tested are isochronous. Also, they assume
that there is no correlation between the duration of a rhythmic unit and
the number of syllables it contains.
This means that there has to be some kind of compression and con-
traction of the syllables which are added to the rhythmic unit. In the end,
they propose that the syntactic boundaries have no effect on the rhythmic
units, i.e. rhythmic units will not increase in duration before particular
syntactic boundaries, nor will they decrease in duration right after the
boundaries (however, this can only affect the so-called “hanging” rhyth-
mic units which occur at the beginning of syntactic units but do not have
any stressed syllables within).
As they expected, their research negated all of the hypotheses pro-
posed at the very beginning. Tatham and Morton (2001) showed in their

86
study that isochrony does not exist, but that there is some stability in the
duration of rhythmic units, i.e. that rhythmic unit duration is not random
and that variations in the duration of rhythmic units, though wide, show
a remarkable consistency (marked by v in the table):

Table 5: Durations of rhythmic units for the newspaper articles used in the
study (Tatham and Morton 2001: 15)

They speculate that this may be the reason why these variations in
duration are neutralized easily by the perceptual system of the listener,
which consequently leads towards the perceived isochrony (Tatham and
Morton 2001: 15).
Furthermore, their investigation confirmed the correlation between
the duration of a rhythmic unit and a number of syllables it contains.
Their study shows a regular increase in the duration of rhythmic units as
the number of syllables in the unit increases. Tatham and Morton (2001)
even calculated a correlation coefficient of +0.54, which is a fair positive
correlation of 95% confidence.
This increase in the duration can only mean one thing: there is no
contraction of existing syllables in order for rhythmic units to have the
same duration, thus there is no isochrony of rhythmic units, at least not
isochrony of this type.
The real contribution of Tatham and Morton’s study (2001) lies in
an attempt to create the perceived isochrony for the purpose of speech
synthesis by constructing the predicative rhythm unit duration model.
They defined the basic rhythmic unit, on the basis of which they generat-
ed a consistency in speech signal in the following way: the basic rhythm

87
unit is said to be of a model “stressed + unstressed” (i.e. two syllables),
which is marked by L and has an average (or mean) duration of 436.7ms.
All the other rhythmic units are calculated according to it as follows:

One-syllable unit: L - (L*20/100)


Two-syllable unit: L = basic rhythm unit (duration = 436.7 ms)
Three-syllable unit: L + (L*15/100)
Four-syllable unit: L + (L*35/100)
Five-syllable unit: L + (L*55/100)
The ratio is basically as follows: [62] : 80 : 100 : 115 : 135 : [155]
(Tatham and Morton 2001: 16)

They calculated the predicted durations of all the syllables they used
in their data and compared them to the measured values. The two types
of data showed very few inconsistencies, which means that their predic-
tions were correct.
However, in order to make predictions even more reliable, Tatham
and Morton needed to test the third hypothesis they had posed at the very
beginning. They stated that the syntactic boundaries had no influence on
the duration of rhythmic units, but as they had expected, this proved to
be wrong.
That is why they took into account that the units before syntactic
boundaries tend to last longer (even around 20% greater in duration).
Moreover, there are units which occur right after a syntactic boundary
and do not include any stressed syllable. Such units tend to last shorter.
In order to account for these “irregularities” in the speech signal, they
calculated the unit that immediately follows a pause as a value of L
which corresponds to a unit with one fewer syllable.
On the other hand, a unit which immediately precedes a pause (a syn-
tactic boundary) uses a value of L which corresponds to a unit with one
more syllable. When they calculated all these pauses in the data, the pre-
dicted and measured results fit almost perfectly:

88
Figure 6: Predicted rhythm unit durations shown against measured unit
durations in the test data with utterance block end corrections before each
pause (Tatham and Morton 2001: 18)

Many researchers before them (Bertrán 1999, Ramus, Nespor, and


Mehler 1999, among others) neglected the influence of syntactic bounda-
ries, and only measured perfectly defined rhythmic units, which is one
of the examples of data manipulation in order to get expected results.
What Tatham and Morton wanted to show is that although there seems to
be no isochrony of rhythmic units in the acoustic signal, there are some
regularities in speech signal which the listener perceives as isochronous.
If there is something which is regular in speech, there has to be a way to
create a model which would predict the durations of the rhythmic units
in speech production. The results of this study can be used to generate
a model of speech synthesis that would produce sound which is natural
and closer to real human speech. Since their analysis revolves around
English and generates a model for English, it would be useful to test
their predicative model on other languages as well. The importance of
studying prosodic properties of speech is by far the most important in
the process of creating a reliable program for speech synthesis. Such
projects require multidisciplinary teams of experts: acoustic, linguistic,
programming, mathematic, as well as signal processing. Such a Research
and Development group at the Faculty of Technical Sciences (Univer-
sity of Novi Sad, Serbia), named AlfaNum, has developed Automatic
Speech Recognition (ASR) and Text-To-Speech (TTS) engines for the
Serbian language. The ASR programme has a goal to train computers to
understand human speech. On the other hand, TTS synthesis has to teach
computers to read any text.

89
Due to a great variability in speech signal, it is impossible to create
reliable programmes to perform such complicated tasks without study-
ing prosodic features of a language – in this case, Serbian. It is crucial
in creating a programme which would synthesise speech that would
sound natural and human-like. This is important not only because such
speech is nice to hear but also because it is easier to understand – listen-
ers have a hard time identifying words and sentence boundaries and thus
understanding the message if the natural flow of speech is in any way
interrupted. Many such programmes for speech synthesis are designed
to produce speech that has a constant fundamental frequency, which in
turn creates a problem for the listeners who make additional effort to
concentrate on understanding what is being said.
While processing the information about phoneme inventory in a lan-
guage is easier, the processing of prosodic information is everything but
an easy work to do. Especially in Serbian, it is important to have proper
intonation and accentuation since very often words change meaning or
lexical category depending on the type and the position of accent within
a word, which in turn creates confusion in processing the message.
As it has already been mentioned (see Data Selection), for the purpose
of designing a reliable speech synthesis programme, it is more useful to
analyse a stretch of read speech than recorded conversations or short sen-
tences because the programme is more likely to produce speech in a read
speech manner, when reciting retrieved information from a database.

4.1 Why Should Speech Rhythm Be Taught in Language Classes?

Gilbert (2008: 2) states that “time spent helping students concentrate


on the major rhythmic and melodic signals of English is more important
than any other efforts to improve their pronunciation”. However, in Eng-
lish teaching practice, the study of pronunciation has been mainly con-
centrated on the segmental aspects of English and speech rhythm contin-
ues to be a much neglected part of language teaching courses. Students
have been taught phonemes, phoneme contrasts, as well as phoneme
sequences, while stress and rhythm have been traditionally neglected,
especially in the classes of English as a foreign language. According to
Sabater (1991: 145), “an appropriate stress and rhythmic pattern is more

90
important for intelligibility than the correct pronunciation of isolated
segments”, since these two prosodic features are said to determine the
correct pronunciation of segments in English. This is so because stress
and rhythm give overall shape to the word or sequence of words.
Rhythm is as problematic for teaching as it is for learning. If taught
in the first stages of language learning, many segmental and any other
problems can be avoided. It is difficult to teach rhythmic patterns since it
is hard to concentrate on rhythmic patterns as a separate unit of speech,
without paying attention to other speech properties (segments, for exam-
ple). Sabater (1991) points out that when the pronunciation of the right
rhythmic pattern is required, students tend to concentrate on the stress
pattern, neglecting all the other properties of speech and thus making un-
necessary mistakes. However, teachers can try to help learners develop,
at least, an awareness of rhythm by highlighting rhythmic patterns apart
from words and meaning. A good practice for doing so is to “divorce”
rhythm from its context and content. In that way, teachers can draw
learners’ attention to it, help them acquire it, and then, finally, practice
meaningful phrases with it. One way of divorcing the rhythm from its
environment is to practice nonsense phrases with appropriate rhythmic
patterns. Once students are able to hear and also reproduce the selected
patterns themselves using the nonsense syllables, they can try to distin-
guish actual phrases. Moreover, rhythm practice is most effective when
physical activity is included which is a good way of showing students
the difference between stressed and unstressed syllables in an utterance.
Such activities can include tapping, clapping, using some rubber materi-
als to stretch if a syllable is perceived as stressed or to squash it if a syl-
lable is perceived as unstressed, etc.
The importance of mastering foreign language rhythmic properties
lies in the fact that a person who studies the particular language will
have much better communication with people whose native language he
or she has been learning. Not only would they sound more natural and
native-like, but the listeners would not have hard time understanding the
message presented in a manner which is more natural and more usual for
their native language. Speech segmentation would thus be much easier
to do if the speaker used the representation units which are characteristic
for the language he or she is trying to speak. Although many people think

91
that the message can be understood even if different rhythmic patterns
are used as long as the segments, words, and phrases are intelligible,
the appropriate segmentation of speech utterances has the same value as
spaces in written texts – the message would indeed be understood but it
would take more time to come to it.

92
5 CONCLUSION

This book is a critical overview of the existing theories of speech


rhythm, both traditional and more modern ones. It compares different ap-
proaches and methodologies and classifies them into three groups, on the
basis of their attitude towards speech rhythm and types of research they
did: 1) typological approach, which is based on the notion of isochrony
and classifies languages into two (sometimes three) different categories,
2) phonological approach, which seeks to find phonological features re-
sponsible for the perception of isochrony, 3) perceptual approach, which
either questions the existence of speech rhythm altogether due to exten-
sive instrumental research or seeks to find correlates of the perceived
rhythm in speech signal in order to explain how infants extract rhythm
from the speech signal and use it to discriminate languages. Moreover, it
ponders over the application of studying speech rhythm both in teaching
a foreign language and creating a reliable programme for speech synthe-
sis and recognition.
Being one of the most controversial issues in language theory, the is-
sue of speech rhythm has caused a lot of problems and controversies for
many linguists who have dealt with this problem so far. To make things
even more difficult, some contrastive study needed to be done in order to
see whether languages such as English and Serbian differ and to which
extent in terms of this language feature. However, one of the major prob-
lems appeared at the very beginning of this research: the disproportion
between the literature about English and that about Serbian related to the
topic of speech rhythm. Comparing the abundance of papers and books
related to this topic in English and the scarcity of such studies done on
Serbian data, we immediately start wondering whether this feature plays
the same role in the two languages. The true reason for this disproportion
of studies is still unknown, but it can be speculated that Serbian linguists
do not regard the rhythm of speech as a relevant language feature. Inter-
estingly enough, most linguists who dealt with this issue are not native
speakers of Serbian (Ilse Lehiste, G.L. Trager, R.G.A. de Bray, etc). Due
to this disproportion, the current study needs to observe all the phenom-
ena concerning speech rhythm through the rhythmic studies of English

93
and to try to apply the proposed rules to the rhythm of Serbian in order to
see whether these languages differ and to which extent.
Many problems in forming the theory of speech rhythm come from
different approaches to this particular issue. First of all, research meth-
odologies vary significantly across different studies, which contributes
to them having different outcomes. Also, the point of view seems to be a
problem since linguists cannot agree whether to define it from the point
of view of the speaker or the hearer. Furthermore, the lack of empirical
evidence to support the earlier approaches is their serious drawback and
thus these early theories are susceptible to criticism. However, recent
approaches to speech rhythm, although very critical towards the exist-
ing theory, have not yet offered a valid, fully-fledged, empirically-based
rhythm theory, although the experimental means have helped clarifying
many problematic issues of the early theories.
The aim of this study is to give an overview of the existing approach-
es to the issue of speech rhythm, to point the differences between them,
but without trying to decide which of these seems to be the most appro-
priate one. Moreover, it stresses the importance of doing similar studies
on the topic in question for the Serbian language. It also offers some
guidelines for future studies of Serbian rhythmic organization, as well
as some guidelines for teachers on how to integrate the topic of speech
rhythm into their teaching practice. Finally, this book can help those who
want to study speech rhythm to find all relevant pieces of information in
one place, which is a small but not insignificant contribution to the study
of speech rhythm.

94
REFERENCES

1. Abercrombie, D. (1965). A phonetician’s view of verse structure. Ox-


ford: OUP.
2. Abercrombie, D. (1967). Elements of General Phonetics. Edinburgh
University Press.
3. Arai, T, Greenberg, S. (1997). “The temporal properties of spoken
Japanese are similar to those of English”. Proceedings of the 5th
European Conference on Speech Communication and Technology
(Eurospeech-97): 1011-1014. http://www.splab.ee.sophia.ac.jp/pa-
pers/1998/1998_13.pdf
4. Bjelica, M. (2010). Characteristics of Speech Rhythm in English and
Serbian. Unpublished Master’s Thesis. Novi Sad: Faculty of Phi-
losophy.
5. Boersma, P, Weenink, D. (1992-2001). “Praat: A system for doing
phonetics by computer”. Available from: http://www.praat.org/
6. Bolinger, D. (1981). Two Kinds of Vowels, Two Kinds of Rhythm.
Indiana University Linguistics Club.
7. Brown, W, Alt, T. (2004). A Handbook of Bosnian, Serbian, and Cro-
atian. SEELRC. http://seelrc.org:8080/grammar/pdf/stand_alone_
bcs.pdf
8. Chela Flores, B. (1997). “Rhythmic Patterns as Basic Units in Pro-
nunciation Teaching”. Chile: ONOMAZEIN 2: 111-134: http://on-
omazein.net/2/patterns.pdf
9. Cruttenden, A. (1986). Intonation. Cambridge: CUP.
10. Crystal, D. (1995). The Cambridge Encyclopedia of the English Lan-
guage. Cambridge: CUP.
11. Crystal, D. (1996). ”The past, present and future of English rhythm”.
Speak Out, Newsletter of the IATEFL Pronunciation Special Inter-
est Group, 18: 8-13: http://www.davidcrystal.com/DC_articles/Eng-
lish46.pdf
12. Crystal, D. (2008). A Dictionary of Linguistics and Phonetics 6th edi-
tion. Oxford: Blackwell Publishing.
13. Cummins, F, Gers, F, Schmidhuber, J. (1999). “Comparing Prosody
Across Many Languages”. I.D.S.I.A. Technical Report IDSIA-07:

95
ftp://ftp.idsia.ch/pub/techrep/IDSIA-07-99.ps.gz
14. Cummins, F, Port, R. F. (1998). “Rhythmic constraints on stress tim-
ing in English”. Journal of Phonetics, 26(2): 145–171. http://www.
asel.udel.edu/icslp/cdrom/vol4/437/a437.pdf
15. Cummins, F. (2002). “Speech rhythm and rhythmic taxonomy”. Pro-
ceedings of speech prosody, Aix-en-Provence: 121-136.
16. Dankovičová, J, Dellwo, V. (2007). “Czech Speech Rhythm and the
Rhythm Class Hypothesis”. Proceedings of the 16th ICPhS, Saar-
bruecken: 1241-1244: http://www.icphs2007.de/conference/Pa-
pers/1538/1538.pdf
17. Dauer, R. M. (1983). “Stress-timing and syllable-timing reanalysed”.
Journal of Phonetics, vol.11: 51-62.
18. Dauer, R. M. (1987). “Phonetic and phonological components of
language rhythm”. Proceedings of the XIth ICPhS, Tallinn, Estonia,
vol. 5: 447-450.
19. Dauer, R. M. (1993). Accurate English: A Complete Course in Pro-
nunciation. Englewood Cliffs, NJ: Prentice Hall Regents.
20. De Bray, R. G. A. (1960). “The Pitch of Serbo-Croatian Word Ac-
cent in Statements and Questions”. The Slavonic and East European
Review, vol. 38 (91). The Modern Humanities Research Association
and University College London, School of Slavonic and Eastern Eu-
ropean Studies: 380-393: http://www.jstor.org/pss/4205174
21. Dellwo, V, Koreman, J. (2008). “How speaker idiosyncratic is meas-
urable speech rhythm?” Proceedings, IAFPA 2008, Swiss Federal In-
stitute of Technology Lausanne (EPFL): http://www.hf.ntnu.no/isk/
koreman/Publications/2008/IAFPA2008abstract_DellwoKoreman.
pdf
22. Dellwo, V, Steiner, I, Aschenberner, B, Dankovičová, J, Wagner, P.
(2004). “The BonnTempo-Corpus and BonnTempo-Tools: A data-
base for the combined study of speech rhythm and rate”. Proceed-
ings of the 8th ICSLP, Jeju Island, Korea: http://www.phonetiklabor.
de/Phonetiklabor/Inhalt/Ver%F6ffentlichungen/PDFs/BonnTempo.
pdf
23. Dellwo,V, Wagner, P. (2003). “Relations between language rhythm
and speech rate”. Proceedings of the International Congress of Pho-
netics Science, Barcelona: 471-474: http://www.phonetiklabor.de/

96
Phonetiklabor/Inhalt/Ver%F6ffentlichungen/PDFs/Rhythm&Rate.
pdf
24. Fenk-Oczlon, G, Fenk, A. (2006). “Speech Rhythm and Speech
Rate in Crosslinguistic Comparison”. In: Sun, R, Miyake, N. (eds.).
Proceedings of the 28th Annual Conference of the Cognitive Science
Society. Mahwah, NJ: Erlbaum: 2480: http://wwwu.uni-klu.ac.at/
gfenk/Speech%20Rhythmfinal.pdf
25. Fox, A. (2002). Prosodic Feature and Prosodic Structure: the Pho-
nology of Suprasegmentals. Oxford: OUP.
26. Gilbert, J. (2005). Clear Speech: Pronunciation and Listening Com-
prehension in North American English. Cambridge: CUP.
27. Gilbert, J. (2008). Teaching Pronunciation: Using the Prosody Pyra-
mid. Cambridge: CUP
28. Gimson, A.C. (1978). An Introduction to the pronunciation of Eng-
lish. 2nd ed. London: Arnold.
29. Gore, M. (2004). “A Review of Perceptual Approaches to Lan-
guage Rhythm”. http://ir.kagoshima-u.ac.jp/bitstream/10232/864/1/
KJ00004193565.pdf
30. Grabe, E, Low, E. L. (2002). “Durational variability in speech and
the rhythm class hypothesis”. Papers in laboratory phonology (7):
515-546: http://wwwhomes.uni-bielefeld.de/gibbon/AK-Phon/
Rhythmus/Grabe/Grabe_Low-reformatted.pdf
31. Hamdi, R, Barkat-Defradas, M, Ferragne, E, Pellegrino, F. (2004).
“Speech Timing and Rhythmic Structure in Arabic dialects: a compar-
ison of two approaches”. INTERSPEECH-2004: 1613-1616: http://
www.isca-speech.org/archive/archive_papers/interspeech_2004/
i04_1613.pdf
32. Harris, J. (1994). English Sound Structure. Oxford: OUP.
33. Inkelas, S, Zec, D. (1988). “Serbo-Croatian pitch accent: the interac-
tion of tone, stress, and intonation”. Language, vol. 64 (2). Linguistic
Society of America: 227-248: www.jstor.org/stable/415433
34. Jokanović-Mihajlov, J. (1990). „O modelima ritmičke organizacije
iskaza“. Naučni sastanak slavista u Vukove dane: 105-113.
35. Jokanović-Mihajlov, J. (2007). Akcenat i intonacija govora na radiju
i televiziji. Beograd: Društvo za srpski jezik i književnost Srbije.
36. Jones, D. (1978). An Outline of English Phonetics. Cambridge: CUP.

97
37. Jovičić, S. T. (1999). Govorna komunikacija: fiziologija, psihoakus-
tika i percepcija. Beograd: Izdavačko preduzeće „Nauka“.
38. Lehiste, I, Ivić, P. (1986). Word and Sentence Prosody in Serbocroa-
tian. Cambridge, MA: MIT Press.
39. Levelt, C, Van de Vijver, R. (1998). “Syllable types in cross-linguis-
tic and developmental grammars”. The Third Biannual Utrecht Pho-
nology Workshop, Utrecht.
40. McArthur, T. (ed.). (1992). The Oxford Companion to the English
Language. Oxford: OUP.
41. Nava, E, Zubizarreta, M. L. (2008).“Prosodic Transfer in L2 Speech:
Evidence from Phrasal Prominence and Rhythm”. Speech Prosody
2008. Campinas, Brazil: 335-338: http://www.isca-speech.org/ar-
chive/sp2008/papers/sp08_335.pdf
42. O’Connor, J. D. (1991). Phonetics. London: Penguin Books.
43. Ordin, M.Yu, Setter, J.E. (2008a). “Objective Indicators of Rhyth-
mic Russian-English Transfer”. XX Session of the Russian Acousti-
cal Society, Moscow: 649-652: http://www.akin.ru/Docs/Rao/Ses20/
AR15.PDF
44. Ordin, M.Yu, Setter, J.E. (2008b). “Comparative Research of Tem-
poral Organization of the Syllable Structure in Hong Kong English,
Russian English, and British English”. XX Session of the Russian
Acoustical Society, Moscow: 653-656: http://www.akin.ru/Docs/
Rao/Ses20/AR16.PDF
45. Pamies Bertrán, A. (1999). “Prosodic Typology: On the Dichotomy
between Stress-Timed and Syllable-Timed Languages”. Language
Design, vol.2: 103-130: http://elies.rediris.es/Language_Design/
LD2/pamies.pdf
46. Patel, A. (2008). Music, Language, and the Brain. Oxford: OUP.
47. Pierrehumbert, J. (1980). The phonology and phonetics of English
intonation. PhD thesis. MIT: Indiana University Linguistics Club:
http://faculty.wcas.northwestern.edu/~jbp/publications/Pierrehum-
bert_PhD.pdf
48. Pike, K. L. (1945). Intonation of American English. Ann Arbor: Uni-
versity of Michigan Press.
49. Ramus, F, Dupoux, E, Mehler, J. (2003). “The psychological real-
ity of rhythm classes: Perceptual studies”. Proceedings of the 15th

98
International Congress of Phonetic Sciences, Barcelona: 337-342:
http://www.ehess.fr/lscp/persons/ramus/docs/ICPhS03.pdf
50. Ramus, F, Dupoux, E, Zangl, R, Mehler, J. (2000). “An empirical
study of the perception of language rhythm”. EHESS/CNRS: http://
citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.9167&rep=r
ep1&type=pdf.
51. Ramus, F, Nespor, M, Mehler, J. (1999). “Correlates of linguistic
rhythm in the speech signal”. Cognition, vol.73(3): 265-292.
52. Ramus, F. (2002). “Acoustic correlates of linguistic rhythm: Per-
spectives”. Proceedings of Speech Prosody 2002, Aix-en-Provence:
115-120: http://www.ehess.fr/lscp/persons/ramus/docs/ramus_sp02.
pdf
53. Roach, P. (1982). “On the distinction between ‘stress-timed’ and
‘syllable-timed’ languages”. In: D. Crystal (ed.). Linguistic contro-
versies, Essays in linguistic theory and practice. London: Edward
Arnold: 73-79.
54. Roach, P. (1998). “Some Languages are Spoken More Quickly than
Others”. In: Bauer, L, Trudgill, P (eds.). Language Myths. Penguin:
150-158: http://www.personal.rdg.ac.uk/~llsroach/phon2/tempopr.
htm
55. Roach, P. (2002). A Little Encyclopedia of Phonetics
h t t p : / / w w w. 1 i n s a a t . c o m / u p l o a d s / T r b B l o g s / p d f s _ 4 /
40625_1232781196_70.pdf
56. Sabater, M-J. S. (1991). “Stress and Rhythm in English”. Revista
Alicantina de Estudios Ingleses 4: 145-62: http://rua.ua.es/dspace/
bitstream/10045/5496/1/RAEI_04_13.pdf
57. Schiering, R. (2006). “Towards a Typology of Linguistic Rhythm”.
14th Manchester Phonology Meeting, University of Manchester:
http://www.rene.punksinscience.org/Schiering_Rhythm_14mfm.pdf
58. Sečujski, M. (2002). „Akcenatski rečnik srpskog jezika namenjen
sintezi govora na osnovu teksta“. DOGS: 17-20: http://alfanum.ftn.
ns.ac.yu/radovi/TTS.3.pdf
59. Setter, J. (2008). “L2 Prosody Research: Rhythm and Intonation”.
Talking English Phonetics: Proceedings of the 1st Belgrade Inter-
national Meeting of English Phoneticians (BIMEP 2008), Belgrade:
93-104.

99
60. Stanojčić, Ž, Popović, Lj. (1999). Gramatika srpskog jezika. Beo-
grad: Zavod za udžbenike i nastavna sredstva.
61. Steiner, I. (2003). “On the Analysis of Rhythm through Acoustic
Parameters (Zur Rhythmusanalyse mittels akustischer Parameter)”.
MA thesis. Institute for Communications Research & Phonetics,
University of Bonn: http://www.coli.uni-saarland.de/~steiner/pdf/
MA-Abstract.pdf
62. Steiner, I. (2004). “Tutorial 5: Analyzing Speech Rhythm”. 5th Euro-
pean Masters in Language and Speech Summer School, Institute for
Communications Research & Phonetics, University of Bonn: http://
www.cstr.ed.ac.uk/emasters/previous_summer_schools/2004_bonn/
steiner.pdf
63. Tajima, K, Zawaydeh, B. A, Kitahara, M. (1999). “A Comparative
Study of Speech Rhythm in Arabic, English, and Japanese”: http://
www.cs.indiana.edu/hyplan/mkitahar/Papers/0
64. Tajima, K. (1998). “Speech Rhythm in English and Japanese: Exper-
iments in Speech Cycling”. PhD thesis. Indiana University, Bloom-
ington, IN: http://ftp.cs.indiana.edu/hyplan/ktajima/thesis-1s
65. Tatham, M, Morton K. (2001). “Intrinsic and Adjusted Unit Length
in English Rhythm Synthesis”. Proceedings of the Institute of Acous-
tics – WISP 2001. St. Albans: Institute of Acoustics: 189-200: http://
www.morton-tatham.co.uk/publications/from1995/Tatham_Mor-
ton_2001.pdf
66. Trager, G. L. (1940). “Serbo-croatian Accents and Quantities”. Lan-
guage, vol. 16 (1). Linguistic Society of America: 29–32: www.jstor.
org/stable/409091
67. Vidović, V. (1967). Engleski glasovi, naglasak, ritam i intonacija.
Beograd: Zavod za izdavanje udžbenika.
68. Wagner, P. S, Dellwo, V. (2004). “Introducing YARD (Yet Another
Rhythm Determination) and Re-Introducing Isochrony to Rhythm
Research”. Proceedings of Speech Prosody, Nara: http://aune.lpl.
univ-aix.fr/~sprosig/sp2004/PDF/Wagner-Dellwo.pdf
69. Zec, D, Zsiga, E. (2009). “Interactions of tone and stress in Stand-
ard Serbian: phonological and phonetic evidence”. FASL 18. Cor-
nell University, New York: http://conf.ling.cornell.edu/FASL18/Ab-
stracts/Zec-Zsiga.pdf

100
Filozofski fakultet u Novom Sadu
Odsek za anglistiku
Dr Zorana Đinđića 2
21 000 Novi sad
Tel: +381214853900
+381214853852
www.ff.uns.ac.rs

Štampa i prelom:
Štamparija FELJTON, Novi Sad
Stražilovska 17, Tel: 021/6622-867, 424-527

Tiraž:
150

CIP - Kаталогизација у публикацији


Библиотека Матице српске, Нови Сад

811.111'342.9
811.163.41'342.9

BJELICA, Maja

Speech rhythm in English and Serbian : a critical study


of traditional and modern approaches / Maja Bjelica. - Novi
Sad : Filozofski fakultet, Odsek za anglistiku, 2012 (Novi
Sad : Feljton). - 1 elektronski optički disk (CD-ROM) ; 12 cm

Tiraž 150, - Napomene i bibliografske reference uz tekst. -


Bibliografija.

ISBN 978-86-6065-111-4

a) Eнглески језик - Говор - Ритам b) Српски језик -


Говор - Ритам
COBISS.SR-ID 272612871
ISBN 866065111-1
ISBN 978-86-6065-112-1

9 788660 651114

Potrebbero piacerti anche