Sei sulla pagina 1di 46

English Read by Japanese Phonetic Corpus: An Interim Report

Takehiko Makino (Chuo University) and Rika Aoki (The University of Tokyo)
12/17/2011 Accents 2011, d 1

Background
The purpose of this paper is to explain the developmental procedure of ERJ Phonetic Corpus and to report on some findings from the small part of the Corpus that has been completed. A series of preliminary studies (Makino 2007, 2008, 2009) made it clear that a computerized phonetically-transcribed corpus of Japanese speakers English speech was worth making. So the first author began building ERJ Phonetic Corpus by making use of ERJ speech database (Minematsu, et al. 2002a), which he also used in the preliminary studies. Corpus studies on L2 pronunciation have been very rare (cf. Gut 2009, Meng, et al. 2009). We intend to fill this gap with this study.
12/17/2011 Accents 2011, d 2

ERJ speech database


ERJ stands for English Read by Japanese and the database was collected mainly in order to help CALL system development (Minematsu, et al. 2002a). 807 different sentences and 1,009 different words were read aloud by 100 male and 100 female speakers in 20 different recording sites in Japan. All of the sites were universities and all the speakers were students there. Each sentence and each word were read by about 12 speakers and 20 speakers respectively for each sex. In total, ERJ speech database consists of more than 70,000 speech files: 24,744 sentence files and 45,495 word files.
12/17/2011 Accents 2011, d 3

ERJ recording procedure


1. Before the recording, speakers were asked to practice pronouncing sentences and words on the given sheets. In the practice, they were permitted to refer to the reading sheets with phonemic and prosodic symbols. 2. In the recording sessions, speakers were asked to read aloud sentences and words on the given sheets repeatedly until they were sure that they pronounced correctly. If they made the same pronunciation error three times, they were allowed to skip the material and go to the next one.

12/17/2011

Accents 2011, d

3. After the recording, each of speech samples was checked by the technical staff of the recording site. If they found any technical errors in some sentences or words, the recording was done again for them. Minematsu, et al. (2002a) claims that with this procedure, the pronunciation errors in the database are supposed to have been made purely because of the speakers lack of the knowledge of English articulation.
12/17/2011 Accents 2011, d 5

Phonemic symbols used in the training sheets


The phonemic symbols used in the training sheets are based on those of TIMIT database and CMU pronunciation dictionary. The model of the pronunciation is therefore Mainstream American English. Consonants: B, D, G, P, T, K, JH, CH, S, SH, Z, ZH, F, TH, V, DH, M, N, NG, L, R, W, Y, HH Vowels: IY, IH, EH, EY, AE, AA, AW, AY, AH, AO, OY, OW, UH, UW, ER, AXR, AX Each vowel was specified for degrees of stress: 1 for primary, 2 for secondary and 0 for unstressed.
12/17/2011 Accents 2011, d 6

Since IPA is used in English dictionaries in Japan, the above sets of symbols are unfamiliar to the Japanese learners. In order to ensure that the speakers understand these symbols correctly, a website was prepared where they could listen to word examples of each phonemic symbol. On that website, they also could listen to example sentences with prosodic notations so that they could understand what they meant. However, how far the speakers made use of the learning materials was entirely up to them; it is possible that some of the speakers were more influenced by the spelling rather than phonemic notation.

12/17/2011

Accents 2011, d

Examples of training sheets


S1_0001 This was easy for us. [DH IH1 S] [W AA1 Z] [IY1 Z IY0] [F AO1 R] [AH1 S] S1_0002 Is this seesaw safe ? [IH1 Z] [DH IH1 S] [S IY1 S AO2] [S EY1 F] S1_0003 Those thieves stole thirty jewels.
[DH OW1 Z] [TH IY1 V Z] [S T OW1 L] [TH ER1 T IY0] [JH UW1 AX0 L Z]

The phonemic notations were removed in the sheets used in the recording session, because it was inferred that reading sentences with phonemic notation could induce unnatural pronunciation.
12/17/2011 Accents 2011, d 8

Example of the sentences with rhythmic specifications are shown below. @ stands for nuclear stress, + for secondary stress, and - for unstressed syllables. Here again, the phonemic notations were removed in the reading sheets in the recording sessions, while the rhythmic specifications were retained. S1_0106 S1_0107 Come to tea with John. /+ - + @/ [K AH1 M] [T UW1] [T IY1] [W IH1 DH] [JH AA1 N] Come to tea with John and Mary. /+ - @/+ @ -/

[K AH1 M] [T UW1] [T IY1] [W IH1 DH] [JH AA1 N] [AE1 N D] [M EH1 R IY0]
12/17/2011 Accents 2011, d 9

Examples of the sentences specified for their intonation are shown below. Note that the intonation curves are not based on any of theoretical frameworks but only indicates what was decided important. Again, the phonemic notations were removed from the reading sheet used in the recording sessions, while the intonation curves were retained. The last line of each sentence is an instruction in Japanese about the meanings/attitudes which were (supposedly) conveyed by the intonation.

12/17/2011

Accents 2011, d

10

Corpus building procedure


Obviously, it is absolutely unpractical to use the whole database for the corpus building because of its sheer size. Fortunately, of all the sound files, 9,494 were given pronunciation proficiency scores by American teachers trained in phonetics in another study (Minematsu, et al. 2002b). They are grouped into five sets:
Sentence files scored for their individual sounds: 1,902 Sentence files scored for their rhythm: 952 Sentence files scored for their intonation: 952 Word files scored for their individual sounds: 3,786 Word files scored for their stress pattern: 1,902

12/17/2011

Accents 2011, d

11

In ERJ Phonetic Corpus, we have chosen to use only the first group, i.e., 1,900 sentence files scored for their individual sounds for transcription. The reason for this choice is that other sentence groups were specified about their rhythm or intonation, which could have distorted what Japanese speakers normally do when they read English aloud. Word groups have not been chosen because we are not interested in the pronunciation of individual words.

12/17/2011

Accents 2011, d

12

To reduce the effort of manual transcription, the files were pre-processed by the Penn Phonetics Lab Forced Aligner (Yuan and Liberman 2008; http://www.ling.upenn.edu/phonetics/p2fa/), which produced forced aligned transcriptions of English words and phonemes for each file in the TextGrid format.
12/17/2011 Accents 2011, d 13

The p2fa is designed for Mainstream American speech, so it was inevitable that the Japanese speakers speech resulted in transcriptions full of errors.

12/17/2011

Accents 2011, d

14

Then, using Praat (Boersma and Weenink 2011), the TextGrids were re-formatted into four tiers (target words, target phones, substitutions and actual phones). The actual phones were manually transcribed, and boundaries of target phones and target words were manually aligned to the actual phones. The second author was involved at this very important stage.
12/17/2011 Accents 2011, d 15

12/17/2011

Accents 2011, d

16

The substitution tier was the same as the actual phone tier, except that more than one consecutive actual phones were conflated into one unit if they corresponded to a single target phone. This tier is only necessary for searching purposes, while retaining the duration information of each actual phone.
12/17/2011 Accents 2011, d 17

The corrected TextGrids were then imported into ELAN (Sloetjes and Wittenburg 2008; http://www.latmpi.eu/tools/elan/), which has a much better searching functionality.

12/17/2011

Accents 2011, d

18

The resulting .eaf files and the original .wav files are the complete individual data of the Corpus. So far, less than 10% of the files have been completed and the corpus-building is still in its initial stage.

12/17/2011

Accents 2011, d

19

Preliminary findings
In this tiny micro-corpus, the following consonantal tendencies, among others, have been found:

12/17/2011

Accents 2011, d

20

Voiced plosives
The voiced plosive phonemes are frequently spirantized: 35% for /b/, and 9% for /d/ and 7% for /g/. These phonemes are regularly spirantized between vowels in Japanese, so this distribution seems entirely natural. But look at the individual cases below:

12/17/2011

Accents 2011, d

21

/b/ - ranking
Annotation |B| |B| |B| |B| |B| |B| |B| |B| |B|
12/17/2011

Percentage 53.19 31.91 2.13 2.13 2.13 2.13 2.13 2.13 2.13
Accents 2011, d

Count 25 15 1 1 1 1 1 1 1
22

|b| || |b| |b| |b| |b| |p| |v| ||

/b/ [] phonetic contexts


|AE| |B| |N| |B| |IH| |IY| || |m| |o| || |i| || || || || || || |e| || |o| || || || |i| |i| |i| || || || || || |t| 7.32 2.44 2.44 2.44 2.44 2.44 2.44 2.44 2.44 3 1 1 1 1 1 1 1 1
23

|AA| |B| |AH0| |B| |IH| |B| |sp| |B| |AH0| |B| |ER0| |B| |AH0| |B|
12/17/2011

|AH0| |AH0| |L| |R| |R| |R| |T|

Accents 2011, d

/d/ - ranking
Annotation |D| |D| |D| |D| |D| |D| |D| |D| |D| |D| |D| 12/17/2011 |d| |d| |t| || |d| |t| |t| |z| || || || Percentage 65.52 12.07 5.17 5.17 1.72 1.72 1.72 1.72 1.72 1.72 1.72 Count 38 7 3 3 1 1 1 1 1 1 1 24

Accents 2011, d

/d/ [, z, , , ] phonetic contexts


|L| |D| |ER0| |IH| |IY0| |IH| |B| |IH| |B| |l| |i| || || || |i| |l|
Accents 2011, d

|| || || || || |z| ||

|| |i| || || |b| |i| |b|

1.85 1.85 1.85 1.85 1.85 1.85 1.85


25

1 1 1 1 1 1 1

|IH1| |D| |Z| |D|

|AH0| |D| |L| |D|

|UW| |D| |L|


12/17/2011

|D|

/g/ - ranking
Annotation Percent Count age 83.34 3.33 3.33 3.33 3.33 3.33
Accents 2011, d

|G| |g| |G| |g| |G| |g| |G| |x| |G| || |G| ||
12/17/2011

25 1 1 1 1 1
26

Variants of /g/ - phonetic contexts


|AH0| |G| |IH0| |G| |EH| |IH| |EH| |G| |G| |G| |UH| |Z| |DH| |Z| |sp| || |i| |e| |i| |e| |g| || |g| || |x| || |z| |d| |z| 8 4 4 4 4 2 1 1 1 1

12/17/2011

Accents 2011, d

27

Voiceless plosives
The voiceless plosive phonemes are also often spirantized: 14% for /p/, 7% for /t/ and 6% for /k/. This cannot be the case of L1 transfer because this sort of weakening is not considered normal for Japanese speech. There might be other reasons for this.

12/17/2011

Accents 2011, d

28

/p/ - ranking
Annotation |P| |p| |P| |p| |P| || |P| |p| |P| |p| |P| |p|
12/17/2011 Accents 2011, d

Percentage Count 51.72 29.31 13.79 1.72 1.72 1.72 30 17 8 1 1 1


29

/p/ [, p, p] phonetic contexts


|IY| |P| |L| |P| |N| |P| |sp| |P| |IY| |P| |L| |L| |L|
12/17/2011

|sp| |G| |R| |R| |S| |S| |S|

|i| ||

|| || |g| || ||

2 2 2 2 4 2 2 2 2

1 1 1 1 2 1 1 1 1
30

|m| || || |i| || |l| |i| || || |s| || |s| || |s| |p| |z| |p|

|P| |P| |P|

|IY| |P|

|DH| |sp|

Accents 2011, d

/t/ - ranking
Annotation |T| |T| |T| |T| |T| |T| |T| |T| |T| |T| |T| |T| |T| |T| |t| |t| |t| |t| |ts| |t| |t| |d| |s| |t| |t| || || || Percentage 61.6 22.4 3.2 2.4 1.6 1.6 1.6 0.8 0.8 0.8 0.8 0.8 0.8 0.8 Count 77 28 4 3 2 2 2 1 1 1 1 1 1 1
31

12/17/2011

Accents 2011, d

/k/ - ranking
Annotation |K| |k| |K| |k| |K| |x| |K| |k| |K| |k| |K| |xk|
12/17/2011 Accents 2011, d

Percentage Count 52.56 38.46 5.13 1.28 1.28 1.28 41 30 4 1 1 1


32

Voiced (inter)dental fricatives


// is very frequently mispronounced: only 13% is []. The most frequent pronunciation is [d], which accounts for 32%, and the next most frequent are [dz] (27%) and [z] (21%).
Annotation |DH| |d| |DH| |dz| |DH| |z| |DH| || |DH| |d|
12/17/2011 Accents 2011, d

Percentage 32.43 27.03 21.62 13.51 5.41

Count 12 10 8 5 2
33

// [d] phonetic contexts


|sp| |sp| |ER0| |ER0| |G| |G| |sp| |sp| |sp| |Z| |sp|
12/17/2011

|DH| |DH| |DH| |DH| |DH| |DH| |DH| |DH| |DH| |DH| |DH|

|AE| |AH| |AE| |AE| |AE| |AE1| |AH| |IH| |IH| |IH| |OW|

|| || |g| |g|

|z|

|d| |d| |d| |d| |d| |d| |d| |d| |d| |d| |d|

|| || || || || || || |i| || |i| |o|

2.7 2.7 2.7 2.7 2.7 2.7 2.7 5.41 2.7 2.7 2.7

1 1 1 1 1 1 1 2 1 1 1
34

Accents 2011, d

// [dz] phonetic contexts


|ER0| |DH| |AH0| |N| |DH| |AH0| || |n| |dz| || |dz| || |dz| || || || |i| |dz| || |dz| || |dz| || |dz| ||
Accents 2011, d

2.7 2.7 8.11 2.7 2.7 5.41 2.7

1 1 3 1 1 2 1
35

|sp| |DH| |AH0| |AH| |DH| |ER0| |R| |DH| |AH0|

|IY0| |DH| |AH0| |sp| |DH| |OW|


12/17/2011

// [z] phonetic contexts


|AH| |DH| |ER0| |P| |DH| |AH0| || |p| || || |z| |z| |z| |z| |z| |l| |z| |z| || |z|
Accents 2011, d

|| || || || |o| |o| |o| |o|

2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.7

1 1 1 1 1 1 1 1
36

|AH1| |DH| |ER0| |N| |DH| |IY0|

|sp| |DH| |OW| |L| |DH| |OW|

|sp| |DH| |OW| |L|


12/17/2011

|DH| |OW|

/n/
/n/ is pronounced as some sort of nasalized vowels in more than 30% of the cases. This again can be predicted from the Japanese phonology, whose moraic nasal /N/ is regularly realized as a nasalized vowel before a vowel, semivowel, or /h/. The details are too numerous to discuss in this presentation, however.
12/17/2011 Accents 2011, d 37

Annotation |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |n| || |m| || || || || || || |n0n| |n| |n| | | || | | || || | |

/n/ - ranking
Percentage 45.14 18.75 9.72 6.25 4.17 3.47 2.08 2.08 2.08 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69
Accents 2011, d

Count 65 27 14 9 6 5 3 3 3 1 1 1 1 1 1 1 1 1
38

12/17/2011

|AH0| |AE| |EH| |AH0| |AH0| |AH0| |IH| |ER| |AH0| |AE| |AH0| |AH0| |IH| |AE| |AH0| |AH0| |EH| |AH0| |AH0| |AH| |AE1| |AH0| |AH0| |AA| |AH0| |AY| |AH0| |AY| |AH0| |AH| |AH0| |AH0| |AH0| |AH0| |AH0| 12/17/2011 |AH0|

|N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N| |N|

/n/ - phonetic contexts


|a| || || || |e| || || |s| |s| |s| |s| |s| |s| || |z| |z| |z| |z| |z| |s| |s| || |w| |o| |s| |a|

|EH| |sp| |SH| |T| |sp| |sp| |AE0| |ER0| |AE| |AY| |EH| |AH0| |AE| |S| |S| |S| |S| |S| |S| |SH| |Z| |Z| |Z| |DH| |Z| |sp| |sp| |SH| |S| |SH| |W| |sp| |AH0| |Z| |AE| |sp|

|| || || | | |e| || |e| || || || || || |i| || || || || || || || || || |e| || |i| || || || || || || || |e| || |e| || || || || || || || || || || || |o| || || || |a| || || || || || |e| || || || |i| || |o| || |o| || |o| || || 2011, d | | Accents || ||

|e| |0| || |t|

0.72 0.72 1.45 0.72 0.72 1.45 0.72 0.72 0.72 1.45 0.72 0.72 0.72 1.45 0.72 0.72 1.45 0.72 1.45 0.72 0.72 0.72 0.72 0.72 1.45 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.72

1 1 2 1 1 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 39 1 1

Remaining problems
1. Lack of prosodic notation.
The Corpus is intended to be a source of all the phonetic characteristics of Japanese speakers English speech. So the prosodic notation is also necessary. But L2 prosody is very difficult to describe. Studies such as Gut (2009) and Li, et al. (2011) use English ToBI, which I think is a wrong thing to do. L2 prosodic system is neither that of L1 nor of the target language, but something of the mixture of the two. I will be discussing this problem and proposing a notational system of Japanese speakers English prosody in Makino (forthcoming).
Accents 2011, d 40

12/17/2011

2. Inefficiency of manual transcription. Development of spoken corpora lags far behind that of written corpora for obvious reasons, although it can be facilitated by using automatic speech recognition technologies. The development of L2 spoken corpora is much more difficult, because ASR technologies have not been developed for non-native speech. Even more difficult is an L2 phonetic corpus like what I am doing, because narrow phonetic transcription (independent of any language) is required.
12/17/2011 Accents 2011, d 41

Tsubaki and Kondo (2011) tried using ASR technologies in the development of their L2 English corpus, with a reasonably good result, but this entailed an enrichment of the dictionary with all the possible pronunciations for each entry that could be conceived of in terms of contrastive phonetics of the two languages. Unless the size of the dictionary necessary is very small like theirs, I do not think it practical. So currently all we can do is manually transcribe the sounds and wait for the advent of a new, useful technology.
12/17/2011 Accents 2011, d 42

Further work
A Different set of files (about 800) are to be included in ERJ Phonetic Corpus. Those files were selected diagonally from ERJ database for another study (Minematsu, et al. 2011), where the recordings were played over the telephone to Americans who were not familiar with Japanese speakers English and they were asked to repeat the sentences they heard. With this data, we will be able to explore what sort of actual phones tend to be misheard or not understood at all. This can be the basis for the study of intelligibility. In fact, currently, this set of data is given priority over that dealt with in this study.

12/17/2011

Accents 2011, d

43

Acknowledgement
The research for this study was supported by Grant-in-Aid for Scientific Research (B) No.23300067 (project leader: Nobuaki Minematsu) from the Japan Society for the Promotion of Science, and a Chuo University Grant for Special Research.

12/17/2011

Accents 2011, d

44

References
Boersma, P. and D. Weenink. (2011) Praat: doing phonetics by computer. http://www.praat.org/ [Accessed 30 September, 2011] Gut, U. (2009) Non-native Speech: A Corpus-based Analysis of Phonological and Phonetic Properties of L2 English and German. Frankfurt: Peter Lang. Li, M., S. Zhang, K. Li, A. Harrison, W.-K. Lo and H. Meng. (2011) Design and collection of an L2 English corpus with a suprasegmental focus for Chinese learners of English. Proceedings from the 17th International Congress of the Phonetic Sciences, Hong Kong. Makino, T. (2007) A corpus of Japanese speakers pronunciation of American English: preliminary research. Proceedings from Phonetics Teaching and Learning Conference 2007 (PTLC2007), University College London, UK. Makino, T. (2008) Consonant substitution patterns in Japanese speakers English. Paper read at Accents 2008: II International conference on native and non-native accents of English, University of d, Poland. Makino, T. (2009) Vowel substitution patterns in Japanese speakers English. In Ta(l)king English Phonetics Across Frontiers, Biljana ubrovi and Tatjana Paunovi (eds.) , pp.19-31. Newcastle: Cambridge Scholars Publishing. Makino, T. (forthcoming) Devising a notational system for the interlanguage prosody of Japanese speakers' English speech: a pilot study for corpus building. Meng, H., C. Tseng, M. Kondo, A. Harrison and T. Viscelgia. (2009) Studying L2 suprasegmental features in Asian Englishes: a position paper. Proceedings from Interspeech 2009, Brighton, UK. Minematsu, N., et al. (2002a) English speech database read by Japanese learners for CALL system development. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2002): pp.896-903. Las Palmas, Spain. Minematsu, N., et al. (2002b) Rating of Japanese students utterances of English by native English teachers. (In Japanese) In Proceedings from the Fall 2002 General Meeting of the Acoustical Society of Japan. Minematsu, N., et al. (2011) Measurement of objective intelligibility of Japanese accented English using ERJ (English Read by Japanese) database. Proceedings from Interspeech2011, Florence, Italy. Sloetjes, H. and P. Wittenburg. (2008) Annotation by category - ELAN and ISO DCR. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. Tsubaki, H. and M. Kondo. (2011) Analysis of L2 English speech corpus by automatic phoneme alignment. Proceedings from SLaTE 2011, Venice, Italy. Yuan, J. and M. Liberman. (2008) Speaker identification on the SCOTUS corpus. Proceedings of Acoustics '08, Paris, France.

12/17/2011

Accents 2011, d

45

Dzikuj!

12/17/2011

Accents 2011, d

46

Potrebbero piacerti anche