Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Windows install
Mine is d:\Stephans\CMUSphinx
From http://cmusphinx.sourceforge.net/wiki/download/, download snapshot of pocketsphinx, sphinxbase, and sphinxtrain to your CMUSphinx directory
I have made window binaries. They are available from the class web page.
If you get binaries, you still need to get the full sphinxtrain file as well (so you will need to download two versions of sphinxtrain)
First get and decompress complete version Second, get executables. Put executables in SphinxTrain\bin\Release (you will need to make this dirtectory) This way the directory+file structure is the same as if you had compiled the files
Put binaries of sphinxbase in CMUSphinx/sphinxbase/bin/Release Put binaries of pocketsphinx in CMUSphinx/pocketsphinx/bin/Release To run on android, you need to get the full version of pocketsphinx. But this only compiles on linux. We will do this later If you are a student, you can get it for free from
http://e5.onthehub.com/WebStore/ProductsByMajorVersionList.aspx?ws=29950cc3-3670-e011-971f0030487d8897&vsro=8&JSEnabled=1 Or find MSDNAA link at https://www.eecis.udel.edu/wiki/ececis-docs/index.php/FAQ/Applications
The non windows binaries requires MS Visual Studio 2010 (I used Visual Studio 2010 Ultimate)
You will also need an eecis account. You can sign up for one.
Open visual studio. File->open->Preoject/Solution. Navigate to and select .sl file. To build: Build->Build solution First download and build sphinxbase
Before buildigng, switch to release. Select Build -> Configuratrion Manager: under Active solutions configuration: Change from Debug to Release
The snapshots include .sh visual studio 2010 project files (earlier versions will not work)
For python get v2.7.X Once python is installed, add the directory to you path Add path to sphinxtrain binaries
If downloaded binaries, then add path to where
Running pocketsphnix
Note audio file in CMUSphinx\pocketsphinx\test\data\goforward.raw Open terminal and
Change directory to d:\Stephans\CMUSphinx\pocketsphinx\bin\Release
Pocketsphinx_batch.exe should be there, unless compile failed
Make file ctlFile.txt with text of the name of the file we will decode
goforward -hmm ../../model/hmm/en_US/hub4wsj_sc_8k -lm ../../model/lm/en/turtle.DMP -dict ../../model/lm/en/turtle.dic CMUSphinx/sphinxbase/bin/Release/sphinxbase.dll To CMUSphinx/pocketsphinx/bin/Release CMUSphinx\pocketsphinx\test\data\goforward.raw To CMUSphinx\pocketsphinx\bin\Release\goforward.raw pocketsphinx_batch.exe -argfile argFile.txt -cepdir ../../test/data -ctl ctlFile.txt -cepext .raw -adcin true -hyp out.txt
Note: the command line arguments must be in this order!! -argfile argFile.txt defines the name of the arguments file. These aurgments are displayed on the screen when the program runs. You can check if they match -cepdir ../../test/data defines the path to the files to be processed
-cepdir must come before -ctl
Make file called argFile.txt with contents (more about these later)
Move
Move
run
Where
-ctl ctlFile.txt defines the ctlFile, which contains the name of the files to process. These names cannoy have the path or the extension -cepext .raw defines the extension of the files in the ctlFile -adcin true means that the files are audio files -hyp out.txt defines the output file More details on the parameters are http://manpages.ubuntu.com/manpages/lucid/man1/pocketsphinx_batch.1.html
Change ctlFile.txt to
myGoForward
In terminal run
pocketsphinx_batch.exe -argfile argFile.txt -ctl ctlFile.txt -cepdir ./ cepext .wav -adcin true -hyp out2.txt
Download data
http://www.speech.cs.cmu.edu/databases/an4/index. html Get mswav version Save it to your CMUSphinx directory Decompress
models
Three types of models are used acoustic model
Used to model the sound of a phone Typically, this a HMM is used Each phone has a HMM Mapping from HMMs to phones Since the acoustic model is a HMM, in the CMU Sphinx the HMM is the same as the acoustic model
phonetic dictionary
Maps phones to words In CMU Sphinx, .dic files are dictionary files
language model
Used to determine sequences of words are allowed. For example, he super run the sally is not allowed in the language model
To CMUSphinx\an4\etc
Other changes
copy sphinxbase.dll from
CMUSphinx\sphinxbase\bin\Release To CMUSphinx\SphinxTrain\bin\Release
Copy files
From CMUSphinx\pocketsphinx\bin\Release, copy
pocketspinx_batch.exe and pocketsphinx.dll to CMUSphinx\SphinxTrain/bin/Release
Try skipping this and setting line 243 of .cfg
check
Open a cmd prompt
Type path and make sure that the directory to
python is there SphinxTrain\bin\Release is there
Run training
Change to CMUSphinx\an4 directory Run
python ..\SphinxTrain\scripts\sphinxtrain.in run
This can fail because python was not installed or the path to python was not set Or the path to SphinxTrain
Check log
Open an4.html Check for errors
MODULE: 30 Training Context Dependent models
A few errors of type: Failed to align audio to trancript: final state of the search is not reached are acceptable
test
background
At a first approximation, words are a sequences of sounds, where each sound is a phone. However, the exactly pronunciation of a phone depends on the phones before and after. Diphones are two phones. Diphones are less impacted by the phones that come before or after. Triphones and quinphones are possible. The general name is senone While there are many phones, not all combinations of a phone is a word. Thus, we should not simple recognize phones, by recognize words as a sequence of phones Besides phones are fillers (e.g., breath, um). An Utterance is a sequence of words and fillers Utterances are separated by a pause
models
Three types of models are used acoustic model
Used to model the sound of a phone Typically, this a HMM is used Each phone has a HMM Mapping from HMMs to phones Since the acoustic model is a HMM, in the CMU Sphinx the HMM is the same as the acoustic model
phonetic dictionary
Maps phones to words In CMU Sphinx, .dic files are dictionary files
language model
Used to determine sequences of words are allowed. For example, he super run the sally is not allowed in the language model
Accurate transcription of the recording There are many acoustic models available online It is possible to take an existing model are quickly adapt it to a particular speaker
Language Model
Different systems need different language models
A voice control for your TV needs to recognize only a few words like volume up, change channel, A voice driven email composer needs to recognize a different set of words
The performance of the recognizer is improved if your language only considers the relevant words. You can take an existing language model and trim it to what you need, or make on from scratch
Many models are available from http://www.ldc.upenn.edu/Catalog/index.jsp
example
To explore acoustic and language models, get the AN4 database
http://www.speech.cs.cmu.edu/databases/an4/index.html Save it to your CMUSphinx directory Decompress
This data is from letters and numbers, e.g., A, B, 19 We can test this system by saying things like A, B, etc.
The acoustic model is used to translate recorded sounds into labeled phones,
e.g., recorded sound in file asc.wav is AH
Acoustic model
Roughly speaking, acoustic models take the sound sample as input and the quality of fit as output
asc.wav -> AH-Model-> -12 asc.wav -> AY-Model-> -14 AH-Model gives a better fit of the recorder sound
Challenge: Usually the audio file has many phones, not just one
E.g., from AN4 data set, an audio file contains a recording of the words TWO SIX EIGHT FOUR FOUR ONE EIGHT
CMUSphinx\an4\wav\an4_clstk\fash\cen7-fash-b.wav
E.g. from PDA data set, an audio file might contain a recording of the words: MARGINS HISTORICALLY HAVE PEAKED BY MID YEAR HE SAYS
CMUSphinx\PDA\PDAs\001\PDAs01_001_1.wav
Transcriptions
Approach one: the recording from the PDA set is transcribed as: M AA R JH AX N Z SIL HH IX S T AO R IX K AX L IY SIL ... Two problems with approach one
If the word margins are in other files, we need to enter the pronounciation of the word twice There are two ways that people pronounce historically
HH IX S T AO R IX K AX L IY HH IX S T AO R IX K L IY (this one actually says historicly, which is incorrect)
Dictionary file
A mapping from words to phones (elementary spoken sounds) Allows words to have multiple pronunciations E.g., the AN4 dataset includes the file an4.dic and it includes the lines
ELEVEN ELEVEN(2) E IH L EH V AH N IY L EH V AH N IY
By combining the transcript file and dictionary file, the sounds in each recorded audio file can be determined This is a major challenge facing training Recall, the overall goal of training is to find models for each sound. But to make the training process easier for the users, we only provide recordings of words and sentences.
However, it is a bit tricky to determine which part of the audio file corresponds to which sound.
Files needed
Training
path/filename (without extension!) The path is from where the SphinxTrain program is executed E.g., an4_train.fileids path is relative to where AN4 /etc directory. So SphinxTrain needs to be run from this directory
your_db_train.transcription - Transcription for training (described on previous slide) your_db.dic - Phonetic dictionary (described on previous slide) your_db.filler - List of fillers and what they map to
Fillers are things like silence, breathing, um etc. Fillers should also be used in the transcript
E.g., <s> TWO +UM+ SIX EIGHT FOUR FOUR ONE EIGHT </s> Fillers use the + sign before and after
During training, models for fillers will be computed Decoding is more complicated
Fillers are allowed to be added, but there is some penalty the fillers are ignored when computing the probability of a sequence of words E.g., the language model might tell us that go to bed is common, and go up bed is uncommon. If the decoder detect go um to bed it translates it to go to bed For some reason, fillers are not used in the an4 and PDA transcript files <s>, </s>, SIL are silence are included SMACK is listed in the PDA filler file, but not in the transcript </s> <s> <sil> ++INHALE++ SIL SIL SIL +INHALE+
File format
Must have sphinxtrain/bin/debug in path Must copy sphinxbase.dll to sphinxtrain/bin/debug or set path to Move pocketsphinix exe and dll Edit sphinxtrain.in to remove /log and set prefix to path Must use python 2.7 Delete an4.html before running
This is a log file. Will not exist before the first run. But if you run and find errors, you can check it. But make sure to delete it before running so you can see the errors
Language model
Language models define which combinations of words are allowed.
And, which combinations are more common or less common How often a word appears
Words: Go, stop, hi, bye Combinations with 2 words: Go forward; go back; Note that the length of these sequences can be 2, 3, .. The language cannot specify all combinations of any length. So only combinations up to some length (e.g., 2 or 3) are specified
There is an online language maker that takes sentences, counts the combinations of words and makes a ARPA file If you make your own arpa file,
you must sort it before using
sphinx_lm_sort < unsorted.arpa > sorted.arpa sphinx_lm_convert I sorted.arpa o sorted.lm.DMP Note that sometimes files that end in .lm are in the arpa format
<header - information ignored by applications> \data\ ngram 1=9 ngram 2=11 ngram 3=3 \1-grams: -0.8953 <unk> -0.7373 -0.7404 </s> -0.6515 -0.7861 <s> -0.1764 -1.0414 When -0.4754 -1.0414 will -0.1315 -0.9622 the 0.0080 -1.4393 Stock -0.3100 -1.0414 Go -0.3852 -0.9622 Up -0.1286 \2-grams: -0.3626 <s> When -0.1736 -1.2765 <s> the 0.0000 -1.2765 <s> Up 0.0000 -0.2359 When will 0.1011 -1.0212 will </s> 0.0000 -0.4191 will the 0.0000 -1.1004 the </s> 0.0000 -1.1004 the Go 0.0000 -0.6232 Stock Go 0.0000 -0.2359 Go Up 0.0587 -0.4983 Up </s> \3-grams: -0.4260 <s> When will -0.6601 When will the -0.6601 Go Up </s> \end\
ARPA format
/data/ specifies how many entries The numbers are log10 of probabilities For the 3-gram entry
-1.2 go to bed -.1. The first number, -0.2 is log10 of the probability that the last word (bed) occurs given the first two words have occurred
There might be other 3-grams like go to sleep, etc.
The second number is the probability that no words occur after this 3-gram -.2 go to -10.1 The first number is the log10 of the probability that to occurs after go The second number is the probability that no words will come after go to
Not so likely
The second number is not the log10 of a probability, but is log10 of a weight (it could be log10 of a probability, but does not have to be)
The instructions here http://cmusphinx.sourceforge.net/2011/05/buildingpocketsphinx-on-android/ are almost correct Follow instructions for getting and compiling sphinxbase and pocketsphinx Get PocketSphinxDemo.tar.gz
Import that to eclipse
File->import->Existing Projects into workspace-(next)Select Select archive file browse and select PocketSphinxDemo.tar.gz
To
On phone
(the directory should be /mnt/sdcard/Android/data/edu.cmu.pocketsphinx) adb shell mkdir /mnt/sdcard/Android/data/edu.cmu.pocketsphinx cd /mnt/sdcard/Android/data/edu.cmu.pocketsphinx Make directory struction as shown on web page
/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/hmm /mnt/sdcard/Android/data/edu.cmu.pocketsphinx /hmm/en_US /mnt/sdcard/Android/data/edu.cmu.pocketsphinx /hmm/hub4wsj_sc_8k
Not sure if this is needed.
Cd to CMUSphinx/pocketsphinx/model/hmm/en_US/
Android/android-sdk/platform-tools/adb push ./hub4wsj_sc_8k /mnt/sdcard/Android/data/edu.cmu.pocketsphinx/hmm/en_US/hub4wsj_sc_8k
Cd to CMUSphinx/pocketsphinx/model/lm
Android/android-sdk/platform-tools/adb push ./en_US /mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US/
In eclipse
In RecognizerTask.java, change code to include the correct path
This path must match the path where the model files are located
pocketsphinx.setLogfile("/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/pocketsphinx.log"); Config c = new Config(); /* * In 2.2 and above we can use getExternalFilesDir() or whatever it's called */ c.setString("-hmm", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/hmm/en_US/hub4wsj_sc_8k"); c.setString("-dict", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US/hub4.5000.dic"); c.setString("-lm", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US/hub4.5000.DMP"); c.setString("-rawlogdir", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx"); // Only use it to store the audio
Note that these lines are also changed if you use different models Build, run and test
Windows install
Requires Android NDK Flex for windows: http://gnuwin32.sourceforge.net/packages/fle x.htm Bison for windows: http://gnuwin32.sourceforge.net/packages/bi son.htm Get CMUSphinix from here: ??
Note that this contains the
Or:
But order of libs at the end need to be reversed Only compiles on linux, because is need yacc
resources: http://www.speech.cs.cmu.edu/sphinxman/