[GSoC 2017 with CMUSphinx] Post 9-10#: Training Acoustic Model on Voxforge Dataset
Nowthat we had collected pronunciation dictionaries and corresponding pretrained G2P models. To compare the new collected dictionaries and exsiting dictionaries, we decide to test the collected English dictionary and CMUDict on some benchmarks. We will use the two dictionaries training acoustic models respectively and compare the error rates for dictionaries' performance. We choose voxforge speech corpus as dataset and use the exsiting language model on voxforge.
During these two weeks, I mainly trained the acoustic models on voxforge dataset using `sphinxtrain` followed by CMUSphinx tutorial.
For number of tied states, I set
Here's the models and logs I trained using CMUDict and collected dictionary on voxforge dataset.
It shows that
By comparing the
Next, I will continue some benchmark experiments on other datasets to compare the collected pronunciation dictionary and CMUDict.
During these two weeks, I mainly trained the acoustic models on voxforge dataset using `sphinxtrain` followed by CMUSphinx tutorial.
- Toolkits installation
sphinxtrain
training acoustic models, the toolkit and its dependences are necessary to be installed correctly first. https://cmusphinx.github.io/wiki/tutorialam/#compilation-of-the-required-packages can be referred to for installing required packages. I used a Linux and the lastest code for installation. Download sphinxbase
, sphinxtrain
and pocketsphinx
from GitHub and compile them in order. Note that a base directory like sphinx
should be made first and those packages need to be put into the same root directory. The compilation process is nearly the same. cd
into the corresponding package's directory, run
./autogen.sh ./configure make -j $(nproc) && make installin order. If you want to install in a custom directory or you do not have
sudo
privileges, use
./configure --prefix=/path/to/dir/to install in a custom path. The three packages need to be installed in order. Finally, do remember to export the installation path to environmental variables, for example
export PATH=/path/to/dir/bin:$PATH export LD_LIBRARY_PATH=/path/to/dir/lib export PKG_CONFIG_PATH=/path/to/dir/lib/pkgconfigAfter installation, you should be able to access
sphinxtrain
from command line. If not, please make sure that the system PATH
and LD_LIBRARY_PATH
environmental variables are set correctly. Those are the basic packages required for training acoustic model. For voxforge, a common issue is that some transcriptions may be noise and they do not match the corresponding audios properly. Training process usually detects that and emits this message in the logs. We can solve that issue by enabling forced alignment stage in training. Forced alignment need to run sphinx3_align
, which is not included in the installed packages. sphinx3_align
is only available in sphinx3
, so we need to install sphinx3
to enable forced alignment. Searching "sphinx3" in Google will find sphinx3
's history version in sourceforge, like version 0.8 which is about nine years ago. The old version sphinx3
won't work with the newest sphinxbase
and sphinxtrain
, you need to checkout the lastest sphinx3
from sourceforge svn. Download sphinx3
's code from svn and compiling is the same with the previos installation. After make install
, you need to find the binary sphinx3_align
in installation path and copy it under sphinxtrain
's installation directory.- Data preparation
sphinxtrain
. The file structure for the dataset sphinxtrain
need is:
voxforge +-- etc | +-- voxforge.dic # pronunciation dictionary | +-- voxforge.phone # phoneme file | +-- voxforge.lm.DMP # language model dump | +-- voxforge.filler # list of fillers | +-- voxforge_train.fileids # list of files for training | +-- voxforge_train.transcription # transcription for training | +-- voxforge_test.fileids # list of files for testing | +-- voxforge_test.transcription # transcription for testing +-- wav | +-- folder_1 | | +-- file_1-1.wav # recording of speech utterance | | +-- file_1-2.wav | | +-- ... | +-- folder_2 | | +-- file_2-1.wav | | +-- file_2-2.wav | | +-- ... | +-- ... +-- ...https://cmusphinx.github.io/wiki/tutorialam/#data-preparation can be referred to for preparing data. First, we need to download speech corpus from voxforge. I used 16kHZ audios which is located at http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit/. Download all
tgz
files and uncompress them into a folder named wav
. Most of uncompressed folders have a plain text file named PROMPT
located at root directory which records the audios' transcriptions, we need to concatenate those PROMPT
files into one united file. Because transcription file accepted by sphinxtrain
requires filename instead of filepath at the end of each line, we need to make every wav files' filename unique regardless of their prefix path. One simple solution is renaming them to prefix_path.filename
, which will make them all unique. Some audio files in voxforge dataset are flac
format which doesn't accepted by sphinxtrain
, we need to converte them into wav
files using ffmpeg
or sox
. Finally, we need to record every audio files' filename into field
file and make corresponding transcription files through PROMPT
. Note that there exist many noise transcriptions in voxforge dataset, like numbers, special symbols, mixture of upper case letters and lower case letters, etc. We'd better to convert them all into [A-Z\' ]+
format. An example of data preparing scripts can be found at wikt2pron/egs/voxforge/scripts. Then put the dictionaries and phoneme file at etc/
. Another necessary file is language model, we can download a language model pretrained on voxforge at http://www.repository.voxforge1.org/downloads/Main/Trunk/AcousticModels/Sphinx/.- Training Configuration
sphinxtrain -t voxforge setupat the root directory of
voxforge/
to setup training configuration. feat.params
and sphinx_train.cfg
will be generated automatically under etc/
. We need to edit sphinx_train.cfg
to fit the training configuration to our need, details can be referred to at https://cmusphinx.github.io/wiki/tutorialam/#setting-up-the-training-scripts. Here's some variables we need to config on voxforge:
$CFG_HMM_TYPE = '.cont.'; # Sphinx 4, PocketSphinx #$CFG_HMM_TYPE = '.semi.'; # PocketSphinx #$CFG_HMM_TYPE = '.ptm.'; # PocketSphinx (larger data sets)I tried both
.cont
type model and .ptm
type model in training. .cont
model provides best accuracy while .ptm
model provides a nice balance between accuracy and speed. Make sure that one and only one model type is uncommented in config file. Also change $CFG_FINAL_NUM_DENSITIES = 16;
for .cont
type model.For number of tied states, I set
3000
for .cont
type model and 1000
for .ptm
model on voxforge dataset.
# Number of tied states (senones) to create in decision-tree clustering $CFG_N_TIED_STATES = 3000;Next, I used LDA/MLLT transform in training
.cont
model since it uses single stream features.
# Calculate an LDA/MLLT transform? $CFG_LDA_MLLT = 'yes'; # Dimensionality of LDA/MLLT output $CFG_LDA_DIMENSION = 29;Force aligned is used on voxforge dataset because there exist much noise on transcriptions. Make sure
sphinx3
has been installed and sphinx3_align
can be accessed on command line.
# Use force-aligned transcripts (if available) as input to training $CFG_FORCEDALIGN = 'yes';Finally, to run the training using all cpus on your machine, change queue type and set
NPART
to the number of cpus on your machine. It can be printed through nproc
command in shell.
# How many parts to run Forward-Backward estimatinon in
$CFG_NPART = 16;
...
# Queue::POSIX for multiple CPUs on a local machine
# Queue::PBS to use a PBS/TORQUE queue
$CFG_QUEUE_TYPE = "Queue::POSIX";
...
$DEC_CFG_NPART = 16; # Define how many pieces to split decode in
- Training Acoustic Model
voxforge/
root directory and simply run sphinxtrain run
. The training process will cost a long time so it's better to run it in background like executing
nohup sphinxtrain run &or using
screen
, etc. https://cmusphinx.github.io/wiki/tutorialam/#training can be referrer to for details in training. Although the whole training is executing one command then waiting for results, there may exist many errors which interrupt training. When it's interrupted, you should find in which step the training is then grep errors in the corresponding logs, for example
grep -r --include "*.log" "ERROR" logdir/20.ci_hmm/https://cmusphinx.github.io/wiki/tutorialam/#troubleshooting had indicated many problems which you may meet. Some other problems may also occur like
sphinxtrain
hangs at a certain step. It's probably becasue out of memory issue, you can reduce number in NPART
and run sphinxtrain
again. Training acoustic model not only costs time but also needs you to debug issues in previous data preparation or sphinx_train.cfg
configuration continuously.Here's the models and logs I trained using CMUDict and collected dictionary on voxforge dataset.
- Decoding and Results
model_parameters/
dictionary. It is named voxforge.cd_cont_3000
or voxforge.cd_ptm_1000
. That directory is only needed for testing. You can found decoded results in result
directory, named result.align
. If you need to decode using the pretrained acoustic model, simply run
sphinxtrain -s decode runMy results of training acoustic model on voxforge can be found at Dropbox.
Dictionary | CMUDict | Wikt Dict |
---|---|---|
SENTENCE ERROR (%) | 54.5 (.ptm) 46.4 (.cont) |
61.9 (.ptm) 50.0 (.cont) |
WORD ERROR RATE (%) | 29.9 (.ptm) 22.6 (.cont) |
36.1 (.ptm) 26.9 (.cont) |
It shows that
.ptm
type model is faster but less accuracy than .cont
type model, and WER using collected dictionary is worse than using CMUDict, a main reason is that entries in the collected dictionary is less than CMUDict, and pronunciations of OOV generated by G2P model are also not good enough comparing to CMUDict.By comparing the
result.align
file in .cont
model using CMUDict and the collected dictionary, we figured out the top ten words in error which appear in results using the collected dictionary while do not appear in CMUDict result. Here lists those top ten words in error:INADEQUACY AY N AX D IY K Y UW AX K IY WHITEFISH W IH T IY F IH S OPPRESSION P P IY P R EH S IY W AA N CHAIN T SH EY N SCHOOLBOY S CH OW OW L B OW ENCOURAGED EH N K UW AX R AX G IY D FOOLISHLY F OW L IY SH EH EH SH L AY HORRIBLE(3) B AX L ANNOUNCED AE N AX UH UW N S IY D CHUCKLED CH AX K S K EH L IY DObviously, these words' pronunciations are bad, for example,
SCHOOLBOY
pronouns in S CH OW OW L B OW
. By finding these words from source, we found that 8 out of these 10 words are generated by G2P model instead of being parsed from Wiktionary. G2P model achieves around 50% WER training on the collected English dictionary and it causes the lower WER in training acoustic model mainly. Since we can't increase the size of pronunciation dictionaries from Wiktionary, it's better to clean up the phoneme set of collected dictionary to improve the performance. That is mapping special phonemes to the nearest phone which appears in CMUDict phoneme set. For example, mapping o
and ɒ
both to OW
, mapping ɝ
and ɚ
both to ER
, etc. The rest two words, CHAIN
and HORRIBLE
, are from Wiktionary. The pronunciation of CHAIN
is normal while the other word HORRIBLE
omits half of the pronunciation. By referring to entry "horrible" in Wiktionary, the third English pronunciation [-bəɫ]
of this word is not complete. This situation can be avoided by adding rules in parser to delete such cases.Next, I will continue some benchmark experiments on other datasets to compare the collected pronunciation dictionary and CMUDict.