[GSoC 2017 with CMUSphinx] Post 9-10#: Training Acoustic Model on Voxforge Dataset

Nowthat we had collected pronunciation dictionaries and corresponding pretrained G2P models. To compare the new collected dictionaries and exsiting dictionaries, we decide to test the collected English dictionary and CMUDict on some benchmarks. We will use the two dictionaries training acoustic models respectively and compare the error rates for dictionaries' performance. We choose voxforge speech corpus as dataset and use the exsiting language model on voxforge.

During these two weeks, I mainly trained the acoustic models on voxforge dataset using `sphinxtrain` followed by CMUSphinx tutorial.

  • Toolkits installation
To use sphinxtrain training acoustic models, the toolkit and its dependences are necessary to be installed correctly first. https://cmusphinx.github.io/wiki/tutorialam/#compilation-of-the-required-packages can be referred to for installing required packages. I used a Linux and the lastest code for installation. Download sphinxbase, sphinxtrain and pocketsphinx from GitHub and compile them in order. Note that a base directory like sphinx should be made first and those packages need to be put into the same root directory. The compilation process is nearly the same. cd into the corresponding package's directory, run
./autogen.sh
./configure
make -j $(nproc) && make install
in order. If you want to install in a custom directory or you do not have sudo privileges, use
./configure --prefix=/path/to/dir/
to install in a custom path. The three packages need to be installed in order. Finally, do remember to export the installation path to environmental variables, for example
export PATH=/path/to/dir/bin:$PATH
export LD_LIBRARY_PATH=/path/to/dir/lib
export PKG_CONFIG_PATH=/path/to/dir/lib/pkgconfig
After installation, you should be able to access sphinxtrain from command line. If not, please make sure that the system PATH and LD_LIBRARY_PATH environmental variables are set correctly. Those are the basic packages required for training acoustic model. For voxforge, a common issue is that some transcriptions may be noise and they do not match the corresponding audios properly. Training process usually detects that and emits this message in the logs. We can solve that issue by enabling forced alignment stage in training. Forced alignment need to run sphinx3_align, which is not included in the installed packages. sphinx3_align is only available in sphinx3, so we need to install sphinx3 to enable forced alignment. Searching "sphinx3" in Google will find sphinx3's history version in sourceforge, like version 0.8 which is about nine years ago. The old version sphinx3 won't work with the newest sphinxbase and sphinxtrain, you need to checkout the lastest sphinx3 from sourceforge svn. Download sphinx3's code from svn and compiling is the same with the previos installation. After make install, you need to find the binary sphinx3_align in installation path and copy it under sphinxtrain's installation directory.

  • Data preparation
To use voxforge dataset for acoustic model training, we need to download it from voxforge and prepare the necessary files in specific format for sphinxtrain. The file structure for the dataset sphinxtrain need is:
voxforge
+-- etc
|   +-- voxforge.dic                  # pronunciation dictionary
|   +-- voxforge.phone                # phoneme file
|   +-- voxforge.lm.DMP               # language model dump
|   +-- voxforge.filler               # list of fillers
|   +-- voxforge_train.fileids        # list of files for training
|   +-- voxforge_train.transcription  # transcription for training
|   +-- voxforge_test.fileids         # list of files for testing
|   +-- voxforge_test.transcription   # transcription for testing
+-- wav
|   +-- folder_1
|   |   +-- file_1-1.wav              # recording of speech utterance
|   |   +-- file_1-2.wav
|   |   +-- ...
|   +-- folder_2
|   |   +-- file_2-1.wav
|   |   +-- file_2-2.wav
|   |   +-- ...
|   +-- ...
+-- ...
https://cmusphinx.github.io/wiki/tutorialam/#data-preparation can be referred to for preparing data. First, we need to download speech corpus from voxforge. I used 16kHZ audios which is located at http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit/. Download all tgz files and uncompress them into a folder named wav. Most of uncompressed folders have a plain text file named PROMPT located at root directory which records the audios' transcriptions, we need to concatenate those PROMPT files into one united file. Because transcription file accepted by sphinxtrain requires filename instead of filepath at the end of each line, we need to make every wav files' filename unique regardless of their prefix path. One simple solution is renaming them to prefix_path.filename, which will make them all unique. Some audio files in voxforge dataset are flac format which doesn't accepted by sphinxtrain, we need to converte them into wav files using ffmpeg or sox. Finally, we need to record every audio files' filename into field file and make corresponding transcription files through PROMPT. Note that there exist many noise transcriptions in voxforge dataset, like numbers, special symbols, mixture of upper case letters and lower case letters, etc. We'd better to convert them all into [A-Z\' ]+ format. An example of data preparing scripts can be found at wikt2pron/egs/voxforge/scripts. Then put the dictionaries and phoneme file at etc/. Another necessary file is language model, we can download a language model pretrained on voxforge at http://www.repository.voxforge1.org/downloads/Main/Trunk/AcousticModels/Sphinx/.

  • Training Configuration
After prepared dataset, we need to run
sphinxtrain -t voxforge setup
at the root directory of voxforge/ to setup training configuration. feat.params and sphinx_train.cfg will be generated automatically under etc/. We need to edit sphinx_train.cfg to fit the training configuration to our need, details can be referred to at https://cmusphinx.github.io/wiki/tutorialam/#setting-up-the-training-scripts. Here's some variables we need to config on voxforge:
$CFG_HMM_TYPE = '.cont.'; # Sphinx 4, PocketSphinx
#$CFG_HMM_TYPE  = '.semi.'; # PocketSphinx
#$CFG_HMM_TYPE  = '.ptm.'; # PocketSphinx (larger data sets)
I tried both .cont type model and .ptm type model in training. .cont model provides best accuracy while .ptm model provides a nice balance between accuracy and speed. Make sure that one and only one model type is uncommented in config file. Also change $CFG_FINAL_NUM_DENSITIES = 16; for .cont type model.
For number of tied states, I set 3000 for .cont type model and 1000 for .ptm model on voxforge dataset.
# Number of tied states (senones) to create in decision-tree clustering
$CFG_N_TIED_STATES = 3000;
Next, I used LDA/MLLT transform in training .cont model since it uses single stream features.
# Calculate an LDA/MLLT transform?
$CFG_LDA_MLLT = 'yes';
# Dimensionality of LDA/MLLT output
$CFG_LDA_DIMENSION = 29;
Force aligned is used on voxforge dataset because there exist much noise on transcriptions. Make sure sphinx3 has been installed and sphinx3_align can be accessed on command line.
# Use force-aligned transcripts (if available) as input to training
$CFG_FORCEDALIGN = 'yes';
Finally, to run the training using all cpus on your machine, change queue type and set NPART to the number of cpus on your machine. It can be printed through nproc command in shell.
# How many parts to run Forward-Backward estimatinon in
$CFG_NPART = 16;

...

# Queue::POSIX for multiple CPUs on a local machine
# Queue::PBS to use a PBS/TORQUE queue
$CFG_QUEUE_TYPE = "Queue::POSIX";

...

$DEC_CFG_NPART = 16;  #  Define how many pieces to split decode in

  • Training Acoustic Model
After setted up config file, change to the voxforge/ root directory and simply run sphinxtrain run. The training process will cost a long time so it's better to run it in background like executing
nohup sphinxtrain run &
or using screen, etc. https://cmusphinx.github.io/wiki/tutorialam/#training can be referrer to for details in training. Although the whole training is executing one command then waiting for results, there may exist many errors which interrupt training. When it's interrupted, you should find in which step the training is then grep errors in the corresponding logs, for example
grep -r --include "*.log" "ERROR" logdir/20.ci_hmm/
https://cmusphinx.github.io/wiki/tutorialam/#troubleshooting had indicated many problems which you may meet. Some other problems may also occur like sphinxtrain hangs at a certain step. It's probably becasue out of memory issue, you can reduce number in NPART and run sphinxtrain again. Training acoustic model not only costs time but also needs you to debug issues in previous data preparation or sphinx_train.cfg configuration continuously.
Here's the models and logs I trained using CMUDict and collected dictionary on voxforge dataset.

  • Decoding and Results
After training completed, you can found acoustic model in model_parameters/ dictionary. It is named voxforge.cd_cont_3000 or voxforge.cd_ptm_1000. That directory is only needed for testing. You can found decoded results in result directory, named result.align. If you need to decode using the pretrained acoustic model, simply run
sphinxtrain -s decode run
My results of training acoustic model on voxforge can be found at Dropbox.

   Dictionary       CMUDict       Wikt Dict   
   SENTENCE ERROR (%)   
  54.5 (.ptm)  
  46.4 (.cont)  
  61.9 (.ptm)  
  50.0 (.cont)  
   WORD ERROR RATE (%)   
  29.9 (.ptm)  
  22.6 (.cont)  
  36.1 (.ptm)  
  26.9 (.cont)  

It shows that .ptm type model is faster but less accuracy than .cont type model, and WER using collected dictionary is worse than using CMUDict, a main reason is that entries in the collected dictionary is less than CMUDict, and pronunciations of OOV generated by G2P model are also not good enough comparing to CMUDict.

By comparing the result.align file in .cont model using CMUDict and the collected dictionary, we figured out the top ten words in error which appear in results using the collected dictionary while do not appear in CMUDict result. Here lists those top ten words in error:
INADEQUACY AY N AX D IY K Y UW AX K IY
WHITEFISH W IH T IY F IH S
OPPRESSION P P IY P R EH S IY W AA N
CHAIN T SH EY N
SCHOOLBOY S CH OW OW L B OW
ENCOURAGED EH N K UW AX R AX G IY D
FOOLISHLY F OW L IY SH EH EH SH L AY
HORRIBLE(3) B AX L
ANNOUNCED AE N AX UH UW N S IY D
CHUCKLED CH AX K S K EH L IY D
Obviously, these words' pronunciations are bad, for example, SCHOOLBOY pronouns in S CH OW OW L B OW. By finding these words from source, we found that 8 out of these 10 words are generated by G2P model instead of being parsed from Wiktionary. G2P model achieves around 50% WER training on the collected English dictionary and it causes the lower WER in training acoustic model mainly. Since we can't increase the size of pronunciation dictionaries from Wiktionary, it's better to clean up the phoneme set of collected dictionary to improve the performance. That is mapping special phonemes to the nearest phone which appears in CMUDict phoneme set. For example, mapping o and ɒ both to OW, mapping ɝ and ɚ both to ER, etc. The rest two words, CHAIN and HORRIBLE, are from Wiktionary. The pronunciation of CHAIN is normal while the other word HORRIBLE omits half of the pronunciation. By referring to entry "horrible" in Wiktionary, the third English pronunciation [-bəɫ] of this word is not complete. This situation can be avoided by adding rules in parser to delete such cases.

Next, I will continue some benchmark experiments on other datasets to compare the collected pronunciation dictionary and CMUDict.

Popular Posts