[GSoC 2017 with CMUSphinx] Post 11#: Training Acoustic Model on LibriSpeech

Having trained acoustic model using the collected English dictionary on Voxforge dataset, I continue in training acoustic models to compare the performance of newly collected dictionary on other dataset. LibriSpeech is a more cleaner English speech corpus than Voxforge. Since the whole set of LibriSpeech is too large for me to train and will cost longer time, I used a subset train-clean-100 as training set. For testing, I used test-clean set. Here's the details of training on LibriSpeech.

As we have trained acoustic models on Voxforge dataset, we have the toolkits and environment prepared. Firstly, we need to prepare LibriSpeech data in file structure sphinxtrain need as the previous blog introduced. We need download speech corpus from OpenSLR, the train-clean-100 set and test-clean set. Create a root dir librispeech/, uncompress the downloaded tar balls and move speech dictionary into librispeech/wav/ dir. Since the speech files in LibriSpeech are all in flac format, we need to convert them into wav format using sox or ffmpeg. Next, we need to generate transcription files using *.trans.txt files provieded in LibriSpeech. The data preparation scripts can be found at wikt2pron/egs/librispeech/scripts. Then put the dictionaries and phoneme file at librispeech/etc/ and download language model for LibriSpeech at OpenSLR, note that we need to using G2P model generating pronunciations for OOV words in dictionaries first.

After prepared dataset, we need to run
sphinxtrain -t librispeech setup
at the root directory of librispeech/ to setup training configuration. feat.params and sphinx_train.cfg will be generated automatically under librispeech/etc/. We need to edit sphinx_train.cfg to fit the training configuration to our need, details can be referred to at https://cmusphinx.github.io/wiki/tutorialam/#setting-up-the-training-scripts and the previous blog. Since the speech files we used are clean enough comparing to Voxforge wav files, we needn't to used forced align in training.

After setted up config file, change to the librispeech/ root directory and simply run sphinxtrain run. The training process is simple to Voxforge training. Here's the models and logs I trained using CMUDict and collected dictionary on librispeech dataset.

After training completed, the acoustic model will be located at model_parameters/ dictionary. It is named librispeech.cd_cont_3000 or librispeech.cd_ptm_1000. That directory is only needed for testing. You can found decoded results in result directory, named result.align. If you need to decode using the pretrained acoustic model, simply run
sphinxtrain -s decode run
My results of training acoustic model on voxforge can be found at Dropbox.

   Dictionary       CMUDict       Wikt Dict   
   SENTENCE ERROR (%)   
  91.7 (.ptm)  
  84.5 (.cont)  
  94.5 (.ptm)  
  90.6 (.cont)  
   WORD ERROR RATE (%)   
  30.3 (.ptm)  
  19.4 (.cont)  
  39.1 (.ptm)  
  28.0 (.cont)  

The result is similar to Voxforge. WER using collected dictionary is worse than using CMUDict, a main reason is that entries in the collected dictionary is less than CMUDict, and pronunciations of OOV generated by G2P model are also not good enough comparing to CMUDict. Generally, the results are not good since the training only used a small set of LibriSpeech contains 100 hours speech corpus.

Apart from Voxforge dataset and LibriSpeech dataset, I also trained on a small and cleaner set on Voxforge, which is available at http://www.repository.voxforge1.org/downloads/Main/Trunk/AcousticModels/Sphinx/. It's the old version of Voxforge in 2010, the wav files are all 8kHZ and transcriptions are cleaner than the Voxforge dataset in previous blog. It contains about 42 hours corpus which is half of the present Voxforge dataset. The models, logs and results can be found at Dropbox.

   Dictionary       CMUDict       Wikt Dict   
   SENTENCE ERROR (%)   
  34.1 (.cont)  
  41.1 (.cont)  
   WORD ERROR RATE (%)   
  16.6 (.cont)  
  19.9 (.cont)  

WER using collected dictionary is still worse than using CMUDict, but the gap in WER becomes smaller on the smaller dataset.

Popular Posts