[GSoC 2017 with CMUSphinx] Post 11#: Training Acoustic Model on LibriSpeech
Having trained acoustic model using the collected English dictionary on Voxforge dataset, I continue in training acoustic models to compare the performance of newly collected dictionary on other dataset. LibriSpeech is a more cleaner English speech corpus than Voxforge. Since the whole set of LibriSpeech is too large for me to train and will cost longer time, I used a subset train-clean-100 as training set. For testing, I used test-clean set. Here's the details of training on LibriSpeech.
As we have trained acoustic models on Voxforge dataset, we have the toolkits and environment prepared. Firstly, we need to prepare LibriSpeech data in file structure
After prepared dataset, we need to run
After setted up config file, change to the
After training completed, the acoustic model will be located at
The result is similar to Voxforge. WER using collected dictionary is worse than using CMUDict, a main reason is that entries in the collected dictionary is less than CMUDict, and pronunciations of OOV generated by G2P model are also not good enough comparing to CMUDict. Generally, the results are not good since the training only used a small set of LibriSpeech contains 100 hours speech corpus.
Apart from Voxforge dataset and LibriSpeech dataset, I also trained on a small and cleaner set on Voxforge, which is available at http://www.repository.voxforge1.org/downloads/Main/Trunk/AcousticModels/Sphinx/. It's the old version of Voxforge in 2010, the wav files are all 8kHZ and transcriptions are cleaner than the Voxforge dataset in previous blog. It contains about 42 hours corpus which is half of the present Voxforge dataset. The models, logs and results can be found at Dropbox.
WER using collected dictionary is still worse than using CMUDict, but the gap in WER becomes smaller on the smaller dataset.
As we have trained acoustic models on Voxforge dataset, we have the toolkits and environment prepared. Firstly, we need to prepare LibriSpeech data in file structure
sphinxtrain
need as the previous blog introduced. We need download speech corpus from OpenSLR, the train-clean-100
set and test-clean
set. Create a root dir librispeech/
, uncompress the downloaded tar balls and move speech dictionary into librispeech/wav/
dir. Since the speech files in LibriSpeech are all in flac
format, we need to convert them into wav
format using sox
or ffmpeg
. Next, we need to generate transcription files using *.trans.txt
files provieded in LibriSpeech. The data preparation scripts can be found at wikt2pron/egs/librispeech/scripts. Then put the dictionaries and phoneme file at librispeech/etc/
and download language model for LibriSpeech at OpenSLR, note that we need to using G2P model generating pronunciations for OOV words in dictionaries first.After prepared dataset, we need to run
sphinxtrain -t librispeech setupat the root directory of
librispeech/
to setup training configuration. feat.params
and sphinx_train.cfg
will be generated automatically under librispeech/etc/
. We need to edit sphinx_train.cfg
to fit the training configuration to our need, details can be referred to at https://cmusphinx.github.io/wiki/tutorialam/#setting-up-the-training-scripts and the previous blog. Since the speech files we used are clean enough comparing to Voxforge wav files, we needn't to used forced align in training.After setted up config file, change to the
librispeech/
root directory and simply run sphinxtrain run
. The training process is simple to Voxforge training. Here's the models and logs I trained using CMUDict and collected dictionary on librispeech dataset.After training completed, the acoustic model will be located at
model_parameters/
dictionary. It is named librispeech.cd_cont_3000
or librispeech.cd_ptm_1000
. That directory is only needed for testing. You can found decoded results in result
directory, named result.align
. If you need to decode using the pretrained acoustic model, simply run
sphinxtrain -s decode runMy results of training acoustic model on voxforge can be found at Dropbox.
Dictionary | CMUDict | Wikt Dict |
---|---|---|
SENTENCE ERROR (%) | 91.7 (.ptm) 84.5 (.cont) |
94.5 (.ptm) 90.6 (.cont) |
WORD ERROR RATE (%) | 30.3 (.ptm) 19.4 (.cont) |
39.1 (.ptm) 28.0 (.cont) |
The result is similar to Voxforge. WER using collected dictionary is worse than using CMUDict, a main reason is that entries in the collected dictionary is less than CMUDict, and pronunciations of OOV generated by G2P model are also not good enough comparing to CMUDict. Generally, the results are not good since the training only used a small set of LibriSpeech contains 100 hours speech corpus.
Apart from Voxforge dataset and LibriSpeech dataset, I also trained on a small and cleaner set on Voxforge, which is available at http://www.repository.voxforge1.org/downloads/Main/Trunk/AcousticModels/Sphinx/. It's the old version of Voxforge in 2010, the wav files are all 8kHZ and transcriptions are cleaner than the Voxforge dataset in previous blog. It contains about 42 hours corpus which is half of the present Voxforge dataset. The models, logs and results can be found at Dropbox.
Dictionary | CMUDict | Wikt Dict |
---|---|---|
SENTENCE ERROR (%) | 34.1 (.cont) |
41.1 (.cont) |
WORD ERROR RATE (%) | 16.6 (.cont) |
19.9 (.cont) |
WER using collected dictionary is still worse than using CMUDict, but the gap in WER becomes smaller on the smaller dataset.