[GSoC 2017 with CMUSphinx] Post 8#: Grapheme to Phoneme Conversion

Nowthat we had collected pronunciation dictionaries for several languages. Since the dictionaries are collected from web data, most of them are not large enough to cover all the words in speech recognition. To solve that problem, one solution is to generate pronunciation for out-of-vocabulary words using some specific rules, that is grapheme to phoneme conversion. CMUSphinx provides a grapheme to phoneme (G2P) toolkit which is based on recurrent neural network (RNN) with long short-term memory units (LSTM) implemented in Tensorflow framework and provides a state of the art accuracy of conversion.

During this week I mainly used CMUSphinx g2p-seq2seq toolkit to extand the collected English dictionaries by training G2P model.

  • Installation
CMUSphinx Sequence-to-Sequence G2P toolkit can be downloaded at GitHub. First, follow the guidance at GitHub to install the toolkit. The package is based on python TensorFlow, so TensorFlow must be pre-installed. Note that the toolkit is based on TensorFlow seq2seq, and it seems seq2seq in TensorFlow only works on 1.0 version, so it's better to use version 1.0 instead of the lastest version 1.2 which also changes a lot in RNN related API. Python TensorFlow has both CPU and GPU versions, since tensorflow and tensorflow-gpu are two different packages in pip, be sure to change tensorflow to tensorflow-gpu in  g2p-seq2seq/setup.py#L41 if you use a GPU supported version. Install the toolkit by  python setup install --prefix=/path/to/install_dir/  and test by  python setup.py test , using TensorFlow 1.2 will fail on test because of both RNN API and seq2seq issues.

  • Preparation
After installation, you should be able to access g2p-seq2seq from command line. If not, please make sure that the installation directory is in system PATH environmental variable. There exists a pretrained G2P model of 2-layer LSTM with 512 hidden units, which can be downloaded at sourceforge. The model is trained on unstressed CMUDict, which is different from stressed one at GitHub, the corresponding unstressed dictionary is available at sourceforge. After download and uncompress the pretrained model, one should be able to follow the simple examples at GitHub.

  • Training
To use the g2p-seq2seq toolkit training a G2P model on our collected pronunciation dictionary, a CMUDict formated dictionary should be prepared first, one word followed by its pronunciation with space seperated per line. Than we can run  g2p-seq2seq --train train.dict --model /path/to/model  to train G2P model. The default learning_rate and learning_rate_decay_factor is 0.5 and 0.8 respectively, which are large for training. We can use --learning_rate and --learning_rate_decay_factor to set them to 0.1 and 0.5 respectively, and use --size and --num_layers to change model architecture, for exmpale, running  g2p-seq2seq --train train.dict --max_steps 0 --size 512 --num_layers 3 --learning_rate 0.1 --learning_rate_decay_factor 0.5 --model g2p-512-3  to train a 3-layer LSTM with 512 hiddern units. The training will use the whole available GPUs by default. To run on specific or just one GPU, set CUDA_VISIBLE_DEVICES environmental variable before training. Here's the training result of 2-layer LSTM model on collected English dictionary:
Loading vocabularies from g2p-512-2
Creating 2 layers of 512 units.
Reading model parameters from g2p-512-2
Beginning calculation word error rate (WER) on test sample.
Words: 4507
Errors: 2248
WER: 0.499
Accuracy: 0.501

  • Testing
Nowthat we have a pretrained G2P model, we can use this model to extand dictionary by generating pronunciations for out-of-vocabulary words. First, prepare a word list file oov.vocab with one word per line. Then, generate corresponding pronunciations using the pretrained model by running g2p-seq2seq --decode oov.vocab --output oov.dict --model g2p-seq2seq-cmudict, oov.dict is the output pronunciation dictionary. Finally, we can run cat train.dict oov.dict | sort -u > dictionary.dict to merge the original dictionary and generated dictionary for training acoustic model.


Next step, I will train the acoustic model on voxforge dataset to compare the difference between CMUDict and collected dictionary. The pretrained G2P model on collected dictionary will be used to generate pronunciation for out-of-vocabulary words in transcriptions.

Popular Posts