[GSoC 2017 with CMUSphinx] Post 12#: Final Report: A Summary
I have been working on this Collect Pronunciation Dictionaries from Wiktionary project with CMUSphinx during the last three months, as a part of Google Summer of Code 2017. My project aims to expand pronunciation dictionaries for new words and multiple languages, and use them in CMUSphinx, like training acoustic models.
My work can mainly be divided into two parts: in the first two months, developed an individual toolkit to parse pronunciation from Wiktioanry dump then convert them into IPA and X-SAMPA format; and test some benchmarks using collected dictionaries in the third month. Here's the status of project goals:
- Developed a toolkit wikt2pron which can be used as a blackbox to collect pronunciation in IPA and X-SAMPA format. Well tested and easy to use to expand dictionaries by parsing pronunciations from Wiktionary dump.
- The toolkit has a detailed document for usages and API.
- Ten phonetic dictionaries have been collected from Wiktioanry using the toolkit. One problem is the pronunciation sources in Wiktioanry are limited, so some collected dictionaries are small. The collected English dictionary has about 69,000 entries comparing to CMUDict's 135,000.
- Had some experiments on comparison of G2P model and acoustic model's performance between the collected English dictionaries and CMUDict. The WER of acoustic model trained by using collected English dictionary is worse than CMUDict, due to the smaller dictionary size.
- code repo:
https://github.com/abuccts/wikt2pron
- document:
http://wikt2pron.readthedocs.io/en/latest/
- collected dictionaries and example models:
https://www.dropbox.com/sh/1anleakrnm5ednt/AAAXeSY0abHxFLcXOr4OkVJ9a
- gitter room for discussion:
https://gitter.im/enwiktionary2cmudict/Lobby
- gsoc blogs:
https://abuccts.blogspot.com/search/label/GSoC
If you'd like to try this toolkit, you can follow the instructions at GitHub to get start quickly. More detailed usages can be followed at its document.
It's a pity that some collected dictionaries are small due to the limited pronunciation entries in Wiktionary, which causes the results of benchmarks are not good enough comparing to state-of-the-art. The problem may be solved by using other resources apart from Wiktioanry. Also, the dictionary collection process is not so automatic in one command line, which can be improved by using docker to ship and run the toolkit. Therefore, the toolkit still needs improvement in the future.
This is the end of GSoC 2017 series blog. It's a wonderful experience to work with people in CMUSphinx community and develop an opensource toolkit from scratch. Although there still exist some problems in the project, it inspires me to continue on improving this project at the same time. Finally, thanks to the CMUSphinx community and my mentors, John, Imran and Arseniy, who have given me much help during the past three months.
Happy coding!
It's a pity that some collected dictionaries are small due to the limited pronunciation entries in Wiktionary, which causes the results of benchmarks are not good enough comparing to state-of-the-art. The problem may be solved by using other resources apart from Wiktioanry. Also, the dictionary collection process is not so automatic in one command line, which can be improved by using docker to ship and run the toolkit. Therefore, the toolkit still needs improvement in the future.
This is the end of GSoC 2017 series blog. It's a wonderful experience to work with people in CMUSphinx community and develop an opensource toolkit from scratch. Although there still exist some problems in the project, it inspires me to continue on improving this project at the same time. Finally, thanks to the CMUSphinx community and my mentors, John, Imran and Arseniy, who have given me much help during the past three months.
Happy coding!