[GSoC 2017 with CMUSphinx] Post #0: Community Bonding
My proposal "Collect Pronunciation Dictionaries from Wiktionary" with CMUSphinx has been accepted in Google Summer of Code 2017 and I will spend this summer working on this project.
This Collect Pronunciation Dictionaries from Wiktionary project aims to expand pronunciation dictionaries in CMUSphinx for new words and multiple languages from Wikimedia Foundation projects. Current Sphinx dictionaries only support limited words and languages, which is difficult to meet the needs of applications, so expanding dictionaries in Sphinx is an urgent need. It's critical to reuse existing pronunciation dictionaries to improve system performance, support new words appeared recently and more languages. A valuable pronunciation source is Wiktionary, which is a multilingual, web-based project to create a free content dictionary of all words in all languages. Although Wiktionary contains pronunciations for many words and multiple languages in a standard format like IPA, it’s not easy to parse those pronunciations from sources in different page formats and convert phonemes in different languages to one common format like CMUBET which can be used by Sphinx. This project will solve those problems and form at least 10 pronunciation dictionaries which will be tested on several ASR benchmarks for Sphinx.
This Collect Pronunciation Dictionaries from Wiktionary project aims to expand pronunciation dictionaries in CMUSphinx for new words and multiple languages from Wikimedia Foundation projects. Current Sphinx dictionaries only support limited words and languages, which is difficult to meet the needs of applications, so expanding dictionaries in Sphinx is an urgent need. It's critical to reuse existing pronunciation dictionaries to improve system performance, support new words appeared recently and more languages. A valuable pronunciation source is Wiktionary, which is a multilingual, web-based project to create a free content dictionary of all words in all languages. Although Wiktionary contains pronunciations for many words and multiple languages in a standard format like IPA, it’s not easy to parse those pronunciations from sources in different page formats and convert phonemes in different languages to one common format like CMUBET which can be used by Sphinx. This project will solve those problems and form at least 10 pronunciation dictionaries which will be tested on several ASR benchmarks for Sphinx.
Mentors who will work with me:
- John: John has very extensive experience with multilingual Wikimedia projects and is familiar with varieties of tools related to Wikimedia.
- Imran: Imran is an experienced researcher working on speech recognition.
- Arseniy: Arseniy is a speech recognition and machine learning researcher and I worked with him during the application period of this project.
Progresses:
- Existing code to retrieve given words in Wiktionary, parse pronunciation part including IPA and return a structured Python dict, finished during the application period for this project.
- According to the existing acoustic models trained using Sphinx, we agreed on a draft list of ten languages of the pronunciation dictionaries to be collected in this project, which are: English, French, German, Spanish, Italian, Russian, Hindi, Greek, Dutch and Mandarin. Words in those languages are all collected from enwiktionary.
- Attempted to use pywikibot-wiktionary for wikitext paring, but failed. I have went through the whole pywikibot-wiktionary project code and tried it on some examples, but failed on testing many words. The problem is that tool tries to parse every part of the Wiktionary wiki source and uses a lot of `if` conditions, for example pywikibot-wiktionary/wiktionarypage.py#L102-L444. Those enumerated conditions may fail on some positions and cause the other problems. The author also said it's still rather limited at pywikibot-wiktionary/wiktionarypage.py#L4-L17 because the tool tries to enumerate all situations, for example, part of speech, which is really difficult to finish. I have tried to improve the code, but as mentioned before, enumerating all situations is hard and meaningless for this project. Using Python to scan the whole wikitext for so many times dealing with raw string and chars also makes the scripts slow. In fact, we only need to deal with the pronunciation part in the page and do not need to parse the whole page, which is more simple. The terms in pronunciation part we need is also limited to IPA, enPR and etc. The number of bugs grows with the lines of code, especially the code using so many `if` to deal with string which may not format well.
- I found Wiktionary dump which is in XML format and we decided to parse pronunciation part from enwiktionary XML dump instead of fetching wikitext through Wiktionary API one entry each time to decrease network cost.
Plans for next week:
- Survey for MediaWiki XML parsing tools and parse Wiktionary XML dump.
- Migrate implemented functions to the new framework.