[GSoC 2017 with CMUSphinx] Post 3#: IPA Extraction and X-SAMPA Conversion
This week I finished the code for IPA extraction and IPA to X-SAMPA conversion, run in the whole Wiktionary XML dump and prepared to release the first version of the toolkit.
Progresses:
- Finish the code for wikitext parsing and IPA extraction. There are mainly two classes in the code at present, `Wiktionary` and `Parser`:
- `Wiktionary`: Provides functions to set the language (default is parsing all languages) and whether to convert IPA to X-SAMPA (default is false). Provides interface `Wiktionary.get_entry_pronunciation(wiki_text)` to parse wikitexts (for enwiktionary dump) or `Wiktionary.lookup(word)` to search for pronunciation of word in https://en.wiktionary.org/ through MediaWiki API.
- `Parser`: The main part of the toolkit. `Parser.parse(wiki_text)` searches the level 3 heading "===Pronunciation===" or level 4 heading "====Pronunciation====", finds the "{{IPA}}" or "{{*-IPA}}" templates and extracts and content of IPA templates. The IPA pronunciation in {{IPA}} template can be extracted easily, but IPA pronunciation in {{*-IPA}} template like {{ru-IPA}} template is expanded using template call in lua scripts, relying on Module:ru-pron. Those lua scripts can not be easily used locally, so the solution for such {{*-IPA}} templates is expanding them online using MediaWiki API, which is much slower. This issue shall be fixed later.
- Finish the code for X-SAMPA conversion. For IPA to X-SAMPA conversion, I mainly referred to IPA_Kiel_2015.pdf, IPA_chart_2005.pdf and full IPA chart. There are some differences between different revisions and the mapping rules in the code are not perfect at present. More cases need to be tested next week.
- I have extracted all Enlgish words' IPA and French word's IPA from enwiktionary 20170601 XML dump, which can be downloaded at https://www.dropbox.com/s/v754grzii3z2rfk/en.dict.json?dl=0 (en.dict.json) and https://www.dropbox.com/s/4nh5q16117my5cm/fr.dict.json?dl=0 (fr.dict.json). There are 45,091 words and 135,273 pronunciations in `en.dict.json`, 41,699 words and 125,097 pronunciations in `fr.dict.json`. The converted X-SAMPA in dict is not perfect and will be improved next week.
- Prepare a distributive of the package, which will be released next week.
Plans for next week:
- Complete the docstrings in Python code comments and write a detailed README for the toolkit.
- Complete the distribution of the toolkit and release the first version.
- Check the IPA and X-SAMPA results in details for several languages and collect the target pronunciation dictionaries.
- Prepare for the first evaluation.