[GSoC 2017 with CMUSphinx] Post 4#: First Evaluation Report - A Beta Version of pywiktionary
This week I add a nearly perfect module with test cases for IPA to X-SAMPA conversion, complete comments and readme for existing code and use travis ci for continus integration.
The installation and usage of the toolkit can be found at enwiktionary2cmudict/README.md. Nose and pylint are used as test suite and code style linter respectively.
The toolkit has three features mainly at present:
1. Extract pronunciation from Wiktionary XML dump
The example code for extracting IPA from Wiktionary XML dump is at enwiktionary2cmudict/expr.py, here is an easy example:
This part needs to be integrated into `Wiktionary` Class and provides a better interface later. Unittests also need to be added for this part.
mediawiki-utilities/python-mwxml will raise a MalformedXML error at mwxml/iteration/page.py#L86 when parsed "DiscussionThreading" related tags, which are legal according to MediaWiki XML Schema. This problem need to be fixed through a PR to the upstream dependence.
2. Lookup pronunciation for a word from en.wiktionary.org
The toolkit also provides a convenient way to lookup IPA for a certain word through Wiktionary online API.
First, create an instance of Wiktionary class:
The IPA and X-SAMPA symbol sets are all from https://en.wiktionary.org/wiki/Module:IPA/data/symbols, which are used in enwiktionary exactly. Mapping rules are modified from https://en.wiktionary.org/wiki/Module:IPA Lua module partially. So the conversion code is compatible with IPA from enwiktionary well. Here's an conversion exmaple:
Plans for next week:
The installation and usage of the toolkit can be found at enwiktionary2cmudict/README.md. Nose and pylint are used as test suite and code style linter respectively.
The toolkit has three features mainly at present:
1. Extract pronunciation from Wiktionary XML dump
The example code for extracting IPA from Wiktionary XML dump is at enwiktionary2cmudict/expr.py, here is an easy example:
>>> import mwxml
>>> from pywiktionary import Wiktionary
>>>
>>> # Create an instance of Wiktionary class
>>> wikt = Wiktionary(lang=None, XSAMPA=True)
>>> # Read enwiktionary XML dump using mwxml
>>> dump = mwxml.Dump.from_file((open(dump_file, "rb")))
>>>
>>> # Traverse the dump and extract IPA for each page
>>> for page in dump:
... for revision in page:
... pronunciation = wikt.get_entry_pronunciation(revision.text)
... print(pronunciation)
...
[{"IPA": ..., "X-SAMPA": ..., "lang": ...}, ...]
This part needs to be integrated into `Wiktionary` Class and provides a better interface later. Unittests also need to be added for this part.
mediawiki-utilities/python-mwxml will raise a MalformedXML error at mwxml/iteration/page.py#L86 when parsed "DiscussionThreading" related tags, which are legal according to MediaWiki XML Schema. This problem need to be fixed through a PR to the upstream dependence.
2. Lookup pronunciation for a word from en.wiktionary.org
The toolkit also provides a convenient way to lookup IPA for a certain word through Wiktionary online API.
First, create an instance of Wiktionary class:
>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(XSAMPA=True)
Lookup a word using `lookup` method:
>>> word = wikt.lookup("present")
The entry of word "present" is at en: [[present]], and here is the lookup result:
>>> from pprint import pprint
>>> pprint(word)
{'Catalan': 'IPA not found.',
'Danish': [{'IPA': '/prɛsanɡ/', 'X-SAMPA': '/prEsang/', 'lang': 'da'},
{'IPA': '[pʰʁ̥ɛˈsɑŋ]', 'X-SAMPA': '[p_hR_0E"sAN]', 'lang': 'da'}
],
'English': [{'IPA': '/ˈpɹɛzənt/', 'X-SAMPA': '/"pr\\Ez@nt/', 'lang': 'en'},
{'IPA': '/pɹɪˈzɛnt/', 'X-SAMPA': '/pr\\I"zEnt/', 'lang': 'en'},
{'IPA': '/pɹəˈzɛnt/', 'X-SAMPA': '/pr\\@"zEnt/', 'lang': 'en'}],
'Ladin': 'IPA not found.',
'Middle French': 'IPA not found.',
'Old French': 'IPA not found.',
'Swedish': [{'IPA': '/preˈsent/', 'X-SAMPA': '/pre"sent/', 'lang': 'sv'}]}
To lookup a word in a certain language, specify the lang parameter, like "English" or "French", etc:
>>> wikt = Wiktionary(lang="English", XSAMPA=True)
>>> word = wikt.lookup("read")
>>> pprint(word)
[{'IPA': '/ɹiːd/', 'X-SAMPA': '/r\\i:d/', 'lang': 'en'},
{'IPA': '/ɹɛd/', 'X-SAMPA': '/r\\Ed/', 'lang': 'en'}]
3. IPA -> X-SAMPA conversionThe IPA and X-SAMPA symbol sets are all from https://en.wiktionary.org/wiki/Module:IPA/data/symbols, which are used in enwiktionary exactly. Mapping rules are modified from https://en.wiktionary.org/wiki/Module:IPA Lua module partially. So the conversion code is compatible with IPA from enwiktionary well. Here's an conversion exmaple:
>>> from pywiktionary import IPA
>>> IPA_text = "/t͡ʃeɪnd͡ʒ/" # en: [[change]]
>>> XSAMPA_text = IPA.IPA_to_XSAMPA(IPA_text)
>>> XSAMPA_text
"/t__SeInd__Z/"
Plans for next week:
- Test cases for wiktionary.py and parser.py
- PRs to python-mwxml to fix DiscussionThreading Tag problem and python 2 compatibility.
- Convert more {{*-IPA}} Lua modules at https://en.wiktionary.org/wiki/Module:*-IPA/ to python module for this toolkit, in order to avoid expanding *-IPA templates through API online.