[GSoC 2017 with CMUSphinx] Post 5#: Documentation
This post introduces several improvements in this week and the documentation of the project.
There are mainly two improvements:
Documentation:
Documentation for this project has been created this week and will be improved continously. We use sphinx-doc for document generation, docstings in Python code are NumPy style over reStructuredText and included in document by sphinx autodoc extension. Published document for this project can be found at readthedocs.
Here's the structure of the documentation:
1. Introduction: The basic information and installation guide for the package.
2. Usage: The example codes for three features of the package.
3. pywiktionary API: Detailed references for `Wiktionary` class and `Parser` class in the package, with utility functions in addition.
Next step:
Currently
To solve this problem, converting expanding Lua scripts Wiktionary uses to offline Python script by this package is needed. Roughly, to extract pronunciations for all languages mentioned in post #0 , {{fr-IPA}} (French), {{ru-IPA}} (Russian), {{hi-IPA}} (Hindi) and {{zh-pron}} (Mandarin) need to be converted.
There are mainly two improvements:
- Fix unexpected tag error in python-mwxml
The project depends on python-mwxml, which contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. To extract pronunciation sections from Wiktionary, we use enwiktionary-20170601-pages-articles-multistream.xml dump. There are many <DiscussionThreading> tags in the dump, the first one has page id 2113483 and title "Thread:User talk:ArielGlenn/lqt test number 1", which causes MalformedXML error in mwxml/iteration/page.py. To have a quick look into the XML dump, use the following commands:
$ grep -A 30 -B 5 "<id>2113483</id>" enwiktionary-20170601-pages-articles-multistream.xml
</revision>
</page>
<page>
<title>Thread:User talk:ArielGlenn/lqt test number 1</title>
<ns>90</ns>
<id>2113483</id>
<DiscussionThreading>
<ThreadSubject>lqt test number 1</ThreadSubject>
<ThreadPage>User talk:ArielGlenn</ThreadPage>
<ThreadID>1</ThreadID>
<ThreadAuthor>ArielGlenn</ThreadAuthor>
<ThreadEditStatus>has-reply</ThreadEditStatus>
<ThreadType>normal</ThreadType>
<ThreadSignature>[[User:ArielGlenn|ArielGlenn]]</ThreadSignature>
</DiscussionThreading>
<revision>
<id>9033408</id>
<timestamp>2010-05-17T06:11:56Z</timestamp>
<contributor>
<username>ArielGlenn</username>
<id>33073</id>
</contributor>
<comment>New thread: lqt test number 1</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">here it is, the first lqt thread on en wikt. Did I break anything? --</text>
<sha1>oxu9xhsuif5idtzx8k0he872vxbyubf</sha1>
</revision>
</page>
<page>
<title>min of meer</title>
<ns>0</ns>
<id>2113484</id>
<revision>
<id>21906521</id>
<parentid>19657063</parentid>
The <DiscussionThreading> tags appear in entries with namespace=90, which are not used in our project, so skip those tags would be OK. The solution is adding a elif statement in the code to skip <DiscussionThreading> tag, which can be found at python-mwxml PR#6.
- Add
extract_IPA
method in Wiktionary class
The previos version ofWiktionary
class only provides aget_entry_pronunciation
method to parse wikitest of one single Wiktionary entry, so users have to travers the XML dump in for loop and deal with every entries in the dump by their own, for example:
>>> import mwxml
>>> from pywiktionary import Wiktionary
>>>
>>> wikt = Wiktionary(lang=None, XSAMPA=True)
>>> dump = mwxml.Dump.from_file((open(dump_file, "rb")))
>>>
>>> for page in dump:
... for revision in page:
... pronunciation = wikt.get_entry_pronunciation(revision.text)
... print(pronunciation)
...
[{"IPA": ..., "X-SAMPA": ..., "lang": ...}, ...]
Now, we added a new methodextract_IPA
toWiktionary
class, so users only need to specify the dump file path and get the pronunciation list by callingextract_IPA
method, for example:
>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(lang=None, XSAMPA=True)
>>> dump_file = "pywiktionary/data/enwiktionary-test-pages-articles-multistream.xml"
>>> wikt.extract_IPA(dump_file)
[{'id': 16,
'pronunciation': {'English': [{'IPA': '/ˈdɪkʃ(ə)n(ə)ɹɪ/',
'X-SAMPA': '/"dIkS(@)n(@)r\\I/',
'lang': 'en'},
{'IPA': '/ˈdɪkʃənɛɹi/',
'X-SAMPA': '/"dIkS@nEr\\i/',
'lang': 'en'}]},
'title': 'dictionary'},
{'id': 65195,
'pronunciation': {'English': 'IPA not found.'},
'title': 'battleship'},
{'id': 39478,
'pronunciation': {'English': [{'IPA': '/ˈmɜːdə(ɹ)/',
'X-SAMPA': '/"m3:d@(r\\)/',
'lang': 'en'},
{'IPA': '/ˈmɝ.dɚ/',
'X-SAMPA': '/"m3`.d@`/',
'lang': 'en'}]},
'title': 'murder'},
{'id': 80141,
'pronunciation': {'English': [{'IPA': '/ˈdæzəl/',
'X-SAMPA': '/"d{z@l/',
'lang': 'en'}]},
'title': 'dazzle'}]
To extraction IPA for a certain language, just specify `lang` parameter for Wiktionary class.
Documentation:
Documentation for this project has been created this week and will be improved continously. We use sphinx-doc for document generation, docstings in Python code are NumPy style over reStructuredText and included in document by sphinx autodoc extension. Published document for this project can be found at readthedocs.
Here's the structure of the documentation:
1. Introduction: The basic information and installation guide for the package.
2. Usage: The example codes for three features of the package.
3. pywiktionary API: Detailed references for `Wiktionary` class and `Parser` class in the package, with utility functions in addition.
Next step:
Currently
{{*-IPA}}
templates are expanded by Wiktionary online API, not stable and costs a lot of time.To solve this problem, converting expanding Lua scripts Wiktionary uses to offline Python script by this package is needed. Roughly, to extract pronunciations for all languages mentioned in post #0 , {{fr-IPA}} (French), {{ru-IPA}} (Russian), {{hi-IPA}} (Hindi) and {{zh-pron}} (Mandarin) need to be converted.