[GSoC 2017 with CMUSphinx] Post 5#: Documentation

This post introduces several improvements in this week and the documentation of the project.

There are mainly two improvements:

  1. Fix unexpected tag error in python-mwxml

    The project depends on python-mwxml, which contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. To extract pronunciation sections from Wiktionary, we use enwiktionary-20170601-pages-articles-multistream.xml dump. There are many <DiscussionThreading> tags in the dump, the first one has page id 2113483 and title "Thread:User talk:ArielGlenn/lqt test number 1", which causes MalformedXML error in mwxml/iteration/page.py. To have a quick look into the XML dump, use the following commands:

    $ grep -A 30 -B 5 "<id>2113483</id>" enwiktionary-20170601-pages-articles-multistream.xml
        </revision>
      </page>
      <page>
        <title>Thread:User talk:ArielGlenn/lqt test number 1</title>
        <ns>90</ns>
        <id>2113483</id>
    <DiscussionThreading>
            <ThreadSubject>lqt test number 1</ThreadSubject>
            <ThreadPage>User talk:ArielGlenn</ThreadPage>
            <ThreadID>1</ThreadID>
            <ThreadAuthor>ArielGlenn</ThreadAuthor>
            <ThreadEditStatus>has-reply</ThreadEditStatus>
            <ThreadType>normal</ThreadType>
            <ThreadSignature>[[User:ArielGlenn|ArielGlenn]]</ThreadSignature>
    </DiscussionThreading>
        <revision>
          <id>9033408</id>
          <timestamp>2010-05-17T06:11:56Z</timestamp>
          <contributor>
            <username>ArielGlenn</username>
            <id>33073</id>
          </contributor>
          <comment>New thread: lqt test number 1</comment>
          <model>wikitext</model>
          <format>text/x-wiki</format>
          <text xml:space="preserve">here it is, the first lqt thread on en wikt.  Did I break anything? --</text>
          <sha1>oxu9xhsuif5idtzx8k0he872vxbyubf</sha1>
        </revision>
      </page>
      <page>
        <title>min of meer</title>
        <ns>0</ns>
        <id>2113484</id>
        <revision>
          <id>21906521</id>
          <parentid>19657063</parentid>


    The <DiscussionThreading> tags appear in entries with namespace=90, which are not used in our project, so skip those tags would be OK. The solution is adding a elif statement in the code to skip <DiscussionThreading> tag, which can be found at python-mwxml PR#6.

  2. Add extract_IPA method in Wiktionary class

    The previos version of Wiktionary class only provides a get_entry_pronunciation method to parse wikitest of one single Wiktionary entry, so users have to travers the XML dump in for loop and deal with every entries in the dump by their own, for example:

    >>> import mwxml
    >>> from pywiktionary import Wiktionary
    >>>
    >>> wikt = Wiktionary(lang=None, XSAMPA=True)
    >>> dump = mwxml.Dump.from_file((open(dump_file, "rb")))
    >>>
    >>> for page in dump:
    ...     for revision in page:
    ...         pronunciation = wikt.get_entry_pronunciation(revision.text)
    ...         print(pronunciation)
    ...
    [{"IPA": ..., "X-SAMPA": ..., "lang": ...}, ...]


    Now, we added a new method extract_IPA to Wiktionary class, so users only need to specify the dump file path and get the pronunciation list by calling extract_IPA method, for example:

    >>> from pywiktionary import Wiktionary
    >>> wikt = Wiktionary(lang=None, XSAMPA=True)
    >>> dump_file = "pywiktionary/data/enwiktionary-test-pages-articles-multistream.xml"
    >>> wikt.extract_IPA(dump_file)
    [{'id': 16,
      'pronunciation': {'English': [{'IPA': '/ˈdɪkʃ(ə)n(ə)ɹɪ/',
                                     'X-SAMPA': '/"dIkS(@)n(@)r\\I/',
                                     'lang': 'en'},
                                    {'IPA': '/ˈdɪkʃənɛɹi/',
                                     'X-SAMPA': '/"dIkS@nEr\\i/',
                                     'lang': 'en'}]},
      'title': 'dictionary'},
     {'id': 65195,
      'pronunciation': {'English': 'IPA not found.'},
      'title': 'battleship'},
     {'id': 39478,
      'pronunciation': {'English': [{'IPA': '/ˈmɜːdə(ɹ)/',
                                     'X-SAMPA': '/"m3:d@(r\\)/',
                                     'lang': 'en'},
                                    {'IPA': '/ˈmɝ.dɚ/',
                                     'X-SAMPA': '/"m3`.d@`/',
                                     'lang': 'en'}]},
      'title': 'murder'},
     {'id': 80141,
      'pronunciation': {'English': [{'IPA': '/ˈdæzəl/',
                                     'X-SAMPA': '/"d{z@l/',
                                     'lang': 'en'}]},
      'title': 'dazzle'}]


    To extraction IPA for a certain language, just specify `lang` parameter for Wiktionary class.

Documentation:

Documentation for this project has been created this week and will be improved continously. We use sphinx-doc for document generation, docstings in Python code are NumPy style over reStructuredText and included in document by sphinx autodoc extension. Published document for this project can be found at readthedocs.

Here's the structure of the documentation:
1. Introduction: The basic information and installation guide for the package.
2. Usage: The example codes for three features of the package.
3. pywiktionary API: Detailed references for `Wiktionary` class and `Parser` class in the package, with utility functions in addition.


Next step:

Currently {{*-IPA}} templates are expanded by Wiktionary online API, not stable and costs a lot of time.
To solve this problem, converting expanding Lua scripts Wiktionary uses to offline Python script by this package is needed. Roughly, to extract pronunciations for all languages mentioned in post #0 , {{fr-IPA}} (French), {{ru-IPA}} (Russian), {{hi-IPA}} (Hindi) and {{zh-pron}} (Mandarin) need to be converted.

Popular Posts