Metadata-Version: 2.1
Name: wiki-dump-parser
Version: 2.0.1
Summary: A simple but fast python script that reads the XML dump of a     wiki and output the processed data in a CSV file.
Home-page: https://github.com/Grasia/wiki-scripts/tree/master/wiki_dump_parser
Author: Abel 'Akronix' Serrano Juste
Author-email: akronix5@gmail.com
License: AGPL-3.0
Description: wiki dump parser
        ================
        
        A simple but fast python script that reads the XML dump of a wiki and
        output the processed data in a CSV file.
        
        `All revisions history of a mediawiki wiki can be backed up as an XML
        file, known as a XML
        dump. <https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki#Backup_the_content_of_the_wiki_(XML_dump)>`__
        This file is a record of all the edits made in a wiki with all the
        corresponding data regarding date, page, author and the full content
        within the edit.
        
        Very often we just want the metadata for the edit regarding date, author
        and page; and therefore, we do not need the content of the edit, which
        by far the longest piece of data.
        
        This script converts this very long XML dump in csv files much smaller
        and easiest to read and work with. It takes care of
        
        Usage
        -----
        
        Install the package using pip:
        
        ``pip install wiki_dump_parser``
        
        Then, use it directly from command line:
        
        ``python -m wiki_dump_parser <dump.xml>``
        
        Or from python code:
        
        .. code:: python
        
            import wiki_dump_parser as parser
            parser.xml_to_csv('dump.xml')
        
        The output csv files should be loaded using '\|' as an escape character
        for quoting string. An example to load the output file "dump.csv"
        generated by this script using pandas would be:
        
        .. code:: python
        
            df = pd.read_csv('dump.csv', quotechar='|', index_col = False)
            df['timestamp'] = pd.to_datetime(df['timestamp'],format='%Y-%m-%dT%H:%M:%SZ')
        
        Dependencies
        ------------
        
        -  python 3
        
        *Yes, nothing more.*
        
        How to get a wiki history dump
        ------------------------------
        
        There are several ways to get the wiki dump:
        
        -  If you have access to the server, follow the `instructions in the
           mediawiki
           docs <https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki#Backup_the_content_of_the_wiki_(XML_dump)>`__.
        -  For **Wikia wikis** and `many other
           domains <https://github.com/Grasia/wiki-scripts/tree/master/wiki_dump_downloader#domains-tested>`__,
           you can use our in-house developed script made to accomplish this
           task. It is straightforward to use and very fast on it.
        -  **Wikimedia project wikis**: For wikis belonging to the Wikimedia
           project, you already have a regular updated repo with all the dumps
           here: http://dumps.wikimedia.org. `Select your target wiki from the
           list <https://dumps.wikimedia.org/backup-index-bydb.html>`__ and
           download the complete edit history dump and uncompress it.
        -  For **other wikis**, like self-hosted wikis, you should use the
           wikiteam's dumpgenerator.py script. You have a simple tutorial `in
           their
           wiki <https://github.com/WikiTeam/wikiteam/wiki/Tutorial#I_have_no_shell_access_to_server>`__.
           Its usage is very straightforward and the script is well maintained.
           Remember to use the --xml option to download the full history dump.
        
Keywords: wiki dump parser Wikia xml csv pandas proccessing history data
Platform: UNKNOWN
Requires-Python: >=3
Description-Content-Type: text/x-rst
