Metadata-Version: 2.1
Name: wikivector
Version: 1.2.1
Summary: WikiVector: Tools for encoding Wikipedia articles as vectors
Home-page: https://github.com/mortonne/wikivector
Author: Neal Morton
Author-email: mortonne@gmail.com
License: GPL-3.0-or-later
Description: # wikivector
        
        [![PyPI version](https://badge.fury.io/py/wikivector.svg)](https://badge.fury.io/py/wikivector)
        [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4453878.svg)](https://doi.org/10.5281/zenodo.4453878)
        
        Tools for encoding Wikipedia articles as vectors.
        
        ## Installation
        
        To get the latest stable version:
        
        ```bash
        pip install wikivector
        ```
        
        To get the development version:
        
        ```bash
        pip install git+git://github.com/mortonne/wikivector
        ```
        
        ## Exporting Wikipedia text
        
        First, run [WikiExtractor](https://github.com/attardi/wikiextractor)
        on a Wikipedia dump. This will generate a directory with many 
        subdirectories and text files within each subdirectory. Next, build 
        a header file with a list of all articles in the extracted text data:
        
        ```bash
        wiki_header wiki_dir
        ```
        
        where `wiki_dir` is the path to the output from `WikiExtractor`. 
        This will create a CSV file called `header.csv` with the title of each 
        article and the file in which it can be found.
        
        To extract specific articles, write a CSV file with two columns: "item"
        and "title". The "title" for each item must exactly match an article
        title in the Wikipedia dump. We refer to this file as the `map_file`.
        
        If you are working with an older Wikipedia dump, it can be difficult to 
        find the correct titles for article pages, as page titles may have changed
        between the archive and the current online version of Wikipedia. To help 
        identify mismatches between the map file and the Wikipedia dump, you can 
        run:
        
        ```bash
        wiki_check_map header_file map_file
        ```
        
        to display any items whose article is not found in the header file. You 
        can then use the Bash utility `grep` to search the header file for correct 
        titles for each missing item.
        
        When your map file is ready, extract the text for each item:
        
        ```bash
        export_articles header_file map_file output_dir
        ```
        
        where `map_file` is the CSV file with your items, and `output_dir` is
        where you want to save text files with each item's article. Check the
        output carefully to ensure that you have the correct text for each item
        and that XML tags have been stripped out.
        
        ## Universal Sentence Encoder
        
        Once articles have been exported, you can calculate a vector embedding
        for each item using the Universal Sentence Encoder.
        
        ```bash
        embed_articles map_file text_dir h5_file
        ```
        
        This reads a map file specifying an item pool (only the "item" field is 
        used) and outputs vectors in an hdf5 file. To read the vectors, in 
        Python:
        
        ```python
        from wikivector import vector
        vectors, items = vector.load_vectors(h5_file)
        ```
        
        ## Citation
        
        If you use wikivector, please cite the following paper:
        
        Morton, NW*, Zippi, EL*, Noh, S, Preston, AR. In press.
        Semantic knowledge of famous people and places is represented in hippocampus and distinct cortical networks.
        Journal of Neuroscience. *authors contributed equally
        
Keywords: NLP,Wikipedia
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
