Metadata-Version: 1.1
Name: wikipedia2vec
Version: 0.1.11
Summary: A tool for learning embeddings of words and entities from Wikipedia
Home-page: http://studio-ousia.github.io/wikipedia2vec/
Author: Studio Ousia
Author-email: ikuya@ousia.jp
License: UNKNOWN
Description: Wikipedia2Vec
        =============
        
        |Fury badge| |CircleCI|
        
        Introduction
        ------------
        
        Wikipedia2Vec is a tool used for obtaining quality embeddings (vector
        representations) of words and entities from Wikipedia. It is developed
        and maintained by `Studio Ousia <http://www.ousia.jp>`__.
        
        This tool enables you to learn embeddings that map words and entities
        into a unified continuous vector space. The embeddings can be used as
        word embeddings, entity embeddings, and the unified embeddings of words
        and entities. They are used in the state-of-the-art models of various
        tasks such as `entity linking <https://arxiv.org/abs/1601.01343>`__,
        `named entity recognition <http://www.aclweb.org/anthology/I17-2017>`__,
        `entity relatedness <https://arxiv.org/abs/1601.01343>`__, and `question
        answering <https://arxiv.org/abs/1803.08652>`__.
        
        The embeddings can be easily trained from a publicly available Wikipedia
        dump. The code is implemented in Python, and optimized using Cython and
        BLAS.
        
        How It Works
        ------------
        
        Wikipedia2Vec is based on the `Word2vec's skip-gram
        model <https://en.wikipedia.org/wiki/Word2vec>`__ that learns to predict
        neighboring words given each word in corpora. We extend the skip-gram
        model by adding the following two submodels:
        
        -  *The KB link graph model* that learns to estimate neighboring
           entities given an entity in the link graph of Wikipedia entities.
        -  *The anchor context model* that learns to predict neighboring words
           given an entity by using an anchor link that points to the entity and
           its neighboring words.
        
        By jointly optimizing the skip-gram model and these two submodels, our
        model simultaneously learns the embedding of words and entities from
        Wikipedia. For further details, please refer to our paper: `Joint
        Learning of the Embedding of Words and Entities for Named Entity
        Disambiguation <https://arxiv.org/abs/1601.01343>`__.
        
        Pretrained Embeddings
        ---------------------
        
        (coming soon)
        
        Installation
        ------------
        
        If you want to train embeddings on your machine, it is highly
        recommended to install a BLAS library before installing this tool. We
        recommend using `OpenBLAS <https://www.openblas.net/>`__ or `Intel Math
        Kernel Library <https://software.intel.com/en-us/mkl>`__.
        
        Wikipedia2Vec can be installed from PyPI:
        
        ::
        
            % pip install Wikipedia2Vec
        
        To process Japanese Wikipedia dumps, it is also required to install
        `MeCab <http://taku910.github.io/mecab/>`__ and `its Python
        binding <https://pypi.python.org/pypi/mecab-python3>`__.
        
        Learning Embeddings
        -------------------
        
        First, you need to download a source Wikipedia dump file (e.g.,
        enwiki-latest-pages-articles.xml.bz2) from `Wikimedia
        Downloads <https://dumps.wikimedia.org/>`__. The English dump file can
        be obtained by running the following command.
        
        ::
        
            % wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
        
        Note that you do not need to decompress the dump file.
        
        Then, the embeddings can be trained from a Wikipedia dump using the
        *train* command.
        
        ::
        
            % wikipedia2vec train DUMP_FILE OUT_FILE
        
        **Arguments:**
        
        -  *DUMP\_FILE*: The Wikipedia dump file
        -  *OUT\_FILE*: The output file
        
        **Options:**
        
        -  *--dim-size*: The number of dimensions of the embeddings (default:
           100)
        -  *--window*: The maximum distance between the target item (word or
           entity) and the context word to be predicted (default: 5)
        -  *--iteration*: The number of iterations for Wikipedia pages (default:
           3)
        -  *--negative*: The number of negative samples (default: 5)
        -  *--lowercase/--no-lowercase*: Whether to lowercase words and phrases
           (default: True)
        -  *--min-word-count*: A word is ignored if the total frequency of the
           word is less than this value (default: 10)
        -  *--min-entity-count*: An entity is ignored if the total frequency of
           the entity appearing as the referent of an anchor link is less than
           this value (default: 5)
        -  *--min-paragraph-len*: A paragraph is ignored if its length is
           shorter than this value (default: 5)
        -  *--category*: If this option is specified, categories are included as
           entities in the dictionary (default: False)
        -  *--link-graph/--no-link-graph*: Whether to learn from the Wikipedia
           link graph (default: True)
        -  *--entities-per-page*: For processing each page, the specified number
           of randomly chosen entities are used to predict their neighboring
           entities in the link graph (default: 10)
        -  *--phrase/--no-phrase*: Whether to learn the embeddings of phrases
           (default: True)
        -  *--min-link-count*: A phrase is ignored if the total frequency of the
           phrase appearing as an anchor link is less than this value (default:
           10)
        -  *--min-link-prob*: A phrase is ignored if the probability of the
           phrase appearing as an anchor link is less than this value (default:
           0.1)
        -  *--max-phrase-len*: The maximum number of words in a phrase (default:
           4)
        -  *--init-alpha*: The initial learning rate (default: 0.025)
        -  *--min-alpha*: The minimum learning rate (default: 0.0001)
        -  *--sample*: The parameter that controls the downsampling of frequent
           words (default: 1e-4)
        
        The *train* command internally calls the five commands described below
        (namely, *build\_dump\_db*, *build\_phrase\_dictionary*,
        *build\_dictionary*, *build\_link\_graph*, and *train\_embedding*).
        
        Building Dump Database
        ~~~~~~~~~~~~~~~~~~~~~~
        
        The *build\_dump\_db* command creates a database that contains Wikipedia
        pages each of which consists of texts and anchor links in it. The size
        of the database based on an English Wikipedia dump is approximately
        15GB.
        
        ::
        
            % wikipedia2vec build_dump_db DUMP_FILE OUT_FILE
        
        **Arguments:**
        
        -  *DUMP\_FILE*: The Wikipedia dump file
        -  *OUT\_FILE*: The output file
        
        Building Phrase Dictionary
        ~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        The *build\_phrase\_dictionary* command constructs a dictionary
        consisting of phrases extracted from Wikipedia. This command extracts
        all phrases that appear as an anchor link in Wikipedia, and reduces them
        using three configurable thresholds, namely *min\_link\_count*,
        *min\_link\_prob*, and *max\_phrase\_len*. Detected phrases are treated
        as words in the subsequent steps.
        
        ::
        
            % wikipedia2vec build_phrase_dictionary DUMP_DB_FILE OUT_FILE
        
        **Arguments:**
        
        -  *DUMP\_DB\_FILE*: The database file generated using the
           *build\_dump\_db* command
        -  *OUT\_FILE*: The output file
        
        **Options:**
        
        -  *--lowercase/--no-lowercase*: Whether to lowercase phrases (default:
           True)
        -  *--min-link-count*: A phrase is ignored if the total frequency of the
           phrase appearing as an anchor link is less than this value (default:
           30)
        -  *--min-link-prob*: A phrase is ignored if the probability of the
           phrase appearing as an anchor link is less than this value (default:
           0.1)
        -  *--max-phrase-len*: The maximum number of words in a phrase (default:
           4)
        
        Building Dictionary
        ~~~~~~~~~~~~~~~~~~~
        
        The *build\_dictionary* command builds a dictionary of words and
        entities.
        
        ::
        
            % wikipedia2vec build_dictionary DUMP_DB_FILE OUT_FILE
        
        **Arguments:**
        
        -  *DUMP\_DB\_FILE*: The database file generated using the
           *build\_dump\_db* command
        -  *OUT\_FILE*: The output file
        
        **Options:**
        
        -  *--phrase*: The phrase dictionary file generated using the
           *build\_phrase\_dictionary* command
        -  *--lowercase/--no-lowercase*: Whether to lowercase words (default:
           True)
        -  *--min-word-count*: A word is ignored if the total frequency of the
           word is less than this value (default: 10)
        -  *--min-entity-count*: An entity is ignored if the total frequency of
           the entity appearing as the referent of an anchor link is less than
           this value (default: 5)
        -  *--min-paragraph-len*: A paragraph is ignored if its length is
           shorter than this value (default: 5)
        -  *--category*: If this option is specified, categories are included as
           entities in the dictionary (default: False)
        
        Building Link Graph
        ~~~~~~~~~~~~~~~~~~~
        
        The *build\_link\_graph* command generates a sparse matrix representing
        the link structure between Wikipedia entities.
        
        ::
        
            % wikipedia2vec build_link_graph DUMP_DB_FILE DIC_FILE OUT_FILE
        
        **Arguments:**
        
        -  *DUMP\_DB\_FILE*: The database file generated using the
           *build\_dump\_db* command
        -  *DIC\_FILE*: The dictionary file generated by the *build\_dictionary*
           command
        -  *OUT\_FILE*: The output file
        
        There is no option in this command.
        
        Learning Embeddings
        ~~~~~~~~~~~~~~~~~~~
        
        The *train\_embedding* command runs the training of the embeddings.
        
        ::
        
            % wikipedia2vec train_embedding DUMP_DB_FILE DIC_FILE OUT_FILE
        
        **Arguments:**
        
        -  *DUMP\_DB\_FILE*: The database file generated using the
           *build\_dump\_db* command
        -  *DIC\_FILE*: The dictionary file generated by the *build\_dictionary*
           command
        -  *OUT\_FILE*: The output file
        
        **Options:**
        
        -  *--link-graph*: The link graph file generated using the
           *build\_link\_graph* command
        -  *--dim-size*: The number of dimensions of the embeddings (default:
           100)
        -  *--window*: The maximum distance between the target item (word or
           entity) and the context word to be predicted (default: 5)
        -  *--iteration*: The number of iterations for Wikipedia pages (default:
           3)
        -  *--negative*: The number of negative samples (default: 5)
        -  *--entities-per-page*: For processing each page, the specified number
           of randomly chosen entities are used to predict their neighboring
           entities in the link graph (default: 10)
        -  *--init-alpha*: The initial learning rate (default: 0.025)
        -  *--min-alpha*: The minimum learning rate (default: 0.0001)
        -  *--sample*: The parameter that controls the downsampling of frequent
           words (default: 1e-4)
        
        Saving Embeddings in Text Format
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        *save\_text* outputs a model in a text format.
        
        ::
        
            % wikipedia2vec save_text MODEL_FILE OUT_FILE
        
        **Arguments:**
        
        -  *MODEL\_FILE*: The model file generated by the *train\_embedding*
           command
        -  *OUT\_FILE*: The output file
        
        There is no option in this command.
        
        Sample Usage
        ------------
        
        .. code:: python
        
            >>> from wikipedia2vec import Wikipedia2Vec
        
            >>> wiki2vec = Wikipedia2Vec.load(MODEL_FILE)
        
            >>> wiki2vec.get_word_vector(u'the')
            memmap([ 0.01617998, -0.03325786, -0.01397999, -0.00150471,  0.03237337,
            ...
                   -0.04226106, -0.19677088, -0.31087297,  0.1071524 , -0.09824426], dtype=float32)
        
            >>> wiki2vec.get_entity_vector(u'Scarlett Johansson')
            memmap([-0.19793572,  0.30861306,  0.29620451, -0.01193621,  0.18228433,
            ...
                    0.04986198,  0.24383858, -0.01466644,  0.10835337, -0.0697331 ], dtype=float32)
        
            >>> wiki2vec.most_similar(wiki2vec.get_word(u'yoda'), 5)
            [(<Word yoda>, 1.0),
             (<Entity Yoda>, 0.84333622),
             (<Word darth>, 0.73328167),
             (<Word kenobi>, 0.7328127),
             (<Word jedi>, 0.7223742)]
        
            >>> wiki2vec.most_similar(wiki2vec.get_entity(u'Scarlett Johansson'), 5)
            [(<Entity Scarlett Johansson>, 1.0),
             (<Entity Natalie Portman>, 0.75090045),
             (<Entity Eva Mendes>, 0.73651594),
             (<Entity Emma Stone>, 0.72868186),
             (<Entity Cameron Diaz>, 0.72390842)]
        
        Reference
        ---------
        
        If you use Wikipedia2Vec in a scientific publication, please cite the
        following paper:
        
        ::
        
            @InProceedings{yamada-EtAl:2016:CoNLL,
              author    = {Yamada, Ikuya  and  Shindo, Hiroyuki  and  Takeda, Hideaki  and  Takefuji, Yoshiyasu},
              title     = {Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
              booktitle = {Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
              month     = {August},
              year      = {2016},
              address   = {Berlin, Germany},
              pages     = {250--259},
              publisher = {Association for Computational Linguistics}
            }
        
        License
        -------
        
        `Apache License 2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__
        
        .. |Fury badge| image:: https://badge.fury.io/py/wikipedia2vec.png
           :target: http://badge.fury.io/py/wikipedia2vec
        .. |CircleCI| image:: https://circleci.com/gh/studio-ousia/wikipedia2vec/tree/master.svg?style=svg
           :target: https://circleci.com/gh/studio-ousia/wikipedia2vec/tree/master
        
Keywords: wikipedia,embedding,wikipedia2vec
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
