Metadata-Version: 1.1
Name: extremetext
Version: 0.8.3
Summary: A Python interface for extremeText library
Home-page: https://github.com/mwydmuch/extremeText
Author: Marek Wydmuch
Author-email: mwydmuch@cs.put.poznan.pl
License: BSD
Description: extremeText
        ===========
        
        `extremeText <https://github.com/mwydmuch/extremeText>`__ is an
        extension of `fastText <https://github.com/facebookresearch/fastText>`__
        library for multi-label classification including extreme cases with
        hundreds of thousands and millions of labels.
        
        `extremeText <https://github.com/mwydmuch/extremeText>`__ implements:
        
        -  Probabilistic Labels Tree (PLT) loss for extreme multi-Label
           classification with top-down hierarchical clustering (k-means) for
           tree building,
        -  sigmoid loss for multi-label classification,
        -  L2 regularization and FOBOS update for all losses,
        -  ensemble of loss layers with bagging,
        -  calculation of hidden (document) vector as a weighted average of the
           word vectors,
        -  calculation of TF-IDF weights for words.
        
        Requirements
        ------------
        
        `extremeText <https://github.com/mwydmuch/extremeText>`__ builds on
        modern Mac OS and Linux distributions. Since it uses C++11 features, it
        requires a compiler with good C++11 support. These include:
        
        -  (gcc-4.8 or newer) or (clang-3.3 or newer)
        
        You will need:
        
        -  `Python <https://www.python.org/>`__ version 2.7 or >=3.4
        -  `NumPy <http://www.numpy.org/>`__ &
           `SciPy <https://www.scipy.org/>`__
        -  `pybind11 <https://github.com/pybind/pybind11>`__
        
        Installing extremeText
        ----------------------
        
        The easiest way to get
        `extremeText <https://github.com/mwydmuch/extremeText>`__ is to use
        `pip <https://pip.pypa.io/en/stable/>`__.
        
        ::
        
            $ pip install extremetext
        
        Installing on MacOS may require setting
        ``MACOSX_DEPLOYMENT_TARGET=10.9`` first:
        
        ::
        
            $ export MACOSX_DEPLOYMENT_TARGET=10.9
            $ pip install extremetext
        
        The latest version of
        `extremeText <https://github.com/mwydmuch/extremeText>`__ can be build
        from sources using pip or alternatively setuptools.
        
        ::
        
            $ git clone https://github.com/mwydmuch/extremeText.git
            $ cd extremeText
            $ pip install .
            (or) $ python setup.py install
        
        Now you can import this library with:
        
        ::
        
            import extremeText
        
        Examples
        --------
        
        In general it is assumed that the reader already has good knowledge of
        fastText/extremeText. For this consider the main
        `README <https://github.com/mwydmuch/extremeText/blob/master/README.md>`__
        and `the tutorials on fastText
        website <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.
        
        We recommend you look at the `examples within the doc
        folder <https://github.com/mwydmuch/extremeText/tree/master/python/doc/examples>`__.
        
        As with any package you can get help on any Python function using the
        help function.
        
        For example:
        
        ::
        
            +>>> import extremeText
            +>>> help(extremeText.ExtremeText)
        
            Help on module extremeText.ExtremeText in extremeText:
        
            NAME
                extremeText.ExtremeText
        
            DESCRIPTION
                # Copyright (c) 2017-present, Facebook, Inc.
                # All rights reserved.
                #
                # This source code is licensed under the BSD-style license found in the
                # LICENSE file in the root directory of this source tree. An additional grant
                # of patent rights can be found in the PATENTS file in the same directory.
        
            FUNCTIONS
                load_model(path)
                    Load a model given a filepath and return a model object.
        
                tokenize(text)
                    Given a string of text, tokenize it and return a list of tokens
            [...]
        
        IMPORTANT: Preprocessing data / enconding conventions
        -----------------------------------------------------
        
        In general it is important to properly preprocess your data. Example
        scripts in the `root
        folder <https://github.com/mwydmuch/extremeText/extremeText>`__ do this.
        
        extremeText like fastText assumes UTF-8 encoded text. All text must be
        `unicode for
        Python2 <https://docs.python.org/2/library/functions.html#unicode>`__
        and `str for
        Python3 <https://docs.python.org/3.5/library/stdtypes.html#textseq>`__.
        The passed text will be `encoded as UTF-8 by
        pybind11 <https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions>`__
        before passed to the extremeText C++ library. This means it is important
        to use UTF-8 encoded text when building a model. On Unix-like systems
        you can convert text using
        `iconv <https://en.wikipedia.org/wiki/Iconv>`__.
        
        extremeText will tokenize (split text into pieces) based on the
        following ASCII characters (bytes). In particular, it is not aware of
        UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word
        boundaries into one of the following symbols as appropiate.
        
        -  space
        -  tab
        -  vertical tab
        -  carriage return
        -  formfeed
        -  the null character
        
        The newline character is used to delimit lines of text. In particular,
        the EOS token is appended to a line of text if a newline character is
        encountered. The only exception is if the number of tokens exceeds the
        MAX\_LINE\_SIZE constant as defined in the `Dictionary
        header <https://github.com/mwydmuch/extremeText/blob/master/src/dictionary.h>`__.
        This means if you have text that is not separate by newlines, such as
        the `fil9 dataset <http://mattmahoney.net/dc/textdata>`__, it will be
        broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is
        not appended.
        
        The length of a token is the number of UTF-8 characters by considering
        the `leading two bits of a
        byte <https://en.wikipedia.org/wiki/UTF-8#Description>`__ to identify
        `subsequent bytes of a multi-byte
        sequence <https://github.com/mwydmuch/extremeText/blob/master/src/dictionary.cc>`__.
        Knowing this is especially important when choosing the minimum and
        maximum length of subwords. Further, the EOS token (as specified in the
        `Dictionary
        header <https://github.com/mwydmuch/extremeText/blob/master/src/dictionary.h>`__)
        is considered a character and will not be broken into subwords.
        
        Reference
        ---------
        
        Please cite below work if using this package for extreme classification.
        
        M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, K. Dembczyński,
        `A no-regret generalization of hierarchical softmax to extreme
        multi-label classification <https://arxiv.org/abs/1810.11671>`__
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
