Metadata-Version: 2.1
Name: prefixspan
Version: 0.5
Summary: PrefixSpan, BIDE, and FEAT in Python 3
Home-page: https://github.com/chuanconggao/PrefixSpan-py
Author: Chuancong Gao
Author-email: chuancong@gmail.com
License: MIT
Download-URL: https://github.com/chuanconggao/PrefixSpan-py/tarball/0.5
Description: [![PyPi version](https://img.shields.io/pypi/v/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
        [![PyPi pyversions](https://img.shields.io/pypi/pyversions/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
        [![PyPi license](https://img.shields.io/pypi/l/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
        
        The shortest yet efficient implementation of famous frequent sequential pattern mining algorithm [PrefixSpan](https://ieeexplore.ieee.org/abstract/document/914830/) in Python 3, with less than 20 lines in core part (scan in `scan.py` and extend in `frequent.py`).
        
        - You can also try the Scala [version](https://github.com/chuanconggao/PrefixSpan-scala).
        
        Also includes the implementation of famous frequent **closed** sequential pattern mining algorithm [BIDE](https://ieeexplore.ieee.org/abstract/document/1319986), which extends the framework of PrefixSpan algorithm (in `closed.py`).
        
        - BIDE is usually much faster than PrefixSpan on large datasets, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.
        
        Also includes the implementation of frequent **generator** sequential pattern mining algorithm [FEAT](https://dl.acm.org/citation.cfm?doid=1367497.1367651), which extends the framework of PrefixSpan algorithm (in `generator.py`).
        
        - FEAT is usually faster than PrefixSpan but slower than BIDE on large datasets.
        
        # Features
        
        Outputs traditional single-item sequential patterns, where gaps are allowed between items.
        
        - Mining top-k patterns is supported, with respective optimizations on efficiency.
        
        - You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.
        
        - Custom key function and custom filter function can be applied.
        
        # Installation
        
        This package is available on PyPi. Just use `pip3 install -U prefixspan` to install it.
        
        # CLI Usage
        
        You can simply use the algorithms on terminal.
        
        ``` text
        Usage:
            prefixspan-cli (frequent | top-k) <threshold> [options] [<file>]
        
            prefixspan-cli --help
        
        
        Options:
            --text             Treat each item as text instead of integer.
        
            --closed           Return only closed patterns.
            --generator        Return only generator patterns.
        
            --key=<key>        Custom key function. [default: ]
                               Must be a Python function in form of "lambda patt, matches: ...", returning an integer value.
            --bound=<bound>    The upper-bound function of the respective key function. When unspecified, the same key function is used. [default: ]
                               Must be no less than the key function, i.e. bound(patt, matches) ≥ key(patt, matches).
                               Must be anti-monotone, i.e. for patt1 ⊑ patt2, bound(patt1, matches1) ≥ bound(patt2, matches2).
        
            --filter=<filter>  Custom filter function. [default: ]
                               Must be a Python function in form of "lambda patt, matches: ...", returning a boolean value.
        
            --minlen=<minlen>  Minimum length of patterns. [default: 1]
            --maxlen=<maxlen>  Maximum length of patterns. [default: 1000]
        ```
        
        * Sequences are read from standard input. Each sequence is integers separated by space, like this example:
        
        ``` text
        cat test.dat
        
        0 1 2 3 4
        1 1 1 3 4
        2 1 2 2 0
        1 1 1 2 2
        ```
        
        - When dealing with text data, please use the `--text` option. Each sequence is words separated by space, assuming stop words have been removed, like this example:
        
        ``` text
        cat test.txt
        
        a b c d e
        b b b d e
        c b c c a
        b b b c c
        ```
        
        * The patterns and their respective frequencies are printed to standard output.
        
        ``` text
        prefixspan-cli frequent 2 test.dat
        
        0 : 2
        1 : 4
        1 2 : 3
        1 2 2 : 2
        1 3 : 2
        1 3 4 : 2
        1 4 : 2
        1 1 : 2
        1 1 1 : 2
        2 : 3
        2 2 : 2
        3 : 2
        3 4 : 2
        4 : 2
        ```
        
        ``` text
        prefixspan-cli frequent 2 --text test.txt
        
        a : 2
        b : 4
        b c : 3
        b c c : 2
        b d : 2
        b d e : 2
        b e : 2
        b b : 2
        b b b : 2
        c : 3
        c c : 2
        d : 2
        d e : 2
        e : 2
        ```
        
        # API Usage
        
        Alternatively, you can use the algorithms via API.
        
        ``` python
        from prefixspan import PrefixSpan
        
        db = [
            [0, 1, 2, 3, 4],
            [1, 1, 1, 3, 4],
            [2, 1, 2, 2, 0],
            [1, 1, 1, 2, 2],
        ]
        
        ps = PrefixSpan(db)
        ```
        
        For details of each parameter, please refer to the `PrefixSpan` class in `prefixspan/api.py`.
        
        ``` python
        print(ps.frequent(2))
        # [(2, [0]),
        #  (4, [1]),
        #  (3, [1, 2]),
        #  (2, [1, 2, 2]),
        #  (2, [1, 3]),
        #  (2, [1, 3, 4]),
        #  (2, [1, 4]),
        #  (2, [1, 1]),
        #  (2, [1, 1, 1]),
        #  (3, [2]),
        #  (2, [2, 2]),
        #  (2, [3]),
        #  (2, [3, 4]),
        #  (2, [4])]
        
        print(ps.topk(5))
        # [(4, [1]),
        #  (3, [2]),
        #  (3, [1, 2]),
        #  (2, [1, 3]),
        #  (2, [1, 3, 4])]
        
        
        print(ps.frequent(2, closed=True))
        
        print(ps.topk(5, closed=True))
        
        
        print(ps.frequent(2, generator=True))
        
        print(ps.topk(5, generator=True))
        ```
        
        # Closed Patterns and Generator Patterns
        
        The closed patterns are much more compact due to the smaller number.
        
        - A pattern is closed if there is no super-pattern with the same frequency.
        
        ``` text
        prefixspan-cli frequent 2 --closed test.dat
        
        0 : 2
        1 : 4
        1 2 : 3
        1 2 2 : 2
        1 3 4 : 2
        1 1 1 : 2
        ```
        
        The generator patterns are even more compact due to both the smaller number and the shorter lengths.
        
        - A pattern is generator if there is no sub-pattern with the same frequency.
        
        - Due to the high compactness, generator patterns are useful as features for classification, etc.
        
        ``` text
        prefixspan-cli frequent 2 --generator test.dat
        
        0 : 2
        1 1 : 2
        2 : 3
        2 2 : 2
        3 : 2
        4 : 2
        ```
        
        There are patterns that are both closed and generator.
        
        ``` text
        prefixspan-cli frequent 2 --closed --generator test.dat
        
        0 : 2
        ```
        
        # Custom Key Function
        
        For both frequent and top-k algorithms, a custom key function `key=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
            
        - In default, `len(matches)` is used denoting the frequency of current pattern.
        
        - Alternatively, any key function can be used. As an example, `sum(len(db[i]) for i in matches)` can be used to find the satisfying patterns according to the number of matched items.
        
        - For efficiency, an anti-monotone upper-bound function should also be specified for pruning.
        
            - If unspecified, the key function is also the upper-bound function, and must be anti-monotone.
        
        ``` python
        print(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))
        # [(20, [1]),
        #  (15, [2]),
        #  (15, [1, 2]),
        #  (10, [1, 3]),
        #  (10, [1, 3, 4])]
        ```
        
        # Custom Filter Function
        
        For both frequent and top-k algorithms, a custom filter function `filter=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
        
        - In default, `filter` is not applied and all the patterns are returned.
        
        - Alternatively, any function can be used. As an example, `matches[0][0] > 0` can be used to exclude the patterns covering the first sequence.
        
        ``` python
        print(ps.topk(5, filter=lambda patt, matches: matches[0][0] > 0))
        # [(2, [1, 1]),
        #  (2, [1, 1, 1]),
        #  (2, [1, 2, 2]),
        #  (2, [2, 2]),
        #  (1, [1, 2, 2, 0])]
        ```
        
        # Tip
        
        I strongly encourage using [PyPy](http://pypy.org/) instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a [virtual environment](https://virtualenv.pypa.io/en/stable/) created for PyPy.
        
Keywords: data-mining,pattern-mining
Platform: UNKNOWN
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Requires-Python: >= 3
Description-Content-Type: text/markdown
