Metadata-Version: 2.1
Name: datafilter
Version: 0.3.0
Summary: Quickly find tokens (words, phrases, etc) within your data.
Home-page: https://github.com/jcp/datafilter
Author: James C. Palmer
Author-email: me@jcp.io
License: BSD 3-Clause
Description: # Data Filter
        
        [![pypi](https://img.shields.io/pypi/v/datafilter.svg?color=brightgreen)](https://pypi.org/project/datafilter/)
        [![pypi](https://img.shields.io/pypi/pyversions/datafilter.svg)](https://pypi.org/project/datafilter/)
        [![codecov](https://codecov.io/gh/jcp/datafilter/branch/master/graph/badge.svg)](https://codecov.io/gh/jcp/datafilter/)
        [![Build Status](https://travis-ci.org/jcp/datafilter.svg?branch=master)](https://travis-ci.org/jcp/datafilter/)
        
        Quickly find tokens (words, phrases, etc) within your data.
        
        Data Filter is a lightweight [data cleansing](https://en.wikipedia.org/wiki/Data_cleansing) framework that can be easily extended to support different data types, structures or processing requirements. It natively supports the following data types:
        
        * CSV files
        * Text files
        * Text strings
        
        # Table of Contents
        
        * [Requirements](#requirements)
        * [Installation](#installation)
        * [Basic Usage](#basic-usage)
        * [Features](#features)
            * [Base](#base)
            * [Filters](#filters)
                * [CSV](#csv)
                * [Text](#text)
                * [TextFile](#textfile)
        
        # Requirements
        
        * Python 3.6+
        
        # Installation
        
        To install, simply use [pipenv](http://pipenv.org/) (or pip):
        
        ```bash
        >>> pipenv install datafilter
        ```
        
        # Basic Usage
        
        Each example below returns a generator that yields [parsed](#parse) data.
        
        ## CSV
        
        ```python
        from datafilter import CSV
        
        tokens = ["Lorem", "ipsum", "Volutpat est", "mi sit amet"]
        data = CSV("test.csv", tokens=tokens)
        print(next(data.results()))
        ```
        
        ## Text
        
        ```python
        from datafilter import Text
        
        text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
        data = Text(text, tokens=["Lorem"])
        print(next(data.results()))
        ```
        
        ## Text File
        
        ```python
        from datafilter import TextFile
        
        data = TextFile("test.txt", tokens=["Lorem", "ipsum"], re_split=r"(?<=\.)")
        print(next(data.results()))
        ```
        
        # Features
        
        Data Filter was designed to be highly extensible. Common or useful filters can be easily reused and shared. A few example use cases include:
        
        * Filters that can handle different data types such as Microsoft Word, Google Docs, etc.
        * Filters that can handle incoming data from external APIs.
        
        ## Base
        
        Abstract base class that's subclassed by every filter.
        
        `Base` includes several methods to ensure data is properly normalized, formatted and returned. The `results` method is an `@abstractmethod` to enforce its use in subclasses.
        
        ### Parameters
        
        #### tokens
        
        `type <list>`
        
        A list of strings that will be searched for within a set of data.
        
        #### translations
        
        `type <list>`
        
        A list of strings that will be removed during normalization.
        
        **Default**
        
        ```python
        ['0123456789', '(){}[]<>!?.:;,`\'"@#$%^&*+-|=~–—/\\_', '\t\n\r\x0c\x0b']
        ```
        
        #### bidirectional
        
        `type <bool>`
        
        When `True`, token matching will be bidirectional. 
        
        **Default**
        
        ```python
        True
        ```
        
        > **Note:**
        >
        > A common obfuscation method is to reverse the offending string or phrase. This helps detect those instances.
        
        #### caseinsensitive
        
        `type <bool>`
        
        When `True`, tokens and data are converted to lowercase during normalization.
        
        **Default**
        
        ```python
        True
        ```
        
        ### Methods
        
        #### results
        
        Abstract method used to return results within a filter. This is defined by a `Base` subclass
        
        #### maketrans
        
        Returns a translation table used during normalization.
        
        **Returns**
        
        `type <dict>`
        
        #### normalize
        
        Returns normalized data. Normalization includes converting data to [lowercase](#caseinsensitive) and [removing strings](#translations).
        
        **Returns**
        
        `type <tuple>`
        
        > **Note:**
        >
        > Normalized data is returned as a tuple. The first element is the original data. The second element is the normalized data.
        >
        
        #### parse
        
        Returns parsed and properly formatted data.
        
        **Returns**
        
        `type <dict>`
        
        > **Example:**
        >
        > Assume we're searching for the token "Lorem" in a very short string.
        >
        > ```python
        > data = Text("Lorem ipsum dolor sit amet", tokens=["Lorem"])
        > print(next(data.results()))
        > ```
        >
        > The returned result would be formatted as:
        >
        > ```python
        > {
        >     "data": "Lorem ipsum dolor sit amet",
        >     "flagged": True,
        >     "describe": {
        >         "tokens": {
        >             "detected": ["Lorem"],
        >             "count": 1,
        >             "frequency": {
        >                 "Lorem": 1,
        >             },
        >         }
        >     },
        > }
        > ```
        
        ## Filters
        
        Filters subclass and extend the `Base` class to support various data types and structure. This extensibility allows for the creation of powerful custom filters specifically tailored to a given task, data type or structure.
        
        ## CSV
        
        ### Parameters
        
        `CSV` is a subclass of `Base` and inherits all parameters.
        
        #### path
        
        `type <str>`
        
        Path to a CSV file.
        
        ### Methods
        
        `CSV` is a subclass of `Base` and inherits all methods.
        
        ## Text
        
        ### Parameters
        
        `Text` is a subclass of `Base` and inherits all parameters.
        
        #### text
        
        `type <str>`
        
        A text string.
        
        #### re_split
        
        `type <str>`
        
        A regular expression pattern or string that will be applied to `text` with `re.split` before normalization.
        
        ### Methods
        
        `Text` is a subclass of `Base` and inherits all methods.
        
        ## TextFile
        
        ### Parameters
        
        `TextFile` is a subclass of `Base` and inherits all parameters.
        
        #### path
        
        `type <str>`
        
        Path to a text file.
        
        #### re_split
        
        `type <str>`
        
        A regular expression pattern or string that will be applied to `text` with `re.split` before normalization.
        
        ### Methods
        
        `TextFile` is a subclass of `Base` and inherits all methods.
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Requires-Python: >=3.6
Description-Content-Type: text/markdown
