Metadata-Version: 1.1
Name: csvmatch
Version: 1.7
Summary: Find (fuzzy) matches between two CSV files in the terminal.
Home-page: https://github.com/maxharlow/csvmatch
Author: Max Harlow
Author-email: maxharlow@gmail.com
License: Apache
Description: CSV Match
        =========
        
        Find (fuzzy) matches between two CSV files in the terminal.
        
        Tested on Python 2.7 and 3.5.
        
        
        Installing
        ----------
        
            pip install csvmatch
        
        
        Usage
        -----
        
        Say you have one CSV file such as:
        
        ```
        name
        George Smiley
        Percy Alleline
        Roy Bland
        Toby Esterhase
        Peter Guillam
        Bill Haydon
        Oliver Lacon
        Jim Prideaux
        Connie Sachs
        ```
        
        And another such as:
        
        ```
        name
        Maria Andreyevna Ostrakova
        Otto Leipzig
        George SMILEY
        Peter Guillam
        Konny Saks
        Saul Enderby
        Sam Collins
        Tony Esterhase
        Claus Kretzschmar
        ```
        
        You can then find which rows match:
        
        ```bash
        $ csvmatch data1.csv data2.csv
        
        name,name
        Peter Guillam,Peter Guillam
        ```
        
        By default this is case-sensitive. We can make it case insensitive with `-i`:
        
        ```bash
        $ csvmatch data1.csv data2.csv -i
        
        name,name
        George Smiley,George SMILEY
        Peter Guillam,Peter Guillam
        ```
        
        There are also options to strip non-alphanumeric characters (`-a`) and to sort words (`-s`) before comparisons. Specific terms can also be filtered out before comparisons by passing a text file and the `-l` argument. A predefined list of common English name prefixes (Mr, Ms, etc) can be used with `-t`.
        
        By default, all columns are used to compare rows. Specific columns can be also be given to be compared -- these should be in the same order for both files. Column headers with a space should be enclosed in quotes.
        
        ```bash
        $ csvmatch dataA.csv dataB.csv \
            --fields1 name address \
            --fields2 'Person Name' Address \
        	> results.csv
        ```
        
        (This example also uses output redirection to save the results to a file.)
        
        Either file can also be piped in using `-` as a placeholder:
        
        ```bash
        $ cat data1.csv | csvmatch - data2.csv
        ```
        
        ### Fuzzy matching
        
        CSV Match also supports fuzzy matching. This can be combined with any of the above options.
        
        #### Bilenko
        
        The default fuzzy mode makes use of the [Dedupe library] (https://github.com/datamade/dedupe) built by Forest Gregg and Derek Eder based on the work of Mikhail Bilenko. This algorithm asks you to give a number of examples of records from each dataset that are the same -- this information is extrapolated to link the rest of the dataset.
        
        ```bash
        $ csvmatch data1.csv data2.csv --fuzzy
        ```
        
        The more examples you give it, the better the results will be. At minimum, you should try to provide 10 positive matches and 10 negative matches.
        
        #### Levenshtein
        
        [Damerau-Levenshtein] (https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance) is a string distance metric, which counts the number of changes that would have to be made to transform one string into another.
        
        For two strings to be considered a match, we require 60% of the longer string to be the same as the shorter one.
        
        ```bash
        $ csvmatch data1.csv data2.csv --fuzzy levenshtein
        
        name,name
        George Smiley,George SMILEY
        Toby Esterhase,Tony Esterhase
        Peter Guillam,Peter Guillam
        ```
        
        Here this matches Toby Esterhase and Tony Esterhase -- Levenshtein is good at picking up typos and other small differences in spelling.
        
        #### Metaphone
        
        [Double Metaphone] (https://en.wikipedia.org/wiki/Metaphone#Double_Metaphone) is a phonetic matching algorithm, which compares strings based on how they are pronounced:
        
        ```bash
        $ csvmatch data1.csv data2.csv --fuzzy metaphone
        
        name,name
        George Smiley,George SMILEY
        Peter Guillam,Peter Guillam
        Connie Sachs,Konny Saks
        ```
        
        This shows a match for Connie Sachs and Konny Saks, despite their very different spellings.
        
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.5
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Utilities
