Metadata-Version: 1.1
Name: ruamel.pdfdouble
Version: 0.1
Summary: Documentation for ruamel.pdfdouble
Home-page: https://bitbucket.org/ruamel/pdfdouble
Author: Anthon van der Neut
Author-email: a.van.der.neut@ruamel.eu
License: MIT license
Description: 
        ruamel.pdfdouble
        ================
        
        
        this package provides the ``pdfdbl`` command::
        
            pdfdbl scan dir1 dir2
        
        This will walk down the directories provided as argument and for the PDF
        files found create a hash based on (in order)::
        
        - metadata if unique
        - images if the number of images
        - text
        
        This assumes that ``pdfinfo``, ``pdfimages`` and `pdftotext`` from the
        ``poppler-utils`` package are avaialable.
        
        A "database" is build up in ``~/.config/pdfdbl/pdf.lst``
        against which further scans are tested.
        
        Removing markings
        -----------------
        
        In ruamel/pdfdouble/pdfdouble.py there are two methods that can be enhanced
        to filter out markings in the PDF that make them less unique and make
        vitually the same files to have different hashes.
        
        For text the method ``PdfData.filter_for_marking`` should be extended to remove
        and markings from the string that is its arguments and return the result.
        
        For scanned images the method ``PdfData.process_image_and_update`` needs to be
        enhanced, e.g. by cutting off the images bottom and top X lines, and
        by removing any gray background text by setting all black pixels to white.
        This function needs to update the hash passed in using the ``.update()`` method
        passing in the filtered data.
        
        Restrictions
        ------------
        
        The current "database" cannot handle paths that contain newlines
        
        
        This utility is currently Python 2.6/2.7 only.
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2 :: Only
Classifier: Topic :: Text Processing :: Indexing
