Metadata-Version: 2.1
Name: documentparser
Version: 1.0a1
Summary: A simple CLI tool that allow to extract all text contained into a document.
Home-page: https://github.com/RobyFerro/DocumentParser
Author: Roberto Ferro
Author-email: roberto.ferro@ikdev.eu
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 2.7
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown

#Document Parser
>A simple CLI tool that allow to extract all text contained into a document.

## Installation
Execute the followings command to before installing DocumentParser

#### Debian/Ubuntu 
* sudo apt-get update
* sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
* apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \
flac ffmpeg lame libmad0 libso-fmt-mp3 sox libjpeg-dev swigx
* pip install docparser

#### MacOSx
* brew install pkg-config poppler
* brew cask install xquartz
* brew install poppler antiword unrtf tesseract swig

#### Fedora / CentOS
Before you start you've to know that there's no a quickly way to install DocParser in a Fedora based system.
This is caused by some missing dependences. This can be the hardest way, but in the end you'll be proud of yourself XD.

* yum -y update
* yum install python-pip

>Required by the .docx parser which uses lxml via python-docx.

* yum install libxml2 libxslt-devel libxml2-devel

>Required by the .docx parser which users lxml via python-docx.

* yum install libxslt

>Required by the .doc and .ps  parser.

* wget https://forensics.cert.org/cert-forensics-tools-release-el7.rpm
* rpm -Uvh cert-forensics-tools-release*rpm
* yum --enablerepo=forensics install antiword
* yum --enablerepo=forensics install pstotext

>Require by .pdf parser

*yum install poppler-utils

>Requred by .jpg, .png, gif parser

* cd /opt

* yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel

>Install AutoConf-Archive

* wget ftp://mirror.switch.ch/pool/4/mirror/epel/7/ppc64/a/autoconf-archive-2016.09.16-1.el7.noarch.rpm
* rpm -i autoconf-archive-2016.09.16-1.el7.noarch.rpm

>Install Leptonica from Source

* wget http://www.leptonica.com/source/leptonica-1.75.3.tar.gz
* tar -zxvf leptonica-1.75.3.tar.gz
* cd leptonica-1.75.3
* ./autobuild
* ./configure
* make
* make install
* cd ..

>Install Tesseract from Source

* wget https://github.com/tesseract-ocr/tesseract/archive/3.05.01.tar.gz
* tar -zxvf 3.05.01.tar.gz
* cd tesseract-3.05.01
* ./autogen.sh
* PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
* LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
* make install
* ldconfig
* cd ..

>Download and install tesseract language files

* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/ben.traineddata
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.traineddata
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/tha.traineddata
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/osd.traineddata
* mv *.traineddata /usr/local/share/tessdata

>Download Hindi Cube data

* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.bigrams
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.fold
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.lm
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.nn
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.params
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.word-freq
* wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.tesseract_cube.nn
* mv hin.* /usr/local/share/tessdata

* ln -s /opt/tesseract-3.05.01 /opt/tesseract-latest

>Required by .mp3 and .ogg parser

* yum install sox
* rm cert-forensics-tools-release-el7.rpm

>Install textract without unsupported features

* git clone https://github.com/deanmalmgren/textract.git
* rm textract/requirements/python && cp requirements/textract/python textract/requirements/python
* cd textract && chmod +x setup.py
* python setup.py install

* yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config



