Metadata-Version: 2.1
Name: pdfsyntax
Version: 0.0.7
Summary: A Python PDF parsing library and tool built on top to browse the internal structure of a PDF file
Author-email: "Martin D." <desgeeko@gmail.com>
License: MIT License
        
        Copyright (c) 2022-2023 Martin D. <desgeeko@gmail.com>
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/desgeeko/pdfsyntax
Project-URL: Bug Tracker, https://github.com/desgeeko/pdfsyntax/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE

PDFSyntax
=========

*A Python PDF parsing library and tool built on top to browse the internal structure of a PDF file*

## Introduction

The project is focused on chapter 7 ("Syntax") of the Portable Document Format (PDF) Specification.

PDFSyntax is lightweight (no dependencies) and written from scratch in pure Python. 

1. CLI: It started as a command-line interface to inspect the internal structure of a PDF file.
2. API: Now the internal functions are being exposed as a toolkit for PDF read/write operations.

## Project status

WORK IN PROGRESS! This is ALPHA quality software. The API may change anytime.

## Design

PDFSyntax favors non-destructive edits allowed by the PDF Specification: by default incremental updates are added at the end of the original file.

It is mostly made of simple functions working on built-in types and named tuples. Shallow copying of the Doc object structure performed by pure functions offers some kind of - *experimental* - immutability.

## Installation

You can install from PyPI:

    pip install pdfsyntax

## CLI overview

Please refer to the [CLI README](https://github.com/desgeeko/pdfsyntax/blob/main/docs/cli.md) for details.

The general form of the CLI usage is:

    python3 -m pdfsyntax COMMAND FILE

You can get quick insights on a PDF file with these commands:
- `overview` outputs text data about the structure and the metadata. 
- `inspect` outputs static html data that lets you browse the internal structure of the PDF file: the PDF source is pretty-printed and augmented with hyperlinks.

## API overview

Please refer to the [API README](https://github.com/desgeeko/pdfsyntax/blob/main/docs/api.md) for details.

PDFSyntax is mostly made of simple functions. Example:

```Python
>>> from pdfsyntax import read, metadata
>>> doc = read("samples/simple_text_string.pdf")
>>> metadata(doc) #returns a Python dict whose keys are 'Title', 'Author', 'Subject', etc...
```

The Doc object is probably the only dedicated class you will need to handle. It is a black box that stores all the internal states of a document:
- content that is cached/memoized from an original file,
- modifications that add/modifiy/delete content and that are tracked as incremental updates.

```Python
>>> doc
<PDF Doc with 1 revisions(s), ready to start update/revision 2, cache loaded with 0 / 7 objects>
```

This object exposes as a method the same metadata function, therefore you can get the same result with:

```Python
>>> doc.metadata() #returns a Python dict whose keys are 'Title', 'Author', 'Subject', etc...
```

Low-level functions like `get_object` or `update_object` allow you to directly access and manipulate the inner objects of the document structure.
You may also use higher-level functions like `rotate`:

```Python
>>> from pdfsyntax import rotate, write
>>> doc180 = rotate(doc, 180) #rotate pages by 180°
```

The orignal object is unchanged and a new object is created with an incremental update (revision 2) that encloses the ongoing orientation modification:

```Python
>>> doc180
<PDF Doc with 2 revisions(s), current update/revision containing 1 modifications, cache loaded with 3 / 7 objects>
```

You then can write the modified PDF to disk. Note that the resulting file contains a new section appended to the original content. You may cut this section to revert the change.

```Python
>>> write(doc180, "rotated_doc.pdf")
```


## Open-Source, not Open-Contribution yet

PDFSyntax is [MIT licensed](https://github.com/desgeeko/pdfsyntax/blob/main/LICENCE) but is currently closed to contributions.
> Personal note: this is a pet projet of mine and my time is limited. First I need to focus on my roadmap (new features and refactoring) and then I will happily accept contributions when everything is a little more stabilised. 


