Metadata-Version: 2.1
Name: podcast-transcript-convert
Version: 0.1.0
Summary: Convert podcast transcripts from HTML, SRT, WebVtt, Podlove etc into PodcastIndex JSON.
Author-email: Harold Martin <Harold.Martin@gmail.com>
Project-URL: Homepage, https://github.com/hbmartin/podcast-transcript-convert
Keywords: convert,podcast,podcastindex,transcripts,srt,vtt,webvtt,podlove,pci
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Typing :: Typed
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: beautifulsoup4
Requires-Dist: loguru
Requires-Dist: lxml
Requires-Dist: webvtt-py
Provides-Extra: lint
Requires-Dist: pyroma; extra == "lint"
Requires-Dist: pytype; extra == "lint"
Requires-Dist: ruff; extra == "lint"
Requires-Dist: pytest; extra == "lint"

# podcast-transcript-convert

[![PyPI](https://img.shields.io/pypi/v/podcast-transcript-convert.svg)](https://pypi.org/project/podcast-transcript-convert/)
[![Lint and Test](https://github.com/hbmartin/podcast-transcript-convert/actions/workflows/lint.yml/badge.svg)](https://github.com/hbmartin/podcast-transcript-tools/actions/workflows/lint.yml)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Code style: black](https://img.shields.io/badge/🐧️-black-000000.svg)](https://github.com/psf/black)
[![Checked with pytype](https://img.shields.io/badge/🦆-pytype-437f30.svg)](https://google.github.io/pytype/)
[![twitter](https://img.shields.io/badge/@hmartin-00aced.svg?logo=twitter&logoColor=black)](https://twitter.com/hmartin)

Convert podcast transcripts from HTML, SRT, WebVtt, Podlove etc into [PodcastIndex JSON](https://github.com/Podcastindex-org/podcast-namespace/blob/main/transcripts/transcripts.md).


## Installation

It is recommended to use [pipx](https://pipx.pypa.io/stable/) to install and run the CLI tool. If you wish to use the library, you can install with `pip` instead.

```bash
brew install pipx
pipx install podcast-transcript-convert
```

## Usage
Run the conversion app on your transcripts directory.

```bash
transcript2json transcripts/ converted/
```
You can then inspect the output JSON files in the `converted/` directory.

## Library Usage
```python
from podcast_transcript_convert.convert import bulk_convert

bulk_convert("transctipts_dir/", "converted_dir/")
```

Individual file type converters are in the `converters` package. You can use them directly if you know the file type.

You can use `file_typing.identify_file_type(file)` to determine the file type of a transcript file.


## Development

Pull requests are very welcome! For major changes, please open an issue first to discuss what you would like to change.

```bash
git clone git@github.com:hbmartin/podcast-transcript-convert.git
cd podcast-transcript-convert
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Replace with the actual path to your transcript files
python -m podcast_transcript_convert ~/Downloads/overcast-to-sqlite/archive/transcripts converted/
```

### Code Formatting

This project is linted with [ruff](https://docs.astral.sh/ruff/) and uses [Black](https://github.com/ambv/black) code formatting.


## Authors
- [Harold Martin](https://www.linkedin.com/in/harold-martin-98526971/) - harold.martin at gmail
