Metadata-Version: 2.1
Name: wpextract
Version: 1.0.0rc1
Summary: Create datasets from WordPress sites
Home-page: https://gatenlp.github.io/wordpress-site-extractor/
License: Apache-2.0
Author: Freddy Heppell
Author-email: frheppell1@sheffield.ac.uk
Requires-Python: >=3.9.0,<3.13
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: beautifulsoup4 (>=4.12.0)
Requires-Dist: click (>=8.0.1)
Requires-Dist: click-option-group (>=0.5.3)
Requires-Dist: langcodes (>=3.3.0)
Requires-Dist: lxml (>=5.0.0)
Requires-Dist: numpy (>=1.23.0)
Requires-Dist: pandas (>=1.5.2)
Requires-Dist: requests (>=2.32.3)
Requires-Dist: tqdm (>=4.65.0)
Project-URL: Documentation, https://gatenlp.github.io/wordpress-site-extractor/
Project-URL: Repository, https://github.com/GateNLP/wordpress-site-extractor
Description-Content-Type: text/markdown

# WPextract - WordPress Site Extractor

<a href="https://pypi.org/project/wpextract/"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/wpextract"></a>
<a href="https://anaconda.org/conda-forge/wpextract"><img alt="Conda Version" src="https://img.shields.io/conda/vn/conda-forge/wpextract"></a>

**WPextract is a tool to create datasets from WordPress sites.**

- Archives posts, pages, tags, categories, media (including files), comments, and users
- Uses the WordPress API to guarantee 100% accurate and complete content
- Resolves internal links and media to IDs
- Automatically parses multilingual sites to create parallel datasets


## Quickstart

See the [complete documentation](https://gatenlp.github.io/wpextract/) for more detailed usage.

1. Install with `pipx`
    ```shell-session
    $ pipx install wpextract
    ```
2. Download site data
    ```shell-session
    $ wpextract download "https://example.org" out_dl
    ```
3. Process into a dataset
    ```shell-session
    $ wpextract extract out_dl out_data
    ```

## About WPextract

WPextract was built by [Freddy Heppell](https://freddyheppell.com) of the [GATE Project](https://gate.ac.uk) at the [School of Computer Science, University of Sheffield](https://sheffield.ac.uk/cs), originally created to scrape mis/disinformation websites for research.

## License

Available under the Apache 2.0 license. See [LICENSE](LICENSE) for more information.

## Citing

> [!NOTE]
> This software was developed for our EMNLP 2023 paper [_Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study_](https://aclanthology.org/2023.emnlp-main.349/). The code has been updated since the paper was written; for archival purposes, the precise version used for the study is [available on Zenodo](https://zenodo.org/records/10008086).

We'd love to hear about your use of our tool, you can [email us](mailto:frheppell1@sheffield.ac.uk) to let us know! Feel free to create issues and/or pull requests for new features or bugs.

If you use this tool in published work, please cite [our EMNLP paper](https://aclanthology.org/2023.emnlp-main.349/):

> Freddy Heppell, Kalina Bontcheva, and Carolina Scarton. 2023. Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5729–5741, Singapore. Association for Computational Linguistics.

