Metadata-Version: 2.1
Name: datagov-harvesting-logic
Version: 0.3.7
Summary: 
Home-page: https://github.com/GSA/datagov-harvesting-logic
License: LICENSE.md
Author: Datagov Team
Author-email: datagov@gsa.gov
Maintainer: Datagov Team
Maintainer-email: datagov@gsa.gov
Requires-Python: >=3.10
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: beautifulsoup4 (>=4.12.2,<5.0.0)
Requires-Dist: boto3 (>=1.34.29,<2.0.0)
Requires-Dist: ckanapi (>=4.7)
Requires-Dist: cloudfoundry-client (>=1.36.0,<2.0.0)
Requires-Dist: deepdiff (>=6)
Requires-Dist: flask (>=3.0.2,<4.0.0)
Requires-Dist: flask-bootstrap (>=3.3.7.1,<4.0.0.0)
Requires-Dist: flask-migrate (>=4.0.7,<5.0.0)
Requires-Dist: flask-sqlalchemy (>=3.1.1,<4.0.0)
Requires-Dist: flask-wtf (>=1.2.1,<2.0.0)
Requires-Dist: jsonschema (>=4)
Requires-Dist: psycopg2-binary (>=2.9.9,<3.0.0)
Requires-Dist: pytest (>=7.3.2)
Requires-Dist: python-dotenv (>=1)
Requires-Dist: sansjson (>=0.3.0,<0.4.0)
Requires-Dist: sqlalchemy (>=2.0.25,<3.0.0)
Project-URL: Repository, https://github.com/GSA/datagov-harvesting-logic
Description-Content-Type: text/markdown

# datagov-harvesting-logic

This is a library that will be utilized for metadata extraction, validation,
transformation, and loading into the data.gov catalog.

## Features

- Extract
  - General purpose fetching and downloading of web resources.
  - Catered extraction to the following data formats:
    - DCAT-US
- Validation
  - DCAT-US
    - `jsonschema` validation using draft 2020-12.
- Load
  - DCAT-US
    - Conversion of dcat-us catalog into ckan dataset schema
    - Create, delete, update, and patch of ckan package/dataset

## Requirements

This project is using `poetry` to manage this project. Install [here](https://python-poetry.org/docs/#installation).

Once installed, `poetry install` installs dependencies into a local virtual environment.

## Testing

### CKAN load testing

- CKAN load testing doesn't require the services provided in the `docker-compose.yml`.
- [catalog-dev](https://catalog-dev.data.gov/) is used for ckan load testing.
- Create an api-key by signing into catalog-dev.
- Create a `credentials.py` file at the root of the project containing the variable `ckan_catalog_dev_api_key` assigned to the api-key.
- Run tests with the command `poetry run pytest ./tests/load/ckan`

### Harvester testing

- These tests are found in `extract`, and `validate`. Some of them rely on services in the `docker-compose.yml`. Run using docker `docker compose up -d` and with the command `poetry run pytest --ignore=./tests/load/ckan`.

If you followed the instructions for `CKAN load testing` and `Harvester testing` you can simply run `poetry run pytest` to run all tests.

### Integration testing
- to run integration tests locally add the following env variables to your .env file in addition to their appropriate values
  - CF_SERVICE_USER = "put username here"
  - CF_SERVICE_AUTH = "put password here"

## Comparison

- `./tests/harvest_sources/ckan_datasets_resp.json`
  - Represents what ckan would respond with after querying for the harvest source name
- `./tests/harvest_sources/dcatus_compare.json`
  - Represents a changed harvest source
  - Created:
    - datasets[0]

        ```diff
        + "identifier" = "cftc-dc10"
        ```

  - Deleted:
    - datasets[0]

        ```diff
        - "identifier" = "cftc-dc1"
        ```

  - Updated:
    - datasets[1]

        ```diff
        - "modified": "R/P1M"
        + "modified": "R/P1M Update"
        ```

    - datasets[2]

        ```diff
        - "keyword": ["cotton on call", "cotton on-call"]
        + "keyword": ["cotton on call", "cotton on-call", "update keyword"]
        ```

    - datasets[3]

        ```diff
        "publisher": {
          "name": "U.S. Commodity Futures Trading Commission",
          "subOrganizationOf": {
        -   "name": "U.S. Government"
        +   "name": "Changed Value"
          }
        }
        ```

- `./test/harvest_sources/dcatus.json`
  - Represents an original harvest source prior to change occuring.


## Flask App

### Local development 

1. set your local configurations in `.env` file.

2. Use the Makefile to set up local Docker containers, including a PostgreSQL database and the Flask application:

   ```bash
   make build 
   make up
   make test
   make clean
   ```

   This will start the necessary services and execute the test.

3. when there are database DDL changes, use following steps to generate migration scripts and update database:

    ```bash
    docker compose db up
    docker compose run app flask db migrate -m "migration description"
    docker compose run app flask db upgrade
    ```

### Deployment to cloud.gov

#### Database Service Setup

A database service is required for use on cloud.gov.

In a given Cloud Foundry `space`, a db can be created with 
`cf create-service <service offering> <plan> <service instance>`. 

In dev, for example, the db was created with 
`cf create-service aws-rds micro-psql harvesting-logic-db`. 

Creating databases for the other spaces should follow the same pattern, though the size may need to be adjusted (see available AWS RDS service offerings with `cf marketplace -e aws-rds`).

Any created service needs to be bound to an app with `cf bind-service <app> <service>`. With the above example, the db can be bound with 
`cf bind-service harvesting-logic harvesting-logic-db`.

Accessing the service can be done with service keys. They can be created with `cf create-service-keys`, listed with `cf service-keys`, and shown with 

`cf service-key <service-key-name>`.

#### Manually Deploying the Flask Application to development

1. Ensure you have a `manifest.yml` and `vars.development.yml` file configured for your Flask application. The vars file may include variables: 

    ```bash
    app_name: harvesting-logic
    database_name: harvesting-logic-db
    route-external: harvester-dev-datagov.app.cloud.gov
    ```

2. Deploy the application using Cloud Foundry's `cf push` command with the variable file:

   ```bash
   poetry export -f requirements.txt --output requirements.txt --without-hashes
   cf push --vars-file vars.development.yml
   ```

3. when there are database DDL changes, use following to do the database update:

    ```bash
    cf run-task harvesting-logic --command "flask db upgrade" --name database-upgrade
    ```
