Metadata-Version: 2.1
Name: modis-crawler-utils
Version: 0.2.81
Summary: Scrapy utils for Modis crawlers projects.
Home-page: https://gitlab.at.ispras.ru/crawlers/crawler-utils
License: BSD
Author: Varlamov
Author-email: varlamov@ispras.ru
Requires-Python: >=3.9,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: Scrapy
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: Pillow (>=7.1.2)
Requires-Dist: certifi
Requires-Dist: cryptography (==36.0.2)
Requires-Dist: dateparser (>=1.0.0)
Requires-Dist: elasticsearch (==7.8.1)
Requires-Dist: ephemeral-port-reserve (>=1.1.1)
Requires-Dist: itemadapter (>=0.2.0)
Requires-Dist: kafka-python (>=2.0.1)
Requires-Dist: pymongo (>=3.10.1)
Requires-Dist: pyopenssl (==22.0.0)
Requires-Dist: python-logstash (>=0.4.6)
Requires-Dist: requests (>=2.23.0)
Requires-Dist: scrapy (>=2.6.0)
Requires-Dist: scrapy-puppeteer-client (>=0.0.6)
Requires-Dist: scrapy-splash (>=0.8.0)
Requires-Dist: twisted
Project-URL: Repository, https://gitlab.at.ispras.ru/crawlers/crawler-utils
Description-Content-Type: text/markdown

# crawler-utils

Scrapy utils for Modis crawlers projects.

## MongoDB

Some utils connected with mongodb. 

MongoDBPipeline - pipeline for saving items in mongodb. 

Params:
* MONGODB_SERVER - address of mongodb database.
* MONGODB_PORT - port of mongodb database.
* MONGODB_DB - database where to save data.
* MONGODB_USERNAME - username for authentication in mongodb in MONGODB_DB database.
* MONGODB_PWD - password for authentication.
* DEFAULT_MONGODB_COLLECTION - default collection where to save data (default value is `test`).
* MONGODB_COLLECTION_KEY - key of item which identifies items collection name (`MONGO_COLLECTION`)
 where to save item (default value is `collection`).
* MONGODB_UNIQUE_KEY - key of item which identifies item
## Kafka

Some utils connected with kafka. 

KafkaPipeline - pipeline for pushing items into kafka.

Pipeline outputs data into stream with name `{RESOURCE_TAG}.{DATA_TYPE}`.
Where `RESOURCE_TAG` is tag of resource from which data is crawled and `DATA_TYPE` is type of 
data crawled: `data`, `post`, `comment`, `like`, `user`, `friend`, `share`, `member`, `news`, 
`community`.

 Params:
* KAFKA_ADDRESS - address of kafka broker.
* KAFKA_KEY - key of item which is put into kafka record key.
* KAFKA_RESOURCE_TAG_KEY - key of item which identifies item `RESOURCE_TAG` (default value is `platform`)
* KAFKA_DEFAULT_RESOURCE_TAG - default `RESOURCE_TAG` for crawled items without `KAFKA_RESOURCE_TAG_KEY` (default value is `crawler`)
* KAFKA_DATA_TYPE_KEY - key of item from which identifies item `DATA_TYPE` (default value is `type`).
* KAFKA_DEFAULT_DATA_TYPE - default `DATA_TYPE` for crawled items without `KAFKA_DATA_TYPE_KEY` (default value is `data`).
* KAFKA_COMPRESSION_TYPE - type of data compression in kafka for example `gzip`.


