Metadata-Version: 2.1
Name: datasetscraper
Version: 0.0.4
Summary: Tool to create image datasets for machine learning problemsby scraping search engines like Google, Bing and Baidu. 
Home-page: https://github.com/TimeTraveller-San/datasetscraper
Author: Time Traveller
Author-email: time.traveller.san@gmail.com
License: GPLv2
Keywords: dataset_scraper,machine learning,dataset,images,scrape,yandex,google,bing,baidu
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3
Description-Content-Type: text/markdown
Requires-Dist: pyppeteer (>=0.0.25)
Requires-Dist: fastprogress (>=0.1.21)
Requires-Dist: requests (>=2.19.1)

# DatasetScraper
Tool to create image datasets for machine learning problems by scraping search engines like Google, Bing and Baidu.

# Features:
- **Search engine support**: Google, Bing, Baidu. (in-production): Yahoo, Yandex, Duckduckgo
- **Image format support**: jpg, png, svg, gif, jpeg
- Fast multiprocessing enabled scraper
- Very fast multithreaded downloader
- Data verification after download for assertion of image files

# Installation
- COMING SOON on pypi

# Usage:
- Import
`from datasetscraper import Scraper`

- Defaults
```python
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic')
obj.download(urls, directory='kiniro_mosaic/')
```

- Specify a search engine
```python
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google'])
obj.download(urls, directory='kiniro_mosaic/')
```

- Specify a list of search engines
```python
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'])
obj.download(urls, directory='kiniro_mosaic/')
```

- Specify max images (default was 200)
```python
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'], maxlist=[500, 300])
obj.download(urls, directory='kiniro_mosaic/')
```

# FAQs
- Why aren't yandex, yahoo, duckduckgo and other search engines supported?
They are hard to scrape, I am working on them and will update as soon as I can.

- I set maxlist=[500] why are only (x<500) images downloaded?
There can be several reasons for this:
    - Search ran out: This happens very often, google/bing might not have enough images for your query
    - Slow internet: Increase the timeout (default is 60 seconds) as follows: ```obj.download(urls, directory='kiniro_mosaic/', timeout=100)```

- How to debug?
You can change the logging level while making the scraper object : `obj = Scraper(logger.INFO)`

# TODO:
- More search engines
- Better debug
- Write documentation
- Text data? Audio data?


