Metadata-Version: 2.1
Name: htmlparsingbs4based
Version: 1.1.0
Summary: This package extracts/parses information from source HTML.
Home-page: https://finquest.com/
Author: Yaxiong Yuan
Author-email: yaxiong.yuan@finquest.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Requires-Dist: beautifulsoup4 (==4.11.1)
Requires-Dist: json-lines (==0.5.0)
Requires-Dist: pandas (==1.5.0)
Requires-Dist: scrapy (==1.8.0)
Requires-Dist: scrapy-fake-useragent (==1.2.0)
Requires-Dist: tqdm (>=4.50.2)
Requires-Dist: retry (==0.9.2)
Requires-Dist: cryptography (>=3.2)
Requires-Dist: spacy (==3.2)
Requires-Dist: boto3 (==1.21.35)
Requires-Dist: cleanco (==2.2)
Requires-Dist: bson (==0.5.10)
Requires-Dist: pymongo (==4.3.2)
Requires-Dist: elasticsearch (==8.4.3)
Requires-Dist: openpyxl (==3.0.10)
Requires-Dist: pika (==1.3.1)
Requires-Dist: jsonlines (==3.1.0)

# HTML Parser

extracts/parses information from source HTML.

# construct a Pypi package

* python3 setup.py sdist bdist_wheel
* twine upload dist/*

# create CLI from dist (if you has .dist file)

* python3 -m pip install /home/yaxiong/html_parsing/dist/htmlparsingbs4based-1.1.0.tar.gz

# install package and CLI

* pip install htmlparsingbs4based
* OR python3 -m pip install htmlparsingbs4based

# run from script

* from htmlparsingbs4based.html_parsing.html_parser_custombs4_script import parse_single_page
* parse_single_page(input_url='https://bryansfuel.on.ca/about/',  path_to_crawled_files='/home/yaxiong/data_crawled_websites/crawled_websites_first_batch', min_length=1,  prefix="")

# run CLI (examples)

* mode_1: eleasticsearch
* PARSE -gpf elasticsearch -i 'http://www.mineracamargo.com/MCA_Investors.html' -esusr readwrite -espw ''

* mode_2: local
* PARSE -gpf local -i 'https://bryansfuel.on.ca/about/' -fo /home/yaxiong/data_crawled_websites/crawled_websites_first_batch
* PARSE -gpf local -i 'http://www.mineracamargo.com/MCA_Investors.html' -fo /home/yaxiong/data_crawled_websites/crawled_websites_first_batch
* PARSE -gpf local -i 'https://www.conpak.com/About-Conpak/' -fo /home/yaxiong/data_crawled_websites/crawled_websites_first_batch

* mode_3: html
* PARSE -gpf html -fi /home/yaxiong/html_parsing/html_example/parsed_html.json
