Metadata-Version: 1.0
Name: webhist
Version: 1.0.0
Summary: Saved webpage index and search
Home-page: https://github.com/projreality/webhist
Author: Samuel Li
Author-email: sli@projreality.com
License: https://www.gnu.org/licenses/lgpl.html
Description: WebHist
        =======
        
        .. contents:: Table of Contents
          :local:
        
        WebHist indexes a collection of saved webpages and provides an interface to search the index.
        
        WebHist can handle the following archive file types:
        
        - MAFF files generated by Mozilla Archive Format, with MHT and Faithful Save
        - HTML files generated by Save Page WE
        
        Installation
        ------------
        
        Package is uploaded on `PyPI <https://pypi.org/project/webhist>`_.
        
        You can install it with pip::
        
          $ pip install webhist
        
        Usage
        -----
        
        Create an index of archived webpages
        
        .. code:: python
        
          i = webhist.Index("/path/to/index")
        
        Index a single file
        
        .. code:: python
        
          i.add("/path/to/file")
        
        A file will not be re-indexed unless explicitly requested. Files are tracked by the path string passed to the add() function, so an absolute path and a relative path will be considered two different files.
        
        The code below will update the file in the index
        
        .. code:: python
        
          i.add("/path/to/file", update=True)
        
        Add all files in a specified directory (note that it does not search within subdirectories)
        
        .. code:: python
        
          i.add_path("/path/to/directory")
        
        Again, you can specify :literal:`update=True` to re-index files. You can also specify :literal:`verbose=True` to print information about whether or not files were indexed
        
        .. code:: python
        
          i.add_path("/path/to/directory", verbose=True)
        
        The output will look something like::
        
          file1
          - file2 (already in index)
          - file3 (exception type: error message)
        
        In the example output above:
        
        - file1 was indexed correctly
        - file2 was already in the index, and was not re-indexed
        - file3 had a problem and was not indexed (python exception message shown)
        
        After adding files, the changes to the index need to be committed
        
        .. code:: python
        
          i.commit()
        
        You can also cancel the changes
        
        .. code:: python
        
          i.cancel()
        
        Once an index has been populated, you can run search queries against it. The syntax follows the Whoosh default query language. More information can be found `here <https://whoosh.readthedocs.io/en/latest/querylang.html>`_.
        
        The code below searches for webpage archives that contain "webhist" and "installation"
        
        .. code:: python
        
          results = i.search("webhist installation")
        
        The field searched by default is the :literal:`content` field. The following fields are indexed and searchable:
        
        - title (title of page)
        - content (content of page)
        - url (full URL of page)
        - fqdn (fully qualified domain name, e.g. packaging.python.org)
        - dn (domain name, e.g. python.org)
        - date (the date the webpage archive was saved)
        
        For example, you can search the title field for webpages saved from example.com
        
        .. code:: python
        
          results = i.search("title:webhist dn:example.com")
        
        Shell Interface
        ---------------
        
        A simple shell interface to a WebHist index is provided in :literal:`examples/shell.py`. You can clone the webhist repo and run it from the repo root::
        
          $ python examples/shell.py /path/to/archive -i /path/to/index
        
        The :literal:`-i` parameter is optional. The default index location is :literal:`/path/to/archive/index`.
        
        Run a search query::
        
          webhist> search title:webhist dn:example.com
        
        The output will look something like::
        
          0: [2010-01-02 12:30:01] Title of page
          1: [2011-02-03 16:20:25] Another page
          2: [2013-06-12 00:00:01] Yet another page
        
        To open page #2 from the search results::
        
          webhist> open 2
        
        To get more help::
        
          webhist> help
        
        To exit the shell::
        
          webhist> exit
        
        License
        -------
        
        WebHist is released under the GNU Lesser General Public License, Version 3.
        
Platform: UNKNOWN
