Scheduling

Meta::

    Author: Christian Theune <ct@gocept.com>
    Valid for: CMF Link checker
    CVS: 
    Status: work in progress

Describes a scheduling mechanism to avoid long locks.

Reason

    Having a large site with thousands of references to external
    it can be a task with a long duration to check.

Approach

    Therefore a threaded approach is used where each request for a link beeing
    checked is scheduled in a queue and run asynchronously. 

    If a check is requested and there exists information about the availability
    that is not older than MAX_CACHE_AGE the information is returned and no
    scheduling is done. If the response is within the MAX_CACHE_AGE but out of
    the GRACEFUL_UPDATE_AGE then the result is returned and a new check is
    scheduled.

    If the response in the CACHE is older than MAX_CACHE_AGE the string "unknown"
    is returned and a check with a high priority is scheduled.

    If the "lock" parameter is given, the request for a link check will wait
    until a valid result has been retrieve

Avoiding bandwith wastage

    Retrieving the whole HTTP Responses on URLS can be quite expensive. If the
    server is capable of the HEAD request this should be used. Otherwise a
    partial HTTP Response is to be accepted. Transfer should be stopped after
    getting all relevant headers. This also minimizes the risk of DOS
    casualities.

Information stored

    The cache carries this information along:

        URL --  The full URL that is checked and used as a key on the cache
        LastCheck  --  A DateTime object representing the point in time when
            the last check has been performed.
        State -- The result of the last check. Can be one of ["ok", "timeout", "network"] or
        an integer representing an HTTP Response Error code.
        Time -- Time the last check took
        Headers -- HTTP Response Headers
        
