Metadata-Version: 1.1
Name: trec-dd
Version: 0.2.2.dev20
Summary: TREC Dynamic Domain (DD) evaluation test harness for simulating user interaction with a search engine
Home-page: http://trec-dd.org/
Author: Diffeo, Inc.
Author-email: support@diffeo.com
License: MIT/X11 license http://opensource.org/licenses/MIT
Description: # trec-dd-simulation-harness
        
        This is the official "jig" for simulating a user interacting with a
        TREC DD system during an interactive query session.
        
        # Simulation Harness
        
        If you wish to evaluate a TREC DD system, you must run it against the
        TREC DD simulation harness. A system interacting with the simulation
        harness will produce a "runfile" that summarizes the simulation
        session.  The "runfile", for each of the system's response, encodes
        information such as (1) "was the system's response on topic?" (2)
        "what subtopics were contained within the system's response?"  and (3)
        "how relevant was the system's response?". Please see the
        specification for a "runfile" for more information.
        
        A TREC DD system can interact with the simulation harness in a couple
        of ways. If the system is written in python, it can use the
        HarnessAmbassador interfact found under trec\_dd/system. Otherwise,
        the TREC DD system must invoke harness commands via the command
        line. Please see trec_dd/harness for more information.
        
        Once you have a "runfile", you may then score your run. Please
        see the section "Gathering Scores" for more information.
        
        # Example TREC DD Systems
        
        The directory trec\_dd/system holds example TREC DD systems to
        demonstrate interaction with the simulation harness using a TREC DD
        system. Right now, the only example system is random_system.py.
        
        # Executing the Random System
        
        ## Requirements
        
        To run the random system, you must have truth data stored
        in a file-backed kvlayer store. The truth data must be stored as
        dossier.label Label objects, using a LabelStore. We have a utility
        script 'trec\_dd/harness/generate\_labels\_from\_runfile.py', which should
        help you convert your truth data into the required format, if all you
        have is a runfile generated by human assessors.
        
        You also need a "topic sequence" file that describes *how* you want to
        evaluate your system. The "topic sequence" file specifies which topics
        to explore, and how many batches to request for each topic. The file
        should be in yaml, and simply encode a mapping from topic\_id to an
        integer representing how many batches to execute for that
        topic\_id. You can find an example "topic sequence" file at
        trec_dd/system/example\_topic\_seq.py.
        
        ## Running the System
        
        You can run the random system in the simulation harness by
        calling
        
            python random_system.py path/to/topic_sequence.yaml path/to/truth_data.kvl path/to/runfile_out.runfile
        
        After this command executes, you should find the resulting system
        runfile at the path you specified in the command. The runfile summarizes
        the responses the random system gave to the harness, as well as the harness's
        thoughts on those responses. This runfile captures everything one needs to
        know in order to give a system a score.
        
        ## Scoring the System
        
        To score your runfile, you may use the trec_dd/scorer/run.py script.
        
            python trec_dd/scorer/run.py path/to/runfile path/to/truthdata.kvl --scorer scorer1 scorer2 scorer3 ...
        
        Please see the section titled "Gathering Scores" for more information on the scoring
        subsystem.
        
        # Gathering Scores
        
        ## Requirements
        
        You must have a runfile generated for your system if you wish to score
        it. You must also have access to the truth data used by the harness
        when generating the runfile.
        
        ## Running the Scorer
        
        The top-level scoring script trec\_dd/scorer/run.py is used to generate
        scores. To run it:
        
            python run.py path/to/runfile path/to/truthdata.kvl --scorer scorer1 scorer2 ...
        
        This will go through your runfile and use all of the specified scorers to
        evaluate the run of your system. The scorers specified after the --scorer
        option must be the names of scorers known to the system. These are
        exactly the following:
        
         * reciprocal\_rank\_at\_recall
         * precision\_at\_recall
         * modified\_precision\_at\_recall
         * average\_err\_arithmetic
         * average\_err\_harmonic
         * average\_err\_arithmetic\_binary
         * average\_err\_harmonic\_binary
        
        # Description of Scorers
        
         * reciprocal\_rank\_at\_recall calculates the reciprocal of the rank by which
         every subtopic for a topic is accounted for.
        
         * precision\_at\_recall calculates the precision of all results up to the point
         where every subtopic for a topic is accounted for.
        
         * average\_err\_arithmetic calculates the expected reciprocal rank
         for each subtopic, and then average the scores accross subtopics
         using an arithmetic average. It uses a graded relevance for computing
         stopping probabilities.
        
         * average\_err\_arithmetic\_binary calculates the expected reciprocal
         rank for each subtopic, and then averages the scores accross
         subtopics using an arithmetic average. It uses binary relevance for
         computing stopping probabilities. Hence, this scorer ignores the
         'rating' field in the runfile.
        
         * average\_err\_harmonic calculates the expected reciprocal rank for
         each subtopic, and then averages the scores accross subtopics using
         an arithmetic average. It uses graded relevance for computing
         stopping probabilities.
        
         * average\_err\_harmonic\_binary average\_err\_harmonic calculates the expected reciprocal rank for
         each subtopic, and then averages the scores accross subtopics using
         an arithmetic average. It uses binary relevance for computing stopping probabilities. Hence,
         this scorer ignores the 'rating' field in the runfile.
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Topic :: Utilities
Classifier: License :: OSI Approved :: MIT License
