Metadata-Version: 2.0
Name: pyConTextNLP
Version: 0.6.0.5
Summary: A Python implementation of the ConText algorithm
Home-page: https://github.com/chapmanbe/pyConTextNLP
Author: Brian Chapman
Author-email: brian.chapman@utah.edu
License: http://www.apache.org/licenses/LICENSE-2.0
Keywords: ConText NLP
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Requires-Dist: networkx
Requires-Dist: nose
Requires-Dist: textblob
Provides-Extra: dev
Requires-Dist: check-manifest; extra == 'dev'
Provides-Extra: test
Requires-Dist: coverage; extra == 'test'

pyConTextNLP
============

pyConTextNLP is a Python implementation/extension/modification of the
ConText algorithm described in `CITE <>`__ which is itself a
generalization of the NegEx algorithm described in `CITE <>`__.

The package is maintained by Brian Chapman at the University of Utah.
Other active and past developers include:

-  Wendy W. Chapman
-  Glenn Dayton

Introduction
------------

pyConTextNLP is a partial implementation of the ConText algorithm using
Python. The original description of pyConTextNLP was provided in Chapman
BE, Lee S, Kang HP, Chapman WW, "Document-level classification of CT
pulmonary angiography reports based on an extension of the ConText
algorithm." `J Biomed Inform. 2011
Oct;44(5):728-37 <http://www.sciencedirect.com/science/article/pii/S1532046411000621>`__

Other publications/presentations based on pyConText include: \* Wilson
RA, et al. "Automated ancillary cancer history classification for
mesothelioma patients from free-text clinical reports." J Pathol Inform.
2010 Oct 11;1:24. \* Chapman BE, Lee S, Kang HP, Chapman WW. Using
ConText to Identify Candidate Pulmonary Embolism Subjects Based on
Dictated Radiology Reports. (Presented at AMIA Clinical Research
Informatics Summit 2011) \* Wilson RA, Chapman WW, DeFries SJ, Becich
MJ, Chapman BE. Identifying History of Ancillary Cancers in Mesothelioma
Patients from Free-Text Clinical Reports. (Presented at AMIA 2010).

Note: we changed the package name from pyConText to pyConTextNLP because
of a name conflict on pypi.

Installation
------------

pyConTextNLP can be downloaded from the Downloads page here on the negex
Google Code project. Alternatively, it can be downloaded from the pypi
repository http://pypi.python.org/pypi/pyConTextNLP. Since pyConTextNLP
is registered with pypi, it can be installed with easy\_install or pip:

easy\_install pyConTextNLP pip install pyConTextNLP

The only listed dependency is NetworkX and easy\_install should also
install this for you, if it is not already installed. However, there is
optional functionality that is dependent on pygraphviz. I do not yet
have this worked into the setuptools script.

Code Structure
--------------

The original code used in the JBI is in the top level pyConTextNLP
package. A simplification of this original algorithm that uses
[http://networkx.lanl.gov/ NetworkX] is in the subpackage
pyConTextNLP.pyConTextGraph. pyConTextGraph is what is currently being
developed by us and is what is described here.

The package has three files:

-  *itemData.py*. This is where the essential domain knowledge is stored
   in 4-tuples as described in the paper. For a new application, this is
   where the user will encapsulate the domain knowledge for their
   application.
-  *pyConTextGraph.py*. This module defines the algorithm
-  *pyConTextSql.py*.

How to Use
----------

I am working on improving the documentation and (hopefully) adding some
testing to the code.

Some preliminary comments:

-  pyConTextNLP works marks up text on a sentence by sentence level.
-  pyConTextNLP facilitates reasoning from multi-sentence documents, but
   the markup (e.g. negation is all limited within the scope of a
   sentence.
-  pyConTextNLP assumes the sentence is a string not a list of words

The Skeleton of an Example
~~~~~~~~~~~~~~~~~~~~~~~~~~

To illustrate how to use pyConTextNLP, i've taken some code excerpts
from a simple application that was written to identify critical finders
in radiology reports.

The first step in building an application is to define *itemData*
objects for your problem. The package contains *itemData* objects
defined in pyConTextNLP.pyConTextGraph.itemData. Common negation terms,
conjunctions, pseudo-negations, etc. are defined in here. An itemData
instance consists of a 4-tuple. Here is an excerpt

::


    probableNegations = itemData(
    ["can rule out","PROBABLE_NEGATED_EXISTENCE","","forward"],
    ["cannot be excluded","PROBABLE_NEGATED_EXISTENCE",r"""cannot\sbe\s((entirely|completely)\s)?(excluded|ruled out)""","backward"])

The four parts are 1. The *literal* "can rule out", "cannot be excluded"
2. The *Category* "PROBABLE\_NEGATED\_EXISTENCE" 3. An optional regular
expression used to capture the literal in the text. If no regular
expression is provided, a regular expression is generated literally from
the literal. 4. An optional rule. If the itemData is being used as a
modifier, the rule states what direction the modifier operates in the
sentence: current valid values are: "forward", the item can modify
objects following it in the sentence; "backward", the item can modify
objects preceding it in the sentence; or "bidirectional", the item can
modify objects preceding and following it in the sentence.

For the criticalFinderGraph.py application, we defined *itemData* for
the critical findings we wanted to identify in the text, for example
pulmonary emboli and aortic dissections. These new *itemData* objects
were defined in a file named critfindingItemData.py

::

    critItems = itemData(
    ['pulmonary embolism','PULMONARY_EMBOLISM',r'''pulmonary\s(artery )?(embol[a-z]+)''',''], 
    ['pe','PULMONARY_EMBOLISM',r'''\bpe\b''',''],
    ['embolism','PULMONARY_EMBOLISM',r'''\b(emboli|embolism|embolus)\b''',''],
    ['aortic dissection','AORTIC_DISSECTION','',''])

We also added negation terms that were not originally defined in
pyConTextNLP:

::

    definiteNegations.prepend([["nor","DEFINITE_NEGATED_EXISTENCE","","forward"],])

Once we have all our *itemData* defined, we're now ready to start
processing text.

In our application we need to import the relevant modules from
pyConTextNLP as well as our own *itemData* definitions:

::

    import pyConTextNLP.pyConTextGraph.pyConTextGraph as pyConText
    import pyConText.helpers as helpers
    from critfindingItemData import *

Assuming we have read in our documents to process and that the basic
document unit is a *report* we can write a simple function to process
the report

::

        def analyzeReport(report, targets, modifiers ):
            """given an individual radiology report, markup the report based on targets and modifiers"""
            # create the pyConText instance
            context = pyConText.pyConText()

            # split the report into individual sentences. Note this is a very simple sentence splitter. You probably
            # want to write your own or use a sentence splitter from nltk or the like.
            sentences = helpers.sentenceSplitter(report)

            # process each sentence in the report
            for s in sentences:
                context.setTxt(s) 
                context.markItems(modifiers, mode="modifier")
                context.markItems(targets, mode="target")

                # some itemData are subsets of larger itemData instances. At the point they will have all been
                # marked. Drop any marked targets and modifiers that are a proper subset of another marked
                # target or modifier
                context.pruneMarks()

                # drop any marks that have the CATEGORY "Exclusion"; these are phrases we want to ignore.
                context.dropMarks('Exclusion')

                # match modifiers to targets
                context.applyModifiers()

                # Drop any modifiers that didn't get hooked up with a target
                context.dropInactiveModifiers()

                # put the current markup into an "archive". The archive will later be used to reason across the entire report.


            return context

The markup is stored as a directed graph, so determining whether a
target is, for example, negated, you simply check to see if an immediate
predecessor of the target node is a negation. This is all done with
NetworkX commands.

To access the underlying graph from the context object evoke the
getCurrentGraph() method

::

    g = context.getCurrentGraph()

Here is some code to get a list of all the target nodes in the markup:

::

    targets = [n[0] for n in g.nodes(data = True) if n[1].get("category","") == 'target']

Here is a function to test whether a node is modified by any of the
categories in a list

::


    def modifies(g,n,modifiers):
        """g: directed graph representing the ConText markup
            n: a node in g
            modifiers: a list of categories e.g. ["definite_negated_existence","probable_existence"]
            modifies() tests whether n is modified by an objects with category in categories"""
        pred = g.predecessors(n)
        if( not pred ):
            return False
        pcats = [n.getCategory().lower() for n in pred]
        return bool(set(pcats).intersection([m.lower() for m in modifiers]))


