Metadata-Version: 2.1
Name: buzz
Version: 2.0.5
Summary: Sophisticated corpus linguistics
Home-page: http://github.com/interrogator/buzz
Author: Daniel McDonald
Author-email: mcddjx@gmail.com
License: MIT
Description: [![Build Status](https://travis-ci.org/interrogator/buzz.svg?branch=master)](https://travis-ci.org/interrogator/buzz)
        [![codecov.io](https://codecov.io/gh/interrogator/buzz/branch/master/graph/badge.svg)](https://codecov.io/gh/interrogator/buzz)
        [![readthedocs](https://readthedocs.org/projects/buzz/badge/?version=latest)](https://buzz.readthedocs.io/en/latest/)
        [![PyPI version](https://badge.fury.io/py/buzz.svg)](https://badge.fury.io/py/buzz)
        [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)
        
        # buzz: python corpus linguistics
        
        <!--- Don't edit the version line below manually. Let bump2version do it for you. -->
        > Version 2.0.5
        
        > *buzz* is a linguistics tool for parsing and then exploring plain or metadata-rich text. This README provides an overview of functionality. Visit the [full documentation](https://buzz.readthedocs.io/en/latest/) for a more complete user guide.
        
        ## Install
        
        bash
        ```
        pip install buzz
        # or
        git clone http://github.com/interrogator/buzz
        cd buzz
        python setup.py install
        ```
        
        ## Creating a corpus
        
        *buzz* models plain text, or [CONLL-U formatted](https://universaldependencies.org/format.html) files. This guide will assume that you are have plain text data, and want to process and analyse it.
        
        So, first, you need to make sure that your corpus is in a format and structure that *buzz* can work with. This simply means putting all your text files into a folder, and optionally within subfolders (representing subcorpora).
        
        Text files should be plain text, with a `.txt` extension. Importantly though, they can be augmented with metadata, which can be stored in two ways. First, speaker names can be added by using capital letters and a colon, much like in a script. Second, you can use XML style metadata markup. Here is an example file, `sopranos/s1/e01.txt`:
        
        ```html
        <metadata aired="10.01.1999">
        MELFI: My understanding from Dr. Cusamano, your family physician, is you collapsed? Possibly a panic attack? <metadata exposition=true interrogative-type="intonation" move="info-request">
        TONY: They said it was a panic attack <metadata emph-token=0 move="refute">
        MELFI: You don't agree that you had a panic attack? <metadata move="info-request" question=type="in">
        ...
        ```
        
        If you add a metadata element at the start of the text file, it will be understood as file-level metadata. For sentence-specific metadata, the element should follow the sentence, ideally at the end of a line. All metadata will be searchable later, so the more you can add, the more you can do with your corpus.
        
        ## Parsing
        
        buzz uses [`spaCy`](https://spacy.io/) to parse your text, saving the results as CONLL-U files to your hard drive. Parsing a corpus is very simply:
        
        ```python
        from buzz import Corpus
        
        corpus = Corpus('sopranos')
        parsed = corpus.parse()
        # if you don't need constituency parses, you can speed things up with:
        parsed = corpus.parse(cons_parser=None)
        ```
        
        The main advantages of parsing with *buzz* are that:
        
        * Parse results are stored as valid CONLL-U 2.0
        * Metadata is respected, and transferred into the output files
        * You can do constituency and dependency parsing at the same time (with parse trees being stored as CONLL-U metadata)
        
        the `parse()` method returns another `Corpus` object, representing the newly created files. We can explore this corpus via commands like:
        
        ```python
        parsed.subcorpora.s1.files.e01
        parsed.files[0]
        parsed.subcorpora.s1[:5]
        parsed.subcorpora['s1']
        ```
        
        ### Loading corpora into memory
        
        You can use the `load()` method to load a whole or partial corpus into memory, as a Dataset object, which extends the [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).
        
        ```python
        loaded = parsed.load()
        ```
        
        You don't need to load corpora into memory to work on them, but it's great for small corpora. As a rule of thumb, datasets under a million words should be easily loadable on a personal computer.
        
        The loaded corpus is a `Dataset` object, which is based on the pandas DataFrame. So, you can use pandas methods on it:
        
        
        ```python
        loaded.head()
        ```
        
        <table border="1" class="dataframe">
          <thead>
            <tr style="text-align: right;">
            </tr>
            <tr>
              <th>file</th>
              <th>s</th>
              <th>i</th>
              <th>w</th>
              <th>l</th>
              <th>x</th>
              <th>p</th>
              <th>g</th>
              <th>f</th>
              <th>e</th>
              <th>aired</th>
              <th>emph_token</th>
              <th>exposition</th>
              <th>interrogative_type</th>
              <th>move</th>
              <th>parse</th>
              <th>question</th>
              <th>sent_id</th>
              <th>sent_len</th>
              <th>speaker</th>
              <th>text</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <th rowspan="5" valign="top">e01</th>
              <th rowspan="5" valign="top">1</th>
              <th>1</th>
              <td>My</td>
              <td>-PRON-</td>
              <td>DET</td>
              <td>PRP$</td>
              <td>2</td>
              <td>poss</td>
              <td>_</td>
              <td>10.01.1999</td>
              <td>_</td>
              <td>True</td>
              <td>intonation</td>
              <td>info-request</td>
              <td>(S (NP (NP (PRP$ My) (NN understanding)) (PP (...</td>
              <td>_</td>
              <td>1</td>
              <td>14</td>
              <td>MELFI</td>
              <td>My understanding from Dr. Cusamano, your famil...</td>
            </tr>
            <tr>
              <th>2</th>
              <td>understanding</td>
              <td>understanding</td>
              <td>NOUN</td>
              <td>NN</td>
              <td>13</td>
              <td>nsubjpass</td>
              <td>_</td>
              <td>10.01.1999</td>
              <td>_</td>
              <td>True</td>
              <td>intonation</td>
              <td>info-request</td>
              <td>(S (NP (NP (PRP$ My) (NN understanding)) (PP (...</td>
              <td>_</td>
              <td>1</td>
              <td>14</td>
              <td>MELFI</td>
              <td>My understanding from Dr. Cusamano, your famil...</td>
            </tr>
            <tr>
              <th>3</th>
              <td>from</td>
              <td>from</td>
              <td>ADP</td>
              <td>IN</td>
              <td>2</td>
              <td>prep</td>
              <td>_</td>
              <td>10.01.1999</td>
              <td>_</td>
              <td>True</td>
              <td>intonation</td>
              <td>info-request</td>
              <td>(S (NP (NP (PRP$ My) (NN understanding)) (PP (...</td>
              <td>_</td>
              <td>1</td>
              <td>14</td>
              <td>MELFI</td>
              <td>My understanding from Dr. Cusamano, your famil...</td>
            </tr>
            <tr>
              <th>4</th>
              <td>Dr.</td>
              <td>Dr.</td>
              <td>PROPN</td>
              <td>NNP</td>
              <td>5</td>
              <td>compound</td>
              <td>_</td>
              <td>10.01.1999</td>
              <td>_</td>
              <td>True</td>
              <td>intonation</td>
              <td>info-request</td>
              <td>(S (NP (NP (PRP$ My) (NN understanding)) (PP (...</td>
              <td>_</td>
              <td>1</td>
              <td>14</td>
              <td>MELFI</td>
              <td>My understanding from Dr. Cusamano, your famil...</td>
            </tr>
            <tr>
              <th>5</th>
              <td>Cusamano</td>
              <td>Cusamano</td>
              <td>PROPN</td>
              <td>NNP</td>
              <td>3</td>
              <td>pobj</td>
              <td>_</td>
              <td>10.01.1999</td>
              <td>_</td>
              <td>True</td>
              <td>intonation</td>
              <td>info-request</td>
              <td>(S (NP (NP (PRP$ My) (NN understanding)) (PP (...</td>
              <td>_</td>
              <td>1</td>
              <td>14</td>
              <td>MELFI</td>
              <td>My understanding from Dr. Cusamano, your famil...</td>
            </tr>
          </tbody>
        </table>
        
        You can also interactively explore the corpus with [tabview](https://github.com/TabViewer/tabview) using the `view()` method:
        
        ```python
        loaded.view()
        ```
        
        The interactive view has a number of cool features, such as the ability to sort by row or column. Also, pressing `enter` on a given line will generate a concordance based on that line's contents. Neat!
        
        ## Exploring parsed and loaded corpora
        
        A corpus is a pandas DataFrame object. The index is a multiindex, comprised of `filename`, `sent_id` and `token`. Each token in the corpus is therefore uniquely identifiable through this index. The columns for the loaded copus are all the CONLL columns, plus anything included as metadata.
        
        ```python
        # get the first sentence using buzz.sent()
        first = loaded.sent(0)
        # using pandas syntax to get first 5 words
        first.iloc[:5]['w']
        # join the wordclasses and words
        print(' '.join(first.x.str.cat(first.w, sep='/')))
        ```
        
        ```
        "DET/My NOUN/understanding ADP/from PROPN/Dr. PROPN/Cusamano PUNCT/, DET/your NOUN/family NOUN/physician PUNCT/, VERB/is PRON/you VERB/collapsed PUNCT/?
        ```
        
        You don't need to know pandas, however, in order to use *buzz*, because *buzz* makes possible some more intuitive measures with linguisitcs in mind. For example, if you want to slice the corpus some way, you can easily do this using the `just` and `skip` properties:
        
        ```python
        tony = loaded.just.speaker.TONY
        # for regular expressions, like removing all punctuation:
        no_punct = loaded.skip.wordlcass.PUNCT
        ```
        
        Any object created by *buzz* has a `.view()` method, which launches a `tabview` interactive space where you can explore corpora, frequencies or concordances.
        
        ## spaCy
        
        [`spaCy`](https://spacy.io/) is used under the hood for dependency parsing, and a couple of other things. spaCy bring with it a lot of state of the art methods in NLP. You can access the `spaCy` representation of your data with:
        
        ```python
        corpus.to_spacy()
        # or
        loaded.to_spacy()
        ```
        
        ## Searching dependencies
        
        To search the dependency graph generated by spaCy during parsing, you can use the *depgrep* method.
        
        
        ```python
        # search dependencies for nominal subjects with definite articles
        nsubj = loaded.depgrep('f/nsubj.*/ -> (w"the" & x"DET")')
        ```
        
        The search language works by modelling nodes and the links between them. Specifying a node, like `f/nsubj/`, is done by specifying the feature you want to match (`f` for `function`), and a query inside slashes (for regular expressions) or inside quotation marks (for literal matches).
        
        The arrow-like link specifies that the `nsubj` must govern the determiner. The `&` relation specifies that the two nodes are actually the same node. Brackets may be necessary to contain the query.
        
        This language is based on `tgrep`, syntax, customised for dependencies. It is still a work in progress, but documentation should emerge [here](https://github.com/interrogator/depgrep).
        
        When you search a Corpus or a Dataset, the result is simply another Dataset, representing a subset of the Corpus. Therefore, rather than trying to construct one query string that gets everything you want, it is often easier to perform multiple small searches:
        
        ```python
        tony_subjects = loaded.just.speaker.TONY.depgrep('f/nsubj/ <- f/ROOT/')
        ```
        
        rather than the more error-prone:
        
        ```python
        tony_subjects = loaded.depgrep('f/nsubj/ <- f/ROOT/ & speaker/TONY/')
        ```
        
        Note that for any searches that do not require traversal of the grammatical structure, you should use the `skip` and `just` methods. *tgrep* and *depgrep* only need to be used when your search involves the grammar, and not just token features.
        
        ## Searching constituency trees
        
        Constituency tree searching can be done with the `tgrep` method, which provides a Python implementation of the [`tgrep2` query syntax](https://web.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html): 
        
        ```python
        nps_with_adjectives = loaded.tgrep('NP < JJ')
        ```
        
        It also works with nodes and links, though there are numerous differences. In particular, note that arrows appear reversed --- `NP < JJ` is an NP that dominates a JJ, while something similar in depgrep would be `f/nsubj/ -> f/amod/`, a nominal subject governing an adjective.
        
        ## Viewing search results
        
        An important principle in *buzz* is the separation of searching and viewing results. Unlike many other tools, you do not search for a concordance---instead, you search the corpus, and then visualise the output of the data as a concordance.
        
        ### Concordancing
        
        Concordancing is a nice way of looking at results. The main thing you have to do is tell *buzz* how you want the match column to look---it can be just the matching words, but also any combination of things. To show words and their parts of speech, you can do:
        
        ```python
        nsubj = loaded.just.function.nsubj
        nsubj.concordance(show=['w', 'p'])
        ```
        
        ### Frequency tables
        
        You can turn your dataset into frequency tables, both before or after searching or filtering. Tabling takes a `show` argument similar to the `show` argument for concordancing, as well as an additional `subcorpora` argument. `show` represents the how the columns will be formatted, and `subcorpora` is used as the index. Below we create a frequency table of `nsubj` tokens, in lemma form, organised by speaker.
        
        ```python
        tab = nsubj.table(show='l', subcorpora=['speaker'])
        ```
        
        Possible keyword arguments for the `.table()` method are as follows:
        
        | Argument           | Description                                                                                                                                                                                                                                                                                      | Default  |
        |--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|
        | `subcorpora`         | Feature(s) to use as the index of the table. Passing in a list of multiple features will create a multiindex                                                                                                                                                                                     | `['file']` |
        | `show`               | Feature(s) to use as the columns of the table. Passing a list will join the features with slash, so `['w', 'p']` results in columns with names like `'friend/NN'`                                                                                                                                | `['w']`    |
        | `sort`               | How to sort the results. 'total'/'infreq', 'increase/'decrease', 'static/turbulent', 'name'/'inverse'                                                                                                                                                                                            | `'total'`  |
        | `relative`           | Use relative, rather than absolute frequencies with `True`. You can also pass in Series, DataFrame or buzz objects to calculate relative frequencies against the passed in data.                                                                                                                 | `False`    |
        | `remove_above_p`     | Sorting by increase/decrease/static/turbulent calculates the slope of the frequencies across each subcorpus, and p-values where the null hypothesis is no slope. If you pass in a float, entries with p-values above this float are dropped from the results. Passing in `True` will use `0.05`. | `False`    |
        | `keep_stats`         | If True, keep generated statistics related to the trajectory calculation                                                                                                                                                                                                                         | `False`    |
        | `preserve_case`      | Keep the original case for `show` (column) values                                                                                                                                                                                                                                                | `False`    |
        | `multiindex_columns` | When `show` is a list with multiple features, rather than joining `show` with slashes, build a multiindex                                                                                                                                                                                        | `False`    |
        
        
        This creates a `Table` object, which is also based on DataFrame. You can use its `.view()` method to quickly explore results. Pressing enter on a given frequency will bring up a concordance of instances of this entry.
        
        ### Plotting
        
        You can also use *buzz* to create high-quality visualisations of frequency data. This relies completely on pandas' plotting method. A `plot` method more tailored to language datasets is still in development.
        
        ```python
        tab.plot(...)
        ```
Keywords: corpus,linguistics,nlp
Platform: UNKNOWN
Description-Content-Type: text/markdown
