Metadata-Version: 2.1
Name: code-tokenizers
Version: 0.0.5
Summary: Aligning BPE and AST
Home-page: https://github.com/ncoop57/code_tokenizers
Author: ncoop57
Author-email: nacooper01@wm.edu
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python tokenizer bpe ast
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: dev
License-File: LICENSE

code_tokenizers
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

This library is built on top of the awesome
[transformers](https://github.com/huggingface/transformers) and
[tree-sitter](https://github.com/tree-sitter/py-tree-sitter) libraries.
It provides a simple interface to align the tokens produced by a BPE
tokenizer with the tokens produced by a tree-sitter parser.

## Install

``` sh
pip install code_tokenizers
```

## How to use

The main interface of `code_tokenizers` is the
[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)
class. You can use a pretrained BPE tokenizer from the popular
[transformers](https://huggingface.co/docs/transformers/quicktour#autotokenizer)
library, and a tree-sitter parser from the
[tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#python)
library.

To specify a
[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)
using the `gpt2` BPE tokenizer and the `python` tree-sitter parser, you
can do:

``` python
from code_tokenizers.core import CodeTokenizer

py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
```

    None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

You can specify any pretrained BPE tokenizer from the [huggingface
hub](hf.co/models) or a local directory and the language to parse the
AST for.

Now, we can tokenize some code:

``` python
from pprint import pprint

code = """
def foo():
    print("Hello world!")
"""

encoding = py_tokenizer(code)
pprint(encoding, depth=1)
```

    {'ast_ids': [...],
     'attention_mask': [...],
     'input_ids': [...],
     'is_builtins': [...],
     'is_internal_methods': [...],
     'merged_ast': [...],
     'offset_mapping': [...],
     'parent_ast_ids': [...]}

And we can print out the associated AST types:

<div>

> **Note**
>
> Note: Here the N/As are the tokens that are not part of the AST, such
> as the spaces and the newline characters. Their IDs are set to -1.

</div>

``` python
for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
    if ast_id != -1:
        print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
    else:
        print("N/A")
```

    N/A
    function_definition def
    function_definition identifier
    parameters (
    N/A
    N/A
    N/A
    N/A
    call identifier
    argument_list (
    argument_list string
    argument_list string
    argument_list string
    argument_list )
    N/A
