Metadata-Version: 2.1
Name: chemdescriptor
Version: 0.2.0
Summary: A standalone module to help generate molecular descriptors from various cheminformatics software
Home-page: https://github.com/darkreactions/chemdescriptor
Author: DRP Project
Author-email: darkreactionproject@haverford.edu
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: pandas

# chemdescriptor - Molecular descriptor generator
Generic molecular descriptor generator wrapper around various software packages to simplify the process of getting descriptors

## To install
Type:
```git clone https://github.com/darkreactions/chemdescriptor```

```cd chemdescriptor```

```git checkout cxcalc_rewrite```

```pip install .```

## Requirements
1. Pandas
2. ChemAxon descriptors
    - Working copy of ChemAxon cxcalc
3. RDKit descriptors
    - RDKit installed

## Usage
Currently only supports ChemAxon cxcalc and RDKit. The module can be expanded to cover other generators as well.
Example input files can be found in the examples/ folder of this repo as well as the pip installed package.

### CXCalc

**Important! The code requires an environment variable CXCALC_PATH to be set, which points to the folder where cxcalc is installed!**

### Command Line
```
chemdescriptor-cx -m /path/to/SMILES/file -d /path/to/descriptor/whitelist/json -p 6.8 7.0 7.2 -o output.csv
```

```
usage: chemdescriptor-cx [-h] -m MOLECULE -d DESCRIPTORS -p PH [PH ...]
                         [-c COMMANDS] -o OUTPUT

optional arguments:
  -h, --help            show this help message and exit
  -m MOLECULE, --molecule MOLECULE
                        Path to input SMILES file
  -d DESCRIPTORS, --descriptors DESCRIPTORS
                        Path to descriptor white list json file
  -p PH [PH ...], --pH PH [PH ...]
                        List of pH values at which to calculate descriptors
  -c COMMANDS, --commands COMMANDS
                        Optional command stems for descriptors in json format
  -o OUTPUT, --output OUTPUT
                        Path to output file
```

### In code

The package will initially search cxcalc executable in the PATH variable if not
will fall back to CXCALC_PATH

Set CXCALC_PATH

```
import os
os.environ['CXCALC_PATH'] = '/path/to/cxcalc'
```

Import the generator class

``` from chemdescriptor.generator.chemaxon import ChemAxonDescriptorGenerator as CAG```

Import SMILES and whitelist

```
with open('/path/to/SMILES/file', 'r') as f:
    smiles_list = f.read().splitlines()

with open('/path/to/descriptor/whitelist/json', 'r') as f:
    whitelist = json.load(f)
```

Instantiate a generator. ```smiles_list``` is a list of smiles and ```whitelist```
is a dictionary of keys in the command_dict 
```logfile``` is the path to a log which contains information such as the final cxcalc
command, columns that were renamed and other errors for debugging

``` 
cag = CAG(smiles_list,
          whitelist,
          ph_values=[6, 7, 8],
          command_dict={},
          logfile='/path/to/logfile')
```

Generate csv output
``` cag.generate('output.csv', dataframe=False, lec=False) ```

Optional keyword arguments for `generate` include `dataframe` boolean (default False) which returns a pandas dataframe in addition to writing a csv if True
and `lec` boolean (default False) which converts the Smiles code to an intermediate "Low Energy Conformer (LEC)" representation before generating descriptors.
A license is most likely required to generate LECs.

## Notes:

Descriptor whitelist is a python dictionary of the form:
```
{
    "descriptors": [
        "refractivity",
        "maximalprojectionarea",
        "maximalprojectionradius",
        "maximalprojectionsize",
        "minimalprojectionarea",
        "minimalprojectionradius",
        "minimalprojectionsize"
    ],
    "ph_descriptors": [
        "avgpol",
        "molpol",
        "vanderwaals",
        "asa",
        "asa+",
        "asa-",
        "asa_hydrophobic",
        "asa_polar",
        "hbda_acc",
        "hbda_don",
        "polar_surface_area"
    ]
}
```

chemdescriptor expects 2 keys in the whitelist where "descriptors" are generic and "ph_descriptors" are ph dependent descriptors

An **optional** dictionary can be passed to the ChemAxonDescriptorGenerator, "command_dict" which
"translates" the above descriptor names into commands that ChemAxon cxcalc can understand.

It also consists of column names that will be added to the final output.

**Note:** If the command_dict is not given or is empty, a default command dict is used whose definition is [here](https://github.com/darkreactions/chemdescriptor/blob/cxcalc_rewrite/chemdescriptor/defaults/cxcalc.py)

An example of a command_dict is:

```
command_dict = {
    "descriptors": {
        "atomcount_c": {
            "command": [
                "atomcount",
                "-z",
                "6"
            ],
            "column_names": [
                "_feat_AtomCount_C"
            ]
        },
        "wateraccessiblesurfacearea": {
            "command": [
                "wateraccessiblesurfacearea"
            ],
            "column_names": [
                "_feat_ASA",
                "_feat_ASA+",
                "_feat_ASA-",
                "_feat_ASA_H",
                "_feat_ASA_P"
            ]
        }
    "ph_descriptors": {
        "acceptorcount": {
            "command": [
                "acceptorcount"
            ],
            "column_names": [
                "_feat_Hacceptorcount"
            ]
        },
        "donorcount": {
            "command": [
                "donorcount"
            ],
            "column_names": [
                "_feat_Hdonorcount"
            ]
        }
    }

```
```command_dict``` consists of 2 dictionaries with keys ```descriptors``` and 
```ph_descriptors```. Within each dictionary are descriptor names referred in the whitelist. 

Under each descriptor, two lists are required ```command``` and ```column_names```

Command refers to the command line options for cxcalc as documented 
[here](https://docs.chemaxon.com/display/docs/cxcalc+calculator+functions)
**Note:** that commands with multiple words are entries in a list. For example, the command 
```atomcount -z 6``` is represented in the dictionary as ```['atomcount', '-z', '6']```

```column_names``` is a list of names the user wants to rename the cxcalc generated
csv column names.

Certain commands generate multiple columns for example, ```wateraccessiblesurfacearea```
generates 5 columns. Therefore, the ```column_names``` list becomes
```
"column_names": [
                "_feat_ASA",
                "_feat_ASA+",
                "_feat_ASA-",
                "_feat_ASA_H",
                "_feat_ASA_P"
            ]
```

**Note** : If the number of columns generated by cxcalc do not match the expected count, 
none of the column names are renamed.

### RDKit

Much easier to use. Only needs a list of descriptors similar to cxcalc. 


# To Do
[ ] Test on different machines

[ ] Get feedback on what needs to be changed/improved

[ ] Expand to cover other descriptor generators


