Metadata-Version: 2.1
Name: easytrajh5
Version: 0.2.3
Summary: MD trajectory library
Author: Bosco Ho
Author-email: apposite@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: ParmEd (==3.4.3)
Requires-Dist: addict
Requires-Dist: click
Requires-Dist: deepdiff
Requires-Dist: h5py
Requires-Dist: mdtraj
Requires-Dist: numpy
Requires-Dist: path
Requires-Dist: pydash
Requires-Dist: rich
Requires-Dist: ruyaml
Description-Content-Type: text/markdown


# EasyTrajH5

Trajectory management for mdtraj H5 files

## Installation

    pip install easytrajh5

## Quick Guide

Our main file object `EasyTrajH5` is a drop-in replacement
for `mdtraj.H5TrajectryFile` with extra functionality.

```python
from easytrajh5.traj import EasyTrajH5File
 
h5 = EasyTrajH5File('traj.h5')
traj = h5.read_as_traj()
```
This loads the data progressively in chunks, allowing online streaming 
in advanced usage. 

Load individual frames

```python
last_frame_traj = h5.read_frame_as_traj(-1)
```

As we use the `h5py` library, we can use efficient
fancy indexing to load just certain atoms:

```python
atom_indices = [100, 115, 116]
three_atom_traj = h5.read_as_traj(atom_indices=atom_indices)
```

We provide atom selection using a new selection language (described in detail below).
This is particular efficient as it only loads the atoms you want, without 
requiring the entire trajectory to be loaded into memory:

```python
from easytrajh5.traj import EasyTrajH5File
 
mask = "intersect {mdtraj name CA} {protein}"
ca_trace_traj = EasyTrajH5File('traj.h5', atom_mask=mask).read_as_traj()
```

Drop in replacement for `mdtraj.reporters.HDF5Reporter` in openmm
that uses `EasyTrajH5File`: 

```python
from easytrajh5.traj import EasyTrajH5Reporter
```


## Atom Selection Language

Why another atom selection language (we have AMBER and MDTRAJ)?
Two main reasons. 

First, we wanted user-defined 
residue selections. These are stored in `easytrajh5/data/select.yaml`. 
Edit this file to create any new residue selections.

Second, we wanted to fix residue selection. The problem
is that AMBER uses residue numbering (`:3,5,10-12`) defined in the PDB file 
and not 0-based residue indexing. This means that in PDB files with multiple
chains, the residue number is not unique. MDTRAJ on the other hand, uses 
0-based indexing, but only allows you to use ranges (`resi 10 to 15`). 

We've combined these ideas to provide our new flexible 0-based residue indexing 
`resi 3,5,10-12,100-150,300`.

We also allow you to easily drop in to AMBER and MDTRAJ simply by 
using the `amber` and `mdtraj` keywords. When combined with set 
operations, everything is now at your disposal.

Some useful masks:

- no solvent: `not {solvent}`
- just the protein: `protein`
- ligand and specific residues: `ligand resi 5,1,22-200`
- heavy protein atoms: `diff {protein} {amber @/H}`
- no hydrogens: `not {amber @/H}`
- ligand and 6 closest residues: `pocket ligand`
- specified ligand with 10 closest neighbours: `resname UNL near UNL 10`

#### User-defined and operator keywords

If more than one keyword is specified, it is assumed they are joined with "or"
operation (i.e. `ligand protein` will return both ligand and protein atom indices).

This default keywords are:
- `ligand`, `protein`, `water`, `lipid`, `salt`, `solvent`, `lipid`, `nucleic`
- as defined in `easytrajh5/data/select.yaml`
- `ligand` will find the residues `LIG`, `UNL`, `UNK`

Special operator keywords:

- `pocket` will find the closest 6 residues to the `ligand` group.
- `near` will require a following resname, with an optional integer, e.g.:
    `near ATP`
    `near ATP 5`
- `resname` identifies a single residue type
    `resname LEU`
- `resi` for 0-indexed residue selections
    `resi 0,10-13` - selects atoms in the first and 11th to 14th residues
- `atom` for 0-indexed atoms selections
    `atom 0,55,43,101-105` - selects the first, 56th, 44th, 102 to 106th atom

#### AMBER-style atom selection

- https://parmed.github.io/ParmEd/html/amber.html#amber-mask-syntax
- `amber :ALA,LYS` - selects all alanine and lysine residues

#### MDTraj-style atom selection 
- https://mdtraj.org/1.9.4/atom_selection.html
- `mdtraj protein and water` - selects protein and water

#### Set operations

Selections can be combined with set operators: `not`, `intersect`, `merge`, `diff`:

- `intersect {not {amber :ALA}} {protein}`
- `diff {protein} {not {amber :ALA}}`
- `not {resname LEU}`
- `merge {near BSM 8} {amber :ALA}`

#### Use in python

In your python code, there is a `select_mask` fn that operates on `parmed.Structure`
objects:

```python
from easytrajh5.traj import EasyTrajH5File
from easytrajh5.select import select_mask
from easytrajh5.struct import slice_parmed
 
pmd = EasyTrajH5File("traj.h5").get_parmed()
i_atoms = select_mask(pmd, "not {solvent}")
sliced_pmd = slice_parmed(pmd, i_atoms)
```

If you need to convert between different MD objects, in `easytrajh5.struct`, 
there are some common manipulations:

```python
def dump_parmed(pmd: parmed.Structure, fname: str): 
def load_parmed(fname: str) -> parmed.Structure:
def get_parmed_from_pdb(pdb: str) -> parmed.Structure:
def get_parmed_from_parmed_or_pdb(pdb_or_parmed: str) -> parmed.Structure:
def get_parmed_from_mdtraj(traj: mdtraj.Trajectory, i_frame=0) -> parmed.Structure:
def get_parmed_from_openmm(openmm_topology, openmm_positions=None) -> parmed.Structure:
def get_mdtraj_from_parmed(pmd: parmed.Structure) -> mdtraj.Trajectory:
def get_mdtraj_from_openmm(openmm_topology, openmm_positions) -> mdtraj.Trajectory:
```

## Use as H5

There are convenience functions to insert different types
of data. 

To save/load strings:

```python
h5.set_str_dataset('my_string', 'a string')
h5.flush()
new_str = h5.get_str_dataset('my_string')
```

To save/load json:
```python
h5.set_json_dataset('my_obj', {"a", "b"})
h5.flush()
new_obj = h5.get_json_dataset('my_obj')
```
To insert/extract binary files:

```python
h5.insert_file_to_dataset('blob', 'blob.bin')
h5.flush()
h5.extract_file_from_dataset('blob', 'new_blob.bin')
```

We can get information about the h5 file:

```python
schema_json = h5.get_schema()
dataset_keys = h5.get_dataset_keys()
attr_keys = h5.get_attr_keys()
```

We can extract data

```python
dataset = h5.get_dataset("coordinates")
value_list = dataset[:]
last_value = dataset[-1]

# if the attrs are set
value = h5.get_attr('user')
```

Convenience function to append values to an `h5` file without
worrying about file or dataset creation:

```python
from easytrajh5.h5 import dump_value_to_h5, EasyH5File

dump_value_to_h5('new.h5', [1,2], 'my_data_set')
dump_value_to_h5('new.h5', [3,4], 'my_data_set')
dump_value_to_h5('new.h5', [5,7], 'my_data_set')

return_values = EasyH5File('new.h5').get_dataset("my_data_set")[:]
# [[1,2], [3,4], [5,6]]
```

## Command-line utility `easyh5`

We have a command line `easyh5` to interrogate any `h5` file. To get a
schema of the dataset layout and attributes:

```bash
> easyh5 schema traj.h5
# {
# │   'datasets': [
# ....
# │   │   {
# │   │   │   'key': 'coordinates',
# │   │   │   'shape': [200, 3340, 3],
# │   │   │   'chunks': [3, 3340, 3],
# │   │   │   'is_extensible': True,
# │   │   │   'frame_shape': [3340, 3],
# │   │   │   'n_frame': 200,
# │   │   │   'dtype': 'float32',
# │   │   │   'attr': {'CLASS': 'EARRAY', 'EXTDIM': 0, 'TITLE': None, 'VERSION': '1.1', 'units': 'nanometers'}
# │   │   },
#
# ...
#
# │   │   {
# │   │   │   'key': 'topology',
# │   │   │   'shape': [1],
# │   │   │   'dtype': 'string(217329)',
# │   │   │   'attr': {'CLASS': 'ARRAY', 'FLAVOR': 'python', 'TITLE': None, 'VERSION': '2.4'}
# │   │   }
# │   ],
# │   'attr': {
# │   │   'CLASS': 'GROUP',
# │   │   'FILTERS': 65793,
# │   │   'PYTABLES_FORMAT_VERSION': '2.1',
# │   │   'TITLE': None,
# │   │   'VERSION': '1.0',
# │   │   'application': 'MDTraj',
# │   │   'conventionVersion': '1.1',
# │   │   'conventions': 'Pande',
# │   │   'program': 'MDTraj',
# │   │   'programVersion': '1.9.7',
# │   │   'title': 'title'
# │   }
# }
```

Or as a quick summary table:

```bash
> easyh5 dataset examples/trajectory.h5 
#Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.

                  sims/high_bf/trajectory.h5                  
                                                              
  dataset           shape              dtype       size (MB)  
 ──────────────────────────────────────────────────────────── 
  cell_angles       (1500, 3)          float32       0.02 MB  
  cell_lengths      (1500, 3)          float32       0.02 MB  
  coordinates       (1500, 25767, 3)   float32     442.32 MB  
  kineticEnergy     (1500,)            float32         <1 KB  
  potentialEnergy   (1500,)            float32         <1 KB  
  temperature       (1500,)            float32         <1 KB  
  time              (1500,)            float32         <1 KB  
  topology          (1,)               |S2083249     1.99 MB  
                                                              
  total                                            444.36 MB  
```

To get an overview of a dataset:

```bash
> easyh5 dataset examples/trajectory.h5 coordinates
# Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
# 
#   examples/trajectory
#      dataset=coordinates
#      shape=(1500, 25767, 3)
# 
# [[[1.291678   7.558739   1.5199517 ]
#   [1.368739   7.5888386  1.4620152 ]
#   [1.2175218  7.6268845  1.5275735 ]
#   ...
#   [2.375777   0.09478953 4.0356894 ]
#   [3.107005   3.3255231  2.8464174 ]
#   [3.0329072  3.9307644  1.3600407 ]]
# 
#  ...
# 
#  [[2.9693408  7.1466036  1.4656581 ]
#   [2.9327238  7.198606   1.3871984 ]
#   [3.0665123  7.171176   1.4781022 ]
#   ...
#   [4.944392   0.56028575 4.301907  ]
#   [2.6180382  0.3969128  1.4842175 ]
#   [3.281546   4.9666233  2.4855924 ]]]
```

Or to focus on a selected frames, use a numbered lis:

```bash
> easyh5 dataset examples/trajectory.h5 coordinates 1,3,4-10
# Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
# 
#   sims/high_bf/trajectory.h5
#      dataset=coordinates
#      shape=(1500, 25767, 3)
# 
# frames(1,3,4-10)=
# [[[1.2958181  7.5481067  1.5513833 ]
#   [1.2766361  7.4586782  1.5085387 ]
#   [1.3654946  7.58469    1.4880756 ]
#   ...
#   [2.0149727  0.20826703 3.712016  ]
#   [3.3603299  3.6615734  2.6487541 ]
#   [3.1595583  4.0199933  1.509442  ]]
# 
#  ...
# 
#  [[1.2550778  7.4836254  1.5989571 ]
#   [1.278228   7.403505   1.5419852 ]
#   [1.2919694  7.571082   1.5644412 ]
#   ...
#   [2.6036768  5.9193387  3.7148886 ]
#   [4.2752028  3.8813443  2.6205144 ]
#   [2.343824   3.9689744  0.05281828]]]
# 
```

To check atom selections of the protein:

```bash
> easyh5 mask sims/high_bf/trajectory.h5 "amber :PRO" --res
# Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
# EasyTrajH5File: fname='sims/high_bf/trajectory.h5' mode='a' atom_mask='' is_dry_cache=False
# open connection:started...
# open connection:finished in <1ms
# loading topology:started...
# loading topology:finished in 134ms
# select_mask "amber :PRO" -> 112 atoms, 8 residues
# <Residue PRO[7]; chain=1>
# <Residue PRO[12]; chain=1>
# <Residue PRO[42]; chain=1>
# <Residue PRO[48]; chain=1>
# <Residue PRO[50]; chain=1>
# <Residue PRO[87]; chain=1>
# <Residue PRO[129]; chain=1>
# <Residue PRO[167]; chain=1>
```

## Miscellaneous utility 

In `easytrajh5.quantity` we have some useful transforms to handle those
pesky unit objects from openmm. These transforms are used in our yaml and
json convenience functions

```python
from easytrajh5 import quantity
from parmed import unit

x = 5 * unit.nanosecond
d = quantity.get_dict_from_quantity(x)
# {
#│   'type': 'quantity',
#│   'value': 5,
#│   'unit': 'nanosecond',
#│   'unit_repr': 'Unit({BaseUnit(base_dim=BaseDimension("time"), name="nanosecond", symbol="ns"): 1.0})'
#}
y = quantity.get_quantity_from_dict(d)
# Quantity(value=5, unit=nanosecond)
```


