Metadata-Version: 2.1
Name: protloc_mex_X
Version: 0.0.9
Summary: ...
Home-page: https://github.com/yujuan-zhang/ProtLoc-mexl
License: MIT
Author: Ze Yu Luo
Author-email: 1024226968@qq.com
Requires-Python: >=3.9
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: matplotlib (>=3.5.1)
Requires-Dist: numpy (>=1.20.3)
Requires-Dist: pandas (>=1.4.1)
Requires-Dist: seaborn (>=0.11.2)
Project-URL: Documentation, https://github.com/yujuan-zhang/ProtLoc-mexl/issues
Project-URL: Repository, https://github.com/yujuan-zhang/ProtLoc-mexl
Description-Content-Type: text/markdown

# ProtLoc-mex_X

## Introduction ProtLoc-mex_X

protloc_mex_X integrates two modules: ESM2_fr and feature_correlation. ESM2_fr is based on the ESM2(Supported by ESM2_650m) model and is capable of extracting feature representations from protein sequences, including 'cls', 'eos', 'mean', 'segment_mean', and 'pho'. On the other hand, the feature_correlation module provides Spearman correlation analysis functionality, enabling users to visualize correlation heatmaps and conduct feature crossover regression analysis. This allows users to explore the relationships between different data features and identify features that are relevant to the target feature.

## Installation

This project's core code has been uploaded to the PyPI repository. To get it using a conda virtual environment, follow the steps below:

First, create a new conda environment. For Windows systems, it is recommended to use Conda Prompt for this task. On Linux systems, you can use the Terminal. (You can also modify the environment name as needed, here, we use "myenv" as an example):

```
conda create -n myenv python=3.10
```

Then, activate the environment you just created:

```
conda activate myenvs
```

Finally, use pip to install 'protloc_mex_X' within this environment:

```
pip install protloc_mex_X
```

### Dependencies

ProtLoc-mex_X requires Python == 3.9 or 3.10.

Below are the Python packages required by ProtLoc-mex_X, which are automatically installed with it:

```
dependencies = [
        "numpy >=1.20.3",
        "pandas >=1.4.1",
        "seaborn >=0.11.2",
        "matplotlib >=3.5.1"
]
```

and other not automatically installed but also required Python packages：

```
dependencies = [
       "torch ==1.12.1",
       "tqdm ==4.63.0",
       "re ==2.2.1",
       "sklearn ==1.0.2",
       "transformers ==4.26.1"
]
```

It is advised to obtain these dependent packages from their respective official sources, while carefully considering the implications of version compatibility.

## How to use ProtLoc-mex_X

ProtLoc-mex_X includes 2 modules: ESM2_fr and feature_corrlation.

### ESM2_fr

ESM2_fr is a pre-trained deep learning model based on the ESM2 model. It is capable of extracting representation features from protein sequences and further optimizing the feature representation through weighted averaging.

It contains one class and three functions. The class is named `Esm2LastHiddenFeatureExtractor`, which includes the following three methods: `get_last_hidden_features_combine()`, `get_last_hidden_phosphorylation_position_feature()`, and `get_amino_acid_representation()`. The functions present in the code are `get_last_hidden_features_single()`, `NetPhos_classic_txt_DataFrame()`, and `phospho_feature_sim_cosine_weighted_average()`.

#### Function  `get_last_hidden_features_single()`：

The `get_last_hidden_features_single()` function is utilized for extracting different types of representation features from the input protein sequences. It accepts protein sequence data `X_input`, along with the model tokenizer and model as inputs, and subsequently returns a DataFrame containing the extracted features.(note: Only single-batch inputs are supported.)

#### Class `Esm2LastHiddenFeatureExtractor()`：

The `Esm2LastHiddenFeatureExtractor()` class is used for extracting various types of representation features from protein sequences. It accepts amino acid sequence input, invokes the pre-trained ESM2 model, and obtains pre-trained representation vectors ('cls', 'eos', 'mean', 'segment_mean', 'pho').

The `get_last_hidden_features_combine()` function serves the same purpose as `get_last_hidden_features_single()`, but it is designed to handle multiple batches of input data. This function takes protein sequence data `X_input` as input and returns a DataFrame containing the combined features extracted from the multiple batches of protein sequence.

The `get_last_hidden_phosphorylation_position_feature()` function extracts phosphorylation representation features from the input protein sequences. It takes protein sequence data `X_input` and returns a DataFrame containing phosphorylation representation features.

The `get_amino_acid_representation()` function is used to calculate representation features for a specific amino acid at a given position in a protein sequence. The main purpose is to support the characterization of phosphorylation sites.

#### Function  `NetPhos_classic_txt_DataFrame()` ：

The `NetPhos_classic_txt_DataFrame()` function is designed to extract sequence information from the provided text data, which is derived from NetPhos (https://services.healthtech.dtu.dk/services/NetPhos-3.1/), and then it returns the extracted data in the form of a DataFrame.

#### Function `phospho_feature_sim_cosine_weighted_average()` ：

The `phospho_feature_sim_cosine_weighted_average()` function calculates the weighted average of phosphorylation features for protein sequences and returns the input DataFrame updated with weighted average values, which provide a characterization of the entire amino acid sequence's phosphorylation pattern.

#### for using ESM2_fr example:

```
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import pandas as pd
from protloc_mex_X.ESM2_fr import Esm2LastHiddenFeatureExtractor, get_last_hidden_features_single, phospho_feature_sim_cosine_weighted_average

tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D")

protein_sequence_df = pd.DataFrame({
    'Entry' : ['protein1','protein2'],
    'Sequence': ['ACDEFGHIKLMNPQRSTVWY', 'ACDEFGHIKLMNPQRSTVWY']
})

feature_extractor = Esm2LastHiddenFeatureExtractor(tokenizer, model,
                                                   compute_cls=True, compute_eos=True, compute_mean=True, compute_segments=True)

human_df_represent = feature_extractor.get_last_hidden_features_combine(protein_sequence_df, sequence_name='Sequence', batch_size= 1)
```

#### Example for pho feature representation:

```
import os
import protloc_mex_X
from protloc_mex_X.ESM2_fr import NetPhos_classic_txt_DataFrame
import random
import re

example_data = os.path.join(protloc_mex_X.__path__[0], "examples", "test1.txt")
#The example_data is generated data from protein sequences analyzed using Netpho.

with open(example_data, "r") as f:
     data = f.read()
# print(data)
pattern = r".*YES"

result_df = NetPhos_classic_txt_DataFrame(pattern, data)
result_df.loc[:,'Entry']=result_df.loc[:,'Sequence']

"""
To generate a corresponding sequence randomly, please note that this is just an example. 
In real scenarios, accurate gene sequences should be used because Netpho only provides 6-bp phosphorylation sites,
which are not complete gene sequences.
Additionally, you need to convert the 'position' column in result_df to an integer type.
"""

# Function to generate random amino acid sequence with a minimum length
def generate_random_sequence(min_length):
    amino_acids = 'ACDEFGHIKLMNPQRSTVWY'  # 20 standard amino acids
    return ''.join(random.choice(amino_acids) for _ in range(min_length))

# Find the max position for each unique Entry
max_positions = result_df['position'].astype(int).groupby(result_df['Entry']).max()

# Generate a sequence for each Entry based on its max position
generated_sequences = {entry: generate_random_sequence(pos) for entry, pos in max_positions.items()}

# Update the 'Sequence' column
def update_sequence(row):
    entry = row['Entry']
    return generated_sequences[entry]

result_df['Sequence'] = result_df.apply(update_sequence, axis=1)


from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import pandas as pd
from protloc_mex_X.ESM2_fr import Esm2LastHiddenFeatureExtractor, phospho_feature_sim_cosine_weighted_average

tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D",output_hidden_states=True)

protein_sequence_df = pd.DataFrame({
    'Entry' : ['seq1','seq2'],
    'Sequence': ['ACDEFGHIKLMNPQRSTVWY', 'ACDEFGHIKLMNPQRSTVWY']
})

feature_extractor = Esm2LastHiddenFeatureExtractor(tokenizer, model,
                                                   compute_cls=False, compute_eos=False, 
                                                   compute_mean=False, compute_segments=False)

Netphos_df_represent = feature_extractor.get_last_hidden_phosphorylation_position_feature(result_df, sequence_name='Sequence', phosphorylation_positions='position', batch_size=2)
human_df_represent = feature_extractor.get_last_hidden_features_combine(protein_sequence_df, sequence_name='Sequence', batch_size= 1)

Netphos_df_represent.set_index('Entry', inplace=True)

# Extract all column names that match the 'ESM2_clsX' format.
cols = [col for col in human_df_represent.columns if re.match(r'ESM2_cls\d+', col)]

# Obtain a sub DataFrame consisting of these columns.
human_df_represent.set_index('Entry', inplace=True)
human_df_represent_cls = human_df_represent[cols]

# Extract all column names that match the 'ESM2_phospho_posX' format.
pho_cols = [col for col in Netphos_df_represent.columns if re.match(r'ESM2_phospho_pos\d+', col)]

# Obtain a sub DataFrame consisting of these columns.
Netphos_df_represent_pho = Netphos_df_represent[pho_cols]

#Set feature dimensions.
dim=1280

#Because the calculation function requires the amino acids of pho to be consistent with those of cls, we are removing 'seq3' from Netphos_df_represent_pho.
Netphos_df_represent_pho = Netphos_df_represent_pho.drop(Netphos_df_represent_pho[Netphos_df_represent_pho.index == 'seq3'].index)

#Return cls and pho_average (which is the result of pho_total).
human_df_represent_cls_pho=phospho_feature_sim_cosine_weighted_average(dim, human_df_represent_cls, Netphos_df_represent_pho)
```

###### Due to the ongoing review process of the articles related to the toolkit, not all information can be fully disclosed at the moment. If you require additional details or have specific inquiries about the toolkit, kindly contact the author, Zeyu Luo <1024226968@qq.com> , for further information. The author will be able to provide more comprehensive and accurate details about the toolkit's functionalities and features. Thank you for your understanding.
