Metadata-Version: 2.1
Name: fselect
Version: 1.0.2
Summary: Feature Selection for Clustering
License: MIT
Author: Billodal Roy
Author-email: billodalroy@gmail.com
Requires-Python: >=3.5,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: numpy (>=1.26.2,<2.0.0)
Requires-Dist: pandas (>=2.1.4,<3.0.0)
Requires-Dist: scikit-learn (>=1.3.0,<2.0.0)
Description-Content-Type: text/markdown

# Feature Selection for Clustering: fselect

A fast and scalable implementation of A-RANK algorithm as proposed
by Dash, M. and Liu, H. in their paper "Feature Selection for Clustering" for selecting features
from a dataset using an entropy measure using fast python libraries: numpy, pandas and scikit-learn.

## Getting Started  

Install the package:

```python
pip install fselect
```

Import the main function:

```python
from fselect import rank_features  
```

Prepare a dataframe with normalized continuous features:  

```python  
import pandas as pd

df = pd.DataFrame({
    'feature1': [...],
    'feature2': [...],    
    [...]
})
```

Rank the features:

```python
ranked_df = rank_features(df)  
```

The returned dataframe \`ranked_df\` contains columns: "rank", "feature", "entropy" sorted by entropy.

## Usage

The main parameters:  

- `dataframe: pd.DataFrame` - Input dataframe with continuous normalized features
- `remove_correlated_columns: bool` (optional) - Whether to remove highly correlated columns before ranking
- `correlation_threshold: float` (optional) - Correlation threshold to determine correlated columns (default 0.999)

**Remove correlated columns first**

```python
ranked_df = rank_features(df, remove_correlated_columns=True)  
```

**Custom correlation threshold**   

```python
ranked_df = rank_features(df, remove_correlated_columns=True, correlation_threshold=0.95) 
```

## Algorithm  

The entropy calculation is based on the equations defined in the ARANK paper. It calculates a similarity matrix of the dataframe and computes entropy from the same.

