Metadata-Version: 2.1
Name: tfIdfInheritVectorizer
Version: 0.1
Author: Berke Dilekoglu
License: MIT
Keywords: machine-learning tf-idf
Description-Content-Type: text/markdown
License-File: LICENSE.txt

# TFIDFVectorizer

TFIDFVectorizer is a custom implementation of the TF-IDF transformation algorithm, using scikit-learn's TfidfVectorizer as a base. The implementation is written in Python, making use of numpy, scikit-learn and other commonly used packages.

The main aim of this implementation is to provide a simple and efficient way of transforming a collection of text documents into a matrix representation, which can then be used as input to various machine learning algorithms.

## Installation

The package can be installed using pip:

```bash
pip install tfIdfInheritVectorizer
```

## Usage

To use the TFIDFVectorizer, simply create an instance of the class and call its fit_transform method. The method takes a list of text documents as input, and returns a sparse matrix representation of the TF-IDF scores for each document.

```python
from tfIdfInheritVectorizer.feature_extraction.vectorizer import TFIDFVectorizer


text_data = [    "This is the first document.",    "This is the second document.",    "And this is the third one.",    "Is this the first document?"]

vectorizer = TFIDFVectorizer()
tfidf = vectorizer.fit_transform(text_data)

```

In addition to the fit_transform method, the TFIDFVectorizer also has a transform method that can be used to transform new text data into a matrix representation, given the model has already been fit to the training data.

```python
new_text_data = [
    "This is a new document.",
    "Is this a new one?"
]

new_tfidf = vectorizer.transform(new_text_data)
```

## Configuration

The TFIDFVectorizer has several parameters that can be configured to customize its behavior. Some of the most important parameters are:

- stop_words: a list of stop words that will be ignored during the tokenization process

```python
vectorizer = TFIDFVectorizer(stop_words=["is", "the", "this"])

```

- max_features: the maximum number of features to keep, based on term frequency across the entire corpus.

```python
vectorizer = TFIDFVectorizer(max_features=50)

```

- use_idf: a flag indicating whether to use the inverse document frequency (IDF) weighting.

```python
vectorizer = TFIDFVectorizer(use_idf=False)

```

For a full list of parameters, see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

## Conclusion

TFIDFVectorizer is a simple and efficient implementation of the TF-IDF transformation algorithm, suitable for use in various machine learning applications. By using scikit-learn as a base, it provides a wide range of customization options and can be easily integrated into existing machine learning workflows.

## License

[MIT](https://choosealicense.com/licenses/mit/)
