Metadata-Version: 2.1
Name: cleanmydata
Version: 0.1.0
Summary: A data cleaning library for text processing
Home-page: https://github.com/pranavnbapat/cleanmydata
Author: Pranav Bapat
Author-email: pranav.g33k@gmail.com
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: ==3.12
Description-Content-Type: text/markdown
Requires-Dist: numpy==2.0.2
Requires-Dist: pandas==2.2.3
Requires-Dist: spacy==3.8.2
Requires-Dist: langdetect==1.0.9
Requires-Dist: bs4==0.0.2
Requires-Dist: lxml==5.3.0

## cleanmydata

This library contains all the essential functions for data cleaning.

It takes a list of data cleaning parameters and either a string or pandas dataframe as input

Functions:
1) Remove new lines
2) Remove emails
3) Remove URLs 
4) Remove hashtags (#hashtag)
5) Remove the string if it contains only numbers
6) Remove mentions (@user)
7) Remove retweets (RT...)
8) Remove text between the square brackets [ ]
9) Remove multiple whitespaces and replace with one whitespace 
10) Replace characters with more than two occurrences and replace with one occurrence
11) Remove emojis
12) Count characters (only for dataframe; creates a new column)
13) Count words (only for dataframe; creates a new column)
14) Calculate average word length (only for dataframe; creates a new column)
15) Count stopwords (only for dataframe; creates two new columns, stowords and stopword_count)
16) Detect language (uses <a href="https://pypi.org/project/fasttext-langdetect/">fasttext-langdetect</a>) (only for dataframe; creates two new columns, lang and lang_prob)
17) Detect language (uses <a href="https://pypi.org/project/fasttext-langdetect/">fasttext-langdetect</a>) (only for dataframe; creates just one column with langauge and probability; takes less time)
18) Remove HTML tags
19) 


## How to install?
<code>pip install cleanmydata</code>


## Parameters
<ol>
   <li>lst (list) - List of data cleaning operations</li>
   <li>data (string or dataframe) - Data to be passed</li>
   <li>column (string) - Dataframe column on which operation to perform; only for dataframe</li>
   <li>save (boolean) - If you want to save the results in a new file</li>
   <li>name (string) - Name of the new file if save is True</li>
</ol>

## Usage
1) Import the library
   <br><code>from cleanmydata.functions import *</code>
2) Call the method clean_data, and pass the parameters as you wish.
3) By default, if the dataframe is passed, it drops all NA values (dropna)

## Examples
1) To remove emails and hashtags<br>
   <code>mydata = "Hello folks. abc@example.com #hashtag"</code>
   <br>
   <code>mydata = clean_data(lst=[2, 4], data=mydata)</code>
   <br>
   <code>print(mydata)</code>
2) To count stopwords, remove mentions, and URLs, and save file from a dataframe<br>
   <code>df = pd.read_csv('data/my_csv.csv', encoding='ISO-8859-1', dtype='unicode')</code>
   <br>
   <code>df = clean_data(lst=[15, 6, 2], data=df, column='comments', save=True, name='my custome file name')</code>


## Other notes
If using stopwords, make sure you have en_core_web_sm installed. <br>
<code>python -m spacy download en_core_web_sm</code>


### More options and enhancements coming soon...
