Metadata-Version: 2.1
Name: clean-df
Version: 0.3.0
Summary: Python module to report, clean, and optimize Pandas Dataframes effectively
Home-page: https://github.com/naelaqel/clean_df
Author: Nael Aqel
Author-email: dev@naelaqel.com
License: MIT license
Keywords: clean_df,cleaning,data analysis,data science,wrangling,reporting,optimization,outliers,missing
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.7
License-File: LICENSE
License-File: AUTHORS.rst

========
clean_df
========

.. image:: https://img.shields.io/pypi/v/clean_df.svg
        :target: https://pypi.python.org/pypi/clean_df

.. image:: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml/badge.svg
   :target: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml

.. image:: https://readthedocs.org/projects/clean-df/badge/?version=latest
        :target: https://clean-df.readthedocs.io/en/latest/?version=latest
        :alt: Documentation Status

.. image:: https://img.shields.io/pypi/l/clean_df.svg
   :target: https://github.com/NaelAqel/clean_df/blob/main/LICENSE  
  
  
  
Python module to report, clean, and optimize **Pandas Dataframes** effectively.

**Full Documentation** `Here`_.

.. _Here: https://naelaqel.com/clean_df/
  
Description and Features
------------------------
The first step of any data analysis project is to check and clean the data, in this module I implemented a very effiecint code that can:  

* Automatically drop columns that have a unique value (these columns are useless, so it will be dropped).
* Report your **Pandas DataFrame** to decide for actions, this report will show:  

  #. Duplicated rows report.
  #. Columnsâ€™ Datatype to optimize memory report.
  #. Columns to convert to categorical report.
  #. Outliers report.
  #. Missing values report.


* Clean the dataframe by dropping columns that have a high ratio of missing values, rows with missing values, and duplicated rows in the dataframe.

* Optimize the dataframe by converting columns to the desired data type and converting categorical columns to 'category' data type.

Installation
------------
To install ``clean_df``, run this command in your terminal:: 

    $ pip install clean_df

For more information on installation details for this project, please see the ``docs/installation.rst`` file.


    
Usage
-----
This module is very easy to use, for a full detailed example please see the ``docs/usage.rst`` file.

Importing the module
^^^^^^^^^^^^^^^^^^^^
::

        from clean_df import CleanDataFrame   

Defining the class
^^^^^^^^^^^^^^^^^^
Pass your pandas dataframe to ``CleanDataFrame`` class::

        cdf = CleanDataFrame(
                df=df,             # the dataframe to be cleaned
                max_num_cat=5      # maximum number of unique values in a column to be 
                )                  # converted to categorical datatype, default is 5

Reporting
^^^^^^^^^
Call ``report`` method to see a full report about the dataframe (duplications, columns to optimize its data types, categorical columns, outliers, and missing values::

        cdf.report(
                show_matrix=True,   # show matrix missing values (from missingno package), default is True
                show_heat=True,     # show heat missing values (from missingno package), default is True
                matrix_kws={},      # if need to pass any arguments to matrix plot, default is {}
                heat_kws={}         # if need to pass any arguments to heat plot, default is {}
                )

Cleaning
^^^^^^^^
Call ``clean`` method to drop high number of missing value columns, duplicated rows, and rows with missing values::

        cdf.clean(
                min_missing_ratio=0.05,    # the minimum ratio of missing values to drop a column, default is 0.05
                drop_nan=True              # if True, drop the rows with missing values after dropping columns 
                                           # with missingsa above min_missing_ratio
                drop_kws={},               # if need to pass any arguments to pd.DataFrame.drop(), default is {}
                drop_duplicates_kws={}     # same drop_kws, but for drop_duplicates function
                )

Optimizing
^^^^^^^^^^
Call ``optimize`` method to optimize the dataframe by changing columns' data types based on its values for maximum memory savings::

        cdf.optimize()


Accessing the Cleaned Data DataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::

        cdf.df 


  
Contributing
------------
See the ``CONTRIBUTING.rst`` for contribution details. Feel free to contact me for any subject through my:  

* `Email`_
* `LinkedIn`_
* `WhatsApp`_

Also, you are welcomed to visit my personal `blog`_ .

.. _Email: mailto:dev@naelaqel.com
.. _LinkedIn: https://www.linkedin.com/in/naelaqel1
.. _WhatsApp: https://wa.me/962796780232
.. _blog: https://naelaqel.com

   

License
-------
Free software: MIT license.

    

Documentation
-------------
* The full documentation is hosted on my `website`_, and on `ReadTheDocs`_.
* The source code is available in `GitHub`_.

.. _website: https://naelaqel.com/clean_df/
.. _ReadTheDocs: https://clean_df.readthedocs.io
.. _GitHub: https://github.com/naelaqel/clean_df

    
    
Credits
-------
* This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.  
* Here are `additional`_ resources I got a lot from them.

.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage
.. _`additional`: https://naelaqel.com/resources/


=======
History
=======
0.3.0 (2023-08-23)
------------------
* Improve the performance when calling ``report`` method.
* The ``pytest`` now is including the full methods in the module. 

0.2.3 (2023-03-04)
------------------
* Improve memory consumption and module performance.

0.2.2 (2023-03-03)
------------------
* Fix a bug that made "dict_keys" error in some speical cases.

0.2.1 (2023-03-03)
------------------
* Improve module performance.

0.2.0 (2023-03-02)
------------------
* Add a new report for categorical columns.
* Make the module more efficient.

0.1.1 (2023-02-27)
------------------
* Rectify and organize documentation.
* Provide test to GitHub Actions.

0.1.0 (2023-02-27)
------------------

* First release on PyPI.
