Metadata-Version: 2.1
Name: light-famd
Version: 0.0.3
Summary: Light Factor Analysis of Mixed Data
Home-page: https://github.com/Cauchemare/Light_FAMD
Author: telescopes
Author-email: luyaoli88@gmail.com
License: UNKNOWN
Description: 
        # Light_FAMD
        
        `Light_FAMD` is a library for prcessing [factor analysis of mixed data](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [multiply correspondence analysis (MCA)](https://www.researchgate.net/publication/239542271_Multiple_Correspondence_Analysis). The goal is to provide an efficient and light implementation for each algorithm along with a scikit-learn API.
        
        ## Table of contents
        
        - [Usage](##Usage)
          - [Guidelines](###Guidelines)
          - [Principal component analysis (PCA)](#principal-component-analysis-pca)
          - [Correspondence analysis (CA)](#correspondence-analysis-ca)
          - [Multiple correspondence analysis (MCA)](#multiple-correspondence-analysis-mca)
          - [Multiple factor analysis (MFA)](#multiple-factor-analysis-mfa)
          - [Factor analysis of mixed data (FAMD)](#factor-analysis-of-mixed-data-famd)
        - [Going faster](#going-faster)
        
        
        
        
        `Light_FAMD` doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `numpy`) which are included with Anaconda.
        
        
        
        ### Guidelines
        
        Each base estimator(CA,PCA) provided by `Light_FAMD` extends scikit-learn's `(TransformerMixin,BaseEstimator)`.which means we could use directly `fit_transform`,and `(set_params,get_params)` methods.
         
        Under the hood `Light_FAMD` uses a [randomised version of SVD](https://scikit-learn.org/dev/modules/generated/sklearn.utils.extmath.randomized_svd.html). This algorithm finds a (usually very good) approximate truncated singular value decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components. In order to obtain further speed up, n_iter can be set <=2 (at the cost of loss of precision). However if you want reproducible results then you should set the `random_state` parameter.
        
        The randomised version of SVD is an iterative method. Because each of light_famd's algorithms use SVD, they all possess a `n_iter` parameter which controls the number of iterations used for computing the SVD. On the one hand the higher `n_iter` is the more precise the results will be. On the other hand increasing `n_iter` increases the computation time. In general the algorithm converges very quickly so using a low `n_iter` (which is the default behaviour) is recommended.
        
        In this package,inheritance relationship as shown  below(A->B:A is superclass of B):
        
        - PCA -> MFA -> FAMD
        - CA ->MCA
        
        You are supposed to use each method depending on your situation:
        
        - All your variables are numeric: use principal component analysis (`PCA`)
        - You have a contingency table: use correspondence analysis (`CA`)
        - You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`MCA`)
        - You have groups of categorical **or** numerical variables: use multiple factor analysis (`MFA`)
        - You have both categorical and numerical variables: use factor analysis of mixed data (`FAMD`)
        
        The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper:
        
        - [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf)
        - [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf)
        - [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf)
        - [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf)
        - [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf)
        - [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf)
        
        Notice that `Light_FAMD` does't support the sparse input,see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of sparse and big data.
        
        
        ###	Principal-Component-Analysis: PCA
        
        **PCA**(rescale_with_mean=True, rescale_with_std=True, n_components=2, n_iter=3,
                         copy=True, check_input=True, random_state=None, engine='auto'):
        	
        **Args:**
        - `rescale_with_mean` (bool): Whether to substract each column's mean or not.
        - `rescale_with_std` (bool): Whether to divide each column by it's standard deviation or not.
        - `n_components` (int): The number of principal components to compute.
        - `n_iter` (int): The number of iterations used for computing the SVD.
        - `copy` (bool): Whether to perform the computations inplace or not.
        - `check_input` (bool): Whether to check the consistency of the inputs or not.
        - `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation
        - `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
        Return ndarray (M,k),M:Number of samples,K:Number of components.
        
        **Examples:**
        ```
        >>>import numpy as np
        >>> np.random.seed(42)  # This is for doctests reproducibility
        
        >>>from light_famd  import PCA
        >>>X = pd.DataFrame(np.random.randint(0,10,size=(10,3)),columns=list('ABC'))
        >>>pca = PCA(n_components=2)
        >>>pca.fit(X)
        PCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3,
          random_state=None, rescale_with_mean=True, rescale_with_std=True)
        
        >>>print(pca.explained_variance_)
        [20.20385109  8.48246239]
        
        >>>print(pca.explained_variance_ratio_)
        [0.6734617029875277, 0.28274874633810754]
        >>>print(pca.column_correlation(X))  # pearson correlation between component and  original column,while p-value >=0.05 this similarity is `Nan`.
                  0        1
        A -0.953482      NaN
        B  0.907314      NaN
        C       NaN  0.84211
        
        >>>print(pca.transform(X))
        [[-0.82262005  0.11730656]
         [ 0.05359079  1.62298683]
         [ 1.03052849  0.79973099]
         [-0.24313366  0.25651395]
         [-0.94630387 -1.04943025]
         [-0.70591749 -0.01282583]
         [-0.39948373 -1.52612436]
         [ 2.70164194  0.38048482]
         [-2.49373351  0.53655273]
         [ 1.8254311  -1.12519545]]
        >>>print(pca.fit_transform(X))
        [[-0.82262005  0.11730656]
         [ 0.05359079  1.62298683]
         [ 1.03052849  0.79973099]
         [-0.24313366  0.25651395]
         [-0.94630387 -1.04943025]
         [-0.70591749 -0.01282583]
         [-0.39948373 -1.52612436]
         [ 2.70164194  0.38048482]
         [-2.49373351  0.53655273]
         [ 1.8254311  -1.12519545]]
        
        ```
        ###	Correspondence-Analysis: CA
        
        **CA**(n_components=2, n_iter=10, copy=True, check_input=True, random_state=None,
                         engine='auto'):
        	
        **Args:**
        - `n_components` (int): The number of principal components to compute.
        - `copy` (bool): Whether to perform the computations inplace or not.
        - `check_input` (bool): Whether to check the consistency of the inputs or not.
        - `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation
        - `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
        
        Return ndarray (M,k),M:Number of samples,K:Number of components.
        
        **Examples:**
        ```
        >>>import numpy as np
        >>>from light_famd import CA
        >>>X  = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD'))
        >>>ca=CA(n_components=2,n_iter=2)
        >>>ca.fit(X)
        CA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=2,
          random_state=None)
        
        >>> print(ca.explained_variance_)
        [0.16892141 0.0746376 ]
        
        >>>print(ca.explained_variance_ratio_)
        [0.5650580210934917, 0.2496697790527281]
        
        >>>print(ca.transform(X))
        [[ 0.23150854 -0.39167802]
         [ 0.36006095  0.00301414]
         [-0.48192602 -0.13002647]
         [-0.06333533 -0.21475652]
         [-0.16438708 -0.10418312]
         [-0.38129126 -0.16515196]
         [ 0.2721296   0.46923757]
         [ 0.82953753  0.20638333]
         [-0.500007    0.36897935]
         [ 0.57932474 -0.1023383 ]]
        
        >>>print(ca.fit_transform(X))
        [[ 0.23150854 -0.39167802]
         [ 0.36006095  0.00301414]
         [-0.48192602 -0.13002647]
         [-0.06333533 -0.21475652]
         [-0.16438708 -0.10418312]
         [-0.38129126 -0.16515196]
         [ 0.2721296   0.46923757]
         [ 0.82953753  0.20638333]
         [-0.500007    0.36897935]
         [ 0.57932474 -0.1023383 ]]
        ```
        
        ###	Multiple-Correspondence-Analysis: MCA
        MCA class inherits from  CA  class.
        
        ```
        >>>import pandas as pd
        >>>from light_famd import MCA
        >>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('ABCD'))
        >>>print(X)
              A  B  C  D
        0  d  e  a  d
        1  e  d  b  b
        2  e  d  a  e
        3  b  b  e  d
        4  b  d  b  b
        5  c  b  a  e
        6  e  d  b  a
        7  d  c  d  d
        8  b  c  d  a
        9  a  e  c  c
        >>>mca=MCA(n_components=2)
        >>>mca.fit(X)
        MCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=10,
          random_state=None)
        
        >>>print(mca.explained_variance_)
        [0.90150495 0.76979456]
        
        >>>print(mca.explained_variance_ratio_)
        [0.24040131974598467, 0.20527854948955893]
        
        >>>print(mca.transform(X)) 
        [[ 0.55603013  0.7016272 ]
         [-0.73558629 -1.17559462]
         [-0.44972794 -0.4973024 ]
         [-0.16248444  0.95706908]
         [-0.66969377 -0.79951057]
         [-0.21267777  0.39953562]
         [-0.67921667 -0.8707747 ]
         [ 0.05058625  1.34573057]
         [-0.31952341  0.77285922]
         [ 2.62229391 -0.83363941]]
        
        >>>print(mca.fit_transform(X)) 
        [[ 0.55603013  0.7016272 ]
         [-0.73558629 -1.17559462]
         [-0.44972794 -0.4973024 ]
         [-0.16248444  0.95706908]
         [-0.66969377 -0.79951057]
         [-0.21267777  0.39953562]
         [-0.67921667 -0.8707747 ]
         [ 0.05058625  1.34573057]
         [-0.31952341  0.77285922]
         [ 2.62229391 -0.83363941]]
        
        ```
        ###	Multiple-Factor-Analysis: MFA
        MFA class inherits from  PCA  class.
        Since FAMD class inherits from  MFA and the only thing to do for FAMD is to determine `groups` parameter compare to its  superclass `MFA`.therefore we skip this chapiter and go directly to `FAMD`.
        
        
        ###	Factor-Analysis-of-Mixed-Data: FAMD
        The `FAMD` inherits from the `MFA` class, which entails that you have access to all it's methods and properties of `MFA` class.
        ```
        >>>import pandas as pd
        >>>from light_famd import FAMD
        >>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(10,2)),columns=list('AB'))
        >>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('CDEF'))
        >>>X=pd.concat([X_n,X_c],axis=1)
        >>>print(X)
                A   B  C  D  E  F
        0  96  19  b  d  b  e
        1  11  46  b  d  a  e
        2   0  89  a  a  a  c
        3  13  63  c  a  e  d
        4  37  36  d  b  e  c
        5  10  99  a  b  d  c
        6  76   2  c  a  d  e
        7  32   5  c  a  e  d
        8  49   9  c  e  e  e
        9   4  22  c  c  b  d
        
        >>>famd = FAMD(n_components=2)
        >>>famd.fit(X)
        MCA PROCESS MCA PROCESS ELIMINATED 0  COLUMNS SINCE THEIR MISS_RATES >= 99%
        Out:
        FAMD(check_input=True, copy=False, engine='auto', n_components=2, n_iter=2,
             random_state=None)
        
        >>>print(famd.explained_variance_)
        [17.40871219  9.73440949]
        
        >>>print(famd.explained_variance_ratio_)
        [0.32596621039327284, 0.1822701494502082]
        
        >>> print(famd.column_correlation(X))
                     0         1
        A         NaN       NaN
        B         NaN       NaN
        C_a       NaN       NaN
        C_b       NaN  0.824458
        C_c  0.922220       NaN
        C_d       NaN       NaN
        D_a       NaN       NaN
        D_b       NaN       NaN
        D_c       NaN       NaN
        D_d       NaN  0.824458
        D_e       NaN       NaN
        E_a       NaN       NaN
        E_b       NaN       NaN
        E_d       NaN       NaN
        E_e       NaN       NaN
        F_c       NaN -0.714447
        F_d  0.673375       NaN
        F_e       NaN  0.839324
        
        
        
        >>>print(famd.transform(X)) 
        [[ 2.23848136  5.75809647]
         [ 2.0845175   4.78930072]
         [ 2.6682068  -2.78991262]
         [ 6.2962962  -1.57451325]
         [ 2.52140085 -3.28279729]
         [ 1.58256681 -3.73135011]
         [ 5.19476759  1.18333717]
         [ 6.35288446 -1.33186723]
         [ 5.02971134  1.6216402 ]
         [ 4.05754963  0.69620997]]
        
        >>>print(famd.fit_transform(X))
        MCA PROCESS HAVE ELIMINATE 0  COLUMNS SINCE ITS MISSING RATE >= 99%
        [[ 2.23848136  5.75809647]
         [ 2.0845175   4.78930072]
         [ 2.6682068  -2.78991262]
         [ 6.2962962  -1.57451325]
         [ 2.52140085 -3.28279729]
         [ 1.58256681 -3.73135011]
         [ 5.19476759  1.18333717]
         [ 6.35288446 -1.33186723]
         [ 5.02971134  1.6216402 ]
         [ 4.05754963  0.69620997]]
        
        ```
        
        
        
        
        ## Going faster
        
        By default `light_famd` uses `sklearn`'s randomized SVD implementation. One of the goals of `Light_FAMD` is to make it possible to use a different SVD backend. For the while the only other supported backend is [Facebook's randomized SVD implementation](https://research.facebook.com/blog/fast-randomized-svd/) called [fbpca](http://fbpca.readthedocs.org/en/latest/). You can use it by setting the `engine` parameter to `'fbpca'` or see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of automatic selection of svd_solver depends on the structure of input:
        
        ```python
        >>> import Light_FAMD
        >>> pca = Light_FAMD.PCA(engine='fbpca')
        
        ```
        
        
Keywords: famd,factor analysis
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: BSD License 
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Intended Audience :: Science/Research
Description-Content-Type: text/markdown
