Metadata-Version: 2.1
Name: csr-utils
Version: 0.1.3
Summary: Utility functions for scalable handling of CSR matrices
Home-page: https://github.com/narges-rzv/csr_utils
Author: Narges Razavian
Author-email: nsr3@nyu.edu
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: scipy
Requires-Dist: numpy

# csr_utils

Scalable Operations for CSR matrices.

[![Build Status](https://travis-ci.org/narges-rzv/csr_utils.svg?branch=master)](https://travis-ci.org/narges-rzv/csr_utils)

Installation 
------------
For general users and if using conda etc.:

`pip install csr_utils` 


Without root access:

`pip install --user csr_utils`

Usage
-----

```
>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> xcsr = csr_matrix(np.array([[1, 0], [3, 4], [2, 2]], dtype=float))

>>> import csr_utils
>>> xnorm, xmean, xstd, xixnormed = normalize_csr_matrix(xcsr)

```


Overview
--------
This package currently only has a fast and memory efficient implementation for normalizing nonzero values of a CSR array without un-sparsifying the function. This is useful step for machine learning on large matrices. Most algorithms work better with normalized input, in particular the commonly used [linear classification models in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). 

There will be more functions added as the need arises, including turning a csr array into cuda sparse array directly. Stay tuned. 

normalize_csr_matrix:
---------------------
- Normalizes a CSR matrix only based on non-zero values, without turning it into dense array. 
   - In the CSR matrix, rows correspond to samples, and columns correspond to features.
   - Normalization will be such that each column (feature)'s non-zero values will have mean of 0.0 and standard deviation of 1.0.
- Will return the scalable equivalent of x = x.toarray(); x[(x==0)] = np.nan; (x - np.nanmean(x, axis=0)) / np.nanstd(x, axis=0)
- We compute a faster and equivalent definition of standard deviation:
   - ```sigma = SquareRoot(ExpectedValue(|X - mean|^2)) # slow```
   - ```sigma = SquareRoot(ExpectedValue(X^2) - ExpectedValue(X)^2) # fast```
   - [For more info see the math](https://en.wikipedia.org/wiki/Standard_deviation#Definition_of_population_values)
- This function makes the following assumptions:
   - If we don't have any observations in a column i, mean_array[i] be set to 0.0, and std_array[i] will be set to 1.0.
   - If we have a single observation, or if standard deviation is 0.0 for a column, we only subtract the mean for that column.

- (Useful for normalizing test sets:) The function allows the normalization to be based on pre-specified mean and standard deviation arrays. 

- The function also allows only a given subset of features to be normalized.

Example
-------
```
>>> import numpy as np
>>> from scipy.sparse import csr_matrix

>>> x = csr_matrix(np.array([[1, 0, 0], [3, 0, 4], [2, 5, 2]], dtype=float))

>>> print(x.toarray())
[[ 1.  0.  0.]
[ 3.  0.  4.]
[ 2.  5.  2.]]

>>> xnorm, xmean, xstd, xixnormed = csr_utils.normalize_csr_matrix(x)

>>> print(xnorm.todense())
[[-1.22474487  0.          0.        ]
[ 1.22474487  0.          1.        ]
[ 0.          0.         -1.        ]]

>>> xmean
array([2., 5., 3.])

>>> xstd
array([0.81649658, 1.        , 1.        ])        

>>> xixnormed
array([0, 2])

```


