Impyute

https://travis-ci.org/eltonlaw/impyute.svg?branch=master https://img.shields.io/pypi/v/impyute.svg

Impyute is a library of missing data imputation algorithms written in Python 3. This library was designed to be super lightweight, here’s a sneak peak at what impyute can do.

>>> n = 5
>>> arr = np.random.uniform(high=6, size=(n, n))
>>> for _ in range(3):
>>>    arr[np.random.randint(n), np.random.randint(n)] = np.nan
>>> print(arr)
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, np.nan],
      [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
      [0.79802036, np.nan, 0.51729349, 5.06533123, 3.70669172],
      [1.30848217, 2.08386584, 2.29894541, np.nan, 3.38661392],
      [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])
>>> import impyute as impy
>>> print(impy.mean(arr))
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, 3.7122365],
      [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
      [0.79802036, 1.99128649, 0.51729349, 5.06533123, 3.70669172],
      [1.30848217, 2.08386584, 2.29894541, 3.08994336, 3.38661392],
      [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])

Feature Support

  • Imputation of Cross Sectional Data
    • K-Nearest Neighbours
    • Multivariate Imputation by Chained Equations
    • Expectation Maximization
    • Mean Imputation
    • Mode Imputation
    • Median Imputation
    • Random Imputation
  • Imputation of Time Series Data
    • Last Observation Carried Forward
    • Moving Window
    • Autoregressive Integrated Moving Average (WIP)
  • Diagnostic Tools
    • Loggers
    • Distribution of Null Values
    • Comparison of imputations
    • Little’s MCAR Test (WIP)

Versions

Currently tested on 2.7, 3.4, 3.5, 3.6 and 3.7

Installation

To install impyute, run the following:

$ pip3 install impyute

Or to get the most latest build:

$ git clone https://github.com/eltonlaw/impyute
$ cd impyute
$ python setup.py install

Documentation

Documentation is available here: http://impyute.readthedocs.io/

How to Contribute

Check out CONTRIBUTING

User Guide

Overview

About

impyute is a general purpose, imputations library written in Python. In statistics, imputation is the method of estimating missing values in a data set. There are a lot of different types of imputation, the result of the various types of datasets. On datasets with high percentages of missing values, some methods work better than others and vice versa. Datasets can be cross sectional or time series, linear or non linear, continuous or categorical or boolean. As you can imagine, there are a lot of different specifications that need to be kept in mind.

Functionality

impyute was built for convenience, an all in one stop so that users can impute their dataset with minimal knowledge and get on with their day. With that in mind, the following tools are provided for the user:

  • Imputations (Fill in missing values)
  • Deletions (Only use complete data)
  • Diagnostics to identify the skew and distribution of missing values
  • Comparison function to experiment with how different machine learning algorithms are affected by different imputation algorithms.
  • Dataset generation to experiment with different types of missingness and different types of data.

Formatting your Data

Prior to running, checks are run to ensure the given data is in an acceptable format. Please ensure that your data satisfies the following criterion:

  • numpy.ndarray with type numpy.float
  • Columns are along the x-axis and individual datapoints are along the y-axis.
  • 2D Matrix (3D is also allowed in certain cases, but requires special treatment)
  • Missing values can be found with numpy.isnan

Getting Started

Installation

Install via pip:

$ pip3 install impyute

From source:

$ git clone https://github.com/eltonlaw/impyute
$ cd impyute
$ python setup.py install

Dependencies

  • NumPy
  • SciPy
  • scikit-learn

Versions

Currently, this package works with 2.7, 3.4, 3.5, 3.6 and 3.7

Troubleshooting

Not working? Open an issue here: https://github.com/eltonlaw/impyute/issues

Tutorial

For the Standard User

Identify what type of data you have (cross-sectional or time-series) then read about the strengths and weaknesses of each type of approach and pick something suitable. I’ve compiled a small (and incomplete) list of Rules of Thumb that you can use to aid your decision making. After you’ve picked your imputation algorithm,

For the Researcher

Diagnostics

Little’s MCAR Test [1]

Take the mean of the data with missing values and take the mean of the data without missing values. If they’re the same/simlar, then it’s more likely that your data is MCAR.

[1]Roderick J. A. Little. (1988). A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association, 83(404), 1198-1202. doi:10.2307/2290157

Rules of Thumb

TBA

API

API Reference

Documentation is auto-generated from docstrings.

Dataset

impyute.dataset.mnist(missingness='mcar', thr=0.2)[source]

Loads corrupted MNIST

Parameters:
missingness: (‘mcar’, ‘mar’, ‘mnar’)

Type of missigness you want in your dataset

th: float between [0,1]

Percentage of missing data in generated data

Returns:
numpy.ndarray
impyute.dataset.randn(theta=(0, 1), shape=(5, 5), missingness='mcar', thr=0.2, dtype='float')[source]

Return randomly generated dataset of numbers with normally distributed values with given and sigma.

Parameters:
theta: tuple (mu, sigma)

Determines the range of values in the matrix

shape:tuple(optional)

Size of the randomly generated data

missingness: (‘mcar’, ‘mar’, ‘mnar’)

Type of missingness you want in your dataset

thr: float between [0,1]

Percentage of missing data in generated data

dtype: (‘int’,’float’)

Type of data

Returns:
numpy.ndarray
impyute.dataset.randu(bound=(0, 10), shape=(5, 5), missingness='mcar', thr=0.2, dtype='int')[source]

Return randomly generated dataset of numbers with uniformly distributed values between bound[0] and bound[1]

Parameters:
bound:tuple (start,stop)

Determines the range of values in the matrix. Index 0 for start value and index 1 for stop value. Start is inclusive, stop is exclusive.

shape:tuple(optional)

Size of the randomly generated data

missingness: (‘mcar’, ‘mar’, ‘mnar’)

Type of missingness you want in your dataset

thr: float between [0,1]

Percentage of missing data in generated data

dtype: (‘int’,’float’)

Type of data

Returns:
numpy.ndarray

Deletion

Missing data approaches that delete values.

impyute.deletion.complete_case(data)[source]

Return only data rows with all columns

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

Utility

Cross Sectional Imputation

Imputations for cross-sectional data.

impyute.imputation.cs.random(data)[source]

Fill missing values in with a randomly selected value from the same column.

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.mean(data)[source]

Substitute missing values with the mean of that column.

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.mode(data)[source]

Substitute missing values with the mode of that column(most frequent).

In the case that there is a tie (there are multiple, most frequent values) for a column randomly pick one of them.

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.median(data)[source]

Substitute missing values with the median of that column(middle).

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.buck_iterative(data)[source]

Iterative variant of buck’s method

  • Variable to regress on is chosen at random.
  • EM type infinite regression loop stops after change in prediction from previous prediction < 10% for all columns with missing values

A Method of Estimation of Missing Values in Multivariate Data Suitable for use with an Electronic Computer S. F. Buck Journal of the Royal Statistical Society. Series B (Methodological) Vol. 22, No. 2 (1960), pp. 302-306

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.em(data, loops=50)[source]

Imputes given data using expectation maximization.

E-step: Calculates the expected complete data log likelihood ratio. M-step: Finds the parameters that maximize the log likelihood of the complete data.

Parameters:
data: numpy.nd.array

Data to impute.

loops: int

Number of em iterations to run before breaking.

inplace: boolean

If True, operate on the numpy array reference

Returns:
numpy.nd.array

Imputed data.

impyute.imputation.cs.fast_knn(data, k=3, eps=0, p=2, distance_upper_bound=inf, leafsize=10, idw_fn=<function shepards at 0x7fc4d40e7b70>, init_impute_fn=<function mean at 0x7fc4c8eab488>)[source]

Impute using a variant of the nearest neighbours approach

Basic idea: Impute array with a passed in initial impute fn (mean impute) and then use the resulting complete array to construct a KDTree. Use this KDTree to compute nearest neighbours. After finding k nearest neighbours, take the weighted average of them. Basically, find the nearest row in terms of distance

This approach is much, much faster than the other implementation (fit+transform for each subset) which is almost prohibitively expensive.

Parameters:
data: numpy.ndarray

2D matrix to impute.

k: int, optional

Parameter used for method querying the KDTree class object. Number of neighbours used in the KNN query. Refer to the docs for [scipy.spatial.KDTree.query] (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).

eps: nonnegative float, optional

Parameter used for method querying the KDTree class object. From the SciPy docs: “Return approximate nearest neighbors; the kth returned value is guaranteed to be no further than (1+eps) times the distance to the real kth nearest neighbor”. Refer to the docs for [scipy.spatial.KDTree.query] (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).

p : float, 1<=p<=infinity, optional

Parameter used for method querying the KDTree class object. Straight from the SciPy docs: “Which Minkowski p-norm to use. 1 is the sum-of-absolute-values Manhattan distance 2 is the usual Euclidean distance infinity is the maximum-coordinate-difference distance”. Refer to the docs for [scipy.spatial.KDTree.query] (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).

distance_upper_bound : nonnegative float, optional

Parameter used for method querying the KDTree class object. Straight from the SciPy docs: “Return only neighbors within this distance. This is used to prune tree searches, so if you are doing a series of nearest-neighbor queries, it may help to supply the distance to the nearest neighbor of the most recent point.” Refer to the docs for [scipy.spatial.KDTree.query] (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).

leafsize: int, optional

Parameter used for construction of the KDTree class object. Straight from the SciPy docs: “The number of points at which the algorithm switches over to brute-force. Has to be positive”. Refer to the docs for [scipy.spatial.KDTree](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html) for more information.

idw_fn: fn, optional

Function that takes one argument, a list of distances, and returns weighted percentages. You can define a custom one or bootstrap from functions defined in impy.util.inverse_distance_weighting which can be using functools.partial, for example: functools.partial(impy.util.inverse_distance_weighting.shepards, power=1)

init_impute_fn: fn, optional
Returns:
numpy.ndarray

Imputed data.

Examples

>>> data = np.arange(25).reshape((5, 5)).astype(np.float)
>>> data[0][2] =  np.nan
>>> data
array([[ 0.,  1., nan,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24.]])
>> fast_knn(data, k=1) # Weighted average (by distance) of nearest 1 neighbour
array([[ 0.,  1.,  7.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24.]])
>> fast_knn(data, k=2) # Weighted average of nearest 2 neighbours
array([[ 0.        ,  1.        , 10.08608891,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
       [10.        , 11.        , 12.        , 13.        , 14.        ],
       [15.        , 16.        , 17.        , 18.        , 19.        ],
       [20.        , 21.        , 22.        , 23.        , 24.        ]])
>> fast_knn(data, k=3)
array([[ 0.        ,  1.        , 13.40249283,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
       [10.        , 11.        , 12.        , 13.        , 14.        ],
       [15.        , 16.        , 17.        , 18.        , 19.        ],
       [20.        , 21.        , 22.        , 23.        , 24.        ]])
>> fast_knn(data, k=5) # There are at most only 4 neighbours. Raises error
...
IndexError: index 5 is out of bounds for axis 0 with size 5

Time Series Imputation

Imputations for time-series data.

impyute.imputation.ts.locf(data, axis=0)[source]

Last Observation Carried Forward

For each set of missing indices, use the value of one row before(same column). In the case that the missing value is the first row, look one row ahead instead. If this next row is also NaN, look to the next row. Repeat until you find a row in this column that’s not NaN. All the rows before will be filled with this value.

Parameters:
data: numpy.ndarray

Data to impute.

axis: boolean (optional)

0 if time series is in row format (Ex. data[0][:] is 1st data point). 1 if time series is in col format (Ex. data[:][0] is 1st data point).

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.ts.moving_window(data, nindex=None, wsize=5, errors='coerce', func=<function mean at 0x7fc4d66fef28>, inplace=False)[source]

Interpolate the missing values based on nearby values.

For example, with an array like this:

array([[-1.24940, -1.38673, -0.03214945, 0.08255145, -0.007415],
[ 2.14662, 0.32758 , -0.82601414, 1.78124027, 0.873998], [-0.41400, -0.977629, nan, -1.39255344, 1.680435], [ 0.40975, 1.067599, 0.29152388, -1.70160145, -0.565226], [-0.54592, -1.126187, 2.04004377, 0.16664863, -0.010677]])

Using a k or window size of 3. The one missing value would be set to -1.18509122. The window operates on the horizontal axis.

Parameters:
data: numpy.ndarray

2D matrix to impute.

nindex: int

Null index. Index of the null value inside the moving average window. Use cases: Say you wanted to make value skewed toward the left or right side. 0 would only take the average of values from the right and -1 would only take the average of values from the left

wsize: int

Window size. Size of the moving average window/area of values being used for each local imputation. This number includes the missing value.

errors: {“raise”, “coerce”, “ignore”}

Errors will occur with the indexing of the windows - for example if there is a nan at data[x][0] and nindex is set to -1 or there is a nan at data[x][-1] and nindex is set to 0. “raise” will raise an error, “coerce” will try again using an nindex set to the middle and “ignore” will just leave it as a nan.

inplace: {True, False}

Whether to return a copy or run on the passed-in array

Returns:
numpy.ndarray

Imputed data.

Contributing

Contributing

See CONTRIBUTING

Philosophy

Looking Forward

** (Not ordered by importance) **

Implementations: - ARMA - ARIMA - Multiple Imputation - EM with Kalman Filter

Datasets: - Load more real world datasets - Generate MCAR, MAR and MNAR data

Feature Upgrades: - compare: Allow customization of used ML algorithms

Major Updates: - Imputation of n-dimensional data - Imputation on specific formats (text, image, audio)

References

Citations

Schmitt P, Mandel J, Guedj M (2015) A Comparison of Six Methods for Missing Data Imputation. J Biom Biostat 6:224. doi: 10.4172/2155-6180.1000224

Gelman A, Hill J (2006) Data Analysis Using Regression and Multilevel/Hierarchical Models.

Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple Imputation by Chained Equations: What is it and how does it work? International journal of methods in psychiatric research. 2011;20(1):40-49. doi:10.1002/mpr.329.

Roderick J. A. Little. (1988). A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association, 83(404), 1198-1202. doi:10.2307/2290157