Cross Sectional Imputation

Imputations for cross-sectional data.

impyute.imputation.cs.random(data)[source]

Fill missing values in with a randomly selected value from the same column.

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.mean(data)[source]

Substitute missing values with the mean of that column.

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.mode(data)[source]

Substitute missing values with the mode of that column(most frequent).

In the case that there is a tie (there are multiple, most frequent values) for a column randomly pick one of them.

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.median(data)[source]

Substitute missing values with the median of that column(middle).

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.buck_iterative(data)[source]

Iterative variant of buck’s method

  • Variable to regress on is chosen at random.
  • EM type infinite regression loop stops after change in prediction from previous prediction < 10% for all columns with missing values

A Method of Estimation of Missing Values in Multivariate Data Suitable for use with an Electronic Computer S. F. Buck Journal of the Royal Statistical Society. Series B (Methodological) Vol. 22, No. 2 (1960), pp. 302-306

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
numpy.ndarray

Imputed data.

impyute.imputation.cs.em(data, loops=50)[source]

Imputes given data using expectation maximization.

E-step: Calculates the expected complete data log likelihood ratio. M-step: Finds the parameters that maximize the log likelihood of the complete data.

Parameters:
data: numpy.nd.array

Data to impute.

loops: int

Number of em iterations to run before breaking.

inplace: boolean

If True, operate on the numpy array reference

Returns:
numpy.nd.array

Imputed data.

impyute.imputation.cs.fast_knn(data, k=3, eps=0, p=2, distance_upper_bound=inf, leafsize=10, idw_fn=<function shepards at 0x7f14d19bdb70>, init_impute_fn=<function mean at 0x7f14c677f488>)[source]

Impute using a variant of the nearest neighbours approach

Basic idea: Impute array with a passed in initial impute fn (mean impute) and then use the resulting complete array to construct a KDTree. Use this KDTree to compute nearest neighbours. After finding k nearest neighbours, take the weighted average of them. Basically, find the nearest row in terms of distance

This approach is much, much faster than the other implementation (fit+transform for each subset) which is almost prohibitively expensive.

Parameters:
data: numpy.ndarray

2D matrix to impute.

k: int, optional

Parameter used for method querying the KDTree class object. Number of neighbours used in the KNN query. Refer to the docs for [scipy.spatial.KDTree.query] (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).

eps: nonnegative float, optional

Parameter used for method querying the KDTree class object. From the SciPy docs: “Return approximate nearest neighbors; the kth returned value is guaranteed to be no further than (1+eps) times the distance to the real kth nearest neighbor”. Refer to the docs for [scipy.spatial.KDTree.query] (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).

p : float, 1<=p<=infinity, optional

Parameter used for method querying the KDTree class object. Straight from the SciPy docs: “Which Minkowski p-norm to use. 1 is the sum-of-absolute-values Manhattan distance 2 is the usual Euclidean distance infinity is the maximum-coordinate-difference distance”. Refer to the docs for [scipy.spatial.KDTree.query] (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).

distance_upper_bound : nonnegative float, optional

Parameter used for method querying the KDTree class object. Straight from the SciPy docs: “Return only neighbors within this distance. This is used to prune tree searches, so if you are doing a series of nearest-neighbor queries, it may help to supply the distance to the nearest neighbor of the most recent point.” Refer to the docs for [scipy.spatial.KDTree.query] (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).

leafsize: int, optional

Parameter used for construction of the KDTree class object. Straight from the SciPy docs: “The number of points at which the algorithm switches over to brute-force. Has to be positive”. Refer to the docs for [scipy.spatial.KDTree](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html) for more information.

idw_fn: fn, optional

Function that takes one argument, a list of distances, and returns weighted percentages. You can define a custom one or bootstrap from functions defined in impy.util.inverse_distance_weighting which can be using functools.partial, for example: functools.partial(impy.util.inverse_distance_weighting.shepards, power=1)

init_impute_fn: fn, optional
Returns:
numpy.ndarray

Imputed data.

Examples

>>> data = np.arange(25).reshape((5, 5)).astype(np.float)
>>> data[0][2] =  np.nan
>>> data
array([[ 0.,  1., nan,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24.]])
>> fast_knn(data, k=1) # Weighted average (by distance) of nearest 1 neighbour
array([[ 0.,  1.,  7.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24.]])
>> fast_knn(data, k=2) # Weighted average of nearest 2 neighbours
array([[ 0.        ,  1.        , 10.08608891,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
       [10.        , 11.        , 12.        , 13.        , 14.        ],
       [15.        , 16.        , 17.        , 18.        , 19.        ],
       [20.        , 21.        , 22.        , 23.        , 24.        ]])
>> fast_knn(data, k=3)
array([[ 0.        ,  1.        , 13.40249283,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
       [10.        , 11.        , 12.        , 13.        , 14.        ],
       [15.        , 16.        , 17.        , 18.        , 19.        ],
       [20.        , 21.        , 22.        , 23.        , 24.        ]])
>> fast_knn(data, k=5) # There are at most only 4 neighbours. Raises error
...
IndexError: index 5 is out of bounds for axis 0 with size 5