Cross Sectional Imputation¶
Imputations for cross-sectional data.
-
impyute.imputation.cs.
random
(data, **kwargs)[source]¶ Fill missing values in with a randomly selected value from the same column.
Parameters: - data: numpy.ndarray
Data to impute.
Returns: - numpy.ndarray
Imputed data.
-
impyute.imputation.cs.
mean
(data, **kwargs)[source]¶ Substitute missing values with the mean of that column.
Parameters: - data: numpy.ndarray
Data to impute.
Returns: - numpy.ndarray
Imputed data.
-
impyute.imputation.cs.
mode
(data, **kwargs)[source]¶ Substitute missing values with the mode of that column(most frequent).
In the case that there is a tie (there are multiple, most frequent values) for a column randomly pick one of them.
Parameters: - data: numpy.ndarray
Data to impute.
Returns: - numpy.ndarray
Imputed data.
-
impyute.imputation.cs.
median
(data, **kwargs)[source]¶ Substitute missing values with the median of that column(middle).
Parameters: - data: numpy.ndarray
Data to impute.
Returns: - numpy.ndarray
Imputed data.
-
impyute.imputation.cs.
mice
(data, **kwargs)[source]¶ Multivariate Imputation by Chained Equations
- Reference:
- Buuren, S. V., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3). doi:10.18637/jss.v045.i03
Implementation follows the main idea from the paper above. Differs in decision of which variable to regress on (here, I choose it at random). Also differs in stopping criterion (here the model stops after change in prediction from previous prediction is less than 10%).
Parameters: - data: numpy.ndarray
Data to impute.
Returns: - numpy.ndarray
Imputed data.
-
impyute.imputation.cs.
em
(data, loops=50, **kwargs)[source]¶ Imputes given data using expectation maximization.
E-step: Calculates the expected complete data log likelihood ratio. M-step: Finds the parameters that maximize the log likelihood of the complete data.
Parameters: - data: numpy.nd.array
Data to impute.
- loops: int
Number of em iterations to run before breaking.
- inplace: boolean
If True, operate on the numpy array reference
Returns: - numpy.nd.array
Imputed data.
-
impyute.imputation.cs.
fast_knn
(data, k=3, eps=0, p=2, distance_upper_bound=inf, leafsize=10, **kwargs)[source]¶ Impute using a variant of the nearest neighbours approach
Basic idea: Impute array with a basic mean impute and then use the resulting complete array to construct a KDTree. Use this KDTree to compute nearest neighbours. After finding k nearest neighbours, take the weighted average of them. Basically, find the nearest row in terms of distance
This approach is much, much faster than the other implementation (fit+transform for each subset) which is almost prohibitively expensive.
Parameters: - data: numpy.ndarray
2D matrix to impute.
- k: int, optional
Parameter used for method querying the KDTree class object. Number of neighbours used in the KNN query. Refer to the docs for [scipy.spatial.KDTree.query](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).
- eps: nonnegative float, optional
Parameter used for method querying the KDTree class object. From the SciPy docs: “Return approximate nearest neighbors; the kth returned value is guaranteed to be no further than (1+eps) times the distance to the real kth nearest neighbor”. Refer to the docs for [scipy.spatial.KDTree.query](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).
- p : float, 1<=p<=infinity, optional
Parameter used for method querying the KDTree class object. Straight from the SciPy docs: “Which Minkowski p-norm to use. 1 is the sum-of-absolute-values Manhattan distance 2 is the usual Euclidean distance infinity is the maximum-coordinate-difference distance”. Refer to the docs for [scipy.spatial.KDTree.query](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).
- distance_upper_bound : nonnegative float, optional
Parameter used for method querying the KDTree class object. Straight from the SciPy docs: “Return only neighbors within this distance. This is used to prune tree searches, so if you are doing a series of nearest-neighbor queries, it may help to supply the distance to the nearest neighbor of the most recent point.” Refer to the docs for [scipy.spatial.KDTree.query](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html).
- leafsize: int, optional
Parameter used for construction of the KDTree class object. Straight from the SciPy docs: “The number of points at which the algorithm switches over to brute-force. Has to be positive”. Refer to the docs for [scipy.spatial.KDTree](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html) for more information.
Returns: - numpy.ndarray
Imputed data.
Examples
>>> data = np.arange(25).reshape((5, 5)).astype(np.float) >>> data[0][2] = np.nan >>> data array([[ 0., 1., nan, 3., 4.], [ 5., 6., 7., 8., 9.], [10., 11., 12., 13., 14.], [15., 16., 17., 18., 19.], [20., 21., 22., 23., 24.]]) >> fast_knn(data, k=1) # Weighted average (by distance) of nearest 1 neighbour array([[ 0., 1., 7., 3., 4.], [ 5., 6., 7., 8., 9.], [10., 11., 12., 13., 14.], [15., 16., 17., 18., 19.], [20., 21., 22., 23., 24.]]) >> fast_knn(data, k=2) # Weighted average of nearest 2 neighbours array([[ 0. , 1. , 10.08608891, 3. , 4. ], [ 5. , 6. , 7. , 8. , 9. ], [10. , 11. , 12. , 13. , 14. ], [15. , 16. , 17. , 18. , 19. ], [20. , 21. , 22. , 23. , 24. ]]) >> fast_knn(data, k=3) array([[ 0. , 1. , 13.40249283, 3. , 4. ], [ 5. , 6. , 7. , 8. , 9. ], [10. , 11. , 12. , 13. , 14. ], [15. , 16. , 17. , 18. , 19. ], [20. , 21. , 22. , 23. , 24. ]]) >> fast_knn(data, k=5) # There are at most only 4 neighbours. Raises error ... IndexError: index 5 is out of bounds for axis 0 with size 5