Overview¶
About¶
impyute is a general purpose, imputations library written in Python. In statistics, imputation is the method of estimating missing values in a data set. There are a lot of different types of imputation, the result of the various types of datasets. On datasets with high percentages of missing values, some methods work better than others and vice versa. Datasets can be cross sectional or time series, linear or non linear, continuous or categorical or boolean. As you can imagine, there are a lot of different specifications that need to be kept in mind.
Functionality¶
impyute was built for convenience, an all in one stop so that users can impute their dataset with minimal knowledge and get on with their day. With that in mind, the following tools are provided for the user:
- Imputations (Fill in missing values)
- Deletions (Only use complete data)
- Diagnostics to identify the skew and distribution of missing values
- Comparison function to experiment with how different machine learning algorithms are affected by different imputation algorithms.
- Dataset generation to experiment with different types of missingness and different types of data.
Formatting your Data¶
Prior to running, checks are run to ensure the given data is in an acceptable format. Please ensure that your data satisfies the following criterion:
- numpy.ndarray with type numpy.float
- Columns are along the x-axis and individual datapoints are along the y-axis.
- 2D Matrix (3D is also allowed in certain cases, but requires special treatment)
- Missing values can be found with numpy.isnan