Utility

Diagnostic tools to find information about data.

impyute.util.find_null(data)[source]

Finds the indices of all missing values.

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
List of tuples

Indices of all missing values in tuple format; (i, j)

impyute.util.describe(data)[source]

Print input/output multiple times

Parameters:
data: numpy.nd.array

The data you want to get a description from

verbose: boolean(optional)

Decides whether the description is short or long form

Returns:
dict
missingness: list

Confidence interval of data being MCAR, MAR or MNAR - in that order

null_xy: list of tuples

Indices of all null points

null_n: list

Total number of null values for each column

pmissing_n: float

Percentage of missing values in dataset

null_rows: list

Indices of all rows that are completely null

null_cols: list

Indices of all columns that are completely null

mean_rows: list

Mean value of each row

mean_cols: list

Mean value of each column

std_dev: list

std dev for each row/column

min_max: list

Finds the minimum and maximum for each row

impyute.util.count_missing(data)[source]

Calculate the total percentage of missing values and also the percentage in each column.

Parameters:
data: np.array

Data to impute.

Returns:
dict

Percentage of missing values in total and in each column.

impyute.util.checks(fn)[source]

Main check function to ensure input is correctly formatted

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
bool

True if data is correctly formatted

impyute.util.compare(imputed, classifiers=['sklearn.svm.SVC'], log_path=None)[source]

Given an imputed dataset with labels and a list of supervised machine learning model, find accuracy score of all model/imputation pairs.

Parameters:
imputed: [(str, np.ndarray), (str, np.ndarray)…]

List of tuples containing (imputation_name, imputed_data) where imputation_name is a string and imputed_data is a tuple where `imputed_data`[0] is the data, X and `imputed_data`[1] is the label, y

classifiers: [str, str…str] (optional)

Provide a list of classifiers to run imputed data sets on. Right now, it ONLY works with sklearn, the format should be like so: sklearn.SUBMODULE.FUNCTION. More generally its ‘MODULE.SUBMODULE.FUNCTION’. If providing a custom classifier, make sure to add the file location to sys.path first and the classifier should also be structured like sklearn (with a fit and predict method).

log_path: str (optional)

To write results to a file, provide a relative path

Returns:
results.txt

Classification results on imputed data

exception impyute.util.BadInputError(value)[source]
impyute.util.preprocess(fn)[source]

Base preprocess function for commonly used preprocessing

Parameters:
data: numpy.ndarray

Data to impute.

Returns:
bool

True if data is correctly formatted