skoot.utils.check_dataframe

skoot.utils.check_dataframe(X, cols=None, assert_all_finite=False, column_diff=False)[source][source]

Check an input dataframe.

Determine whether an input frame is a Pandas dataframe or whether it can be coerced as one, and raise a TypeError if not. Also check for finite values if specified. If columns are provided, checks that all columns are present within the dataframe and raises an assertion error if not.

Note: if X is not a dataframe (i.e., a list of lists or a numpy array), the columns will not be specified when creating a pandas dataframe and will thus be indices. Any columns provided should account for this behavior.

Parameters:

X : array-like, shape=(n_samples, n_features)

The input frame. Should be a pandas DataFrame, numpy ndarray or a similar array-like structure. Any non-pandas structure will be attempted to be cast to pandas; if it cannot be cast, it will fail with a TypeError.

cols : list, iterable or None

Any columns to check for. If this is provided, all columns will be checked for presence in the X.columns index. If any are not present, a ValueError will be raised.

assert_all_finite : bool, optional (default=False)

Whether to assert that all values within the X frame are finite. Note that if cols is specified, this will only assert all values in the specified columns are finite.

column_diff : bool, optional (default=False)

Whether to also get the columns present in X that are not present in cols. This is returned as the third element in the output if column_diff is True.

Returns:

X_copy : DataFrame

A copy of the X dataframe.

cols : list

The list of columns on which to apply a function to this dataframe. if cols was specified in the function, this is equal to cols as a list. Else, it’s the X.columns index.

diff : list

If column_diff is True, will return as the third position in the tuple the columns that are within X but NOT present in cols.

Examples

When providing a dataframe and columns, the columns should be present:

>>> from skoot.datasets import load_iris_df
>>> df = load_iris_df(include_tgt=False, names=['a', 'b', 'c', 'd'])
>>> df, cols = check_dataframe(df, cols=('a', 'c'))
>>> assert cols == ['a', 'c']
>>> df.head()
     a    b    c    d
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2

When passing numpy arrays, account for the fact that the columns cannot be specified when creating the pandas dataframe:

>>> df2, cols = check_dataframe(df.values, cols=[0, 2])
>>> cols
[0, 2]
>>> df2.columns.tolist()
[0, 1, 2, 3]
>>> df2.head()
     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2

If you want to get the column_diff, or the left-out columns, this will be returned as a third element in the tuple when specifed:

>>> df2, cols, diff = check_dataframe(df.values, [0, 2], column_diff=True)
>>> cols
[0, 2]
>>> df2.columns.tolist()
[0, 1, 2, 3]
>>> diff
[1, 3]