skoot.utils
.check_dataframe¶
-
skoot.utils.
check_dataframe
(X, cols=None, assert_all_finite=False, column_diff=False)[source][source]¶ Check an input dataframe.
Determine whether an input frame is a Pandas dataframe or whether it can be coerced as one, and raise a TypeError if not. Also check for finite values if specified. If columns are provided, checks that all columns are present within the dataframe and raises an assertion error if not.
Note: if
X
is not a dataframe (i.e., a list of lists or a numpy array), the columns will not be specified when creating a pandas dataframe and will thus be indices. Any columns provided should account for this behavior.Parameters: X : array-like, shape=(n_samples, n_features)
The input frame. Should be a pandas DataFrame, numpy
ndarray
or a similar array-like structure. Any non-pandas structure will be attempted to be cast to pandas; if it cannot be cast, it will fail with a TypeError.cols : list, iterable or None
Any columns to check for. If this is provided, all columns will be checked for presence in the
X.columns
index. If any are not present, a ValueError will be raised.assert_all_finite : bool, optional (default=False)
Whether to assert that all values within the
X
frame are finite. Note that ifcols
is specified, this will only assert all values in the specified columns are finite.column_diff : bool, optional (default=False)
Whether to also get the columns present in
X
that are not present incols
. This is returned as the third element in the output ifcolumn_diff
is True.Returns: X_copy : DataFrame
A copy of the
X
dataframe.cols : list
The list of columns on which to apply a function to this dataframe. if
cols
was specified in the function, this is equal tocols
as a list. Else, it’s theX.columns
index.diff : list
If
column_diff
is True, will return as the third position in the tuple the columns that are withinX
but NOT present incols
.Examples
When providing a dataframe and columns, the columns should be present:
>>> from skoot.datasets import load_iris_df >>> df = load_iris_df(include_tgt=False, names=['a', 'b', 'c', 'd']) >>> df, cols = check_dataframe(df, cols=('a', 'c')) >>> assert cols == ['a', 'c'] >>> df.head() a b c d 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2
When passing numpy arrays, account for the fact that the columns cannot be specified when creating the pandas dataframe:
>>> df2, cols = check_dataframe(df.values, cols=[0, 2]) >>> cols [0, 2] >>> df2.columns.tolist() [0, 1, 2, 3] >>> df2.head() 0 1 2 3 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2
If you want to get the
column_diff
, or the left-out columns, this will be returned as a third element in the tuple when specifed:>>> df2, cols, diff = check_dataframe(df.values, [0, 2], column_diff=True) >>> cols [0, 2] >>> df2.columns.tolist() [0, 1, 2, 3] >>> diff [1, 3]