skoot.utils.check_dataframe¶
-
skoot.utils.check_dataframe(X, cols=None, assert_all_finite=False, column_diff=False)[source][source]¶ Check an input dataframe.
Determine whether an input frame is a Pandas dataframe or whether it can be coerced as one, and raise a TypeError if not. Also check for finite values if specified. If columns are provided, checks that all columns are present within the dataframe and raises an assertion error if not.
Note: if
Xis not a dataframe (i.e., a list of lists or a numpy array), the columns will not be specified when creating a pandas dataframe and will thus be indices. Any columns provided should account for this behavior.Parameters: X : array-like, shape=(n_samples, n_features)
The input frame. Should be a pandas DataFrame, numpy
ndarrayor a similar array-like structure. Any non-pandas structure will be attempted to be cast to pandas; if it cannot be cast, it will fail with a TypeError.cols : list, iterable or None
Any columns to check for. If this is provided, all columns will be checked for presence in the
X.columnsindex. If any are not present, a ValueError will be raised.assert_all_finite : bool, optional (default=False)
Whether to assert that all values within the
Xframe are finite. Note that ifcolsis specified, this will only assert all values in the specified columns are finite.column_diff : bool, optional (default=False)
Whether to also get the columns present in
Xthat are not present incols. This is returned as the third element in the output ifcolumn_diffis True.Returns: X_copy : DataFrame
A copy of the
Xdataframe.cols : list
The list of columns on which to apply a function to this dataframe. if
colswas specified in the function, this is equal tocolsas a list. Else, it’s theX.columnsindex.diff : list
If
column_diffis True, will return as the third position in the tuple the columns that are withinXbut NOT present incols.Examples
When providing a dataframe and columns, the columns should be present:
>>> from skoot.datasets import load_iris_df >>> df = load_iris_df(include_tgt=False, names=['a', 'b', 'c', 'd']) >>> df, cols = check_dataframe(df, cols=('a', 'c')) >>> assert cols == ['a', 'c'] >>> df.head() a b c d 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2
When passing numpy arrays, account for the fact that the columns cannot be specified when creating the pandas dataframe:
>>> df2, cols = check_dataframe(df.values, cols=[0, 2]) >>> cols [0, 2] >>> df2.columns.tolist() [0, 1, 2, 3] >>> df2.head() 0 1 2 3 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2
If you want to get the
column_diff, or the left-out columns, this will be returned as a third element in the tuple when specifed:>>> df2, cols, diff = check_dataframe(df.values, [0, 2], column_diff=True) >>> cols [0, 2] >>> df2.columns.tolist() [0, 1, 2, 3] >>> diff [1, 3]