skoot.exploration.summarize

skoot.exploration.summarize(X)[source][source]

Summarize a dataframe.

Create a more in-depth summary of a dataframe than pd.describe will give you. This includes details on skewness, arity (for categorical features) and more. For continuous features (floats), this summary computes:

  • Mean
  • Median
  • Max
  • Min
  • Variance
  • Skewness
  • Kurtosis

For categorical features:

  • Least frequent class
  • Most frequent class
  • Class balance (n_least_freq / n_most_freq; higher is better)
  • Num Levels
  • Arity (n_unique_classes / n_samples; lower is better)
Parameters:

X : array-like, shape=(n_samples, n_features)

The input data. Can be comprised of categorical or continuous data, and will be cast to pandas DataFrame for the computations.

Returns:

stats : DataFrame

The summarized dataframe

Notes

The skewness of a normal distribution is zero, and symmetric data should exhibit a skewness near zero. Positive values for skewness indicate the data is skewed right, and negative indicate they’re skewed left. If the data are multi-modal, this may impact the sign of the skewness.

References

[R12]Measures of Skewness and Kurtosis https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

Examples

>>> import skoot
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=20,
...                            n_informative=12, random_state=1)
>>> X = pd.DataFrame.from_records(X[:, :5],
...                               columns=['a', 'b', 'c', 'd', 'e'])
>>> # Make one into a binary column
>>> X['d'] = (np.random.RandomState(1).rand(X.shape[0]) > 0.9).astype(int)
>>> skoot.summarize(X)
                      a         b         c         d         e
Mean          -1.036419 -0.382853 -0.007993       NaN  0.394417
Median        -0.968732 -0.382114 -0.047757       NaN  0.283779
Max            4.559433  9.863773  2.991107       NaN  7.344063
Min           -6.147430 -8.301872 -2.679137       NaN -5.866428
Variance       3.324646  5.832246  0.985764       NaN  3.938836
Skewness      -0.059496  0.148757  0.121908       NaN  0.021251
Kurtosis       0.069795 -0.040619 -0.098477       NaN -0.187570
Least Freq.         NaN       NaN       NaN      (1,)       NaN
Most Freq.          NaN       NaN       NaN      (0,)       NaN
Class Balance       NaN       NaN       NaN  0.113586       NaN
Num Levels          NaN       NaN       NaN         2       NaN
Arity               NaN       NaN       NaN     0.002       NaN
Missing        0.000000  0.000000  0.000000         0  0.000000