skoot.exploration
.summarize¶
-
skoot.exploration.
summarize
(X)[source][source]¶ Summarize a dataframe.
Create a more in-depth summary of a dataframe than
pd.describe
will give you. This includes details on skewness, arity (for categorical features) and more. For continuous features (floats), this summary computes:- Mean
- Median
- Max
- Min
- Variance
- Skewness
- Kurtosis
For categorical features:
- Least frequent class
- Most frequent class
- Class balance (n_least_freq / n_most_freq; higher is better)
- Num Levels
- Arity (n_unique_classes / n_samples; lower is better)
Parameters: X : array-like, shape=(n_samples, n_features)
The input data. Can be comprised of categorical or continuous data, and will be cast to pandas DataFrame for the computations.
Returns: stats : DataFrame
The summarized dataframe
Notes
The skewness of a normal distribution is zero, and symmetric data should exhibit a skewness near zero. Positive values for skewness indicate the data is skewed right, and negative indicate they’re skewed left. If the data are multi-modal, this may impact the sign of the skewness.
References
[R12] Measures of Skewness and Kurtosis https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm Examples
>>> import skoot >>> import pandas as pd >>> import numpy as np >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=1000, n_features=20, ... n_informative=12, random_state=1) >>> X = pd.DataFrame.from_records(X[:, :5], ... columns=['a', 'b', 'c', 'd', 'e']) >>> # Make one into a binary column >>> X['d'] = (np.random.RandomState(1).rand(X.shape[0]) > 0.9).astype(int) >>> skoot.summarize(X) a b c d e Mean -1.036419 -0.382853 -0.007993 NaN 0.394417 Median -0.968732 -0.382114 -0.047757 NaN 0.283779 Max 4.559433 9.863773 2.991107 NaN 7.344063 Min -6.147430 -8.301872 -2.679137 NaN -5.866428 Variance 3.324646 5.832246 0.985764 NaN 3.938836 Skewness -0.059496 0.148757 0.121908 NaN 0.021251 Kurtosis 0.069795 -0.040619 -0.098477 NaN -0.187570 Least Freq. NaN NaN NaN (1,) NaN Most Freq. NaN NaN NaN (0,) NaN Class Balance NaN NaN NaN 0.113586 NaN Num Levels NaN NaN NaN 2 NaN Arity NaN NaN NaN 0.002 NaN Missing 0.000000 0.000000 0.000000 0 0.000000