Oversampling minority samplesΒΆ
This example creates an imbalanced classification dataset, and oversamples the minority class to balance the class ratios.
Out:
Num zero class (pre-balance): 29
Num one class (pre-balance): 471
Num zero class (post-balance): 94
Num one class (post-balance): 471
Num samples (post-balance): 565
0 ... 19
87 -2.153390 ... -0.240686
367 -0.769973 ... -0.175931
485 -0.946491 ... -0.120715
290 -0.839210 ... -0.958555
72 -1.225000 ... 0.078820
[5 rows x 20 columns]
print(__doc__)
# Author: Taylor Smith <taylor.smith@alkaline-ml.com>
from sklearn.datasets import make_classification
from skoot.balance import over_sample_balance
import pandas as pd
# #############################################################################
# Create an imbalanced dataset
X, y = make_classification(n_samples=500, n_classes=2, weights=[0.05, 0.95],
random_state=42)
# get counts:
zero_mask = y == 0
print("Num zero class (pre-balance): %i" % zero_mask.sum())
print("Num one class (pre-balance): %i\n" % (~zero_mask).sum())
# #############################################################################
# Balance the dataset
X_balance, y_balance = over_sample_balance(X, y, balance_ratio=0.2,
random_state=42)
# get the new counts
new_mask = y_balance == 0
print("Num zero class (post-balance): %i" % new_mask.sum())
print("Num one class (post-balance): %i" % (~new_mask).sum())
print("Num samples (post-balance): %i" % X_balance.shape[0])
# #############################################################################
# This also works for pandas DataFrames
X_balance_df, _ = over_sample_balance(pd.DataFrame.from_records(X),
y, balance_ratio=0.2,
random_state=42)
print(X_balance_df.head())
Total running time of the script: ( 0 minutes 0.031 seconds)