Oversampling minority samplesΒΆ

This example creates an imbalanced classification dataset, and oversamples the minority class to balance the class ratios.


Out:

Num zero class (pre-balance): 29
Num one class (pre-balance): 471

Num zero class (post-balance): 94
Num one class (post-balance): 471
Num samples (post-balance): 565
            0  ...        19
87  -2.153390  ... -0.240686
367 -0.769973  ... -0.175931
485 -0.946491  ... -0.120715
290 -0.839210  ... -0.958555
72  -1.225000  ...  0.078820

[5 rows x 20 columns]

print(__doc__)

# Author: Taylor Smith <taylor.smith@alkaline-ml.com>

from sklearn.datasets import make_classification
from skoot.balance import over_sample_balance
import pandas as pd

# #############################################################################
# Create an imbalanced dataset
X, y = make_classification(n_samples=500, n_classes=2, weights=[0.05, 0.95],
                           random_state=42)

# get counts:
zero_mask = y == 0
print("Num zero class (pre-balance): %i" % zero_mask.sum())
print("Num one class (pre-balance): %i\n" % (~zero_mask).sum())

# #############################################################################
# Balance the dataset
X_balance, y_balance = over_sample_balance(X, y, balance_ratio=0.2,
                                           random_state=42)

# get the new counts
new_mask = y_balance == 0
print("Num zero class (post-balance): %i" % new_mask.sum())
print("Num one class (post-balance): %i" % (~new_mask).sum())
print("Num samples (post-balance): %i" % X_balance.shape[0])

# #############################################################################
# This also works for pandas DataFrames

X_balance_df, _ = over_sample_balance(pd.DataFrame.from_records(X),
                                      y, balance_ratio=0.2,
                                      random_state=42)

print(X_balance_df.head())

Total running time of the script: ( 0 minutes 0.031 seconds)

Gallery generated by Sphinx-Gallery