GitBook: [#2881] update

This commit is contained in:
CPol 2021-11-30 23:36:04 +00:00 committed by gitbook-bot
parent 1b45fddbff
commit 7a93916e07
No known key found for this signature in database
GPG Key ID: 07D2180C7B12D0FF
1 changed files with 46 additions and 1 deletions

View File

@ -124,4 +124,49 @@ print(dataset.target_column.value_counts())
In an imbalance there is always a **majority class or classes** and a **minority class or classes**.
There are 2 main ways to fix this problem. Using undersampling: REmoving randomly selected data fom the majority class so it has the same numbe
There are 2 main ways to fix this problem:
* **Undersampling**: Removing randomly selected data from the majority class so it has the same number of samples as the minority class.
```python
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUserSampler(random_state=1337)
X = dataset[['column1', 'column2', 'column3']].copy()
y = dataset.target_column
X_under, y_under = rus.fit_resample(X,y)
print(y_under.value_counts()) #Confirm data isn't imbalanced anymore
```
* **Oversampling**: Generating more data for the minority class until it has as many samples as the majority class.
```python
from imblearn.under_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1337)
X = dataset[['column1', 'column2', 'column3']].copy()
y = dataset.target_column
X_over, y_over = ros.fit_resample(X,y)
print(y_over.value_counts()) #Confirm data isn't imbalanced anymore
```
{% hint style="info" %}
Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see **that they changed.** Therefore oversampling and undersampling are modifying the training data.
{% endhint %}
### SMOTE oversampling
**SMOTE** is usually a **more trustable way to oversample the data**.
```python
from imblearn.over_sampling import SMOTE
# Form SMOTE the target_column need to be numeric, map it if necessary
smote = SMOTE(random_state=1337)
X_smote, y_smote = smote.fit_resample(dataset[['column1', 'column2', 'column3']], dataset.target_column)
dataset_smote = pd.DataFrame(X_smote, columns=['column1', 'column2', 'column3'])
dataset['target_column'] = y_smote
print(y_smote.value_counts()) #Confirm data isn't imbalanced anymore
```