hacktricks/a.i.-exploiting/bra.i.nsmasher-presentation/ml-basics/feature-engineering.md

# Feature Engineering

## Basic types of possible data

Data can be **continuous** (**infinity** values) or **categorical** (nominal) where the amount of possible values are **limited**.

### Categorical types

#### Binary

Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with:&#x20;

```python
dataset["column2"] = dataset.column2.map({"T": 1, "F": 0})
```

#### **Ordinal**

The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case.

```python
column2_mapping = {'starter':0,'amateur':1,'professional':2,'expert':3}
dataset['column2'] = dataset.column2.map(column2_mapping)
```

* For **alphabetic columns** you can order them more easily:

```python
# First get all the uniq values alphabetically sorted
possible_values_sorted = dataset.column2.sort_values().unique().tolist()
# Assign each one a value
possible_values_mapping = {value:idx for idx,value in enumerate(possible_values_sorted)}
dataset['column2'] = dataset.column2.map(possible_values_mapping)
```

#### **Cyclical**

Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.

* There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**

```python
column2_dummies = pd.get_dummies(dataset.column2, drop_first=True)
dataset_joined = pd.concat([dataset[['column2']], column2_dummies], axis=1)
```

#### **Dates**

Date are **continuous** **variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).

* Usually dates are used as **index**

```python
# Transform dates to datetime
dataset["column_date"] = pd.to_datetime(dataset.column_date)
# Make the date feature the index
dataset.set_index('column_date', inplace=True)
print(dataset.head())

# Sum usage column per day
daily_sum = dataset.groupby(df_daily_usage.index.date).agg({'usage':['sum']})
# Flatten and rename usage column
daily_sum.columns = daily_sum.columns.get_level_values(0)
daily_sum.columns = ['daily_usage']
print(daily_sum.head())

# Fill days with 0 usage
idx = pd.date_range('2020-01-01', '2020-12-31')
daily_sum.index = pd.DatetimeIndex(daily_sum.index)
df_filled = daily_sum.reindex(idx, fill_value=0) # Fill missing values


# Get day of the week, Monday=0, Sunday=6, and week days names
dataset['DoW'] = dataset.transaction_date.dt.dayofweek
## do the same in a different way
dataset['weekday'] = dataset.transaction_date.dt.weekday
# get day names
dataset['day_name'] = dataset.transaction_date.apply(lambda x: x.day_name())
```

#### Multi-category/nominal

**More than 2 categories** with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature.

* A **referring string** is a **column that identifies an example** (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is **useless and should be removed**.
* A **key column** is used to **link data between tables**. In this case the elements are unique. his data is **useless and should be removed**.

To **encode multi-category columns into numbers** (so the ML algorithm understand them), **dummy encoding is used** (and **not one-hot encoding** because it **doesn't avoid perfect multicollinearity**).

You can get a **multi-category column one-hot encoded** with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create **one new column per possible class** and will assign 1 **True value to one column**, and the rest will be false.

You can get a **multi-category column dummie encoded** with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create **one new column per possible class minus one** as the **last 2 columns will be reflect as "1" or "0" in the last binary column created**. This will avoid perfect multicollinearity, reducing the relations between columns.

## Collinear/Multicollinearity

Collinear appears when **2 features are related to each other**. Multicollineratity appears when those are more than 2.

In ML **you want that your features are related with the possible results but you don't want them to be related between them**. That's why the **dummy encoding mix the last two columns** of that and **is better than one-hot encoding** which doesn't do that creating a clear relation between all the new featured from the multi-category column.

VIF is the **Variance Inflation Factor** which **measures the multicollinearity of the features**. A value **above 5 means that one of the two or more collinear features should be removed**.

```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

#dummies_encoded = pd.get_dummies(dataset.column1, drop_first=True)
onehot_encoded = pd.get_dummies(dataset.column1)
X = add_constant(onehot_encoded) # Add previously one-hot encoded data
print(pd.Series([variance_inflation_factor(X.values,i) for i in range(X.shape[1])], index=X.columns))
```

## Categorical Imbalance

This occurs when there is **not the same amount of each category** in the training data.

```python
# Get statistic of the features
print(dataset.describe(include='all'))
# Get an overview of the features
print(dataset.info())
# Get imbalance information of the target column
print(dataset.target_column.value_counts())
```

In an imbalance there is always a **majority class or classes** and a **minority class or classes**.

There are 2 main ways to fix this problem:

* **Undersampling**: Removing randomly selected data from the majority class so it has the same number of samples as the minority class.

```python
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUserSampler(random_state=1337)

X = dataset[['column1', 'column2', 'column3']].copy()
y = dataset.target_column

X_under, y_under = rus.fit_resample(X,y)
print(y_under.value_counts()) #Confirm data isn't imbalanced anymore
```

* **Oversampling**: Generating more data for the minority class until it has as many samples as the majority class.

```python
from imblearn.under_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1337)

X = dataset[['column1', 'column2', 'column3']].copy()
y = dataset.target_column

X_over, y_over = ros.fit_resample(X,y)
print(y_over.value_counts()) #Confirm data isn't imbalanced anymore
```

You can use the argument **`sampling_strategy`** to indicate the **percentage** you want to **undersample or oversample** (**by default it's 1 (100%)** which means to equal the number of minority classes with majority classes)

{% hint style="info" %}
Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see **that they changed.** Therefore oversampling and undersampling are modifying the training data.
{% endhint %}

### SMOTE oversampling

**SMOTE** is usually a **more trustable way to oversample the data**.

```python
from imblearn.over_sampling import SMOTE

# Form SMOTE the target_column need to be numeric, map it if necessary
smote = SMOTE(random_state=1337)
X_smote, y_smote = smote.fit_resample(dataset[['column1', 'column2', 'column3']], dataset.target_column)
dataset_smote = pd.DataFrame(X_smote, columns=['column1', 'column2', 'column3'])
dataset['target_column'] = y_smote
print(y_smote.value_counts()) #Confirm data isn't imbalanced anymore
```

## Rarely Occurring Categories

Imagine a dataset where one of the target classes **occur very little times**.

This is like the category imbalance from the previous section, but the rarely occurring category is occurring even less than "minority class" in that case. The **raw** **oversampling** and **undersampling** methods could be also used here, but generally those techniques **won't give really good results**.

### Weights

In some algorithms it's possible to **modify the weights of the targeted data** so some of them get by default more importance when generating the model.

```python
weights = {0: 10 1:1} #Assign weight 10 to False and 1 to True
model = LogisticRegression(class_weight=weights)
```

You can **mix the weights with over/under-sampling techniques** to try to improve the results.

### PCA - Principal Component Analysis

Is a method that helps to reduce the dimensionality of the data. It's going to **combine different features** to **reduce the amount** of them generating **more useful features** (_less computation is needed_).

The resulting features aren't understandable by humans, so it also **anonymize the data**.

## Incongruent Label Categories

Data might have mistakes for unsuccessful transformations or just because human error when writing the data.

Therefore you might find the **same label with spelling mistakes**, different **capitalisation**, **abbreviations** like: _BLUE, Blue, b, bule_. You need to fix these label errors inside the data before training the model.

You can clean this issues by lowercasing everything and mapping misspelled labels to the correct ones.

It's very important to check that **all the data that you have contains is correctly labeled**, because for example, one misspelling error in the data, when dummie encoding the classes, will generate a new column in the final features with **bad consequences for the final model**. This example can be detected very easily by one-hot encoding a column and checking the names of the columns created.
GitBook: [#2871] update 2021-11-30 01:35:54 +01:00			`# Feature Engineering`

			`## Basic types of possible data`

			`Data can be continuous (infinity values) or categorical (nominal) where the amount of possible values are limited.`

			`### Categorical types`

GitBook: [#2880] update 2021-12-01 00:18:19 +01:00			`#### Binary`

			`Just 2 possible values: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with: `

			```python
			`dataset["column2"] = dataset.column2.map({"T": 1, "F": 0})`
			```

			`#### Ordinal`

			`The values follows an order, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case.`

			```python
			`column2_mapping = {'starter':0,'amateur':1,'professional':2,'expert':3}`
			`dataset['column2'] = dataset.column2.map(column2_mapping)`
			```

			`* For alphabetic columns you can order them more easily:`

			```python
			`# First get all the uniq values alphabetically sorted`
			`possible_values_sorted = dataset.column2.sort_values().unique().tolist()`
			`# Assign each one a value`
			`possible_values_mapping = {value:idx for idx,value in enumerate(possible_values_sorted)}`
			`dataset['column2'] = dataset.column2.map(possible_values_mapping)`
			```

			`#### Cyclical`

			`Looks like ordinal value because there is an order, but it doesn't mean one is bigger than the other. Also the distance between them depends on the direction you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.`

			`* There are different ways to encode cyclical features, ones may work with only just some algorithms. In general, dummy encode can be used`

			```python
			`column2_dummies = pd.get_dummies(dataset.column2, drop_first=True)`
			`dataset_joined = pd.concat([dataset[['column2']], column2_dummies], axis=1)`
			```

			`#### Dates`

			`Date are continuous variables. Can be seen as cyclical (because they repeat) or as ordinal variables (because a time is bigger than a previous one).`

			`* Usually dates are used as index`
GitBook: [#2871] update 2021-11-30 01:35:54 +01:00
			```python
			`# Transform dates to datetime`
			`dataset["column_date"] = pd.to_datetime(dataset.column_date)`
			`# Make the date feature the index`
			`dataset.set_index('column_date', inplace=True)`
			`print(dataset.head())`

			`# Sum usage column per day`
			`daily_sum = dataset.groupby(df_daily_usage.index.date).agg({'usage':['sum']})`
			`# Flatten and rename usage column`
			`daily_sum.columns = daily_sum.columns.get_level_values(0)`
			`daily_sum.columns = ['daily_usage']`
			`print(daily_sum.head())`

			`# Fill days with 0 usage`
			`idx = pd.date_range('2020-01-01', '2020-12-31')`
			`daily_sum.index = pd.DatetimeIndex(daily_sum.index)`
			`df_filled = daily_sum.reindex(idx, fill_value=0) # Fill missing values`


			`# Get day of the week, Monday=0, Sunday=6, and week days names`
			`dataset['DoW'] = dataset.transaction_date.dt.dayofweek`
			`## do the same in a different way`
			`dataset['weekday'] = dataset.transaction_date.dt.weekday`
			`# get day names`
			`dataset['day_name'] = dataset.transaction_date.apply(lambda x: x.day_name())`
			```

GitBook: [#2880] update 2021-12-01 00:18:19 +01:00			`#### Multi-category/nominal`

			More than 2 categories with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature.

			`* A referring string is a column that identifies an example (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is useless and should be removed.`
			`* A key column is used to link data between tables. In this case the elements are unique. his data is useless and should be removed.`

			`To encode multi-category columns into numbers (so the ML algorithm understand them), dummy encoding is used (and not one-hot encoding because it doesn't avoid perfect multicollinearity).`

			You can get a multi-category column one-hot encoded with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create one new column per possible class and will assign 1 True value to one column, and the rest will be false.

			You can get a multi-category column dummie encoded with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create one new column per possible class minus one as the last 2 columns will be reflect as "1" or "0" in the last binary column created. This will avoid perfect multicollinearity, reducing the relations between columns.

			`## Collinear/Multicollinearity`

			`Collinear appears when 2 features are related to each other. Multicollineratity appears when those are more than 2.`

			`In ML you want that your features are related with the possible results but you don't want them to be related between them. That's why the dummy encoding mix the last two columns of that and is better than one-hot encoding which doesn't do that creating a clear relation between all the new featured from the multi-category column.`

			`VIF is the Variance Inflation Factor which measures the multicollinearity of the features. A value above 5 means that one of the two or more collinear features should be removed.`

			```python
			`from statsmodels.stats.outliers_influence import variance_inflation_factor`
			`from statsmodels.tools.tools import add_constant`

			`#dummies_encoded = pd.get_dummies(dataset.column1, drop_first=True)`
			`onehot_encoded = pd.get_dummies(dataset.column1)`
			`X = add_constant(onehot_encoded) # Add previously one-hot encoded data`
			`print(pd.Series([variance_inflation_factor(X.values,i) for i in range(X.shape[1])], index=X.columns))`
			```

			`## Categorical Imbalance`

			`This occurs when there is not the same amount of each category in the training data.`

			```python
			`# Get statistic of the features`
			`print(dataset.describe(include='all'))`
			`# Get an overview of the features`
			`print(dataset.info())`
			`# Get imbalance information of the target column`
			`print(dataset.target_column.value_counts())`
			```

			`In an imbalance there is always a majority class or classes and a minority class or classes.`

GitBook: [#2881] update 2021-12-01 00:36:04 +01:00			`There are 2 main ways to fix this problem:`

			`* Undersampling: Removing randomly selected data from the majority class so it has the same number of samples as the minority class.`

			```python
			`from imblearn.under_sampling import RandomUnderSampler`
			`rus = RandomUserSampler(random_state=1337)`

			`X = dataset[['column1', 'column2', 'column3']].copy()`
			`y = dataset.target_column`

			`X_under, y_under = rus.fit_resample(X,y)`
			`print(y_under.value_counts()) #Confirm data isn't imbalanced anymore`
			```

			`* Oversampling: Generating more data for the minority class until it has as many samples as the majority class.`

			```python
			`from imblearn.under_sampling import RandomOverSampler`
			`ros = RandomOverSampler(random_state=1337)`

			`X = dataset[['column1', 'column2', 'column3']].copy()`
			`y = dataset.target_column`

			`X_over, y_over = ros.fit_resample(X,y)`
			`print(y_over.value_counts()) #Confirm data isn't imbalanced anymore`
			```

GitBook: [#2882] update 2021-12-01 01:10:15 +01:00			You can use the argument `sampling_strategy` to indicate the percentage you want to undersample or oversample (by default it's 1 (100%) which means to equal the number of minority classes with majority classes)

GitBook: [#2881] update 2021-12-01 00:36:04 +01:00			`{% hint style="info" %}`
			Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see that they changed. Therefore oversampling and undersampling are modifying the training data.
			`{% endhint %}`

			`### SMOTE oversampling`

			`SMOTE is usually a more trustable way to oversample the data.`

			```python
			`from imblearn.over_sampling import SMOTE`

			`# Form SMOTE the target_column need to be numeric, map it if necessary`
			`smote = SMOTE(random_state=1337)`
			`X_smote, y_smote = smote.fit_resample(dataset[['column1', 'column2', 'column3']], dataset.target_column)`
			`dataset_smote = pd.DataFrame(X_smote, columns=['column1', 'column2', 'column3'])`
			`dataset['target_column'] = y_smote`
			`print(y_smote.value_counts()) #Confirm data isn't imbalanced anymore`
			```
GitBook: [#2882] update 2021-12-01 01:10:15 +01:00
			`## Rarely Occurring Categories`

			`Imagine a dataset where one of the target classes occur very little times.`

			`This is like the category imbalance from the previous section, but the rarely occurring category is occurring even less than "minority class" in that case. The raw oversampling and undersampling methods could be also used here, but generally those techniques won't give really good results.`

			`### Weights`

			`In some algorithms it's possible to modify the weights of the targeted data so some of them get by default more importance when generating the model.`

			```python
			`weights = {0: 10 1:1} #Assign weight 10 to False and 1 to True`
			`model = LogisticRegression(class_weight=weights)`
			```

			`You can mix the weights with over/under-sampling techniques to try to improve the results.`

			`### PCA - Principal Component Analysis`

GitBook: [#2889] update 2021-12-02 16:52:50 +01:00			`Is a method that helps to reduce the dimensionality of the data. It's going to combine different features to reduce the amount of them generating more useful features (_less computation is needed_).`
GitBook: [#2882] update 2021-12-01 01:10:15 +01:00
			`The resulting features aren't understandable by humans, so it also anonymize the data.`
GitBook: [#2889] update 2021-12-02 16:52:50 +01:00
			`## Incongruent Label Categories`

			`Data might have mistakes for unsuccessful transformations or just because human error when writing the data.`

			`Therefore you might find the same label with spelling mistakes, different capitalisation, abbreviations like: _BLUE, Blue, b, bule_. You need to fix these label errors inside the data before training the model.`

			`You can clean this issues by lowercasing everything and mapping misspelled labels to the correct ones.`

			`It's very important to check that all the data that you have contains is correctly labeled, because for example, one misspelling error in the data, when dummie encoding the classes, will generate a new column in the final features with bad consequences for the final model. This example can be detected very easily by one-hot encoding a column and checking the names of the columns created.`