1
2
Fork 0
mirror of https://github.com/carlospolop/hacktricks.git synced 2023-12-14 19:12:55 +01:00
hacktricks/a.i.-exploiting/bra.i.nsmasher-presentation/ml-basics/feature-engineering.md

128 lines
6 KiB
Markdown
Raw Normal View History

2021-11-30 01:35:54 +01:00
# Feature Engineering
## Basic types of possible data
Data can be **continuous** (**infinity** values) or **categorical** (nominal) where the amount of possible values are **limited**.
### Categorical types
2021-12-01 00:18:19 +01:00
#### Binary
Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with: 
```python
dataset["column2"] = dataset.column2.map({"T": 1, "F": 0})
```
#### **Ordinal**
The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case.
```python
column2_mapping = {'starter':0,'amateur':1,'professional':2,'expert':3}
dataset['column2'] = dataset.column2.map(column2_mapping)
```
* For **alphabetic columns** you can order them more easily:
```python
# First get all the uniq values alphabetically sorted
possible_values_sorted = dataset.column2.sort_values().unique().tolist()
# Assign each one a value
possible_values_mapping = {value:idx for idx,value in enumerate(possible_values_sorted)}
dataset['column2'] = dataset.column2.map(possible_values_mapping)
```
#### **Cyclical**
Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.
* There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**
```python
column2_dummies = pd.get_dummies(dataset.column2, drop_first=True)
dataset_joined = pd.concat([dataset[['column2']], column2_dummies], axis=1)
```
#### **Dates**
Date are **continuous** **variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).
* Usually dates are used as **index**
2021-11-30 01:35:54 +01:00
```python
# Transform dates to datetime
dataset["column_date"] = pd.to_datetime(dataset.column_date)
# Make the date feature the index
dataset.set_index('column_date', inplace=True)
print(dataset.head())
# Sum usage column per day
daily_sum = dataset.groupby(df_daily_usage.index.date).agg({'usage':['sum']})
# Flatten and rename usage column
daily_sum.columns = daily_sum.columns.get_level_values(0)
daily_sum.columns = ['daily_usage']
print(daily_sum.head())
# Fill days with 0 usage
idx = pd.date_range('2020-01-01', '2020-12-31')
daily_sum.index = pd.DatetimeIndex(daily_sum.index)
df_filled = daily_sum.reindex(idx, fill_value=0) # Fill missing values
# Get day of the week, Monday=0, Sunday=6, and week days names
dataset['DoW'] = dataset.transaction_date.dt.dayofweek
## do the same in a different way
dataset['weekday'] = dataset.transaction_date.dt.weekday
# get day names
dataset['day_name'] = dataset.transaction_date.apply(lambda x: x.day_name())
```
2021-12-01 00:18:19 +01:00
#### Multi-category/nominal
**More than 2 categories** with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature.
* A **referring string** is a **column that identifies an example** (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is **useless and should be removed**.
* A **key column** is used to **link data between tables**. In this case the elements are unique. his data is **useless and should be removed**.
To **encode multi-category columns into numbers** (so the ML algorithm understand them), **dummy encoding is used** (and **not one-hot encoding** because it **doesn't avoid perfect multicollinearity**).
You can get a **multi-category column one-hot encoded** with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create **one new column per possible class** and will assign 1 **True value to one column**, and the rest will be false.
You can get a **multi-category column dummie encoded** with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create **one new column per possible class minus one** as the **last 2 columns will be reflect as "1" or "0" in the last binary column created**. This will avoid perfect multicollinearity, reducing the relations between columns.
## Collinear/Multicollinearity
Collinear appears when **2 features are related to each other**. Multicollineratity appears when those are more than 2.
In ML **you want that your features are related with the possible results but you don't want them to be related between them**. That's why the **dummy encoding mix the last two columns** of that and **is better than one-hot encoding** which doesn't do that creating a clear relation between all the new featured from the multi-category column.
VIF is the **Variance Inflation Factor** which **measures the multicollinearity of the features**. A value **above 5 means that one of the two or more collinear features should be removed**.
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
#dummies_encoded = pd.get_dummies(dataset.column1, drop_first=True)
onehot_encoded = pd.get_dummies(dataset.column1)
X = add_constant(onehot_encoded) # Add previously one-hot encoded data
print(pd.Series([variance_inflation_factor(X.values,i) for i in range(X.shape[1])], index=X.columns))
```
## Categorical Imbalance
This occurs when there is **not the same amount of each category** in the training data.
```python
# Get statistic of the features
print(dataset.describe(include='all'))
# Get an overview of the features
print(dataset.info())
# Get imbalance information of the target column
print(dataset.target_column.value_counts())
```
In an imbalance there is always a **majority class or classes** and a **minority class or classes**.
There are 2 main ways to fix this problem. Using undersampling: REmoving randomly selected data fom the majority class so it has the same numbe