Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with: 
The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case.
Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.
* There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**
Date are **continuous****variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).
**More than 2 categories** with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature.
* A **referring string** is a **column that identifies an example** (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is **useless and should be removed**.
* A **key column** is used to **link data between tables**. In this case the elements are unique. his data is **useless and should be removed**.
To **encode multi-category columns into numbers** (so the ML algorithm understand them), **dummy encoding is used** (and **not one-hot encoding** because it **doesn't avoid perfect multicollinearity**).
You can get a **multi-category column one-hot encoded** with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create **one new column per possible class** and will assign 1 **True value to one column**, and the rest will be false.
You can get a **multi-category column dummie encoded** with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create **one new column per possible class minus one** as the **last 2 columns will be reflect as "1" or "0" in the last binary column created**. This will avoid perfect multicollinearity, reducing the relations between columns.
## Collinear/Multicollinearity
Collinear appears when **2 features are related to each other**. Multicollineratity appears when those are more than 2.
In ML **you want that your features are related with the possible results but you don't want them to be related between them**. That's why the **dummy encoding mix the last two columns** of that and **is better than one-hot encoding** which doesn't do that creating a clear relation between all the new featured from the multi-category column.
VIF is the **Variance Inflation Factor** which **measures the multicollinearity of the features**. A value **above 5 means that one of the two or more collinear features should be removed**.
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
* **Undersampling**: Removing randomly selected data from the majority class so it has the same number of samples as the minority class.
```python
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUserSampler(random_state=1337)
X = dataset[['column1', 'column2', 'column3']].copy()
y = dataset.target_column
X_under, y_under = rus.fit_resample(X,y)
print(y_under.value_counts()) #Confirm data isn't imbalanced anymore
```
* **Oversampling**: Generating more data for the minority class until it has as many samples as the majority class.
```python
from imblearn.under_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1337)
X = dataset[['column1', 'column2', 'column3']].copy()
y = dataset.target_column
X_over, y_over = ros.fit_resample(X,y)
print(y_over.value_counts()) #Confirm data isn't imbalanced anymore
```
{% hint style="info" %}
Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see **that they changed.** Therefore oversampling and undersampling are modifying the training data.
{% endhint %}
### SMOTE oversampling
**SMOTE** is usually a **more trustable way to oversample the data**.
```python
from imblearn.over_sampling import SMOTE
# Form SMOTE the target_column need to be numeric, map it if necessary