1
2
Fork 0
mirror of https://github.com/carlospolop/hacktricks.git synced 2023-12-14 19:12:55 +01:00

GitBook: [#2890] update

This commit is contained in:
CPol 2021-12-02 19:09:43 +00:00 committed by gitbook-bot
parent da985d0c7b
commit 3b999f514b
No known key found for this signature in database
GPG key ID: 07D2180C7B12D0FF

View file

@ -205,3 +205,71 @@ Therefore you might find the **same label with spelling mistakes**, different **
You can clean this issues by lowercasing everything and mapping misspelled labels to the correct ones.
It's very important to check that **all the data that you have contains is correctly labeled**, because for example, one misspelling error in the data, when dummie encoding the classes, will generate a new column in the final features with **bad consequences for the final model**. This example can be detected very easily by one-hot encoding a column and checking the names of the columns created.
## Missing Data
Some data of the study may be missing.
It might happen that some complete random data is missing for some error. This is kind of da ta is **Missing Completely at Random** (**MCAR**).
It could be that some random data is missing but there is something making some specific details more probable to be missing, for example more frequently man will tell their their age but not women. This is call **Missing at Random** (**MAR**).
Finally, there could be data **Missing Not at Random** (**MNAR**). The vale of the data is directly related with the probability of having the data. For example, if you want to measure something embarrassing, the most embarrassing someone is, the less probable he is going to share it. 
The **two first categories** of missing data can be **ignorable**. But the **third one** requires to consider **only portions of the data** that isn't impacted or to try to **model the missing data somehow**.
One way to find about missing data is to use `.info()` function as it will indicate the **number of rows but also the number of values per category**. If some category has less values than number of rows, then there is some data missing:
```bash
# Get info of the dataset
dataset.info()
# Drop all rows where some value is missing
dataset.dropna(how='any', axis=0).info()
```
It's usually recommended that if a feature is **missing in more than the 20%** of the dataset, the **column should be removed:**
```bash
# Remove column
dataset.drop('Column_name', axis='columns', inplace=True)
dataset.info()
```
{% hint style="info" %}
Note that **not all the missing values are missing in the dataset**. It's possible that missing values have been giving the value "Unknown", "n/a", "", -1, 0... You need to check the dataset (using `dataset.column`_`name.value`_`counts(dropna=False)` to check the possible values).
{% endhint %}
If some data is missing in the dataset (in it's not too much) you need to find the **category of the missing data**. For that you basically need to know if the **missing data is at random or not**, and for that you need to find if the **missing data was correlated with other data** of the dataset.
To find if a missing value if correlated with another column, you can create a new column that put 1s and 0s if the data is missing or isn't and then calculate the correlation between them:
```bash
# The closer it's to 1 or -1 the more correlated the data is
# Note that columns are always perfectly correlated with themselves.
dataset[['column_name', 'cloumn_missing_data']].corr()
```
If you decide to ignore the missing data you still need to do what to do with it: You can **remove the rows** with missing data (the train data for the model will be smaller), you can r**emove the feature** completely, or could **model it**.
You should **check the correlation between the missing feature with the target column** to see how important that feature is for the target, if it's really **small** you can **drop it or fill it**.
To fill missing **continuous data** you could use: the **mean**, the **median** or use an **imputation** algorithm. The imputation algorithm can try to use other features to find a value for the missing feature:
```python
from sklearn.impute import KNNImputer
X = dataset[['column1', 'column2', 'column3']]
y = dataset.column_target
# Create the imputer that will fill the data
imputer = KNNImputer(n_neightbors=2, weights='uniform')
X_imp = imputer.fit_transform(X)
# Check new data
dataset_imp = pd.DataFrame(X_imp)
dataset.columns = ['column1', 'column2', 'column3']
dataset.iloc[10:20] # Get some indexes that contained empty data before
```
To fill categorical data first of all you need to think if there is any reason why the values are missing. If it's by **choice of the users** (they didn't want to give the data) maybe yo can **create a new category** indicating that. If it's because of human error you can **remove the rows** or the **feature** (check the steps mentioned before) or **fill it with the mode, the most used category** (not recommended).