Summaries/AI/MachineLearning/ML_Landscape.md

---
title: ML_Landscape
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---

# ML Landscape

## What is ML

Science of programming computers so they can learn from data, instead of having explicite code rules

Data examples are called **training sets**. Each training example is called a **training instance**
Performance is measured in **accuracy**

ML is good in solving problems that either:

1. problems are too complex for traditonal approaches
2. have no known algoritme/can help find an algoritme
3. fluctuating environments: ML can learn machines to adapt to new data
4. can help humans learn to get insight in complex and large amount of data => data mining

|  | Description | Example |
| :---|    :----   | :---- |
| CNN | Convolutional Neural Netwerk | Image Classification 
|  |Segmantic Segmentation  | Brain scans |
| NLP | Natural Language Processing | News articles classification|
|  |  | Text Summary | 
| RNN | Recurring Neural Netwerk | News articles classification |
| NLU | Natural Language Understanding | Chatbot/personal assistant |
| SVM | Support Vector Machine | Forecasting |
| RL | Reinformcement Learning | |

**Regression model**:

- Linear
- Polynomial
- Random Forest
- when take past in account then RNN, CNN, Transformers

dimensionality reduction: Simplify the data without loosing too much information.
Feature extraction: merge one feature in an other and both will represent an new feature.

Anomaly detection: unusual credit card transactions to prevent fraud.
Novelty detection: detact new instances that look different from all training instances.

Association rule learning: dig in large datasets and discover interesting reations between attributes.

## Types of ML

1. **Supervised Learning**
    - Classification
    - K-Nearest Neighbours
    - Linear/Logistic regression (both predictor and their labels required)
    - SVM
    - Decision Tree
    - Random Forests
    - Neural Networks
2. **Unsupervised Learning**
    - Clustering
        - K-Means
        - DBSCAN
        - HCA (Hierarchical Cluster Analysis)
    - Anomaly detection and novelty detection
        - One-class SVM
        - Isolation Forest
    - Visualistation and dimensionality reduction
        - PCS (Principal Component Analysis)
        - Kernel PCA
        - LLE (Locally Linear Embedding)
        - t-SNE (t-Distributed Stochastic Neigbor Embedding)
    - Association rule learning
        - Apriori
        - Eclat
3. **Semi-supervised Learning**
    - partly labeled
    - mostly combination of supervised and unsupervised learning
    - DBN (Deep Believe Networks)
    - RBM (Resticted Boltzmann Machines)
4. **Reinforcement Learning**
    - Call an agent, observe the environment, select and perform actions, get reward or penalty. Learns by itself.

System can wheter or not learn increamentally from stream of incomming data

1. **Batch Learning**
    - must train with all data available (offline learning)
    - can take many hours to train and requires resources
    - when extreme much data, can be impossible to train or limited resources
    - new data then train system from scratch and (automatic) deploy
2. **Online learning**
    - train system incrementially, by feeding sequential data.
    - either individually or in mini-batches
    - great when data arrives in an continous flow
    - great approach when system needs to adapt quicky to new data.
    - requires less resources
    - when data does not fitt in memory => out-of-core learining
        then online learning perfect approach.
    - important parameter: learning rate
        - High then adapt quickly to chancing data, but forgets also quickly.
        - Low then learns slower, bet less sensitive to new data.
    - Problem when feed with bad data => performance will decline.
        Need to monitor system.

ML Systems how they generalize. Needs to prform well on new data!

1. **Instance-based learning**
    - learn examples by heart, then generalize to new cases by using similarity measures to compare.
2. **Model-based learning**
    - build model of examples and then use the model to make predictions
    - Use model selection to select an appropiate model and fully specifying its architecture (incl tune parameters)

Inference: make predictions on new data.

## Main challenges of ML

"Bad algorithme" and "bad data"

1. Insufficient quantity of training data
    - Not always easy and/or cheap to get extra training data
    - More data is better.
2. Nonrepresentative Training data
    - Crusial the data represents the case to generalize about.
        For both instant-based en model-based learning
    - If sample too small => sampling noise
    - If large sample but sampling method is flawed => sampling bias
        Sampling method: how data is collected
3. Poor-Quality data
    - Errors, outliers, noise => clean up data
    - Clearly outliers => discard them of try fix errors
    - missing few features => whether ignore attribute or fill in values manually
4. Irrelevant Features
    - Come up with a good set of features for training => feauture engineering:
        - feature selection (select most useful)
        - feature extraction (combine existing features to more useful one)
        - creating new features by gathering new data
5. Overfitting the training data
    Overfitting => model performs well on training dat, but does not generalize well.
    Overfitting happens when model is too complex relative to the amount and noisiness of the training data.
    Overfitting solutions:
        - simplify model by selecting model with fewer parameters, reducing number of attributes, constraining model.
            Constraining model => **regularization** result: will fit less the training data but generalises better to new data.
                Amount of reqularization is controlled by hyperparameters. A hyperparameter is a parameter of the learning algorithme (not of model)
        - gather more training data
        - reduce the noise in the training data (fix errors, remove outliers)
6. Underfitting the training data
    - Model is too simple to learn the structure of the data.
    - Solutions:
        - select more powerful model, with more parameters
        - improve feature engineering tasks
        - reduce contrains on the model (eg regularisation)

## Testing and validation

Training set and test set, relate to each other 80-20% to 99-1%; depending the absolute size total data set.
Error rate in test set => generalization error
Training error low, but generalization error high => overfitting

### Hyperparameter tuning an model section

Holdout validation: keep part of the training set (= validation/development set) to validate several candidata models and select the best. Work mostly very well, except when the validation set is too small.
    Solution: cross-validation => using many small validation sets and validate each model per validation set.(drawback: trainingtime is multiplied by number of validation sets)

### Data Mismatch

Most important rule: validation set and test set must be as representative as possible of the data used in production
Training set => to train model
Test set => evaluate after happy with the dev set
Dev + test set have to come from the SAME distribution (random shuffle data)

1. define dev set + metric. Quickly iterate
    idea -> code -> experiment

    | Set | percentage of data |
    | :--- | --: |
    | training | 98 |
    | dev | 1 |
    | test |1  |
Init rest 2022-08-09 21:04:44 +02:00			`---`
			`title: ML_Landscape`
			`updated: 2021-05-04 14:58:11Z`
			`created: 2021-05-04 14:58:11Z`
			`---`

			`# ML Landscape`

			`## What is ML`

			`Science of programming computers so they can learn from data, instead of having explicite code rules`

			`Data examples are called training sets. Each training example is called a training instance`
			`Performance is measured in accuracy`

			`ML is good in solving problems that either:`

			`1. problems are too complex for traditonal approaches`
			`2. have no known algoritme/can help find an algoritme`
			`3. fluctuating environments: ML can learn machines to adapt to new data`
			`4. can help humans learn to get insight in complex and large amount of data => data mining`

			`\| \| Description \| Example \|`
			`\| :---\| :---- \| :---- \|`
			`\| CNN \| Convolutional Neural Netwerk \| Image Classification`
			`\| \|Segmantic Segmentation \| Brain scans \|`
			`\| NLP \| Natural Language Processing \| News articles classification\|`
			`\| \| \| Text Summary \|`
			`\| RNN \| Recurring Neural Netwerk \| News articles classification \|`
			`\| NLU \| Natural Language Understanding \| Chatbot/personal assistant \|`
			`\| SVM \| Support Vector Machine \| Forecasting \|`
			`\| RL \| Reinformcement Learning \| \|`

			`Regression model:`

			`- Linear`
			`- Polynomial`
			`- Random Forest`
			`- when take past in account then RNN, CNN, Transformers`

			`dimensionality reduction: Simplify the data without loosing too much information.`
			`Feature extraction: merge one feature in an other and both will represent an new feature.`

			`Anomaly detection: unusual credit card transactions to prevent fraud.`
			`Novelty detection: detact new instances that look different from all training instances.`

			`Association rule learning: dig in large datasets and discover interesting reations between attributes.`

			`## Types of ML`

			`1. Supervised Learning`
			`- Classification`
			`- K-Nearest Neighbours`
			`- Linear/Logistic regression (both predictor and their labels required)`
			`- SVM`
			`- Decision Tree`
			`- Random Forests`
			`- Neural Networks`
			`2. Unsupervised Learning`
			`- Clustering`
			`- K-Means`
			`- DBSCAN`
			`- HCA (Hierarchical Cluster Analysis)`
			`- Anomaly detection and novelty detection`
			`- One-class SVM`
			`- Isolation Forest`
			`- Visualistation and dimensionality reduction`
			`- PCS (Principal Component Analysis)`
			`- Kernel PCA`
			`- LLE (Locally Linear Embedding)`
			`- t-SNE (t-Distributed Stochastic Neigbor Embedding)`
			`- Association rule learning`
			`- Apriori`
			`- Eclat`
			`3. Semi-supervised Learning`
			`- partly labeled`
			`- mostly combination of supervised and unsupervised learning`
			`- DBN (Deep Believe Networks)`
			`- RBM (Resticted Boltzmann Machines)`
			`4. Reinforcement Learning`
			`- Call an agent, observe the environment, select and perform actions, get reward or penalty. Learns by itself.`

			`System can wheter or not learn increamentally from stream of incomming data`

			`1. Batch Learning`
			`- must train with all data available (offline learning)`
			`- can take many hours to train and requires resources`
			`- when extreme much data, can be impossible to train or limited resources`
			`- new data then train system from scratch and (automatic) deploy`
			`2. Online learning`
			`- train system incrementially, by feeding sequential data.`
			`- either individually or in mini-batches`
			`- great when data arrives in an continous flow`
			`- great approach when system needs to adapt quicky to new data.`
			`- requires less resources`
			`- when data does not fitt in memory => out-of-core learining`
			`then online learning perfect approach.`
			`- important parameter: learning rate`
			`- High then adapt quickly to chancing data, but forgets also quickly.`
			`- Low then learns slower, bet less sensitive to new data.`
			`- Problem when feed with bad data => performance will decline.`
			`Need to monitor system.`

			`ML Systems how they generalize. Needs to prform well on new data!`

			`1. Instance-based learning`
			`- learn examples by heart, then generalize to new cases by using similarity measures to compare.`
			`2. Model-based learning`
			`- build model of examples and then use the model to make predictions`
			`- Use model selection to select an appropiate model and fully specifying its architecture (incl tune parameters)`

			`Inference: make predictions on new data.`

			`## Main challenges of ML`

			`"Bad algorithme" and "bad data"`

			`1. Insufficient quantity of training data`
			`- Not always easy and/or cheap to get extra training data`
			`- More data is better.`
			`2. Nonrepresentative Training data`
			`- Crusial the data represents the case to generalize about.`
			`For both instant-based en model-based learning`
			`- If sample too small => sampling noise`
			`- If large sample but sampling method is flawed => sampling bias`
			`Sampling method: how data is collected`
			`3. Poor-Quality data`
			`- Errors, outliers, noise => clean up data`
			`- Clearly outliers => discard them of try fix errors`
			`- missing few features => whether ignore attribute or fill in values manually`
			`4. Irrelevant Features`
			`- Come up with a good set of features for training => feauture engineering:`
			`- feature selection (select most useful)`
			`- feature extraction (combine existing features to more useful one)`
			`- creating new features by gathering new data`
			`5. Overfitting the training data`
			`Overfitting => model performs well on training dat, but does not generalize well.`
			`Overfitting happens when model is too complex relative to the amount and noisiness of the training data.`
			`Overfitting solutions:`
			`- simplify model by selecting model with fewer parameters, reducing number of attributes, constraining model.`
			`Constraining model => regularization result: will fit less the training data but generalises better to new data.`
			`Amount of reqularization is controlled by hyperparameters. A hyperparameter is a parameter of the learning algorithme (not of model)`
			`- gather more training data`
			`- reduce the noise in the training data (fix errors, remove outliers)`
			`6. Underfitting the training data`
			`- Model is too simple to learn the structure of the data.`
			`- Solutions:`
			`- select more powerful model, with more parameters`
			`- improve feature engineering tasks`
			`- reduce contrains on the model (eg regularisation)`

			`## Testing and validation`

			`Training set and test set, relate to each other 80-20% to 99-1%; depending the absolute size total data set.`
			`Error rate in test set => generalization error`
			`Training error low, but generalization error high => overfitting`

			`### Hyperparameter tuning an model section`

			`Holdout validation: keep part of the training set (= validation/development set) to validate several candidata models and select the best. Work mostly very well, except when the validation set is too small.`
			`Solution: cross-validation => using many small validation sets and validate each model per validation set.(drawback: trainingtime is multiplied by number of validation sets)`

			`### Data Mismatch`

			`Most important rule: validation set and test set must be as representative as possible of the data used in production`
			`Training set => to train model`
			`Test set => evaluate after happy with the dev set`
			`Dev + test set have to come from the SAME distribution (random shuffle data)`

			`1. define dev set + metric. Quickly iterate`
			`idea -> code -> experiment`

			`\| Set \| percentage of data \|`
			`\| :--- \| --: \|`
			`\| training \| 98 \|`
			`\| dev \| 1 \|`
			`\| test \|1 \|`