Summaries/AI/MachineLearning/ML_Landscape.md

178 lines
7.3 KiB
Markdown
Raw Normal View History

2022-08-09 21:04:44 +02:00
---
title: ML_Landscape
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# ML Landscape
## What is ML
Science of programming computers so they can learn from data, instead of having explicite code rules
Data examples are called **training sets**. Each training example is called a **training instance**
Performance is measured in **accuracy**
ML is good in solving problems that either:
1. problems are too complex for traditonal approaches
2. have no known algoritme/can help find an algoritme
3. fluctuating environments: ML can learn machines to adapt to new data
4. can help humans learn to get insight in complex and large amount of data => data mining
| | Description | Example |
| :---| :---- | :---- |
| CNN | Convolutional Neural Netwerk | Image Classification
| |Segmantic Segmentation | Brain scans |
| NLP | Natural Language Processing | News articles classification|
| | | Text Summary |
| RNN | Recurring Neural Netwerk | News articles classification |
| NLU | Natural Language Understanding | Chatbot/personal assistant |
| SVM | Support Vector Machine | Forecasting |
| RL | Reinformcement Learning | |
**Regression model**:
- Linear
- Polynomial
- Random Forest
- when take past in account then RNN, CNN, Transformers
dimensionality reduction: Simplify the data without loosing too much information.
Feature extraction: merge one feature in an other and both will represent an new feature.
Anomaly detection: unusual credit card transactions to prevent fraud.
Novelty detection: detact new instances that look different from all training instances.
Association rule learning: dig in large datasets and discover interesting reations between attributes.
## Types of ML
1. **Supervised Learning**
- Classification
- K-Nearest Neighbours
- Linear/Logistic regression (both predictor and their labels required)
- SVM
- Decision Tree
- Random Forests
- Neural Networks
2. **Unsupervised Learning**
- Clustering
- K-Means
- DBSCAN
- HCA (Hierarchical Cluster Analysis)
- Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
- Visualistation and dimensionality reduction
- PCS (Principal Component Analysis)
- Kernel PCA
- LLE (Locally Linear Embedding)
- t-SNE (t-Distributed Stochastic Neigbor Embedding)
- Association rule learning
- Apriori
- Eclat
3. **Semi-supervised Learning**
- partly labeled
- mostly combination of supervised and unsupervised learning
- DBN (Deep Believe Networks)
- RBM (Resticted Boltzmann Machines)
4. **Reinforcement Learning**
- Call an agent, observe the environment, select and perform actions, get reward or penalty. Learns by itself.
System can wheter or not learn increamentally from stream of incomming data
1. **Batch Learning**
- must train with all data available (offline learning)
- can take many hours to train and requires resources
- when extreme much data, can be impossible to train or limited resources
- new data then train system from scratch and (automatic) deploy
2. **Online learning**
- train system incrementially, by feeding sequential data.
- either individually or in mini-batches
- great when data arrives in an continous flow
- great approach when system needs to adapt quicky to new data.
- requires less resources
- when data does not fitt in memory => out-of-core learining
then online learning perfect approach.
- important parameter: learning rate
- High then adapt quickly to chancing data, but forgets also quickly.
- Low then learns slower, bet less sensitive to new data.
- Problem when feed with bad data => performance will decline.
Need to monitor system.
ML Systems how they generalize. Needs to prform well on new data!
1. **Instance-based learning**
- learn examples by heart, then generalize to new cases by using similarity measures to compare.
2. **Model-based learning**
- build model of examples and then use the model to make predictions
- Use model selection to select an appropiate model and fully specifying its architecture (incl tune parameters)
Inference: make predictions on new data.
## Main challenges of ML
"Bad algorithme" and "bad data"
1. Insufficient quantity of training data
- Not always easy and/or cheap to get extra training data
- More data is better.
2. Nonrepresentative Training data
- Crusial the data represents the case to generalize about.
For both instant-based en model-based learning
- If sample too small => sampling noise
- If large sample but sampling method is flawed => sampling bias
Sampling method: how data is collected
3. Poor-Quality data
- Errors, outliers, noise => clean up data
- Clearly outliers => discard them of try fix errors
- missing few features => whether ignore attribute or fill in values manually
4. Irrelevant Features
- Come up with a good set of features for training => feauture engineering:
- feature selection (select most useful)
- feature extraction (combine existing features to more useful one)
- creating new features by gathering new data
5. Overfitting the training data
Overfitting => model performs well on training dat, but does not generalize well.
Overfitting happens when model is too complex relative to the amount and noisiness of the training data.
Overfitting solutions:
- simplify model by selecting model with fewer parameters, reducing number of attributes, constraining model.
Constraining model => **regularization** result: will fit less the training data but generalises better to new data.
Amount of reqularization is controlled by hyperparameters. A hyperparameter is a parameter of the learning algorithme (not of model)
- gather more training data
- reduce the noise in the training data (fix errors, remove outliers)
6. Underfitting the training data
- Model is too simple to learn the structure of the data.
- Solutions:
- select more powerful model, with more parameters
- improve feature engineering tasks
- reduce contrains on the model (eg regularisation)
## Testing and validation
Training set and test set, relate to each other 80-20% to 99-1%; depending the absolute size total data set.
Error rate in test set => generalization error
Training error low, but generalization error high => overfitting
### Hyperparameter tuning an model section
Holdout validation: keep part of the training set (= validation/development set) to validate several candidata models and select the best. Work mostly very well, except when the validation set is too small.
Solution: cross-validation => using many small validation sets and validate each model per validation set.(drawback: trainingtime is multiplied by number of validation sets)
### Data Mismatch
Most important rule: validation set and test set must be as representative as possible of the data used in production
Training set => to train model
Test set => evaluate after happy with the dev set
Dev + test set have to come from the SAME distribution (random shuffle data)
1. define dev set + metric. Quickly iterate
idea -> code -> experiment
| Set | percentage of data |
| :--- | --: |
| training | 98 |
| dev | 1 |
| test |1 |