--- title: ML_Landscape updated: 2021-05-04 14:58:11Z created: 2021-05-04 14:58:11Z --- # ML Landscape ## What is ML Science of programming computers so they can learn from data, instead of having explicite code rules Data examples are called **training sets**. Each training example is called a **training instance** Performance is measured in **accuracy** ML is good in solving problems that either: 1. problems are too complex for traditonal approaches 2. have no known algoritme/can help find an algoritme 3. fluctuating environments: ML can learn machines to adapt to new data 4. can help humans learn to get insight in complex and large amount of data => data mining | | Description | Example | | :---| :---- | :---- | | CNN | Convolutional Neural Netwerk | Image Classification | |Segmantic Segmentation | Brain scans | | NLP | Natural Language Processing | News articles classification| | | | Text Summary | | RNN | Recurring Neural Netwerk | News articles classification | | NLU | Natural Language Understanding | Chatbot/personal assistant | | SVM | Support Vector Machine | Forecasting | | RL | Reinformcement Learning | | **Regression model**: - Linear - Polynomial - Random Forest - when take past in account then RNN, CNN, Transformers dimensionality reduction: Simplify the data without loosing too much information. Feature extraction: merge one feature in an other and both will represent an new feature. Anomaly detection: unusual credit card transactions to prevent fraud. Novelty detection: detact new instances that look different from all training instances. Association rule learning: dig in large datasets and discover interesting reations between attributes. ## Types of ML 1. **Supervised Learning** - Classification - K-Nearest Neighbours - Linear/Logistic regression (both predictor and their labels required) - SVM - Decision Tree - Random Forests - Neural Networks 2. **Unsupervised Learning** - Clustering - K-Means - DBSCAN - HCA (Hierarchical Cluster Analysis) - Anomaly detection and novelty detection - One-class SVM - Isolation Forest - Visualistation and dimensionality reduction - PCS (Principal Component Analysis) - Kernel PCA - LLE (Locally Linear Embedding) - t-SNE (t-Distributed Stochastic Neigbor Embedding) - Association rule learning - Apriori - Eclat 3. **Semi-supervised Learning** - partly labeled - mostly combination of supervised and unsupervised learning - DBN (Deep Believe Networks) - RBM (Resticted Boltzmann Machines) 4. **Reinforcement Learning** - Call an agent, observe the environment, select and perform actions, get reward or penalty. Learns by itself. System can wheter or not learn increamentally from stream of incomming data 1. **Batch Learning** - must train with all data available (offline learning) - can take many hours to train and requires resources - when extreme much data, can be impossible to train or limited resources - new data then train system from scratch and (automatic) deploy 2. **Online learning** - train system incrementially, by feeding sequential data. - either individually or in mini-batches - great when data arrives in an continous flow - great approach when system needs to adapt quicky to new data. - requires less resources - when data does not fitt in memory => out-of-core learining then online learning perfect approach. - important parameter: learning rate - High then adapt quickly to chancing data, but forgets also quickly. - Low then learns slower, bet less sensitive to new data. - Problem when feed with bad data => performance will decline. Need to monitor system. ML Systems how they generalize. Needs to prform well on new data! 1. **Instance-based learning** - learn examples by heart, then generalize to new cases by using similarity measures to compare. 2. **Model-based learning** - build model of examples and then use the model to make predictions - Use model selection to select an appropiate model and fully specifying its architecture (incl tune parameters) Inference: make predictions on new data. ## Main challenges of ML "Bad algorithme" and "bad data" 1. Insufficient quantity of training data - Not always easy and/or cheap to get extra training data - More data is better. 2. Nonrepresentative Training data - Crusial the data represents the case to generalize about. For both instant-based en model-based learning - If sample too small => sampling noise - If large sample but sampling method is flawed => sampling bias Sampling method: how data is collected 3. Poor-Quality data - Errors, outliers, noise => clean up data - Clearly outliers => discard them of try fix errors - missing few features => whether ignore attribute or fill in values manually 4. Irrelevant Features - Come up with a good set of features for training => feauture engineering: - feature selection (select most useful) - feature extraction (combine existing features to more useful one) - creating new features by gathering new data 5. Overfitting the training data Overfitting => model performs well on training dat, but does not generalize well. Overfitting happens when model is too complex relative to the amount and noisiness of the training data. Overfitting solutions: - simplify model by selecting model with fewer parameters, reducing number of attributes, constraining model. Constraining model => **regularization** result: will fit less the training data but generalises better to new data. Amount of reqularization is controlled by hyperparameters. A hyperparameter is a parameter of the learning algorithme (not of model) - gather more training data - reduce the noise in the training data (fix errors, remove outliers) 6. Underfitting the training data - Model is too simple to learn the structure of the data. - Solutions: - select more powerful model, with more parameters - improve feature engineering tasks - reduce contrains on the model (eg regularisation) ## Testing and validation Training set and test set, relate to each other 80-20% to 99-1%; depending the absolute size total data set. Error rate in test set => generalization error Training error low, but generalization error high => overfitting ### Hyperparameter tuning an model section Holdout validation: keep part of the training set (= validation/development set) to validate several candidata models and select the best. Work mostly very well, except when the validation set is too small. Solution: cross-validation => using many small validation sets and validate each model per validation set.(drawback: trainingtime is multiplied by number of validation sets) ### Data Mismatch Most important rule: validation set and test set must be as representative as possible of the data used in production Training set => to train model Test set => evaluate after happy with the dev set Dev + test set have to come from the SAME distribution (random shuffle data) 1. define dev set + metric. Quickly iterate idea -> code -> experiment | Set | percentage of data | | :--- | --: | | training | 98 | | dev | 1 | | test |1 |