Init rest

This commit is contained in:
John 2022-08-09 21:04:44 +02:00
parent 08387ad8a4
commit 37c133aaba
476 changed files with 16195 additions and 23 deletions

View File

@ -0,0 +1,67 @@
---
title: Ethics
updated: 2022-04-03 11:38:26Z
created: 2021-05-04 14:58:11Z
---
# Why is AI ethics becoming a problem now?
Machine learning (ML) through neural networks is advancing rapidly for three reasons:
1. Huge increase in the size of data sets;
2. Huge increase in computing power;
3. Huge improvement in ML algorithms and more human talent to write them.
All three of these trends are centralizing of power, and “With great power comes great responsibility”
## Sixteen Challenges
1. **Technical Safety**
Will AI systems work as they are promised or will they fail? If and when they fail, what will be the results of those failures? And if we are dependent upon them, will we be able to survive without them?
Might legally reduce a manufacturers responsibility in terms and contracts => unethical scheme to avoid legitimate responsibility.
2. **Transparency and Privacy**
Once we have determined that the technology functions adequately, can we actually understand how it works and properly gather data on its functioning? Ethical analysis always depends on getting the facts first—only then can evaluation begin.
It turns out that with some machine learning techniques such as deep learning in neural networks it can be difficult or impossible to really understand why the machine is making the choices that it makes. In other cases, it might be that the machine can explain something, but the explanation is too complex for humans to understand.
As an additional point, in general, the more powerful someone or something is, the more transparent it ought to be, while the weaker someone is, the more right to privacy he or she should have. Therefore the idea that powerful AIs might be intrinsically opaque is disconcerting.
3. **Beneficial Use & Capacity for Good**
The main purpose of AI is to help people lead longer, more flourishing, more fulfilling lives.
4. **Malicious Use & Capacity for Evil**
Artificial intelligence, like human intelligence, will be used maliciously, there is no doubt.
Competition and war are always primary drivers of technological advance, and that militaries and corporations are working on these technologies right now. Not always intended. Because of this, forbidding, banning, and relinquishing certain types of technology would be the most prudent solution.
5. **Bias in Data, Training Sets, etc.**
One of the interesting things about neural networks is that they effectively merge a computer program with the data that is given to it. Many benefits but also potential harm.
Algorithmic bias is one of the major concerns in AI right now and will remain so in the future unless we endeavor to make our technological products better than we are.
6. **Unemployment / Lack of Purpose & Meaning**
Automation of industry has been a major contributing factor in job losses since the beginning of the industrial revolution. AI will simply extend this trend to more fields ie law, medicine, and education.
Attached to the concern for employment is the concern for how humanity spends its time and what makes a life well-spent.
7. **Growing Socio-Economic Inequality**
Related to the unemployment problem is the question of how people will survive if unemployment rises to very high levels.
Universal basic income (UBI) => major restructuring of national economies (political)
8. **Environmental Effects**
Machine learning models require enormous amounts of energy to train ==> fossil fuels. AI is in some very basic ways a technology focused on efficiency, and energy efficiency is one way that its capabilities can be directed.
9. **Automating Ethics**
One strength of AI is that it can automate decision-making. Automation of decision making will presents huge problems for society, because if these automated decisions are good, society will benefit, but if they are bad, society will be harmed. => ethical standards needed.
the ethical decision-making process might be as simple as following a program to fairly distribute a benefit, wherein the decision is made by humans and executed by algorithms, but it also might entail much more detailed ethical analysis, even if we humans would prefer that it did not—this is because Ai will operate so much faster than humans can, that under some circumstances humans will be left “out of the loop” of control due to human slowness. This already occurs with cyberattacks, and high-frequency trading (both of which are filled with ethical questions which are typically ignored) and it will only get worse as AI expands its role in society.
Since AI can be so powerful, the ethical standards we give to it had better be good.
10. **Moral Deskilling & Debility**
If we turn over our decision-making capacities to machines, we will become less experienced at making decisions. AI will be to either assist or replace humans at making certain types of decisions => humans may become worse at these skills.
11. **AI Consciousness, Personhood, and “Robot Rights”**
Some thinkers have wondered whether AIs might eventually become self-conscious, attain their own volition, or otherwise deserve recognition as persons like ourselves.
Legally => personhood has been given to corporations and (in other countries) rivers, so there is certainly no need for consciousness even before legal questions may arise.
Morally => perhaps someday they will be such good imitations.
12. **AGI and Superintelligence**
when AI reaches human levels of intelligence (AGI=Artificial General Intelligenc), potentially vastly more clever and capable than we are.
There is no reason why the improvement of AI would stop at AGI => more hardware to do more and faster. Dethroning of humanity as the most intelligent thing on Earth.
13. **Dependency on AI**
Our technological dependency is almost what defines us as a species. => complex and fragile
Intelligence dependence is a form of dependence like that of a child to an adult. raises the question of what an infantilized human race would do if our AI parents ever malfunctioned. Without that AI, if dependent on it, we could become like lost children not knowing how to take care of ourselves or our technological society.
14. **AI-powered Addiction**
AI can exploit numerous human desires and weaknesses including purpose-seeking, gambling, greed, libido, violence, and so on. Addiction enslaves us and wastes our time when we could be doing something worthwhile. It is not the AIs that choose to treat people this way, it is other people.
15. **Isolation and Loneliness**
One might think that “social” media, smartphones, and AI could help, but in fact they are major causes of loneliness since people are facing screens instead of each other. What does help are strong in-person relationships, precisely the relationships that are being pushed out by addictive (often AI-powered) technology. Loneliness can be helped by dropping devices and building quality in-person relationships. In other words: caring.
16. **Effects on the Human Spirit**
By externalizing our intelligence and improving it beyond human intelligence, are we making ourselves second-class beings to our own creations?
[Artificial Intelligence and Ethics: Sixteen Challenges and Opportunities](https://www.scu.edu/ethics/all-about-ethics/artificial-intelligence-and-ethics-sixteen-challenges-and-opportunities/)

View File

@ -0,0 +1,155 @@
---
title: DataStreaming
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# **Data Streaming**
Data streaming is optimal for time series and detecting patterns over time.
Things like traffic sensors, health sensors, transaction logs, and activity logs are all good candidates for data streaming.
## Data Streaming Challenges
- Plan for scalability.
- Plan for data durability.
- Incorporate fault tolerance in both the storage and processing layers.
## Data Streaming Tools (popular)
- [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/) Managed, scalable, cloud-based service
- [Apache Kafka](https://kafka.apache.org/) Distributed publish-subscribe messaging system,integrates applications and data streams
- [Apache Flink](https://flink.apache.org/) Distributed computation over data streams
- [Apache Storm](https://storm.apache.org/) High data velocity.
# **Devoxx Apache Kafka Presentation of James Ward**
Integration Complexity:
- No System of recods; unknown where the data came initially
- Sychronisation is hard
- Scaling ETL is hard; horizonally is best, but fails often
- processing is error-prone; parsing, missing data ...
### integration architecture
Events not tables
Streams as ledger (records all events and can go back in time)
First-class partitioning (scala horizontally)
### Why not messaging system?
- is ordening garanteed?
- Horizontal scaling
- push? Back pressure? Difference in speed ie one source is faster then an other.
The Reactive Streams specification defines a model for **back pressure**. The elements in the Streams are produced by the producer at one end, and the elements are consumed by the consumer at the other end. The most favorable condition is where the rate at which the elements are produced and consumed is the same. But, in certain situations, the elements are emitted at a higher rate than they are consumed by the consumer. This scenario leads to the growing backlog of unconsumed elements. The more the backlog grows, the faster the application fails. Is it possible to stop the failure? And if yes, how to stop this failure? We can certainly stop the application from failing. One of the ways to do this is to communicate with the source ... etc
### **Kafka = Event Ledge + Distributed & Redundant**
![Kafka](images/kafka1.png)
### **Kafka: Liniar scaling** (near network speed between nodes/brokers)
Kafka Fundamentals:
![Kafka Fundamentals](images/kafka2.png)
- messaging system Semantics
- Clustering is core (scaling horizontal)
- Durabiliy & Ordering Guarantees (no events are lost and in right order)
Use cases:
![Use Cases](images/kafka3.png)
- modern ETL/CDC
(*change data capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources.*)
- Data Pipelines
![Data Pipeline example](images/kafka4.png)
- Big Data Ingest
**Records/events**
![Records/events](images/kafka5.png)
- Record: key, value, timestamp
- Immutable
- Append only to the ledger
- Persisted to disk (all on disk)
AKA: a log
## Producers & Consumers
![Producers and Consumers](images/kafka6.png)
* Broker = Node in the cluster
* Producer writes records to a brober
* Consumer reads records from a broker (is asking for records)
* Leader/follower for cluster distribution
## Topcs & Partitions
![Topcs & Partitions](images/kafka7.png)
* Topic = Logical name with 1..n partitions
* Partitions are replicated
* Ordering is guaranteed for a partition
# Offset
![Offset](images/kafka8.png)
Offset the way Kafka keeps track of the ordering
* Unique sequential ID (per partition)
* Consumers track offsets (give me my offset)
* Benetis: replay, different speed consumers, etc
# producer offset
![Producer Offset](images/kafka9.png)
Send message to given partion
* Wrirtes are to the leader of a partition (allways)
* Partitioning can be done mannually or based on a key
* Replication Factor is Topic-based
* Auto-rebalancing is arranged by Kafka
* followers (hot standby) are there when leader nodes go down. A new leader will be selected from the followers
* whole cluster is using Zookeeper to keep track of all nodes
# Consumer groups
![Consumer groups](images/kafka10.png)
* Logical name for 1 or more consumers
* message consumption is load balanced across all consumers in a group
# Delivery Guarantees
* **Producers:**
* Async (No guarantee Kafka has recorded; best performance)
* Committed to Leader
* depends saturation (verzadeging) node
* depends diskspeed
* Comitted to Leader and Quorum (part of folowers)
* very sure not loosing data
* Depends latency between nodes; so network latency
* **Consumer:**
* at-least-once (default)
* Kafka delivers a block of records requested by the consumer. Begin and end offset are commited. If sending fails only the start was committed and redelivers all messages again.
* Kafka delivers and waits till consumer commit delivery
* at-most-once Right in the beginning is commited all messages are recieved. If fails halfway is pitty.
* effectively-once (at least one delivery)
* exactly-once (maybe; very difficult/impossible)
# Cool features of Kafka
* log compaction option (removes duplicates) Group-by based on key.
* trade-off: disk read-writes; background proces
* gapes in offset-id. Kafka handles this.
* disk not heap
* kafka uses disk cache
* pagecache to socket: move piece of memory directly though the network; just skip Kafka; usefull for replication
* Balanced partition & leaders (Zookeeper)
* Producer and Consumer Quotes: to avoid fully saturated nodes
* Heraku Kafka
# Clients
* JVM is official
* most other platforms via community
* polling based on consumption side
# AKKA Streams
![AKKA](images/kafka11.png)
* Implementation of Reactive streams
* Source/Sink Stream programming
* back-pressure etc
* [kafka-adapter](https://github.com/akka/alpakka-kafka)
* [Code examples](https://github.com/jamesward/koober)
Sources:
https://dzone.com/articles/what-is-data-streaming
https://www.youtube.com/watch?v=UEg40Te8pnE DEVOXX KAFKA
https://en.wikipedia.org/wiki/Change_data_capture

View File

@ -0,0 +1,82 @@
---
title: ELTvsETL
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# **ETL**
- Extraction (from source to staging area)
- Transformation (reformatted for datawarehouse purpose)
- Loaded (from staging area into datawarehous)
![ETL process](images/ETL.png)
## ETL == Pipeline approach
Many tools available: DataStage, Informatica, or SQL Server Integration Services (SSIS) ==> mostly for transforming
All work in a similair way: read from source, perform changes, write to target
ETL steps can be performed multiple times for a particulair load.
Transformation step can add business logic
Also everything is done in one single step.
An ETL process should have data flowing steadily through it. Risk of running out of memory and/or disk space. (sorting is a classic example - holds entire dataset - if already sorted then rightaway in datawarehouse) Should have the possibility of buffering in pipeline.
Many ETL tools facilitate parallel execution == multiple pipelines.
ETL can be better performing, but needs more training and developement cost.
When ETL is used:
- source and target differ in data types
- volumes of data are small
- transformations are compute-intensive
- data is structured
# **ELT**
- Extract
- Load
- Transform (so data is transformed **AFTER** loading)
Target system is performing the transformation.
![ELT process](images/ELT.png)
## ELT == NO transformation engine
When ELT is used:
- source and target share same data types (ie DB2 source and target)
- large volumes
- target database is able to handle large data volumes
- data is unstructured
ETL or ELT?? Depends on priorities.
ELT is requires a powerful system in place as target. More such systems available because of analytics.
ie a perfect platform is Hadoop but needs carefully planning
| | ETL | ELT |
| --------------- | ------------------------------------------ | --------------------------------------------- |
| **Maturity** | for 20 years avaliable | not as well adapted |
| | expertise | works not well with structured data |
| **Flexibility** | Older ETL not suited for unstructured data | can handle structured and unstructured data |
| | Remap data => reload all preveous data! | In target data more flexible |
| **Hardware** | mostly own engine hardware | takes cumpute power from existing hardware |
| | modern ETL tools run in cloud | |
| **Better for** | - structured data | - Unstructured data |
| | - Smaller volumes and complex | - large volumes and less complex computations |
| | - On-premise relational databases | - Cloud environment |
| | | - Data lake |
## **Merge ETL and ELT approach**
Extract, Transform, Load, Transform
## **Tools**
1. https://www.getdbt.com/
2. https://fivetran.com/
3. https://www.stitchdata.com/
Sources:
https://dzone.com/articles/etl-vs-elt-differences-explained

View File

@ -0,0 +1,177 @@
---
title: ML_Landscape
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# ML Landscape
## What is ML
Science of programming computers so they can learn from data, instead of having explicite code rules
Data examples are called **training sets**. Each training example is called a **training instance**
Performance is measured in **accuracy**
ML is good in solving problems that either:
1. problems are too complex for traditonal approaches
2. have no known algoritme/can help find an algoritme
3. fluctuating environments: ML can learn machines to adapt to new data
4. can help humans learn to get insight in complex and large amount of data => data mining
| | Description | Example |
| :---| :---- | :---- |
| CNN | Convolutional Neural Netwerk | Image Classification
| |Segmantic Segmentation | Brain scans |
| NLP | Natural Language Processing | News articles classification|
| | | Text Summary |
| RNN | Recurring Neural Netwerk | News articles classification |
| NLU | Natural Language Understanding | Chatbot/personal assistant |
| SVM | Support Vector Machine | Forecasting |
| RL | Reinformcement Learning | |
**Regression model**:
- Linear
- Polynomial
- Random Forest
- when take past in account then RNN, CNN, Transformers
dimensionality reduction: Simplify the data without loosing too much information.
Feature extraction: merge one feature in an other and both will represent an new feature.
Anomaly detection: unusual credit card transactions to prevent fraud.
Novelty detection: detact new instances that look different from all training instances.
Association rule learning: dig in large datasets and discover interesting reations between attributes.
## Types of ML
1. **Supervised Learning**
- Classification
- K-Nearest Neighbours
- Linear/Logistic regression (both predictor and their labels required)
- SVM
- Decision Tree
- Random Forests
- Neural Networks
2. **Unsupervised Learning**
- Clustering
- K-Means
- DBSCAN
- HCA (Hierarchical Cluster Analysis)
- Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
- Visualistation and dimensionality reduction
- PCS (Principal Component Analysis)
- Kernel PCA
- LLE (Locally Linear Embedding)
- t-SNE (t-Distributed Stochastic Neigbor Embedding)
- Association rule learning
- Apriori
- Eclat
3. **Semi-supervised Learning**
- partly labeled
- mostly combination of supervised and unsupervised learning
- DBN (Deep Believe Networks)
- RBM (Resticted Boltzmann Machines)
4. **Reinforcement Learning**
- Call an agent, observe the environment, select and perform actions, get reward or penalty. Learns by itself.
System can wheter or not learn increamentally from stream of incomming data
1. **Batch Learning**
- must train with all data available (offline learning)
- can take many hours to train and requires resources
- when extreme much data, can be impossible to train or limited resources
- new data then train system from scratch and (automatic) deploy
2. **Online learning**
- train system incrementially, by feeding sequential data.
- either individually or in mini-batches
- great when data arrives in an continous flow
- great approach when system needs to adapt quicky to new data.
- requires less resources
- when data does not fitt in memory => out-of-core learining
then online learning perfect approach.
- important parameter: learning rate
- High then adapt quickly to chancing data, but forgets also quickly.
- Low then learns slower, bet less sensitive to new data.
- Problem when feed with bad data => performance will decline.
Need to monitor system.
ML Systems how they generalize. Needs to prform well on new data!
1. **Instance-based learning**
- learn examples by heart, then generalize to new cases by using similarity measures to compare.
2. **Model-based learning**
- build model of examples and then use the model to make predictions
- Use model selection to select an appropiate model and fully specifying its architecture (incl tune parameters)
Inference: make predictions on new data.
## Main challenges of ML
"Bad algorithme" and "bad data"
1. Insufficient quantity of training data
- Not always easy and/or cheap to get extra training data
- More data is better.
2. Nonrepresentative Training data
- Crusial the data represents the case to generalize about.
For both instant-based en model-based learning
- If sample too small => sampling noise
- If large sample but sampling method is flawed => sampling bias
Sampling method: how data is collected
3. Poor-Quality data
- Errors, outliers, noise => clean up data
- Clearly outliers => discard them of try fix errors
- missing few features => whether ignore attribute or fill in values manually
4. Irrelevant Features
- Come up with a good set of features for training => feauture engineering:
- feature selection (select most useful)
- feature extraction (combine existing features to more useful one)
- creating new features by gathering new data
5. Overfitting the training data
Overfitting => model performs well on training dat, but does not generalize well.
Overfitting happens when model is too complex relative to the amount and noisiness of the training data.
Overfitting solutions:
- simplify model by selecting model with fewer parameters, reducing number of attributes, constraining model.
Constraining model => **regularization** result: will fit less the training data but generalises better to new data.
Amount of reqularization is controlled by hyperparameters. A hyperparameter is a parameter of the learning algorithme (not of model)
- gather more training data
- reduce the noise in the training data (fix errors, remove outliers)
6. Underfitting the training data
- Model is too simple to learn the structure of the data.
- Solutions:
- select more powerful model, with more parameters
- improve feature engineering tasks
- reduce contrains on the model (eg regularisation)
## Testing and validation
Training set and test set, relate to each other 80-20% to 99-1%; depending the absolute size total data set.
Error rate in test set => generalization error
Training error low, but generalization error high => overfitting
### Hyperparameter tuning an model section
Holdout validation: keep part of the training set (= validation/development set) to validate several candidata models and select the best. Work mostly very well, except when the validation set is too small.
Solution: cross-validation => using many small validation sets and validate each model per validation set.(drawback: trainingtime is multiplied by number of validation sets)
### Data Mismatch
Most important rule: validation set and test set must be as representative as possible of the data used in production
Training set => to train model
Test set => evaluate after happy with the dev set
Dev + test set have to come from the SAME distribution (random shuffle data)
1. define dev set + metric. Quickly iterate
idea -> code -> experiment
| Set | percentage of data |
| :--- | --: |
| training | 98 |
| dev | 1 |
| test |1 |

View File

@ -0,0 +1,87 @@
---
title: Statistics
updated: 2022-04-02 15:10:58Z
created: 2021-05-04 14:58:11Z
---
# Statistics
## Data Types
- **Categorical**
- **Nominal Variables**
Intrinsic order of the labels:
Country of birth (Argentina, England, Germany)
Postcode
Vehicle make (Citroen, Peugeot, ...)
- **Ordinal Variables**
Can be meaningfully ordered are called ordinal:
Student's grade in an exam (A, B, C or Fail)
Days of the week (Monday = 1 and Sunday = 7)
- **Numerical**
- **Discrete**
how many cards in a game?
integers
- **Continuous**
height of a room
floating point numbers
What are proportions?
Is an aggregation of nominal data to provide a numerical figure. eq a percentage of nominal variables.
## Mixed Variables
- Observations show either numbers or categories among their values
- Number of credit accounts (1-100, U, T, M) U = unknown, T = unverified, M = unmatched)
- Observations show both numbers and categories in their values
- Cabin (Titanic) (A15, B18, ...)
## Distributions
![48751b057b60e03ec51f64e3235fa1b3.png](../../_resources/48751b057b60e03ec51f64e3235fa1b3.png)
Selecting something on de x-axis in the middle has an higher probability then rarer on to the edges.
Bell curve of Normal Distribution
![be8b17237548f72ecd8013f80df036dc.png](../../_resources/be8b17237548f72ecd8013f80df036dc.png)
Bi-mode distribution
![b7f8b2f785a9637ea5a22abe2877bca5.png](../../_resources/b7f8b2f785a9637ea5a22abe2877bca5.png)
Skewed distribution
Sample Distribution
![34262aff59c5f5dd9a413b2b3d74629a.png](../../_resources/34262aff59c5f5dd9a413b2b3d74629a.png)
$$
\overline{X} == variance
$$
## Sampling ande Estimation
eg
some best number of successes divided by the sampling, gives an estimate 10 / 3 = 3,3333
$$
\Theta == estimate with some variance around to make a good guess out of the sample
$$
![846c953521751f708bd680556dc9ae0b.png](../../_resources/846c953521751f708bd680556dc9ae0b.png)
So given an sample we have am 95% confidence out sample estimate is in this interval.
If less sure of this theta, the larger the confidence interval. eq because the n is much smaller.
![d575f021de579d10e3855c763198e7bc.png](../../_resources/d575f021de579d10e3855c763198e7bc.png)
## Hypothesis Testing
![981ea34418a595b422aab0b0df23f4b6.png](../../_resources/981ea34418a595b422aab0b0df23f4b6.png)
In Hypothesis Testing never:
- prove anything
- never accept the null hypothesis
## P-values
consider a null Hypothesis:
Hypothesis Test asses ig our sample is extreme enough to reject the null.
The p-value then measure how extreme our sample is.
![6af1399567c87fcda04a6414efbe18bf.png](../../_resources/6af1399567c87fcda04a6414efbe18bf.png)
## P-hacking
![ace369b638966681b9558c42e25dd0b4.png](../../_resources/ace369b638966681b9558c42e25dd0b4.png)

View File

@ -0,0 +1,41 @@
---
title: Installation
updated: 2022-04-03 11:39:19Z
created: 2021-10-28 18:57:40Z
latitude: 52.38660000
longitude: 5.27820000
altitude: 0.0000
---
# Installation (no need anymore, see conda install)
[Installation cuda](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation)
[Download page](https://developer.nvidia.com/cuda-downloads)
### Installation (possible different version)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.5.0/local_installers/cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-5-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
sudo apt-get install cuda-compat-11-5
[Install cuddn](https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-824/install-guide/index.html#install-linux)
[Download page](https://developer.nvidia.com/rdp/cudnn-download)
### Unzip the cuDNN package:
tar -xzvf cudnn-x.x-linux-x64-v8.x.x.x.tgz
### Copy the following files into the CUDA Toolkit directory:
$ sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
$ sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
[Post actions](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions)
# environment variables add to .bashrc
export LD_LIBRARY_PATH=/usr/local/cuda-11.5/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export PATH=/usr/local/cuda-11.5/bin${PATH:+:${PATH}}

View File

@ -0,0 +1,39 @@
---
title: testingGPU
updated: 2022-04-03 09:01:13Z
created: 2021-05-04 14:58:11Z
---
### list devices CPU, GPU
```python
tf.config.experimental.list_physical_devices("GPU")
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
```
### Init GPU; disable experimentals
```python
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
tf.config.experimental.disable_mlir_graph_optimization()
tf.config.experimental.enable_tensor_float_32_execution(enabled=True)
```
### assign memory to GPU
```python
gpus = tf.config.experimental.list_physical_devices("GPU")
if gpus:
# Restrict TensorFlow to only allocate 22GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=7000)],
)
logical_gpus = tf.config.experimental.list_logical_devices("GPU")
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
```

70
AI/nvidea/nvidea.md Normal file
View File

@ -0,0 +1,70 @@
---
title: nvidea
updated: 2022-04-03 11:44:16Z
created: 2021-05-04 14:58:11Z
---
# NVIDIA
## show installed video drivers
nvidia-smi
[Latest drivers](https://www.nvidia.com/Download/index.aspx?lang=en-us)
---
## list installed hw
lspci | grep -i nvidia
sudo lshw -numeric -C display
## find NVIDIA modules
find /usr/lib/modules -name nvidia.ko
## Settings
nvidia-settings
## run
```bash
nvidia-smi nvidia-smi -L
nvidia-smi -l n # run every n seconds
```
## monitoring nvidia
https://github.com/fbcotter/py3nvml
---
## successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero => error; Modify in host and set the -1 to 0
/sys/bus/pci/devices/0000:2b:00.0/numa_node
for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done
https://stackoverflow.com/questions/44232898/memoryerror-in-tensorflow-and-successful-numa-node-read-from-sysfs-had-negativ
---
## set numa value at start computer
```bash
sudo crontab -e
# Add the following line
@reboot (echo 0 | tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node")
```
[Source](https://askubuntu.com/questions/1379119/how-to-set-the-numa-node-for-an-nvidia-gpu-persistently)
---
## start docker with --gpus=all every time, otherwise error
### failed call to cuInit: UNKNOWN ERROR (-1
### no NVIDIA GPU device is present: /dev/nvidia0 does not exist
docker run -it -p 8888:8888 --gpus=all tensorflow/tensorflow:latest-gpu-jupyter
---
## update nvidea drivers
ubuntu-drivers autoinstall

View File

@ -0,0 +1,89 @@
---
title: Create table
updated: 2022-04-04 11:59:19Z
created: 2022-04-03 12:44:49Z
---
```hive
CREATE [TEMPORARY] TABLE employee (
name STRING,
work_place ARRAY<STRING>,
gender_age STRUCT<gender:STRING,age:INT>,
skills_score MAP<STRING,INT>,
depart_title MAP<STRING,ARRAY<STRING>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE;
LOAD DATA INPATH '/user/aap/data/employee.txt'
OVERWRITE INTO TABLE employee;
```
```hive
CREATE TABLE IF NOT EXISTS employee_hr (
name string,
employee_id int,
sin_number string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';
LOAD DATA INPATH '/user/aap/data/employee_hr.txt'
OVERWRITE INTO TABLE employee_hr
```
```hive
CREATE TABLE employee_id (
name STRING,
employee_id INT,
work_place ARRAY<STRING>,
gender_age STRUCT<gender:STRING,age:INT>,
skills_score MAP<STRING,INT>,
depart_title MAP<STRING,ARRAY<STRING>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
LOAD DATA INPATH
'/user/aap/data/employee_id.txt'
OVERWRITE INTO TABLE employee_id
```
```hive
CREATE TABLE IF NOT EXISTS employee_contract (
name string,
dept_num int,
employee_id int,
salary int,
type string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED as TEXTFILE;
LOAD DATA INPATH '/user/aap/data/employee_contract.txt'
OVERWRITE INTO TABLE employee_contract;
```
```hive
CREATE TABLE ctas_employee as SELECT * FROM employee
```
```hive
CREATE VIEW IF NOT EXISTS employee_skills
AS
SELECT
name, skills_score['DB'] as DB,
skills_score['Perl'] as Perl,
skills_score['Python'] as Python,
skills_score['Sales'] as Sales,
skills_score['HR'] as HR
FROM employee;
```

View File

@ -0,0 +1,7 @@
---
title: 'Extracting queries from Hive logs '
updated: 2022-04-27 17:36:19Z
created: 2022-04-27 17:36:05Z
---
https://thisdataguy.com/2017/06/23/extracting-queries-from-hive-logs/

View File

@ -0,0 +1,20 @@
---
title: Hive
updated: 2022-05-24 18:43:47Z
created: 2022-05-24 18:35:26Z
---
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
Built on top of Apache Hadoop™, Hive provides the following features:
- Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
- A mechanism to impose structure on a variety of data formats
- Access to files stored either directly in **Apache HDFS™** or in other data storage systems such as **Apache HBase™**
- Query execution via **Apache Tez™, Apache Spark™**, or **MapReduce**
- Procedural language with HPL-SQL
- Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
Hive's SQL can also be extended with user code via user defined functions (**UDF**s), user defined aggregates (UDAFs), and user defined table functions (UDTFs).
Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks.

View File

@ -0,0 +1,31 @@
---
title: '# Aggregations'
updated: 2022-04-03 17:09:07Z
created: 2022-04-03 17:00:47Z
---
```hive
SELECT
sum(CASE WHEN gender_age.gender = 'Male' THEN gender_age.age ELSE 0 END)/
count(CASE WHEN gender_age.gender = 'Male' THEN 1
ELSE NULL END) as male_age_avg
FROM employee;
SELECT
sum(coalesce(gender_age.age,0)) as age_sum,
sum(if(gender_age.gender = 'Female',gender_age.age,0)) as female_age_sum
FROM employee;
SELECT
if(name = 'Will', 1, 0) as name_group,
count(name) as name_cnt
FROM employee
GROUP BY if(name = 'Will', 1, 0);
```
```hive
SELECT
count(DISTINCT gender_age.gender) as gender_uni_cnt,
count(DISTINCT name) as name_uni_cnt
FROM employee;
```

View File

@ -0,0 +1,17 @@
---
title: '# DDL'
updated: 2022-04-04 20:03:49Z
created: 2022-04-03 13:46:28Z
---
```hive
SHOW CREATE TABLE employee
SHOW TABLES
SHOW TABLES '*em*'
SHOW VIEWS
SHOW COLUMNS IN employee
DESC employee | DESCRIBE employee
SHOW TBLPROPERTIES employee
```

View File

@ -0,0 +1,41 @@
---
title: '# Data Sorting'
updated: 2022-04-03 16:53:16Z
created: 2022-04-03 16:50:40Z
---
- ORDER BY [ASC|DESC]
It performs a global sort using only one reducer, so it takes longer to return the result. Using LIMIT with ORDER BY is strongly recommended.
```hive
SELECT name
FROM employee -- Order by expression
ORDER BY CASE WHEN name = 'Will' THEN 0 ELSE 1 END DESC;
SELECT * FROM emp_simple
ORDER BY work_place NULL LAST;
```
- SORT BY [ASC|DESC]: which columns to use to sort reducer input records. This means the sorting is completed before sending data to the reducer.
```hive
SELECT name FROM employee SORT BY name DESC;
```
- DISTRIBUTE BY: It is very similar to GROUP BY when the mapper decides to which reducer it can deliver the output. Compared to GROUP BY, DISTRIBUTE BY will not work on data aggregations, such as count(*), but only directs where data goes
```hive
SELECT name, employee_id FROM employee_hr DISTRIBUTE BY employee_id;
SELECT name, start_date
FROM employee_hr
DISTRIBUTE BY start_date SORT BY name;
```
- CLUSTER BY: shortcut operator you can use to perform DISTRIBUTE BY and SORT BY operations on the same group of columns. The CLUSTER BY statement does not allow you to specify ASC or DESC yet. Compared to ORDER BY, which is globally sorted, the CLUSTER BY statement sorts data in each distributed group:
```hive
SELECT name, employee_id FROM employee_hr CLUSTER BY name;
```
![e9effef3a9891b908b2197d351856eff.png](../../_resources/e9effef3a9891b908b2197d351856eff.png)

View File

@ -0,0 +1,43 @@
---
title: '# Functions'
updated: 2022-04-03 16:58:17Z
created: 2022-04-03 13:36:09Z
---
```hive
SELECT concat('1','+','3','=',cast((1 + 3) as string)) as res;
SELECT
SIZE(work_place) as array_size,
SIZE(skills_score) as map_size,
SIZE(depart_title) as complex_size,
SIZE(depart_title["Product"]) as nest_size
FROM employee;
SELECT
array_contains(work_place, 'Toronto') as is_Toronto,
sort_array(work_place) as sorted_array
FROM employee;
```
## Date
```hive
SELECT TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())) as currentdate;
```
```hive
SELECT
reverse(split(reverse('/user/john/data/employee.txt'),'/')[0])
as linux_file_name;
```
## Opposite of explode
```hive
SELECT
collect_set(gender_age.gender) as gender_set,
collect_list(gender_age.gender) as gender_list
FROM employee;
```

View File

@ -0,0 +1,108 @@
---
title: '# Select'
updated: 2022-04-07 19:29:20Z
created: 2022-04-03 13:25:15Z
---
```hive
select name, -- regular column
work_place[0], -- array
gender_age.gender, -- struct
skills_score['DB'], -- map
depart_title[0] -- map with array
from employee
```
```hive
select name, work_place,
cities
from employee
LATERAL VIEW explode(work_place) C AS cities;
```
```hive
select name, work_place,
depart_title['Product'],
jobs
from employee
LATERAL VIEW explode(depart_title['Product']) C AS jobs;
```
```hive
SELECT
name,
dept_num as deptno,
salary,
count(*) OVER (PARTITION BY dept_num) as cnt,
count(distinct dept_num) OVER (PARTITION BY dept_num) as dcnt,
sum(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) as sum1,
sum(salary) OVER(ORDER BY dept_num) as sum2,
sum(salary) OVER(ORDER BY dept_num, name) as sum3
FROM employee_contract
ORDER BY deptno, name;
```
```hive
with r1 as (select name from employee),
r2 as (select name from employee)
select * from r1
union all
select * from r2
```
```hive
SELECT
CASE WHEN gender_age.gender = 'Female' THEN 'Ms.'
ELSE 'Mr.' END as title,
name,
IF(array_contains(work_place, 'New York'), 'US', 'CA') as country
FROM employee;
```
```hive
SELECT
name, gender_age.gender as gender
FROM (
SELECT * FROM employee WHERE gender_age.gender = 'Male'
) t1 -- t1 here is mandatory
```
```hive
SELECT name, gender_age FROM employee WHERE gender_age.age in (27, 30)
```
```hive
SELECT
name, gender_age
FROM employee
WHERE (gender_age.gender, gender_age.age) IN
(('Female', 27), ('Male', 27 + 3)) -- expression support version > v2.1.0
```
|Join type | Logic | Rows returned |
|---|---|---|
|table_m JOIN table_n | This returns all rows matched in both tables.| m ∩ n|
|table_m LEFT JOIN table_n | This returns all rows in the left table and matched rows in the right table. If there is no match in the right table, it returns NULL in the right table.| m |
|table_m RIGHT JOIN table_n | This returns all rows in the right table and matched rows in the left table. If there is no match in the left table, it returns NULL in the left table.| n |
|table_m FULL JOIN table_n| This returns all rows in both tables and matched rows in both tables. If there is no match in the left or right table, it returns NULL instead. | m + n - m ∩ n |
|table_m CROSS JOIN table_n | This returns all row combinations in both the tables to produce a Cartesian product.| m * n |
### Special joins for HiveQL
- MAPJOIN: The MapJoin statement reads all the data from the small table to memory and broadcasts to all maps. During the map phase, the join operation is performed by comparing each row of data in the big table with small tables against the join conditions. Because there is no reduce needed, such kinds of join usually have better performance. In the newer version of Hive, Hive automatically converts join to MapJoin at runtime if possible. However, you can also manually specify the broadcast table by providing a join.
hint, /*+ MAPJOIN(table_name) */. The MapJoin operation does not support the following: Using MapJoin after UNION ALL, LATERAL VIEW, GROUP BY/JOIN/SORT BY/CLUSTER, and BY/DISTRIBUTE BY Using MapJoin before UNION, JOIN, and another MapJoin
```hive
SELECT
/*+ MAPJOIN(employee) */ emp.name, emph.sin_number
FROM employee emp
CROSS JOIN employee_hr emph
WHERE emp.name <> emph.name;
```
- LEFT SEMI JOIN statement is also a type of MapJoin. It is the same as a subquery with IN/EXISTS after v0.13.0 of Hive. However, it is not recommended for use since it is not part of standard SQL
```hive
SELECT a.name FROM employee a
LEFT SEMI JOIN employee_id b ON a.name = b.name;
```

View File

@ -0,0 +1,189 @@
---
title: Cheat Sheat
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Cheat Sheat
## Python, Spark setting
```bash
# Spark home for full install
export SPARK_HOME="/usr/local/spark/"
# Set a fixed value for the hash seed secret
export PYTHONHASHSEED=0
# Set an alternate Python executable
export PYSPARK_PYTHON=/usr/local/ipython/bin/ipython
# Augment the default search path for shared libraries
export LD_LIBRARY_PATH=/usr/local/ipython/bin/ipython
# Augment the default search path for private libraries
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-*-src.zip:$PYTHONPATH:$SPARK_HOME/python/
```
### Initializing SparkSession
```python
from pyspark.sql import SparkSession
spark = SparkSession
.builder
.appName("Python Spark SQL basic example")
.config("spark.executor.memory", "1gb")
.config("spark.some.config.option", "some-value")
.getOrCreate
sc = spark.sparkContext
```
### Creating DataFrames
```python
from pyspark.sql.types import *
# Infer Schema
sc = spark.sparkContext
lines = sc.textFile("people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
df_people = spark.createDataFrame(people)
# Specify Schema
people = parts.map(lambda p: Row(name=p[0],age=int(p[1].strip())))
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
spark.createDataFrame(people, schema).show()
```
### From Spark Data Sources
```python
# JSON
df = spark.read.json("customer.json")
df2 = spark.read.load("people.json", format="json")
# Parquet files
df3 = spark.read.load("users.parquet")
# TXT files
df4 = spark.read.text("people.txt")
```
| Inspect Data | Inspect Data |
| ------------ | --------------------- |
| df.types | df.describe().show() |
| df.show() | df.columns |
| df.head() | df.count() |
| df.first() | df.distinct().count() |
| df.take(2) | df.printSchema() |
| df.schema | df.explain() |
### Duplicate Values
```python
df = df.dropDuplicates()
```
### Queries
```python
from pyspark.sql import functions as F
# Select
df.select('firstName',
'lastName',
explode('phoneNumber').alias('contactInfo'),
"adddress.type", # type of address column
df['age'] + 10
).show()
# When
# Show firstName and 0 or 1 depending on age > 30
df.select("firstName",F.when(df.age > 30, 1).otherwise(0)).show()
# Show firstName if in the given options
df[df.firstName.isin("Jane","Boris")].collect()
df1.withColumn("new column",when(df1["major"] == "J",1).otherwise(0)).show()
# Like
df.select("firstName", df.lastName.like("Smith")).show()
# Startswith - Endswith
df.select("firstName", df.lastName.startswith("Sm")).show()
df.select(df.lastName.endswith("th")).show()
# Substring
df.select(df.firstName.subs(1,3).alias("name"))
# Between
df.select(df.age.between(22, 24))
```
### Add, Update, Remove Columns
```python
# Adding Columns
from pyspark.sql.types import *
df = df.withColumn('city',df.address.city)
.withColumn('postalCode',df.address.postalCode)
.withColumn('state',df.address.state) .withColumn('streetAddress',df.address.streetAddress)
.withColumn('telePhoneNumber', explode(df.phoneNumber.number))
.withColumn('telePhoneType', explode(df.phoneNumber.type))
.withColumn("medianHouseValue", df["medianHouseValue"].cast(FloatType())
from pyspark.sql.functions import add_months,current_date, year, dayofmonth, when
df2.select(add_months(df2.dt, 1).alias('next_month')).collect()
df3 = df2.withColumn("day",dayofmonth(current_date()))
df3.withColumn("year",when(year(current_date()) < 2020,year(current_date())).otherwise(2020)).show()
# Updating Column ame
df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')
# Removing Columns
df = df.drop("address", "phoneNumber")
df = df.drop(df.address).drop(df.phoneNumber)
# GroupBy
df.groupBy("age").count()
# Filter
df.filter(df["age"]>24)
# Sort
peopledf.sort(peopledf.age.desc())
df.sort("age", ascending=False)
df.orderBy(["age","city"],ascending=[0,1])
# Missing & Replacing Values
df.na.fill(50)
df.na.drop()
df.na.replace(10,20)
# Repartitioning
df.repartittion(10).rdd.getNumPartitions() # df with 10 partitions
df.coalesce(1).rdd.getNumPartitions() # df with 1 partions
```
### Running SQL Queries Programmatically
```python
# Registering DataFrames & Query as Views
df.createOrReplaceTempView("customer")
df.createTempView("customer")
df5 = spark.sql("SELECT * FROM customer")
peopledf.createGlobalTempView("people")
peopledf2 = spark.sql("SELECT * FROM global_temp.people")
```
### Output
```python
# Data Structures
rdd1 = df.rdd
df.toJSON().first()
df.toPandas()
# Write & Save to Files
df.select("firstName", "city").write.save('someName.parquet')
df.select("firstName", "age").write.save('someName.json',format='json')
```

View File

@ -0,0 +1,23 @@
---
title: Configuarions
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Spark configuation options
SPARK_LOCAL_IP environment variable
```bash
SPARK_LOCAL_IP=127.0.0.1 ./bin/spark-shell
```
In a program set the bindAddress
```scala
val config = new SparkConf()
config.setMaster("local[*]")
config.setAppName("Test App")
config.set("spark.driver.bindAddress", "127.0.0.1")
val sc = new SparkContext(config)
```

View File

@ -0,0 +1,26 @@
---
title: Files
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
Files
## Get remote files
```python
from pyspark import SparkContext
from pyspark import SparkFiles
from pyspark.sql import SQLContext
url = "https://raw.githubusercontent.com/.../data/adult.csv"
sc = SparkContext()
sc.addFile(url)
spark = SQLContext(sc)
df = spark \
.read \
.csv(SparkFiles.get("adult.csv"),header=True,inferSchema=True)
df.printSchema()
```

View File

@ -0,0 +1,22 @@
---
title: General
updated: 2022-05-24 19:25:58Z
created: 2022-05-24 19:20:33Z
---
What is Apache Spark™?
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
### Batch/streaming data
> Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
### SQL analytics
> Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
### Data science at scale
> Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
### Machine learning
> Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
[Difference between Spark DataFrame and Pandas DataFrame](https://www.geeksforgeeks.org/difference-between-spark-dataframe-and-pandas-dataframe/)

View File

@ -0,0 +1,500 @@
---
title: Notes
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
Notes
# Apache Spark Architecture
Spark applications consists of
- driver process
- maintaining information about the Spark Application;
- responding to a users program or input;
- analyzing, distributing, and scheduling work across the executors;
- executor processes
- executing code assigned to it by the driver;
- reporting the state of the computation on that executor back to the driver node;
![Architecture Spark Application](https://raw.githubusercontent.com/jjmw/Summaries/master/images/architectureSparkApplication.png)
## cluster manager
controls physical machines and allocates
resources to Spark Applications:
- Sparks standalone cluster manager
- YARN
- Mesos
Spark in _local mode_: driver and executor are simply processes.
### Language APIs
All languages (R, Python, Scala, Java) have similar performance characteristics when using structured API's.
When Python is using UDF then performance drops. Reason the Python code is executed in a separate Python process outside the JVM.
Spark has two fundamental sets of APIs:
- low-level "unstructured" (RDD)
- higher-level "structured" (Dataframe and Dataset)
### Spark Session
```scala
val myRange = spark.range(1000).toDF("number")
```
range of numbers represents a distributed collection: each part of this range of numbers exists on a different executor
### Partitions
To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster. A dataFrames partitions represent how the data is physically distributed across the cluster of machines during execution. If you have one partition, Spark will have a parallelism of only one, even if you have thousands of executors. If you have many partitions but only one executor, Spark will still have a parallelism of only one because there is only one computation resource
```scala
spark.conf.set("spark.sql.shuffle.partitions", "5")
```
Default is 200 partitions
### Transformations
Core data structures are immutable. To “change” , ie a DataFrame, you need to instruct Spark how you would like to modify it to do what you want. These instructions are called *transformations*. Return no output (**lazy evaluation**) This is because we specified only an abstract transformation, and Spark will not act on transformations until we call an action, but build up a plan of transformations (predicate pushdown)
Types of transformations:
- narrow: one input partition ==> one output partition. In memory
- wide: input partitions ==> many output partitions (shuffle= Spark writes to disk) ie aggregation and sort
- lazy evaluation: Spark will wait until the very last moment to execute the graph of computation instructions. Spark compiles this plan from your raw DataFrame transformations to a streamlined physical plan that will run as efficiently aspossible across the cluster.
![phtsicalPlan](../../_resources/5_physical_plan.png)
### Actions
An action instructs Spark to compute a result from a series of transformations. ie count.
Kind of actions:
- view data in the console
- collect data to native objects in the respective language
- write to output data sources
![Architecture Spark Application](../../_resources/ReadSortTakeDataframe.png)
### logical plan
The logical plan of transformations that we build up defines a lineage for the DataFrame so that at any given point in time, Spark knows how to recompute any partition by performing all of the operations it had before on the same input data
### DataFrames and SQL
Register any DataFrame as a table or
view (a temporary table) and query it using pure SQL. There is **no performance** difference between writing SQL queries or writing DataFrame code, they both “compile” to the same
underlying plan that we specify in DataFrame code
```scala
flightData2015.createOrReplaceTempView("flight_data_2015")
val sqlWay = spark
.sql("""SELECT DEST_COUNTRY_NAME, count(1) FROM flight_data_2015 GROUP BY DEST_COUNTRY_NAME
""")
val dataFrameWay = flightData2015
.groupBy('DEST_COUNTRY_NAME)
.count()
```
---
## Spark Toolset
![Spark toolset](../../_resources/spark_toolset.png)
### Running Production Applications (spark-submit)
spark-submit does one thing: it lets you send your application code to a cluster and launch it to execute there
On local machine:
```bash
## specify location of external jars
LIB=......
JARS=$(files=("$LIB"/*.jar); IFS=,; echo "${files[*]}")
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local \
--jars $JARS \
./examples/jars/spark-examples_2.11-2.2.0.jar 10
```
### Datasets: Type-Safe Structured APIs
- datasets: statically type code in Java and Scala. Is parameterized DataSet[T].In Scala use case classes. Collection of typed objects ie Scala Seq
Reason to use Datasets: especially attractive for writing large applications, with which multiple software engineers must interact through well-defined interfaces.
```scala
case class Flight(DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME: String, count: BigInt)
val flightsDF = spark.read.parquet("/data/
2010-summary.parquet/")
val flights = flightsDF.as[Flight]
```
Advantage of Dataset use is that when call collect or take on a Dataset, it will collect objects of the proper type in your Dataset, not DataFrame Rows. This makes it easy to get type safety and securely perform manipulation in a distributed and a local manner without code changes
### Dataframes
distibuted collection of objects of type Row
### Structured Streaming
Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2.
It also makes it easy to conceptualize because you can write your batch job as a way to prototype it and then you can convert it to a streaming job
### Machine Learning and Advanced Analytics
Machine learning algorithms in MLlib require that data is represented as numerical values. All machine learning algorithms in Spark take as input a Vector type
### Lower-Level APIs
Virtually everything in Spark is built on
top of RDDs. One thing that you might use RDDs for is to parallelize raw data that you have stored in memory on the driver machine.
RDDs are available in Scala as well as Python. However, theyre not equivalent.
### Sparks Ecosystem and Packages
[Spark Packages](https://spark-packages.org/)
---
## Structured APIs
The Structured APIs are the fundamental abstraction that you will use to write the majority of your data flows.
DataFrames and Datasets represent immutable, lazily evaluated plans that specify what operations to apply to data residing at a location to generate some output.
Support schema on write and schema on read.
Spark uses an engine called **Catalys** and Spark is a program language in it own. The majority of our manipulations will operate strictly on Spark types.
Within the Structured APIs, two more APIs:
- untyped Dataframes; typed at runtime
- typed Datasets at compile time
The “Row” type is Sparks internal representation of its optimized in-memory format for computation. DataFrames are simply Datasets of Type Row.
### Columns
Columns represent a simple type like an integer or string a complex type like an array or map, or a null value.
### Spark Types
To work with the correct Scala types, use the following:
```scala
import org.apache.spark.sql.types._
val b = ByteType
```
Te get the Spark type:
the Scala type ie:
- Short ==> ByteType
- Int ==> IntegerType
- etc
### Overview of Structured API Execution
single structured API query steps:
1. Write DataFrame/Dataset/SQL Code
2. If valid code, Spark converts this to a Logical Plan.
3. Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way
4. Spark then executes this Physical Plan (RDD manipulations) on the cluster.
overview
![Catalyst Optimizer](../../_resources/CatalystOptimizer.png)
Logical plan is first created and represents a set of abstract transformations that do not refer to executors or drivers. This plan is unresolved because although your code might be valid, the tables or columns that it refers to might or might not exist. Spark uses the catalog, a repository of all table and DataFrame information, to resolve columns and tables in the analyzer. The analyzer might reject the unresolved logical plan if the required table or column name does not exist in the catalog. Packages can extend the Catalyst to include their own rules for domain-specific optimizations.
![LogicalPlan Spark](../../_resources/LogicalPlanSpark.png)
The physical plan, often called a Spark plan, specifies how the logical plan will execute on the cluster by generating different physical execution strategies and comparing them through a cost model. Physical planning results in a series of RDDs and transformations
![PhysicalPlan Spark](../../_resources/PhysicalPlanSpark.png)
### Execution
Upon selecting a physical plan, Spark runs all of this code over RDDs. further optimizations at runtime, generating native Java bytecode that can remove entire tasks or stages during execution.
---
## Basic Structured Operations
### Schemas
A schema is a StructType made up of:
- a number of fields
- StructFields, that have a name
- type (Spark types)
- Boolean flag: column can contain missing or null values
- optionally specify associated metadata with that column
The metadata is a way of storing information about this column (Spark uses this in its machine learning library).
```scala
import org.apache.spark.sql.types.{StructField,StructType, StringType,LongType}
import org.apache.spark.sql.types.Metadata
val myManualSchema = StructType(Array(
StructField("DEST_COUNTRY_NAME", StringType, true),
StructField("ORIGIN_COUNTRY_NAME", StringType, true),
StructField("count", LongType, false,
Metadata.fromJson("{\"hello\":\"world\"}"))
))
val df = spark
.read
.format("json")
.schema(myManualSchema)
.load("/data/2015-summary.json")
```
### Columns and Expressions
You cannot manipulate an individual column outside the context of a DataFrame; you must use Spark transformations
within a DataFrame to modify the contents of a column.
Different ways to construct and refer to columns:
```scala
import org.apache.spark.sql.functions.{col, column}
col("someColumnName")
column("someColumnName")
$"someColumnName"
'someColumnName
df.col("count") // use
```
**Columns are not resolved until we compare the column names with those we are maintaining in the catalog. Column and table resolution happens in the analyzer
phase.**
### Expressions
Columns are expressions. An expression is a set of transformations on one or more values in a record in a DataFrame. (a function that takes as input one or more column names, resolves them, and then potentially
applies more expressions to create a single value for each record in the dataset)
Each row in a DataFrame is a single record as an object of type Row
### Creating Rows
Only DataFrames have schemas. Rows themselves do not have
schemas.
```scala
import org.apache.spark.sql.Row
val myRow = Row("Hello", null, 1, false)
myRow(0) // type Any
myRow(0).asInstanceOf[String] // String
myRow.getString(0) // String
myRow.getInt(2) // Int
```
### Creating DataFrames
```scala
val df = spark
.read
.format("json")
.load("/data/2015-summary.json")
df.createOrReplaceTempView("dfTable")
```
or
```scala
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,StringType,LongType}
val myManualSchema = new StructType(Array(
new StructField("some", StringType, true),
new StructField("col", StringType, true),
new StructField("names", LongType, false)))
val myRows = Seq(Row("Hello", null, 1L))
val myRDD = spark
.sparkContext
.parallelize(myRows)
val myDf = spark
.createDataFrame(myRDD, myManualSchema)
myDf.show()
```
### select and selectExpr
```scala
mDF.select("colA","colB").show()
mDF.select('colA).show()
mDF.select(col("colA")).show()
mDF.select(expr("colA as aap")).show() // most flexible
mDF.select(expr("colA").alias("aap")).show()
mDF.selectExpr("colA as aap", "colB").show() // daily use; opens up the true power of Spark.
```
### Adding, renaming and dropping Columns
Dataframe is NOT modified!!
```scala
df.withColumn("numberOne", lit(1))
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))
df.withColumnRenamed("DEST_COUNTRY_NAME","dest")
df.drop("DEST_COUNTRY_NAME")
```
### Case Sensitivity
By default Spark is case insensitive. Make sensitive:
```scala
spark.sql("""set spark.sql.caseSensitive true""")
```
### Changing a Columns Type (cast)
```scala
df.withColumn("count2", col("count").cast("long"))
```
### Filtering Rows and Unique Rows
```scala
df.filter(col("count") < 2).show(2)
df.where("count < 2").show(2)
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME")
.distinct().count()
```
### Random Samples and Split
```scala
// Random Sample
val seed = 5
val withReplacement = false
val fraction = 0.5
df.sample(withReplacement, fraction, seed).count()
// Random Splits
val dataFrames = df
.randomSplit(Array(0.25, 0.75), seed)
dataFrames(0)
.count() > dataFrames(1)
.count() // False
```
---
## Working with Different Types of Data
**lit function**. This function converts a type in another language to its correspnding Spark representation.
```scala
import org.apache.spark.sql.functions.lit
df.select(lit(5), lit("five"), lit(5.0))
```
```scala
// where as in sql; === equal =!= not equal
vlucht.where(col("count") === 15)
.select("*")
.show(10)
// best way
vlucht.where("count = 15").show(10)
// column aap boolean; == equal
vlucht.selectExpr("*","count == 15 as aap").show(10)
```
**Better just write SQL!!!!**
```scala
// compute summary statistics
df.describe().show()
```
### Working with Dates and Timestamps
There are dates, which focus exclusively on calendar dates, and timestamps, which include both date and time information. Sparks TimestampType class supports only second-level precision, which means that if youre going to be working with milliseconds or microseconds, youll need to work around this problem by potentially operating on them as longs.
### Working with Complex Types
Structs:
Think of structs as DataFrames within DataFrames
```scala
import org.apache.spark.sql.functions.struct
val complexDF = df.select(struct("Description","InvoiceNo")
.alias("complex"))
complexDF.select(col("complex")
.getField("Description"))
.show()
```
split
```scala
import org.apache.spark.sql.functions.split
df.select(split(col("Description"), " ")
.alias("array_col"))
.selectExpr("array_col[0]")
.show(2)
```
### User-Defined Functions (UDF)
One of the most powerful things that you can do in Spark is define your own functions. Functions that operate on the data, record by record.
Performance considerations:
- UDFs in Scala or Java, you can use it within the Java Virtual Machine (JVM)
- In Python Spark starts a Python process on the workers and serializes all data to a format that Python understands.
```scala
val udfExampleDF = spark.range(5).toDF("num")
def power3(number:Double):Double = number * number * number
import org.apache.spark.sql.functions.udf
val power3udf = udf(power3(_:Double):Double)
udfExampleDF.select(power3udf(col("num"))).show()
// register
spark.udf.register("power3", power3(_:Double):Double)
udfExampleDF.selectExpr("power3(num)").show(2)
```
## Aggregations
groupings types in Spark:
(all return a RelationalGroupedDataset)
- group by
- window
- grouping set
- rollup
- cube

View File

@ -0,0 +1,63 @@
---
title: '# jdbc'
updated: 2022-04-03 15:16:26Z
created: 2021-05-04 14:58:11Z
---
### method a load drivers
```python
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars file:/home/john/opt/jars/postgresql-42.2.5.jar pyspark-shell'
```
### method b load drivers
```bash
pyspark \
--packages org.postgresql:postgresql:42.2.5 \
--driver-class-path /home/john/opt/jars/postgresql-42.2.5.jar
```
alone driver-class-path is also OK
```python
from pyspark.sql import DataFrameReader, SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("jdbc data sources") \
.config("spark.sql.shuffle.partitions", "4") \
.getOrCreate()
```
### method 1
```python
df_company = (
spark.read.format("jdbc")
.option("url", "jdbc:postgresql://172.17.0.2/postgres")
.option("dbtable", "public.company")
.option("user", "postgres")
.option("password", "qw12aap")
.option("driver", "org.postgresql.Driver")
.load()
)
df_company.show()
```
### method 2
```python
dataframe = (
spark.read.format("jdbc")
.options(
url="jdbc:postgresql://172.17.0.2/postgres?user=postgres&password=qw12aap",
database="public",
dbtable="company",
driver="org.postgresql.Driver"
)
.load()
)
dataframe.show()
```

View File

@ -0,0 +1,24 @@
---
title: snippets
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# snippets
```python
df_main.join(df_sub,['finr','belastingjaar'],'left').filter(df_sub["element"].isin(20,30)).drop("element").groupBy('finr','belastingjaar').sum("waarde22").show()
df_main.join(df_sub,['finr','belastingjaar'],'left').filter(df_sub["element"].isin(10,20)).show()
```
```python
df = spark.createDataFrame ([
("a", 1, 10, "m1"), ("a", 1, 10, "m2"), ("a", 1, 30, "m3"),
("a", 1, 11, "m4")],
("a", "b", "cnt", "major"))
df.show()
reshaped_df = df.groupby('a','b').pivot('major').max('cnt').fillna(0)
reshaped_df.show()
```

View File

@ -0,0 +1,495 @@
---
title: spark_notes
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Apache Spark Architecture
Spark applications consists of
- driver process
- maintaining information about the Spark Application;
- responding to a users program or input;
- analyzing, distributing, and scheduling work across the executors;
- executor processes
- executing code assigned to it by the driver;
- reporting the state of the computation on that executor back to the driver node;
![Architecture Spark Application](../../_resources/architectureSparkApplication.png)
## cluster manager
controls physical machines and allocates
resources to Spark Applications:
- Sparks standalone cluster manager
- YARN
- Mesos
Spark in _local mode_: driver and executor are simply processes.
### Language APIs
All languages (R, Python, Scala, Java) have similar performance characteristics when using structured API's.
When Python is using UDF then performance drops. Reason the Python code is executed in a separate Python process outside the JVM.
Spark has two fundamental sets of APIs:
- low-level "unstructured" (RDD)
- higher-level "structured" (Dataframe and Dataset)
### Spark Session
```scala
val myRange = spark.range(1000).toDF("number")
```
range of numbers represents a distributed collection: each part of this range of numbers exists on a different executor
### Partitions
To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster
```scala
spark.conf.set("spark.sql.shuffle.partitions", "5")
```
Default is 200 partitions
### Transformations
Core data structures are immutable. To “change” , ie a DataFrame, you need to instruct Spark how you would like to modify it to do what you want. These instructions are called *transformations*. Return no output (**lazy evaluation**) This is because we specified only an abstract transformation, and Spark will not act on transformations until we call an action, but build up a plan of transformations (predicate pushdown)
Types of transformations:
- narrow: one input partition ==> one output partition
- wide: input partitions ==> many output partitions (shuffle= Spark writes to disk) ie aggregation and sort
### Actions
An action instructs Spark to compute a result from a series of transformations. ie count.
Kind of actions:
- view data in the console
- collect data to native objects in the respective language
- write to output data sources
![Architecture Spark Application](../../_resources/ReadSortTakeDataframe-1.png)
### logical plan
The logical plan of transformations that we build up defines a lineage for the DataFrame so that at any given point in time, Spark knows how to recompute any partition by performing all of the operations it had before on the same input data
### DataFrames and SQL
Register any DataFrame as a table or
view (a temporary table) and query it using pure SQL. There is **no performance** difference between writing SQL queries or writing DataFrame code, they both “compile” to the same
underlying plan that we specify in DataFrame code
```scala
flightData2015.createOrReplaceTempView("flight_data_2015")
val sqlWay = spark
.sql("""SELECT DEST_COUNTRY_NAME, count(1) FROM flight_data_2015 GROUP BY DEST_COUNTRY_NAME
""")
val dataFrameWay = flightData2015
.groupBy('DEST_COUNTRY_NAME)
.count()
```
---
## Spark Toolset
![Spark toolset](../../_resources/spark_toolset-1.png)
### Running Production Applications (spark-submit)
spark-submit does one thing: it lets you send your application code to a cluster and launch it to execute there
On local machine:
```bash
## specify location of external jars
LIB=......
JARS=$(files=("$LIB"/*.jar); IFS=,; echo "${files[*]}")
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local \
--jars $JARS \
./examples/jars/spark-examples_2.11-2.2.0.jar 10
```
### Datasets: Type-Safe Structured APIs
Datasets: statically type code in Java and Scala
Dataframes: distibuted collection of objects of type Row
Datasets: collection of typed objects ie Scala Seq
Reason to use Datasets: especially attractive for writing large applications, with which multiple software engineers must interact through well-defined interfaces.
```scala
case class Flight(DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME: String, count: BigInt)
val flightsDF = spark.read.parquet("/data/
2010-summary.parquet/")
val flights = flightsDF.as[Flight]
```
Advantage of Dataset use is that when call collect or take on a Dataset, it will collect objects of the proper type in your Dataset, not DataFrame Rows. This makes it easy to get type safety and securely perform manipulation in a distributed and a local manner without code changes
### Structured Streaming
Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2.
It also makes it easy to conceptualize because you can write your batch job as a way to prototype it and then you can convert it to a streaming job
### Machine Learning and Advanced Analytics
Machine learning algorithms in MLlib require that data is represented as numerical values. All machine learning algorithms in Spark take as input a Vector type
### Lower-Level APIs
Virtually everything in Spark is built on
top of RDDs. One thing that you might use RDDs for is to parallelize raw data that you have stored in memory on the driver machine.
RDDs are available in Scala as well as Python. However, theyre not equivalent.
### Sparks Ecosystem and Packages
[Spark Packages](https://spark-packages.org/)
---
## Structured APIs
The Structured APIs are the fundamental abstraction that you will use to write the majority of your data flows.
DataFrames and Datasets represent immutable, lazily evaluated plans that specify what operations to apply to data residing at a location to generate some output.
Support schema on write and schema on read.
Spark uses an engine called **Catalys** and Spark is a program language in it own. The majority of our manipulations will operate strictly on Spark types.
Within the Structured APIs, two more APIs:
- untyped Dataframes; typed at runtime
- typed Datasets at compile time
The “Row” type is Sparks internal representation of its optimized in-memory format for computation. DataFrames are simply Datasets of Type Row.
### Columns
Columns represent a simple type like an integer or string a complex type like an array or map, or a null value.
### Spark Types
To work with the correct Scala types, use the following:
```scala
import org.apache.spark.sql.types._
val b = ByteType
```
Te get the Spark type:
the Scala type ie:
- Short ==> ByteType
- Int ==> IntegerType
- etc
### Overview of Structured API Execution
single structured API query steps:
1. Write DataFrame/Dataset/SQL Code
2. If valid code, Spark converts this to a Logical Plan.
3. Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way
4. Spark then executes this Physical Plan (RDD manipulations) on the cluster.
overview
![Catalyst Optimizer](../../_resources/CatalystOptimizer-1.png)
Logical plan is first created and represents a set of abstract transformations that do not refer to executors or drivers. This plan is unresolved because although your code might be valid, the tables or columns that it refers to might or might not exist. Spark uses the catalog, a repository of all table and DataFrame information, to resolve columns and tables in the analyzer. The analyzer might reject the unresolved logical plan if the required table or column name does not exist in the catalog. Packages can extend the Catalyst to include their own rules for domain-specific optimizations.
![LogicalPlan Spark](../../_resources/LogicalPlanSpark-1.png)
The physical plan, often called a Spark plan, specifies how the logical plan will execute on the cluster by generating different physical execution strategies and comparing them through a cost model. Physical planning results in a series of RDDs and transformations
![PhysicalPlan Spark](../../_resources/PhysicalPlanSpark-1.png)
### Execution
Upon selecting a physical plan, Spark runs all of this code over RDDs. further optimizations at runtime, generating native Java bytecode that can remove entire tasks or stages during execution.
---
## Basic Structured Operations
### Schemas
A schema is a StructType made up of:
- a number of fields
- StructFields, that have a name
- type (Spark types)
- Boolean flag: column can contain missing or null values
- optionally specify associated metadata with that column
The metadata is a way of storing information about this column (Spark uses this in its machine learning library).
```scala
import org.apache.spark.sql.types.{StructField,StructType, StringType,LongType}
import org.apache.spark.sql.types.Metadata
val myManualSchema = StructType(Array(
StructField("DEST_COUNTRY_NAME", StringType, true),
StructField("ORIGIN_COUNTRY_NAME", StringType, true),
StructField("count", LongType, false,
Metadata.fromJson("{\"hello\":\"world\"}"))
))
val df = spark
.read
.format("json")
.schema(myManualSchema)
.load("/data/2015-summary.json")
```
### Columns and Expressions
You cannot manipulate an individual column outside the context of a DataFrame; you must use Spark transformations
within a DataFrame to modify the contents of a column.
Different ways to construct and refer to columns:
```scala
import org.apache.spark.sql.functions.{col, column}
col("someColumnName")
column("someColumnName")
$"someColumnName"
'someColumnName
df.col("count") // use
```
**Columns are not resolved until we compare the column names with those we are maintaining in the catalog. Column and table resolution happens in the analyzer
phase.**
### Expressions
Columns are expressions. An expression is a set of transformations on one or more values in a record in a DataFrame. (a function that takes as input one or more column names, resolves them, and then potentially
applies more expressions to create a single value for each record in the dataset)
Each row in a DataFrame is a single record as an object of type Row
### Creating Rows
Only DataFrames have schemas. Rows themselves do not have
schemas.
```scala
import org.apache.spark.sql.Row
val myRow = Row("Hello", null, 1, false)
myRow(0) // type Any
myRow(0).asInstanceOf[String] // String
myRow.getString(0) // String
myRow.getInt(2) // Int
```
### Creating DataFrames
```scala
val df = spark
.read
.format("json")
.load("/data/2015-summary.json")
df.createOrReplaceTempView("dfTable")
```
or
```scala
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,StringType,LongType}
val myManualSchema = new StructType(Array(
new StructField("some", StringType, true),
new StructField("col", StringType, true),
new StructField("names", LongType, false)))
val myRows = Seq(Row("Hello", null, 1L))
val myRDD = spark
.sparkContext
.parallelize(myRows)
val myDf = spark
.createDataFrame(myRDD, myManualSchema)
myDf.show()
```
### select and selectExpr
```scala
mDF.select("colA","colB").show()
mDF.select('colA).show()
mDF.select(col("colA")).show()
mDF.select(expr("colA as aap")).show() // most flexible
mDF.select(expr("colA").alias("aap")).show()
mDF.selectExpr("colA as aap", "colB").show() // daily use; opens up the true power of Spark.
```
### Adding, renaming and dropping Columns
Dataframe is NOT modified!!
```scala
df.withColumn("numberOne", lit(1))
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))
df.withColumnRenamed("DEST_COUNTRY_NAME","dest")
df.drop("DEST_COUNTRY_NAME")
```
### Case Sensitivity
By default Spark is case insensitive. Make sensitive:
```scala
spark.sql("""set spark.sql.caseSensitive true""")
```
### Changing a Columns Type (cast)
```scala
df.withColumn("count2", col("count").cast("long"))
```
### Filtering Rows and Unique Rows
```scala
df.filter(col("count") < 2).show(2)
df.where("count < 2").show(2)
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME")
.distinct().count()
```
### Random Samples and Split
```scala
// Random Sample
val seed = 5
val withReplacement = false
val fraction = 0.5
df.sample(withReplacement, fraction, seed).count()
// Random Splits
val dataFrames = df
.randomSplit(Array(0.25, 0.75), seed)
dataFrames(0)
.count() > dataFrames(1)
.count() // False
```
---
## Working with Different Types of Data
**lit function**. This function converts a type in another language to its correspnding Spark representation.
```scala
import org.apache.spark.sql.functions.lit
df.select(lit(5), lit("five"), lit(5.0))
```
```scala
// where as in sql; === equal =!= not equal
vlucht.where(col("count") === 15)
.select("*")
.show(10)
// best way
vlucht.where("count = 15").show(10)
// column aap boolean; == equal
vlucht.selectExpr("*","count == 15 as aap").show(10)
```
**Better just write SQL!!!!**
```scala
// compute summary statistics
df.describe().show()
```
### Working with Dates and Timestamps
There are dates, which focus exclusively on calendar dates, and timestamps, which include both date and time information. Sparks TimestampType class supports only second-level precision, which means that if youre going to be working with milliseconds or microseconds, youll need to work around this problem by potentially operating on them as longs.
### Working with Complex Types
Structs:
Think of structs as DataFrames within DataFrames
```scala
import org.apache.spark.sql.functions.struct
val complexDF = df.select(struct("Description","InvoiceNo")
.alias("complex"))
complexDF.select(col("complex")
.getField("Description"))
.show()
```
split
```scala
import org.apache.spark.sql.functions.split
df.select(split(col("Description"), " ")
.alias("array_col"))
.selectExpr("array_col[0]")
.show(2)
```
### User-Defined Functions (UDF)
One of the most powerful things that you can do in Spark is define your own functions. Functions that operate on the data, record by record.
Performance considerations:
- UDFs in Scala or Java, you can use it within the Java Virtual Machine (JVM)
- In Python Spark starts a Python process on the workers and serializes all data to a format that Python understands.
```scala
val udfExampleDF = spark.range(5).toDF("num")
def power3(number:Double):Double = number * number * number
import org.apache.spark.sql.functions.udf
val power3udf = udf(power3(_:Double):Double)
udfExampleDF.select(power3udf(col("num"))).show()
// register
spark.udf.register("power3", power3(_:Double):Double)
udfExampleDF.selectExpr("power3(num)").show(2)
```
## Aggregations
groupings types in Spark:
(all return a RelationalGroupedDataset)
- group by
- window
- grouping set
- rollup
- cube

13
Apache/Camel.md Normal file
View File

@ -0,0 +1,13 @@
---
title: Camel
updated: 2022-05-24 19:42:30Z
created: 2022-05-24 19:38:56Z
---
Apache Camel is an Open Source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.
Apache Camel™ is a versatile open-source integration framework based on known Enterprise Integration Patterns.
Camel empowers you to define routing and mediation rules in a variety of domain-specific languages (DSL, such as Java, XML, Groovy, Kotlin, and YAML). This means you get smart completion of routing rules in your IDE, whether in a Java or XML editor.
[source](https://camel.apache.org/)

12
Apache/Cassandra.md Normal file
View File

@ -0,0 +1,12 @@
---
title: Cassandra
updated: 2022-05-24 19:28:24Z
created: 2022-05-24 19:26:51Z
---
What is Apache Cassandra?
> Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
### Distributed
> Cassandra is suitable for applications that cant afford to lose data, even when an entire data center goes down. There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.

45
Apache/Flink.md Normal file
View File

@ -0,0 +1,45 @@
---
title: Flink
updated: 2022-05-24 19:01:38Z
created: 2022-05-24 18:44:47Z
---
# Stateful Computations over Data Streams
![de4bf8596cb1518879ba2589540b3c7d.png](../_resources/de4bf8596cb1518879ba2589540b3c7d.png)
Apache Flink is a framework and **distributed processing** engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
* * *
### Streaming use cases
- Event-driven Applications
- Stream & Batch Analytics
- Data Pipelines & ETL
* * *
### Guaranteed correctness
- Exactly-once state consistency
- Event-time processing
- Sophisticated late data handling
* * *
### Layered APIs
- SQL on Stream & Batch Data
- DataStream API & DataSet API
- ProcessFunction (Time & State)
* * *
### Excellent Performance
- Low latency
- High throughput
- In-Memory computing
* * *
### Scales to any use case
- Scale-out architecture
- Support for very large state
- Incremental check-pointing
* * *
### Data can be processed as unbounded or bounded streams.
- **Unbounded streams have a start but no defined end.** They do not terminate and provide data as it is generated. Unbounded streams must be continuously processed, i.e., events must be promptly handled after they have been ingested. It is not possible to wait for all input data to arrive because the input is unbounded and will not be complete at any point in time. Processing unbounded data often requires that events are ingested in a specific order, such as the order in which events occurred, to be able to reason about result completeness.
- **Bounded streams have a defined start and end**. Bounded streams can be processed by ingesting all data before performing any computations. Ordered ingestion is not required to process bounded streams because a bounded data set can always be sorted. Processing of bounded streams is also known as batch processing.
[source](https://flink.apache.org/)

9
Apache/Fluentd.md Normal file
View File

@ -0,0 +1,9 @@
---
title: Fluentd
updated: 2022-05-24 19:54:57Z
created: 2022-05-24 19:53:01Z
---
Fluentd is an open source data collector, which lets you unify the data collection and consumption for a better use and understanding of data.
![c2a9b03791fddf1aabea180f18076f55.png](../_resources/c2a9b03791fddf1aabea180f18076f55.png)

10
Apache/Flume-1.md Normal file
View File

@ -0,0 +1,10 @@
---
title: Flume
updated: 2022-05-24 19:52:11Z
created: 2022-05-24 19:50:29Z
---
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on **streaming data** flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
![e3bb1a88acbda798341b1985f38d888c.png](../_resources/e3bb1a88acbda798341b1985f38d888c.png)

51
Apache/Flume.md Normal file
View File

@ -0,0 +1,51 @@
---
title: Flume
updated: 2022-05-24 18:29:31Z
created: 2021-05-04 14:58:11Z
---
# Apache Flume
Streaming data into cluster
Developed with Hadoop in mind
- Build-in sinks fir HDFS and Hbase
- Originally made to handle log aggregation
***Flume is buffering data before delivering to the cluster.***
## Anatomy of a Flume Agent and Flow
![Flume Agent](https://flume.apache.org/_images/DevGuide_image00.png?)
### Three components of a Flume Agent:
- Source
- Where data is comming from
- Optionally Channel Selectors and Interceptors
- Selectors:
- based on some selection the data is sent somewhere
- Interceptors:
- Data can add or reschape the data
- Channel
- how the data is transferred between Source and Sink (via memory or files)
- Sink
- Where the data is going
- multiple Sinks and can be organized into Sink Groups
- A Sink can connect to only ***one*** Channel
- Channel is notified te delete a message once the Sink processes it
### Build-in Source Types:
- Spooling directory, Avro (specific Hadoop format), Kafka, Exec (command-line), Thrift, Netcat (tcp/ip), HTTP, Custom, etc
### Build-in Sink Types:
- HDFS, Hive, HBase, Avro, Thrift, Elasticsearch, Kafka,Custom
Flume Example
![example](../_resources/FlumeExample.png)
First layer close to source and proces data. ie are in a local datacenter.
The second layer collects from and incests into the sink.
Between first amnd second layer of agents are source AVRO and Sink AVRO to transfer data very efficient.

11
Apache/HBase.md Normal file
View File

@ -0,0 +1,11 @@
---
title: HBase
updated: 2022-05-24 19:44:45Z
created: 2022-05-24 19:43:06Z
---
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
[Source](https://hbase.apache.org/)

160
Apache/Hadoop.md Normal file
View File

@ -0,0 +1,160 @@
---
title: Hadoop
updated: 2022-05-24 18:26:29Z
created: 2022-05-24 18:24:59Z
---
## **HDFS**: Hadoop Distributed File System
### Hadoop 2
- high availability
- federations
- snapshots
**YARN** was introduced in Hadoop version 2 to overcome scalability issues and resource management jobs.
### Hadoop 3
- Overhead due to data replication factor
- Default replication factor of 3
- fault-tolerant with better data locality and better load balancing of jobs among DataNodes
- overhead cost of around 200%
- less/not used data consumes resources
- Solution: **erasure coding**. This stores data durably while saving space significantly.
- YARN Timeline service re-architected
- YARN opportunistic containers & distributed scheduling,
- Optimizing map output collector: use of Java Native Interface (JNI) for optimazation. Useful for shuffle-intensive operations.
- higher availability factor of NameNode. While it is a highly available solution, in the case of the failure of an active (or standby) NameNode, it will go back to a non-HA mode. Support for ++more than one++ standby NameNode has been introduced.
Dependency on Linux ephemeral port range: default ports are moved out of the ephemeral port range
- Disk-level data skew: multiple disks (or drives) managed by DataNodes. Sometimes, adding or replacing disks leads to significant data skew within a DataNode.
### Origins
- Fault tolerant: The system should be able to handle any failure of the machines automatically, in an isolated manner. This means the failure of one machine should not affect the entire application.
- Load balancing: If one machine fails, then its work should be distributed automatically to the working machines in a fair manner.
- Data loss: once data is written to disk, it should never be lost even if one or two machines fail.
### Concept of blocks and replication
Blocks are created by splitting each file into 64 MB chunks (the size is configurable) and replicating each block three times by default so that, if a machine holding one block fails, then data can be served from another machine.
### MapReduce model
- Provide parallelism
- Fault tolerance
- Data locality features.
(Data locality means a program is executed where data is stored instead of bringing the data to the program.)
**NameNodes** and **DataNodes** have a specific role in managing overall clusters.
NameNodes are responsible for maintaining metadata information.
### Hadoop Logical View
![Hadoop Logical View](../_resources/HadoopLogicalView.jpg)
### Ingress/egress/processing
- Ingesting (ingress) data
- Reading (Egress) data
- Processing already ingested data
These actions can be automated via the use of tools or automated code.
### Data integration components
For ingress/egress or data processing in Hadoop, you need data integration components. These components are tools, software, or custom code that help integrate the underlying Hadoop data with user views or actions. These components give end users a unified view of data in Hadoop across different distributed Hadoop folders, in different files and data formats.
ie Hue, Sqoop, Java Hadoop Clients, Hive, Beeline Clients
### Data access interfaces
Data access interfaces allow you to access underlying Hadoop data using different languages such as SQL, NoSQL, or APIs such as Rest and JAVA APIs, or using different data formats such as search data formats and streams. Sometimes, the interface that you use to access data from Hadoop is tightly coupled with underlying data processing engines. ie Spark SQL, SOLR or elastic search.
### Data Processing Engines
To manipulate underlying data and have different mechanisms to use system resources and have completely different SLA guarantees.ie MapReduce processing engine is more disk I/O-bound (keeping RAM memory usage under control) and it is suitable for batch-oriented data processing. Similarly, SPARK in a memory processing engine is less disk I/O-bound and more dependent on RAM memory. It is more suitable for stream or micro-batch processing.
### Resource management frameworks
Expose abstract APIs to interact with underlying resource managers for task and job scheduling in Hadoop. These frameworks ensure there is a set of steps to follow for submitting jobs in Hadoop using designated resource managers such as YARN or MESOS. These frameworks help establish optimal performance by utilizing underlying resources systematically. ie Tez or Slider.
### Task and resource management
sharing a large cluster of machines across different, simultaneously running applications in a cluster. YARN and MESOS.
YARN is a Unix process while MESOS is Linux-container-based
### Data input/output
The data input/output layer is primarily responsible for different file formats, compression techniques, and data serialization for Hadoop storage.
### Data Storage Medium
HDFS is the primary data storage medium used in Hadoop. It is a Java-based, high-performant distributed filesystem that is based on the underlying UNIX File System.
## Core Hadoop Ecosystem
![coreHadoop](../_resources/CoreHadoopEcosystem.png)
Hadoop can handle big files effectively. Breaking files up in blocks of 64 or 128 mb (configurable) and stored across several commodity computers
<img src="../images/File542.png" width="200">
## HDFS Read Mechanism
![read](../_resources/HDFS_Read_Mechanism.png)
## HDFS Write Mechanism
![write](../_resources/HDFS_Write_mechanism.jpeg)
# Mapreduce
## Conceptional
1. Raw Data
2. Mapper
3. Shuffle and Sort (happens automatically by Hadoop)
4. Reducer
![MapperReduce](../_resources/HadoopMapperReduce_conceptionally.png)
## Distribution
Raw data is splitup in partitions and partitions are distributed to different nodes
## How all works together
![oveview](../_resources/OverviewMapReduceMasterNodes.png)
Important is data locality. Client node stores data into HDFS. The DataNodes need to access this data, therefor the data has toe be distibuted efficient.
## Handling Failure
1. Application master monitors worker tasks for errors or hanging
- restarts as needed
- Preferably on a diffent node
1. application master goes down
- YARN can try to restart it
2. Entire node goes down
- could be application master (1)
- resource manager will try to restart (Hadoop 3 has solution)
1. Resource manager goes down
- only when Zookeeper has a standby and will start an replace
# References
[slides](https://www.slideshare.net/Simplilearn/hadoop-architecture-hdfs-architecture-hadoop-architecture-tutorial-hdfs-tutorial-simplilearn)
[Youtube for slides(1:31 hour)](https://www.youtube.com/watch?v=CI0QkZYsLmw)
[The Hadoop Ecosystem Table](https://hadoopecosystemtable.github.io/)

69
Apache/NiFi.md Normal file
View File

@ -0,0 +1,69 @@
---
title: NiFi
updated: 2022-05-24 18:29:11Z
created: 2022-05-21 13:19:51Z
---
## What is Apache NiFi used for:
- reliable and secure transfer of data between systems
- delivery of data from sources to analytics platforms => top use case
- enrichment and preparation of data:
- conversion between formats => on thing at the time (json => csv)
- extraction/parsing
- route decisions => get value of json field and make decision on that value: send json to system A other wise to system B
## What is Apache NiFi **NOT** used for?
- distribution computation
- complex event processing
- joins / complex rolling window operations
## Hadoop ecosystem integration examples
### HDFS ingest
- MergeContent
- merges into appropriately sized files for HDFS
- based on size, number of messages, and time
- UpdateAttribute
- sets the HDFS directory and filename
- use expression language to dynamically bin by date
- PutHDFS
- write FlowFile content to HDFS
- support conflict resolution strategy and Kerboros authentication
![c45b3dcdac107122793b14d8bdd76a0f.png](../_resources/c45b3dcdac107122793b14d8bdd76a0f.png)
### HDFS Retrieval
- ListHDFS
- perioddically perform listing on HDFS directory
- produces FlowFile per HDFS file
- flow only contains HDFS path & filename
- FetchHDFS
- retriece a file form HDFS
- use incoming FlowFiles to dynamically fetch
![a6ea2a07d58fac8a6739c7379c1b92f6.png](../_resources/a6ea2a07d58fac8a6739c7379c1b92f6.png)
### HBase integration
- HBAse ingest - single cell =? table, row id, col family and col qualifier
- FlowFile content becomes the cell value
- HBase Ingest - Full row
- Row id can be a field in JSON or FlowFile attribute
## Kafka integration
- PutKafka
- Provide Broker and topic name
- publishes FlowFile content as one or more messages
- Ability to send large delimited content, slit into messages bu NiFi
- GetKafka
- Provide ZK connection string and topic name
- produces a FlowFile for each message consumed
## Stream Processing Integration
![1ce08014a43470c07e5314f1d69c6771.png](../_resources/1ce08014a43470c07e5314f1d69c6771.png)
- Spark Streaming - NiFi Spark Receiver
- Storm - NiFi Spout
- Flink - NiFi Source & Sink
- Apex - NiFi Input Operations & Output Operations
- and many more integrations available
[NiFi Videos](https://nifi.apache.org/videos.html)

13
Apache/RabbitMQ.md Normal file
View File

@ -0,0 +1,13 @@
---
title: RabbitMQ
updated: 2022-05-24 19:49:57Z
created: 2022-05-24 19:45:41Z
---
RabbitMQ is the most widely deployed open source message broker.
### Asynchronous Messaging
> Supports multiple messaging protocols, message queuing, delivery acknowledgement, flexible routing to queues, multiple exchange type.
[source](https://www.rabbitmq.com)

13
Apache/Samoa.md Normal file
View File

@ -0,0 +1,13 @@
---
title: Samoa
updated: 2022-05-24 18:34:24Z
created: 2022-05-24 18:30:37Z
---
# Scalable Advanced Massive Online Analysis
Apache SAMOA is a platform for mining on big data streams. It is a distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms.
Apache SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). Apache SAMOA also provides extensibility in integrating new SPEs into the framework. These features allow Apache SAMOA users to develop distributed streaming ML algorithms once and to execute the algorithms in multiple SPEs, i.e., code the algorithms once and execute them in multiple SPEs.
[samoa-project](https://samoa-project.net/)

70
Apache/Storm.md Normal file
View File

@ -0,0 +1,70 @@
---
title: Storm
updated: 2022-05-24 19:15:01Z
created: 2021-05-04 14:58:11Z
---
# Why use Apache Storm?
Apache Storm is a free and open source distributed real-time computation system. Apache Storm makes it easy to reliably process **unbounded streams of data**, doing for real-time processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language.
Apache Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Apache Storm integrates with the queueing and database technologies you already use. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, re-partitioning the streams between each stage of the computation however needed. Read more in the tutorial.
- Real-time continuous streaming data on clusters
- Runs on top of Yarn
- Works on individual events (**NOT** micro-batches like Spark)
- Storm is a better solution to Spark streaming
- Storm is perfect for sub-second latency (fast)
## Storm Topology
<img src="../images/StormSpoutBolt.png" width="200">
- Streams consists of ___tuples___ that flow through
- Spouts are ___sources___ of stream data (from Kafka, Twitter, etc)
- ___Bolts___ process stream data as it's recieved
- transform, aggregate, write to database / HDFS
- So no final state. Data stream continous goes on an on forever
- Storm topology is a graph of spouts ans bolts tat process the stream
- can get complex (In Spark you get the DAG for free)
## Storm Architecture
<img src="../images/StormArchitecture.png" width="300">
- Nimbus is a single point of failure
- Job tracker
- can restart quickly witout loosing any data
- HA is available as a Nimbus backup server
- Zookeeper (in it self is HA)
- Supervisors are doing the work
## Developing Storm applications
- usually in Java
- Bolts may be directed through scripts in other languages
- Selling point of Storm, but in practice in Java
- Storm Core
- lower-level API for Storm
- "At-least-once" semantics (possibility of duplicated data)
- Trident
- Highlevel API for Storm <=== prefer
- "Exactly once" semantics
- After submitted, Storm runs forever - until explicitly stopped
## Storm vs Spark Streaming
Storm
- tumbling window
- ie all events in the past 5 sec exactly; no overlap of events
- sliding window
- can overlap by design
Storm only Java
Spark
- graph, ML, micro-batch streaming
Spark in Scala and Python
Kafka and Storm => perfect combination

160
Cloud/1 Cloud General.md Normal file
View File

@ -0,0 +1,160 @@
---
title: 1 Cloud General
updated: 2021-09-06 19:07:35Z
created: 2021-05-04 14:58:11Z
---
# Traditional IT Services Deployment Models
1. On Premises solutions (CPE Customer Premises Equipment)
- All equipment is located in your building
- All equipment is owned by you
- There are clear lines of demarcation everything in the building is your
- responsibility, the connections between offices are your network service
- providers responsibility
- Equipment is CapEx
- New equipment will typically take over a week to deploy
- Equipment requires technology refreshes
- You need to consider redundancy
<p align="center"> <img src="https://s14-eu5.startpage.com/cgi-bin/serveimage?url=https%3A%2F%2Fbtinet.files.wordpress.com%2F2012%2F08%2Fcolored-tier-data-center.jpg%3Fw%3D630%26amp%3Bh%3D478&sp=a24786e7ada86060dba83d2a4ab59318" width="450" title="Data Center Tiers"></p>
2. Colocation or Colo services
- A colocation centre or “colo", is a data center location where the owner of the facility rents out space to external customers
- You own your own server, storage, and networking equipment within the colo facility (CapEx cost)
- The facility owner provides power, cooling, and physical security for their customers server, storage, and networking equipment
# Server Virtualization
<p align="center"> <img src=" https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Hyperviseur.png/400px-Hyperviseur.png" width="350" title="Data Center Tiers"></p>
1. Type 1 Hypervisors run directly on the system hardware (Bare Metal)
ie VMware ESXi, Microsoft Hyper-V, Red Hat KVM
2. Type 2 Hypervisors run on top of a host operating system
ie VMware Workstation, Player and Fusion, Virtualbox, Parallels, QEMU
# Definition Cloud Computing NIST [1]:
> Cloud computing is a model for enabling ubiquitous, convenient, on-
demand network access to a shared pool of configurable computing
resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal
management effort or service provider interaction
# Essential Characteristics
1. **On-demand self-service.** A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider.
3. **Broad network access.** Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, tablets, laptops, and workstations).
>Comments: Servers can be quickly provisioned and decommissioned based on current demand. Elasticity allows customers to achieve cost savings and is often a core justification for adoption
4. **Resource pooling.** The providers computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or data center). Examples of resources include storage, processing, memory, and network bandwidth.
5. **Rapid elasticity.** Capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.
6. **Measured service.** Cloud systems automatically control and optimize resource use by leveraging a metering capability1 at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.
# Service Models:
## On Premise (green customer managed, orange Provider managed)
![](../_resources/onpremise.png)
## Colo (green customer managed, orange Provider managed)
![](../_resources/colo.png)
1. ## Software as a Service (SaaS)
The capability provided to the consumer is to use the providers applications running on a cloud infrastructure. The applications are accessible from various client devices through either a thin client interface, such as a web browser (e.g., web-based email), or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
![SaaS](../_resources/SaaS.png)
> examples:
- Microsoft Office 365
- Salesforce.com
- Intuit
- Adobe Creative Cloud
- Gmail
2. ## Platform as a Service (PaaS)
The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider. 3 The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment. ![PaaS](../_resources/PaaS.png)
> examples:
- AWS Elastic Beanstalk
- Microsoft Azure (offers both IaaS and PaaS services)
- Google Apps
- Salesforce Force.com
- IBM Bluemix
3. ## Infrastructure as a Service (IaaS)
The capability provided to the consumer is to rovision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components (e.g., host firewalls).
![IAS](../_resources/IaaS.png)
>Cloud Providers will often offer three distinct flavours of IaaS compute:
- Virtual machines on shared physical servers
- different customers share same underlying physical servers
- least expensive
- least number of options vCPU, RAM and storage
- provision quickly
- most common deployed option
- Virtual machines on dedicated physical servers
- customer is guaranteed that the underlying physical server is dedicated to them
- more options vCPU, RAM and storage
- may be require to sign a minimum length contract
- Dedicated bare-metal physical servers
- customer is given access to their own physical servers
- hypervisor is NOT installed and managed by cloud provider
- customer can install hypervior by themself or OS directly on HW
- most expensive option
- may be require to sign a minimum length contract
- AWS doesn't offer this option
- Customers can mix and match between the three types
## Optional Servide Model (not defined by NIST) XaaS
> Many cloud providers also offer other as a Services
These are sometimes described as XaaS Anything as a Service
Examples include:
- DaaS Desktop As A Service
- DRaaS Disaster Recopry As A Service
- BaaS Backup As A Service
- Storage As A Device
- .... many more ..
# Deployment Models
1. ## Private cloud
The cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises.
- different than On Prem?
- On-Demant Self-Service
- Rapid Elastic
- Broad Network Acces
- Resource Pooling
- Measured Service
- orders a new server typically through a web portal
- company will use automation software
- Private Cloud is most suitable for large companies where the long term ROI and efficiency gains can outweigh the initial effort and cost to set up the infrastructure and automated workflows
2. ## Community cloud
The cloud infrastructure is provisioned for exclusive use by a specific community of consumers from organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be owned, managed, and operated by one or more of the organizations in the community, a third party, or some combination of them, and it may exist on or off premises.
- least common deployment model
- similar to a traditional extranet, but with full shared data center services instead of just network connectivity between On Prem offices
3. ## Public cloud
The cloud infrastructure is provisioned for open use by the general public. It may be owned, managed, and operated by a business, academic, or government organization, or some combination of them. It exists on the premises of the cloud provider.
- examples
- AWS
- Microsoft Azure
- IBM Bluemix
- Salesforce
- Most common deployment model
4. ## Hybrid cloud
The cloud infrastructure is a composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities, but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load balancing between clouds).
- Companies with limited Private Cloud infrastructure may cloud burst into Public Cloud for additional capacity when required
- A company could also have Private Cloud at their main site and use Public Cloud for their Disaster Recovery location
sources:
1. [The NIST Definition of Cloud Computing](https://csrc.nist.gov/publications/detail/sp/800-145/final
2. [A Practical Introduction to Cloud Computing](https://www.udemy.com/introduction-cloud-computing)

133
Cloud/2 Intro GC.md Normal file
View File

@ -0,0 +1,133 @@
---
title: 2 Intro GC
updated: 2021-09-06 19:07:42Z
created: 2021-09-06 07:30:13Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
06/09/2021 11:49
# Google Cloud Platform (GCP) Infrastructure
![5a61dcd70f5bb58e422ce7b3ef30f1b2.png](../_resources/5a61dcd70f5bb58e422ce7b3ef30f1b2.png)
![c53ad46a0de9b5989d9c61f607aad517.png](../_resources/c53ad46a0de9b5989d9c61f607aad517.png)
[https://cloud.google.com/video-intelligence]
![87bb8cb9d5ec12045bbbf38c08b8c7a0.png](../_resources/87bb8cb9d5ec12045bbbf38c08b8c7a0.png)
No will not save us. Increase of computer power has decreased dramaticly, because of fundamental fysic limitations.
One solution is to limit the power consumption of a chip, and you can do that by building Application-Specific Chips or ASICs
![9816005cb285fba7d59895ab9686676c.png](../_resources/9816005cb285fba7d59895ab9686676c.png)
The **T**ensor **P**rocessing **U**nit or TPU is an ASIC specifically optimized for ML. It has more memory and a faster processor for ML workloads than traditional CPUs or GPUs.
ML model training and future engineering is one of the most time-consuming parts of any machine learning project
## Elastic Storage with Google Cloud Storage
Storage and Ciomputer Power (VM) are sepearated and independent from each other. This makes cloud computing different from desktop computing.
### Create Cloud Storage:
- through UI (browser)
- CLI: gsutil mb -p [PROJECT NAME] -c [STORAGE CLASS] -l [BUCKET LOCATION] gs://[BUCKET NAME]/
mb : make bucket
![65d46691cee632fc49a1054a983fba7c.png](../_resources/65d46691cee632fc49a1054a983fba7c.png)
All classes have multi-region, dual-region, and region location options. They differ based on the access speed and the cost.
For data analysis workloads, it's common to use a standard storage bucket within a region for staging your data. Why do I say within a region? That's because you need the data to be available to your data processing computing resources, and these will often be within a single region. Co-locating your resources this way maximizes the performance for data-intensive computations and could reduce network charges.
l : EUROPE-WEST4
Bucket names have to be globally unique, so you can use that project ID as a unique name for your bucket.
![35424d82fe826bba9b7f251153ef6278.png](../_resources/35424d82fe826bba9b7f251153ef6278.png)
### But what's a project and organisation?
A project is a base-level organizing entity for creating and using resources and services for managing billing, APIs, and permissions.
**Zones and regions** physically organize the GCP resources you use, whereas **projects** logically organize them. Projects can be created, managed, deleted, even recovered from accidental deletions.
**Folders** are another logical grouping you can have for collections of projects. __Having an organization is required to use folders__. What's an organization?
The organization is a root node of the entire GCP hierarchy. While it's not required, an organization is quite useful because it allows you to set policies that apply throughout your enterprise to all the projects and all the folders that are created in your enterprise.
Cloud Identity and Access Management, also called **IM or IAM**, lets you fine-tune access control to all the GCP resources you use.
Moving data around use gsutil. Locally and in the cloud
![ce13e5abfb93d0f151b38e19a04cf13b.png](../_resources/ce13e5abfb93d0f151b38e19a04cf13b.png)
## Networking
Google's data centers around the world are interconnected by this **private** full duplex Google Jupiter network.
The petabit bisectional bandwidth and separation of compute and storage. No need to do everything on a single machine or even a single cluster of machines with their own dedicated storage => fast-enough network. Locality within the cluster is not important.
This is where Edge points of presence: Google's Network, interconnects with the public Internet at more than 90 internet exchanges and more than 100 points of presence worldwide. Google's Edge caching network places content close to end-users to minimize latency.
## Security: On-premise vs Cloud-native
![e46733bcebbc39b3852081970b1ad9cd.png](../_resources/e46733bcebbc39b3852081970b1ad9cd.png)
Communications over the internet to our public cloud services are encrypted in transit
- In-transit encryption
- Multiple layers of security
- Backed by Google security eq protect agains DoS-attacks
Stored data is automatically encrypted at rest and distributed for availability and reliability.
eg BigQuery:
- BigQuery table data encrypted with keys (and keys are also encrypted) and provides own defined encryption keys.
- Monitor and flag queries for anomalous behavior
- limit data access with autorized views
## Big data and ML products
![5f9685eb06bc6b913e29821e3c135f78.png](../_resources/5f9685eb06bc6b913e29821e3c135f78.png)
**GFS** Google File System to handle sharding and storing petabytes of data at scale.
**MapReduce**: manage large-scale data processing across large clusters of commodity servers. Automatically parallelized and executed on a large cluster of these commodity machines. Disadvantage: developers have to write code to manage all of the infrastructure of commodity servers.
(**Apache Hadoop**: now used in many industries for a huge variety of tasks that all share the common theme of volume, velocity and variety of structured, and unstructured data)
**Bigtable**: solved problem of recording and retrieving millions of streaming user actions with high throughput (inspiration for Hbase or MongoDB)
**Dremel** took a new approach to big data processing where Dremel breaks data into small chunks called shards, and compresses them into a columnar format across distributed storage. It then uses a query optimizer to farm out tasks between the many shards of data and the Google data centers full of commodity hardware to process a query in parallel and deliver the results. The big leap forward here was that the service, automanagers data imbalances, and communications between workers, and auto-scales to meet different query demands, and as you will soon see, Dremel became the query engine behind BigQuery.
**Colossus**: next-generation distributed data store.
**Spanner** as a planet scale relational database.
**Flume** and **Millwheel** for data pipelines.
**Pub/Sub** for messaging.
**TensorFlow** for machine learning.
**TPU** (Hardware).
## [Google Cloud Public Datasets](https://services.google.com/fh/files/misc/public_datasets_one_pager.pdf)
Facilitate access to high-demand public datasets, hosted in n BigQuery and Google Cloud Storage.
[Datasets](https://cloud.google.com/solutions/datasets)
## Choosing the right approach
Computer Engine is one instance is
Infrastructure as an Service (***IAAS***)
Maximum flexibility managed by user.
Google Kubernetes Engine (***GKE***) is a cluster of engines running containers Containarization is packing code, highly portable and uses resources efficiently. GKE is an orchestrator
App Engine: Platform a an Service (***PAAS***)
Use for long living applications and can autoscale.
Cloud Functions: Serverless environment (***FAAS**) Executes code in response to events.
![c3f2d93dc2ae40260e94b36083fc20a9.png](../_resources/c3f2d93dc2ae40260e94b36083fc20a9.png)
## What you can do with Google Cloud
[Google Customers solutions](https://cloud.google.com/customers)
For Products and Solutions, filter on big data analytics and also on machine learning. Select a customer use case that interests you, then answer these three questions.
1. what were the barriers or challenges the customer faced? The challenges are important, you want to understand what they were.
2. how were these challenges solved with a cloud solution? What products did they use?
3. what was the business impact?
Example Architucture
![da8ec8c3f3b5290b4f2ec2c5fb3a7122.png](../_resources/da8ec8c3f3b5290b4f2ec2c5fb3a7122.png)
## Key roles in a data-driven organization
***Data engineers*** to build the pipelines and get you clean data.
***Decision makers***, to decide how deep you want to invest in a data-driven opportunity while weighing the benefits for the organization.
***Analysts***, to explore the data for insights and potential relationships that could be useful as features in a machine learning model.
***Statisticians***, to help make your data-inspired decisions become true data-driven decisions, with their added rigor.
***Applied machine learning engineers***, who have real-world experience building production machine learning models from the latest and best information and research by the researchers.
***Data scientists***, who have the mastery over analysis, statistics, and machine learning.
***Analytics managers*** to lead the team.
***Social scientists and ethicists*** to ensure that the quantitative impact is there for your project and, it's the right thing to do.
A single person might have a combination of these roles, but this depends on the size of your organization.

View File

@ -0,0 +1,39 @@
---
title: 3 Recommendation Systems
updated: 2021-09-07 18:33:42Z
created: 2021-09-06 16:26:21Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
# Cloud SQL and Cloud Dataproc
- Cloud SQL: managed relational database
- Cloud Dataproc: managed environment on which you can run Apache Spark
### Why moving from on-premisis to cloud
- utilizing and tuning on-premise clusters is dfficult
- but also moving dedicated storage to off-cluster storage
A core aspect of a **recommendation system** is that you need to train and serve it at scale.
### what is managed?
................
### Recommendation Systems
The core pieces are:
- data
- the model
- infrastructure
to train and serve recommendations to users.
A core tenet of machine learning is to let the model learn for itself what
the relationship is between the data that you have, like user preferences (labeled data), and the data that you don't have. A history of good labeled data is important.
Machine learning scales much better because it doesn't require hard-coded rules. It's all automated. Learning from data in an automated way, that's what machine learning is.
Machine learning recommentation model is essentially asking "Who is this user like?" Secondly, is this subjectively a iyrm that people tend to rate highly? The predicted rating is a combination of both these factors.
All things considered, the rating of an item for a particular user will be the average of the ratings of users like this user but it's calibrated with the quality of the item itself.
Updating the data can be in **batch**, because ratings of the items doesn't change on eq a daily basis. On the otherhand there is a lot of data that has to be updated in a **fault tolerant way** that can scale to **large datasets** ==> Apache Hadoop.
When the user logs on, we want to show that user the recommendations that we precomputed specifically for them. So we need a transactional way (so that while the user is reading these predictions, we can update the predictions table as well) to store the predictions. eg 1 million user with each 5 predictions = 5 miljion rows. A MySQL database is sufficient.

View File

@ -0,0 +1,126 @@
---
title: WK 1 Modernizing Data Lakes and Data Warehouses with GCP
updated: 2021-09-18 21:24:35Z
created: 2021-09-18 17:26:31Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
# Explore the role of a data engineer
![5908a4d876bce77f6cc35e42ca230be3.png](../_resources/5908a4d876bce77f6cc35e42ca230be3.png)
![73822f556381b2d599822c18a4e09d83.png](../_resources/73822f556381b2d599822c18a4e09d83.png)
The point of a data pipeline is to make data-driven decisions
A data lake brings together data from across the Enterprise into a single location. the Purpose of a data lake is to make data accessible for analytics.
![0532ef04db4483bf2be14245b2c02b4b.png](../_resources/0532ef04db4483bf2be14245b2c02b4b.png)
Cloud storage is blob storage. So you might need to think about the granularity of what you store.
Cloud storage bucket is a good option for staging all of your raw data in one place before building transformation pipelines into your data warehouse.
![36ea2fef8a287ca026801e2d45632c20.png](../_resources/36ea2fef8a287ca026801e2d45632c20.png)
Because of Google's many data center locations and high network availability,
storing data in a GCS bucket is durable and performed. As a data engineer, you will often use a cloud storage bucket as part of your data lake to store many different types of raw data files, CSV, JSON, Avro, parquet, etc. You could then load or query them directly from BigQuery, which is a data warehouse.
Other Google Cloud platform products and services can easily query and integrate with your bucket once you've got it set up and loaded with data.
Common challanges encountered by data engineers:
- Access to data
A typical problem: data is scatered around different locations( eq departments with there own systems: need to know how to combine the data)
![0745b249204a6a1ab9b5aaafaa18d53b.png](../_resources/0745b249204a6a1ab9b5aaafaa18d53b.png)
- Data accuracy and quality
cleaning, formatting, and getting the data ready for insightsrequires that you build
extract transform load or ETL pipelines. **ETL pipelines** are usually necessary to ensure data accuracy and quality. The cleaned and transformed data are typically stored not in a data lake, but in a **data warehouse**. A data warehouse is a consolidated place just like a data lake, it's a consolidated place. But this time, the data that we're storing is all easily joinable and queryable. Unlike a data lake where the data is in a row format in the data warehouse, the data is stored in a way that makes it very efficient to query
- Availiability of computational resources
The problem is that the compute that's needed by any specific ETL job is not constant over time. This means that when traffic is low, you're going to be wasting money because you have computers out there doing nothing, and when traffic is high, those computers are so busy that your jobs are taking way too long.
- Query performance
Once your data is in your data warehouse, you need to optimize the queries that your users are running to make the most efficient use of your compute resources.
## BigQuery
**BigQuery** is Google Cloud's petabyte scale **Serverless Data Warehouse.**
Datasets are collections of tables that can be divided along business lines or a given analytical domain. Each dataset is tied to a GCP project.
![67abe953280b3a8006e5a19d4e094040.png](../_resources/67abe953280b3a8006e5a19d4e094040.png)
Cloud Identity and Access Management, or Cloud IAM is used to grant permission to perform specific actions BigQuery. This replaces the SQL grant and revoke statements that are used to manage access permissions in traditional SQL databases.
![362653ef1df287f348947de381643f48.png](../_resources/362653ef1df287f348947de381643f48.png)
BigQuery allocates storage and query resources dynamically based on your usage patterns. Storage resources are allocated as you consume them, and deallocated as you remove data or you drop tables. Query resources are allocated according to the query type and complexity. Each query uses some number of what are called slots. **Slots** are units of computation that comprise a certain amount of CPU and RAM.
## Data Lakes and Data Warehouses
![f5d936aaf30e3634df936c86cd99003e.png](../_resources/f5d936aaf30e3634df936c86cd99003e.png)
Considerations when choosing a data warehouse:
- The data warehouses going to serve as a sink. Will the data warehouse be fed by a batch pipeline or by a streaming pipeline?
Need to be up-to-the-minute correct or is it enough to load data into it once a day or once a week?
- Will the data warehouse scale to meet my needs?
- How is the data organized? is it cataloged? Is it access controlled?
Be able to share access. data to all your stakeholders? Who will pay for the querying?
- Is the warehouse design for performance? Carefully consider concurrent query performance,and whether that performance comes out of the box,or whether you ended go around creating indexes and tuning the data warehouse
- What level of maintenance is required by your engineering team?
![7b7de3102acdf026a8f7d05aba8e7217.png](../_resources/7b7de3102acdf026a8f7d05aba8e7217.png)
BigQuery provides mechanisms for automated data transfer and powers business applications using technologies like SQL that teams already know and use
### Other option
That is to treat BigQuery as just a query engine and allow it to query the data in the data lake, data in place. For example, you can use BigQuery to directly query database data in Cloud SQL, that is managed relational databases like PostgreSQL, MySQL, and SQL Server. You can also use BigQuery to directly query files and Cloud Storage as long as those files are in formats like CSV or parquet.
![920bc1ebbbe823a40736ed122e695036.png](../_resources/920bc1ebbbe823a40736ed122e695036.png)
The real power comes when you can leave your data in place and still join it against
other data in the data warehouse.
## Transactional Databases vs Data Warehouses
- Cloud SQL backend transactional Database systems that support your company's applications: optimized to be a database for transactions.
![703c231dfc0b7336315963917d74b038.png](../_resources/703c231dfc0b7336315963917d74b038.png)
- Data Warehouses that support your analytic workloads. Optimized for reads
![94a60b4329d9d14e300dd2f806af1521.png](../_resources/94a60b4329d9d14e300dd2f806af1521.png)
**The data Lake is designed for durability and high availability.**
## How to provide access to the data warehouse while keeping to data governance best practices?
The three most common clients are
1. Machine Learning engineers
- how long does it take for a transaction to make it from raw data all the way into the data warehouse? to be available at prediction time
- how difficult it would be to add more columns or
- making your data sets easily discoverable, documented and available to them to experiment on quickly
more rows of data into certain datasets?
3. data or business analysts
- rely on good clean data so that they can query it for insights and build dashboards.
- need datasets that have clearly defined schema definitions, the ability to quickly review rows, and the performance to scale too many concurrent dashboard users.
![8bc3cae2d26662dcf3d103e60e365a3f.png](../_resources/8bc3cae2d26662dcf3d103e60e365a3f.png)
5. other data engineers
- always be available when we need it?
## Google Cloud stackdriver
- monitoring tool
- track resource use
- create audit logs
- who has run what
- trace usage sensitive datasets
- create alerts and send notifications
## Manage data access and governance
Overall governance of how data is to be used and not used by your users. Privacy and security,
Clearly communicating a data governance model for who should access and who should not be able to access.
How will our end users discover different data sets that we have for analysis?
1. One solution for data governance is Cloud **Data Catalog** .Data Catalog makes all the metadata about your data sets available for users to search.
![c7b0d8e8356eaef330da4a5ba9040df3.png](../_resources/c7b0d8e8356eaef330da4a5ba9040df3.png)
2. **Data Loss Prevention API**. This helps you better understand and manage sensitive data. It provides fast scalable classification and reduction for sensitive data elements, like credit card numbers, names.
## Productionize the data process
End-to-end and scalable data processing system
Data engineering team is responsible for the health of the plumbing, that is the pipelines and ensuring that the data is available and up-to-date for analytic and ML workloads.
Common questions:
![43d7aa7e1b2fb5703bc8874d6295cf1c.png](../_resources/43d7aa7e1b2fb5703bc8874d6295cf1c.png)
One common workflow orchestration tool used by enterprises is Apache Airflow and Google Cloud has a fully managed version of **Apache Airflow called Cloud Composer**.
Cloud Composer helps your data engineering team orchestrate the pieces to the data engineering puzzle that we discussed to date, and even more that we haven't come across yet. The power of this tool comes from the fact that GCP big data products and services have API endpoints that you can call. A Cloud Composer job can then run every night or every hour and kickoff your entire pipeline from broad data to the data lake and into the data warehouse for you.

View File

@ -0,0 +1,34 @@
---
title: WK 2 Data wharehouse
updated: 2021-09-20 11:29:16Z
created: 2021-09-20 09:08:49Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
![4c3b142e5e73a6ac715600d0331150c2.png](../_resources/4c3b142e5e73a6ac715600d0331150c2.png)
![d00c7365a714367186d44f165573aa33.png](../_resources/d00c7365a714367186d44f165573aa33.png)
# BigQuery
BigQuery organizes data tables into units called datasets
![3393db98a463262eb4c2453a6140b984.png](../_resources/3393db98a463262eb4c2453a6140b984.png)
The project is what the billing is associated with.
To run a query, you need to be logged into the GCP console. You'll run a query in your own GCP project and the query charges are then build to your project.
In order to run a query in a project, you need Cloud IAM permissions to submit a job.
Access control is through Cloud IAM, and is that the data set level and applies to all tables in the dataset. BigQuery provides predefined roles for controlling access to resources. By defining authorized views and row-level permissions to give different users different roles for for the same data.
BigQuery data sets can be regional or multi-regional.
![3a3ffb85892084804f4ad8c23681d1b2.png](../_resources/3a3ffb85892084804f4ad8c23681d1b2.png)
Logs and BigQuery are immutable and are available to be exported to Stackdriver.
# Loading data into BigQuery
EL, ELT, ETL
![9c2b3ab61d6331389b844684b029f52f.png](../_resources/9c2b3ab61d6331389b844684b029f52f.png)
If your data is an Avro format, which is self-describing BigQuery can determine the schema directly, if the data is in JSON or CSV format BigQuery can auto detect the schema, but manual verification is recommended.
**Backfilling data** means adding a missing past data to make a dataset complete with no gaps, and to keep all analytic processes working as expected.

View File

@ -0,0 +1,145 @@
---
title: WK1 Big Data and Machine Learning Fundamentals
updated: 2021-09-11 16:36:12Z
created: 2021-09-07 18:38:10Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
07/09/2021 20:38
# Migrating workloads to the cloud
as an example an Apache Hadoop, Spark platform and MySQL database
An migration needs to ad value
![410d3a1816f7f58c8d812d7047ada137.png](../_resources/410d3a1816f7f58c8d812d7047ada137.png)
Reference of choosing the right storage access platform
![57f223b1256deb84e40fbefc1bc5960a.png](../_resources/57f223b1256deb84e40fbefc1bc5960a.png)
**[Dataproc](https://cloud.google.com/dataproc)** is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.
- **Cloud Storage** as a global file system Data is unstructured.
- **[Cloud SQL](https://cloud.google.com/sql)** as an RDBMS Data is structured and transactions.
Cloud SQL generally plateaus out at a few gigabytes _One database_
Fully managed relational database service for MySQL, PostgreSQL, and SQL Server. Advantages:
- Familiar
- Flexible pricing
- Managed backups
- Connect from anywhere
- Automatic replication
- Fast connection from GCE (Google Compute Engine) & GAE (Google App Engine)
- Google security
- **Cloud Datastore** as a transactional No-SQL object-oriented database. Key-Value pair.
- **Cloud BigTable** for high-throughput No-SQL append-only data. No transactions. A typical use case for Bigtable is sensor data for connected devices eg.
- **Cloud BigQuery** as a SQL data warehouse to power all your analytics needs
- **Cloud Spanner** transactional database that is horizontally scalable so that you can deal with data larger than a few gigabytes, or if you need multiple databases across different continents.
Selection of storage in a visual way
- ![765bc3651f87f986301649e1efafd128.png](../_resources/765bc3651f87f986301649e1efafd128.png)
## Challenge: Utilizing and tuning on-premise clusters
One of the most common challenges from managing on premise Hadoop clusters is making sure they're efficiently utilized and tooled properly for all the workloads that their users throw at them
The problem here lies in the static nature of the on premise cluster capacity
GCP think of clusters as flexible resources
Turn down clusters automaticly with scheduled deletion:
- idle time (minimum 10 minutes)
- timstamp (eq maximum x days)
- duration (granularity 1 second)
- shutdown when a job is finished
![b4a66850e6f618c5b36f61b9efd52678.png](../_resources/b4a66850e6f618c5b36f61b9efd52678.png)
And you can use auto scaling as long as when you shut down the clusters node, it doesn't remove any data. So, you cannot store the data on the cluster, but that's why we store our data on Cloud Storage, or BigTable, or BigQuery, we store our data off cluster. So, **autoscaling works as long as you don't store your data in HDFS**
![28f3c610f7afc0d7715e6bfa470edcdd.png](../_resources/28f3c610f7afc0d7715e6bfa470edcdd.png)
In addition to auto scaling, another advantage of running Hadoop clusters and GCP is that you can incorporate **preemptible virtual machines** into your cluster architecture.Preemptible VMs are highly affordable, shortlived compute instances that are suitable for batch jobs and fault tolerant workloads.
### Why fault tolerant?
Because preemptible machines, they offer the same machine types and
options as regular compute instances, but they last only after 24 hours and
they can be taken away whenever somebody else comes along and
offers a new compute needs for them.
So, if your applications are fault tolerant and Hadoop applications are,
then preemptable instances can reduce your **compute engine costs** significantly. Preemptible VMs are up to 80% cheaper than regular instances.
The pricing is fixed, you get an 80% discount.
**But just like autoscaling, preemptible VMs work when your workload
can function without the data being stored on the cluster!!!!**
# Big Query
- BigQuery is actually two services in one, a fast SQL Query Engine and fully managed data storage for loading and storing your datasets
![200ebd8d5ccb744efcc58af1fa67988d.png](../_resources/200ebd8d5ccb744efcc58af1fa67988d.png)
The storage service and the query service work
together to internally organize the data to make your queries run
efficiently on terabytes and petabytes.
![56dff0bf25437b4d77b8a2504ff17ae8.png](../_resources/56dff0bf25437b4d77b8a2504ff17ae8.png)
The **storage service** automatically manages the data that you ingest into the platform. Data is contained within a project in what are called datasets, which would have zero to many tables or views. The tables are stored as highly compressed columns.Each column of that table highly compressed in Google's internal Colossus file system, which provides durability and global availability.
All the data stored here is only accessible to you and your project team as governed by your access policy.
The storage service can do both bulk data ingestion and streaming data ingestion via the API. For streaming the max row size for a streaming insert is one megabyte and the maximum throughput is 100,000 records per second per project.
BigQuery manages the storage and the metadata for your dataset, automatically replicated, backed up and set up to auto scale for your query needs
The **query service** runs interactive or batch queries that are submitted through the console, the BigQuery web UI, the BQ command-line tool, or via the REST API.
There are BigQuery connectors to other services such as Cloud Dataproc and Cloud Dataflow, which simplify creating those complex workflows between
BigQuery and other GCP data processing services.
The query service can also run query jobs and data contained in other locations
You can run queries on tables that are a CSV file, for example, that are hosted
somewhere else in Cloud storage. Native BigQuery storage is the fastest.
- Serverless service meaning that's fully managed. So don't have to worry about how BigQuery stores data on disk or how it autoscales machines for large queries.
- BigQuery is designed to be an easy-to-use data warehouse.
- BigQuery's default pricing model is pay as you go. Pay for the number
of bytes of data that your query processes and any other permanent data that's stored inside of BigQuery. Automatic caching of query results: don't end up paying for the same query returning the same data twice.
- Data in BigQuery is encrypted at rest by default.
- Controlling access to your data can be as granular as specific columns,
say any column tag with PII (Personally Identifiable Information) or specific rows
- BigQuery works in tandem with Cloud IAM to set these roles and permissions at a project level, and then inherited down to the BigQuery level.
- BigQuery as both a data warehouse and an Advanced Query Engine is foundational for your AI and ML workloads.It's common for data analysts, engineers, and data scientists to use BigQuery to store, transform, and then feed those large datasets directly into your ML models.
- Write ML models directly in BigQuery using SQL.
![5f01121f49a2b85234a4dde5f3c243fa.png](../_resources/5f01121f49a2b85234a4dde5f3c243fa.png)
It stores all the incoming data from the left and allows you to do your analysis and your model-building.
![626a552aa3d1eb5331b7e1a0ed5161af.png](../_resources/626a552aa3d1eb5331b7e1a0ed5161af.png)
# Cloud Dataprep
After your transformation recipe is complete, when you run a Cloud Dataprep job, it farms out the work to Cloud Dataflow which handles the actual processing of your data pipeline at scale. The main advantage to Cloud Dataprep is for teams want to use a UI for data exploration, and want to spend minimal time coding to build their pipelines. Lastly, with Dataprep you can schedule your pipeline to run at regular preset intervals. But, if you prefer to do all of your SQL and exploration work inside of BigQuery, you can also now use SQL to setup scheduled queries by using the @run_time parameter, or the query scheduler and the BigQuery UI.
# Data security
So your insights are only shared with those people who should actually
have access to see your data. As you see here in this table, BigQuery inherits data security roles that you and your teams set up in Cloud IAM.
![92c4ab63a9cc743176d498e462af41af.png](../_resources/92c4ab63a9cc743176d498e462af41af.png)
Keep in mind that default access datasets can be overridden on a per dataset basis. Beyond Cloud IAM, you can also set up very granular controls over your columns and rows of data in BigQuery using the new data catalog service
and some of the advanced features in BigQuery, such as authorized views.
- Dataset users should have the minimum permission needed for their role.
- use separate projects or datasets for different environments (Dev, QA, PRD)
- Audit roles periodically
Data Access Policy for your organization, and it should specify how and when and why data should be shared, and with whom.
# ML using SQL with BigQuery
![9106b716ae276078eea921e3e03078ee.png](../_resources/9106b716ae276078eea921e3e03078ee.png)
Look at the type of label or special column of data that you're predicting.
Generally, if it's a numeric datatype ==> **forecasting**
String value ==> **classification**
This row is either in this class or this other class, two classes or more ==> **multi-class classification**
**ML benchmark** is the performance threshold that you're willing to accept from your model before you even allow it to be near your production data. It's critical that you set your benchmark before you train your model. So you can really be truly objective in your decision making to use the model or not.
![a435b4325f8a97c83018933cc44dff54.png](../_resources/a435b4325f8a97c83018933cc44dff54.png)
The End -to_end BQML Process
![d80f9b4acc1e1833c3fcb177f3a50dd8.png](../_resources/d80f9b4acc1e1833c3fcb177f3a50dd8.png)
1. Create model
![1166787a12cf4602040c27e3863abb43.png](../_resources/1166787a12cf4602040c27e3863abb43.png)
Inspect the model Weights
![7ad6809dc9b770b2bb4967f5c8795b2f.png](../_resources/7ad6809dc9b770b2bb4967f5c8795b2f.png)
Evaluate the model
![0dc381da545d4b7f9885e08555d4445a.png](../_resources/0dc381da545d4b7f9885e08555d4445a.png)
Make batch predictions with ML.PREDICT
![db70a1d3769f8483876fb44e83fbc35c.png](../_resources/db70a1d3769f8483876fb44e83fbc35c.png)
BQML Cheatsheet
![d2d5237d7c278a8fe8ffbe3175b6a7bc.png](../_resources/d2d5237d7c278a8fe8ffbe3175b6a7bc.png)
![60b4e3ed0ebd1c3fb65b18d812f5f8fb.png](../_resources/60b4e3ed0ebd1c3fb65b18d812f5f8fb.png)

145
Cloud/WK2 Data Lakes.md Normal file
View File

@ -0,0 +1,145 @@
---
title: WK2 Data Lakes
updated: 2021-09-19 15:15:03Z
created: 2021-09-19 10:18:20Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
# Introduction to Data Lakes
![4bd6a4b0fe00fd4735c0efd1b01892a1.png](../_resources/4bd6a4b0fe00fd4735c0efd1b01892a1.png)
**Data sources**: originating system or systems that are the source of all of your data
**Data sinks**: build those reliable ways of retrieving and storing that data
The first line of defense in an enterprise data environment is your **data lake** variety of formats, volume, and velocity
**Data pipelines**: doing the transformations and processing
**Orchestration layer**: coordinate efforts between many of the different components at a regular or an event driven cadence. (*Apache airflow*)
It's so important to first understand what you want to do first, and then finding which of the solutions best meets your needs.
# Data Storage and ETL options on GCP
![26d5943bae46c8a9bc9a4210eb56ed45.png](../_resources/26d5943bae46c8a9bc9a4210eb56ed45.png)
- Cloud SQL and Cloud Spanner for **relational data**
- Cloud Firestore and Cloud Bigtable for **nosql data**.
The path the data takes depends on:
- where is the data comming from
- Volume
- Where it has to go
- How much processing is needed to arrive in the sink
The method that you use to **load the data** into the cloud
depends on how much transformation is needed from that raw data
Cases:
- readily ingested (**EL** => Extract and Load eq avro format) Think also about federated search
- **ELT** => Extract Load Transform. Data is not in the right form to load into the sink. Volume is not big. eq use SQL to do the transformation: select from source and insert into the destination.
- **ETL** => Extract Transform Load. Transformation is essential or reduces the volume significant before importing into the cloud.
# Building a Data Lake using Cloud Storage
Google Cloud Storage:
- strong persistant
- share globally
- encrypted
- controlled and private if needed
- moderate latency and high troughput
- relative inexpensive
- Object store: binary objects regartless of what the data is containt in the objects
- in some extent it has system compatibilities (copy out/in of objects as it where files) Cloud storage uses the bucket name and the object name to simulate a file system
Use cases:
- archive data
- save state of application when shutdown instance
![cbffc74d82c2acc542253f2a050fa649.png](../_resources/cbffc74d82c2acc542253f2a050fa649.png)
The two main entities in cloud storage are **buckets** and
**objects**
- buckets are containers which hold objects
- identified in a single globally unique name space (no one else can use that name. till deletion and name is released)
- associated with a particular region or multiple regions
- For a single region bucket the objects are replicated across zones within that one region (low-latency)
- multiple requesters could be retrieving the objects at the same time from different replicas (high throughput)
- objects exist inside of those buckets and not apart from them.
- When an object is stored, cloud storage replicates that object, it'll then monitor the replicas and if one of them is lost or corrupted it'll replace it automatically with a fresh copy. (high durability)
- stored with metadata. Used for access control, compression, encryption and lifecycle management of those objects and buckets.
![d9fd10fbec213e8ae877a457a76bfe82.png](../_resources/d9fd10fbec213e8ae877a457a76bfe82.png)
1. the location of that bucket, location is set when a bucket is created and it can never be changed.
2. have the location to be a dual region bucket?
Select one region and the data will be replicated to multiple zones within this region
3. need to determine how often to access or change your data.
**[Storage classes](https://cloud.google.com/storage/docs/storage-classes)**: archival storage, backups or disaster recovery
Cloud storage uses the bucket name and the object name to simulate a file system
![d74445c2bc5cf347c94f572c4787c354.png](../_resources/d74445c2bc5cf347c94f572c4787c354.png)
In example:
bucket name is declass
object name is de/modules/O2/script.sh
the forward slashes are just characters in the name
A best practice is to avoid the use of sensitive information as part of bucket names, because bucket names are in a global namespace.
![4e05321688bdc9bd53da7809ff1e4961.png](../_resources/4e05321688bdc9bd53da7809ff1e4961.png)
# Securing Cloud Storage
![5b5bd4d599ad5e479af873c9f714ee7b.png](../_resources/5b5bd4d599ad5e479af873c9f714ee7b.png)
1. **IAM** is set at the bucket level.
- provides project roles and bucket roles:
- bucket reader
- bucket writer
- bucket owner.
In the ability to create and delete buckets and to set IAM policy,
is a **project level role**.
The ability to create or change access control lists is an **IAM bucket role**. **Custom roles** are also available.
2. Access control lists (**ACL**)
- applied at the bucket level or to individual objects.
So it provides more fine-grained access control.
Access lists are currently enabled by default
All data in Google Cloud is **encrypted at rest and in transit** and there is no way to turn off the encryption.
![ec27c52f379456fd1ccdcd586b69c7de.png](../_resources/ec27c52f379456fd1ccdcd586b69c7de.png)
Which data encryption option you use generally depend on your business, legal and regulatory requirements.
Two levels of encryption: data is encrypted using a data encryption key, and then the data encryption key itself is then encrypted using a key encryption key or a **KEK**. These KEKs are automatically rotated on a schedule that use the current KEK stored in Cloud KMS, or the Key management Service
![1798e9a44636860cd20db3b66ede4949.png](../_resources/1798e9a44636860cd20db3b66ede4949.png)
The fourth encryption option is client-side encryption. Client-side encryption simply means that you've encrypted the data before it's uploaded and then you have to decrypt the data yourself before it's used. Google Cloud storage still performs GMEK, CMEK, or CSEK encryption on the object.
**Data locking** is different from encryption. Where encryption prevents somebody from understanding the data, locking prevents them from modifying the data.
# Storing All Sorts of Data Types
Cloud Storage not for transactional data or for Analytics unstructured data.
![0c0746feb4d1a4cb7802f32da7da5a43.png](../_resources/0c0746feb4d1a4cb7802f32da7da5a43.png)
![ca233fcb9a340843cc102e395c3b0a38.png](../_resources/ca233fcb9a340843cc102e395c3b0a38.png)
Online Transaction Processing or **OLTP**
Online Analytical Processing or **OLAP**
![396364a8c39247681880d65a1f9b9a8b.png](../_resources/396364a8c39247681880d65a1f9b9a8b.png)
# Storing Relational Data in the Cloud
**Cloud SQL**:
- managed service for third-party RDBMSs (MySQL, SQL server, PostgresSQL)
- cost effective
- default choice for those OLTP
- fully managed
![da9fb26de9169d4567f32995e6380343.png](../_resources/da9fb26de9169d4567f32995e6380343.png)
**Cloud Spanner**:
- globally distributed database. Updates from applications running in different geographic regions.
- database is too big to fit in a single Cloud SQL instance
**Cloud Bigtable**
- really high throughput inserts, like more than a million rows per second or all sure low latency on the order of milliseconds, consider
**Difference between fully managed and serverless:**
By fully managed, we mean that the service runs on a hardware that you can control.Dataproc is fully mananged
A serverless product that's just like an API that you're calling. BigQuery and Cloud Storage is serverless
![1863886204ff15acbf0544007ff3c91a.png](../_resources/1863886204ff15acbf0544007ff3c91a.png)

View File

@ -0,0 +1,157 @@
---
title: Wk2 Big Data and Machine Learning Fundamentals
updated: 2021-09-12 20:50:42Z
created: 2021-09-11 16:35:51Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
![9948507e5860dd40b34a9f10e6b370c2.png](../_resources/9948507e5860dd40b34a9f10e6b370c2.png)
# Message-oriented architectures with Pub/Sub
## Distributed Messages
- Streaming data from various devices
- issues: bad data, delayed data, no data
- Distributing event notifications (ex: new user sign up)
- other services to subscribe to new messages that we're publishing out
- Scalable to handle volumes
- needs to handle an arbitrarily high amount of data so we don't lose any messages coming in.
- Reliable (no duplicates)
- We need all the messages and also a way to remove any duplicates if found
**Pub/Sub is a distributed messaging** service that can receive messages from a variety of different streams, **upstream data** systems like gaming events, IoT devices, applications streams, and more.
Pub/Sub will scale to meet that demand.
- Ensures at-least-once delivery and passes them to subscribing applications
- No provisioning, auto-everything
- Open API's
- Global by default
- End-to-end encryption
![51c2fc2521e7435aba198a5c23584999.png](../_resources/51c2fc2521e7435aba198a5c23584999.png)
Upstream data starts in from the left and comes into those devices from all around the globe. It is then **ingested** into Cloud Pub/Sub as a first point of contact with our system. Cloud Pub/Sub reads, stores, and then publishes out
any subscribers of this particular topic.
Cloud Dataflow as a subscriber to this pub subtopic in particular and It will ingest and transform those messages in an inelastic streaming pipeline.
If you're doing analytics one common data sink is Google BigQuery.
## Architecture of Pub/Sub (like Kafka)
A central piece of Pub/Sub is the **topic**.
There can be zero, one, or many publishers.
Zero, one or many many subscribers relating to any given Pub/Sub topic.
Completely decoupled from each other.
## Designing streaming pipelines with Apache Beam
Design of the actual pipeline in code the actual implementation in
serving of that pipeline at scale in production:
- is the code compatible with both batch and streaming data: **YES**
- Does the pipeline code SDK support the transformations I need to do? **Likely**
- Does it have the ability to handle late data coming into the pipeline?
- Are there any existing templates or solutions that we can leverage to quickly get us started? **Choose form templates**
## What is Apache Beam?
Apache Beam is a **portable data processing programming model**.
- It's extensible (write and share new SDK's, IO connectors and transformation libraries) and Open Source
- can be ran in a **highly distributed** fashion
- It's **unified**: use a single model programming model for both batch and streaming use cases.
- **Portable**: Execute pipelines on multile excecution environments. No vendor lockin.
- You can browse and write your own connectors
- Build transformation libraries too if needed.
- Apache Beam pipelines are written in Java, Python or Go.
- The SDK provides a host of libraries for transformations and existing data connectors to sources and sinks.
![5151c1c2b55de604999b2b6f38822b08.png](../_resources/5151c1c2b55de604999b2b6f38822b08.png)
- Apache Beam creates a model representation of your code, which is portable across many runners. Runners pass off your model to an execution environment, which you could run in many different possible engines.
Cloud Dataflow is one of the popular choices for running Apache Beam as an engine.
Example of an pipeline
![a6d9be269ce6bc83c9f5d824ef667013.png](../_resources/a6d9be269ce6bc83c9f5d824ef667013.png)
Transformations can be done in parallel, which is how you get that truly elastic pipeline
You can get input from many different and even multiple sources concurrently,
then you can write output to many different sinks, and the pipeline code remains the same.
## Implementing streaming pipelines on Dataflow
![1238eb4c4d458a3679c68dff091d8e0c.png](../_resources/1238eb4c4d458a3679c68dff091d8e0c.png)
![5ea764eaeeb247f7b43501c4d6e11653.png](../_resources/5ea764eaeeb247f7b43501c4d6e11653.png)
Many Hadoop workloads can be done easily and more maintainably with Dataflow. Plus, Dataflow is serverless and designed to be NoOps.
What do we mean by **serverless**? It means that Google will manage
all the infrastructure tasks for you, like resource provisioning and performance tuning, as well as ensuring that your pipeline is reliable.
![2fa4ce2967893077a569d372aed8a3ff.png](../_resources/2fa4ce2967893077a569d372aed8a3ff.png)
[Source for Google DataFlow templates:](https://github.com/GoogleCloudPlatform/Dataflowtemplates)
Recap:
![87df2df0f990da8682065eeb7f8381a9.png](../_resources/87df2df0f990da8682065eeb7f8381a9.png)
QPS: Queries Per Second
# Visualizing insights with Data Studio
The first thing you need to do is tell Data Studio the Data Source
A Data Studio report can have any number of data sources.
The Data Source picker shows all the data sources that you have access to.
other people who can view the report can potentially see
all the data in
that Data Source if you share that data source with them.
Warning: Anyone who can edit the rapport can also use
all the fields from any added data sources to create new charts with them.
# Creating charts with Data Studio
- Dimension chips are green.
**Dimensions** are things like categories or buckets of information. Dimension values could be things like names, descriptions, or other characteristics of a category.
- **Metric** chips are blue.
Metrics measure dimension values. Metrics represent measurements or
aggregations such as a sum, x plus y, a count, how many of x, or even a ratio, x over y. A calculated field can also be a dimension
Data Studio uses Google Drive for sharing and storing files.
Share andcollaborate your dashboards with your team. Google login is required to edit a report.. No login for viewing.
Keep in mind that when you share a report, if you're connected to an underlying data source like a BigQuery data set, Data Studio does not automatically grant
permissions to viewers and that data source if the viewer doesn't already
have them and this is for data security reasons.
After you share your report, users can interact with filters and sort, and then you can collect feedback on the usage of your report through Data Studio's native integration with Google Analytics.
# Machine Learning on Unstructured Datasets
Comparing approaches to ML
- **Use pre-built AI**: Dialogflow or Auto ML (10-100 images per label)
- provided as services
- Cloud Translation API
- Cloud Natural Language API
- Cloud-Speech-toText
- Cloud Video intelligence API (recognizing content in motion and action video's)
- Cloud Vision API (recognizing content in still images)
- Dialogflow Enterprise Edition( to build chatbots)
- **Add Custom Models**: only when you have a lot of data,
like 100,000 plus to millions of examples.
- **Create new Models**: TensorFlow, Cloud AI, Cloud TPU
**Dialogflow** is a platform for building natural and rich conversational experiences. It achieves a conversational user experience by handling the natural language understanding for you.
It has built-in **entity recognition** which enables your agent to identify entities and label by types such as person, organization, location, events, products, and media. Also **sentiment analysis** to give an understanding of the overall sentiment expressed in a block of text. Even **content classification**, allows you to classify documents in over 700 predefined categories like common greetings and conversational styles. It has **multi-language support** so you can analyze text in multiple languages.
Dialogflow works by putting all of these ML capabilities together which you can then optimize for your own training data and use case.
Dialogflow then creates unique algorithms for each specific conversational agent, which continuously learns and is trained and retrained as more and more users engage with your agent.
![95d94272c1cec38e84f0dd6d9a5c9e48.png](../_resources/95d94272c1cec38e84f0dd6d9a5c9e48.png)
### Dialogflow benefits for users:
- Build faster:
- start training with only a few examples
- 40+ pre-built agents
- Engage more efficiently
- build-in naltural language understanding
- multiple options to connect with backend systems
- Training and analytics
- Maximize reach
- Build once, deploy everywhere
- 20+ language supported
- 14 single-click platform integrations and 7 SDKs
## Customizing pre-built models with AutoML
![c95aa196527b6a81b5ad0e3e6eb4d20d.png](../_resources/c95aa196527b6a81b5ad0e3e6eb4d20d.png)
So **precision** is the number of photos correctly classified as a particular label divided by the total number of photos classified with that label.
**Recall** is number of photos classified as a particular label divided
by the total number of photos with that label.
![e9bdd75ea38db93bd06929d6d2371de2.png](../_resources/e9bdd75ea38db93bd06929d6d2371de2.png)
![317aa12b279ad4a9b0e289d1c654c4a3.png](../_resources/317aa12b279ad4a9b0e289d1c654c4a3.png)
![f7e03cc71bfb39115e9ccb74f9306f2e.png](../_resources/f7e03cc71bfb39115e9ccb74f9306f2e.png)

View File

@ -0,0 +1,178 @@
---
title: postgresql
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Postgresql
![postgresql](https://d1q6f0aelx0por.cloudfront.net/product-logos/a28dcd12-094d-4248-bfcc-f6fb954c7ab8-postgres.png?)
psql DBNAME USERNAME
\d \\ list all relations
\d tablename \\ ddl
```bash
sudo -u postgres CREATEUSER <username>
sudo -u postgres CREATEDB <datebasename>
```
```psql
psql# ALTER USER <username> WITH ENCRYPTED PASSWORD '<password>';
psql# GRANT ALL PRIVILEGES ON DATABASE <databasename> to <username>;
```
## create csv file
```sql
COPY (SELECT * FROM "public".empbase) TO '/tmp/empbase.csv' WITH CSV header;
```
### Jupyter:
> first install:
> ipython-sql and psycopg2
```jupyter
* %load_ext sql
* %sql postgresql://john:****@localhost/testdb
* %%sql
- select * from aap;
```
## From python program
```python
import psycopg2
query = "select * from aap"
try:
conn = psycopg2.connect("postgres://john:qw12aap@localhost:5432/testdb")
except psycopg2.OperationalError as e:
print('Unable to connect!\n{0}').format(e)
sys.exit(1)
finally:
print("connected")
cur = conn.cursor()
cur.execute(query)
for x in cur.fetchall():
print(x)
cur.close()
conn.close()
```
[Source PostgreSQL](https://www.postgresql.org/files/documentation/pdf/11/postgresql-11-A4.pdf)
## Connection from Apache Spark example:
```scala
val driver = "org.postgresql.Driver"
Class.forName(driver)
val df = spark.sqlContext
.read
.format("jdbc")
.options(Map("url"->"jdbc:postgresql://172.17.0.2:5432/postgres", "user"->"postgres","password"->"qw12aap","driver"->driver,"dbtable"->"company"))
.load()
```
## Install Docker
<https://hackernoon.com/dont-install-postgres-docker-pull-postgres-bee20e200198>
<https://docs.databricks.com/spark/latest/data-sources/sql-databases.html>
## create persistant storage location
```bash
mkdir -p $HOME/docker/volumes/postgres
```
## launch docker container
```bash
docker run --rm --name pg-docker -e POSTGRES_PASSWORD=docker -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data postgres
```
## connect to running container
```bash
docker exec -it pg-docker /bin/bash
```
## inside container
```bash
psql -h localhost -U postgres -d postgres
```
[Postgresql tutorial](https://www.tutorialspoint.com/postgresql)
# notes
psql DBNAME USERNAME
\d \\ list all relations
\d tablename \\ ddl
sudo -u postgres CREATEUSER <username>
sudo -u postgres CREATEDB <datebasename>
psql# ALTER USER <username> WITH ENCRYPTED PASSWORD '<password>';
psql# GRANT ALL PRIVILEGES ON DATABASE <databasename> to <username>;
## create csv file:
```sql
COPY (SELECT * FROM "public".empbase) TO '/tmp/empbase.csv' WITH CSV header;
```
## In jupyter
> first install:
> ipython-sql and psycopg2
Use:
* %load_ext sql
* %sql postgresql://john:****@localhost/testdb
* %%sql
- select * from aap;
## From python program
```python
import psycopg2
query = "select * from aap"
try:
conn = psycopg2.connect("postgres://john:qw12aap@localhost:5432/testdb")
except psycopg2.OperationalError as e:
print('Unable to connect!\n{0}').format(e)
sys.exit(1)
finally:
print("connected")
cur = conn.cursor()
cur.execute(query)
for x in cur.fetchall():
print(x)
cur.close()
conn.close()
```
sources:
https://www.postgresql.org/files/documentation/pdf/11/postgresql-11-A4.pdf
## Connection from Apache Spark example:
```scala
val driver = "org.postgresql.Driver"
Class.forName(driver)
val df = spark.sqlContext
.read
.format("jdbc")
.options(Map("url"->"jdbc:postgresql://172.17.0.2:5432/postgres", "user"->"postgres","password"->"qw12aap","driver"->driver,"dbtable"->"company"))
.load()
```

View File

@ -0,0 +1,110 @@
---
title: Analysis
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Analysis
### Analyses is performed by a analyser
- tokenizer: breaks sentence in tokens, position of the tokens, optional for a specific language
- token filter: filter out stopwords
- character filter
Reader -> tokenizer -> token filter -> token
### Where use analyses?
- query
- mapping parameter
- index setting
Analyser is used in the mapping part
Example
### Analysers
1. Standard
- max_token_length (default 255)
- stopwords (defaults \_none_)
- stopwords_path (path to file containing stopwords)
- keep numeric values
2. simple
- lowercase
- remove special characters (ie dog's -> [dog, s])
- remove numeric values
3. whitespace
- breakes text into terms whenever it encounters a whitespace character
- no lowercase transformation
- takes terms as they are
- keeps special characters
4. keyword
- no configuration
- takes all text as one keyword
5. stop
- stopword, stopword_path
6. pattern
- stopword, stopword_path, pattern, lowercase
- regular expression
7. custom
- tokenizer, char_filter, filter
### Example with standard analyzer
```json
PUT /test_analyzer
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
},
"mappings": {
"properties": {
"spreker_1": {
"type": "keyword",
"analyzer" : "my_analyzer" <== or an other analyzer; so per field
}
}
}
}
```
```json
GET /test_analyzer/_analyze
{
"analyzer": "my_analyzer",
"field": "spreker_1",
"text": ["What is the this builders"]
}
```
### without mapping; pattern analyzer
```json
PUT /test_analyzer
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_words": {
"type" : "pattern",
"pattern": "\\W|_|[a-c]", <-==== seperator whitespace or _ or chars a,b,c
"lowercase": true
}
},
"analyzer": {
"rebuild_pattern": {
"tokenizer" : "split_on_words",
"filter": ["lowercase"]
}
}
}
}
}
```

View File

@ -0,0 +1,35 @@
---
title: Delete_DSL
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Delete DSL
### delete document
```json
```json
{
DELETE /<index>/_doc/<id>
}
```
```
### delete by query
```json
{
POST /<index>/_delete_by_query
{
```

View File

@ -0,0 +1,22 @@
---
title: ES_Docker
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Setup ES with Docker
## get ES Docker container and start
```bash
docker pull elasticsearch:7.6.2 # lastest is not supported by ES
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.6.2 # start a singel node
```
## Test if running correctly
```bash
curl -I -XHEAD localhost:9200
```
[Link to ES with Kibana](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/docker.html#_pulling_the_image)

View File

@ -0,0 +1,202 @@
---
title: Index
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Indexes
### Create index without mapping
```json
{
PUT <index>
}
```
### Delete index
```json
{
DELETE <index>
}
```
### Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.
[Mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html)
```json
{
GET /<index>/_mapping
}
{
GET /<index>/_mapping/field/<fieldname>
}
```
### create a new index with mapping (example)
- Can not change the number of shards after index iz created
- re-index is possible (worstcase)
- replicas can be added later
### Types of fields
- object (may contain inner objects; json docs are hierarchocal in nature)
- nested
```json
PUT /items
{
"settings": {
"index": {
"number_of_shards": <int>,
"number_of_replicas": <int>
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"production_date": {
"type": "date"
},
"location": {
"type": "geo_point"
},
"max_spots": {
"type": "integer"
},
"description": {
"type": "text"
}
}
}
}
```
[Field datatypes](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html)
## Elasticsearch Query DSL
```http
PUT /catalog
{
"settings": {
"index": {
"number_of_shards": <int>,
"number_of_replicas": <int>
}
},
"mappings": {
"properties": {
"speaker": {"type": "keyword"},
"play_name": {"type": "keyword"},
"line_id": {"type": "integer"},
"description": {"type": "text"}
"speech_number": {"type": "integer"}
}
}
}
```
### Adding a type mapping in an existing index
merged into the existing mappings of the _doc type
```json
PUT /<index>/_mapping
{
"properties": {
"name": {
"type": "text"
}
}
}
```
### Create NESTED mapping, insert data and query
```json
DELETE /developer
PUT /developer
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"skills": {
"type": "nested",
"properties": {
"language": {
"type": "keyword"
},
"level": {
"type": "keyword"
}
}
}
}
}
}
POST /developer/_doc/101
{
"name": "john Doe",
"skills": [
{
"language": "ruby",
"level": "expert"
},
{
"language": "javascript",
"level": "beginner"
}
]
}
GET /developer/_search
{
"query": {
"nested": {
"path": "skills",
"query": {
"bool": {
"must": [
{
"match": {
"skills.language": "ruby"
}
},
{
"match": {
"skills.level": "expert"
}
}
]
}
}
}
}
}
```
### Create OBJECT mapping, insert data and query

View File

@ -0,0 +1,15 @@
---
title: Python
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# ElasticSearch with Python
Libraries:
- pyelasticsearch (DSL Queries)
- elasticutils (on top of the former)
- django-haystack

View File

@ -0,0 +1,372 @@
---
title: Query_DSL
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Elasticsearch Query DSL
### Queries can be classified into three types
1. Filtering by exact values
2. Searching on analyzed text
3. A combination of the two
Every __document field__ can be classified:
- either as an exact values
- analyzed text (also called full text)
## Exact values
are fields like user_id, date, email_addresses
Querying documents can be done by specifying __filters over exact values__. Whether the document gets returned is a __binary__ yes or no
---
## Analyzed search
__Analyzed text__ is text data like product_description or email_body
- Querying documents by searching analyzed text returns results based on __relevance__ (score)
- Highly complex operation and involves different __analyzer packages__ depending on the type of text data
- - The default analyzer package is the _standard analyzer_ which splits text by word boundaries, lowercases and removes punctuation
- less performant than just filtering by exact values
## Expensive queries
1. Lineair scans
- script queries
2. high up-front
- fussie queries
- reqexp queries
- prefix queries without index_prefixes
- wildcard queries
- range queries on text and keyword fields
3. joinig queries
4. Queries on deprecated geo shapes
5. high per-document cost
- script score queries
- percolate queries
The execution of such queries can be prevented by setting the value of the `search.allow_expensive_queries` setting to `false` (defaults to `true`).
Queries behave different: **query context** or **filter context**
| Queries | filters |
| --------------- | -------- |
| Fuzzy, scoring | Boolean |
| Slower | Faster |
| not Cachable | Cachable |
## Scoring queries
By default, Elasticsearch sorts matching search results by **relevance score**, which measures how well each document matches a query. But depends if the query is executed in **query** or **filter** context
## => Query context
“*How well does this document match this query clause?*” The relevance is stored in the **_score** meta_field
Query context is in effect whenever query clause is passed to the query parameter.
## => Filter context
“*Does this document match this query clause?*” Answer is a true of false. No score is calculated == scoring of all documents is 0.
Mostly used for filtering structured data, eq
- Does this timestamp fall in range....
- is the status field set to "text value"
Frequently used filters will be cached
Filter context in effect when filter clause is used
- such as filter or must_not parameters in bool query
- filter parameter ins constant_score query
- filter aggregation
Example
```json
GET /_search
{
"query": { <= query context
"bool": { <= query context, together with matches: how well they match documents
"must": [
{ "match": { "title": "Search" }},
{ "match": { "content": "Elasticsearch" }}
],
"filter": [ <= filter context
{ "term": { "status": "published" }},
{ "range": { "publish_date": { "gte": "2015-01-01" }}}
]
}
}
}
```
---
### Difference term vs match
- match : query aplies the same analyzer to the search at the time the data was stored
- term : does not apply any analyzer, so will look for exactly what is stored in the inverted index
## The Query DSL
Elasticsearch queries are comprised of one or many __Leaf query clauses__. Query clauses can be combined to create other query clauses, called __compound query clauses__. All query clauses have either one of these two formats:
```json
{
QUERY_CLAUSE: { // match, match_all, multi_match, term, terms, exists, missing, range, bool
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
{
QUERY_CLAUSE: {
FIELD_NAME: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
}
```
Query clauses can be __repeatedly nested__ inside other query clauses
```json
{
QUERY_CLAUSE {
QUERY_CLAUSE: {
QUERY_CLAUSE: {
QUERY_CLAUSE: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
}
}
}
```
## Two type of Query DSL (Leaf and Compound)
### Leaf query clause
Look for a partiqulair value in a particulair field, such as match, term, range queries/
These queries can be used by themselves. Use such as **match**, **term** or **range**.
### Compound query clause
wrap other leaf(s) or compound queries and are used to combine multiple queries in a logical fashion (**bool** or **dis_max**)
Or alter their behaviour (such as **constant_score**)
- bool => must, must-not, should, filter, minimum_should_match
multiple leaf or compound query clauses
**must**, **should** => scores combined (), contributes to score
**must_not**, **filter** => in context filter
**must** ==> like logical **AND**.
**should** ==> like logical **OR**.
You can use the `minimum_should_match` parameter to specify the number or percentage of `should` clauses returned documents *must* match.
If the `bool` query includes at least one `should` clause and no `must` or `filter` clauses, the default value is `1`. Otherwise, the default value is `0`
```json
POST _search
{
"query": {
"bool" : {
"must" : {
"term" : { "user" : "kimchy" }
},
"filter": {
"term" : { "tag" : "tech" }
},
"must_not" : {
"range" : {
"age" : { "gte" : 10, "lte" : 20 }
}
},
"should" : [
{ "term" : { "tag" : "wow" } },
{ "term" : { "tag" : "elasticsearch" } }
],
"minimum_should_match" : 1,
"boost" : 1.0
}
}
}
```
- boosting query
- constant_score query
- dis_max query
- function_score query
## Match Query Clause
Match query clause is the most generic and commonly used query clause:
- run on a analyzed text field, it performs an analyzed search on the text
- run on an exact value field, it performs a filter
- calculates the score
example:
```json
{ "match": { "description": "Fourier analysis signals processing" }}
{ "match": { "date": "2014-09-01" }}
{ "match": { "visible": true }}
```
## The Match All Query Clause
Returns all documemts
```json
{ "match_all": {} }
```
## Term/Terms Query Clause
The term and terms query clauses are used to **filter** by a exact value fields by single or multiple values, respectively. In the case of multiple values, the logical connection is OR.
```json
{
"query": {
"term": { "tag": "math" }
}
}
{
"query": {
"term": { "tag": ["math", "second"] }
}
}
```
## Multi Match Query Clause
Is run across multiple fields instead of just one
```json
{ "query": {
"multi_match": {
"query": "probability theory", // value
"fields": ["title^3", "*body"], // fields, with wildcard *
// no fields == *
// title 3* more important
"type": "best_fields",
}
}
}
```
[Other types](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#multi-match-types)
## Exists and Missing Filters Query Clause
- The exists filter checks that documents have a value at a specified field
```json
{
"query": {
"exists": {
"field": "*installCount" // also with wildcards
}
}
}
```
- The missing filter checks that documents do not have have a value at a specified field
```json
{
"missing" : {
"field" : "title"
}
}
```
## Range Filter Query Clause
Number and date fields in ranges, using the operators gt gte lt lte
```json
{ "range" : { "age" : { "gt" : 30 } } }
{
"range": {
"born" : {
"gte": "01/01/2012",
"lte": "2013",
"format": "dd/MM/yyyy||yyyy"
}
}
}
```
## Query in filter context
### No scores are calculated: yes or no
The __query__ parameter indicates query context.
The __bool__ and two __match__ clauses are used in query context, which means that they are used to score how well each document matches.
The __filter__ parameter indicates __*filter context*__. Its term and range clauses are used in filter context. They will filter out documents which do not match, but they will __*not affect the score*__ for matching documents.
__Must__ clause is not required (score == 0.0)
```json
GET /.kibana/_search
{
"query": {
"bool": {
"must": [
{"match": {"type" : "ui-metric"}},
{"match": {"ui-metric.count" : "1"}}
],
"filter": [
{"range": {"updated_at": {"gte": "2020-04-01"}}}
]
}
}
}
```
## Bool Query Clause
Are built from other query clauses are called compound query clauses. <sup> Note that compound query clauses can also be comprised of other compound query clauses, allowing for multi-layer nesting <sup>.
The three supported boolean operators are __must__ (and) __must_not__ (not) and __should__ (or)
```json
{
"bool": {
"must": { "term": { "tag": "math" }},
"must_not": { "term": { "tag": "probability" }},
"should": [
{ "term": { "favorite": true }},
{ "term": { "unread": true }}
]
}
}
```
## Combining Analyzed Search With Filters
Example: query to find all posts by performing an analyzed search for “Probability Theory” but we only want posts with 20 or more upvotes and not those with that tag “frequentist”.
```json
{
"filtered": {
"query": { "match": { "body": "Probability Theory" }},
"filter": {
"bool": {
"must": {
"range": { "upvotes" : { "gt" : 20 } }
},
"must_not": { "term": { "tag": "frequentist" } }
}
}
}
}
```
[Source: Understanding the Elasticsearch Query DSL](https://medium.com/@User3141592/understanding-the-elasticsearch-query-dsl-ce1d67f1aa5b)

View File

@ -0,0 +1,111 @@
---
title: cUrl_commands
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# cUrl commands
```bash
curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'
```
| Variables | Description |
|:------------- | :-------------|
| VERB | The appropriate HTTP method or verb. For example, GET, POST, PUT, HEAD, or DELETE |
| PROTOCOL | Either http or https. Use the latter if you have an HTTPS proxy in front of Elasticsearch or you use Elasticsearch security features to encrypt HTTP communications |
| HOST | The hostname of any node in your Elasticsearch cluster. Alternatively, use localhost for a node on your local machine |
| PORT | The port running the Elasticsearch HTTP service, which defaults to 9200 |
| PATH | The API endpoint, which can contain multiple components, such as _cluster/stats or _nodes/stats/jvm |
| QUERY_STRING | Any optional query-string parameters. For example, ?pretty will pretty-print the JSON response to make it easier to read |
| BODY | A JSON-encoded request body (if necessary) |
```bash
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
```
```bash
curl "localhost:9200/_cat/indices?v"
```
## Test connection and ES correctly running
```bash
curl -I -XHEAD localhost:9200
```
## Create Index
```bash
curl -X PUT http://localhost:9200/indexName
```
## Delete Index
``` bash
curl -X DELETE 'http://localhost:9200/indexName'
```
## List all indexes
``` bash
curl -X GET 'http://localhost:9200/_cat/indices?v'
```
## query using URL parameters
### Lucene syntax
```bash
curl -X GET http://localhost:9200/IndexName/_search?q=school:Harvard
```
## Query using JSON
### ElasticSearch DSL syntax
```bash
curl -XGET --header 'Content-Type: application/json' http://localhost:9200/indexName/_search -d '{
"query" : {
"match" : { "school": "Harvard" }
}
}'
Lookup on index id
```bash
curl -XGET --header 'Content-Type: application/json' http://localhost:9200/indexName/_search -d '{
"query" : {
"match" : { "_id": "37" }
}
}'
```
## List index mapping
### aka schema; fieldnames and their type
```bash
curl -X GET http://localhost:9200/indexName
```
## Add data
### indeName and doc# = 1
curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/indexName/_doc/1 -d '{
"school" : "Harvard"
}'
## Update a document
## In this example create first a doc and then update the document
```bash
curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/indexName/_doc/2 -d '
{
"school": "Clemson"
}'
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/indexName/_doc/2/_update -d '{
"doc" : {
"students": 50000}
}'
```
### load a dataset
```bash
curl -u elastic -H 'Content-Type: application/x-ndjson' -XPOST '<host>:<port>/bank/_bulk?pretty' --data-binary @accounts.json
```
[Source 1](https://www.bmc.com/blogs/elasticsearch-commands/)

View File

@ -0,0 +1,410 @@
---
title: examples
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# examples
```json
POST /items/_search
{
"query": {
"term": {"age" : "10"}
}
}
GET /items/_search
{
"query": {
"match": {
"name": "jan"
}
}
}
GET /shakespeare/_search
GET /shakespeare/_search
{
"query": {
"match": {
"speech_number": "1"
}
}
}
GET /shakespeare/_search
{
"query": {
"match_phrase": {
"text_entry": "scene I"
}
}
}
GET /shakespeare/_search
{
"query": {
"match_phrase_prefix": {
"text_entry": "with care"
}
}
}
GET /shakespeare/_search
{
"query": {
"match_all": {}
}
}
GET /shakespeare/_search
{
"query": {
"match_phrase": {
"text_entry":
{"query" : "shaken are", "slop": 2}
}
}
}
GET /shakespeare/_search
{
"query": {
"query_string": {
"fields": ["play_name", "speaker"],
"query": "KING HENRY IV",
"default_operator": "AND"
}
}
}
GET /shakespeare/_search
{
"query": {
"match": {
"line_id": 1
}
}
}
GET /shakespeare/_search
{
"query": {
"terms": {
"speaker": [
"KING HENRY IV",
"HENTY"
]
}
}
}
GET /shakespeare/_search
{
"query": {
"range": {
"line_id": {
"gte": 1,
"lte": 7
}
}
}
}
GET /shakespeare/_search
{
"query": {
"prefix": {
"speaker": {
"value": "KING"
}
}
}
}
GET /shakespeare/_search
{
"query": {
"wildcard": {
"speaker": {
"value": "KING HENR*"
}
}
}
}
GET /shakespeare/_mapping
GET /shakespeare/_search
{
"query": {
"bool": {
"must": [
{"term": {
"speaker": {
"value": "KING HENRY IV"
}
}}
],
"filter": [
{"term": {
"speech_number": "1"
}}
]
}
}
}
GET /shakespeare/_search
{
"query": {
"bool": {
"must": [
{"match": {
"speaker": "KING HENRY IV"
}}
],
"should": [
{"term": {
"line_number": {
"value": "1.1.2"
}
}},
{"term": {
"speech_number": {
"value": "2"
}
}}
],
"minimum_should_match": 1,
"filter": [
{"term": {
"play_name": "Henry IV"
}}
]
}
}
}
GET /shakespeare/_search
{
"query": {
"bool": {
"should": [
{
"wildcard": {
"line_number": {
"value": "1.1.?"
}
}
},
{
"range": {
"line_id": {
"gte": 1,
"lte": 40
}
}
}
],
"minimum_should_match": 2
}
}
}
GET /shakespeare/_search
{
"query": {
"query_string": {
"fields": ["speaker","play_name"],
"query": "KING HENRY IV",
"default_operator": "OR"
}
}
}
PUT /shakespeare/_mapping
{
"properties": {
"spreker_1": {
"type": "keyword"
}
}
}
DELETE /test_analyzer
PUT /test_analyzer
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_words": {
"type" : "pattern",
"pattern": "\\W|_|[a-c]",
"lowercase": true
}
},
"analyzer": {
"rebuild_pattern": {
"tokenizer" : "split_on_words",
"filter": ["lowercase"]
}
}
}
}
}
"mappings": {
"properties": {
"spreker_1": {
"type": "keyword"
}
}
}
}
GET /test_analyzer/_analyze
{
"analyzer": "rebuild_pattern",
"field": "spreker_1",
"text": ["Whsat is_the dd@this 1 builder's"]
}
DELETE /developer
PUT /developer
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"skills": {
"type": "nested",
"properties": {
"language": {
"type": "keyword"
},
"level": {
"type": "keyword"
}
}
}
}
}
}
POST /developer/_doc/101
{
"name": "john Doe",
"skills": [
{
"language": "ruby",
"level": "expert"
},
{
"language": "javascript",
"level": "beginner"
}
]
}
GET /developer/_search
{
"query": {
"nested": {
"path": "skills",
"query": {
"bool": {
"must": [
{
"match": {
"skills.language": "ruby"
}
},
{
"match": {
"skills.level": "expert"
}
}
]
}
}
}
}
}
PUT /developer1
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"skills": {
"type": "object",
"properties": {
"language": {
"type": "keyword"
},
"level": {
"type": "keyword"
}
}
}
}
}
}
POST /developer1/_doc/101
{
"name": "john Doe",
"skills": [
{
"language": "ruby",
"level": "expert"
},
{
"language": "javascript",
"level": "beginner"
}
]
}
GET /developer1/_search
{
"query": {
"match": {
"skills.language": "ruby"
}
}
}
```

View File

@ -0,0 +1,126 @@
---
title: http_commands
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# HTTP API commands
### list indexes
```http
GET /_cat/indices?v <!-- index with all columns -->
GET /_cat/indices
GET /<index>/_mapping <!-- get all mappings of index catalog -->
GET /<index>/_doc/<id>
GET /_cat/health?v <!-- info about cluster and nodes -->
```
### Searches
```http
GET /<index>,<index>/<type>/_search <!-- Searches over all indices on types -->
GET _search <!-- Searches every index on all types -->
```
### Most basic query; it returns all the content and with the score of 1.0 for every object.
```http
GET /<index>/_search
{
"query":{
"match_all":{}
}
}
```
### exanple
```http
GET /customer/_search
{
"query": { <!-- when search query is first level always -->
"match" : { "name" : "John Doe" }
}
}
```
### Index API CRUD operations
```http
PUT /catalog/_doc/1
PUT /<index>/<type>/<id> <!-- providing an ID -->
POST /catalog/_doc{....}
POST /<index>/<type> <!-- without providing an ID; ID = generated hash string -->
POST /catalog/_update/1{ doc { ....} }
POST <index>/<type>/<id>/_update
DELETE <index>/<type>/<id>
```
### Creating an **new** index
## Elasticsearch Query DSL
```http
PUT /catalog
{
"settings": {
"index": {
"number_of_shards": <int>,
"number_of_replicas": <int>
}
},
"mappings": {
"properties": {
"speaker": {"type": "keyword"},
"play_name": {"type": "keyword"},
"line_id": {"type": "integer"},
"description": {"type": "text"}
"speech_number": {"type": "integer"}
}
}
}
```
### Adding a type mapping in an existing index
merged into the existing mappings of the _doc type
```http
PUT /<index>/_mapping
{
"properties": {
"name": {
"type": "text"
}
}
}
```
### Formatting the JSON response
```bash
curl -XGET http://localhost:9200/catalog/_doc/1?pretty=true
```
### Standard tokenizer
```http
POST _analyze
{"tokenizer": "standard", "text": "Tokenizer breaks characters into tokens!"
}
```
### analyzer example with english stopwords
```http
PUT index_standard_analyzer_english_stopwords
{ "settings": {
"analysis": {
"analyzer": {
"std": {
"type": "standard",
"stopwords": "_english_" }
}
}
},
"mappings": {
"properties": {
"my_text": { "type": "text", "analyzer": "std"
}
}
}
}
```

View File

@ -0,0 +1,6 @@
---
title: Applications with Python
updated: 2022-08-08 19:51:06Z
created: 2022-08-08 19:50:49Z
---

View File

@ -0,0 +1,53 @@
---
title: Complex Networks
updated: 2021-10-04 17:28:56Z
created: 2021-10-04 15:37:12Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
![193bfaaed851ce4eef6d43f0a20c03a7.png](../../_resources/193bfaaed851ce4eef6d43f0a20c03a7.png)
# Node and Edge Characteristics
![c9a1ce3aec2e82148c02a5d70d096109.png](../../_resources/c9a1ce3aec2e82148c02a5d70d096109.png)
![a546793b2366edc935ce4a1d98db7f1e.png](../../_resources/a546793b2366edc935ce4a1d98db7f1e.png)
http://ks3329888.kimsufi.com/Intro2GraphTheoryAndComplexNetworksAnalysis/
![4d44a731b2dcebdb8e5210d5f6378b45.png](../../_resources/4d44a731b2dcebdb8e5210d5f6378b45.png)
1. Starting from a set of n vertices
2. Give two random vertices there is a probability P that they are linked together.
![cf910f508994db7a3d68732db2c383fb.png](../../_resources/cf910f508994db7a3d68732db2c383fb.png)
1. Generate a lattice
2. Nodes are initially linked to k closest neighbours
3. Apply a rewiring probability
![8cf3db7009d5518b37d9b0deb9b88dea.png](../../_resources/8cf3db7009d5518b37d9b0deb9b88dea.png)
![5362730add08cb32ca4de62973425703.png](../../_resources/5362730add08cb32ca4de62973425703.png)
![b685f0b119c20427c202a7143b227b60.png](../../_resources/b685f0b119c20427c202a7143b227b60.png)
![f2a731b0ccf8dc8c852572318d9d6bea.png](../../_resources/f2a731b0ccf8dc8c852572318d9d6bea.png)
![fe7d21e251098349149692c251f53dbc.png](../../_resources/fe7d21e251098349149692c251f53dbc.png)
![536ce7d9e4c324339ba03a53d7eaaec9.png](../../_resources/536ce7d9e4c324339ba03a53d7eaaec9.png)
![eb4e1a22065b0f7b00096573ca4f7cff.png](../../_resources/eb4e1a22065b0f7b00096573ca4f7cff.png)
![418daa8979836e67e544c4b275820715.png](../../_resources/418daa8979836e67e544c4b275820715.png)
![c30291ea3fd814cde29bbc5e2f275c3c.png](../../_resources/c30291ea3fd814cde29bbc5e2f275c3c.png)
![0ceb8df673446ebaaa7f858751a50669.png](../../_resources/0ceb8df673446ebaaa7f858751a50669.png)
[Network Analysis with Python and NetworkX Cheat Sheet](https://cheatography.com/murenei/cheat-sheets/network-analysis-with-python-and-networkx/)
[Network Science Book](http://networksciencebook.com/)
[The Colorado Index of Complex Networks](https://icon.colorado.edu/#!/)
[Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/)

View File

@ -0,0 +1,29 @@
---
title: Docker_neo4j
updated: 2022-07-11 13:00:08Z
created: 2021-05-06 10:33:11Z
---
docker run \
--name neo4jserver \
-p7474:7474 \
-p7687:7687 \
-d \
--rm \
-v $HOME/work/data/neo4j/data:/data \
-v $HOME/work/data/neo4j/logs:/logs \
-v $HOME/work/data/neo4j/import:/var/lib/neo4j/import \
-v $HOME/work/data/neo4j/plugins:/plugins \
--user="$(id -u):$(id -g)" \
--env NEO4J_AUTH=none \
neo4j:latest
```
sudo groupadd -g 7474 neo4jgrp
getent group | grep neo4jgrp
sudo usermod -a -G neo4jgrp user_name
```

View File

@ -26,15 +26,15 @@ The Neo4j components that are used to define the graph data model are:
Here are the steps to create a graph data model:
1. Understand the domain and define specific use cases (questions) for the application.
2. Develop the initial graph data model:
1. Understand the domain and define specific use cases (questions) for the application.
2. Develop the initial graph data model:
a. Model the nodes (entities).
b. Model the relationships between nodes.
3. Test the use cases against the initial data model.
4. Create the graph (instance model) with test data using Cypher.
5. Test the use cases, including performance against the graph.
6. Refactor (improve) the graph data model due to a change in the key use cases or for performance reasons.
7. Implement the refactoring on the graph and retest using Cypher.
3. Test the use cases against the initial data model.
4. Create the graph (instance model) with test data using Cypher.
5. Test the use cases, including performance against the graph.
6. Refactor (improve) the graph data model due to a change in the key use cases or for performance reasons.
7. Implement the refactoring on the graph and retest using Cypher.
Graph data modeling is an iterative process. Your initial graph data model is a starting point, but as you learn more about the use cases or if the use cases change, the initial graph data model will need to change. In addition, you may find that especially when the graph scales, you will need to modify the graph (refactor) to achieve the best performance for your key use cases.
@ -67,7 +67,7 @@ When performing the graph data modeling process for an application, you will nee
The data model describes the labels, relationships, and properties for the graph. It does not have specific data that will be created in the graph.
Here is an example of a data model:
![0e5c55b7a519831b5ba0393544641782.png](../../images/0e5c55b7a519831b5ba0393544641782.png)
![0e5c55b7a519831b5ba0393544641782.png](../../_resources/0e5c55b7a519831b5ba0393544641782.png)
There is nothing that uniquely identifies a node with a given label. A graph data model, however is important because it defines the names that will be used for labels, relationship types, and properties when the graph is created and used by the application.
@ -141,13 +141,11 @@ The main risk about fanout is that it can lead to very dense nodes, or supernode
## Properties for relationships
Properties for a relationship are used to enrich how two nodes are related.
When you define a property for a relationship, it is because your use cases ask a specific question about how two nodes are related, not just that they are related.
Properties for a relationship are used to enrich how two nodes are related. When you define a property for a relationship, it is because your use cases ask a specific question about how two nodes are related, not just that they are related.
# Testing the Model
You use the **use cases** to design the data model:
- includes labels for nodes
- relationship types and direction
- properties for the nodes and relationships.
@ -162,18 +160,15 @@ More data for testing is OK => test **scalability**
The Cypher code used to test the use cases needs to be carefully reviewed for correctness.
# Refactoring the Graph
## Refactoring
changing the data model and the graph.
three reasons why refactor:
- The graph as modeled does not answer all of the use cases.
- A new use case has come up that you must account for in your data model.
- The Cypher for the use cases does not perform optimally, especially when the graph scales
Steps (must) for refactoring:
1. Design the new data model.
2. Write Cypher code to transform the existing graph to implement the new data model.
3. Retest all use cases, possibly with updated Cypher code.
@ -185,7 +180,6 @@ Limit the number of labels to 4
What is the primary reason to add labels to nodes is reduce the number of data accessed at runtime.
## Retesting After Refactoring
- After refactoring the graph, revisit all queries for all use cases.
- Rewrite any Cypher queries for use cases that are affected by the refactoring.
@ -199,9 +193,8 @@ What is the primary reason to add labels to nodes is reduce the number of data a
- avoid duplicating data in your graph
- elilimnate duplication -> improve query performance
- In order to perform the query, all nodes must be retrieved to match a property.
- example refactoring list property to nodes
- In order to perform the query, all nodes must be retrieved to match a property.
- example refactoring list property to nodes
```
MATCH (m:Movie)
UNWIND m.languages AS language
@ -218,8 +211,7 @@ SET m.languages = null
## Eliminating Complex Data in Nodes
Storing complex data in the nodes like this may not be beneficial for a couple of reasons:
- Duplicate data.
- Duplicate data.
- Queries related to the information in the nodes require that all nodes be retrieved.
# Using Specific Relationships
@ -245,13 +237,13 @@ It has a **apoc.merge.relationship** procedure that allows you to **dynamically
a relationship that connects more than two nodes. Mathematics allows this, with the concept of a hyperedge. Impossible in Neo4j.
![69b4c46435ed52c1fe5be0ba6a074be5.png](../../images/69b4c46435ed52c1fe5be0ba6a074be5.png)
![69b4c46435ed52c1fe5be0ba6a074be5.png](../../_resources/69b4c46435ed52c1fe5be0ba6a074be5.png)
Email is new intermediate node
![ae805ac0f184fdb6cf93d6b038af28a9.png](../../images/ae805ac0f184fdb6cf93d6b038af28a9.png)
![ae805ac0f184fdb6cf93d6b038af28a9.png](../../_resources/ae805ac0f184fdb6cf93d6b038af28a9.png)
- Intermediate nodes deduplicate information.
- Intermediate nodes deduplicate information.
- Connect more than two nodes in a single context.
- Share data in the graph.
- Relate something to a relationship.

View File

@ -0,0 +1,76 @@
---
title: Graph Theory
updated: 2022-07-18 09:46:19Z
created: 2021-10-02 17:39:39Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
## Euler Path
uses every single egde once; no repeats
No need to have start en end vertex the same.
**0 or 2 odd degree vertices, the rest even**
![0dbbbe727224fc349edc14517e3b93c9.png](../../_resources/0dbbbe727224fc349edc14517e3b93c9.png)
## Euler Circuit
Every edge once; no repeats
Start and end vertex are the same
**All vertices must have an even degree**
Multiple Euler Circuits are possible and depends on the starting direction.
![3f0188b20328cc489d50b321df8d6bc7.png](../../_resources/3f0188b20328cc489d50b321df8d6bc7.png)
## Fleury's Algorithme
Find an Euler Circuit, starting at vertex A
![2e79a268b8d9ef0e61db6ea008bafd7d.png](../../_resources/2e79a268b8d9ef0e61db6ea008bafd7d.png)
Remove the edge take to go the next vertex, but a disconnected graph is not allowed. Deleting an edge prevents backtracking.
![a99bbee6299b41776e45d4af26288c05.png](../../_resources/a99bbee6299b41776e45d4af26288c05.png)
## Eulerization
Duplicate egdes so a Euler cirquit is in the graph, from and to vertices with odd degree. No new edges. Minimize duplication of edges. Multiple solutions possible.
![8cc8d717ee428ffeb48a396d804b91de.png](../../_resources/8cc8d717ee428ffeb48a396d804b91de.png)
## Hamiltonian Circuit
Visit every vertex once, with no repeats,
![91c43645ccabeee0047757424141fb5d.png](../../_resources/91c43645ccabeee0047757424141fb5d.png)
Minimun cost Hamiltonian Circuit == (Traveling Salesman Problem)
Possible method: Brute Force
Optimal, bit not efficient
![5815ff14a02f76a5118a6ef8f86ed048.png](../../_resources/5815ff14a02f76a5118a6ef8f86ed048.png)
## Complete Graph
All vertices are connected with all the others.
If n vertices then (n-1)!/2 different hamiltonian circuits
Better then Brute Force the get the shortest circuit are heuristic methods;
- **Nearest Neighbor Algorithm** (not optimal)
Start at A and take at every vertex the edge with the lowest weight. (Greedy, doesn't look ahead)
![beb7e37bd0e389910f37d2ef559a1900.png](../../_resources/beb7e37bd0e389910f37d2ef559a1900.png)
- **Repeated Nearest Algorithm**
Same as Nearest Neighbor Algorithm but repeat for every vertex and select the lowest cost.
![89ef06b3ace7055254b5a4d1b2db1663.png](../../_resources/89ef06b3ace7055254b5a4d1b2db1663.png)
- **Sorted Edge Algorithm** (not optimal)
Add cheapest up, unless:
- a mini circuit (a curcuit that doesn't include all vertices)
- no vertex with a degree 3
AD = 1
AC = 2
AB = 4 (not because of degree 3 rule)
CD = 8 (not because creates mini circuit)
BD = 9
BC = 13
Most optimal is ADBCA = 25
![b3fabbeba3f22c574c31078c58c5dbe8.png](../../_resources/b3fabbeba3f22c574c31078c58c5dbe8.png)
## Hamiltonian Path
Visit every vertex once, with no repeats,
![c34ed20473a52dda3d546240f3b6e52c.png](../../_resources/c34ed20473a52dda3d546240f3b6e52c.png)
## Kruskal's Algoritme (optimal and efficient)
Similar to Sorted Edge Algorithm, but no circuit.
Minimim cost spanning tree == every vertex is connected to an other vertex
![b55dc1138ff9d80431c97102da97c98c.png](../../_resources/b55dc1138ff9d80431c97102da97c98c.png)
Add cheapest up, unless it creates a circuit.
![c1e5b885c1aead4ff5700b0a10cc0302.png](../../_resources/c1e5b885c1aead4ff5700b0a10cc0302.png)

View File

@ -0,0 +1,590 @@
---
title: Intermediate Cypher Queries
updated: 2022-08-08 19:42:32Z
created: 2022-08-01 13:15:27Z
---
# Filtering Queries
```
CALL db.schema.visualization()
CALL db.schema.nodeTypeProperties()
CALL db.schema.relTypeProperties()
SHOW CONSTRAINTS
:HISTORY
:USE database
```
check multiple labels
```
match (p)
where p:Actor:Director
and p.born.year >= 1950 and p.born.year <= 1959
return count(p)
```
```
MATCH (p:Director)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(p)
WHERE "German" IN m.languages
return p.name, labels(p), m.title
```
```
match (n)-[a]->(m:Movie)
where (n:Actor or n:Director)
and toUpper(a.role) contains 'DOG'
return n.name, m.title, a.role
```
### Difference EXPLAIN vs PROFILE
- EXPLAIN provides estimates of the query steps
- PROFILE provides the exact steps and number of rows retrieved for the query.
Providing you are simply querying the graph and not updating anything, it is fine to execute the query multiple times using **PROFILE**. In fact, as part of query tuning, you should _execute the query at least twice_ as the first execution involves the generation of the execution plan which is then cached. That is, the first PROFILE of a query will always be more expensive than subsequent queries.
Useful use of exists to exclude patterns in the graph
```
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
AND NOT exists {(p)-[:DIRECTED]->(m)}
RETURN m.title
```
If you profile this query, you will find that it is not performant, but it is the only way to perform this query.
### Multiple MATCH Clauses
```
MATCH (a:Person)-[:ACTED_IN]->(m:Movie),
(m)<-[:DIRECTED]-(d:Person)
WHERE m.year > 2000
RETURN a.name, m.title, d.name
```
In general, using a single MATCH clause will perform better than multiple MATCH clauses. This is because relationship uniquness is enforced so there are fewer relationships traversed.
Same as above
```
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
WHERE m.year > 2000
RETURN a.name, m.title, d.name
```
### Optionally matching rows
```
MATCH (m:Movie) WHERE m.title = "Kiss Me Deadly"
MATCH (m)-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(rec:Movie)
OPTIONAL MATCH (m)<-[:ACTED_IN]-(a:Actor)-[:ACTED_IN]->(rec)
RETURN rec.title, a.name
```
This query returns rows where the pattern where an actor acted in both movies is optional and a null value is returned for any row that has no value. In general, and depending on your graph, an optional match will return more rows.
## Controlling Results Returned
### Ordering Returned Results
```
MATCH (p:Person)
WHERE p.born.year = 1980
RETURN p.name AS name, p.born AS birthDate
ORDER BY birthDate DESC , name ASC
```
### Limiting results; Skipping some results
```
MATCH (p:Person)
WHERE p.born.year = 1980
RETURN p.name as name,
p.born AS birthDate
ORDER BY p.born SKIP 40 LIMIT 10
```
In this query, we return 10 rows representing page 5, where each page contains 10 rows.
```
MATCH (p:Person)-[:ACTED_IN| DIRECTED]->(m)
WHERE m.title = 'Toy Story'
MATCH (p)-[:ACTED_IN]->()<-[:ACTED_IN]-(p2:Person)
RETURN p.name, p2.name
```
Returns the names people who acted or directed the movie Toy Story and then retrieves all people who acted in the same movie.
### Map projections
```
MATCH (p:Person)
WHERE p.name CONTAINS "Thomas"
RETURN p { .* } AS person
ORDER BY p.name ASC
```
```
MATCH (p:Person)
WHERE p.name CONTAINS "Thomas"
RETURN p { .name, .born } AS person
ORDER BY p.name
```
```
MATCH (m:Movie)<-[:DIRECTED]-(d:Director)
WHERE d.name = 'Woody Allen'
RETURN m {.*, favorite: true} AS movie
```
Returning a property of favorite with a value of true for each Movie object returned.
```
MATCH (m:Movie)<-[:ACTED_IN]-(p:Person)
WHERE p.name = 'Henry Fonda'
RETURN m.title AS movie,
CASE
WHEN m.year < 1940 THEN 'oldies'
WHEN 1940 <= m.year < 1950 THEN 'forties'
WHEN 1950 <= m.year < 1960 THEN 'fifties'
WHEN 1960 <= m.year < 1970 THEN 'sixties'
WHEN 1970 <= m.year < 1980 THEN 'seventies'
WHEN 1980 <= m.year < 1990 THEN 'eighties'
WHEN 1990 <= m.year < 2000 THEN 'nineties'
ELSE 'two-thousands'
END
AS timeFrame
```
# Aggregating Data
If a aggregation function like count() is used, all non-aggregated result columns become grouping keys.
_If you specify **count(n)**, the graph engine calculates the number of non-null occurrences of n.
If you specify \*\*count(_)\*_, the graph engine calculates the number of rows retrieved, including those with null values._
### Returning a list
```
MATCH (p:Person)
RETURN p.name, [p.born, p.died] AS lifeTime
LIMIT 10
```
```
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.year = 1920
RETURN collect( DISTINCT m.title) AS movies,
collect( a.name) AS actors
```
```
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN m.title AS movie,
collect(a.name)[2..] AS castMember,
size(collect(a.name)) as castSize
```
return a slice of a collection.
### List comprehension
```
MATCH (m:Movie)
RETURN m.title as movie,
[x IN m.countries WHERE x = 'USA' OR x = 'Germany']
AS country LIMIT 500
```
### Pattern comprehension
```
MATCH (m:Movie)
WHERE m.year = 2015
RETURN m.title,
[(dir:Person)-[:DIRECTED]->(m) | dir.name] AS directors,
[(actor:Person)-[:ACTED_IN]->(m) | actor.name] AS actors
```
For pattern comprehension specify the list with the square braces to include the pattern followed by the pipe character to then specify what value will be placed in the list from the pattern.
```
[<pattern> | value]
```
MATCH (a:Person {name: 'Tom Hanks'})
RETURN [(a)-->(b:Movie)
WHERE b.title CONTAINS "Toy" | b.title + ": " + b.year]
AS movies
```
### Working with maps
A Cypher map is list of key/value pairs where each element of the list is of the format 'key': value.
```
RETURN {Jan: 31, Feb: 28, Mar: 31, Apr: 30 ,
May: 31, Jun: 30 , Jul: 31, Aug: 31, Sep: 30,
Oct: 31, Nov: 30, Dec: 31}['Feb'] AS daysInFeb
Also with dot notation Dec: 31}.Feb AS daysInFeb
### Map projections
```
MATCH (m:Movie)
WHERE m.title CONTAINS 'Matrix'
RETURN m { .title, .released } AS movie
```
# Working with Dates and Times
```
RETURN date(), datetime(), time()
```
```
CALL apoc.meta.nodeTypeProperties()
```
List node properties
```
MATCH (x:Test {id: 1})
RETURN x.date.day, x.date.year,
x.datetime.year, x.datetime.hour,
x.datetime.minute
```
Extract date components
```
MATCH (x:Test {id: 1})
SET x.datetime1 = datetime('2022-01-04T10:05:20'),
x.datetime2 = datetime('2022-04-09T18:33:05')
RETURN x
```
`Date property using a <ISO-date> string.`
```
MATCH (x:Test {id: 1})
RETURN duration.between(x.date1,x.date2)
RETURN duration.inDays(x.datetime1,x.datetime2).days
RETURN x.date1 + duration({months: 6})
```
### APOC to format dates and times
```
MATCH (x:Test {id: 1})
RETURN x.datetime as Datetime,
apoc.temporal.format( x.datetime, 'HH:mm:ss.SSSS')
AS formattedDateTime
```
# Graph Traversal
### Anchor of a query
Execution plan determines the set of nodes, which are the starting points for the query. The anchor is ostly based on the match clause.
The anchor is typically determined by meta-data that is stored in the graph or a filter that is provided inline or in a WHERE clause. The anchor for a query will be based upon the fewest number of nodes that need to be retrieved into memory.
# Varying Length Traversal
```
MATCH p = shortestPath((p1:Person)-[*]-(p2:Person))
WHERE p1.name = "Eminem"
AND p2.name = "Charlton Heston"
RETURN p
```
shortest path, regardless of relations
```
MATCH (p:Person {name: 'Eminem'})-[:ACTED_IN*2]-(others:Person)
RETURN others.name
```
Two hops away from Eminem using the ACTED_IN relationship
```
MATCH (p:Person {name: 'Eminem'})-[:ACTED_IN*1..4]-(others:Person)
RETURN others.name
```
1 to 4 nodes; all connections of the connectod nodes; 4 deep
# Pipelining Queries
```
MATCH (n:Movie)
WHERE n.imdbRating IS NOT NULL
AND n.poster IS NOT NULL
WITH n {
.title,
.year,
.languages,
.plot,
.poster,
.imdbRating,
directors: [ (n)<-[:DIRECTED]-(d) | d { tmdbId:d.imdbId, .name } ]
}
ORDER BY n.imdbRating DESC LIMIT 4
RETURN collect(n)
```
```
WITH 'Clint Eastwood' AS a, 'high' AS t
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
with p, m, toLower(m.title) as movieTitle
WHERE p.name = a
AND movieTitle CONTAINS t
RETURN p.name AS actor, m.title AS movie
```
```
WITH 'Tom Hanks' AS theActor
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = theActor
AND m.revenue IS NOT NULL
with m order by m.revenue desc limit 1
// Use WITH here to limit the movie node to 1 and order it by revenue
RETURN m.revenue AS revenue, m.title AS title
```
```
MATCH (n:Movie)
WHERE n.imdbRating IS NOT NULL and n.poster IS NOT NULL
with n {
.title,
.imdbRating,
actors: [(a)-[:ACTED_IN]->(n) | a {name:a.name, .name}],
genre: [(n)-[:IN_GENRE]->(g) | g {name:g.name, .name}]}
ORDER BY n.imdbRating DESC LIMIT 4
with collect(n.actors) as a
unwind a as b
unwind b as listB
return listB.name, count(listB.name)
order by listB.name
```
# Pipelining Queries
### Aggregation and pipelining
```
MATCH (:Movie {title: 'Toy Story'})-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(m)
WHERE m.imdbRating IS NOT NULL
WITH
g.name AS genre,
count(m) AS moviesInCommon,
sum(m.imdbRating) AS total
RETURN
genre, moviesInCommon,
total/moviesInCommon AS score
ORDER By score DESC
```
```
MATCH (u:User {name: "Misty Williams"})-[r:RATED]->(:Movie)
WITH u, avg(r.rating) AS average
MATCH (u)-[r:RATED]->(m:Movie)
WHERE r.rating > average
RETURN
average , m.title AS movie,
r.rating as rating
ORDER BY rating DESC
```
### Using WITH for collecting
```
MATCH (m:Movie)--(a:Actor)
WHERE m.title CONTAINS 'New York'
WITH
m,
collect (a.name) AS actors,
count(*) AS numActors
RETURN
m.title AS movieTitle,
actors
ORDER BY numActors DESC
```
```
MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor)
WHERE m.title CONTAINS 'New York'
WITH
m,
collect (a.name) AS actors,
count(*) AS numActors
ORDER BY numActors DESC
RETURN collect(m { .title, actors, numActors }) AS movies
```
### Using LIMIT early
```
MATCH (p:Actor)
WHERE p.born.year = 1980
WITH p LIMIT 3
MATCH (p)-[:ACTED_IN]->(m:Movie)-[:IN_GENRE]->(g:Genre)
WITH
p,
collect(DISTINCT g.name) AS genres
RETURN p.name AS actor, genres
```
```
Match (a:Actor)-[:ACTED_IN]->(m)
where a.name = 'Tom Hanks'
with m
match (m)<-[r:RATED]-(u)
with
m,
avg(r.rating) as rating
return rating, m.title
order by rating desc
limit 1
```
# Unwinding Lists
```
MATCH (m:Movie)
UNWIND m.languages AS lang
WITH
m,
trim(lang) AS language
// this automatically, makes the language distinct because it's a grouping key
WITH
language,
collect(m.title) AS movies
RETURN
language,
movies[0..10]
```
# Reducing Memory (CALL, UNION)
MATCH clauses exceed the VM configured, the query will fail.
A subquery is a set of Cypher statements that execute within their own scope.
Important things to know about a subquery:
- A subquery returns values referred to by the variables in the RETURN clause.
- A subquery cannot return variables with the same name used in the enclosing query.
- You must explicitly pass in variables from the enclosing query to a subquery.
### CALL
```
MATCH (m:Movie)
CALL {
WITH m
MATCH (m)<-[r:RATED]-(u:User)
WHERE r.rating = 5
RETURN count(u) AS numReviews
}
RETURN m.title, numReviews
ORDER BY numReviews DESC
```
### UNION [ALL]
```
MATCH (p:Person)
WITH p LIMIT 100
CALL {
WITH p
OPTIONAL MATCH (p)-[:ACTED_IN]->(m:Movie)
RETURN m.title + ": " + "Actor" AS work
UNION
WITH p
OPTIONAL MATCH (p)-[:DIRECTED]->(m:Movie)
RETURN m.title+ ": " + "Director" AS work
}
RETURN p.name, collect(work)
```
```
MATCH (g:Genre)
call {
with g
match (m:Movie)-[:IN_GENRE]->(g)
where 'France' in m.countries
return count(m) as numMovies
}
RETURN g.name AS genre, numMovies
ORDER BY numMovies DESC
```
# Using Parameters
```
:params {actorName: 'Tom Cruise', movieName: 'Top Gun'}
```
```
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = $actorName
RETURN m.released AS releaseDate,
m.title AS title
ORDER BY m.released DESC
```
```
:params {actorName: 'Tom Cruise', movieName: 'Top Gun', l:2}
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.title = $movieName RETURN p.name LIMIT $l
```
### Setting Integers
:param number: 10 >>>> will be converted to float!!!!!
:param number=> 10 >>>>> remains an integer!!!!
```
:param
```
to view all set parameters
```
:param {}
```
clear all set parameters
# Application Examples Using Parameters
```
def get_actors(tx, movieTitle): # (1)
result = tx.run("""
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.title = $title
RETURN p
""", title=movieTitle)
# Access the `p` value from each record
return [ record["p"] for record in result ]
with driver.session() as session:
result = session.read_transaction(get_actors, movieTitle="Toy Story")
```

View File

@ -0,0 +1,87 @@
---
title: 'Preparing for Importing Data '
updated: 2022-07-31 11:38:21Z
created: 2022-07-30 13:45:09Z
---
# What does importing data mean?
Cypher has a built-in clause, **LOAD CSV** for importing CSV files. If you have a JSON or XML file, you must use the **APOC library** to import the data, but you can also import CSV with APOC. And the **Neo4j Data Importer** enables you to import CSV data without writing any Cypher code.
The **types of data** that you can store as properties in Neo4j include:
- String
- Long (integer values)
- Double (decimal values)
- Boolean
- Date/Datetime
- Point (spatial)
- StringArray (comma-separated list of strings)
- LongArray (comma-separated list of integer values)
- DoubleArray (comma-separated list of decimal values)
### Two ways that you can import CSV data:
1. Using the Neo4j Data Importer.
2. Writing Cypher code to perform the import.
### Steps for preparing for importing data
1. Understand the data in the source CSV files.
2. Inspect and clean (if necessary) the data in the source data files.
3. Create or understand the graph data model you will be implementing during the import.
# Understanding the Source Data
CSV files, you must determine:
- Whether the CSV file will have header information, describing the names of the fields.
- What the delimiter will be for the fields in each row.
Including headers in the CSV file **reduces syncing** issues and is a recommended Neo4j best practice.
A Neo4j best practice is to **use an ID as a unique property value for each node**. If the IDs in your CSV file are not unique for the same entity (node), you will have problems when you load the data and try to create relationships between existing nodes.
### Inspecting the Data for Import
**Important**: By default all of these fields in each row will be read in as string types.
Use **FIELDTERMINATOR** is delimiter is not the default ','
Test if all rows in the csv file can be read. For example:
```
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing/ratings.csv'
AS row
RETURN count(row)
```
### Is the data clean?
Check:
- Are quotes used correctly?
- If an element has no value will an empty string be used?
- Are UTF-8 prefixes used (for example \uc)?
- Do some fields have trailing spaces?
- Do the fields contain binary zeros?
- Understand how lists are formed (default is to use colon(:) as the separator.
- Any obvious typos?
# Overview of the Neo4j Data Importer
The benefit of the Data Importer is that you need not know Cypher to load the data.
It is useful for loading small to medium CSV files that contain fewer that 1M rows.
Data that is imported into the graph can be interpreted as string, integer, float, or boolean data.
Requirements for using the Data Importer
- You must use CSV files for import.
- CSV files must reside on your local system so you can load them into the graph app.
- CSV data must be clean.
- IDs must be unique for all nodes you will be creating.
- The CSV file must have headers.
- The DBMS must be started.
If you have de-normalized data, you will need to perform a multi-pass import. That is, you cannot create multiple nodes and relationship types from a single CSV file.
The Neo4j Data Importer can import or export mappings to a JSON file or to a ZIP file, if you also want to include the CSV files.

View File

@ -0,0 +1,121 @@
---
title: Refactoring Imported Data
updated: 2022-07-31 20:32:04Z
created: 2022-07-31 17:34:56Z
---
# Transforming String Properties to Dates
### Converting to Date values
date(property)
- correct data format eq "yyyy-mm-dd"
- not empty
[Cypher: Temporal (Date/Time) values](https://neo4j.com/docs/cypher-manual/current/syntax/temporal/)
Test for empty string
```
MATCH (p:Person)
SET p.born = CASE p.born WHEN "" THEN null ELSE date(p.born) END
WITH p
SET p.died = CASE p.died WHEN "" THEN null ELSE date(p.died) END
```
List all stored **node** types in the database:
```
CALL apoc.meta.nodeTypeProperties()
```
List all stored **relation** types in the database:
```
CALL apoc.meta.relTypeProperties()
```
# Transforming Multi-value Properties
### Transform Strings to Lists
- list
- same type
Transform to list from string with seperator eq "|"
```
MATCH (m:Movie)
SET m.countries = split(coalesce(m.countries,""), "|")
```
Transform a multi-value property to a list of strings => StringArray in database
# Adding labels
Labels is a best practice so that key queries will perform better, especially when the graph is large.
```
MATCH (p:Person)-[:ACTED_IN]->()
WITH DISTINCT p SET p:Actor
```
# Refactoring Properties as Nodes
Increase performance
For unique properties, like id's, create indexes.
Best practice is to always have a unique ID for every type of node in the graph.
View defined constraints in database
```
SHOW CONSTRAINTS
```
Before using MERGE, create first a unique constaint
```
CREATE CONSTRAINT Genre_name ON (g:Genre) ASSERT g.name IS UNIQUE
CREATE CONSTRAINT Genre_name IF NOT EXISTS ON (x:Genre) ASSERT x.name IS UNIQUE
```
Creating the new nodes from a node property
```
MATCH (m:Movie)
UNWIND m.genres AS genre
WITH m, genre
MERGE (g:Genre {name:genre})
MERGE (m)-[:IN_GENRE]->(g)
```
eq
```
unwind ['aap','olifant'] as a
return a
```
Removing a node property, set it to NULL
```
MATCH (m:Movie)
SET m.genres = null
```
Show the schema of a database
```
CALL db.schema.visualization
```
# Importing Large Datasets with Cypher
Data Importer can be used for small to medium datasets containing less than 1M rows
In Cypher, by default, the execution of your code is a single transaction. Break up the execution of the Cypher into multiple transactions. Reduces the amount of memory needed for the import. In Neo4j:
```
:auto USING PERIODIC COMMIT LOAD CSV ....
```
The advantage of performing the import in multiple passes is that you can check the graph after each import to see if it is getting closer to the data model. If the CSV file were extremely large, you might want to consider a single pass.

View File

@ -0,0 +1,343 @@
---
title: Syntax_examples
updated: 2022-07-18 09:37:40Z
created: 2021-05-04 14:58:11Z
---
## show database schema info
CALL db.schema.visualization()
CALL db.schema.relTypeProperties()
CALL db.schema.nodeTypeProperties()
CALL db.propertyKeys()
## syntax
MATCH (variable:Label {propertyKey: propertyValue, propertyKey2: propertyValue2})
RETURN variable
## relationships
() // a node
()--() // 2 nodes have some type of relationship
()-[]-() // 2 nodes have some type of relationship
()-->() // the first node has a relationship to the second node
()<--() // the second node has a relationship to the first node
MATCH (node1)-[:REL_TYPE]->(node2)
RETURN node1, node2
MATCH (node1)-[:REL_TYPEA | REL_TYPEB]->(node2)
RETURN node1, node2
## show node with name "Tom Hanks"
MATCH (tom {name: "Tom"}) RETURN tom
## return all nodes in database
MATCH (a:Person) WHERE a.name = "Tom" RETURN a
MATCH (a:Person) RETURN a.name
## with where clause
match (a:Movie)
where a.released >= 1990 and a.released < 1999
return a.title;
## a list of all properties that match a string
MATCH (n) WITH keys(n) AS p UNWIND p AS x WITH DISTINCT x WHERE x =~ ".*" RETURN collect(x) AS SET;
## delete all nodes and relations
MATCH (n)
DETACH DELETE n
## create
```cypher
create (:Person {name = 'jan', age = 32})
```
match(n:Person {age: 32}) return n
match(n:Person {age: 32})
create (n)-[:RELATIE]->(:Person {name:"klaas"})
MATCH (n:Person)
DETACH DELETE n
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(other:Person)
where toLower(p.name) =~ 'gene.*' and other.born IN [1950,1930]
and exists((other)-[:DIRECTED]->(m))
return m.title, other.name, other.born as YearBorn
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
where p.name = 'Tom Hanks'
with m, datetime().year - m.released as Ago, m.released - p.born as Age
where 20 <= Ago <= 33
return m.title, Ago
MATCH (m:Movie)
WITH m, size((:Person)-[:DIRECTED]->(m)) AS directors
WHERE directors >= 2
OPTIONAL MATCH (p:Person)-[:REVIEWED]->(m)
RETURN m.title, p.name
match (a:Person), (m:Movie), (b:Person)
where a.name = 'Liam Neeson'
and b.name = 'Benjamin Melniker'
and m.title = 'Batman Begins'
create (a)-[:ACTED_IN {roles: ['Rachel','Rachel Dawes']}]->(m)<-[:PRODUCED]-(b)
return a,m,b
MATCH (a:Person),(m:Movie)
WHERE a.name = 'Christian Bale' AND
m.title = 'Batman Begins' AND
NOT exists((a)-[:ACTED_IN]->(m))
CREATE (a)-[rel:ACTED_IN]->(m)
SET rel.roles = ['Bruce Wayne','Batman']
RETURN a, rel, m
MATCH (p:Person)-[rel:ACTED_IN]->(m:Movie)
where m.title = 'Forrest Gump'
set rel.roles = case p.name
when 'Tom Hanks' then ['Forrest Gum']
when 'Robin Wright' then ['Jenny Curran']
when 'Gary Sinise' then ['Lieutenant Dan Taylor']
end
return p,rel,m
MATCH (p:Person)-[rel:HELPED]->(p2:Person)
where p.name = 'Tom Hanks' and p2.name = 'Gary Sinise'
set rel += {research:'war history'}
return p,rel,p2
merge (m:Movie {name:'Forrest Gump'})
on match set m.year = 1994
on match set m.tagline = 'Life is like a box of chocolates…you never know what youre gonna get.'
return m
merge (p:Movie {name:'Forrest Gump'})
on match set p:OlderMovie
return p
match (p:Person {name:'Robert Zemeckis'}), (m:Movie {title:'Forrest Gump'})
merge (p)-[r:DIRECTED]->(m)
return p,r,m
## constrain uniqueness
CREATE CONSTRAINT UniqueMovieTitleConstraint
ON (m:Movie)
ASSERT m.title IS UNIQUE
## constrain uniqueness over two properties
## only enterprise edition
CREATE CONSTRAINT UniqueNameBornConstraint
ON (p:Person)
ASSERT (p.name, p.born) IS NODE KEY
## needs enterprise edition of neo4j
create constraint PersonBornExistsConstraint on (p:Person)
assert exists(p.born)
## existence constraint (possible for node
CREATE CONSTRAINT ExistsMovieTagline
ON (m:Movie)
ASSERT exists(m.tagline)
DROP CONSTRAINT MovieTitleConstraint
## existence constraint for relationship
## only enterprise edition of neo4j
CREATE CONSTRAINT ExistsREVIEWEDRating
ON ()-[rel:REVIEWED]-()
ASSERT exists(rel.rating)
## drop constraint
DROP CONSTRAINT ExistsREVIEWEDRating
CALL db.constraints() better SHOW CONSTRAINTS
## Indexes
## Single property index
CREATE INDEX MovieReleased FOR (m:Movie) ON (m.released)
## composite index
CREATE INDEX MovieReleasedVideoFormat
FOR (m:Movie)
ON (m.released, m.videoFormat)
## full-text schema index
CALL db.index.fulltext.createNodeIndex(
'MovieTitlePersonName',['Movie', 'Person'], ['title', 'name'])
### To use a full-text schema index, you must call the query procedure that uses the index.
CALL db.index.fulltext.queryNodes(
'MovieTitlePersonName', 'Jerry')
YIELD node, score
RETURN node.title, score
### Searching on a particular property
CALL db.index.fulltext.queryNodes(
'MovieTitlePersonName', 'name: Jerry') YIELD node
RETURN node
## drop index
DROP INDEX MovieReleasedVideoFormat
## dropping full-text schema index
CALL db.index.fulltext.drop('MovieTitlePersonName')
## search a full-text schema index
CALL db.index.fulltext.queryNodes('MovieTaglineFTIndex', 'real OR world')
YIELD node
RETURN node.title, node.tagline
## set parameters
:param year => 2000
:params {actorName: 'Tom Cruise', movieName: 'Top Gun'}
## for statement
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = $actorName AND m.title = $movieName
RETURN p, m
## clear
:params {}
## view
:params
## Analyzing queries
- EXPLAIN provides estimates of the graph engine processing that will occur, but does not execute the Cypher statement.
- PROFILE provides real profiling information for what has occurred in the graph engine during the query and executes the Cypher statement. (run-time performance metrics)
## Monitoring queries
:queries
## exercise
:params {year:2006, ratingValue:65}
match (p:Person)-[r:REVIEWED]->(m:Movie)<-[:ACTED_IN]-(a:Person)
where m.released = $year and r.rating = $ratingValue
return p.name, m.title, m.released, r.rating, collect(a.name)
:auto USING PERIODIC COMMIT LOAD CSV
commit every 1000 rows
Eager operators don't act on this command, ie:
collect()
count()
ORDER BY
DISTINCT
LOAD CSV WITH HEADERS FROM
'https://data.neo4j.com/v4.0-intro-neo4j/directors.csv' AS row
MATCH (movie:Movie {id:toInteger(row.movieId)})
MATCH (person:Person {id: toInteger(row.personId)})
MERGE (person)-[:DIRECTED]->(movie)
ON CREATE SET person:Director
LOAD CSV WITH HEADERS
FROM 'http://data.neo4j.com/v4.0-intro-neo4j/actors.csv'
AS line
MERGE (actor:Person {name: line.name})
ON CREATE SET actor.born = toInteger(trim(line.birthYear)), actor.actorId = line.id
ON MATCH SET actor.actorId = line.id
## before load
CREATE CONSTRAINT UniqueMovieIdConstraint ON (m:Movie) ASSERT m.id IS UNIQUE;
## after load
CREATE INDEX MovieTitleIndex ON (m:Movie) FOR (m.title);
// Delete all constraints and indexes
CALL apoc.schema.assert({},{},true);
// Delete all nodes and relationships
CALL apoc.periodic.iterate(
'MATCH (n) RETURN n',
'DETACH DELETE n',
{ batchSize:500 }
)
## test apoc
CALL dbms.procedures()
YIELD name WHERE name STARTS WITH "apoc"
RETURN name
## Graph modelling
How does Neo4j support graph data modeling?
- allows you to create property graphs.
- traversing the graph: traversal means anchoring a query based upon a property value, then traversing the graph to satisfy the query
Nodes and relationships are the key components of a graph.
Nodes must have labels to categorize entities.
A label is used to categorize a set of nodes.
Relationships must have direction and type.
A relationship is only traversed once during a query.
Nodes and relationships can have properties.
Properties are used to provide specific values to a node or relationship.
## Your model must address Nodes:
- Uniqueness of nodes: always have a property (or set of properties) that uniquely identify a node.
- Complex data: balance between number of properties that represent complex data vs. multiple nodes and relationships.
super nodes = (a node with lots of fan-in or fan-out)
- Reduce property duplication (no repeating property values)
- Reduce gather-and-inspect (traversal)
## Best practices for modeling relationships
- Using specific relationship types.
- Reducing symmetric relationships.
- No semantically identical relationships (PARENT_OF and CHILD_OF)
- Not all mutual relationships are semantically symmetric(FOLLOWS)
- Using types vs. properties.
## Property best practices
In the case of property value complexity, it depends on how the property is used. Anchors and traversal paths that use property values need to be parsed at query time.
- Property lookups have a cost.
- Parsing a complex property adds more cost.
- Anchors and properties used for traversal will be as simple as possible.
- Identifiers, outputs, and decoration are OK as complex values.
## Hierarchy of accessibility
1. Anchor node label, indexed anchor node properties (cheap)
2. Relationship types (cheap)
3. Non-indexed anchor node properties
4. Downstream node labels
5. Relationship properties, downstream node properties
Downstream labels and properties are most expensive.
## Common graph structures used in modeling:
1. Intermediate nodes
- (solve hyperedge; n-ary relationships)
- sharing context (share contextual information)
- sharing data (deduplicate information)
- organizing data (avoid density of nodes)
2. Linked lists (useful whenever the sequence of objects matters)
- Interleaved linked list
- Head and tail of linked list (root point to head and tail)
- No double linked-lists (redundant symmetrical relationships)
3. Timeline trees
- use time as either an anchor or a navigational aid
- topmost node in the timeline is an “all time” node
- timeline trees consume a lot of space
4. Multiple structures in a single graph
CREATE (:Airport {code: "ABQ"})<-[:CONNECTED_TO {airline: "WN", flightNumber: 500, date: "2019-1-3", depature: 1445, arrival: 1710}]-(:Airport {code: "LAS"})-[:CONNECTED_TO {airline: "WN", flightNumber: 82, date: "2019-1-3", depature: 1715, arrival: 1820}]->(:Airport {code: "LAX"})
LOAD CSV WITH HEADERS FROM 'file:///flights_2019_1k.csv' AS row
MERGE (origin:Airport {code: row.Origin})
MERGE (destination:Airport {code: row.Dest})
MERGE (origin)-[connection:CONNECTED_TO {
airline: row.UniqueCarrier,
flightNumber: row.FlightNum,
date: toInteger(row.Year) + '-' + toInteger(row.Month) + '-' + toInteger(row.DayofMonth)}]->(destination)
ON CREATE SET connection.departure = toInteger(row.CRSDepTime), connection.arrival = toInteger(row.CRSArrTime)

View File

@ -0,0 +1,17 @@
---
title: 'SQL vs NoSQL: 5 Critical Differences'
updated: 2022-05-24 19:33:07Z
created: 2022-05-24 19:31:42Z
---
# SQL vs NoSQL: 5 Critical Differences
The five critical differences between SQL vs NoSQL are:
1. SQL databases are relational, NoSQL databases are non-relational.
2. SQL databases use structured query language and have a predefined schema. NoSQL databases have dynamic schemas for unstructured data.
3. SQL databases are vertically scalable, while NoSQL databases are horizontally scalable.
4. SQL databases are table-based, while NoSQL databases are document, key-value, graph, or wide-column stores.
5. SQL databases are better for multi-row transactions, while NoSQL is better for unstructured data like documents or JSON.
[source](https://www.integrate.io/blog/the-sql-vs-nosql-difference/)

View File

@ -0,0 +1,635 @@
---
title: 'Course 1: Neural Networks and Deep Learning'
updated: 2022-05-22 11:18:21Z
created: 2022-05-16 17:52:57Z
---
# Course 1: Neural Networks and Deep Learning
- [Course 1: Neural Networks and Deep Learning](#course-1-neural-networks-and-deep-learning)
- [Week 1: Introduction to Deep Learning](#week-1-introduction-to-deep-learning)
- [Learning Objectives](#learning-objectives)
- [Introduction to Deep Learning](#introduction-to-deep-learning)
- [What is a neural network](#what-is-a-neural-network)
- [Supervised learning with neural networks](#supervised-learning-with-neural-networks)
- [Why is deep learning taking off](#why-is-deep-learning-taking-off)
- [Week 2: Neural Networks Basics](#week-2-neural-networks-basics)
- [Learning Objectives](#learning-objectives-1)
- [Logistic Regression as a Neural Network](#logistic-regression-as-a-neural-network)
- [Binary Classification](#binary-classification)
- [Logistic Regression](#logistic-regression)
- [Logistic Regression Cost Function](#logistic-regression-cost-function)
- [Gradient Descent](#gradient-descent)
- [Derivatives](#derivatives)
- [Computation Graph](#computation-graph)
- [Derivatives with a Computation Graph](#derivatives-with-a-computation-graph)
- [Logistic Regression Gradient Descent](#logistic-regression-gradient-descent)
- [Gradient Descent on m Examples](#gradient-descent-on-m-examples)
- [Derivation of dL/dz](#derivation-of-dldz)
- [Python and Vectorization](#python-and-vectorization)
- [Vectorization](#vectorization)
- [Vectorizing Logistic Regression](#vectorizing-logistic-regression)
- [Broadcasting in Python](#broadcasting-in-python)
- [A note on python/numpy vectors](#a-note-on-pythonnumpy-vectors)
- [Quick tour of Jupyter/iPython Notebooks](#quick-tour-of-jupyteripython-notebooks)
- [Explanation of logistic regression cost function (optional)](#explanation-of-logistic-regression-cost-function-optional)
- [Week 3: Shallow Neural Networks](#week-3-shallow-neural-networks)
- [Learning Objectives](#learning-objectives-2)
- [Shallow Neural Network](#shallow-neural-network)
- [Neural Networks Overview](#neural-networks-overview)
- [Neural Network Representation](#neural-network-representation)
- [Computing a Neural Network's Output](#computing-a-neural-networks-output)
- [Vectorizing across multiple examples](#vectorizing-across-multiple-examples)
- [Activation functions](#activation-functions)
- [Why do you need non-linear activation functions](#why-do-you-need-non-linear-activation-functions)
- [Derivatives of activation functions](#derivatives-of-activation-functions)
- [Gradient descent for Neural Networks](#gradient-descent-for-neural-networks)
- [Random initialization](#random-initialization)
- [Week 4: Deep Neural Networks](#week-4-deep-neural-networks)
- [Learning Objectives](#learning-objectives-3)
- [Deep Neural Network](#deep-neural-network)
- [Deep L-layer neural network](#deep-l-layer-neural-network)
- [Forward Propagation in a deep network](#forward-propagation-in-a-deep-network)
- [Getting your matrix dimensions right](#getting-your-matrix-dimensions-right)
- [Why deep representations](#why-deep-representations)
- [Building blocks of deep neural networks](#building-blocks-of-deep-neural-networks)
- [Forward and Backward Propagation](#forward-and-backward-propagation)
- [Parameters vs Hyperparameters](#parameters-vs-hyperparameters)
- [What does this have to do with the brain](#what-does-this-have-to-do-with-the-brain)
## Week 1: Introduction to Deep Learning
### Learning Objectives
- Discuss the major trends driving the rise of deep learning
- Explain how deep learning is applied to supervised learning
- List the major categories of models (CNNs, RNNs, etc.), and when they should be applied
- Assess appropriate use cases for deep learning
### Introduction to Deep Learning
#### What is a neural network
It is a powerful learning algorithm inspired by how the brain works. Here is a definition from [mathworks](https://www.mathworks.com/discovery/neural-network.html):
![neural-network](../_resources/neural-network.svg "Neural Network")
*image source: [mathworks](https://www.mathworks.com/discovery/neural-network.html)*
> A neural network (also called an artificial neural network) is an adaptive system that learns by using interconnected nodes or neurons in a layered structure that resembles a human brain. A neural network can learn from data—so it can be trained to recognize patterns, classify data, and forecast future events.
>
> A neural network breaks down the input into layers of abstraction. It can be trained using many examples to recognize patterns in speech or images, for example, just as the human brain does. Its behavior is defined by the way its individual elements are connected and by the strength, or weights, of those connections. These weights are automatically adjusted during training according to a specified learning rule until the artificial neural network performs the desired task correctly.
>
> A neural network combines several processing layers, using simple elements operating in parallel and inspired by biological nervous systems. It consists of an input layer, one or more hidden layers, and an output layer. In each layer there are several nodes, or neurons, with each layer using the output of the previous layer as its input, so neurons interconnect the different layers. Each neuron typically has weights that are adjusted during the learning process, and as the weight decreases or increases, it changes the strength of the signal of that neuron.
#### Supervised learning with neural networks
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
*Examples of supervised learning applications*:
| Input(X) | Output(y) | Application |
| --- | --- | --- |
| Home features | Price | Real Estate |
| Ad, user info | Click on ad? (0/1) | Online Advertising |
| Image | Object (1,...,1000) | Photo tagging |
| Audio | Text transcript | Speech recognition |
| English | Chinese | Machine translation |
| Image, Radar info | Position of other cars | Autonomous driving |
*Structured vs unstructured data*
- Structured data refers to things that has a defined meaning such as price, age
- Unstructured data refers to thing like pixel, raw audio, text.
#### Why is deep learning taking off
Deep learning is taking off due to a large amount of data available through the digitization of the society, faster computation and innovation in the development of neural network algorithm.
*Two things have to be considered to get to the high level of performance*:
1. Being able to train a big enough neural network
2. Huge amount of labeled data
## Week 2: Neural Networks Basics
### Learning Objectives
- Build a logistic regression model structured as a shallow neural network
- Build the general architecture of a learning algorithm, including parameter initialization, cost function and gradient calculation, and optimization implemetation (gradient descent)
- Implement computationally efficient and highly vectorized versions of models
- Compute derivatives for logistic regression, using a backpropagation mindset
- Use Numpy functions and Numpy matrix/vector operations
- Work with iPython Notebooks
- Implement vectorization across multiple training examples
### Logistic Regression as a Neural Network
#### Binary Classification
Week2 focuses on the basics of neural network programming, especially some important techniques, such as how to deal with m training examples in the computation and how to implement forward and backward propagation. To illustrate this process step by step, Andrew Ng took a lot of time explaining how Logistic regression is implemented for a binary classification, here a Cat vs. Non-Cat classification, which would take an image as an input and output a label to propagation whether this image is a cat (label 1) or not (label 0).
An image is store in the computer in three separate matrices corresponding to the Red, Green, and Blue color channels of the image. The three matrices have the same size as the image, for example, the resolution of the cat image is 64 pixels x 64 pixels, the three matrices (RGB) are 64 by 64 each. To create a feature vector, x, the pixel intensity values will be “unroll” or “reshape” for each color. The dimension of the input feature vector x is ![n_x=64\times64\times3=12288](../_resources/n_x_64_64_3_12288.svg).
#### Logistic Regression
> Logistic regression is useful for situations in which you want to be able to predict the presence or absence of a characteristic or outcome based on values of a set of predictor variables. It is similar to a linear regression model but is suited to models where the dependent variable is dichotomous. Logistic regression coefficients can be used to estimate odds ratios for each of the independent variables in the model. Logistic regression is applicable to a broader range of research situations than discriminant analysis. (from [ibm knowledge center](https://www.ibm.com/support/knowledgecenter/en/SSLVMB_26.0.0/statistics_mainhelp_ddita/spss/regression/idh_lreg.html))
A detailed guide on [Logistic Regression for Machine Learning](https://machinelearningmastery.com/logistic-regression-for-machine-learning/) by Jason Brownlee is the best summary of this topic for data science engineers.
Andrew Ng's course on Logistic Regression here focuses more on LR as the simplest neural network, as its programming implementation is a good starting point for the deep neural networks that will be covered later.
#### Logistic Regression Cost Function
In Logistic regression, we want to train the parameters `w` and `b`, we need to define a cost function.
![lr-eqn](../_resources/lr-eqn.svg), where ![lr-sigmoid](../_resources/lr-sigmoid.svg)
Given ![lr-input](../_resources/lr-input.svg), we want ![lr-target](../_resources/lr-target.svg)
The loss function measures the discrepancy between the prediction (𝑦̂(𝑖)) and the desired output (𝑦(𝑖)). In other words, the loss function computes the error for a single training example.
![lr-loss-function](../_resources/lr-loss-function.png)
The cost function is the average of the loss function of the entire training set. We are going to find the parameters 𝑤 𝑎𝑛𝑑 𝑏 that minimize the overall cost function.
![lr-cost-function](../_resources/lr-cost-function.png)
The loss function measures how well the model is doing on the single training example, whereas the cost function measures how well the parameters w and b are doing on the entire training set.
#### Gradient Descent
As you go through any course on machine learning or deep learning, gradient descent the concept that comes up most often. It is used when training models, can be combined with every algorithm and is easy to understand and implement.
The goal of the training model is to minimize the loss function, usually with randomly initialized parameters, and using a gradient descent method with the following main steps. Randomization of parameters initialization is not necessary in logistic regression (zero initialization is fine), but it is necessary in multilayer neural networks.
1. Start calculating the cost and gradient for the given training set of (x,y) with the parameters w and b.
2. update parameters w and b with pre-set learning rate:
w\_new =w\_old learning\_rate * gradient\_of\_at(w\_old)
Repeat these steps until you reach the minimal values of cost function.
The fancy image below comes from [analytics vidhya](https://www.analyticsvidhya.com/blog/2020/10/what-does-gradient-descent-actually-mean/).
![gradient-descent](../_resources/gradient-descent.jpeg)
#### Derivatives
Derivatives are crucial in backpropagation during neural network training, which uses the concept of computational graphs and the chain rule of derivatives to make the computation of thousands of parameters in neural networks more efficient.
#### Computation Graph
A nice illustration by [colah's blog](https://colah.github.io/posts/2015-08-Backprop/) can help better understand.
Computational graphs are a nice way to think about mathematical expressions. For example, consider the expression e=(a+b)(b+1). There are three operations: two additions and one multiplication. To help us talk about this, lets introduce two intermediary variables, c and d so that every functions output has a variable. We now have:
```
c=a+b
d=b+1
e=cd
```
To create a computational graph, we make each of these operations, along with the input variables, into nodes. When one nodes value is the input to another node, an arrow goes from one to another.
![tree-def](../_resources/tree-def.png)
#### Derivatives with a Computation Graph
If one wants to understand derivatives in a computational graph, the key is to understand derivatives on the edges. If a directly affects c, then we want to know how it affects c. If a changes a little bit, how does c change? We call this the partial derivative of c with respect to a.
![tree-eval-derivs](../_resources/tree-eval-derivs.png)
#### Logistic Regression Gradient Descent
Andrew did logistic regreesion gradient descent computation using the computation graph in order to get us familiar with computation graph ideas for neural networks.
#### Gradient Descent on m Examples
The cost function is computed as an average of the `m` individual loss values, the gradient with respect to each parameter should also be calculated as the mean of the `m` gradient values on each example.
The calculattion process can be done in a loop through m examples.
```python
J=0
dw=np.zeros(n)
db=0
for i in range(m):
z[i] = w.transpose() * x[i] + b
a[i] = sigmoid(z[i])
J = J + (-[y[i]*log(a[i])+(1-y[i])*log(1-a[i])])
dz[i] = a[i] - y[i]
# inner loop for n features, later will be optimized by vectorization
for j in range(n):
dw[j] = dw[j] + x[i][j] * dz[i]
db = db + dz[i]
j = j / m
dw = dw / m
db = db / m
```
After gradient computation, we can update parameters with a learning rate `alpha`.
```
# vectorization should also applied here
for j in range(n):
w[j] = w[j] - alpha * dw[j]
b = b - alpha * db
```
As you can see above, to update parameters one step, we have to go throught all the `m` examples. This will be mentioned again in later videos. Stay tuned!
#### Derivation of dL/dz
You may be wondering why `dz=a-y` in the above code is calculated this way and where it comes from. Here is a [detailed derivation process of dl/dz](https://www.coursera.org/learn/neural-networks-deep-learning/discussions/weeks/2/threads/ysF-gYfISSGBfoGHyLkhYg) on discussion forum.
### Python and Vectorization
#### Vectorization
Both GPU and CPU have parallelization instructions. They're sometimes called SIMD instructions, which stands for a single instruction multiple data. The rule of thumb to remember is whenever possible avoid using explicit four loops.
#### Vectorizing Logistic Regression
If we stack all the `m` examples of `x` we have a input matrix `X` with each column representing an example. So by the builtin vectorization of numpy we can simplify the above gradient descent calculation with a few lines of code which can boost the computational efficiency definitely.
```
Z = np.dot(w.T, X) + b
A = sigmoid(Z)
dz = A - Y
# in constrast to the inner loop above, vectorization is used here to boost computation
dw = 1/m * np.dot(X, dz.T)
db = 1/m * np.sum(dz)
```
Update parameters:
```
w = w - alpha * dw
b = b - alpha * db
```
#### Broadcasting in Python
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is "broadcast" across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. More detailed examples on [numpy.org](https://numpy.org/doc/stable/user/theory.broadcasting.html#array-broadcasting-in-numpy).
![theory.broadcast_2.gif](../_resources/theory.broadcast_2.gif)
#### A note on python/numpy vectors
Do not use rank 1 arrays:
```
# an example of rank 1 array
a = np.random.randn(5)
a.shape
# (5,)
```
Instead, we should use these:
```
a = np.random.randn(5,1)
a = np.random.randn(1,5)
```
Or, just reshape the first case if necessary:
```
a = a.reshape(5,1)
a.shape
# (5,1)
```
#### Quick tour of Jupyter/iPython Notebooks
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
JupyterLab is a web-based interactive development environment for Jupyter notebooks, code, and data. JupyterLab is flexible: configure and arrange the user interface to support a wide range of workflows in data science, scientific computing, and machine learning. JupyterLab is extensible and modular: write plugins that add new components and integrate with existing ones.
See [jupyter.org](https://jupyter.org/)
```
pip install jupyterlab
```
To open jupyter lab, run
```
jupyter-lab
```
#### Explanation of logistic regression cost function (optional)
But so to summarize, by minimizing this cost function J(w,b) we're really carrying out maximum likelihood estimation with the logistic regression model because minimizing the loss corresponds to maximizing the log of the probability.
![prob](../_resources/prob-conditional.svg)
![prob-cost](../_resources/prob-cost.svg)
## Week 3: Shallow Neural Networks
### Learning Objectives
- Describe hidden units and hidden layers
- Use units with a non-linear activation function, such as tanh
- Implement forward and backward propagation
- Apply random initialization to your neural network
- Increase fluency in Deep Learning notations and Neural Network Representations
- Implement a 2-class classification neural network with a single hidden layer
### Shallow Neural Network
#### Neural Networks Overview
This is a simple 2-layer neural network (just one hidden layer)
![neural-network-2-layer](../_resources/neural-network-2-layer.png)
Using computation graph, the forward computation process is like this.
![neural-network-2-layer-forward](../_resources/neural-network-2-layer-forward.png)
#### Neural Network Representation
A neural network consists of three types of layers: input layer, hidden layer and output layer. Input layer is not counted in the number of layers of one neural network. When we talk about training a neural network, basically we are training parameters associated with the hidden layers and the output layer.
- Input layer: input features (x1, x2, x3, ...) stack up vertically
- Hidden layer(s): values for the nodes are not observed
- Output layer: responsilble for generating the predicted value
![nn-representation](../_resources/nn-representation.png)
#### Computing a Neural Network's Output
![nn-computation](../_resources/nn-computation.png)
In the above example, `z[1]` is the result of linear computation of the input values and the parameters of the hidden layer and `a[1]` is the activation as a sigmoid function of `z[1]`.
Generally, in a two-layer neural network, if we have `nx` features of input `x` and `n1` neurons of hidden layer and one output value, we have the following dimensions of each variable. Specifically, we have `nx=3, n1=4` in the above network.
| variable | shape | description |
| --- | --- | --- |
| `x` | `(nx,1)` | input value with `nx` features |
| `W[1]` | `(n1,nx)` | weight matrix of first layer, i.e., hidden layer |
| `b[1]` | `(n1,1)` | bias terms of hidden layer |
| `z[1]` | `(n1,1)` | result of linear computation of hidden layer |
| `a[1]` | `(n1,1)` | activation of hidden layer |
| `W[2]` | `(1,n1)` | weight matrix of second layer, i.e., output layer here |
| `b[2]` | `(1,1)` | bias terms of output layer |
| `z[2]` | `(1,1)` | result of linear computation of output layer |
| `a[2]` | `(1,1)` | activation of output layer, i.e., output value |
We should compute `z[1], a[1], z[2], a[2]` for each example `i` of `m` examples:
```
for i in range(m):
z[1][i] = W[1]*x[i] + b[1]
a[1][i] = sigmoid(z[1][i])
z[2][i] = W[2]*a[1][i] + b[2]
a[2][i] = sigmoid(z[2][i])
```
#### Vectorizing across multiple examples
Just as we have already been familiar with vectorization and broadcasting in the logistic regression, we can also apply the same method to the neural networks training. Inevitably, we have to go through the `m` examples of input values in the process of computation. Stacking them together is good idea. So we have the following vectorizing variables with only small differences as before.
| variable | shape | description |
| --- | --- | --- |
| `X` | `(nx,m)` | input values with `nx` features |
| `W[1]` | `(n1,nx)` | weight matrix of first layer, i.e., hidden layer |
| `b[1]` | `(n1,1)` | bias terms of hidden layer |
| `Z[1]` | `(n1,m)` | results of linear computation of hidden layer |
| `A[1]` | `(n1,m)` | activations of hidden layer |
| `W[2]` | `(1,n1)` | weight matrix of second layer, i.e., output layer here |
| `b[2]` | `(1,1)` | bias terms of output layer |
| `Z[2]` | `(1,1)` | results of linear computation of output layer |
| `A[2]` | `(1,1)` | activations of output layer, i.e., output value |
Now we can compute `Z[1], A[1], Z[2], A[2]` all at once.
```
Z[1] = W[1]*X + b[1]
A[1] = sigmoid(Z[1])
Z[2] = W[2]*A[1] + b[2]
A[2] = sigmoid(Z[2])
```
#### Activation functions
So far, we know that a non-linear function is applied in the output step of each layer. Actually there are several common activation functions which are also popular.
| activation | formula | graph | description |
| --- | --- | --- | --- |
| sigmoid | ![a=1/(1+np.exp(-z))](../_resources/sigmoid-latex.svg) | ![sigmoid.png](../_resources/sigmoid.png) | also called logistic activation function, looks like an S-shape, if your output value between 0 and 1 choose sigmoid |
| tanh | ![a=(np.exp(z)-np.exp(-z))/(np.exp(z)+np.exp(-z))](../_resources/tanh-latex.svg) | ![tanh.png](../_resources/tanh.png) | tanh activation usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, and so it centers the data better for the next layer |
| ReLU | `a=max(0,z)` | ![tanh.png](../_resources/relu.png) | rectified linear unit, the most widely used activation function |
| Leaky ReLU | `a=max(0.01z,z)` | ![tanh.png](../_resources/leaky-relu.png) | an improved version of ReLU, 0.01 can be a parameter |
#### Why do you need non-linear activation functions
> If we only allow linear activation functions in a neural network, the output will just be a linear transformation of the input, which is not enough to form a universal function approximator. Such a network can just be represented as a matrix multiplication, and you would not be able to obtain very interesting behaviors from such a network.
A good explanation on [Stack Overflow](https://stackoverflow.com/questions/9782071/why-must-a-nonlinear-activation-function-be-used-in-a-backpropagation-neural-net).
#### Derivatives of activation functions
| activation | formula | derivative |
| --- | --- | --- |
| sigmoid | ![a=1/(1+np.exp(-z))](../_resources/sigmoid-latex.svg) | a(1-a) |
| tanh | ![a=(np.exp(z)-np.exp(-z))/(np.exp(z)+np.exp(-z))](../_resources/tanh-latex.svg) | 1-a^2 |
| ReLU | `a=max(0,z)` | 0 if z&lt;0; 1 if z&gt;=0 |
| Leaky ReLU | `a=max(0.01z,z)` | 0.01 if z&lt;0; 1 if z&gt;=0 |
#### Gradient descent for Neural Networks
Again we will have a single hidden layer in our neural network, this section focuses on the equations we need to implement in order to get back-propagation or to get gradient descent working. Suppose we have nx input features, n1 hidden units and n2 output units in our examples. In the previous vectorization section we have n2 equals one. Here we will have a more general representation in order to give ourselves a smoother transition to the next week of the course.
*Variables*:
| variable | shape | description |
| --- | --- | --- |
| `X` | `(nx,m)` | input values with `nx` features |
| `Z[1]` | `(n1,m)` | results of linear computation of hidden layer |
| `A[1]` | `(n1,m)` | activations of hidden layer |
| `Z[2]` | `(n2,1)` | results of linear computation of output layer |
| `A[2]` | `(n2,1)` | activations of output layer, i.e., output value |
*Parameters*:
| variable | shape | description |
| --- | --- | --- |
| `W[1]` | `(n1,nx)` | weight matrix of first layer, i.e., hidden layer |
| `b[1]` | `(n1,1)` | bias terms of hidden layer |
| `W[2]` | `(n2,n1)` | weight matrix of second layer, i.e., output layer here |
| `b[2]` | `(n2,1)` | bias terms of output layer |
*Forward propagation* computes all the variable values of each layer, which will also be used in the backpropagation computation.
```
Z[1] = W[1]*X + b[1]
A[1] = sigmoid(Z[1])
Z[2] = W[2]*A[1] + b[2]
A[2] = sigmoid(Z[2])
```
*Backpropagation* computes the derivatives of parameters by the chain rule.
```
# backpropagation
dZ[2] = A[2] - Y # get this with combination of the derivative of cost function and g'[2]
dW[2] = 1/m * np.matmul(dZ[2], A[1].T)
db[2] = 1/m * np.sum(dZ[2], axis=1, keepdims=True)
dZ[1] = np.multiply(np.matmul(W[2].T, dZ[2]), g'[1](Z[1])) # derivative of activation is used here
dW[1] = 1/m * np.matmul(dZ[1], X.T)
db[1] = 1/m * np.sum(dZ[1])
# update parameters
W[1] = W[1] - learning_rate * dW[1]
b[1] = b[1] - learning_rate * db[1]
W[2] = W[2] - learning_rate * dW[2]
b[2] = b[2] - learning_rate * db[2]
```
*Repeat* forward propagation and backpropagation a lot of times until the parameters look like they're converging.
#### Random initialization
Initialization of parameters:
```
W[1] = np.random.randn((n1,nx)) * 0.01 # randomized small numbers
b[1] = np.zeros((n1,1)) # zeros is fine for bias terms
W[2] = np.random.randn((n2,n1)) * 0.01
b[2] = np.zeros((n2,1))
```
*Why randomized initialization?*
In order to break the symmetry for hidden layers.
> Imagine that you initialize all weights to the same value (e.g. zero or one). In this case, each hidden unit will get exactly the same signal. E.g. if all weights are initialized to 1, each unit gets signal equal to sum of inputs (and outputs sigmoid(sum(inputs))). If all weights are zeros, which is even worse, every hidden unit will get zero signal. No matter what was the input - if all weights are the same, all units in hidden layer will be the same too.
See some interesting discussion on [Stack Overflow](https://stackoverflow.com/questions/20027598/why-should-weights-of-neural-networks-be-initialized-to-random-numbers).
*Why small numbers?*
This is for sigmoid or tanh activation function. If weight parameters are initially large, we are more likely to get large values of `z` calculated by `z=wx+b`. If we check this in the graph of sigmoid(tanh) function, we can see the slope in large `z` is very close to zero, which would slow down the learning process since parameters are updated by only a very small amount each time.
## Week 4: Deep Neural Networks
### Learning Objectives
- Describe the successive block structure of a deep neural network
- Build a deep L-layer neural network
- Analyze matrix and vector dimensions to check neural network implementations
- Use a cache to pass information from forward to back propagation
- Explain the role of hyper-parameters in deep learning
### Deep Neural Network
#### Deep L-layer neural network
Technically logistic regression is a 1-layer neural network. Deep neural networks, with more layers, can learn functions that shallower models are often unable to.
Here `L` denotes the number of layers in a deep neural network. Some notations:
| notation | description |
| --- | --- |
| `n[0]` | number of neurons in the input layer |
| `n[l]` | number of neurons in the `lth` layer, `l` from 1 to L |
| `W[l]` | weights of the l-layer of shape `(n[l], n[l-1])` |
| `b[l]` | bias term of the l-layer of shape `(n[l], 1)` |
| `Z[l]` | affine result of the l-layer of shape `(n[l], m)`, `Z[l]=W[l]A[l-1]+b[l]` |
| `g[l]` | activation function of the l-layer |
| `A[l]` | activation output of the l-layer of shape `(n[l], m)`, `A[l]=g[l](Z[l])` |
#### Forward Propagation in a deep network
With `A[0]=X`, forward propagation is generalized as:
```
Z[l] = W[l]*A[l-1] + b[l]
A[l] = sigmoid(Z[l])
```
*Backpropagation* computes the derivatives of parameters by the chain rule.
```
# backpropagation
dZ[2] = A[2] - Y # get this with combination of the derivative of cost function and g'[2]
dW[2] = 1/m * np.matmul(dZ[2], A[1].T)
db[2] = 1/m * np.sum(dZ[2], axis=1, keepdims=True)
dZ[1] = np.multiply(np.matmul(W[2].T, dZ[2]), g'[1](Z[1])) # derivative of activation is used here
dW[1] = 1/m * np.matmul(dZ[1], X.T)
db[1] = 1/m * np.sum(dZ[1])
# update parameters
W[1] = W[1] - learning_rate * dW[1]
b[1] = b[1] - learning_rate * db[1]
W[2] = W[2] - learning_rate * dW[2]
b[2] = b[2] - learning_rate * db[2]
```
#### Getting your matrix dimensions right
| matrix | dimension |
| --- | --- |
| `W[l]` | `(n[l], n[l-1])` |
| `b[l]` | `(n[l], 1)` |
| `Z[l]` | `(n[l], m)` |
| `A[l]` | `(n[l], m)` |
| `dW[l]` | `(n[l], n[l-1])` |
| `db[l]` | `(n[l], 1)` |
| `dZ[l]` | `(n[l], m)` |
| `dA[l]` | `(n[l], m)` |
#### Why deep representations
- Deep neural network with multiple hidden layers might be able to have the earlier layers learn lower level simple features and then have the later deeper layers then put together the simpler things it's detected in order to detect more complex things like recognize specific words or even phrases or sentences.
- If there aren't enough hidden layers, then we might require exponentially more hidden units to compute in shallower networks.
#### Building blocks of deep neural networks
![nn framework](../_resources/nn_frame.png)
*Implementation steps*:
1. Initialize parameters / Define hyperparameters
2. Loop for num_iterations:
1. Forward propagation
2. Compute cost function
3. Backward propagation
4. Update parameters (using parameters, and grads from backprop)
3. Use trained parameters to predict labels
#### Forward and Backward Propagation
In the algorithm implementation, outputting intermediate values as caches (basically `Z` and `A`) of each forward step is crucial for backward computation.
![forward and backward](../_resources/backprop_flow.png)
#### Parameters vs Hyperparameters
*Parameters*:
- weight matrices `W` of each layer
- bias terms `b` of each layer
*Hyper parameters*:
- number of hidden units `n[l]`
- learning rate
- number of iteration
- number of layers `L`
- choice of activation functions
### What does this have to do with the brain
About this topic, I think the following Andrew's explanation is the best summary:
> I do think that maybe the few that computer vision has taken a bit more inspiration from the human brain than other disciplines that also apply deep learning, but I personally use the analogy to the human brain less than I used to.

View File

@ -0,0 +1,677 @@
---
title: >-
Course 2: Improving Deep Neural Networks: Hyperparameter tuning,
Regularization and Optimization
updated: 2022-05-23 11:00:22Z
created: 2022-05-16 17:54:09Z
---
# Course 2: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
- [Course 2: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization](#course-2-improving-deep-neural-networks-hyperparameter-tuning-regularization-and-optimization)
- [Week 1: Practical aspects of Deep Learning](#week-1-practical-aspects-of-deep-learning)
- [Learning Objectives](#learning-objectives)
- [Setting up your Machine Learning Application](#setting-up-your-machine-learning-application)
- [Train / Dev / Test sets](#train--dev--test-sets)
- [Bias / Variance](#bias--variance)
- [Basic Recipe for Machine Learning](#basic-recipe-for-machine-learning)
- [Regularizing your neural network](#regularizing-your-neural-network)
- [Regularization](#regularization)
- [Why regularization reduces over-fitting](#why-regularization-reduces-over-fitting)
- [Dropout Regularization](#dropout-regularization)
- [Understanding Dropout ("Inverted Dropout")](#understanding-dropout-inverted-dropout)
- [Other regularization methods](#other-regularization-methods)
- [Setting up your optimization problem](#setting-up-your-optimization-problem)
- [Normalizing inputs](#normalizing-inputs)
- [Vanishing / Exploding gradients](#vanishing--exploding-gradients)
- [Weight Initialization for Deep Networks](#weight-initialization-for-deep-networks)
- [Numerical approximation of gradients](#numerical-approximation-of-gradients)
- [Gradient checking](#gradient-checking)
- [Gradient checking implementation notes](#gradient-checking-implementation-notes)
- [Week 2: Optimization algorithms](#week-2-optimization-algorithms)
- [Learning Objectives](#learning-objectives-1)
- [Optimization algorithms](#optimization-algorithms)
- [Mini-batch gradient descent](#mini-batch-gradient-descent)
- [Understanding mini-batch gradient descent](#understanding-mini-batch-gradient-descent)
- [Exponentially Weighted Averages](#exponentially-weighted-averages)
- [Understanding exponentially weighted averages](#understanding-exponentially-weighted-averages)
- [Bias correction in exponentially weighted averages](#bias-correction-in-exponentially-weighted-averages)
- [Gradient descent with momentum](#gradient-descent-with-momentum)
- [RMSprop](#rmsprop)
- [Adam optimization algorithm](#adam-optimization-algorithm)
- [Learning rate decay](#learning-rate-decay)
- [The problem of local optima](#the-problem-of-local-optima)
- [Quick notes for optimization algorithms](#quick-notes-for-optimization-algorithms)
- [Week 3: Hyperparameter tuning, Batch Normalization and Programming Frameworks](#week-3-hyperparameter-tuning-batch-normalization-and-programming-frameworks)
- [Learning Objectives](#learning-objectives-2)
- [Hyperparameter tuning](#hyperparameter-tuning)
- [Tuning process](#tuning-process)
- [Using an appropriate scale to pick hyperparameters](#using-an-appropriate-scale-to-pick-hyperparameters)
- [Hyperparameters tuning in practice: Panda vs. Caviar](#hyperparameters-tuning-in-practice-panda-vs-caviar)
- [Batch Normalization](#batch-normalization)
- [Normalizing activations in a network](#normalizing-activations-in-a-network)
- [Fitting Batch Norm into a neural network](#fitting-batch-norm-into-a-neural-network)
- [Why does Batch Norm work](#why-does-batch-norm-work)
- [Batch Norm at test time](#batch-norm-at-test-time)
- [Multi-class classification](#multi-class-classification)
- [Softmax Regression](#softmax-regression)
- [Training a softmax classifier](#training-a-softmax-classifier)
- [Introduction to programming frameworks](#introduction-to-programming-frameworks)
- [Deep learning frameworks](#deep-learning-frameworks)
- [Tensorflow](#tensorflow)
## Week 1: Practical aspects of Deep Learning
### Learning Objectives
- Give examples of how different types of initializations can lead to different results
- Examine the importance of initialization in complex neural networks
- Explain the difference between train/dev/test sets
- Diagnose the bias and variance issues in your model
- Assess the right time and place for using regularization methods such as dropout or L2 regularization
- Explain Vanishing and Exploding gradients and how to deal with them
- Use gradient checking to verify the accuracy of your back-propagation implementation
### Setting up your Machine Learning Application
#### Train / Dev / Test sets
Setting up the training, development (dev, also called validate set) and test sets has a huge impact on productivity. It is important to choose the dev and test sets from the same distribution and it must be taken randomly from all the data.
![b190151ee52f3e4c4e48f19afe65db7a.png](../_resources/b190151ee52f3e4c4e48f19afe65db7a.png)
In Big Data (> 1.000.000 m) take
|Training|dev|test|
|-|-|-|
|98%|1%|1%|
|99,5%|0,5%|0,5%|
*Guideline*:
- Choose a dev set and test set to reflect data you expect to get in the future.
- The dev and test sets should be just big enough to represent accurately the performance of the model.
- __Make sure dev and test set come from the same distribution__
- Test set is not always necessary
#### Bias / Variance
![5563a5e13ec8cdd9e44831e7209880a2.png](../_resources/5563a5e13ec8cdd9e44831e7209880a2.png)
| error type | high variance | high bias | high bias, high variance | low bias, low variance |
| --- | --- | --- | --- | --- |
| Train set error | 1% | 15% | 15% | 0.5% |
| Dev set error | 11% | 16% | 30% | 1% |
> When we discuss prediction models, prediction errors can be decomposed into two main subcomponents we care about: error due to "bias" and error due to "variance". There is a tradeoff between a model's ability to minimize bias and variance. Understanding these two types of error can help us diagnose model results and avoid the mistake of over- or under-fitting.
If human error is $\approx$ 0% then optimal (Bayes) error $\approx$ 0. So percentages in table above are relative. When human error is 15% all looks different.
![46f3e8c2b3038d9c4b19536c9cdd9a21.png](../_resources/46f3e8c2b3038d9c4b19536c9cdd9a21.png)
High bias because is not fitting the green line area.
To understand bias and variance better, read this essay: [Understanding the Bias-Variance Tradeoff](http://scott.fortmann-roe.com/docs/BiasVariance.html).
#### Basic Recipe for Machine Learning
![bias-variance-tradeoff](../_resources/bias-variance-tradeoff.png)
- For a high bias problem, getting more training data is actually not going to help.
- Back in the pre-deep learning era, we didn't have as many tools that just reduce bias or that just reduce variance without hurting the other one.
- In the modern deep learning, big data era, getting a bigger network and more data almost always just __reduces bias without necessarily hurting your variance__, so long as you regularize appropriately.
- This has been one of the big reasons that deep learning has been so useful for supervised learning.
- The main cost of training a big neural network is just computational time, so long as you're regularizing.
- Useful reducing variance:
- L2 regularization
- Data augmentation
- Dropout
- More data
### Regularizing your neural network
#### Regularization
__Regularization for Logistic Regression__:
![reg-cost](../_resources/reg-logistic-cost.svg)
__strong text__
`b` is just one parameter over a very large number of parameters, so no need to include it in the regularization.
| regularization | formula | description |
| --- | --- | --- |
| L2 regularization | ![reg-cost](../_resources/reg-logistic-l2.svg) | most common type of regularization |
| L1 regularization | ![reg-cost](../_resources/reg-logistic-l1.svg) | w vector will have a lot of zeros, so L1 regularization makes your model sparse |
__Regularization for a Neural Network__:
![reg-cost](../_resources/reg-nn-cost.svg)
For the matrix `w`, this norm is called the Frobenius norm. Its definition looks like `L2` norm but is not called the `L2` norm:
![reg-cost](../_resources/reg-nn-fnorm.svg)
Regularization of gradient:
![reg-nn-grad](../_resources/reg-nn-grad.svg)
With regularization the coefficient of `w` is slightly less than `1`, in which case it is called __weight decay__.
![reg-nn-weight-decay](../_resources/reg-nn-wdecay.svg)
#### Why regularization reduces over-fitting
- If we make regularization lambda to be very big, then weight matrices will be set to be reasonably close to zero, __effectively zeroing out a lot of the impact of the hidden units.__ Then the simplified neural network becomes a much smaller neural network, eventually almost like a logistic regression. We'll end up with a much smaller network that is therefore less prone to over-fitting.
- Taking activation function `g(Z)=tanh(Z)` as example, if lambda is large, then weights `W` are small and subsequently `Z` ends up taking relatively small values, where `g` and `Z` will be roughly linear which is not able to fit those very complicated decision boundary, i.e., less able to over-fit.
![c2e4a299ff489309dab8e6e81864d82b.png](../_resources/c2e4a299ff489309dab8e6e81864d82b.png)
*Implementation tips*:
Without regularization term, we should see the cost function decreases monotonically in the plot. Whereas in the case of regularization, to debug gradient descent make sure that we plot `J` with a regularization term; otherwise, if we plot only the first term (the old J), we might not see a decrease monotonically.
![35ae293edf8fad30686a64bbe5dcc90a.png](../_resources/35ae293edf8fad30686a64bbe5dcc90a.png)
#### Dropout Regularization
- Dropout is another powerful regularization technique.
- With dropout, what we're going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network. It's as if on every iteration you're working with a smaller neural network, which has a regularizing effect.
- Inverted dropout technique, `a3 = a3 / keep_prob`, ensures that the expected value of `a3` remains the same, which makes test time easier because you have less of a scaling problem.
![dropout](../_resources/dropout.jpeg)
*(image source: [deepnotes](https://deepnotes.io/dropout))*
#### Understanding Dropout ("Inverted Dropout")
- Can't rely on any one feature, so have to spread out weights, which has an effect of shrinking the squared norm of the weights, similar to what we saw with L2 regularization, helping prevent over-fitting.
- For layers where you're more worried about over-fitting, really the layers with a lot of parameters, you can set the key prop to be smaller to apply a more powerful form of drop out.
- Downside: with `keep prop` for some layers, more hyper-parameters to search for using cross-validation.
- Frequently used in __computer vision__, as the input size is so big, inputting all these pixels that you almost never have enough data, prone to over-fitting.
- Cost function `J` is no longer well-defined and harder to debug or double check that `J` is going downhill on every iteration. So first run code and make sure old `J` is monotonically decreasing, and then turn on drop out in order to make sure that no bug in drop out.
- Do not use at test runs
- makes tests easier because less scaling problem.
- different runs, different random zero out!
*Note*:
- A __common mistake__ when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training.
- Turn off dropout during testing (keep.prob = 1.0)
- Deep learning frameworks like [tensorflow](https://www.tensorflow.org/api_docs/python/tf/nn/dropout), [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/api/layers/dropout.html), [keras](https://keras.io/api/layers/regularization_layers/dropout/) or [caffe](http://caffe.berkeleyvision.org/tutorial/layers/dropout.html) come with a dropout layer implementation. Don't stress - you will soon learn some of these frameworks.
#### Other regularization methods
- __Data augmentation__: getting more training data can be expensive and sometimes can't get more data, so flipping horizontally, random cropping, random distortion and translation of image can make additional fake training examples.
- __Early stopping__: stopping halfway to get a mid-size `w`.
- *Disadvantage*: early stopping couples two tasks of machine learning, optimizing the cost function `J` and not over-fitting, which are supposed to be completely separate tasks, to make things more complicated.
- __orthogonalization__: think on ONE task at the time.
- *Advantage*: running the gradient descent process just once, you get to try out values of small `w`, mid-size `w`, and large `w`, without needing to try a lot of values of the L2 regularization hyper-parameter lambda.
- focus on
- first: Optimize cost function J
- then: not overfit (L2 etc)
![f96c721f766385e5fc346faf159dcd20.png](../_resources/f96c721f766385e5fc346faf159dcd20.png)
### Setting up your optimization problem
#### Normalizing inputs
With normalization, cost function will be more round and easier to optimize when features are all on similar scales. This is a very common topic, see more on [Stack Overflow](https://stackoverflow.com/questions/4674623/why-do-we-have-to-normalize-the-input-for-an-artificial-neural-network).
- faster learning rate gradient decent
- use same normalization for train and test
![a157a98c23f8955d06c56f8be279c074.png](../_resources/a157a98c23f8955d06c56f8be279c074.png)
#### Vanishing / Exploding gradients
- In a very deep network derivatives or slopes can sometimes get either very big or very small, maybe even exponentially, and this makes training difficult.
- The weights W, if they're all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode. And if W is just a little bit less than identity, the activations will decrease exponentially.
#### Weight Initialization for Deep Networks
A partial solution to the problems of vanishing and exploding gradients is better or more careful choice of the random initialization for neural network.
For a single neuron, suppose we have `n` features for the input layer, then we want `Z = W1X1 + W2X2 + ... + WnXn` not blow up and not become too small, so the larger `n` is, the smaller we want `Wi` to be.
- It's reasonable to set variance of `Wi` to be equal to `1/n`
- It helps reduce the vanishing and exploding gradients problem, because it's trying to set each of the weight matrices `W` not too much bigger than `1` and not too much less than `1`.
- Generally for layer `l`, set `W[l]=np.random.randn(shape) * np.sqrt(1/n[l-1])`.
- For `relu` activation, set `Var(W)=2/n` by `W[l]=np.random.randn(shape) * np.sqrt(2/n[l-1])`. (aka He initialization by [Kaiming He](http://kaiminghe.com/))
- For `tanh` activation, `W[l]=np.random.randn(shape) * np.sqrt(1/n[l-1])`. (Xavier initialization)
- `W[l]=np.random.randn(shape) * np.sqrt(2/(n[l-1]+n[l]))` (Yoshua Bengio)
- `1` or `2` in variance `Var(W)=1/n or 2/n` can be a hyperparameter, but not as important as other hyperparameters.
*A well chosen initialization can*:
- Speed up the convergence of gradient descent
- Increase the odds of gradient descent converging to a lower training (and generalization) error
*Implementation tips*:
- The weights `W[l]` should be initialized randomly to *break symmetry* and make sure different hidden units can learn different things. Initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing.
- It is however okay to initialize the biases `b[l]` to zeros. Symmetry is still broken so long as `W[l]` is initialized randomly.
- Initializing weights to very large random values does not work well.
- Hopefully initializing with small random values does better. The important question is: how small should be these random values be? He initialization works well for networks with ReLU activations. In other cases, try other initializations.
#### Numerical approximation of gradients
Numerically verify implementation of derivative of a function is correct and hence to check if there is a bug in the back-propagation implementation.
*Two-sided difference formula is much more accurate*:
- In two side case, `f'(𝜃)=lim(f(𝜃+𝜀)-f(𝜃-𝜀))/(2𝜀), error term ~ O(𝜀^2)`
- In one side case, `f'(𝜃)=lim(f(𝜃+𝜀)-f(𝜃))/(𝜀), error term ~ O(𝜀)`
- `𝜀 < 1`, so `O(𝜀^2) < O(𝜀)`
#### Gradient checking
*Implementation steps*:
1. Take `W[1],b[1],...,W[L],b[L]` and reshape into a big vector `𝜃`: `J(W[1],b[1],...,W[L],b[L])=J(𝜃)`.
2. Take `dW[1],db[1],...,dW[L],db[L]` and reshape into a big vector `d𝜃`.
3. For each `i`: `d𝜃_approx[i] = (J(𝜃1,𝜃2,...,𝜃i+𝜀,...)-J(𝜃1,𝜃2,...,𝜃i-𝜀,...))/(2𝜀)`. (Should have `d𝜃_approx[i] ≈ d𝜃[i]`)
4. Check `diff_ratio = norm_2(d𝜃_approx-d𝜃) / (norm_2(d𝜃_approx)+norm_2(d𝜃)) ≈ eps`:
1. `diff_ratio ≈ 10^-7`, great, backprop is very likely correct.
2. `diff_ratio ≈ 10^-5`, maybe OK, better check no component of this difference is particularly large.
3. `diff_ratio ≈ 10^-3`, worry, check if there is a bug.
#### Gradient checking implementation notes
- Don't use in training - only to debug
- If algorithm fails grad check, look at components to try to identify bug.
- Remember regularization.
- Doesn't work with dropout. (you can first check grad, then turn on dropout)
- Run at random initialization; perhaps again after some training.
## Week 2: Optimization algorithms
### Learning Objectives
- Apply optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
- Use random minibatches to accelerate convergence and improve optimization
- Describe the benefits of learning rate decay and apply it to your optimization
### Optimization algorithms
#### Mini-batch gradient descent
Vectorization allows you to process all M examples relatively quickly if M is very large, but it can still be slow. For example, `m = 5,000,000` (or `m = 50,000,000` or even bigger), we have to process the entire training sets of five million training samples before we take one little step of gradient descent.
We can use the mini-batch method to let gradient descent start to make some progress before we finish processing the entire, giant training set of 5 million examples by splitting up the training set into smaller, little baby training sets called mini-batches. In this case, we have 5000 mini-batches with 1000 examples each.
*Notations*:
- `(i)`: the *i*-th training sample
- `[l]`: the *l*-th layer of the neural network
- `{t}`: the *t*-th mini batch
In every step of the iteration loop, we need to loop for `num_batches` and do forward and backward computation for each batch.
1. Forward propagation
2. Compute cost function
3. Backward propagation
4. Update parameters (using parameters, and grads from backprop)
With mini-batch gradient descent, a single pass through the training set is one epoch, which in the above 5 million example, means 5000 gradient descent steps.
#### Understanding mini-batch gradient descent
| batch size | method | description | guidelines |
| --- | --- | --- | --- |
| =`m` | batch gradient descent | cost function decreases on every iteration;<br>but too long per iteration. | for a small training set (<2000). |
| =`1` | stochastic gradient descent | cost function oscillates, can be extremely noisy;<br>wander around minimum;<br>lose speedup from vectorization, inefficient. | use a smaller learning rate when it oscillates too much. |
| between `1` and `m` | mini-batch gradient descent | somewhere in between, vectorization advantage, faster;<br>not guaranteed to always head toward the minimum but more consistently in that direction than stochastic descent;<br>not always exactly converge, may oscillate in a very small region, reducing the learning rate slowly may also help. | mini-batch size is a hyperparameter;<br>batch size better in \[64, 128, 256, 512\], a power of 2;<br>make sure that mini-batch fits in CPU/GPU memory. |
#### Exponentially Weighted Averages
Moving averages are favored statistical tools of active traders to measure momentum. There are three MA methods:
| MA methods | calculations |
| --- | --- |
| simple moving average (SMA) | calculated from the average closing prices for a specified period |
| weighted moving average (WMA) | calculated by multiplying the given price by its associated weighting (assign a heavier weighting to more current data points) and totaling the values |
| exponential moving average (EWMA) | also weighted toward the most recent prices, but the rate of decrease is exponential |
For a list of daily temperatures:
![london-temp-example](../_resources/ewa-temp1.svg)
This data looks a little bit noisy (blue dots):
![ewa1](../_resources/ewa-temp-plot1.png)
![ewa-on-temp](../_resources/ewa-temp2.svg)
If we want to compute the trends, by averaging over a larger window, the above exponentially weighted average formula adapts more slowly when the temperature changes. So, there's just a bit more latency. (See the red curve above)
- When `β=0.98` then it's giving a lot of weight to the previous value and a much smaller weight just 0.02, to whatever you're seeing right now. (See the green curve below)
- When `β=0.5`, which something like averaging over just two days temperature. And by averaging only over two days temperature, as if averaging over much shorter window. It's much more noisy, much more susceptible to outliers. But this adapts much more quickly to what the temperature changes. (See the yellow curve below)
![ewa2](../_resources/ewa-temp-plot2.png)
#### Understanding exponentially weighted averages
This topic is basically related to [gradient descent optimizations](http://people.duke.edu/~ccc14/sta-663-2018/notebooks/S09G_Gradient_Descent_Optimization.html).
![ewa](../_resources/ewa.svg)
The exponentially weighted average adds a fraction β of the current value to a leaky running sum of past values. Effectively, the contribution from the *tn*th value is scaled by ![ewa-weight](../_resources/ewa-weight.svg).
For example, here are the contributions to the current value after 5 iterations (iteration 5 is the current iteration)
| iteration | contribution |
| --- | --- |
| 1 | `β^4(1β)` |
| 2 | `β^3(1β)` |
| 3 | `β^2(1β)` |
| 4 | `β^1(1β)` |
| 5 | `(1β)` |
Since `β<1`, the contribution decreases exponentially with the passage of time. Effectively, this acts as a smoother for a function.
***e*-folding**:
Andrew Ng also mentioned an interesting concept related to *e*-folding. He said:
- if `β=0.9` it would take about 10 days for `V` to decay to about `1/3` (`1/e ≈ 1/3`) of the peak;
- if `β=0.98` it would be 50 days.
Here 10 or 50 days is called one lifetime (1 *e*-folding). Generally, for an exponential decay quantity, after one lifetime (`1/(1-β)` iterations), `1/e ≈ 37%` is remained and after two lifetime, `1/e^2 ≈ 14%` is left.
For more information, check the definition of [*e*-folding](https://en.formulasearchengine.com/wiki/E-folding).
#### Bias correction in exponentially weighted averages
There's one technical detail called biased correction that can make you computation of these averages more accurately. In the temperature example above, when we set `β=0.98`, we won't actually get the green curve; instead, we get the purple curve (see the graph below).
![ewa3](../_resources/ewa-temp-plot3.png)
Because when we're implementing the exponentially weighted moving average, we initialize it with `V0=0`, subsequently we have the following result in the beginning of the iteration:
- `V1 = 0.98*V0 + 0.02*θ1 = 0.02 * θ1`
- `V2 = 0.98*V1 + 0.02*θ2 = 0.0196 * θ1 + 0.02 * θ2`
As a result, `V1` and `V2` calculated by this are not very good estimates of the first two temperature. So we need some modification to make it more accurate, especially during the initial phase of our estimate to avoid an __initial bias__. This can be corrected by scaling with `1/(1-β^t)` where `t` is the iteration number.
| original | correction |
| --- | --- |
| ![V1](../_resources/bias-c1.svg) | ![V1c](../_resources/bias-c2.svg) |
| ![V2](../_resources/bias-c3.svg) | ![V2c](../_resources/bias-c4.svg) |
#### Gradient descent with momentum
Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using momentum can reduce these oscillations.
- gradient descent with momentum, which computes an EWA of gradients to update weights almost always works faster than the standard gradient descent algorithm.
- algorithm has two hyperparameters of `alpha`, the learning rate, and `beta` which controls your exponentially weighted average. common value for `beta` is `0.9`.
- don't bother with bias correction
![momentum-algo](../_resources/momentum-algo.png)
*Implementation tips*:
- If `β = 0`, then this just becomes standard gradient descent without momentum.
- The larger the momentum `β` is, the smoother the update because the more we take the past gradients into account. But if `β` is too big, it could also smooth out the updates too much.
- Common values for `β` range from `0.8` to `0.999`. If you don't feel inclined to tune this, `β = 0.9` is often a reasonable default.
- It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
#### RMSprop
RMSprop(root mean square), similar to momentum, has the effects of damping out the oscillations in gradient descent and mini-batch gradient descent and allowing you to maybe use a larger learning rate alpha.
The algorithm computes the exponentially weighted averages of the squared gradients and updates weights by the square root of the EWA.
```
for iteration t:
# compute dW, db on mini-batch
S_dW = (beta * S_dW) + (1 - beta) * dW^2
S_db = (beta * S_db) + (1 - beta) * db^2
W = W - alpha * dW / sqrt(S_dW + 𝜀) # 𝜀: small number(10^-8) to avoid dividing by zero
b = b - alpha * db / sqrt(S_db + 𝜀)
```
#### Adam optimization algorithm
- Adam (Adaptive Moment Estimation) optimization algorithm is basically putting momentum and RMSprop together and combines the effect of gradient descent with momentum together with gradient descent with RMSprop.
- This is a commonly used learning algorithm that is proven to be very effective for many different neural networks of a very wide variety of architectures.
- In the typical implementation of Adam, bias correction is on.
```
V_dW = 0
V_db = 0
S_dW = 0
S_db = 0
for iteration t:
# compute dW, db using mini-batch
# momentum
V_dW = (beta1 * V_dW) + (1 - beta1) * dW
V_db = (beta1 * V_db) + (1 - beta1) * db
# RMSprop
S_dW = (beta2 * S_dW) + (1 - beta2) * dW^2
S_db = (beta2 * S_db) + (1 - beta2) * db^2
# bias correction
V_dW_c = V_dW / (1 - beta1^t)
V_db_c = V_db / (1 - beta1^t)
S_dW_c = S_dW / (1 - beta2^t)
S_db_c = S_db / (1 - beta2^t)
W = W - alpha * V_dW_c / (sqrt(S_dW_c) + 𝜀)
b = b - alpha * V_db_c / (sqrt(S_db_c) + 𝜀)
```
*Implementation tips*:
1. It calculates an exponentially weighted average of past gradients, and stores it in variables `V_dW,V_db` (before bias correction) and `V_dW_c,V_db_c` (with bias correction).
2. It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables `S_dW,S_db` (before bias correction) and `S_dW_c,S_db_c` (with bias correction).
3. It updates parameters in a direction based on combining information from "1" and "2".
| hyperparameter | guideline |
| --- | --- |
| `learning rate` | tune |
| `beta1` (parameter of the momentum, for `dW`) | `0.9` |
| `beta2` (parameter of the RMSprop, for `dW^2`) | `0.999` |
| `𝜀` (avoid dividing by zero) | `10^-8` |
Adam paper: [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980)
#### Learning rate decay
The learning algorithm might just end up wandering around, and never really converge, because you're using some fixed value for alpha. Learning rate decay methods can help by making learning rate smaller when optimum is near. There are several decay methods:
| decay factor | description |
| --- | --- |
| `0.95^epoch_num` | exponential decay |
| `k/sqrt(epoch_num)` or `k/sqrt(t)` | polynomial decay |
| discrete staircase | piecewise constant |
| manual decay | -- |
#### The problem of local optima
- First, you're actually pretty unlikely to get stuck in bad local optima, but much more likely to run into a saddle point, so long as you're training a reasonably large neural network, save a lot of parameters, and the cost function J is defined over a __relatively high dimensional space__.
- Second, that plateaus are a problem and you can actually make learning pretty slow. And this is where algorithms like __momentum__ or __RMSProp__ or __Adam__ can really help your learning algorithm.
This is what a saddle point look like.
![saddle-point](../_resources/saddle-point.png)
#### Quick notes for optimization algorithms
Recall that in [Course 1](joplin://8b8d24c8270944829c58a2071481e8b7#building-blocks-of-deep-neural-networks) we have already known that there are several steps in the neural network implementation:
1. Initialize parameters / Define hyperparameters
2. Loop for num_iterations:
1. Forward propagation
2. Compute cost function
3. Backward propagation
4. __Update parameters (using parameters, and grads from backprop)__
3. Use trained parameters to predict labels
When we create `momentum`, `RMSprop` or `Adam` optimization methods, what we do is to implement algorithms in the __update parameters__ step. A good practice is to wrap them up as options so we can compare them during our alchemy training
```
if optimizer == "gd":
parameters = update_parameters_with_gd(parameters, grads, learning_rate)
elif optimizer == "momentum":
parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
elif optimizer == "adam":
t = t + 1 # Adam counter
parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t, learning_rate, beta1, beta2, epsilon)
```
## Week 3: Hyperparameter tuning, Batch Normalization and Programming Frameworks
### Learning Objectives
- Master the process of hyperparameter tuning
### Hyperparameter tuning
#### Tuning process
Importance of hyperparameters (roughly):
| importance level | hyperparameters |
| --- | --- |
| first | learning rate `alpha` |
| second | momentum term `beta`<br>mini-batch size<br>number of hidden units |
| third | number of layers<br>learning rate decay<br>Adam `beta1, beta2, epsilon` |
*Tuning tips*:
- Choose points at random, not in a grid
- Optionally use a coarse to fine search process
#### Using an appropriate scale to pick hyperparameters
Search for hyperparameters on a log scale.
```
r = -4 * np.random.rand() # r in [-4,0]
alpha = 10**r # alpha in [10^-4, 1]
```
It's easy to extend to a more generalized case `[a,b]`.
As for `beta`, use the same logarithmic scale method for `1-beta`.
#### Hyperparameters tuning in practice: Panda vs. Caviar
- __Panda approach__: Not enough computational capacity: babysitting one model
- __Caviar approach__: training many models in parallel
### Batch Normalization
#### Normalizing activations in a network
- Batch normalization makes your hyperparameter search problem much easier, makes your neural network much more robust.
- What batch norm does is it applies that normalization process not just to the input layer, but to the values even deep in some hidden layer in the neural network. So it will apply this type of normalization to normalize the mean and
variance of `z[i]` of hidden units.
- One difference between the training input and these hidden unit values is that you might not want your hidden unit values be forced to have mean 0 and variance 1.
- For example, if you have a sigmoid activation function, you don't want your values to always be clustered in the normal distribution around `0`. You might want them to have a larger variance or have a mean that's different than 0, in order to better take advantage of the nonlinearity of the sigmoid function rather than have all your values be in just this linear region (near `0` on sigmoid function).
- What it does really is it then shows that your hidden units have standardized mean and variance, where the mean and variance are controlled by two explicit parameters `gamma` and `beta` which the learning algorithm can set to whatever it wants.
![batch-norm](../_resources/batch-norm.png)
#### Fitting Batch Norm into a neural network
- `𝛽[1],𝛾[1],𝛽[2],𝛾[2],⋯,𝛽[𝐿],𝛾[𝐿]` can also be updated using gradient descent with momentum (or RMSprop, Adam). `𝛽[l],𝛾[l]` have the shape with `z[l]`.
- Similar computation can also be applied to mini-batches.
- With batch normalization, the parameter `b[l]` can be eliminated. So `w[l],𝛽[l],𝛾[l]` need to be trained.
- The parameter `𝛽` here has nothing to do with the `beta` in the momentum, RMSprop or Adam algorithms.
![batch-norm-nn](../_resources/batch-norm-nn.png)
#### Why does Batch Norm work
- By normalizing all the features, input features `X`, to take on a similar range of values that can speed up learning. So batch normalization is doing a similar thing.
- To avoid *covariate shift* of data distribution, which makes the parameters change a lot in the training progress. Batch norm can reduce the amount that the distribution of the hidden unit values shifts around by making the mean and variance of `z` values remain the same.
- It allows each layer of the network to learn by itself, a little bit more independently of other layers, and this has the effect of speedup of learning in the whole network.
- From the perspective of one of the later layers of the neural network, the earlier layers don't get to shift around as much, because they're constrained to have the same mean and variance. This makes the job of learning on the later layers easier.
- It has a slight regularization effect.
- The mean and variance is a little bit noisy because it's estimated with just a relatively small sample of data (each mini-batch). So similar to dropout, it adds some noise to each hidden layer's activations.
- It's forcing the downstream hidden units not to rely too much on any one hidden unit.
- The noise added is quite small, so not a huge regularization effect. You can use batch norm together with dropouts if you want the more powerful regularization effect of dropout.
- Using bigger mini-batch size can reduce noise and therefore reduce regularization effect.
- Don't turn to batch norm as a regularization. This is not the intent of batch norm.
- Just use it as a way to normalize hidden units activations and therefore speedup learning.
- At test time, you try to make predictors and evaluate the neural network, you might not have a mini-batch of examples, you might be processing one single example at the time. So, at test time you need to do something slightly differently to make sure your predictions make sense.
#### Batch Norm at test time
- Batch norm processes our data one mini batch at a time, but at test time we need to process examples one at a time.
- In theory we could run the whole training set through final network to get `𝜇` and `𝜎^2`.
- In practice, usually implement an exponentially weighted average where we just keep track of the `𝜇` and `𝜎^2` we're seeing during training and use an EWA (across mini-batches), also sometimes called the running average, to get a rough estimate of `𝜇` and `𝜎^2` and then use these to scale at test time.
- `𝜇{1}[l], 𝜇{2}[l], 𝜇{3}[l], ...` —\> `𝜇[l]`
- `𝜎^2{1}[l], 𝜎^2{2}[l], 𝜎^2{3}[l], ...` —\> `𝜎^2[l]`
### Multi-class classification
#### Softmax Regression
Use softmax activation function.
```
def softmax(z):
return np.exp(z) / sum(np.exp(z))
z = [1,0.5,-2,1,3]
print(softmax(z))
# array([0.09954831, 0.0603791 , 0.00495622, 0.09954831, 0.73556806])
```
#### Training a softmax classifier
Softmax regression is a generalization of logistic regression to more than two classes.
### Introduction to programming frameworks
#### Deep learning frameworks
- Caffe/Caffe2
- CNTK
- DL4J
- Keras
- Lasagne
- mxnet
- PaddlePaddle
- TensorFlow
- Theano
- Torch
*Choosing deep learning frameworks*:
- Ease of programming (development and deployment)
- Running speed
- Truly open (open source with good governance)
#### Tensorflow
- The two main object classes in tensorflow are *Tensors* and *Operators*.
- When we code in tensorflow we have to take the following steps:
- Create a graph containing Tensors (*Variables*, *Placeholders* ...) and *Operations* (`tf.matmul`, `tf.add`, ...)
- Create a *session*
- Initialize the *session*
- Run the *session* to execute the graph
- We might need to execute the graph multiple times when implementing `model()`
- The backpropagation and optimization is automatically done when running the session on the "optimizer" object.
```
import numpy as np 
import tensorflow as tf
coefficients = np.array([[1], [-20], [25]])
w = tf.Variable([0],dtype=tf.float32)
x = tf.placeholder(tf.float32, [3,1])
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]    # (w-5)**2
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init) 
print(session.run(w))
for i in range(1000):
session.run(train, feed_dict={x:coefficients})
print(session.run(w))
```
* * *
Notes by Aaron © 2020

View File

@ -0,0 +1,382 @@
---
title: 'Course 3: Structuring Machine Learning Projects'
updated: 2022-05-17 18:57:41Z
created: 2022-05-16 17:54:31Z
---
# Course 3: Structuring Machine Learning Projects
- [Course 3: Structuring Machine Learning Projects](#course-3-structuring-machine-learning-projects)
- [Week 1: ML Strategy (1)](#week-1-ml-strategy-1)
- [Learning Objectives](#learning-objectives)
- [Introduction to ML Strategy](#introduction-to-ml-strategy)
- [Why ML Strategy](#why-ml-strategy)
- [Orthogonalization](#orthogonalization)
- [Setting up your goal](#setting-up-your-goal)
- [Single number evaluation metric](#single-number-evaluation-metric)
- [Satisficing and optimizing metric](#satisficing-and-optimizing-metric)
- [Train/dev/test distributions](#traindevtest-distributions)
- [Size of the dev and test sets](#size-of-the-dev-and-test-sets)
- [When to change dev/test sets and metrics](#when-to-change-devtest-sets-and-metrics)
- [Comparing to human-level performance](#comparing-to-human-level-performance)
- [Why human-level performance](#why-human-level-performance)
- [Avoidable bias](#avoidable-bias)
- [Understanding human-level performance](#understanding-human-level-performance)
- [Surpassing human-level performance](#surpassing-human-level-performance)
- [Improving your model performance](#improving-your-model-performance)
- [Week 2: ML Strategy (2)](#week-2-ml-strategy-2)
- [Learning Objectives](#learning-objectives-1)
- [Error Analysis](#error-analysis)
- [Carrying out error analysis](#carrying-out-error-analysis)
- [Cleaning up incorrectly labeled data](#cleaning-up-incorrectly-labeled-data)
- [Build your first system quickly, then iterate](#build-your-first-system-quickly-then-iterate)
- [Mismatched training and dev/test set](#mismatched-training-and-devtest-set)
- [Training and testing on different distributions](#training-and-testing-on-different-distributions)
- [Bias and Variance with mismatched data distributions](#bias-and-variance-with-mismatched-data-distributions)
- [Addressing data mismatch](#addressing-data-mismatch)
- [Learning from multiple tasks](#learning-from-multiple-tasks)
- [Transfering learning](#transfering-learning)
- [Multi-task learning](#multi-task-learning)
- [End-to-end deep learning](#end-to-end-deep-learning)
- [What is end-to-end deep learning](#what-is-end-to-end-deep-learning)
- [Whether to use end-to-end deep learning](#whether-to-use-end-to-end-deep-learning)
## Week 1: ML Strategy (1)
### Learning Objectives
- Explain why Machine Learning strategy is important
- Apply satisficing and optimizing metrics to set up your goal for ML projects
- Choose a correct train/dev/test split of your dataset
- Define human-level performance
- Use human-level performance to define key priorities in ML projects
- Take the correct ML Strategic decision based on observations of performances and dataset
### Introduction to ML Strategy
#### Why ML Strategy
*Ideas to improve a machine learning system*:
- Collect more data
- Collect more diverse training set
- Train algorithm longer with gradient descent
- Try Adam instead of gradient descent
- Try bigger network
- Try smaller network
- Try dropout
- Add L2 regularization
- Network architecture
- Activation functions
- number of hidden units
- ...
In order to have quick and effective ways to figure out which of all of these ideas and maybe even other ideas, are worth pursuing and which ones we can safely discard, we need ML strategies.
#### Orthogonalization
In the example of TV tuning knobs, orthogonalization refers to that the TV designers had designed the knobs so that each knob kind of does only one thing.
In a car the stearing wheel controls the angle and the accelerator and brake control the speed. If there are two controllers, each has different effect simultaneously on angle and speed, then it's much harder to set the car to the speed and angle we want.
```
0.3 * angle - 0.8 * speed
2 * angle + 0.9 * speed
```
Orthogonal means at 90 degrees to each other. By having orthogonal controls that are ideally aligned with the things we actually want to control. It makes it much easier to tune the knobs we have to tune. To tune the steering wheel angle, and the accelerator, the brake, to get the car to do what we want.
| chain of assumptions in ML | tune the *knobs* |
| --- | --- |
| Fit training set well on cost function | bigger network<br>better optimization algorithm, Adam... |
| Fit dev set well on cost function | regularization<br>bigger training set |
| Fit test set well on cost function | bigger dev set |
| Performs well in real world | change dev set or cost function<br>(dev test set distribution not correct or cost function not right) |
Early stopping, though not a bad technique, is a *knob* that simultaneously affects the training set and dev set performance, and therefore is **less orthogonalized**, so Andrew tend not to use it.
### Setting up your goal
#### Single number evaluation metric
Evaluation metric allows you to quickly tell if classifier A or classifier B is better, and therefore having a dev set plus single number evaluation metric tends to speed up iterating.
| metric | calculation | definition |
| --- | --- | --- |
| Precision | `P = TP/(TP+FP)` | percentage of true positive in predicted positive |
| Recall | `R = TP/(TP+FN)` | percentage of true positive predicted in all real positive |
| F1 score | `F1 = 2PR/(P+R)` or `1/F1 = (1/P+1/R)/2` | harmonic mean of precision and recall |
#### Satisficing and optimizing metric
If we care about the classification accuracy of our cat's classifier and also care about the running time or some other performance, instead of combining them into an overall evaluation metric by their *artificial* linear weighted sum, we actually can have one thing as an **optimizing metric** and the others as **satisficing metrics**.
- In the cat's classifier example, we might have accuracy as optimizing metric and running time as satificing metric.
- In wake words detection system (like Amazon Echo, Apple Siri, ...), maybe accuracy is an optimizing metric and false positive `≤ 1` every 24 hours is a satificing metric.
#### Train/dev/test distributions
*Guideline*:
- Choose a dev set and test set to reflect data you expect to get in future and consider important to do well on.
- In particular, **the dev set and the test set here, should come from the same distribution**.
#### Size of the dev and test sets
- In the era of big data, the old rule of thumb of a 70/30 is that, that no longer applies. And the trend has been to use more data for training and less for dev and test, especially when you have a very large data sets.
- Suppose we have a million training examples, it might be quite reasonable to set up the data so that we have 98% in the training set, 1% dev, and 1% test.
- The guideline is, to set your test set to big enough to give high confidence in the overall performance of your system.
- When people were talking about using train test splits, what they actually had was a train dev split and no test set.
- In the history of machine learning, not everyone has been completely clean and completely records of about calling the dev set when it really should be treated as dev set.
#### When to change dev/test sets and metrics
In an example of cat classification system, classification error might not be a reasonable metric if two algorithms have the following performance:
| algorithm | classification error | issues | review |
| --- | --- | --- | --- |
| Algorithm A | 3% | letting through lots of porn images | showing pornographic images to users is intolerable |
| Algorithm B | 5% | no pornographic images | classifies fewer images but acceptable |
In this case, metric should be modified. One way to change this evaluation metric would be adding weight terms.
| metric | calculation | notation |
| --- | --- | --- |
| classification error | ![clf-error](../_resources/metric-clf-error.svg) | `L` can be identity function to count correct labels |
| weighted classification error | ![clf-error-weighted](../_resources/metric-clf-error-weighted.svg) | ![weights](../_resources/metric-clf-error-weights.svg) |
So if you find that evaluation metric is not giving the correct rank order preference for what is actually better algorithm, then there's a time to think about defining a new evaluation metric.
This is actually an example of an orthogonalization where I think you should take a machine learning problem and break it into distinct steps.
- First, figure out how to define a metric that captures what you want to do. (*place the target*)
- Second, think about how to actually do well on this metric. (*shoot the target*)
The overall guideline is if your current metric and data you are evaluating on doesn't correspond to doing well on what you actually care about, then change your metrics and/or your dev/test set to better capture what you need your algorithm to actually do well on.
### Comparing to human-level performance
#### Why human-level performance
A lot more machine learning teams have been talking about comparing the machine learning systems to human-level performance.
- First, because of advances in deep learning, machine learning algorithms are suddenly working much better and so it has become much more feasible in a lot of application areas for machine learning algorithms to actually become competitive with human-level performance.
- Second, the workflow of designing and building a machine learning system is much more efficient when we're trying to do something that humans can also do.
The graph below shows the performance of humans and machine learning over time.
![human-performance](../_resources/human-performance.png)
Machine learning progresses slowly when it surpasses human-level performance. One of the reason is that human-level performance can be closeto Bayes optimal error, especially for natural perception problem.
Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y cant surpass a certain level of accuracy.
Also, when the performance of machine learning is worse than the performance of humans, we can improve it with different tools. They are harder to use once it surpasses human-level performance.
*These tools are*:
- Get labelled data from humans
- Gain insight from manual error analysis: Why did a person get this right?
- Better analysis of bias/variance.
#### Avoidable bias
By knowing what the human-level performanceis, it is possible to tell when a training set is performing well or not.
| performance | Scenario A | Scenario B |
| --- | --- | --- |
| humans | 1 | 7.5 |
| training error | 8 | 8 |
| development error | 10 | 10 |
In this case, the human-level error as a proxy for Bayes error since humans are good to identify images. If you want to improve the performance of the training set but you cant do better than the Bayes error otherwise the training set is overfitting. By knowing the Bayes error, it is easier to focus on whether bias or variance avoidance tactics will improve the performance of the model.
- *Scenario A*: There is a 7% gap between the performance of the training set and the human-level error. It means that the algorithm isnt fitting well with the training set since the target is around 1%. To resolve the issue, we use bias reduction technique such as training a bigger neural network or running the training set longer.
- *Scenario B*: The training set is doing good since there is only a 0.5% difference with the human-level error. The difference between the training set and the human-level error is called **avoidable bias**. The focus here is to reduce the variance since the difference between the training error and the development error is 2%. To resolve the issue, we use variance reduction technique such as regularization or have a bigger training set.
#### Understanding human-level performance
Summary of bias/variance with human-level performance:
- Human-level error is a proxy for Bayes error.
- If the difference between human-level error and the training error is bigger than the difference between the training error and the development error. The focus should be on bias reduction technique.
- If the difference between training error and the development error is bigger than the difference between the human-level error and the training error. The focus should be on variance reduction technique
#### Surpassing human-level performance
*Classification task performance (classification error)*:
| performance | Scenario A | Scenario B |
| --- | --- | --- |
| Team of humans | 0.5 | 0.5 |
| One human | 1.0 | 1 |
| Training error | 0.6 | 0.3 |
| Development error | 0.8 | 0.4 |
- Scenario A: In this case, the Bayes error is 0.5%, therefore the available bias is 0.1%et the variance is 0.2%.
- Scenario B: In this case, there is not enough information to know if bias reduction or variance reduction has to be done on the algorithm. It doesnt mean that the model cannot be improve, it means that the conventional ways to know if bias reduction or variance reduction are not working in this case.
There are many problems where machine learning significantly surpasses human-level performance, especially with structured data:
| problem | structured data |
| --- | --- |
| Online advertising | database of what has users clicked on |
| Product recommendations | database of proper support for |
| Logistics (predicting transit time) | database of how long it takes to get from A to B |
| Loan approvals | database of previous loan applications and their outcomes |
And these are not **natural perception problems**, so these are not *computer vision*, or *speech recognition*, or *natural language processing* task. Humans tend to be very good in natural perception task. So it is possible, but it's just a bit harder for computers to surpass human-level performance on natural perception task.
#### Improving your model performance
*There are two fundamental assumptions of supervised learning.*
- The first one is to have a low avoidable bias which means that the training set fits well.
- The second one is to have a low or acceptable variance which means that the training set performance generalizes well to the development set and test set.
![improve-model-performance](../_resources/improve-performance.png)
## Week 2: ML Strategy (2)
### Learning Objectives
- Describe multi-task learning and transfer learning
- Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets
### Error Analysis
#### Carrying out error analysis
*To carry out error analysis, you should*:
- find a set of mislabeled examples in your dev set.
- look at the mislabeled examples for false positives and false negatives.
- count up the number of errors that fall into various different categories.
- you might be inspired to generate new categories of errors.
#### Cleaning up incorrectly labeled data
*Some facts*:
- Deep learning algorithms are quite robust to random errors in the training set.
- The goal of the dev set, the main purpose of the dev set is, you want to really use it to help you select between two classifiers A and B.
- It's super important that your dev and test sets come from the same distribution.
*Correcting incorrect dev/test set examples*:
- Apply same process to your dev and test sets to make sure they continue to come from the same distribution.
- Consider examining examples your algorithm got right as well as ones it got wrong.
- Train and dev/test data may now come from slightly different distributions.
#### Build your first system quickly, then iterate
Depending on the area of application, the guideline below will help you prioritize when you build your system.
*Guideline*:
1. Set up development/test set and metrics
1. Set up a target
2. Build an initial system quickly
1. Train training set quickly: Fit the parameters
2. Development set: Tune the parameters
3. Test set: Assess the performance
3. Use **bias/variance analysis** & **error analysis** to prioritize next steps
### Mismatched training and dev/test set
#### Training and testing on different distributions
In the *Cat vs Non-cat* example, there are two sources of data used to develop **the mobile app**.
- The first data distribution is small, 10,000 pictures uploaded from the mobile application. Since they are from amateur users,the pictures are not professionally shot, not well framed and blurrier.
- The second source is from the web, you downloaded 200,000 pictures where cats pictures are professionally framed and in high resolution.
The guideline is that you have to choose a development set and test set to reflect data you expect to get **in the future** and consider important to do well.
![data-on-diff-dist](../_resources/data-dist.png)
#### Bias and Variance with mismatched data distributions
Instead of just having bias and variance as two potential problems, you now have a third potential problem, data mismatch.
![bias-variance-mismatched](../_resources/bias-variance-mismatch.png)
![bias-variance-mismatched-1](../_resources/bias-variance-mismatch-1.png)
#### Addressing data mismatch
*This is a general guideline to address data mismatch*:
- Perform manual error analysis to understand the error differences between training, development/test sets. Development should never be done on test set to avoid overfitting.
- Make training data or collect data similar to development and test sets. To make the training data more similar to your development set, you can use is **artificial data synthesis**. However, it is possible that if you might be accidentally simulating data only from a tiny subset of the space of all possible examples.
### Learning from multiple tasks
#### Transfering learning
Transfer learning refers to using the neural network knowledge for another application.
*When to use transfer learning*:
- Task A and B have the same input 𝑥
- A lot more data for Task A than Task B
- Low level features from Task A could be helpful for Task B
*Example 1: Cat recognition - radiology diagnosis*
The following neural network is trained for cat recognition, but we want to adapt it for radiology diagnosis. The neural network will learn about the structure and the nature of images. This initial phase of training on image recognition is called **pre-training**, since it will pre-initialize the weights of the neural network. Updating all the weights afterwards is called **fine-tuning**.
*Guideline*:
- Delete last layer of neural network
- Delete weights feeding into the last output layer of the neural network
- Create a new set of randomly initialized weights for the last layer only
- New data set `(𝑥, 𝑦)`
![transfer-learning](../_resources/transfer-learning.png)
#### Multi-task learning
Multi-task learning refers to having one neural network do simultaneously several tasks.
*When to use multi-tasklearning*:
- Training on a set of tasks that could benefit from having shared lower-level features
- Usually: Amount of data youhave for each task is quite similar
- Can train a big enough neural network to do well on all tasks
![multi-task](../_resources/multi-task-learning.png)
### End-to-end deep learning
#### What is end-to-end deep learning
- End-to-end deep learning is the simplification of a processing or learning systems into one neural network.
- End-to-end deep learning cannot be used for every problem since it needs a lot of labeled data. It is used mainly in
- audio transcripts,
- image captures,
- image synthesis,
- machine translation,
- steering in self-driving cars, etc.
![end-to-end](../_resources/end-to-end.png)
#### Whether to use end-to-end deep learning
Before applying end-to-end deep learning, you need to ask yourself the following question: Do you have enough data to learn a function of the complexity needed to map x and y?
*Pro*:
- *Let the data speak*. By having a pure machine learning approach, the neural network will learn from x to y. It will be able to find which statistics are in the data, rather than being forced to reflect human preconceptions.
- *Less hand-designing of components needed*. It simplifies the design work flow.
*Cons*:
- *Large amount of labeled data*. It cannot be used for every problem as it needs a lot of labeled data.
- *Excludes potentially useful hand-designed component*. Data and any hand-designs components or features are the 2 main sources of knowledge for a learning algorithm. If the data set is small than a hand-design system is a way to give manual knowledge into the algorithm.
* * *
Notes by Aaron © 2020

View File

@ -0,0 +1,818 @@
---
title: 'Course 4: Convolutional Neural Networks'
updated: 2022-05-23 16:40:37Z
created: 2022-05-16 17:54:54Z
---
# Course 4: Convolutional Neural Networks
- [Course 4: Convolutional Neural Networks](#course-4-convolutional-neural-networks)
- [Week 1: Foundations of Convolutional Neural Networks](#week-1-foundations-of-convolutional-neural-networks)
- [Learning Objectives](#learning-objectives)
- [Convolutional Neural Networks](#convolutional-neural-networks)
- [Computer Vision](#computer-vision)
- [Edge Detection Example](#edge-detection-example)
- [More Edge Detection](#more-edge-detection)
- [Padding](#padding)
- [Strided Convolutions](#strided-convolutions)
- [Convolutions Over Volume](#convolutions-over-volume)
- [One Layer of a Convolutional Network](#one-layer-of-a-convolutional-network)
- [Simple Convolutional Network](#simple-convolutional-network)
- [Pooling Layers](#pooling-layers)
- [CNN Example](#cnn-example)
- [Why Convolutions](#why-convolutions)
- [Week 2: Classic Networks](#week-2-classic-networks)
- [Learning Objectives](#learning-objectives-1)
- [Case Studies](#case-studies)
- [Why look at case studies](#why-look-at-case-studies)
- [Classic Networks](#classic-networks)
- [LeNet-5](#lenet-5)
- [AlexNet](#alexnet)
- [VGG-16](#vgg-16)
- [ResNets](#resnets)
- [Why ResNets](#why-resnets)
- [Networks in Networks and 1x1 Convolutions](#networks-in-networks-and-1x1-convolutions)
- [Inception Network Motivation](#inception-network-motivation)
- [Inception Network](#inception-network)
- [Practical advices for using ConvNets](#practical-advices-for-using-convnets)
- [Using Open-Source Implementation](#using-open-source-implementation)
- [Transfering Learning](#transfering-learning)
- [Data Augmentation](#data-augmentation)
- [State of Computer Vision](#state-of-computer-vision)
- [Tips for Keras](#tips-for-keras)
- [Week 3: Object detection](#week-3-object-detection)
- [Learning Objectives](#learning-objectives-2)
- [Detection algorithms](#detection-algorithms)
- [Object Localization](#object-localization)
- [Landmark Detection](#landmark-detection)
- [Object Detection](#object-detection)
- [Convolutional Implementation of Sliding Windows](#convolutional-implementation-of-sliding-windows)
- [Bounding Box Predictions (YOLO)](#bounding-box-predictions-yolo)
- [Intersection Over Union](#intersection-over-union)
- [Non-max Suppression](#non-max-suppression)
- [Anchor Boxes](#anchor-boxes)
- [YOLO Algorithm](#yolo-algorithm)
- [(Optional) Region Proposals](#optional-region-proposals)
- [Week 4: Special applications: Face recognition & Neural style transfer](#week-4-special-applications-face-recognition--neural-style-transfer)
- [Face Recognition](#face-recognition)
- [What is face recognition](#what-is-face-recognition)
- [One Shot Learning](#one-shot-learning)
- [Siamese network](#siamese-network)
- [Triplet Loss](#triplet-loss)
- [Face Verification and Binary Classification](#face-verification-and-binary-classification)
- [Summary of Face Recognition](#summary-of-face-recognition)
- [Neural Style Transfer](#neural-style-transfer)
- [What is neural style transfer](#what-is-neural-style-transfer)
- [What are deep ConvNets learning](#what-are-deep-convnets-learning)
- [Cost Function](#cost-function)
- [Content Cost Function](#content-cost-function)
- [Style Cost Function](#style-cost-function)
- [1D and 3D Generalizations](#1d-and-3d-generalizations)
## Week 1: Foundations of Convolutional Neural Networks
### Learning Objectives
- Explain the convolution operation
- Apply two different types of pooling operations
- Identify the components used in a convolutional neural network (padding, stride, filter, ...) and their purpose
- Build and train a ConvNet in TensorFlow for a classification problem
### Convolutional Neural Networks
#### Computer Vision
*Deep learning computer vision can now*:
- help self-driving cars figure out where the other cars and pedestrians around so as to avoid them.
- make face recognition work much better than ever before.
- unlock a phone or unlock a door using just your face.
*Deep learning for computer vision is exciting* because:
- First, rapid advances in computer vision are enabling brand new applications to view, though they just were impossible a few years ago.
- Second, even if you don't end up building computer vision systems per se, I found that because the computer vision research community has been so creative and so inventive in coming up with new neural network architectures and algorithms, is actually inspire that creates a lot cross-fertilization into other areas as well.
For computer vision applications, you don't want to be stuck using only tiny little images. You want to use large images. To do that, you need to better implement the **convolution operation**, which is one of the fundamental building blocks of **convolutional neural networks**.
#### Edge Detection Example
- The convolution operation is one of the fundamental building blocks of a convolutional neural network.
- Early layers of the neural network might detect edges and then some later layers might detect parts of objects and then even later layers may detect parts of complete objects like people's faces.
- Given a picture for a computer to figure out what are the objects in the picture, the first thing you might do is maybe detect edges in the image.
The *convolution operation* gives you a convenient way to specify how to find these **vertical edges** in an image.
A `3 by 3` filter or `3 by 3` matrix may look like below, and this is called a vertical edge detector or a vertical edge detection filter. In this matrix, pixels are relatively bright on the left part and relatively dark on the right part.
```
1, 0, -1
1, 0, -1
1, 0, -1
```
Convolving it with the vertical edge detection filter results in detecting the vertical edge down the middle of the image.
![edge-detection](../_resources/edge-detect-v.png)
#### More Edge Detection
In the horizontal filter matrix below, pixels are relatively bright on the top part and relatively dark on the bottom part.
```
1, 1, 1
0, 0, 0
-1, -1, -1
```
Different filters allow you to find vertical and horizontal edges. The following filter is called a **Sobel filter** the advantage of which is it puts a little bit more weight to the central row, the central pixel, and this makes it maybe a little bit more robust. [More about Sobel filter](https://fiveko.com/tutorials/image-processing/sobel-filter/).
```
1, 0, -1
2, 0, -2
1, 0, -1
```
Here is another filter called **Scharr filter**:
```
3, 0, -3
10, 0, -10
3, 0, -3
```
More about [**Scharr filter**](https://plantcv.readthedocs.io/en/v3.0.5/scharr_filter/).
```
w1, w2, w3
w4, w5, w6
w7, w8, w9
```
By just letting all of these numbers be parameters and learning them automatically from data, we find that neural networks can actually learn low level features, can learn features such as edges, even more robustly than computer vision researchers are generally able to code up these things by hand.
#### Padding
In order to fix the following two problems, padding is usually applied in the convolutional operation.
- Every time you apply a convolutional operator the image shrinks.
- A lot of information from the edges of the image is thrown away.
*Notations*:
- image size: `n x n`
- convolution size: `f x f`
- padding size: `p`
*Output size after convolution*:
- without padding: `(n-f+1) x (n-f+1)`
- with padding: `(n+2p-f+1) x (n+2p-f+1)`
*Convention*:
- Valid convolutions: no padding
- Same convolutions: output size is the same as the input size
- `f` is usually odd
#### Strided Convolutions
*Notation*:
- stride `s`
*Output size after convolution*: `floor((n+2p-f)/s+1) x floor((n+2p-f)/s+1)`
*Conventions*:
- The filter must lie entirely within the image or the image plus the padding region.
- In the deep learning literature by convention, a convolutional operation (maybe better *called cross-correlation*) is what we usually do not bother with a flipping operation, which is included before the product and summing step in a typical math textbook or a signal processing textbook.
- In the latter case, the filter is flipped vertically and horizontally.
#### Convolutions Over Volume
For a RGB image, the filter itself has three layers corresponding to the red, green, and blue channels.
`height x width x channel`
`n x n x nc` \* `f x f x nc` --\> `(n-f+1) x (n-f+1) x nc'`
#### One Layer of a Convolutional Network
*Notations*:
| size | notation |
| --- | --- |
| filter size | ![f(l)](../_resources/layer_filter_size.svg) |
| padding size | ![p(l)](../_resources/layer_padding_size.svg) |
| stride size | ![s(l)](../_resources/layer_stride_size.svg) |
| number of filters | ![nc(l)](../_resources/layer_num_filters.svg) |
| filter shape | ![filter_shape](../_resources/layer_filter_shape.svg) |
| input shape | ![input_shape](../_resources/layer_input_shape.svg) |
| output shape | ![output_shape](../_resources/layer_output_shape.svg) |
| output height | ![nh(l)](../_resources/layer_output_height.svg) |
| output width | ![nw(l)](../_resources/layer_output_width.svg) |
| activations `a[l]` | ![activations](../_resources/layer_output_shape.svg) |
| activations `A[l]` | ![activations](../_resources/layer_activations.svg) |
| weights | ![weights](../_resources/layer_weights.svg) |
| bias | ![bias](../_resources/layer_bias.svg) |
#### Simple Convolutional Network
Types of layer in a convolutional network:
- Convolution (CONV)
- Pooling (POOL)
- Fully connected (FC)
#### Pooling Layers
- One interesting property of max pooling is that it has a set of hyper-parameters but it has no parameters to learn. There's actually nothing for gradient descent to learn.
- Formulas that we had developed previously for figuring out the output size for conv layer also work for max pooling.
- The max pooling is used much more often than the average pooling.
- When you do max pooling, usually, you do not use any padding.
- why pooling:
- to allow a degree of translational invariance on the input.
- to down sample the spacial dimensions, thereby reducing the numbers of networks.
#### CNN Example
- Because the pooling layer has no weights, has no parameters, only a few hyper parameters, I'm going to use a convention that `CONV1` and `POOL1` shared together.
- As you go deeper usually the *height* and *width* will decrease, whereas the number of *channels* will increase.
- max pooling layers don't have any parameters
- The conv layers tend to have relatively few parameters and a lot of the parameters tend to be in the fully collected layers of the neural network.
- The activation size tends to maybe go down *gradually* as you go deeper in the neural network. If it drops too quickly, that's usually not great for performance as well.
![nn-example](../_resources/nn-example.png)
*Layer shapes of the network*:
| layer | activation shape | activation size | \# parameters |
| --- | --- | --- | --- |
| Input | (32,32,3) | 3072 | 0 |
| CONV1 (f=5,s=1) | (28,28,8) | 6272 | 608 `=(5*5*3+1)*8` |
| *POOL1* | (14,14,8) | 1,568 | 0 |
| CONV2 (f=5,s=1) | (10,10,16) | 1600 | 3216 `=(5*5*8+1)*16` |
| *POOL2* | (5,5,16) | 400 | 0 |
| FC3 | (120,1) | 120 | 48120 `=400*120+120` |
| FC4 | (84,1) | 84 | 10164 `=120*84+84` |
| softmax | (10,1) | 10 | 850 `=84*10+10` |
#### Why Convolutions
There are two main advantages of convolutional layers over just using fully connected layers.
- Parameter sharing: A feature detector (such as a vertical edge detector) thats useful in one part of the image is probably useful in another part of the image.
- Sparsity of connections: In each layer, each output value depends only on a small number of inputs.
Through these two mechanisms, a neural network has a lot fewer parameters which allows it to be trained with smaller training cells and is less prone to be overfitting.
- Convolutional structure helps the neural network encode the fact that an image shifted a few pixels should result in pretty similar features and should probably be assigned the same output label.
- And the fact that you are applying the same filter in all the positions of the image, both in the early layers and in the late layers that helps a neural network automatically learn to be more robust or to better capture the desirable property of translation invariance.
## Week 2: Classic Networks
### Learning Objectives
- Discuss multiple foundational papers written about convolutional neural networks
- Analyze the dimensionality reduction of a volume in a very deep network
- Implement the basic building blocks of ResNets in a deep neural network using Keras
- Train a state-of-the-art neural network for image classification
- Implement a skip connection in your network
- Clone a repository from github and use transfer learning
### Case Studies
#### Why look at case studies
It is helpful in taking someone else's neural network architecture and applying that to another problem.
- Classic networks
- LeNet-5
- AlexNet
- VGG
- ResNet
- Inception
#### Classic Networks
##### LeNet-5
![LeNet-5](../_resources/lenet-5.png)
Some difficult points about reading the [LeNet-5 paper](https://pdfs.semanticscholar.org/62d7/9ced441a6c78dfd161fb472c5769791192f6.pdf):
- Back then, people used sigmoid and tanh nonlinearities, not relu.
- To save on computation as well as some parameters, the original LeNet-5 had some crazy complicated way where different filters would look at different channels of the input block. And so the paper talks about those details, but the more modern implementation wouldn't have that type of complexity these days.
- One last thing that was done back then I guess but isn't really done right now is that the original LeNet-5 had a non-linearity after pooling, and I think it actually uses sigmoid non-linearity after the pooling layer.
- Andrew Ng recommend focusing on section two which talks about this architecture, and take a quick look at section three which has a bunch of experiments and results, which is pretty interesting. Later sections talked about the graph transformer network, which isn't widely used today.
##### AlexNet
![AlexNet](../_resources/alexnet.png)
- AlexNet has a lot of similarities to LeNet (60,000 parameters), but it is much bigger (60 million parameters).
- The paper had a complicated way of training on two GPUs since GPU was still a little bit slower back then.
- The original AlexNet architecture had another set of a layer called local response normalization, which isn't really used much.
- Before AlexNet, deep learning was starting to gain traction in speech recognition and a few other areas, but it was really just paper that convinced a lot of the computer vision community to take a serious look at deep learning, to convince them that deep learning really works in computer vision.
##### VGG-16
![VGG-16](../_resources/vgg-16.png)
- Filters are always `3x3` with a stride of `1` and are always `same` convolutions.
- VGG-16 has 16 layers that have weights. A total of about 138 million parameters. Pretty large even by modern standards.
- It is the simplicity, or the uniformity, of the VGG-16 architecture made it quite appealing.
- There is a few conv-layers followed by a pooling layer which reduces the height and width by a factor of `2`.
- Doubling through every stack of conv-layers is a simple principle used to design the architecture of this network.
- The main downside is that you have to train a large number of parameters.
#### ResNets
Paper: [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)
![resnet-network](../_resources/resnet-network.png)
- Deeper neural networks are more difficult to train. They present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.
- When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. The paper address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, they explicitly let these layers fit a residual mapping.
- The paper authors show that: 1) Their extremely deep residual nets are easy to optimize, but the counterpart "plain" nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Their deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.
![resnet](../_resources/resnet.png)
Formally, denoting the desired underlying mapping as `H(x)`, they let the stacked nonlinear layers fit another mapping of `F(x):=H(x)-x`. The original mapping `H(x)` is recast into `F(x)+x`. If the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart.
![resnet-block](../_resources/resnet-block.png)
#### Why ResNets
- Doing well on the training set is usually a prerequisite to doing well on your hold up or on your depth or on your test sets. So, being able to at least train ResNet to do well on the training set is a good first step toward that.
- But if you make a network deeper, it can hurt your ability to train the network to do well on the training set. It is not true or at least less true when training a ResNet.
- If we use `L2` regularization on `a[l+2]=g(Z[l+2]+a[l])=g(W[l+2]a[l+1]+b[l+2]+a[l])`, and if the value of `W[l+2],b[l+2]` shrink to zero, then `a[l+2]=g(a[l])=a[l]` since we use `relu` activation and `a[l]` is also non-negative. So we just get back `a[l]`. This shows that the identity function is easy for residual block to learn.
- It's easy to get `a[l+2]` equals to `a[l]` because of this skip connection. What this means is that adding these two layers in the neural network doesn't really hurt the neural network's ability to do as well as this simpler network without these two extra layers, because it's quite easy for it to learn the identity function to just copy `a[l]` to `a[l+2]` despite the addition of these two layers.
- So adding two extra layers or adding this residual block to somewhere in the middle or the end of this big neural network doesn't hurt performance. It is easier to go from a decent baseline of not hurting performance and then gradient descent can only improve the solution from there.
*About dimensions*:
- In `a[l+2]=g(Z[l+2]+a[l])` we're assuming that `Z[l+2]` and `a[l]` have the same dimension. So what we see in ResNet is a lot of use of same convolutions.
- In case the input and output have different dimensions, we can add an extra matrix `W_s` so that `a[l+2] = g(Z[l+2] + W_s * a[l])`. The matrix `W_s` could be a matrix of parameters we learned or could be a fixed matrix that just implements zero paddings.
*An example from the paper*:
A plain network in which you input an image and then have a number of `CONV` layers until eventually you have a softmax output at the end.
![resnet-plain-34](../_resources/resnet-plain-34.png)
To turn this into a ResNet, you add those extra skip connections and there are a lot of `3x3` convolutions and most of these are `3x3` same convolutions and that's why you're adding equal dimension feature vectors. There are occasionally pooling layers and in these cases you need to make an adjustment to the dimension by the matrix `W_s`.
![resnet-resnet-34](../_resources/resnet-resnet-34.png)
**Practice advices on ResNet**:
- Very deep "plain" networks don't work in practice because they are hard to train due to vanishing gradients.
- The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
- There are two main types of blocks: The identity block and the convolutional block.
- Very deep Residual Networks are built by stacking these blocks together.
#### Networks in Networks and 1x1 Convolutions
Paper: [Network in Network](https://arxiv.org/abs/1312.4400)
- At first, a 1×1 convolution does not seem to make much sense. After all, a convolution correlates adjacent pixels. A 1×1 convolution obviously does not.
- Because the minimum window is used, the 1×1 convolution loses the ability of larger convolutional layers to recognize patterns consisting of interactions among adjacent elements in the height and width dimensions. The only computation of the 1×1 convolution occurs on the channel dimension.
- The 1×1 convolutional layer is typically used to *adjust the number of channels* between network layers and to control model complexity.
![conv-1x1](../_resources/conv-1x1.svg)
*(image from [here](https://d2l.ai/chapter_convolutional-neural-networks/channels.html#times-1-convolutional-layer))*
The 1×1 convolutional layer is equivalent to *the fully-connected layer*, when applied on a per pixel basis.
- You can take every pixel as an *example* with `n_c[l]` input values (channels) and the output layer has `n_c[l+1]` nodes. The kernel is just nothing but the weights.
- Thus the 1x1 convolutional layer requires `n_c[l+1] x n_c[l]` weights and the bias.
The 1x1 convolutional layer is actually doing something pretty non-trivial and adds non-linearity to your neural network and allow you to decrease or keep the same or if you want, increase the number of channels in your volumes.
#### Inception Network Motivation
Paper: [Going Deeper with Convolutions](https://arxiv.org/abs/1409.4842)
When designing a layer for a ConvNet, you might have to pick, do you want a 1 by 3 filter, or 3 by 3, or 5 by 5, or do you want a pooling layer?
What the inception network does is it says, why shouldn't do them all? And this makes the network architecture more complicated, but it also works remarkably well.
![inception-motivation](../_resources/inception-motivation.png)
And the basic idea is that instead of you need to pick one of these filter sizes or pooling you want and commit to that, you can do them all and just concatenate all the outputs, and let the network learn whatever parameters it wants to use, whatever the combinations of these filter sizes it wants. Now it turns out that there is a problem with the inception layer as we've described it here, which is *computational cost*.
*The analysis of computational cost*:
![inception-computational-cost](../_resources/inception-computation.png)
*Inception modules*:
![inception](../_resources/inception.png)
#### Inception Network
![inception-module](../_resources/inception-module.png)
- In order to really concatenate all of these outputs at the end we are going to use the same type of padding for pooling.
- What the inception network does is more or less put a lot of these modules together.
![inception-network](../_resources/inception-network.png)
The last few layers of the network is a fully connected layer followed by a softmax layer to try to make a prediction. What these side branches do is it takes some hidden layer and it tries to use that to make a prediction. You should think of this as maybe just another detail of the inception that's worked. But what is does is it helps to ensure that the features computed even in the heading units, even at intermediate layers that they're not too bad for protecting the output cause of a image. And this appears to have a regularizing effect on the inception network and helps prevent this network from overfitting.
### Practical advices for using ConvNets
#### Using Open-Source Implementation
- Starting with open-source implementations is a better way, or certainly a faster way to get started on a new project.
- One of the advantages of doing so also is that sometimes these networks take a long time to train, and someone else might have used multiple GPUs and
a very large dataset to pretrain some of these networks. And that allows you to do transfer learning using these networks.
#### Transfering Learning
The computer vision research community has been pretty good at posting lots of data sets on the Internet so if you hear of things like ImageNet, or MS COCO, or PASCAL types of data sets, these are the names of different data sets that people have post online and a lot of computer researchers have trained their algorithms on.
- [ImageNet](http://image-net.org/): ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.
- [Microsoft COCO](https://www.microsoft.com/en-us/research/publication/microsoft-coco-common-objects-in-context/): COCO is a common object in context. The dataset contains 91 objects types of 2.5 million labeled instances across 328,000 images.
- [PASCAL](https://www.cs.stanford.edu/~roozbeh/pascal-context/): PASCAL-Context Dataset This dataset is a set of additional annotations for PASCAL VOC 2010. It goes beyond the original PASCAL semantic segmentation task by providing annotations for the whole scene. The statistics section has a full list of 400+ labels.
Sometimes these training takes several weeks and might take many GPUs and the fact that someone else has done this and gone through the painful high-performance search process, means that you can often download open source ways that took someone else many weeks or months to figure out and use that as a very good initialization for your own neural network.
- If you have a small dataset for your image classification problem, you can download some open source implementation of a neural network and download not just the code but also the weights. And then you get rid of the softmax layer and create your own softmax unit that outputs your classification labels.
- To do this, you just freeze the parameters which you don't want to train. A lot of popular learning frameworks support this mode of operation (i.e., set *trainable* parameter to 0).
- Those early frozen layers are some fixed function that doesn't change. So one trick that could speedup training is that we just pre-compute that layer's activations and save them to disk. The advantage of the save-to-disk or the pre-compute method is that you don't need to recompute those activations everytime you take an epoch or take a path through a training set.
- If you have a larger label dataset one thing you could do is then freeze fewer layers. If you have a lot of data, in the extreme case, you could just use the downloaded weights as initialization so they would replace random initialization.
#### Data Augmentation
Having more data will help all computer vision tasks.
*Some common data augmentation in computer vision*:
- Mirroring
- Random cropping
- Rotation
- Shearing
- Local warping
*Color shifting*: Take different values of R, G and B and use them to *distort the color channels*. In practice, the values R, G and B are drawn from some probability distribution. This makes your learning algorithm more robust to changes in the colors of your images.
- One of the ways to implement color distortion uses an algorithm called PCA. The details of this are actually given in the AlexNet paper, and sometimes called PCA Color Augmentation.
- If your image is mainly purple, if it mainly has red and blue tints, and very little green, then PCA Color Augmentation, will add and subtract a lot to red and blue, where it balance \[inaudible\] all the greens, so kind of keeps the overall color of the tint the same.
*Implementation tips*:
A pretty common way of implementing data augmentation is to really have one thread, almost four threads, that is responsible for loading the data and implementing distortions, and then passing that to some other thread or some other process that then does the training.
- Often the data augmentation and training process can run in parallel.
- Similar to other parts of training a deep neural network, the data augmentation process also has a few hyperparameters, such as how much color shifting do you implement and what parameters you use for random cropping.
![data-augmentation-implementation](../_resources/data-augmentation.png)
#### State of Computer Vision
- Image recognition: the problem of looking at a picture and telling you is this a cat or not.
- Object detection: look in the picture and actually you're putting the bounding boxes are telling you where in the picture the objects, such as the car as well. The cost of getting the bounding boxes is more expensive to label the objects.
*Data vs. hand-engineering*:
- Having a lot of data: simpler algorithms as well as less hand-engineering. So less needing to carefully design features for the problem, but instead you can have a giant neural network, even a simpler architecture.
- Don't have much data: more hand-engineering ("hacks")
*Two sources of knowledge*:
- Labeled data, (x,y)
- Hand-engineering: features / network architecture / other components
![data vs. hand-engineering](../_resources/data-hand-engineering.png)
Even though data sets are getting bigger and bigger, often we just don't have as much data as we need. And this is why the computer vision historically and even today has relied more on hand-engineering. And this is also why that the field of computer vision has developed rather complex network architectures, is because in the absence of more data. The way to get good performance is to spend more time architecting, or fooling around with the network architecture.
- Hand-engineering is very difficult and skillful task that requires a lot of insight. Historically the field of the computer vision has used very small datasets and the computer vision literature has relied on a lot of hand-engineering.
- In the last few years the amount of data with the computer vision task has increased so dramatically that the amount of hand-engineering has a significant reduction.
- But there's still a lot of hand-engineering of network architectures and computer vision, which is why you see very complicated hyperparameters choices in computer vision.
- The algorithms of object detection become even more complex and has even more specialized components.
- One thing that helps a lot when you have little data is *transfer learning*.
**Tips for doing well on benchmarks/winning competitions**:
- (1) Ensembling
- Train several networks independently and average their outputs (not weights).
- That maybe gives you 1% or 2% better, which really helps win a competition.
- To test on each image you might need to run an image through 3 to 15 different networks, so ensembling slows down your running time by a factor of 3 to 15.
- So ensembling is one of those tips that people use doing well in benchmarks and for winning competitions.
- Almost never use in production to serve actual customers.
- One big problem: need to keep all these different networks around, which takes up a lot more computer memory.
- (2) Multi-crop at test time
- Run classifier on multiple versions of test images and average results.
- Used much more for doing well on benchmarks than in actual production systems.
- Keep just one network around, which doesn't suck up as much memory, but it still slows down your run time quite a bit.
![multi-crop](../_resources/multi-crop.png)
*Use open source code*:
- Use architectures of networks published in the literature
- Use open source implementations if possible
- Use pretrained models and fine-tune on your dataset
#### Tips for Keras
- Keras is a tool for rapid prototyping which allows you to quickly try out different model architectures. Only four steps to build a model using Keras:
- *Create*: define your model architecture, using functions such as `Input()`, `ZeroPadding2D()`, `Conv2D()`, `BatchNormalization()`, `MaxPooling2D()`, ... These python objects would be used as functions. [Know more about "Objects as functions"](https://medium.com/python-pandemonium/function-as-objects-in-python-d5215e6d1b0d).
- *Compile*: `model.compile(optimizer = "...", loss = "...", metrics = ["accuracy"])`. Optimizers include 'adam', 'sgd' or others. The loss function can be 'binary\_crossentropy' or 'categorical\_crossentropy' or others. See [Keras API Doc](https://keras.io/api/).
- *Fit/Train*: train the model by `model.fit(x = ..., y = ..., epochs = ..., batch_size = ...)`.
- *Evaluate/Test*: test the model by `model.evaluate(x = ..., y = ...)`.
- Model visualization tools:
- *Summarize model*: `model.summary()` prints the details of your layers in a table with the sizes of its inputs/outputs
- *Visualize model*: `plot_model()` plots your graph in a nice layout.
For a full guidance read the newest tutorial on the Keras documentation:
- [Introduction to Keras for Engineers](https://keras.io/getting_started/intro_to_keras_for_engineers/)
- [Introduction to Keras for Researchers](https://keras.io/getting_started/intro_to_keras_for_researchers/)
Implementations of VGG16, ResNet and Inception by Keras can be found in [Francois Chollet's GitHub repository](https://github.com/fchollet/deep-learning-models).
## Week 3: Object detection
### Learning Objectives
- Describe the challenges of Object Localization, Object Detection and Landmark Finding
- Implement non-max suppression to increase accuracy
- Implement intersection over union
- Label a dataset for an object detection application
- Identify the components used for object detection (landmark, anchor, bounding box, grid, ...) and their purpose
### Detection algorithms
#### Object Localization
![object-classification-detection](../_resources/object-clf-detect.png)
- The classification and the classification of localization problems usually have one object.
- In the detection problem there can be multiple objects.
- The ideas you learn about image classification will be useful for classification with localization, and the ideas you learn for localization will be useful for detection.
![object-classification-localization](../_resources/object-clf-local.png)
Giving the bounding box then you can use supervised learning to make your algorithm outputs not just a class label but also the four parameters to tell you where is the bounding box of the object you detected.
![object-classification-localization-y](../_resources/object-clf-local-y.png)
The squared error is used just to simplify the description here. In practice you could probably use a log like feature loss for the `c1, c2, c3` to the softmax output.
#### Landmark Detection
In more general cases, you can have a neural network just output x and y coordinates of important points in image, sometimes called landmarks.
![landmark-detection](../_resources/object-landmark.png)
If you are interested in people pose detection, you could also define a few key positions like the midpoint of the chest, the left shoulder, left elbow, the wrist, and so on.
The identity of landmark one must be consistent across different images like maybe landmark one is always this corner of the eye, landmark two is always this corner of the eye, landmark three, landmark four, and so on.
#### Object Detection
![sliding windows detection](../_resources/object-slide-window.png)
Disadvantage of sliding windows detection is computational cost. Unless you use a very fine granularity or a very small stride, you end up not able to localize the objects accurately within the image.
#### Convolutional Implementation of Sliding Windows
To build up towards the convolutional implementation of sliding windows let's first see how you can turn fully connected layers in neural network into convolutional layers.
![Turn FC into CONV layers](../_resources/object-sliding-conv.png)
What the convolutional implementation of sliding windows does is it allows *four* processes in the convnet to share a lot of computation. Instead of doing it sequentially, with the convolutional implementation you can implement the entire image, all maybe 28 by 28 and convolutionally make all the predictions at the same time.
![convolutional implementation of sliding windows](../_resources/object-sliding-conv2.png)
#### Bounding Box Predictions (YOLO)
The convolutional implementation of sliding windows is more computationally efficient, but it still has a problem of not quite outputting the most accurate bounding boxes. The perfect bounding box isn't even quite square, it's actually has a slightly wider rectangle or slightly horizontal aspect ratio.
![YOLO](../_resources/object-yolo-alg.png)
**YOLO algorithm**:
The basic idea is you're going to take the image classification and localization algorithm and apply that to each of the nine grid cells of the image. If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.
The advantage of this algorithm is that the neural network outputs precise bounding boxes as follows.
- First, this allows in your network to output bounding boxes of any aspect ratio, as well as, output much more precise coordinates than are just dictated by the stride size of your sliding windows classifier.
- Second, this is a convolutional implementation and you're not implementing this algorithm nine times on the 3 by 3 grid or 361 times on 19 by 19 grid.
#### Intersection Over Union
`IoU` is a measure of the overlap between two bounding boxes. If we use `IoU` in the output assessment step, then the higher the `IoU` the more accurate the bounding box. However `IoU` is a nice tool for the YOLO algorithm to discard redundant bounding boxes.
![IoU](../_resources/object-iou.png)
#### Non-max Suppression
One of the problems of Object Detection as you've learned about this so far, is that your algorithm may find multiple detections of the same objects. Rather than detecting an object just once, it might detect it multiple times. Non-max suppression is a way for you to make sure that your algorithm detects each object only once.
- It first takes the largest `Pc` with the probability of a detection.
- Then, the non-max suppression part is to get rid of any other ones with a high (defined by a threshold) `IoU` between the box chosen in the first step.
![Non-max](../_resources/object-nonmax.png)
If you actually tried to detect three objects say pedestrians, cars, and motorcycles, then the output vector will have three additional components. And it turns out, the right thing to do is to independently carry out non-max suppression three times, one on each of the outputs classes.
#### Anchor Boxes
One of the problems with object detection as you have seen it so far is that each of the grid cells can detect only one object. What if a grid cell wants to detect multiple objects? This is what the idea of anchor boxes does.
*Anchor box algorithm*:
| previous box | with two anchor boxes |
| --- | --- |
| Each object in training image is assigned to grid cell that contains that objects midpoint. | Each object in training image is assigned to grid cell that contains objects midpoint and anchor box for the grid cell with highest `IoU`. |
| Output `y`: `3x3x8` | Output `y`: `3x3x16` or `3x3x2x8` |
![anchor box](../_resources/object-anchorbox.png)
#### YOLO Algorithm
*YOLO algorithm steps*:
- If you're using two anchor boxes, then for each of the nine grid cells, you get two predicted bounding boxes.
- Next, you then get rid of the low probability predictions.
- And then finally if you have three classes you're trying to detect, you're trying to detect pedestrians, cars and motorcycles. What you do is, for each of the three classes, independently run non-max suppression for the objects that were predicted to come from that class.
![yolo-algorithm](../_resources/object-yolo-algorithm.png)
#### (Optional) Region Proposals
| algorithm | description |
| --- | --- |
| R-CNN | Propose regions. Classify proposed regions one at a time. Output label + bounding box. The way that they perform the region proposals is to run an algorithm called a segmentation algorithm. One downside of the R-CNN algorithm was that it is actually quite slow. |
| Fast R-CNN | Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions. One of the problems of fast R-CNN algorithm is that the clustering step to propose the regions is still quite slow. |
| Faster R-CNN | Use convolutional network to propose regions. (Most implementations are usually still quit a bit slower than the YOLO algorithm.) |
## Week 4: Special applications: Face recognition & Neural style transfer
Discover how CNNs can be applied to multiple fields, including art generation and face recognition. Implement your own algorithm to generate art and recognize faces.
### Face Recognition
#### What is face recognition
- Verification
- Input image, name/ID
- Output whether the input image is that of the claimed person
- Recognition
- Has a database of K persons
- Get an input image
- Output ID if the image is any of the K persons (or “not recognized”)
#### One Shot Learning
One-shot learning problem: to recognize a person given just one single image.
- So one approach is to input the image of the person, feed it too a ConvNet. And have it output a label, y, using a softmax unit with four outputs or maybe five outputs corresponding to each of these four persons or none of the above. However, this doesn't work well.
- Instead, to make this work, what you're going to do instead is learn a **similarity function** `d(img1,img2) = degree of difference between images`. So long as you can learn this function, which inputs a pair of images and tells you, basically, if they're the same person or different persons. Then if you have someone new join your team, you can add a fifth person to your database, and it just works fine.
#### Siamese network
A good way to implement a *similarity function* `d(img1, img2)` is to use a [Siamese network](https://www.paperswithcode.com/method/siamese-network).
![siamese-network](../_resources/siamese-network.png)
In a Siamese network, instead of making a classification by a softmax unit, we focus on the vector computed by a fully connected layer as an encoding of the input image `x1`.
*Goal of learning*:
- Parameters of NN define an encoding `𝑓(𝑥_𝑖)`
- Learn parameters so that:
- If `𝑥_𝑖,𝑥_𝑗` are the same person, `‖f(𝑥_𝑖)f(𝑥_𝑗)‖^2` is small.
- If `𝑥_𝑖,𝑥_𝑗` are different persons, `‖f(𝑥_𝑖)f(𝑥_𝑗)‖^2` is large.
#### Triplet Loss
One way to learn the parameters of the neural network so that it gives you a good encoding for your pictures of faces is to define an applied gradient descent on the triplet loss function.
In the terminology of the triplet loss, what you're going do is always look at one anchor image and then you want to distance between the anchor and the positive image, really a positive example, meaning as the same person to be similar. Whereas, you want the anchor when pairs are compared to the negative example for their distances to be much further apart. You'll always be looking at three images at a time:
- an anchor image (A)
- a positive image (P)
- a negative image (N)
As before we have `d(A,P)=‖f(A)f(P)‖^2` and `d(A,N)=‖f(A)f(N)‖^2`, the learning objective is to have `d(A,P) ≤ d(A,N)`. But if `f` always equals zero or `f` always outputs the same, i.e., the encoding for every image is identical, the objective is easily achieved, which is not what we want. So we need to add an `𝛼` to the left, a margin, which is a terminology you can see on support vector machines.
*The learning objective*:
`d(A,P) + 𝛼 ≤ d(A,N)` or `d(A,P) - d(A,N) + 𝛼 ≤ 0`
*Loss function*:
```
Given 3 images A,P,N:
L(A,P,N) = max(d(A,P) - d(A,N) + 𝛼, 0)
J = sum(L(A[i],P[i],N[i]))
```
You do need a dataset where you have multiple pictures of the same person. If you had just one picture of each person, then you can't actually train this system.
- During training, if A,P,N are chosen randomly, `𝑑(𝐴,𝑃) + 𝛼𝑑(𝐴,𝑁)` is easily satisfied.
- Choose triplets that're "hard" to train on.
#### Face Verification and Binary Classification
The Triplet loss is a good way to learn the parameters of a ConvNet for face recognition. Face recognition can also be posed as a straight binary classification problem by taking a pair of neural networks to take a Siamese Network and having them both compute the embeddings, maybe 128 dimensional embeddings or even higher dimensional, and then having the embeddings be input to a logistic regression unit to make a prediction. The output will be one if both of them are the same person and zero if different.
![face-recognition](../_resources/face-recognition.png)
*Implementation tips*:
Instead of having to compute the encoding every single time you can pre-compute that, which can save a significant computation.
#### Summary of Face Recognition
*Key points to remember*:
- Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
- The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
- The same encoding can be used for verification and recognition. Measuring distances between two images' encodings allows you to determine whether they are pictures of the same person.
*More references*:
- Florian Schroff, Dmitry Kalenichenko, James Philbin (2015). [FaceNet: A Unified Embedding for Face Recognition and Clustering](https://arxiv.org/pdf/1503.03832.pdf)
- Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf (2014). [DeepFace: Closing the gap to human-level performance in face verification](https://research.fb.com/wp-content/uploads/2016/11/deepface-closing-the-gap-to-human-level-performance-in-face-verification.pdf)
- The pretrained model we use is inspired by Victor Sy Wang's implementation and was loaded using his code: https://github.com/iwantooxxoox/Keras-OpenFace.
- Our implementation also took a lot of inspiration from the official FaceNet github repository: https://github.com/davidsandberg/facenet
### Neural Style Transfer
#### What is neural style transfer
Paper: [A Neural Algorithm of Artistic Style](https://arxiv.org/abs/1508.06576)
![neural style transfer](../_resources/neural-style-transfer.png)
In order to implement Neural Style Transfer, you need to look at the features extracted by ConvNet at various layers, the shallow and the deeper layers of a ConvNet.
#### What are deep ConvNets learning
Paper: [Visualizing and Understanding Convolutional Networks](https://arxiv.org/abs/1311.2901)
![visualizing network](../_resources/visualizing-nn.png)
#### Cost Function
*Neural style transfer cost function*:
```
J(G) = alpha * J_content(C, G) + beta * J_style(S, G)
```
*Find the generated image G*:
1. Initiate G randomly, `G: 100 x 100 x 3`
2. Use gradient descent to minimize `J(G)`
#### Content Cost Function
- Say you use hidden layer 𝑙 to compute content cost. (Usually, choose some layer in the middle, neither too shallow nor too deep)
- Use pre-trained ConvNet. (E.g., VGG network)
- Let `𝑎[𝑙](𝐶)` and `𝑎[𝑙](𝐺)` be the activation of layer 𝑙 on the images
- If `𝑎[𝑙](𝐶)` and `𝑎[𝑙](𝐺)` are similar, both images have similar content
```
J_content(C, G) = 1/2 * ‖𝑎[𝑙](𝐶)𝑎[𝑙](𝐺)‖^2
```
#### Style Cost Function
Style is defined as correlation between activations across channels.
![style-cost1](../_resources/style-cost1.png)
![style-cost2](../_resources/style-cost2.png)
![style-cost3](../_resources/style-cost3.png)
#### 1D and 3D Generalizations
ConvNets can apply not just to 2D images but also to 1D data as well as to 3D data.
For 1D data, like ECG signal (electrocardiogram), it's a time series showing the voltage at each instant time. Maybe we have a 14 dimensional input. With 1D data applications, we actually use a recurrent neural network.
```
14 x 1 * 5 x 1 --> 10 x 16 (16 filters)
```
For 3D data, we can think the data has some height, some width, and then also some depth. For example, we want to apply a ConvNet to detect features in a 3D CT scan, for simplifying purpose, we have 14 x 14 x 14 input here.
```
14 x 14 x 14 x 1 * 5 x 5 x 5 x 1 --> 10 x 10 x 10 x 16 (16 filters)
```
Other 3D data can be movie data where the different slices could be different slices in time through a movie. We could use ConvNets to detect motion or people taking actions in movies.
* * *
Notes by Aaron © 2020

View File

@ -0,0 +1,935 @@
---
title: 'Course 5: Sequence Models'
updated: 2022-05-17 19:02:05Z
created: 2022-05-16 17:55:13Z
---
# Course 5: Sequence Models
- [Course 5: Sequence Models](#course-5-sequence-models)
- [Week 1: Recurrent Neural Networks](#week-1-recurrent-neural-networks)
- [Recurrent Neural Networks](#recurrent-neural-networks)
- [Why sequence models](#why-sequence-models)
- [Notation](#notation)
- [Recurrent Neural Network Model](#recurrent-neural-network-model)
- [Backpropagation through time](#backpropagation-through-time)
- [Different types of RNNs](#different-types-of-rnns)
- [Language model and sequence generation](#language-model-and-sequence-generation)
- [Sampling novel sequences](#sampling-novel-sequences)
- [Vanishing gradients with RNNs](#vanishing-gradients-with-rnns)
- [Gated Recurrent Unit (GRU)](#gated-recurrent-unit-gru)
- [Long Short Term Memory (LSTM)](#long-short-term-memory-lstm)
- [Bidirectional RNN](#bidirectional-rnn)
- [Deep RNNs](#deep-rnns)
- [Week 2: Natural Language Processing & Word Embeddings](#week-2-natural-language-processing--word-embeddings)
- [Introduction to Word Embeddings](#introduction-to-word-embeddings)
- [Word Representation](#word-representation)
- [Using word embeddings](#using-word-embeddings)
- [Properties of word embeddings](#properties-of-word-embeddings)
- [Embedding matrix](#embedding-matrix)
- [Learning Word Embeddings: Word2vec & GloVe](#learning-word-embeddings-word2vec--glove)
- [Learning word embeddings](#learning-word-embeddings)
- [Word2Vec](#word2vec)
- [Negative Sampling](#negative-sampling)
- [GloVe word vectors](#glove-word-vectors)
- [Applications using Word Embeddings](#applications-using-word-embeddings)
- [Sentiment Classification](#sentiment-classification)
- [Debiasing word embeddings](#debiasing-word-embeddings)
- [Week 3: Sequence models & Attention mechanism](#week-3-sequence-models--attention-mechanism)
- [Various sequence to sequence architectures](#various-sequence-to-sequence-architectures)
- [Basic Models](#basic-models)
- [Picking the most likely sentence](#picking-the-most-likely-sentence)
- [Beam Search](#beam-search)
- [Refinements to Beam Search](#refinements-to-beam-search)
- [Error analysis in beam search](#error-analysis-in-beam-search)
- [Bleu Score (optional)](#bleu-score-optional)
- [Attention Model Intuition](#attention-model-intuition)
- [Attention Model](#attention-model)
- [Speech recognition - Audio data](#speech-recognition---audio-data)
- [Speech recognition](#speech-recognition)
- [Trigger Word Detection](#trigger-word-detection)
## Week 1: Recurrent Neural Networks
> Learn about recurrent neural networks. This type of model has been proven to perform extremely well on temporal data. It has several variants including LSTMs, GRUs and Bidirectional RNNs, which you are going to learn about in this section.
### Recurrent Neural Networks
#### Why sequence models
Examples of sequence data:
- Speech recognition
- Music generation
- Sentiment classification
- DNA sequence analysis
- Machine translation
- Video activity recognition
- Named entity recognition
#### Notation
For a motivation, in the problem of Named Entity Recognition (NER), we have the following notation:
- `x` is the input sentence, such as: `Harry Potter and Hermione Granger invented a new spell.`
- `y` is the output, in this case: `1 1 0 1 1 0 0 0 0`.
- x<sup>&lt;t&gt;</sup> denote the word in the index `t` and y<sup>&lt;t&gt;</sup> is the correspondent output.
- In the *i*-th input example, x<sup>(i)&lt;t&gt;</sup> is *t*-th word and T<sup>x(i)</sup> is the length of the *i*-th example.
- T<sub>y</sub> is the length of the output. In NER, we have T<sub>x</sub> = T<sub>y</sub>.
Words representation introduced in this video is the One-Hot representation.
- First, you have a dictionary which words appear in a certain order.
- Second, for a particular word, we create a new vector with `1` in position of the word in the dictionary and `0` everywhere else.
For a word not in your vocabulary, we need create a new token or a new fake word called unknown word denoted by `<UNK>`.
#### Recurrent Neural Network Model
If we build a neural network to learn the mapping from x to y using the one-hot representation for each word as input, it might not work well. There are two main problems:
- Inputs and outputs can be different lengths in different examples. not every example has the same input length T<sub>x</sub> or the same output length T<sub>y</sub>. Even with a maximum length, zero-padding every input up to the maximum length doesn't seem like a good representation.
- For a naive neural network architecture, it doesn't share features learned across different positions of texts.
*Recurrent Neural Networks*:
- A recurrent neural network does not have either of these disadvantages.
- At each time step, the recurrent neural network that passes on as activation to the next time step for it to use.
- The recurrent neural network scans through the data from left to right. The parameters it uses for each time step are shared.
- One limitation of unidirectional neural network architecture is that the prediction at a certain time uses inputs or uses information from the inputs earlier in the sequence but not information later in the sequence.
- `He said, "Teddy Roosevelt was a great president."`
- `He said, "Teddy bears are on sale!"`
- You can't tell the difference if you look only at the first three words.
![rnn-forward](../_resources/rnn-forward.png)
Instead of carrying around two parameter matrices W<sub>aa</sub> and W<sub>ax</sub>, we can simplifying the notation by compressing them into just one parameter matrix W<sub>a</sub>.
![rnn-notation](../_resources/rnn-notation.png)
#### Backpropagation through time
In the backpropagation procedure the most significant messaage or the most significant recursive calculation is which goes from right to left, that is, backpropagation through time.
#### Different types of RNNs
There are different types of RNN:
- One to One
- One to Many
- Many to One
- Many to Many
![rnn-type](../_resources/rnn-type.png)
See more details about RNN by [Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
#### Language model and sequence generation
So what a language model does is to tell you what is the probability of a particular sentence.
For example, we have two sentences from speech recognition application:
| sentence | probability |
| --- | --- |
| The apple and pair salad. | 𝑃(The apple and pair salad)=3.2x10<sup>-13</sup> |
| The apple and pear salad. | 𝑃(The apple and pear salad)=5.7x10<sup>-10</sup> |
For language model it will be useful to represent a sentence as output `y` rather than inputs `x`. So what the language model does is to estimate the probability of a particular sequence of words `𝑃(y<1>, y<2>, ..., y<T_y>)`.
*How to build a language model*?
`Cats average 15 hours of sleep a day <EOS>` Totally 9 words in this sentence.
- The first thing you would do is to tokenize this sentence.
- Map each of these words to one-hot vectors or indices in vocabulary.
- Maybe need to add extra token for end of sentence as `<EOS>` or unknown words as `<UNK>`.
- Omit the period. if you want to treat the period or other punctuation as explicit token, then you can add the period to you vocabulary as well.
- Set the inputs x<sup>&lt;t&gt;</sup> = y<sup>&lt;t-1&gt;</sup>.
- What `a1` does is it will make a softmax prediction to try to figure out what is the probability of the first words y<sup>&lt;1&gt;</sup>. That is what is the probability of any word in the dictionary. Such as, what's the chance that the first word is *Aaron*?
- Until the end, it will predict the chance of `<EOS>`.
- Define the cost function. The overall loss is just the sum over all time steps of the loss associated with the individual predictions.
![language model](../_resources/rnn-lm.png)
If you train this RNN on a large training set, we can do:
- Given an initial set of words, use the model to predict the chance of the next word.
- Given a new sentence `y<1>,y<2>,y<3>`, use it to figure out the chance of this sentence: `p(y<1>,y<2>,y<3>) = p(y<1>) * p(y<2>|y<1>) * p(y<3>|y<1>,y<2>)`
#### Sampling novel sequences
After you train a sequence model, one way you can informally get a sense of what is learned is to have it sample novel sequences.
*How to generate a randomly chosen sentence from your RNN language model*:
- In the first time step, sample what is the first word you want your model to generate: randomly sample according to the softmax distribution.
- What the softmax distribution gives you is it tells the chance of the first word is 'a', the chance of the first word is 'Aaron', the chance of the first word is 'Zulu', or the chance of the first word refers to `<UNK>` or `<EOS>`. All these probabilities can form a vector.
- Take the vector and use `np.random.choice` to sample according to distribution defined by this vector probabilities. That lets you sample the first word.
- In the second time step, remember in the last section, y<sup>&lt;1&gt;</sup> is expected as input. Here take ŷ<sup>&lt;1&gt;</sup> you just sampled and pass it as input to the second step. Then use `np.random.choice` to sample ŷ<sup>&lt;2&gt;</sup>. Repeat this process until you generate an `<EOS>` token.
- If you want to make sure that your algorithm never generate `<UNK>`, just reject any sample that come out as `<UNK>` and keep resampling from vocabulary until you get a word that's not `<UNK>`.
*Character level language model*:
If you build a character level language model rather than a word level language model, then your sequence y1, y2, y3, would be the individual characters in your training data, rather than the individual words in your training data. Using a character level language model has some pros and cons. As computers gets faster there are more and more applications where people are, at least in some special cases, starting to look at more character level models.
- Advantages:
- You don't have to worry about `<UNK>`.
- Disadvantages:
- The main disadvantage of the character level language model is that you end up with much longer sequences.
- And so character language models are not as good as word level language models at capturing long range dependencies between how the the earlier parts of the sentence also affect the later part of the sentence.
- More computationally expensive to train.
#### Vanishing gradients with RNNs
- One of the problems with a basic RNN algorithm is that it runs into vanishing gradient problems.
- Language can have very long-term dependencies, for example:
- The **cat**, which already ate a bunch of food that was delicious ..., **was** full.
- The **cats**, which already ate a bunch of food that was delicious, and apples, and pears, ..., **were** full.
- The basic RNN we've seen so far is not very good at capturing very long-term dependencies. It's difficult for the output to be strongly influenced by an input that was very early in the sequence.
- When doing backprop, the gradients should not just decrease exponentially, they may also increase exponentially with the number of layers going through.
- Exploding gradients are easier to spot because the parameters just blow up and you might often see NaNs, or not a numbers, meaning results of a numerical overflow in your neural network computation.
- One solution to that is apply *gradient clipping*: it is bigger than some threshold, re-scale some of your gradient vector so that is not too big.
- Vanishing gradients is much harder to solve and it will be the subject of GRU or LSTM.
#### Gated Recurrent Unit (GRU)
Gate Recurrent Unit is one of the ideas that has enabled RNN to become much better at capturing very long range dependencies and has made RNN much more effective.
A visualization of the RNN unit of the hidden layer of the RNN in terms of a picture:
![rnn-unit](../_resources/rnn-unit.png)
- The GRU unit is going to have a new variable called `c`, which stands for memory cell.
- c̃<sup>&lt;t&gt;</sup> is a candidate for replacing c<sup>&lt;t&gt;</sup>.
- For intuition, think of Γ<sub>u</sub> as being either zero or one most of the time. In practice gamma won't be exactly zero or one.
- Because Γ<sub>u</sub> can be so close to zero, can be 0.000001 or even smaller than that, it doesn't suffer from much of a vanishing gradient problem
- Because when Γ<sub>u</sub> is so close to zero this becomes essentially c<sup>&lt;t&gt;</sup> = c<sup>&lt;t-1&gt;</sup> and the value of c <t>is maintained pretty much exactly even across many many time-steps. So this can help significantly with the vanishing gradient problem and therefore allow a neural network to go on even very long range dependencies.</t>
- In the full version of GRU, there is another gate Γ<sub>r</sub>. You can think of `r` as standing for relevance. So this gate Γ<sub>r</sub> tells you how relevant is c<sup>&lt;t-1&gt;</sup> to computing the next candidate for c<sup>&lt;t&gt;</sup>.
![GRU](../_resources/GRU.png)
*Implementation tips*:
- The asterisks are actually element-wise multiplication.
- If you have 100 dimensional or hidden activation value, then c<sup>&lt;t&gt;</sup>, c̃<sup>&lt;t&gt;</sup>, Γ<sub>u</sub> would be the same dimension.
- If Γ<sub>u</sub> is 100 dimensional vector, then it is really a 100 dimensional vector of bits, the value is mostly zero and one.
- That tells you of this 100 dimensional memory cell which are the bits you want to update. What these element-wise multiplications do is it just element-wise tells the GRU unit which bits to update at every time-step. So you can choose to keep some bits constant while updating other bits.
- In practice gamma won't be exactly zero or one.
#### Long Short Term Memory (LSTM)
Fancy explanation: [Understanding LSTM Network](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- For the LSTM we will no longer have the case that a<sup>&lt;t&gt;</sup> is equal to c<sup>&lt;t&gt;</sup>.
- And we're not using relevance gate Γ<sub>r</sub>. Instead, LSTM has update, forget and output gates, Γ<sub>u</sub>, Γ<sub>f</sub> and Γ<sub>o</sub> respectively.
![LSTM-units](../_resources/LSTM-units.png)
One cool thing about this you'll notice is that this red line at the top that shows how, so long as you set the forget and the update gate appropriately, it is relatively easy for the LSTM to have some value c<sup>&lt;0&gt;</sup> and have that be passed all the way to the right to have your, maybe, c<sup>&lt;3&gt;</sup> equals c<sup>&lt;0&gt;</sup>. And this is why the LSTM, as well as the GRU, is very good at memorizing certain values even for a long time, for certain real values stored in the memory cell even for many, many timesteps.
![LSTM](../_resources/LSTM.png)
*One common variation of LSTM*:
- Peephole connection: instead of just having the gate values be dependent only on a<sup>&lt;t-1&gt;</sup>, x<sup>&lt;t&gt;</sup>, sometimes, people also sneak in there the values c<sup>&lt;t-1&gt;</sup> as well.
*GRU vs. LSTM*:
- The advantage of the GRU is that it's a simpler model and so it is actually easier to build a much bigger network, it only has two gates,
so computationally, it runs a bit faster. So, it scales the building somewhat bigger models.
- The LSTM is more powerful and more effective since it has three gates instead of two. If you want to pick one to use, LSTM has been the historically more proven choice. Most people today will still use the LSTM as the default first thing to try.
**Implementation tips**:
- *forget gate Γ<sub>f</sub>*
- The forget gate Γ<sub>f</sub><sup>&lt;t&gt;</sup> has the same dimensions as the previous cell state c<sup>&lt;t-1&gt;</sup>.
- This means that the two can be multiplied together, element-wise.
- Multiplying the tensors Γ<sub>f</sub><sup>&lt;t&gt;</sup> is like applying a mask over the previous cell state.
- If a single value in Γ<sub>f</sub><sup>&lt;t&gt;</sup> is 0 or close to 0, then the product is close to 0.
- This keeps the information stored in the corresponding unit in c<sup>&lt;t-1&gt;</sup> from being remembered for the next time step.
- Similarly, if one value is close to 1, the product is close to the original value in the previous cell state.
- The LSTM will keep the information from the corresponding unit of c<sup>&lt;t-1&gt;</sup>, to be used in the next time step.
- *candidate value c̃<sup>&lt;t&gt;</sup>*
- The candidate value is a tensor containing information from the current time step that **may** be stored in the current cell state c<sup>&lt;t&gt;</sup>.
- Which parts of the candidate value get passed on depends on the update gate.
- The candidate value is a tensor containing values that range from -1 to 1. (tanh function)
- The tilde "~" is used to differentiate the candidate c̃<sup>&lt;t&gt;</sup> from the cell state c<sup>&lt;t&gt;</sup>.
- *update gate Γ<sub>u</sub>*
- The update gate decides what parts of a "candidate" tensor c̃<sup>&lt;t&gt;</sup> are passed onto the cell state c<sup>&lt;t&gt;</sup>.
- The update gate is a tensor containing values between 0 and 1.
- When a unit in the update gate is close to 1, it allows the value of the candidate c̃<sup>&lt;t&gt;</sup> to be passed onto the hidden state c<sup>&lt;t&gt;</sup>.
- When a unit in the update gate is close to 0, it prevents the corresponding value in the candidate from being passed onto the hidden state.
- *cell state c<sup>&lt;t&gt;</sup>*
- The cell state is the "memory" that gets passed onto future time steps.
- The new cell state c<sup>&lt;t&gt;</sup> is a combination of the previous cell state and the candidate value.
- *output gate Γ<sub>o</sub>*
- The output gate decides what gets sent as the prediction (output) of the time step.
- The output gate is like the other gates. It contains values that range from 0 to 1.
- *hidden state a<sup>&lt;t&gt;</sup>*
- The hidden state gets passed to the LSTM cell's next time step.
- It is used to determine the three gates (Γ<sub>f</sub>, Γ<sub>u</sub>, Γ<sub>o</sub>) of the next time step.
- The hidden state is also used for the prediction y<sup>&lt;t&gt;</sup>.
#### Bidirectional RNN
![RNN-ner](../_resources/BRNN-ner.png)
- Bidirectional RNN lets you at a point in time to take information from both earlier and later in the sequence.
- This network defines a Acyclic graph
- The forward prop has part of the computation going from left to right and part of computation going from right to left in this diagram.
- So information from x<sup>&lt;1&gt;</sup>, x<sup>&lt;2&gt;</sup>, x<sup>&lt;3&gt;</sup> are all taken into account with information from x<sup>&lt;4&gt;</sup> can flow through a backward four to a backward three to Y three. So this allows the prediction at time three to take as input both information from the past, as well as information from the present which goes into both the forward and the backward things at this step, as well as information from the future.
- Blocks can be not just the standard RNN block but they can also be GRU blocks or LSTM blocks. In fact, BRNN with LSTM units is commonly used in NLP problems.
![BRNN](../_resources/BRNN.png)
*Disadvantage*:
The disadvantage of the bidirectional RNN is that you do need the entire sequence of data before you can make predictions anywhere. So, for example, if you're building a speech recognition system, then the BRNN will let you take into account the entire speech utterance but if you use this straightforward implementation, you need to wait for the person to stop talking to get the entire utterance before you can actually process it and make a speech recognition prediction. For a real type speech recognition applications, they're somewhat more complex modules as well rather than just using the standard bidirectional RNN as you've seen here.
#### Deep RNNs
- For learning very complex functions sometimes is useful to stack multiple layers of RNNs together to build even deeper versions of these models.
- The blocks don't just have to be standard RNN, the simple RNN model. They can also be GRU blocks LSTM blocks.
- And you can also build deep versions of the bidirectional RNN.
![DRNN](../_resources/DRNN.png)
## Week 2: Natural Language Processing & Word Embeddings
> Natural language processing with deep learning is an important combination. Using word vector representations and embedding layers you can train recurrent neural networks with outstanding performances in a wide variety of industries. Examples of applications are sentiment analysis, named entity recognition and machine translation.
### Introduction to Word Embeddings
#### Word Representation
- One of the weaknesses of one-hot representation is that it treats each word as a thing unto itself, and it doesn't allow an algorithm to easily generalize across words.
- Because the any product between any two different one-hot vector is zero.
- It doesn't know that somehow apple and orange are much more similar than king and orange or queen and orange.
- Instead we can learn a featurized representation.
- But by a lot of the features of apple and orange are actually the same, or take on very similar values. And so, this increases the odds of the learning algorithm that has figured out that orange juice is a thing, to also quickly figure out that apple juice is a thing.
- The features we'll end up learning, won't have a easy to interpret interpretation like that component one is gender, component two is royal, component three is age and so on. What they're representing will be a bit harder to figure out.
- But nonetheless, the featurized representations we will learn, will allow an algorithm to quickly figure out that apple and orange are more similar than say, king and orange or queen and orange.
| features\\words | Man (5391) | Woman (9853) | King (4914) | Queen (7157) | Apple (456) | Orange (6257) |
| --- | --- | --- | --- | --- | --- | --- |
| Gender | -1 | 1 | -0.95 | 0.97 | 0.00 | 0.01 |
| Royal | 0.01 | 0.02 | 0.93 | 0.95 | -0.01 | 0.00 |
| Age (adult?) | 0.03 | 0.02 | 0.7 | 0.69 | 0.03 | -0.02 |
| Food | 0.09 | 0.01 | 0.02 | 0.01 | 0.95 | 0.97 |
| Size | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
- One common algorithm for visualizing word representation is the [t-SNE](http://www.cs.toronto.edu/~hinton/absps/tsne.pdf) algorithm due to [Laurens van der Maaten](https://lvdmaaten.github.io/tsne/) and Geoff Hinton.
#### Using word embeddings
- Learn word embeddings from large text corpus. (1-100B words) (Or download pre-trained embedding online.)
- Transfer embedding to new task with smaller training set. (say, 100k words)
- Optional: Continue to finetune the word embeddings with new data.
- In practice, you would do this only if this task 2 has a pretty big data set.
- If your label data set for step 2 is quite small, then usually, I would not bother to continue to fine tune the word embeddings.
Word embeddings tend to make the biggest difference when the task you're trying to carry out has a relatively smaller training set.
- Useful for NLP standard tasks.
- Named entity recognition
- Text summarization
- Co-reference
- Parsing
- Less useful for:
- Language modeling
- Machine translation
*Word embedding vs. face recognition encoding*:
- The words encoding and embedding mean fairly similar things. In the face recognition literature, people also use the term encoding to refer to the vectors, `f(x(i))` and `f(x(j))`. Refer to [Course 4](joplin://1c1155ef678f4b41a1b0aa6fd36eabad#face-verification-and-binary-classification).
- For face recognition, you wanted to train a neural network that can take any face picture as input, even a picture you've never seen before, and have a neural network compute an encoding for that new picture.
- What we'll do for learning word embeddings is that we'll have a fixed vocabulary of, say, 10,000 words. We'll learn a fixed encoding or learn a fixed embedding for each of the words in our vocabulary.
- The terms encoding and embedding are used somewhat interchangeably. So the difference is not represented by the difference in terminologies. It's just a difference in how we need to use these algorithms in face recognition with unlimited pictures and natural language processing with a fixed vocabulary.
#### Properties of word embeddings
- Word embeddings can be used for analogy reasoning, which can help convey a sense of what word embeddings are doing even though analogy reasoning is not by itself the most important NLP application.
- `man --> woman` vs. `king --> queen`:
e<sub>man</sub> \- e<sub>woman</sub> ≈ e<sub>king</sub> \- e<sub>queen</sub>
- To carry out an analogy reasoning, man is to woman as king is to what?
- To find a word so that e<sub>man</sub> \- e<sub>woman</sub> ≈ e<sub>king</sub> \- e<sub>?</sub>.
- Find word `w`: argmax<sub>w</sub> *sim*(e<sub>w</sub>, e<sub>king</sub>-e<sub>man</sub>+e<sub>woman</sub>)
- We can use cosine similarity to calculate this similarity.
- Refer to work paper by [Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/rvecs.pdf).
- What t-SNE does is, it takes 300-D data, and it maps it in a very non-linear way to a 2D space. And so the mapping that t-SNE learns, this is a very complicated and very non-linear mapping. So after the t-SNE mapping, you should not expect these types of parallelogram relationships, like the one we saw on the left, to hold true. And many of the parallelogram analogy relationships will be broken by t-SNE.
![word-embedding](../_resources/word-embedding.png)
#### Embedding matrix
- When you implement an algorithm to learn a word embedding, what you end up learning is an embedding matrix.
- E: embedding matrix (300, 10000)
- O<sub>6257</sub> = \[0,......0,1,0,...,0\], (10000, 1)
- E·O<sub>6257</sub> = e<sub>6257</sub>, (300, 1)
| a | aaron | ... | orange (6257) | ... | zulu | `<UNK>` |
| --- | --- | --- | --- | --- | --- | --- |
| ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
- Our goal will be to learn an embedding matrix E by initializing E randomly and then learning all the parameters of this (300, 10000) dimensional matrix.
- E times the one-hot vector gives you the embedding vector.
- In practice, use specialized function to look up an embedding.
### Learning Word Embeddings: Word2vec & GloVe
#### Learning word embeddings
- In the history of deep learning as applied to learning word embeddings, people actually started off with relatively complex algorithms. And then over time, researchers discovered they can use simpler and simpler and simpler algorithms and still get very good results especially for a large dataset.
- A more complex algorithm: a neural language model, by Yoshua Bengio, Rejean Ducharme, Pascals Vincent, and Christian Jauvin: [A Neural Probabilistic Language Model](http://www.iro.umontreal.ca/~lisa/pointeurs/BengioDucharmeVincentJauvin_jmlr.pdf).
- Let's start to build a neural network to predict the next word in the sequence below.
```
I want a glass of orange ______.
4343 9665 1 3852 6163 6257
```
![n-lm](../_resources/neural-lm-embedding.png)
- If we have a fixed historical window of 4 words (4 is a hyperparameter), then we take the four embedding vectors and stack them together, and feed them into a neural network, and then feed this neural network output to a softmax, and the softmax classifies among the 10,000 possible outputs in the vocab for the final word we're trying to predict. These two layers have their own parameters W1,b1 and W2, b2.
- This is one of the earlier and pretty successful algorithms for learning word embeddings.
- A more generalized algorithm.
- We have a longer sentence: `I want a glass of orange juice to go along with my cereal`. The task is to predict the word `juice` in the middle.
- If it goes to build a language model then is natural for the context to be a few words right before the target word. But if your goal isn't to learn the language model per se, then you can choose other contexts.
- Contexts:
- Last 4 words: descibed previously.
- 4 words on left & right: `a glass of orange ___ to go along with`
- Last 1 word: `orange`, much more simpler context.
- Nearby 1 word: `glass`. This is the idea of a **Skip-Gram** model, which works surprisingly well.
- If your main goal is really to learn a word embedding, then you can use all of these other contexts and they will result in very meaningful work embeddings as well.
#### Word2Vec
Paper: [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781) by Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean.
*The Skip-Gram model*:
- In the skip-gram model, what we're going to do is come up with a few context to target errors to create our supervised learning problem.
- So rather than having the context be always the last four words or the last end words immediately before the target word, what I'm going to do is, say, randomly pick a word to be the context word. And let's say we chose the word `orange`.
- What we're going to do is randomly pick another word within some window. Say plus minus five words or plus minus ten words of the context word and we choose that to be target word.
- maybe by chance pick `juice` to be a target word, that just one word later.
- maybe `glass`, two words before.
- maybe `my`.
| Context | Target |
| --- | --- |
| orange | juice |
| orange | glass |
| orange | my |
- And so we'll set up a supervised learning problem where given the context word, you're asked to predict what is a randomly chosen word within say, a ±10 word window, or a ±5 word window of the input context word.
- This is not a very easy learning problem, because within ±10 words of the word `orange`, it could be a lot of different words.
- But the goal of setting up this supervised learning problem isn't to do well on the supervised learning problem per se. It is that we want to use this learning problem to learning good word embeddings.
*Model details*:
- Context `c`: `orange` and target `t`: `juice`.
- o<sub>c</sub> ---\> E ---> e<sub>c</sub> ---\> O(softmax) ---> ŷ. This is the little neural network with basically looking up the embedding and then just a softmax unit.
- Softmax: ![p(t|c)](../_resources/softmax.svg), 𝜃<sub>t</sub>: parameter associated with output `t`. (bias term is omitted)
- Loss: L(ŷ,y) = -sum(y<sub>i</sub>logŷ<sub>i</sub>)
- So this is called the skip-gram model because it's taking as input one word like `orange` and then trying to predict some words skipping a few words from the left or the right side.
*Model problem*:
- Computational speed: in the softmax step, every time evaluating the probability, you need to carry out a sum over all 10,000, maybe even larger 1,000,000, words in your vocabulary. It gets really slow to do that every time.
*Hierarchical softmax classifier*:
- Hierarchical softmax classifier is one of a few solutions to the computational problem.
- Instead of trying to categorize something into all 10,000 categories on one go, imagine if you have one classifier, it tells you is the target word in the first 5,000 words in the vocabulary, or is in the second 5,000 words in the vocabulary, until eventually you get down to classify exactly what word it is, so that the leaf of this tree.
- The main advantage is that instead of evaluating `W` output nodes in the neural network to obtain the probability distribution, it is needed to evaluate only about `log2(W)` nodes.
- In practice, the hierarchical softmax classifier doesn't use a perfectly balanced tree or perfectly symmetric tree. The hierarchical softmax classifier can be developed so that the common words tend to be on top, whereas the less common words like durian can be buried much deeper in the tree.
*How to sample context `c`*:
- One thing you could do is just sample uniformly, at random, from your training corpus.
- When we do that, you find that there are some words like `the, of, a, and, to` and so on that appear extremely frequently.
- In your context to target mapping pairs just get these these types of words extremely frequently, whereas there are other words like `orange`, `apple`, and also `durian` that don't appear that often.
- In practice the distribution of words `p(c)` isn't taken just entirely uniformly at random for the training set purpose, but instead there are different heuristics that you could use in order to balance out something from the common words together with the less common words.
*CBOW*:
The other version of the Word2Vec model is CBOW, the continuous bag of words model, which takes the surrounding contexts from middle word, and uses the surrounding words to try to predict the middle word. And the algorithm also works, which also has some advantages and disadvantages.
#### Negative Sampling
Paper: [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546) by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean.
Negative sampling is a modified learning problem to do something similar to the Skip-Gram model with a much more efficient learning algorithm.
- `I want a glass of orange juice to go along with my cereal.`
- To create a new supervised learning problem: given a pair of words like `orange, juice`, we're going to predict it is a context-target pair or not?
- First, generate a positive example. Sample a context word, like `orange` and a target word, `juice`, associate them with a label of `1`.
- Then generate negative examples. Take `orange` and pick another random word from the dictionary for `k` times.
- Choose large values of k for smaller data sets, like 5 to 20, and smaller k for large data sets, like 2 to 5.
- In this example, k=4. x=`(context, word)`, y=`target`.
| context | word | target? |
| --- | --- | --- |
| orange | juice | 1 |
| orange | king | 0 |
| orange | book | 0 |
| orange | the | 0 |
| orange | of | 0 |
- Compared to the original Skip-Gram model: instead of training all 10,000 of them on every iteration which is very expensive, we're only going to train five, or k+1 of them. k+1 binary classification problems is relative cheap to do rather than updating a 10,000 weights of softmax classifier.
- *How to choose the negative examples*?
- One thing you could do is sample it according to the empirical frequency of words in your corpus. The problem is you end up with a very high representation of words like 'the', 'of', 'and', and so on.
- Other extreme method would use `p(w)=1/|V|` to sample the negative examples uniformly at random. This is also very non-representative of the distribution of English words.
- The paper choose a method somewhere in-between: ![negative-sampling-p](../_resources/negative-sampling-p.svg). f(w<sub>i</sub>) is the observed frequency of word w<sub>i</sub>.
#### GloVe word vectors
Paper: [GloVe: Global Vectors for Word Representation](http://nlp.stanford.edu/pubs/glove.pdf)
- X<sub>ij</sub>: #times j appear in context of i. (Think X<sub>ij</sub> as X<sub>ct</sub>).
- X<sub>ij</sub> = X<sub>ji</sub>.
- If the context is always the word immediately before the target word, then X<sub>ij</sub> is not symmetric.
- For the GloVe algorithm, define context and target as whether or not the two words appear in close proximity, say within ±10 words of each other. So, X<sub>ij</sub> is a count that captures how often do words i and j appear with each other or close to each other.
- Model: ![glove-model](../_resources/glove-model.svg).
- 𝜃<sub>i</sub><sup>T</sup>e<sub>j</sub> plays the role of 𝜃<sub>t</sub><sup>T</sup>e<sub>c</sub> in the previous sections.
- We just want to learn vectors, so that their end product is a good predictor for how often the two words occur together.
- There are various heuristics for choosing this weighting function `f` that neither gives these words too much weight nor gives the infrequent words too little weight.
- f(X<sub>ij</sub>) = 0 if X<sub>ij</sub> = 0 to make sure 0log0=0
- One way to train the algorithm is to initialize `theta` and `e` both uniformly random, run gradient descent to minimize its objective, and then when you're done for every word, to then take the average.
- For a given words w, you can have e<sup>final</sup> to be equal to the embedding that was trained through this gradient descent procedure, plus `theta` trained through this gradient descent procedure divided by two, because `theta` and `e` in this particular formulation play **symmetric** roles unlike the earlier models we saw in the previous videos, where theta and e actually play different roles and couldn't just be averaged like that.
Conclusion:
- The way that the inventors end up with this algorithm was, they were building on the history of much more complicated algorithms like the newer language model, and then later, there came the Word2Vec skip-gram model, and then this came later.
- But when you learn a word embedding using one of the algorithms that we've seen, such as the GloVe algorithm that we just saw on the previous slide, what happens is, you cannot guarantee that the individual components of the embeddings are interpretable.
- But despite this type of linear transformation, the parallelogram map that we worked out when we were describing analogies, that still works.
### Applications using Word Embeddings
#### Sentiment Classification
| comments | stars |
| --- | --- |
| The dessert is excellent. | 4 |
| Service was quite slow. | 2 |
| Good for a quick meal, but nothing special. | 3 |
| Completely lacking in good taste, good service, and good ambience. | 1 |
*A simple sentiment classification model*:
![sentiment-model](../_resources/sentiment-model-simple.png)
- So one of the challenges of sentiment classification is you might not have a huge label data set.
- If this was trained on a very large data set, like a hundred billion words, then this allows you to take a lot of knowledge even from infrequent words and apply them to your problem, even words that weren't in your labeled training set.
- Notice that by using the average operation here, this particular algorithm works for reviews that are short or long because even if a review that is 100 words long, you can just sum or average all the feature vectors for all hundred words and so that gives you a representation, a 300-dimensional feature representation, that you can then pass into your sentiment classifier.
- One of the problems with this algorithm is it **ignores word order**.
- "Completely *lacking* in *good* taste, *good* service, and *good* ambiance".
- This is a very negative review. But the word good appears a lot.
*A more sophisticated model*:
![sentiment-model-rnn](../_resources/sentiment-model-rnn.png)
- Instead of just summing all of your word embeddings, you can instead use a RNN for sentiment classification.
- In the graph, the one-hot vector representation is skipped.
- This is an example of a many-to-one RNN architecture.
#### Debiasing word embeddings
Paper: [Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings](https://arxiv.org/abs/1607.06520)
Word embeddings maybe have the bias problem such as gender bias, ethnicity bias and so on. As word embeddings can learn analogies like man is to woman like king to queen. The paper shows that a learned word embedding might output:
```
Man: Computer_Programmer as Woman: Homemaker
```
Learning algorithms are making very important decisions and so I think it's important that we try to change learning algorithms to diminish as much as is possible, or, ideally, eliminate these types of undesirable biases.
- *Identify bias direction*
- The first thing we're going to do is to identify the direction corresponding to a particular bias we want to reduce or eliminate.
- And take a few of these differences and basically average them. And this will allow you to figure out in this case that what looks like this direction is the gender direction, or the bias direction. Suppose we have a 50-dimensional word embedding.
- g<sub>1</sub> = e<sub>she</sub> \- e<sub>he</sub>
- g<sub>2</sub> = e<sub>girl</sub> \- e<sub>boy</sub>
- g<sub>3</sub> = e<sub>mother</sub> \- e<sub>father</sub>
- g<sub>4</sub> = e<sub>woman</sub> \- e<sub>man</sub>
- g = g<sub>1</sub> \+ g<sub>2</sub> \+ g<sub>3</sub> \+ g<sub>4</sub> \+ ... for gender vector.
- Then we have
- `cosine_similarity(sophie, g)) = 0.318687898594`
- `cosine_similarity(john, g)) = -0.23163356146`
- to see male names tend to have positive similarity with gender vector whereas female names tend to have a negative similarity. This is acceptable.
- But we also have
- `cosine_similarity(computer, g)) = -0.103303588739`
- `cosine_similarity(singer, g)) = 0.185005181365`
- It is astonishing how these results reflect certain unhealthy gender stereotypes.
- The bias direction can be higher than 1-dimensional. Rather than taking an average, SVD (singular value decomposition) and PCA might help.
- *Neutralize*
- For every word that is not definitional, project to get rid of bias.
![embedding-debiased](../_resources/embedding_debiased.svg)
- *Equalize pairs*
- In the final equalization step, what we'd like to do is to make sure that words like grandmother and grandfather are both exactly the same similarity, or exactly the same distance, from words that should be gender neutral, such as babysitter or such as doctor.
- The key idea behind equalization is to make sure that a particular pair of words are equi-distant from the 49-dimensional g⊥.
![equalize](../_resources/equalize.png)
## Week 3: Sequence models & Attention mechanism
> Sequence models can be augmented using an attention mechanism. This algorithm will help your model understand where it should focus its attention given a sequence of inputs. This week, you will also learn about speech recognition and how to deal with audio data.
### Various sequence to sequence architectures
#### Basic Models
In this week, you hear about sequence-to-sequence models, which are useful for everything from machine translation to speech recognition.
- Machine translation
- Papers:
- [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) by Ilya Sutskever, Oriol Vinyals, Quoc V. Le.
- [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078) by Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio.
- Input a French sentence: `Jane visite lAfrique en septembre`, we want to translate it to the English sentence: `Jane is visiting Africa in September`.
- First, let's have a network, which we're going to call the encoder network be built as a RNN, and this could be a GRU and LSTM, feed in the input French words one word at a time. And after ingesting the input sequence, the RNN then offers a vector that represents the input sentence.
- After that, you can build a decoder network which takes as input the encoding and then can be trained to output the translation one word at a time until eventually it outputs the end of sequence.
- The model simply uses an encoder network to find an encoding of the input French sentence and then use a decoder network to then generate the corresponding English translation.
![translation-seq-seq](../_resources/seq-seq.png)
- Image Captioning
- This architecture is very similar to the one of machine translation.
- Paper: [Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)](https://arxiv.org/abs/1412.6632) by Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille.
- In Course 4, you've seen how you can input an image into a convolutional network, maybe a pre-trained AlexNet, and have that learn an encoding or learn a set of features of the input image.
- In the AlexNet architecture, if we get rid of this final Softmax unit, the pre-trained AlexNet can give you a 4096-dimensional feature vector of which to represent this picture of a cat. And so this pre-trained network can be the **encoder** network for the image and you now have a 4096-dimensional vector that represents the image. You can then take this and feed it to an RNN, whose job it is to generate the caption one word at a time.
![image-captioning](../_resources/image-caption.png)
#### Picking the most likely sentence
There are some similarities between the sequence to sequence machine translation model and the language models that you have worked within the first week of this course, but there are some significant differences as well.
- The machine translation is very similar to a **conditional** language model.
- You can use a language model to estimate the probability of a sentence.
- The decoder network of the machine translation model looks pretty much identical to the language model, except that instead of always starting along with the vector of all zeros, it has an encoder network that figures out some representation for the input sentence.
- Instead of modeling the probability of any sentence, it is now modeling the probability of the output English translation conditioned on some input French sentence. In other words, you're trying to estimate the probability of an English translation.
![mt-as-conditional-lm](../_resources/mt-conditional-lm.png)
- The difference between machine translation and the earlier language model problem is: rather than wanting to generate a sentence at random, you may want to try to find the most likely English translation.
- In developing a machine translation system, one of the things you need to do is come up with an algorithm that can actually find the value of y that maximizes p(y<sup>&lt;1&gt;</sup>,...,y<sup>&lt;T_y&gt;</sup>|x<sup>&lt;1&gt;</sup>,...,x<sup>&lt;T_x&gt;</sup>). The most common algorithm for doing this is called **beam search**.
- The set of all English sentences of a certain length is too large to exhaustively enumerate. The total number of combinations of words in the English sentence is exponentially larger. So it turns out that the greedy approach, where you just pick the best first word, and then, after having picked the best first word, try to pick the best second word, and then, after that, try to pick the best third word, that approach doesn't really work.
- The most common thing to do is use an approximate search out of them. And, what an approximate search algorithm does, is it will try, it won't always succeed, but it will to pick the sentence, y, that maximizes that conditional probability.
#### Beam Search
In the example of the French sentence, `"Jane, visite l'Afrique en Septembre"`.
- Step 1: pick the first word of the English translation.
- Set `beam width B = 3`.
- Choose the most likely **three** possibilities for the first words in the English outputs. Then Beam search will store away in computer memory that it wants to try all of three of these words.
- Run the input French sentence through the encoder network and then this first step will then decode the network, this is a softmax output overall 10,000 possibilities (if we have a vocabulary of 10,000 words). Then you would take those 10,000 possible outputs p(y<sup>&lt;1&gt;</sup>|x) and keep in memory which were the *top three*.
- For example, after this step, we have the three words as `in, Jane, September`.
![beam-search-step1](../_resources/beam-search-1.png)
- Step 2: consider the next word.
- Find the pair of the first and second words that is most likely it's not just a second where is most likely. By the rules of conditional probability, it's p(y<sup>&lt;1&gt;</sup>,y<sup>&lt;2&gt;</sup>|x) = p(y<sup>&lt;1&gt;</sup>|x) * p(y<sup>&lt;2&gt;</sup>|x,y<sup>&lt;1&gt;</sup>).
- After this step, `in september, jane is, jane visit` is left. And notice that `September` has been rejected as a candidate for the first word.
- Because `beam width` is equal to 3, every step you instantiate three copies of the network to evaluate these partial sentence fragments and the output.
- Repeat this step until terminated by the end of sentence symbol.
![beam-search-step2](../_resources/beam-search-2.png)
- If beam width is 1, this essentially becomes the greedy search algorithm.
#### Refinements to Beam Search
- *Length normalization:*
- Beam search is to maximize the probability:
![beam-search-p](../_resources/beam-search-p.svg)
- But multiplying a lot of numbers less than 1 will result in a very tiny number, which can result in numerical underflow.
- So instead, we maximizing a log version:
![beam-search-logp](../_resources/beam-search-logp.svg)
- If you have a very long sentence, the probability of that sentence is going to be low, because you're multiplying many terms less than 1. And so the objective function (the original version as well as the log version) has an undesirable effect, that maybe it unnaturally tends to prefer very short translations. It tends to prefer very short outputs.
- A normalized log-likelihood objective:
![beam-search-normalize](../_resources/beam-search-norm.svg)
- 𝛼 is another hyperparameter
- 𝛼=0 no normalizing
- 𝛼=1 full normalization
- *How to choose beam width B?*
- If beam width is large:
- consider a lot of possibilities, so better result
- consuming a lot of different options, so slower and memory requirements higher
- If beam width is small:
- worse result
- faster, memory requirements lower
- choice of beam width is application dependent and domain dependent
- In practice, B=10 is common in a production system, whereas B=100 is uncommon.
- B=1000 or B=3000 is not uncommon for research systems.
- But when B gets very large, there is often diminishing returns.
- Unlike exact search algorithms like BFS (Breadth First Search) or DFS (Depth First Search), Beam Search runs faster but is not guaranteed to find exact maximum for argmax<sub>y</sub>𝑃(𝑦|𝑥).
#### Error analysis in beam search
- Beam search is an approximate search algorithm, also called a heuristic search algorithm. And so it doesn't always output the most likely sentence.
- In order to know whether it is the beam search algorithm that's causing problems and worth spending time on, or whether it might be the RNN model that's causing problems and worth spending time on, we need to do error analysis with beam search.
- Getting more training data or increasing the beam width might not get you to the level of performance you want.
- You should break the problem down and figure out what's actually a good use of your time.
- *The error analysis process:*
- Problem:
- To translate: `Jane visite lAfrique en septembre.` (x)
- Human: `Jane visits Africa in September.` (y<sup>*</sup>)
- Algorithm: `Jane visited Africa last September.` (ŷ) which has some error.
- Analysis:
- Case 1:
| Human | Algorithm | p(y<sup>*</sup>\|x) vs p(ŷ\|x) | At fault? |
| --- | --- | --- | --- |
| Jane visits Africa in September. | Jane visited Africa last September. | p(y<sup>*</sup>\|x) > p(ŷ\|x) | Beam search |
| ... | ... | ... | ... |
- Case 2:
| Human | Algorithm | p(y<sup>*</sup>\|x) vs p(ŷ\|x) | At fault? |
| --- | --- | --- | --- |
| Jane visits Africa in September. | Jane visited Africa last September. | p(y<sup>*</sup>\|x) ≤ p(ŷ\|x) | RNN |
| ... | ... | ... | ... |
#### Bleu Score (optional)
Paper: [BLEU: a Method for Automatic Evaluation of Machine Translation](https://www.aclweb.org/anthology/P02-1040.pdf) by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.
BLEU stands for **bilingual evaluation understudy**.
- The reason the BLEU score was revolutionary for machine translation was because this gave a pretty good, by no means perfect, but pretty good single real number evaluation metric and so that accelerated the progress of the entire field of machine translation.
- The intuition behind the BLEU score is we're going to look at the machine generated output and see if the types of words it generates appear in at least one of the human generated references. And so these human generated references would be provided as part of the dev set or as part of the test set.
- One way to measure how good the machine translation output is to look at each of the words in the output and see if it appears in the references.
- An extreme example:
- French: `Le chat est sur le tapis.`
- Reference 1: `The cat is on the mat.`
- Reference 2: `There is a cat on the mat.`
- MT output: `the the the the the the the.`
- Precision: 7/7. This is not a particularly useful measure because it seems to imply that this MT output has very high precision.
- Instead, what we're going to use is a modified precision measure in which we will give each word credit only up to the maximum number of times it appears in the reference sentences.
- Modified precision: 2/7. The numerator is the count of the number of times the word, `the`, appears. We take a max, we clip this count, at 2.
- In the BLEU score, you don't want to just look at isolated words. You maybe want to look at pairs of words as well. Let's define a portion of the BLEU score on bigrams.
- MT output: `The cat the cat on the mat.`
| Bigram | Count | Count<sub>clip</sub> |
| --- | --- | --- |
| the cat | 2 | 1 |
| cat the | 1 | 0 |
| cat on | 1 | 1 |
| on the | 1 | 1 |
| the mat | 1 | 1 |
| *sum* | 6 | 4 |
- Modified bigram precision: 4/6
- Generally, Bleu score on n-grams is defined as:
![ngram-precision](../_resources/bleu-ngram.svg)
- **Combined Bleu score** = ![bleu](../_resources/bleu-combined.svg)
- BP is for **brevity penalty**. Preventing short sentences from scoring too high.
- `BP = 1`, if `MT_output_length > reference_output_length`, or
- `BP = exp(1 reference_output_length / MT_output_length)`, otherwise.
#### Attention Model Intuition
Paper: [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
You've been using an Encoder-Decoder architecture for machine translation. Where one RNN reads in a sentence and then different one outputs a sentence. There's a modification to this called the Attention Model that makes all this work much better.
The French sentence:
> Jane s'est rendue en Afrique en septembre dernier, a apprécié la culture et a rencontré beaucoup de gens merveilleux; elle est revenue en parlant comment son voyage était merveilleux, et elle me tente d'y aller aussi.
The English translation:
> Jane went to Africa last September, and enjoyed the culture and met many wonderful people; she came back raving about how wonderful her trip was, and is tempting me to go too.
- The way a human translator would translate this sentence is not to first read the whole French sentence and then memorize the whole thing and then regurgitate an English sentence from scratch. Instead, what the human translator would do is read the first part of it, maybe generate part of the translation, look at the second part, generate a few more words, look at a few more words, generate a few more words and so on.
![encoder-decoder](../_resources/encoder-decoder.png)
- The Encoder-Decoder architecture above is that it works quite well for short sentences, so we might achieve a relatively high Bleu score, but for very long sentences, maybe longer than 30 or 40 words, the performance comes down. (The blue line)
![bleu-score-drop](../_resources/bleu-score-drop.png)
- The Attention model which translates maybe a bit more like humans looking at part of the sentence at a time. With an Attention model, machine translation systems performance can look like the green line above.
![attention-intuition](../_resources/attention-intuition.png)
- What the Attention Model would be computing is a set of attention weights and we're going to use 𝛼<sup>&lt;1,1&gt;</sup> to denote when you're generating the first words, how much should you be paying attention to this first piece of information here and 𝛼<sup>&lt;1,2&gt;</sup> which tells us what we're trying to compute the first word of *Jane*, how much attention we're paying to the second word from the inputs, and 𝛼<sup>&lt;1,3&gt;</sup> and so on.
- Together this will be exactly the context from, denoted as `C`, that we should be paying attention to, and that is input to the RNN unit to try to generate the first word.
- In this way the RNN marches forward generating one word at a time, until eventually it generates maybe the `<EOS>` and at every step, there are **attention weighs** 𝛼<sup>&lt;t,t'&gt;</sup> that tells it, when you're trying to generate the *t*-th English word, how much should you be paying attention to the *t'*-th French word.
#### Attention Model
- Assume you have an input sentence and you use a bidirectional RNN, or bidirectional GRU, or bidirectional LSTM to compute features on every word. In practice, GRUs and LSTMs are often used for this, maybe LSTMs be more common. The notation for the Attention model is shown below.
![attention-model](../_resources/attention-model.png)
- Compute attention weights:
![attention-weights-alpha](../_resources/attention-alpha.svg)
- Compute e<sup>&lt;t,t'&gt;</sup> using a small neural network:
- And the intuition is, if you want to decide how much attention to pay to the activation of t', it seems like it should depend the most on is what is your own hidden state activation from the previous time step. And then a<sup>&lt;t'&gt;</sup>, the features from time step t', is the other input.
- So it seems pretty natural that 𝛼<sup>&lt;t,t'&gt;</sup> and e<sup>&lt;t,t'&gt;</sup> should depend on s<sup>&lt;t-1&gt;</sup> and a<sup>&lt;t'&gt;</sup> . But we don't know what the function is. So one thing you could do is just train a very small neural network to learn whatever this function should be. And trust the backpropagation and trust gradient descent to learn the right function.
![attention-weights-alpha-e-nn](../_resources/attention-alpha-e-nn.png)
- One downside to this algorithm is that it does take quadratic time or quadratic cost to run this algorithm. If you have T<sub>x</sub> words in the input and T<sub>y</sub> words in the output then the total number of these attention parameters are going to be T<sub>x</sub> \* T<sub>y</sub>.
- Visualize the attention weights 𝛼<sup>&lt;t,t'&gt;</sup>:
![visualize-alpha](../_resources/attention-alpha-vis.png)
*Implementation tips*:
- The diagram on the left shows the attention model.
- The diagram on the right shows what one "attention" step does to calculate the attention variables 𝛼<sup>&lt;t,t'&gt;</sup>.
- The attention variables 𝛼<sup>&lt;t,t'&gt;</sup> are used to compute the context variable context<sup>&lt;t&gt;</sup> for each timestep in the output (t=1, ..., T<sub>y</sub>).
<center>Neural machine translation with attention</center>
### Speech recognition - Audio data
#### Speech recognition
- What is the speech recognition problem? You're given an audio clip, x, and your job is to automatically find a text transcript, y.
- So, one of the most exciting trends in speech recognition is that, once upon a time, speech recognition systems used to be built using *phonemes* and this were, I want to say, hand-engineered basic units of cells.
- Linguists use to hypothesize that writing down audio in terms of these basic units of sound called phonemes would be the best way to do speech recognition.
- But with end-to-end deep learning, we're finding that phonemes representations are no longer necessary. But instead, you can built systems that input an audio clip and directly output a transcript without needing to use hand-engineered representations like these.
- One of the things that made this possible was going to much larger data sets.
- Academic data sets on speech recognition might be as a 300 hours, and in academia, 3000 hour data sets of transcribed audio would be considered reasonable size.
- But, the best commercial systems are now trains on over 10,000 hours and sometimes over a 100,000 hours of audio.
*How to build a speech recognition?*
- **Attention model for speech recognition**: one thing you could do is actually do that, where on the horizontal axis, you take in different time frames of the audio input, and then you have an attention model try to output the transcript like, "the quick brown fox".
![speech-recognition-attention](../_resources/speech-recognition-attention.png)
- **CTC cost for speech recognition**: Connectionist Temporal Classification
- Paper: [Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks](https://www.cs.toronto.edu/~graves/icml_2006.pdf) by Alex Graves, Santiago Fernandes, Faustino Gomez, and Jürgen Schmidhuber.
![CTC-rnn](../_resources/CTC-rnn.png)
- For simplicity, this is a simple of what uni-directional for the RNN, but in practice, this will usually be a bidirectional LSTM and bidirectional GRU and usually, a deeper model. But notice that the number of time steps here is very large and in speech recognition, usually the number of input time steps is much bigger than the number of output time steps.
- For example, if you have 10 seconds of audio and your features come at a 100 hertz so 100 samples per second, then a 10 second audio clip would end up with a thousand inputs. But your output might not have a thousand alphabets, might not have a thousand characters.
- The CTC cost function allows the RNN to generate an output like `ttt_h_eee___[]___qqq__`, here `_` is for "blank", `[]` for "space".
- The basic rule for the CTC cost function is to collapse repeated characters not separated by "blank".
#### Trigger Word Detection
- With the rise of speech recognition, there have been more and more devices. You can wake up with your voice, and those are sometimes called *trigger word detection systems*.
![trigger-word-detection](../_resources/trigger-word-detection.png)
- The literature on triggered detection algorithm is still evolving, so there isn't wide consensus yet, on what's the best algorithm for trigger word detection.
- With a RNN what we really do, is to take an audio clip, maybe compute spectrogram features, and that generates audio features x<sup>&lt;1&gt;</sup>, x<sup>&lt;2&gt;</sup>, x<sup>&lt;3&gt;</sup>, that you pass through an RNN. So, all that remains to be done, is to define the target labels y.
- In the training set, you can set the target labels to be zero for everything before that point, and right after that, to set the target label of one. Then, if a little bit later on, the trigger word was said again at this point, then you can again set the target label to be one.
- Actually it just won't actually work reasonably well. One slight disadvantage of this is, it creates a very imbalanced training set, so we have a lot more zeros than we want.
- One other thing you could do, that it's little bit of a hack, but could make the model a little bit easier to train, is instead of setting only a single time step to operate one, you could actually make it to operate a few ones for several times. *Guide to label the positive/negative words)*:
- Assume labels y<sup>&lt;t&gt;</sup> represent whether or not someone has just finished saying "activate."
- y<sup>&lt;t&gt;</sup> = 1 when that that clip has finished saying "activate".
- Given a background clip, we can initialize y<sup>&lt;t&gt;</sup> = 0 for all `t`, since the clip doesn't contain any "activates."
- When you insert or overlay an "activate" clip, you will also update labels for y<sup>&lt;t&gt;</sup>.
- Rather than updating the label of a single time step, we will update 50 steps of the output to have target label 1.
- Recall from the lecture on trigger word detection that updating several consecutive time steps can make the training data more balanced.
*Implementation tips*:
- Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
- Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
- An end-to-end deep learning approach can be used to build a very effective trigger word detection system.
* * *
Notes by Aaron © 2020

42
DeepLearning_AI/README.md Normal file
View File

@ -0,0 +1,42 @@
---
title: README
updated: 2022-05-16 18:26:58Z
created: 2022-05-16 17:43:56Z
---
# Deep Learning Specialization Course Notes
This is the notes of the [Deep Learning Specialization](https://www.coursera.org/specializations/deep-learning) courses offered by [deeplearning.ai](https://www.deeplearning.ai/) on Coursera.
Introduction from the specialization page:
>In five courses, you will learn the foundations of Deep Learning, understand how to build neural networks, and learn how to lead successful machine learning projects. You will learn about Convolutional networks, RNNs, LSTM, Adam, Dropout, BatchNorm, Xavier/He initialization, and more. You will work on case studies from healthcare, autonomous driving, sign language reading, music generation, and natural language processing. You will master not only the theory, but also see how it is applied in industry. You will practice all these ideas in Python and in TensorFlow, which we will teach.
*The Specialization consists of five courses*:
- [Course 1: Neural Networks and Deep Learning](:/8b8d24c8270944829c58a2071481e8b7)
- [Week 1: Introduction to Deep Learning](:/8b8d24c8270944829c58a2071481e8b7#week-1-introduction-to-deep-learning)
- [Week 2: Neural Networks Basics](:/8b8d24c8270944829c58a2071481e8b7#week-2-neural-networks-basics)
- [Week 3: Shallow Neural Networks](:/8b8d24c8270944829c58a2071481e8b7#week-3-shallow-neural-networks)
- [Week 4: Deep Neural Networks](:/8b8d24c8270944829c58a2071481e8b7#week-4-deep-neural-networks)
- [Course 2: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization](:/eae2031754c04c308d51a81b8b0e4b1f)
- [Week 1: Practical aspects of Deep Learning](:/eae2031754c04c308d51a81b8b0e4b1f#week-1-practical-aspects-of-deep-learning)
- [Week 2: Optimization algorithms](:/eae2031754c04c308d51a81b8b0e4b1f#week-2-optimization-algorithms)
- [Week 3: Hyperparameter tuning, Batch Normalization and Programming Frameworks](:/eae2031754c04c308d51a81b8b0e4b1f#week-3-hyperparameter-tuning-batch-normalization-and-programming-frameworks)
- [Course 3: Structuring Machine Learning Projects](:/955c977bae464e6aa23cd6f94a8461ed)
- [Week 1: ML Strategy (1)](:/955c977bae464e6aa23cd6f94a8461ed#week-1-ml-strategy-1)
- [Week 2: ML Strategy (2)](:/955c977bae464e6aa23cd6f94a8461ed#week-2-ml-strategy-2)
- [Course 4: Convolutional Neural Networks](:/1c1155ef678f4b41a1b0aa6fd36eabad)
- [Week 1: Foundations of Convolutional Neural Networks](:/1c1155ef678f4b41a1b0aa6fd36eabad#week-1-foundations-of-convolutional-neural-networks)
- [Week 2: Classic Networks](:/1c1155ef678f4b41a1b0aa6fd36eabad#week-2-classic-networks)
- [Week 3: Object detection](:/1c1155ef678f4b41a1b0aa6fd36eabad#week-3-object-detection)
- [Week 4: Special applications: Face recognition & Neural style transfer](:/1c1155ef678f4b41a1b0aa6fd36eabad#week-4-special-applications-face-recognition--neural-style-transfer)
- [Course 5: Sequence Models](:/8c3f644598994759a0d60b2a12997e60)
- [Week 1: Recurrent Neural Networks](:/8c3f644598994759a0d60b2a12997e60#week-1-recurrent-neural-networks)
- [Week 2: Natural Language Processing & Word Embeddings](:/8c3f644598994759a0d60b2a12997e60#week-2-natural-language-processing--word-embeddings)
- [Week 3: Sequence models & Attention mechanism](:/8c3f644598994759a0d60b2a12997e60#week-3-sequence-models--attention-mechanism)
[fancy-course-summary]: https://www.slideshare.net/TessFerrandez/notes-from-coursera-deep-learning-courses-by-andrew-ng
[math-html]: https://www.toptal.com/designers/htmlarrows/letters/

47
Markdown/Mermaid.md Normal file
View File

@ -0,0 +1,47 @@
---
title: Mermaid
updated: 2022-04-03 12:20:47Z
created: 2022-04-03 12:17:09Z
---
```mermaid
graph TD;
A-->B;
A-->C;
B-->D;
C-->D;
```
## sequenceDiagram
```mermaid
sequenceDiagram
participant Alice
participant Bob
Alice->John: Hello John, how are you?
loop Healthcheck
John->John: Fight against hypochondria
end
Note right of John: Rational thoughts <br/>prevail...
John-->Alice: Great!
John->Bob: How about you?
Bob-->John: Jolly good!
```
## Gantt Charts
```mermaid
gantt
dateFormat YYYY-MM-DD
title Adding GANTT diagram functionality to mermaid
section A section
Completed task :done, des1, 2014-01-06,2014-01-08
Active task :active, des2, 2014-01-09, 3d
Future task : des3, after des2, 5d
Future task2 : des4, after des3, 5d
section Critical tasks
Completed task in the critical line :crit, done, 2014-01-06,24h
Implement parser and jison :crit, done, after des1, 2d
Create tests for parser :crit, active, 3d
Future task in critical line :crit, 5d
Create tests for renderer :2d
Add to mermaid :1d
```

View File

@ -0,0 +1,11 @@
---
title: Docker vs lxc
updated: 2021-07-26 13:05:45Z
created: 2021-05-04 14:58:11Z
---
# Docker vs lxc
https://archives.flockport.com/lxc-vs-docker/
docker inspect --format '{{ .NetworkSettings.IPAddress }}' container_name_or_id

121
Overig/Docker/commands.md Normal file
View File

@ -0,0 +1,121 @@
---
title: commands
updated: 2021-11-20 17:25:07Z
created: 2021-05-04 14:58:11Z
---
# Docker
## remove all stopped containers
- docker rm $(docker ps -a -q)
### list and remove images
- docker images
- docker rmi $(docker images -a -q)
### Docker lifecycle
1. [docker create \[OPTIONS\] IMAGE \[COMMAND\] \[ARG...\]](https://docs.docker.com/engine/reference/commandline/create/) create but doesn't start
2. [docker rename CONTAINER NEW_NAME](https://docs.docker.com/engine/reference/commandline/rename/)
3. [docker run \[OPTIONS\] IMAGE \[COMMAND\] \[ARG...\]](https://docs.docker.com/engine/reference/commandline/run/) run and create in one statement
4. [docker rm](https://docs.docker.com/engine/reference/commandline/rm/)
docker image rm dd9acebe0b4d
5. [docker update](https://docs.docker.com/engine/reference/commandline/update/) update configuration of one or more containers
- docker logs -f
### docker ipaddress or running container
docker inspect <containernameorid>| grep '"IPAddress"' | head -n 1</containernameorid>
### three basic commands
```bash
docker images <ls>
docker container <ls> <-a>
docker run <name>
```
### run tensorflow and jupyter at port 8888
docker run --rm -v $(pwd):/tf/convolutional -it -p 8888:8888 tensorflow/tensorflow:latest-jupyter
### run iterative python program directly
docker run --rm -v $(pwd):/src --rm python:latest python /src/hello-world.py
### run iterative python shell
docker run --rm -it -v $(pwd):/src --rm python:latest python
### run bash inside python container
docker run --rm -it -v $(pwd):/src --rm python:latest /bin/bash
### run a daemon with option -d
```bash
docker run --rm --name my-postgres -e POSTGRES_PASSWORD=qw12aap -d postgres:latest
docker exec -it my-postgres psql -h localhost -U postgres -d postgres
```
### docker files
```dockerfile
FROM python:latest
RUN pip3 install numpy
CMD python3 /src/hello-world.py
```
### docker networks
Usage: docker network COMMAND
Commands:
connect Connect a container to a network
create Create a network
disconnect Disconnect a container from a network
inspect Display detailed information on one or more networks
ls List networks
prune Remove all unused networks
rm Remove one or more networks
```bash
docker network create net_1
docker run --rm -d --net net_1 --name my_py -v $(pwd):/src python:latest python3 /src/run.py
docker run --rm -it --net net_1 alpine:latest /bin/bash
docker network create net_2
docker run --rm --name my-postgres --network net_2 -e POSTGRES_PASSWORD=qw12aap -d postgres:latest
docker run -it --rm --name my_postgre2 --network net_2 postgres:latest /bin/bash
```
inside: psql -U postgres -h my-postgres
### Docker Compose
```docker-compose
version: '3'
services:
python:
image: python:latest
container_name: my_py
volumes:
- .:/src
command: python3 /src/run.py
restart: always
postgres:
image: postgres:latest
container_name: my_post
environment:
- e POSTGRES_PASSWORD=qw12aap
restart: always
alpine:
image: alpine:latest
command: echo "hello from alpine"
restart: always
```
[How To Remove Docker Images, Containers, and Volumes](https://www.digitalocean.com/community/tutorials/how-to-remove-docker-images-containers-and-volumes)

View File

@ -0,0 +1,9 @@
---
title: Akonadi server
updated: 2022-06-04 22:22:09Z
created: 2022-06-04 22:21:49Z
---
## Akonadi server
\[https://userbase.kde.org/Akonadi/nl\](https://userbase.kde.org/Akonadi/nl)

72
Overig/Linux/Bash.md Normal file
View File

@ -0,0 +1,72 @@
---
title: Bash
updated: 2022-04-27 17:38:16Z
created: 2021-05-04 14:58:11Z
---
# Bash
PROMPT_COMMAND='echo -n "writing the prompt at " && date'
HISTTIMEFORMAT='I ran this at: %d/%m/%y %T '
## CDPATH
As with the PATH variable, the CDPATH variable is a colon-separated list of paths. When you run a cd command with a relative path (ie one without a leading slash), by default the shell looks in your local folder for matching names. CDPATH will look in the paths you give it for the directory you want to change to.
If you set CDPATH up like this:
```bash
CDPATH=/:/lib
```
Then typing in:
```bash
cd /home
cd tmp
```
will always take you to /tmp no matter where you are.
Watch out, though, as if you dont put the local (.) folder in the list, then you wont be able to create any other tmp folder and move to it as you normally would:
$ cd /home
$ mkdir tmp
$ cd tmp
$ pwd
/tmp
Oops!
This is similar to the confusion I felt when I realised the dot folder was not included in my more familiar PATH variable… but you should do that in the PATH variable because you can get tricked into running a fake command from some downloaded code.
Correct way:
```bash
CDPATH=.:/space:/etc:/var/lib:/usr/share:/opt
```
### Restart terminal
```bash
exec "$SHELL"
```
### SHLVL
This variable tracks how deeply nested you are in the bash shell.
```bash
echo $SHLVL
```
### LINENO
Reports the number of commands that have been run in the session so far.
### TMOUT
If nothing is typed in for the number of seconds this is set to, then the shell will exit.
[direnv](https://github.com/direnv/direnv)

46
Overig/Linux/Ethernet.md Normal file
View File

@ -0,0 +1,46 @@
---
title: Ethernet
updated: 2022-04-27 17:47:40Z
created: 2021-05-04 14:58:11Z
---
# Ethernet
## Change mac address:
macchanger <device> -r
## local ip address:
hostname -I
## public ip address:
curl ipinfo.io/ip
## scan local network
sudo arp-scan --interface=enp4s0 --localnet
```bash
sudo service network-manager restart
sudo vi [/etc/dhcp/dhclient.conf](file:///etc/dhcp/dhclient.conf)
```
edit "prepend domain-name-servers" for DNS servers
## Monitor wifi
wavemon
## available wifi networks
nmcli connection show
## which DNS server in use:
( nmcli dev list || nmcli dev show ) 2>/dev/null | grep DNS
nm-tool | grep DNS
[Unbound](https://aacable.wordpress.com/2019/12/10/short-notes-for-unbound-caching-dns-server-under-ubuntu-18/)

View File

@ -0,0 +1,9 @@
---
title: Firewall
updated: 2022-06-03 21:51:40Z
created: 2022-06-03 21:51:30Z
---
https://www.cyberciti.biz/faq/how-to-delete-a-ufw-firewall-rule-on-ubuntu-debian-linux/

15
Overig/Linux/Firewall.md Normal file
View File

@ -0,0 +1,15 @@
---
title: Firewall
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Firewall
sudo ufw status verbose
sudo ufw enable
sudo ufw disable
sudo ufw --help
gufw

View File

@ -0,0 +1,7 @@
---
title: Keychron Keyboard
updated: 2022-04-27 17:37:12Z
created: 2022-04-27 17:36:57Z
---
https://gist.github.com/andrebrait/961cefe730f4a2c41f57911e6195e444

View File

@ -0,0 +1,8 @@
---
title: Linux Kernel
updated: 2022-04-27 17:41:37Z
created: 2022-04-27 17:41:13Z
---
[Linux Kernel Boot Parameters](http://redsymbol.net/linux-kernel-boot-parameters//)

13
Overig/Linux/Locale.md Normal file
View File

@ -0,0 +1,13 @@
---
title: Locale
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Locale
sudo dpkg-reconfigure locales
- select en_US.UTF-8
- select en_US.UTF-8
- type "locale" to check again

106
Overig/Linux/Packages.md Normal file
View File

@ -0,0 +1,106 @@
---
title: Packages
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Remove unused
## install
sudo apt install deborphan
## run terminal version
deborphan
## To remove the orphaned packages, run:
sudo orphaner
## graphical version
sudo apt install gtkorphan
sudo gtkorphan
## Alternative
sudo apt autoclean && sudo apt autoremove
dpkg --remove <name>
## check broken dependencies
sudo apt-get check
## search
apt-cache search [string1 stringn]
## list all available packages and search
apt-cache pkgnames
apt-cache search <packagename>
## get package info
apt-cache show <packagename>
## dependencies
apt-cache showpkg <packagename>
## statistics
apt-cache stats
## distro update
sudo apt-get dist-upgrade
sudo apt-get install <packageName> --only-upgrade
## install specific version
sudo apt-get install vsftpd=2.3.5-3ubuntu1
## remove without removing configurations
sudo apt-get remove <packageName>
## remove configurations
sudo apt-get purge <packageName>
## remove package and configuration
sudo apt-get remove --purge <packageName>
## cleanup diskspace
sudo apt-get clean
## check log off a package
sudo apt-get change log <packageName>
## Get Debian version
lsb_release -a
## Disk info
sudo hdparm -I /dev/sda1
## which kernels are installed?
dpkg --list | grep linux-image
sudo dpkg -i <package name.deb>
sudo dpkg --remove <package name>
sudo add-apt-repository 'deb <https://typora.io/linux> ./'
## Create and install rpm installation package
sudo alien packagename.rpm
sudo dpkg -i packagename.deb

View File

@ -0,0 +1,9 @@
---
title: Search find
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Search find
find /. -name 'toBeSearched.file' 2>&1 | grep -v 'Permission denied'

26
Overig/Linux/Services.md Normal file
View File

@ -0,0 +1,26 @@
---
title: Services
updated: 2022-04-27 17:35:25Z
created: 2021-05-04 14:58:11Z
---
# Services
## list all services
service --status-all
## start - stop - pauze services
sudo service <package name> <start | off | pauze | ....
## List services
systemctl list-unit-files
## Statup time
systemd-analyze blame
[set numa parameters](https://askubuntu.com/questions/1379119/how-to-set-the-numa-node-for-an-nvidia-gpu-persistently)

View File

@ -0,0 +1,46 @@
---
title: SizeFileDir
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Size File & Dirs
```bash
ls -l filename
```
lists the number of bytes
```bash
hexdump -C filename
```
lists the exact byte count and the non-printing characters as hex values
```bash
du filenaam
```
lists the number of blocks
The blocksize can be found with:
```bash
stat -f /dev/sda1
```
```bash
du
du ~/work ## size work dir in bytes
du -a ## all files in bytes
du -d 1 ## all files in dir 1 level deep
du --block=1 ## blocksize 1 will give the exact size of dirs and files
du ---block=1M == du -m ## size in megabytes
du -h ## human readable disk space
du -h -s ## summary
du --apparent-size ## apparent size of the file
du --apparent-size -s ## summary
du --time -d 2 ## creation time or last modification
```
[source](https://www.howtogeek.com/450366/how-to-get-the-size-of-a-file-or-directory-in-linux/)

View File

@ -0,0 +1,12 @@
---
title: StartupDisk
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# Startup disk
ddrescue
fdisk -l # find the ricght device ie dev/sdX
ddrescue bionic-desktop-amd64.iso /dev/sdX --force -D # create the startup dusk/usbstick

7
Overig/Linux/Vim.md Normal file
View File

@ -0,0 +1,7 @@
---
title: Vim
updated: 2022-04-27 17:40:11Z
created: 2022-04-27 17:39:41Z
---
[Learn Vim](https://learnvim.irian.to/)

View File

@ -0,0 +1,52 @@
---
title: journalctl
updated: 2021-09-11 08:09:53Z
created: 2021-09-11 08:00:22Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
# only errors
`journalctl --utc`
# vie emergency system messages
`journalctl -p 0`
# Error codes:
0: emergency
1: alerts
2: critical
3: errors
4: warning
5: notice
6: info
7: debug
# shows all messages with priority 2, 1 and 0
`journalctl -p 2`
# errors at boot time
`journalctl --list-boots`
# errors with explanation
`journalctl -xb -p 3`
# show log from specific moment in time
`journalctl --since "2020-12-04 06:00:00"`
`journalctl --since yesterday`
# kernel messages
`journalctl -k`
# Network messages
`journalctl -u NetworkManager.service`
# list of all services
`systemctl list-units --type=service`
# messages of an application
`journalctl /usr/bin/docker --since today`
[Source](https://www.debugpoint.com/2020/12/systemd-journalctl/)

33
Overig/Linux/ssh.md Normal file
View File

@ -0,0 +1,33 @@
---
title: ssh
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# SSH keys
## check current
```bash
for keyfile in ~/.ssh/id_*; do ssh-keygen -l -f "${keyfile}"; done | uniq
```
Ed25519 is intended to provide attack resistance comparable to quality 128-bit symmetric ciphers.
```bash
ssh-keygen -o -a 100 -t ed25519
```
result
```bash
~/.ssh/id_ed25519
```
### Change or set a passphrase
```bash
ssh-keygen -f ~/.ssh/id_rsa -p -o -a 100
```
[Source](https://blog.g3rt.nl/upgrade-your-ssh-keys.html)

View File

@ -0,0 +1,30 @@
---
title: youtube_dl
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# youtube_dl
### Get all possible available formats
```bash
youtube-dl -F 'https:// .....'
```
### Download movie in specified quality
```bash
youtube-dl -f <int> 'https://......'
```
### If problems with audio and video
```bash
youtube-dl -f 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/bestvideo+bestaudio' --merge-output-format mp4 'http://....'
```
[More info](https://askubuntu.com/questions/486297/how-to-select-video-quality-from-youtube-dl)
ffmpeg -v 5 -y -i input.m4a -acodec libmp3lame -ac 2 -ab 192k output.mp3

7
Overig/Privacy.md Normal file
View File

@ -0,0 +1,7 @@
---
title: Privacy
updated: 2022-04-27 17:42:36Z
created: 2022-04-27 17:42:29Z
---
https://oisd.nl/

View File

@ -0,0 +1,24 @@
---
title: Remove string pdf
updated: 2021-10-04 13:26:02Z
created: 2021-10-04 13:21:30Z
latitude: 52.09370000
longitude: 6.72510000
altitude: 0.0000
---
```bash
sudo apt install pdftk
```
```bash
pdftk file.pdf output uncompressed.pdf uncompress
```
```bash
sed -i 's/RemoveString//g' uncompressed.pdf
```
```bash
pdftk uncompressed.pdf output changed.pdf compress
```

View File

@ -0,0 +1,49 @@
---
title: CheatSheet K8s
updated: 2022-02-05 17:47:17Z
created: 2022-02-05 15:06:57Z
latitude: 51.86290000
longitude: 4.61910000
altitude: 0.0000
---
## minikube
`minikube start | status | stop | delete | pause | unpause | dashboard`
## Crud commands kubectl
`kubectl create deployment [name]`
`kubectl create deployment nginx-depl --image=nginx`
`kubectl edit deployment [name]`
`kubectl edit deployment nginx-depl`
`kubectl delete deployment [name]`
## Use configuration file for CRUD
`kubectl apply -f [file name]`
`kubectl delete -f [file name]`
## Status of different K8s components
`kubectl get nodes | pod | service | replicaset | deployment | all`
`kubectl get pod -o wide` More columns with info
`kubectl get deployment [name] -o yaml` current in yaml format (useful for debugging)
## Debugging pods
`kuberctl logs [pod name]`
`kubectl logs nginx-depl-{pod-name}`
`kubectl exec -it [pod name] -- bin/bash`
`kuberctl describe pod [pod name]`
`kuberctl describe pod nginx-depl-{pod name}`
## Configuration file Deployment & Service
Each configuration file has 3 parts
1. metadata
2. specification
3. status (automatic generated)
- desired state
- Actual state
- Both have to match. If not K8s knows something has to be fixed
- status is updated continues
- etcd is providing the info for status
![99bd030fd518baa2ca26a85b313e23f0.png](../../_resources/99bd030fd518baa2ca26a85b313e23f0.png)
![b138f59fcc154b98b815a2dc79b89003.png](../../_resources/b138f59fcc154b98b815a2dc79b89003.png)

View File

@ -0,0 +1,26 @@
---
title: Ingress
updated: 2022-02-06 17:57:21Z
created: 2022-02-06 17:38:09Z
latitude: 51.86290000
longitude: 4.61910000
altitude: 0.0000
---
# install Ingress addon
Installs an additional pod
Starts automatically th K8s Mginx implimentation of Ingress controller
`minikube addons enable ingress'
Then create Ingress rule
![77e6428cf3d2bff19c7215e2915ca3c4.png](../../_resources/77e6428cf3d2bff19c7215e2915ca3c4.png)
or
![a6f83a73449bec9c309c2cab27c24f5e.png](../../_resources/a6f83a73449bec9c309c2cab27c24f5e.png)
SSL implementation
![47073357d36804a22ad703c90cb541ce.png](../../_resources/47073357d36804a22ad703c90cb541ce.png)
Take care!!!
![b97bd5343f507eb8529bc2c00d30084d.png](../../_resources/b97bd5343f507eb8529bc2c00d30084d.png)

View File

@ -0,0 +1,68 @@
---
title: Intro Basis Components of K8s
updated: 2022-02-05 15:10:02Z
created: 2022-01-29 15:33:56Z
latitude: 52.38660000
longitude: 5.27820000
altitude: 0.0000
---
What is Kubernetes (K8s)
- container orchestration tool
- manages containers (not only Docker)
- different environments
- physical
- virtual
- cloud
What problems solves Kubernetes:
- manages many containers with independent services, like micro services
Kubernetes features
- High Availability (no downtime)
- Scalability (high performance)
- Disaster recovery (backup and restore)
Kubernetes components (but many more)
- Node (physical or virtual)
- node contains pods and per pod usually 1 application
- pod is smallest unit of K8s
- pod is abstraction over container
- each pod gets its own ip-address
- new ip-address on re-creation
- ![aca7155a4c7e4f2b982378cbb2d37400.png](../../_resources/aca7155a4c7e4f2b982378cbb2d37400.png)
- Services
- permanent ip-address
- lifecycle of Pod and Service are not connected
- ![ad9e649d35df7e77ce73056bbaf2cbb9.png](../../_resources/ad9e649d35df7e77ce73056bbaf2cbb9.png)
- External Service
- opens the communication for external sources
- internal service for eq database. Not accessible from outside
- External request go first to Ingress (to route traffic)
- ![25f981495ea54faccfe5ea58cd1f8b75.png](../../_resources/25f981495ea54faccfe5ea58cd1f8b75.png)
- ConfigMap & Secret
- external configuration of application (eq URL of a database DB_URL = mongo-db-service)
- secret credentials in Secret (like configMap but base64 encoded)
- Application can read from ConfigMap and Secret
- Volumes
- for persistent data
- attach physical storage to a pod
- local machine or remote storage (eq cloud and outside k8s cluster)
- k8s does not manage data persistence
- Deployment and Stateful Set
- Every Pod is replicated as specified in BluePrint
- Service has 2 functionalities:
- Permanent IP (so an other pod can connect of the pod dies)
- load balancer
- the pod is not created, but for every pod is specified (Blueprint) how many replicas are required
- Blueprint is abstraction of Pods. Pods are an abstraction of Containers
- In practice just working with Deployments
- Stateful Set
- Databases have state and can not be replicated
- For stateful applications like databases
- no database inconsistency can occur.
- is not easy therefor advice to host databases outside K8s cluster

View File

@ -0,0 +1,47 @@
---
title: K8s Architecture
updated: 2022-01-29 16:47:31Z
created: 2022-01-29 16:19:16Z
latitude: 52.38660000
longitude: 5.27820000
altitude: 0.0000
---
Worker Machine in K8s cluster
- Nodes do the actual work
- each node has multiple Pots on it
- 3 process must be installed on every Node
- Container Runtime (eq Docker)
- Kubelet
- interacts with container and node
- starts the Pod and the container inside
- assigns resources from Node to the container
- Communication between Nodes through Services
- Kube Proxy
- forwards the requests
Master Node (usally multple master nodes)
- 4 processes that must un on every master Node that control the master state and the workers
- API Server(client interacts with this api)
- acts as a load balances for the master nodes
- require much less resources then worker nodes
- more worker nodes, then more master nodes => makes the application more robust
- cluster gateway
- gets initial request like update or query
- acts as a gatekeeper for authentication
- validates a request and when OK it forwards the request to other processes and eventually to the Pod.
- One entry point to the cluster
- Scheduler
- API Server sends request to Scheduler and will start a Pod
- Has intelligence with Pod has to complete the work (how much cpu, ram etc)
- Kubelet gets the request and executes the request
- Controller manager
- detects cluster state changes like crashes
- tries to recover the state by making a request to the Scheduler
- etcd (key value store off the cluster state) (cluster brain)
- other 3 components get the information from etcd
- does not contain any application data!
- distributed across all the master nodes
![4c5a7cb8f10e6e2cc67ccfb4681679b7.png](../../_resources/4c5a7cb8f10e6e2cc67ccfb4681679b7.png)

11
Python/Algemeen.md Normal file
View File

@ -0,0 +1,11 @@
---
title: Algemeen
updated: 2021-05-04 14:58:11Z
created: 2021-05-04 14:58:11Z
---
# General Python
[Install Python](https://tecadmin.net/install-python-3-8-ubuntu/)
[Use Poetry: Python packaging and dependency management](https://python-poetry.org/)

View File

@ -0,0 +1,93 @@
---
title: Jupyter Built-in magic commands
updated: 2022-04-26 11:13:12Z
created: 2022-04-26 10:31:22Z
---
## references
%lsmagic # list of all magic methods
%quickref # cheatsheet
%magic
%lsmagic
## Timeit
%%timeit -n 3
%time
The %timeit magic runs the given code many times, then returns the speed of the fastest result.
```
%timeit sum(range(100000))
```
The %%timeit cell magic can be used to time blocks of code. Start cell
| Options | Description |
| --- | --- |
| -n&lt;N&gt; | It executes the code statement &lt;N&gt;times in a loop. If the number is not given, it determines the <n>to get good accuracy.</n> |
| -r&lt;R&gt; | Shows the number of repeats. |
| -p&lt;P&gt; | Used to calculate the precision of &lt;P&gt;digits to show the timing result. |
| -c | Use time.clock; default function on Windows to measure the wall time. |
| -t | Use time.time; the default function on Unix measures the wall time. |
| -q | Use for Quiet; do not display any result. |
| -o | Returns the TimeitResult that is further stored in a variable to view more details |
```
%%timeit -n 3
a = 0
for i in range(100000):
a += i
```
The %time magic times a single run of a function
```
%time sum(range(100000))
```
## run scripts
```
%run somescript.py
%run -d myscript.py # debug
```
## reset kernel
%reset is not a kernel restart
## Matplotlib
```
from matplotlib import pyplot as plt
%matplotlib inline
```
%matplotlib # set matplotlib to work interactively; does not import anythig
%matplotlib inline
%matplotlib qt # request a specific GUI backend
## debugging
%debug # jump into the Python debugger (pdb)
%pdb # start the debugger on any uncaught exception.
%cd # change directory
%pwd # print working directory
%env # OS environment variables
## OS command
!OScommand
!ping www.bbc.co.uk
%alias # system command alias
!!date # output \['Sat Jan 19 02:53:54 UTC 2019'\]
%system date
## Auto reload
%load_ext autoreload
%autoreload
When you are working with external tools or changing the enviornment variables, this will certainly help you a lot. The external commands help you autoreload the tools and libraries at a specified defined interval. So whenever there is even a minor change, we do not have to run the imports to update the local enviornment of the notebook

46
Python/Algemeen/RE.md Normal file
View File

@ -0,0 +1,46 @@
---
title: RE
updated: 2022-04-25 11:42:43Z
created: 2022-04-25 11:29:11Z
---
### giving a label and looking at the results as a dictionary is pretty useful. For that we use the syntax (?P<name>), where the parethesis starts the group, the ?P indicates that this is an extension to basic regexes, and <name> is the dictionary key we want to use wrapped in <>.
```python
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
# We can get the dictionary returned for the item with .groupdict()
print(item.groupdict()['edit_link'])
```
## Look-ahead and Look-behind
### One more concept to be familiar with is called "look ahead" and "look behind" matching. In this case, the pattern being given to the regex engine is for text either before or after the text we are trying to isolate. For example, in our headers we want to isolate text which comes before the [edit] rendering, but we actually don't care about the [edit] text itself. Thus far we have been throwing the [edit] away, but if we want to use them to match but don't want to capture them we could put them in a group and use look ahead instead with ?= syntax
```python
for item in re.finditer("(?P<title>[\w ]+)(?=\[edit\])",wiki):
# What this regex says is match two groups, the first will be named and called title, will have any amount
# of whitespace or regular word characters, the second will be the characters [edit] but we don't actually
# want this edit put in our output match objects
print(item)
```
## I'll actually use this example to show you the verbose mode of python regexes. The verbose mode allows you to write multi-line regexes and increases readability. For this mode, we have to explicitly indicate all whitespace characters, either by prepending them with a \ or by using the \s special value. However, this means we can write our regex a bit more like code, and can even include comments with
```python
pattern="""
(?P<title>.*) #the university title
(\ located\ in\ ) #an indicator of the location
(?P<city>\w*) #city the university is in
(,\ ) #separator for the state
(?P<state>\w*) #the state the city is located in"""
# Now when we call finditer() we just pass the re.VERBOSE flag as the last parameter, this makes it much
# easier to understand large regexes!
for item in re.finditer(pattern,wiki,re.VERBOSE):
# We can get the dictionary returned for the item with .groupdict()
print(item.groupdict())
```
### lets create a pattern. We want to include the hash sign first, then any number of alphanumeric characters. And we end when we see some whitespace
```python
pattern = '#[\w\d]*(?=\s)'
# Notice that the ending is a look ahead.
re.findall(pattern, health)
```

View File

@ -0,0 +1,27 @@
---
title: Time Date
updated: 2022-04-25 10:57:17Z
created: 2022-04-25 10:53:38Z
---
```python
import datetime as dt
import time as tm
```
### seconds sinds 1 jan 1970
```python
tm.time()
```
Output
1650884042.68
```python
dt.datetime.fromtimestamp(tm.time())
```
Output:
datetime.datetime(2022, 4, 25, 12, 54, 44, 69803)
```python
dt.timedelta(days = 100) # create a timedelta of 100 days
```

Some files were not shown because too many files have changed in this diff Show More