This commit is contained in:
John 2024-03-25 16:14:50 +01:00
parent 317f0900a3
commit e52fffc17b
12 changed files with 69 additions and 14 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 373 KiB

View File

@ -1,8 +1,8 @@
# Generative AI & LLMs
Transformer model revolutionized the field of natural language processing (NLP) and became the basis for the LLMs we now know - such as GPT, PaLM and others.
Transformer models replaces traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs) with an entirely <mark>attention-based mechanism</mark>
The Transformer model uses self-attention to compute representations of input sequences, which allows it to capture <mark>long-term dependencies and parallelize computation</mark> effectively.
Transformer models replaces traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs) with an entirely <mark>attention-based mechanism</mark>.
The Transformer model uses <mark>self-attention</mark> to compute representations of input sequences, which allows it to capture <mark>long-term dependencies and parallelize computation</mark> effectively.
The Transformer architecture consists of an <mark>encoder and a decoder</mark>, each of which is composed of several layers. Each layer consists of two sub-layers: a <mark>multi-head self-attention mechanism and a feed-forward neural network</mark>. The multi-head self-attention mechanism allows the model to attend to different parts of the input sequence, while the feed-forward network applies a point-wise fully connected layer to each position separately and identically.
The Transformer model also uses <mark>residual connections and layer normalization to facilitate training and prevent overfitting</mark> In addition, the authors introduce a positional encoding scheme that encodes the position of each token in the input sequence, enabling the model to capture the order of the sequence without the need for recurrent or convolutional operations.
@ -17,6 +17,8 @@ Generative AI is a subset of traditional machine learning. And the machine learn
<mark>Foundation models</mark>, sometimes called base models. Examples are GTP, BERT, LLaMa, BLOOM, FLAN-T5 and PaLM
[BLOOM paper](images/BloomPaper.pdf)
The more **parameters** a model has, the more memory, and as it turns out, the more sophisticated the tasks it can perform. Different: models with more parameters are able to capture more understanding of language.
![Prompt and completion](images/2024-03-02-17-51-00-image.png)
@ -35,6 +37,8 @@ With in-context learning, you can help LLMs learn more about the task being aske
If model is not performing well with 5 or 6 exeamples, then try <mark>fine tuning the model</mark>.
Fine tuning is training the model with more data what makes it more capable to perform.
[Zero-Shot Generalization paper](images/Zero-ShotGeneralization.pdf)
## Capabilities of LLMs
- next word prediction
@ -101,7 +105,7 @@ One single token will have a score higher than the rest, but there are a number
![example](images/exampleTransformerTranslation.png)
Encoder: Encodes inputs woth contextual understanding and produces one vector per input token
Encoder: Encodes inputs with contextual understanding and produces one vector per input token
Decoder: Accepts input tokens and generates new tokens.
1. tokenize the input words using this same tokenizer that was used to train the network.2. These tokens are then added into the input on the encoder side of the network.
@ -115,7 +119,7 @@ This representation is inserted into the middle of the decoder to influence
the decoder's self-attention mechanisms.
1. a start of sequence token is added to the input of the decoder.
2. This triggers the decoder to predict the next token, which it does based on the contextual understanding that it'sbeing provided from the encoder.
2. This triggers the decoder to predict the next token, which it does based on the contextual understanding that it's being provided from the encoder.
3. The output of the decoder's self-attention layers gets passed through the decoder feed-forward network and through a final softmax output layer.
At this point, we have our first token. You'll continue this loop, passing the output token back to the input to trigger the generation of the next token, until the model predicts an end-of-sequence token.
@ -127,7 +131,7 @@ There are multiple ways in which you can use the output from the softmax layer t
## Split decoder and encoder architecture
**Encoder-only models** (autoencoding) also work as sequence-to-sequence models, but without further modification, the input sequence and the output sequence or the same length.
Their use is less common these days, but by adding additional layers to the architecture, you can train encoder-only models to perform classification tasks such as sentiment analysis, **BERT** is an example of an encoder-only model.
Their use is less common these days, but by adding additional layers to the architecture, you can train encoder-only models to perform <mark>classification tasks</mark> such as sentiment analysis, **BERT** is an example of an encoder-only model.
**Encoder-decoder models**, as you've seen, perform well on sequence-to-sequence tasks such as translation, where the input sequence and the output sequence can be different lengths. Examples are BART and T5.
@ -144,8 +148,8 @@ Each model exposes a set of configuration parameters that can influence the mode
The output from the transformer's softmax layer is a probability distribution across
the entire dictionary of words that the model uses.
- greedy sampling: the word/token with the highest probability is selected. Works well for short generation, but is susceptible (vatbaar) to repeated words or repeated sequences of words
- random(-weighted) sampling: select a token using a random-weighted strategy acress the probabilities of all tokens. The generated text is more natural, more creative and avoids repeating words. Note that in some implementations, you may need to disable
- greedy sampling: the word/token with the <mark>highest probability</mark> is selected. Works well for short generation, but is susceptible (vatbaar) to repeated words or repeated sequences of words
- random(-weighted) sampling: select a token using a <mark>random-weighted strategy</mark> acress the probabilities of all tokens. The generated text is more natural, more creative and avoids repeating words. Note that in some implementations, you may need to disable
greedy and enable random sampling explicitly.
Two Settings, **top p** and **top k** are sampling techniques that we can use to
@ -163,7 +167,6 @@ and more variability in the output compared to a cool temperature setting. This
## Generative AI project lifecycle
autoregressive models
![Generative AI project lifecycle](images/Generative%20AI%20project%20lifecycle.png)
### Scope
@ -175,18 +178,18 @@ Possible tasks: essay writing, summarization, translation, information retrieval
Whether to train your own model from scratch or work with an existing base model. In general, you'll start with an existing model, although there are some cases where you may find it necessary to train a model from scratch.
### Adapt and align model
### Adapt and align model (Iterative)
Assess its performance and carry out additional training if needed for your application.
Prompt engineering can sometimes be enough to get your model to perform well, so you'll likely start by trying in-context learning. There are still cases, however, where the model may not perform as well as you need, even with one or a few short inference, and in that case, you can try fine-tuning your model: supervised learning process and reinforcement learning with human feedback.
Evaluation: some metrics and benchmarks that can be used to determine how well your model is performing or how well aligned it is to your preferences. Iterative.
Prompt engineering can sometimes be enough to get your model to perform well, so you'll likely <mark>start by trying in-context learning</mark>. There are still cases, however, where the model may not perform as well as you need, even with one or a few short inference, and in that case, you can try <mark>fine-tuning your model: supervised learning process and reinforcement learning with human feedback</mark>.
Evaluation: some metrics and benchmarks that can be used to determine how well your model is performing or how well aligned it is to your preferences.
### Application Integration
Deploy it into your infrastructure and integrate it with your application. At this stage, an important step is to optimize your model for deployment.
The last but very important step is to consider any additional infrastructure that
your application will require to work well. There are some fundamental limitations of
LLMs that can be difficult to overcome through training alone like their tendency to invent information when they don't know an answer, or their limited ability to carry out complex reasoning and mathematics.
your application will require to work well. <mark>There are some fundamental limitations of
LLMs</mark> that can be difficult to overcome through training alone like their tendency to invent information when they don't know an answer, or their limited ability to carry out complex reasoning and mathematics.
## Pre-training large language models
@ -240,7 +243,7 @@ Model capability with size has driven the development of larger and larger model
## Computational challenges of training LLMs
Most common issues: OutOfMemoryError: CUDA out of memory.
<mark>Most common issues: OutOfMemoryError: CUDA out of memory.</mark>
CUDA = Compute Unified Device Architecture
@ -266,3 +269,55 @@ So, for full precision model of 4GB @ 32-bit full precision -> 16 bit quantized
[Video Computational challenges of training LLMs](images/ComputationalChallengesOfTrainingLLMs.mp4)
[Video Efficient multi-GPU compute strategies](images/Efficientmulti-GPUcomputestrategies.mp4)
## Scaling laws and compute-optimal models
[Scaling Laws for Neural Language Models](images/scalingLaw.pdf)
The goal during pre-training is to maximize the model's performance of its learning objective, which is minimizing the loss when predicting tokens. Two options:
![Scaling Choices Pre-training](images/ScalingChoicesPreTraining.png)
1. increasing the size of the dataset you train your model on
2. increasing the number of parameters in your model
To take into consideration is your compute budget.
**Chinchilla paper**
![Compute Optimals](images/Chinchilla1.png)
![Chincilla scaling](images/Chinchilla2.png)
Important takeaways from the Chinchilla paper:
1. the optimal training dataset size for a given model is about 20 times larger than the number of parameters in the model.
2. the compute optimal Chinchilla model outperforms non compute optimal models such as GPT-3 on a large range of downstream evaluation tasks
![Model size vs time](images/Chinchilla3.png)
With the results of the Chinchilla paper in hand teams have recently started to develop smaller models that achieved similar, if not better results than larger models that were trained in a non-optimal way.
Moving forward, you can probably expect to see a deviation from the bigger is always better trends of the last few years as more teams or developers like you start to optimize their model design.
**Bloomberg GPT**, is a really interesting model. It was trained in a compute optimal way following the Chinchilla loss and so achieves good performance with the size of 50 billion parameters.
## Pre-training for domain adaptation
There's one situation where you may find it necessary to pretrain your own model from scratch: If your target domain uses vocabulary and language structures that are not commonly used in day to day language.
- specific words (ie legal language) and common words that have in your domain a different meaning.
- medical language
**BloombergGPT** is an example of a large Decoder-only language model that has been pretrained for a specific domain, in this case, finance. The Bloomberg researchers chose to **combine both finance data and general purpose tax data** to pretrain a model that achieves Bestinclass results on financial benchmarks. 51% finacial data and 49% other data.
During the training of BloombergGPT, the authors used the Chinchilla Scaling Laws to guide the number of parameters in the model and the volume of training data, measured in tokens. The recommendations of Chinchilla are represented by the lines Chinchilla-1, Chinchilla-2 and Chinchilla-3 in the image, and we can see that BloombergGPT is close to it.
The BloombergGPT project is a good illustration of pre-training a model for increased domain-specificity, and the challenges that may force trade-offs against compute-optimal model and training configurations.
[BloombergGTP paper](images/BloombergGTPpaper.pdf)
## Models
[HuggingFace Tasks](https://huggingface.co/tasks)
[Model Hub](https://huggingface.co/models)

Binary file not shown.

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 297 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 373 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 228 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 255 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 195 KiB

Binary file not shown.