llm update

This commit is contained in:
John 2024-03-03 18:23:06 +01:00
parent c26604e416
commit 418154dfd5
11 changed files with 121 additions and 13 deletions

View File

@ -3,7 +3,7 @@
Transformer model revolutionized the field of natural language processing (NLP) and became the basis for the LLMs we now know - such as GPT, PaLM and others.
Transformer models replaces traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs) with an entirely attention-based mechanism
The Transformer model uses self-attention to compute representations of input sequences, which allows it to capture long-term dependencies and parallelize computation effectively.
The Transformer architecture consists of an encoder and a decoder, each of which is composed of several layers. Each layer consists of two sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. The multi-head self-attention mechanism allows the model to attend to different parts of the input sequence, while the feed-forward network applies a point-wise fully connected layer to each position separately and identically.
The Transformer architecture consists of an encoder and a decoder, each of which is composed of several layers. Each layer consists of two sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. The multi-head self-attention mechanism allows the model to attend to different parts of the input sequence, while the feed-forward network applies a point-wise fully connected layer to each position separately and identically.
The Transformer model also uses residual connections and layer normalization to facilitate training and prevent overfitting. In addition, the authors introduce a positional encoding scheme that encodes the position of each token in the input sequence, enabling the model to capture the order of the sequence without the need for recurrent or convolutional operations.
@ -17,8 +17,7 @@ Generative AI is a subset of traditional machine learning. And the machine learn
Foundation models, sometimes called base models. Examples are GTP, BERT, LLaMa, BLOOM, FLAN-T5 and PaLM
The more **parameters** a model has, the more memory, and
as it turns out, the more sophisticated the tasks it can perform.
The more **parameters** a model has, the more memory, and as it turns out, the more sophisticated the tasks it can perform. Different: models with more parameters are able to capture more understanding of language.
![Prompt and completion](images/2024-03-02-17-51-00-image.png)
@ -26,24 +25,31 @@ The text that you pass to an LLM is known as a **prompt**.
The space or memory that is available to the prompt is called the **context window**,
and this is typically large enough for a few thousand words, but
differs from model to model. The output of the model is called a **completion**, and the act of using the model to generate text is known as **inference**.
Rewritte several times to get the model to behave in the way that you want is called **prompt engineering** One powerful strategy to get the model to produce better outcomes is to include examples of the task that you want the model to carry out inside the prompt. Providing examples inside the context window is called **in-context learning ICL**.
With in-context learning, you can help LLMs learn more about the task being asked by including examples or additional data in the prompt.
- Zero-shot inference: including your input data within the prompt (large models are good in this)
- One-shot inference: the inclusion of a single example of the prompt, so the model can learn what to do.
- few-shot inference: multiple examples are included in the prompt. Even small model can sufficient the inference the correct answer.
![Few Shot Inference](images/FewShotInference.png)
If model is not performing well with 5 or 6 exeamples, then try fine tuning the model.
Fine tuning is training the model with more data what makes it more capable to perform.
## Capabilities of LLMs
- next word prediction
- translation tasks
- program code generation
- information retrieval: ask the model to identify all of
the people and places identified in a news article => named **entity recognition**, a word classification.
## Transformer architecture
This novel approach unlocked the progress in generative AI that we see today. It can be **scaled efficiently** to use multi-core GPUs, it can **parallel process input data**, making use of much larger training datasets, and crucially, it's able to learn **to pay attention to the meaning of the words it's processing**.
The power of the transformer architecture lies in its ability to learn the relevance and context of all of the words in a sentence. To apply attention weights to those relationships so that the model learns the relevance of each word to each other words no matter where they are in the input.
Attention map and can be useful to illustrate the attention weights between
@ -67,9 +73,7 @@ first tokenize the words
Multiple tokenization methods, for example:
- token IDs matching two complete words,
- using token IDs to represent parts of words.
_Important is that once you've selected a tokenizer to train the model,
you must use the same tokenizer when you generate text_
@ -111,7 +115,6 @@ One single token will have a score higher than the rest, but there are a number
Encoder: Encodes inputs woth contextual understanding and produces one vector per input token
Decoder: Accepts input tokens and generates new tokens.
1. tokenize the input words using this same tokenizer that was used to train the network.2. These tokens are then added into the input on the encoder side of the network.
3. passed through the embedding layer.
4. fed into the multi-headed attention layers.
@ -134,13 +137,118 @@ There are multiple ways in which you can use the output from the softmax layer t
## Split decoder and encoder architecture
**Encoder-only models** also work as sequence-to-sequence models, but without further modification, the input sequence and the output sequence or the same length.
**Encoder-only models** (autoencoding) also work as sequence-to-sequence models, but without further modification, the input sequence and the output sequence or the same length.
Their use is less common these days, but by adding additional layers to the architecture, you can train encoder-only models to perform classification tasks such as sentiment analysis, **BERT** is an example of an encoder-only model.
**Encoder-decoder models**, as you've seen, perform well on sequence-to-sequence tasks such as translation, where the input sequence and the output sequence can be different lengths. Examples are BART and T5.
**Decoder-only models** are some of the most commonly used today. Example are the GPT family of models, BLOOM, Jurassic, LLaMA, and many more.
**Decoder-only models** (autoregressive) are some of the most commonly used today. Example are the GPT family of models, BLOOM, Jurassic, LLaMA, and many more.
[Video Generating text with transformers](images/VideoGeneratingTextWithTransformers.mp4)
## Generative Configuration
Each model exposes a set of configuration parameters that can influence the model's output during inference. These are different then the training parameters.
![Generative Configuration](images/GenerativeConfiguration.png)
The output from the transformer's softmax layer is a probability distribution across
the entire dictionary of words that the model uses.
- greedy sampling: the word/token with the highest probability is selected. Works well for short generation, but is susceptible (vatbaar) to repeated words or repeated sequences of words
- random(-weighted) sampling: select a token using a random-weighted strategy acress the probabilities of all tokens. The generated text is more natural, more creative and avoids repeating words. Note that in some implementations, you may need to disable
greedy and enable random sampling explicitly.
Two Settings, **top p** and **top k** are sampling techniques that we can use to
help limit the random sampling and increase the chance that the output will be sensible.
- top k: specify a top k value which instructs the model to choose from only the k tokens with the highest probability. This method can help the model have some randomness while preventing the selection of highly improbable completion words.
- top p: to limit the random sampling to the predictions whose combined probabilities do not exceed p
- temperature: is a scaling factor that's applied within the final softmax layer of the model that impacts the shape of the probability distribution of the next token. The higher the temperature, the higher the randomness, and the lower the temperature, the lower the randomness.
![Generative Configuration Temperature](images/GenerativeConfogTemperature.png)
Notice that in contrast to the blue bars, the probability is more evenly spread across the tokens. This leads the model to generate text with a higher degree of randomness
and more variability in the output compared to a cool temperature setting. This can help you generate text that sounds more creative. If you leave the temperature value equal to one, this will leave the softmax function as default and the unaltered probability distribution will be used.
## Generative AI project lifecycle
autoregressive models
![Generative AI project lifecycle](images/Generative%20AI%20project%20lifecycle.png)
### Scope
define the scope as accurately and narrowly as you can. LLMs are capable of carrying out many tasks, but their abilities depend strongly on the size and architecture of the model.
Possible tasks: essay writing, summarization, translation, information retrieval, invoke APIs and actions.
### select
Whether to train your own model from scratch or work with an existing base model. In general, you'll start with an existing model, although there are some cases where you may find it necessary to train a model from scratch.
### Adapt and align model
Assess its performance and carry out additional training if needed for your application.
Prompt engineering can sometimes be enough to get your model to perform well, so you'll likely start by trying in-context learning. There are still cases, however, where the model may not perform as well as you need, even with one or a few short inference, and in that case, you can try fine-tuning your model: supervised learning process and reinforcement learning with human feedback.
Evaluation: some metrics and benchmarks that can be used to determine how well your model is performing or how well aligned it is to your preferences. Iterative.
### Application Integration
Deploy it into your infrastructure and integrate it with your application. At this stage, an important step is to optimize your model for deployment.
The last but very important step is to consider any additional infrastructure that
your application will require to work well. There are some fundamental limitations of
LLMs that can be difficult to overcome through training alone like their tendency to invent information when they don't know an answer, or their limited ability to carry out complex reasoning and mathematics.
## Pre-training large language models
The developers of some of the major frameworks for building generative AI applications like Hugging Face and PyTorch, have curated hubs where you can browse these models.
A really useful feature of these hubs is the inclusion of **model cards**, that describe important details including the best use cases for each model, how it was trained, and known limitations
Variance of the transformer model architecture are suited to different language tasks, largely because of differences in how the models are trained.
### Model Architectures and pre-training objectives
1. pre-training. Self-supervised learning step, the model internalizes the patterns and structures present in the language. These patterns then enable the model to complete its training objective, which depends on the architecture of the model.
The encoder generates an embedding or vector representation for each token.
![LLM pre-training](images/LLMPre-traing.png)
Encoder-only models are also known as Autoencoding models, and they are pre-trained using masked language modeling.
![AutoencodingModels](images/AutoencodingModels.png)
a. **Autoencoding models** spilled **bi-directional representations** of the input sequence, meaning that the model has an understanding of the full context of a token and not just of the words that come before. Encoder-only models are ideally suited to task that benefit from this bi-directional contexts. You can use them to carry out **sentence classification tasks**, for example, **sentiment analysis** or *token-level tasks* like **named entity recognition** or **word classification**. Some well-known examples of an autoencoder model are **BERT** and **RoBERTa**.
**Decoder-only or autoregressive models**
the training objective is to predict the next token based on the previous sequence of tokens. Predicting the next token is sometimes called full language modeling.
~[autoregressive models](images/autoregressiveModels.png)
b. **Decoder-based autoregressive models**, mask the input sequence and can only see the input tokens leading up to the token in question. The model has no knowledge of the end of the sentence. The model then iterates over the input sequence one by one to predict the following token. In contrast to the encoder architecture, this means that the context is **unidirectional**. By learning to predict the next token from a vast number of examples,
the model builds up a statistical representation of language. Models of this type make use of the decoder component off the original architecture without the encoder. Decoder-only models are often used for **text generation**, although larger decoder-only models show **strong zero-shot inference abilities**, and can often perform a range of tasks well. Well known examples of decoder-based autoregressive models are **GBT** and **BLOOM**.
c. **sequence-to-sequence model** uses both the encoder and decoder parts off the original transformer architecture. The exact details of the pre-training objective vary
from model to model.
![sequence-to-sequenceModel](images/sequence-to-sequenceModel.png)
A popular sequence-to-sequence model **T5**, pre-trains the encoder using span corruption, which masks random sequences of input tokens. Those mass sequences are then
replaced with a unique Sentinel token, shown here as x. Sentinel tokens are special tokens added to the vocabulary, but do not correspond to any actual word from the input text. The decoder is then tasked with reconstructing the mask token sequences uto-regressively. The output is the Sentinel token followed by the predicted tokens. You can use sequence-to-sequence models for **translation**, **summarization**, and **question-answering**. Well-known encoder-decoder model are **BART and T5**.
### Summary a comparison:
~[Summary model archi and pretraning](images/SummaryModelArchiandPretraning.png)
Comparison of the different model architectures and the targets off the pre-training objectives.
- Autoencoding models are pre-trained using masked language modeling. They correspond to the encoder part of the original transformer architecture, and are often used with sentence classification or token classification.
- Autoregressive models are pre-trained using causal language modeling. Models of this type make use of the decoder component of the original transformer architecture, and often used for text generation.
- Sequence-to-sequence models use both the encoder and decoder part off the original transformer architecture. The exact details of the pre-training objective vary from model to model. The T5 model is pre-trained using span corruption. Sequence-to-sequence models are often used for translation, summarization, and question-answering.
Model capability with size has driven the development of larger and larger models in recent years. This growth has been fueled by inflection points and research, such as the introduction of the highly scalable transformer architecture, access to massive amounts of data for training, and the development of more powerful compute resources
![modeSize](images/ModelSize.png)
## Computational challenges of training LLMs

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 301 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 260 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 238 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 347 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 214 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 328 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 201 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 215 KiB