Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Generative AI in Context of LLMs

Generative AI in Context of LLMs

Published by pathak15.sudden, 2023-08-10 13:40:13

Description: Generative AI in Context of LLMs

Search

Read the Text Version

A summary of recent developments in Generative AI in the light of Large Language models Karunamay Pathak

Table of Contents Chapters Page No. 1. Generative AI …………………………………………………………………………..04 1.1 Introduction ….…………………………………………………………………………04 1.2 Use case ……………………………………………………………………………….04 1.3 Pros n Cons... ………………………………………………………………………….07 1.4 Challenges ….………………………………………………………………………….08 2. Transformers ……...…………………………………………………………………...09 2.1 Introduction …………………………………………………………………………….09 2.2 Attention …….………………………………………………………………………….10 2.3 Detailed Transformer Architecture …………………………………………………..11 2.4 Foundation Models ……………………………………………………………………15 2.5 Pre-trained Language Models (PLMs) ................................................................16 2.6 Challenges ….………………………………………………………………………….17 3. Generative Models ……………………………………………………………………18 3.1 Introduction …………………………………………………………………………..18 3.2 Decoder Model ………………………………………………………………………..18 3.3 Encoder-Decoder Model ……………………………………………………………..19 3.4 Generative Adversarial Networks …………………………………………………..19 3.5 Variational Autoencoders (VAE) ……………………………………………………..20 3.6 Flow …………………………………………………………………………………….20 3.7 Diffusion ………………………………………………………………………………..20 4. Large Language Models ……………………………………………………………..21 4.1 Introduction …………………………………………………………………………….21 4.2 Key Techniques ………………………………………………………………………..21 4.3 Fine Tuning …………………………………………………………………………….23 4.4 Data Sources …………………………………………………………………………..26 4.5 Validation ……………………………………………………………………………….27 4.6 Challenges ……………………………………………………………………………..28 5. Prompt Engineering and In-Context Learning …………………………………..30 5.1 Introduction …………………………………………………………………………….30 5.2 Basic Elements of Prompts …………………………………………………………..30 5.3 Basic Tips for Prompt …………………………………………………………………31 5.4 In-Context Learning …………………………………………………………………...31 6. AI Agents ……………………………………………………………………………….36 6.1 Introduction …………………………………………………………………………….36 6.2 How it Works …………………………………………………………………………..36 6.3 Key Techniques ……………………………………………………………………….37 6.4 Limitations ……………………………………………………………………………..38 7. Responsible AI ………………………………………………………………………..39 8. References ……………………………………………………………………………..42

1. Generative AI 1.1 Introduction Generative models have a long history in artificial intelligence, dating back to the 1950s with the development of Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) These models generated sequential data such as speech and time series. However, it wasn’t until the advent of deep learning that generative models saw significant improvements in performance. In 2014, with the introduction of generative adversarial networks, or GANs -- a type of machine learning algorithm -- that generative AI could create convincingly authentic images, videos and audio of real people. On the one hand, this newfound capability opened up opportunities that include creating better movie dubbing and rich educational content. It also unlocked concerns about deepfakes i.e. digitally forged images or videos, and harmful cybersecurity attacks on businesses, including nefarious requests that realistically mimic anybody to share any important information. Though these developments were happening, generative models in various domains e.g text, image and vision were following different paths, but eventually, in 2017 transformers came, and now that is a common deep learning framework across domains. With that generalisation capabilities, generative AI models are powerful enough to quickly generate new content based on a variety of inputs. Inputs and outputs to these models can include text, images, sound, video. The buzz around it has been mainly driven by the ease of access of open source interfaces/apis for creating high-quality text, graphics and videos in a matter of seconds. 1.2 Use cases 1. Code generation, documentation, and QA: For software developers and programmers, generative AI use cases include writing, completing, and vetting sets of software code. Quality assurance is perhaps the most important emerging use case in this area, with generative AI models handling bug fixes, test generation, and various types of documentation. 2. Product and app development: Generative AI is now being used to code various kinds of apps and write product documentation for these apps. While apps are probably the most common type of product development for generative AI today, generative AI support is also going into projects like semiconductor chip development and design. 3. Blog and social media content writing: With the right prompts and inputs, large language models are capable of creating appropriate and creative content for blogs, social media accounts, product pages, and business websites.Many of these models enable users to give instructions on article tone and voice, input past written content from the brand, and add other specifications so content is written in a way that sounds human and relevant to the brand’s audience.

4. Inbound and outbound marketing communication workflows: Inbound and outbound marketing frequently require contextualized email and chat threads to be sent to prospective and current customers on a daily basis. Generative AI solutions can create and send the content for these communications, and in some cases, they can also automate the process of moving these people to the next stage of the customer lifecycle in a CRM. 5. Graphic design and video marketing: Generative AI is capable of generating realistic images, animation, and audio that can be used for graphic design and video marketing projects. Some generative AI vendors offer voice synthesis and AI avatars so you can create marketing videos without actors, video equipment, or video editing expertise. 6. Entertainment media generation: As AI-generated imagery, animation, and audio become more and more realistic, this type of technology is being used to create the graphics for movies and video games, the audio for music and podcast generation, and the characters for virtual storytelling and virtual reality experiences. Some tech experts predict that generative AI will constitute the majority of future film content and script writing, though creatives are understandably pushing back on that assumption. 7.Performance management: Generative AI use cases include several business and employee coaching scenarios. As an example, contact center call documentation and summarization, when combined with sentiment analysis, gives managers the information they need to assess current customer service rep performance and coach employees on ways to improve. 8. Business performance reporting: Because generative AI can work through massive amounts of text and data to quickly summarize the main points, it is becoming an important piece of business performance reporting. It’s especially useful for unstructured and qualitative data that usually require more processing before insights can be drawn. 9. Customer support and customer experience: For many of the most straightforward customer service engagements, generative AI chatbots and virtual assistants can handle customer service questions at all hours of the day. Chatbots have been used for customer service for many years, but generative AI advancements are giving them additional resources to give comprehensive and more human answers without the help of a human customer support representative. 10. Optimized enterprise search and knowledge base: Both internal and external search benefit from generative AI technology. For internal employee users, generative AI models can be used to scour, identify, and or summarize enterprise resources when users are searching for certain information about their job.

Similarly, generative AI models can be embedded into company websites and other customer-facing properties, giving them a self-service solution to find answers to their brand questions. 11. Pharmaceutical drug discovery and design: Generative AI technology is being used to make drug discovery and design processes more efficient for new drugs. AI-driven drug discovery is one of the areas of generative AI that is receiving the most funding right now, so expect this particular enterprise use case to grow significantly in the coming months and years. 12. Medical diagnostics: Generative AI in medicine is still nascent, but that is changing quickly. Image generation and editing tools are increasingly being used to optimize and zoom into medical images, allowing medical professionals to get a better and more realistic look at certain areas of the human body. Some tools even perform medical image analysis and basic diagnostics on their own. 13. Consumer-friendly data analysis: Although generative AI poses some crucial security concerns, it can also be used to heighten data and consumer privacy. For example, generative AI can be used to create synthetic data copies of actual sensitive data, allowing analysts to analyze and derive insights from the copies without compromising data privacy or compliance. 14. Smart manufacturing and predictive maintenance: Generative AI is quickly becoming a staple in modern manufacturing, helping workers create more innovative designs and meet other production goals. In the realm of predictive maintenance, generative models can generate to-do lists and timelines, make workflow and repair suggestions, and simplify the process of assessing complex data from sensors and other parts of the assembly line. 15. Inventory and supply chain management: Several components of supply chain management can be enhanced with generative AI. Route optimization, demand forecasting, supplier risk management, and inventory management can all be made smarter and more accurate with generative AI suggestions. 16. Fraud detection and risk management: This type of technology can analyze large amounts of transaction or claims data, quickly summarizing and identifying any patterns or anomalies in that data. With these capabilities, generative AI is great for fraud detection and risk management in finance and insurance scenarios.

1.2 Pros n Cons Pros 1. Storytelling: One of the most exciting applications of generative AI is in storytelling. AI models can generate narratives, characters, and even entire storylines, providing a wellspring of inspiration for writers and filmmakers. These models can analyze vast amounts of existing literature and media, learning the patterns and structures of storytelling, and then generate new narratives that adhere to these principles while introducing novel and unexpected elements. This opens up new avenues for storytelling, blurring the lines between human and machine creativity, and challenging traditional notions of authorship. 2. Ability to Learn: Another advantage of generative AI is its ability to learn the underlying patterns and distributions of a dataset. This allows it to generate outputs that are similar to, but not identical to, the input data. This can be used for a variety of tasks, such as image and video synthesis, text generation, and music composition. 3. Data Augmentation: Generative AI can also be used for data augmentation, a technique where the model creates new data from the given data. This can be used to increase the amount of data available for training a machine learning model, which can improve its performance. Cons 1. Limited creativity: While generative AI can create new data based on existing patterns, it is limited in terms of creativity and originality. It can only generate new data based on what it has learned from existing data and cannot think beyond that. 2.Bias: Generative AI can also be biased if the data it was trained on is biased. For example, if the data used to train a generative AI model is biased against a particular group of people, the generated data may also reflect that bias. 3. Limited Application: Generative AI is best suited for applications where there is a large amount of existing data to train on. In cases where there is limited data available, or the data is highly complex, generative AI may not be effective. 4. Resource-intensive: Generative AI requires significant computing resources and training time. It can be expensive to train and deploy generative AI models, which may limit its widespread adoption. 5. Ethical Concerns: Generative AI can be used for malicious purposes, such as generating fake news, deepfakes, or other types of false information. This raises ethical concerns about the potential misuse of the technology.

1.3 Some of the challenges 1.Quality of Generated Outputs: Generative AI systems may not always produce high-quality outputs, and the generated outputs may contain errors or artifacts. This can be due to a variety of factors, such as a lack of data, poor training, or an overly complex model. 2. Control Over the Generated Outputs: Generative AI systems are typically trained on a dataset and can generate new outputs that are similar to, but not identical to, the input data. However, it can be difficult to control the specific characteristics of the generated outputs. 3. Explainability and Interpretability: Generative AI models can be complex and opaque, making it difficult to understand how they are making their predictions. This can be a challenge when trying to ensure that the model is making fair and unbiased decisions. 4. Safety and Security: Generative AI systems can be used to generate realistic and convincing fake images, videos, and text, which can be used to spread misinformation or propaganda. This highlights the importance of developing safety and security measures to prevent the malicious use of generative AI. Though, there are multiple challanges and limitations of the Generative AI which can be addressed may not be 100% but enough through regulations and tools, but still it has potential to influence how we do our jobs, what kind of content we consume over the next few years.

2. Transformers 2.1 Introduction Transformers were developed to solve the problem of sequence to sequence neural machine translation. That means any task that transforms an input sequence to an output sequence. This includes speech recognition, text-to-speech transformation, etc.. For models to perform sequence transduction, it is necessary to have some sort of memory. For example let’s say that we are translating the following sentence to another language (French): “The Transformers” are a Japanese [[hardcore punk]] band. The band was formed in 1968, during the height of Japanese music history” In the above example, the word “the band” in the second sentence refers to the band “The Transformers” introduced in the first sentence. When you read about the band in the second sentence, you know that it is referencing the “The Transformers” band. That may be important for translation. There are many examples, where words in some sentences refer to words in previous sentences. For translating sentences like that, a model needs to figure out these sort of dependencies and connections. Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have been used to deal with this problem because of their properties, for properties refer How Transformers Work. But, still there were some challenges like, 1. Sequential computation inhibits parallelization 2. No explicit modeling of long and short range dependencies 3. “Distance” between positions is linear Transformer with its architecture and attention mechanism is able to solve most of the above problems. Attention is a technique that is used in a neural network for paying attention to specific words.For RNNs, instead of only encoding the whole sentence in a hidden state, each word has a corresponding hidden state that is passed all the way to the decoding stage. Then, the hidden states are used at each step of the RNN to decode which is hard to parallized and model long term dependencies, though with GRU Understanding GRU Networks it solves to certain extent. Convolutional Neural Networks help solve these problems. With them we can parallelize (per layer), Exploits local dependencies and Distance between positions is logarithmic.But, still lags in figuring out the problem of dependencies when translating sentences How Transformers Work. That’s when the transformer came. Transformer model is a neural network that learns context and thus meaning by tracking relationships in sequential data like the words in the sentence.Transformer models apply attention or self-attention, to detect subtle ways even distant data elements in a series influence and depend on each other.First described in a 2017 paper from Google, transformers are among the newest and one of the most powerful classes of models invented to date. They’re driving a wave of advances in machine learning and deep learning.

2.2 Attention When we talk about attention, it's generally used in case of self attention. Say the following sentence is an input sentence we want to translate: ”The animal didn't cross the street because it was too tired”. What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm. When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”. As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word, more details The Illustrated Transformer. In the Transformer, Attention is used in three places[Attention1]: 1. Self-attention in the Encoder — the input sequence pays attention to itself. 2. Self-attention in the Decoder — the target sequence pays attention to itself. 3. Encoder-Decoder-attention in the Decoder — the target sequence pays attention to the input sequence. The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.These new vectors are smaller in dimension than the embedding vector. In practice, attention function is computed on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Where ������k: dimension of the Query and Key vector.

2.3 Detailed Transformer Architecture Above is an encoder-decoder architecture of transfer, in nutshell the task of the encoder, on the left half of the Transformer architecture, is to map an input sequence to a sequence of continuous representations, which is then fed into a decoder. The decoder, on the right half of the architecture, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

Step by step details of transformer Inputs Input (Tokenization and Embedding), Input text is first split into pieces. Can be characters, word, \"tokens\": \"The detective investigated\" -> [The_] [detective_] [invest] [igat] [ed_] Tokens are indices into the \"vocabulary\": [The_] [detective_] [invest] [igat] [ed_] -> [3 721 68 1337 42]. Each vocab entry corresponds to a learned d_model [512]-dimensional vector. [3 721 68 1337 42] -> [ [0.123, -5.234, ...], […….], [...], [...], [...] ]. Positional Encoding Remember attention is permutation invariant, but language is not! (\"The mouse ate the cat\" vs \"The cat ate the mouse\") . Need to encode position of each word; just add something may be a number. Think: [The_] + 10, [detective_] + 20, [invest] + 30 ... but there are some smarter ways to do that. There are many reasons why a single number, such as the index value, is not used to represent an item’s position in transformer models. For long sequences, the indices can grow large in magnitude. If you normalize the index value to lie between 0 and 1, it can create problems for variable length sequences as they would be normalized differently. Transformers use a smart positional encoding scheme, where each position/index is mapped to a vector. Hence, the output of the positional encoding layer is a matrix, where each row of the matrix represents an encoded object of the sequence summed with its positional information. In the paper, Attention Is All You Need they have used sine and cosine functions of different frequencies, Suppose you have an input sequence of length L and and require the position of the kth object within this sequence. The positional encoding is given by sine and cosine functions of varying frequencies given by, Where: k: Position of an object in the input sequence, 0≤k<L/2. d : Dimension of the output embedding space. p(k,j) : Position function for mapping a position k in the input sequence to index (k,j) of the positional matrix. n: User-defined scalar, set to 10,000 by the author.

i: Used for mapping to column indices 0≤i<d/2, with a single value of i it maps to both sine and cosine functions. let’s take an example of the phrase “I am a robot,” with n=100 and d=4. The following table shows the positional encoding matrix for this phrase. In fact, the positional encoding matrix would be the same for any four-letter phrase with n=100 and d=4. Source-positional encoding Though there is not much change in the result if you use fix embeddding vs sinusoidal, author chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

Multi-Head Attention To understand multi head attention, first Multi-head attention is nothing but running multiple self attention in parallel.Instead of performing a single attention function with dmodel-dimensional keys, values and queries, its beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values as shown in below right figure. It allows the model to jointly attend to information from different representation subspaces at different positions.WIth single attention head averaging inhibits this. Attention Is All You Need employs h = 8 parallel attention layers, or heads. For each of these they use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality. Position-wise Feed-Forward Networks A simple fully connected feed forward network applied to each token individually: z₁ = W₂ GeLU(W₁x+b₁) + b₂ 2. GeLU: Gaussian Error Linear Unit (GELU), a high-performing neural network activation function Think of it as each token pondering for itself about what it has observed previously. There's some weak evidence this is where \"world knowledge\" is stored, too. It contains the bulk of the parameters. When people make giant models and sparse/moe, this is what becomes giant. Some people like to call it 1x1 convolution.

LayerNorm: Residual connections: \"Skip connection\" = == \"Residual block. Each module's output has the exact same shape as its input. Following ResNets, the module computes a \"residual\" instead of a new value: z₁ = Module(x) + x;This was shown to dramatically improve trainability. Normalization also dramatically improves trainability. There's post-norm (original) : z₁ = LN(Module(x;) +x.) Block and pre-norm (modern) : z₁ = Module(LN(x;)) + x. Encoder and decoder architecture blocks are the same, with decoder having two multihead attention steps, one with respect to the output till that word in sequence and second with encoder output, and a linear layer with softmax layer to predict probabilities. Encoding / Decoder : Since input and output shapes are identical, we can stack N such blocks. Typically, N=6 (\"base\"), N=12 (\"large\") or more. Encoder output is a \"heavily processed\" (think: \"high level, contextualized\") version of the input tokens, i.e. a sequence. At training time: Masked self-attention : This is regular self-attention as in the encoder, to process what's been decoded so far, eg Z2,Z1 in p(Z3|Z2,Z1,X), but with a trick… If we had to train on one single p(Z3|Z₂,Z₁‚X) at a time, it will be SLOW!. Instead, train on all p(Zi|Z(1:i).,X) simultaneously. How? In the attention weights for Z., set all entries i:N to O. This way, each token only sees the already generated ones. At generation time: There is no such trick. We need to generate one Z at a time. This is why autoregressive decoding is extremely slow. 2.4 Foundation Model Transformer is the backbone architecture for many state-of-the-art models, such as GPT-3, DALL-E-2 , Codex , Gopher and many more. It was first proposed to solve the limitations of traditional models such as RNNs in handling variable-length sequences and context- awareness. Transformer architecture is mainly based on a self-attention mechanism that allows the model to attend to different parts in an input sequence.Another advantage of transformer is that its architecture makes it highly parallelizable, and allows data to trump inductive biases. This property makes transformer well-suited for large-scale pre-training, enabling transformer based models to become adaptable to different downstream tasks, this class of large-scale models are known as foundation models. Foundation models are trained on massive amounts of data and are capable of performing a wide range of tasks. With a simple natural language prompt like “describe a scene of the sun rising over the beach,” generative AI models can output a detailed description or produce an image based on the generated description, which can then be animated or even turned into video. Many recent language models are not only good at generating text but also generating, explaining, and debugging code.

2.5 Pre-trained Language Models Generally, these transformer based pre-trained language models can be commonly classified into two types based on their training tasks: autoregressive language modeling and masked language modeling. Given a sentence, which is composed of several tokens, the objective of masked language modeling, e.g., BERT and RoBERTa, refers to predicting the probability of a masked token given context information. The most notable example of masked language modeling is BERT, which includes masked language modeling and next sentence prediction tasks. RoBERTa, which uses the same architecture as BERT, improves its performance by increasing the amount of pre-training data and incorporating more challenging pre-training objectives. XL-Net, which is also based on BERT, incorporates permutation operations to change the prediction order for each training iteration, allowing the model to learn more information across tokens. Autoregressive language models, e.g., GPT-3 and OPT, is to model the probability of the next token given previous tokens, hence, left-to-right language modeling. Different from masked language models, autoregressive models are more suitable for generative tasks. History of PLMs, Source- A Survey of Large Language Models @OpenAI models timelines, Source- A Survey of Large Language Models

2.6 Challenges Despite their impressive performance on a wide range of NLP tasks, transformers also have some drawbacks. Some of the most notable ones are: 1. Computational complexity: Transformers can be computationally expensive to train and use, especially for large models like GPT-3. This can be a major barrier to their deployment in real-world applications, especially for smaller organizations and individuals who may not have access to the powerful computing resources required. 2. Overfitting: Transformers can easily overfit to the training data, especially if the data is small or not diverse enough. This can lead to poor generalization performance on unseen data. 3. Long-term dependencies: Transformers can have difficulties modeling longterm dependencies in sequences, especially in situations where the dependencies span multiple tokens. 4. Attention bias: Transformers rely on attention mechanisms to determine which parts of the input sequence are most relevant. However, these mechanisms can sometimes be biased, leading to suboptimal results. 5. Interpretability: Transformers can be difficult to interpret and understand, as they are trained end-to-end and do not have clear, interpretable intermediate representations like those produced by some other neural network architectures.

3. Generative Language Models 3.1 Introduction Generative language models (GLMs) are a type of NLP models that are trained to generate readable human language based on patterns and structures in input data that they have been exposed to. These models can be used for a wide range of NLP tasks such as dialogue systems, translation and question answering and many more. Recently, The use of pre-trained language models has emerged as the prevailing technique in the domain of NLP. Generally, current state-of-the-art pre-trained language models could be categorized as masked language models (encoders), autoregressive language models (decoders) and encoder-decoder language models, Decoder models are widely used for text generation, while encoder models are mainly applied to classification tasks. By combining the strengths of both structures, encoder-decoder models can leverage both context information and autoregressive properties to improve performance across a variety of tasks. The primary focus of this work is on generative models.So will focus only on recent advancements in decoder and encoder-decoder architectures. 3.2 Decoder Models One of the most prominent examples of autoregressive decoder-based language models is GPT, which is a transformer-based model that utilizes self-attention mechanisms to process all words in a sequence simultaneously. GPT is trained on next word prediction task based on previous words, allowing it to generate coherent text. Subsequently, GPT-2 and GPT-3 maintains the autoregressive left-to-right training method, while scaling up model parameters and leveraging diverse datasets beyond basic web text, achieving state-of-the-art results on numerous datasets. Gopher uses a GPT-like structure but replace LayerNorm with RSNorm, where a residual connection is added to the original layernorm structure to maintain the information. In addition to enhancing the normalization function, several other studies have concentrated on optimizing the attention mechanism. BLOOM shares the same structure as GPT-3 but instead of using sparse attention, BLOOM uses a full attention network, which is better suited for modeling long dependencies. Megatron, which extends commonly used architectures like GPT-3, BERT and T5 with distributed training objectives to process large amount of data. This method is also later adopted by MT-NLG and OPT. Except for the advancements in model architecture and pre-training tasks, there has also been significant efforts put into improving the fine-tuning process for language models. For example, InstructGPT takes advantage of pre-trained GPT-3 and uses RLHF for fine-tuning, allowing the model to learn preference according to ranking feedback labeled by human.

3.3 Encoder-Decoder Models One of the main encoder-decoder methods is Text-to-Text Transfer Transformer (T5-LM), which combines transformer-based encoders and decoders together for pre-training. T5 employs a \"text-to-text\" approach, which means that it transforms both the input and output data into a standardized text format. This allows T5 to be trained on a wide range of NLP tasks, such as machine translation, question-answering, summarization, and more, using the same model architecture. Switch Transformer, as stated in its name, utilizes \"switching\", which refers to a simplified MoE routing algorithm, for parallelized training on T5. This model successfully obtained larger scale and better performance with the same computational resources compared to the base model. Another widely-used method that improves upon T5 is ExT5, which is proposed by Google in 2021, extending the scale of previous T5 model. Compared to T5, ExT5 is continue pre-trained on C4 and ExMix, which is a combination of 107 supervised NLP tasks across diverse domains. Another widely used encoder-decoder method is BART, which blends the bidirectional encoder from BERT and the autoregressive decoder from GPT, allowing it to leverage the bidirectional modeling abilities of the encoder while retaining the autoregressive properties for generation tasks. HTLM leverages BART denoising objectives for modeling hyper-text language, which contains valuable information regarding document-level structure. This model also achieves state-of-the-art performance on zero-shot learning on various generation tasks. 3.4 Generative Adversarial Networks Generative Adversarial Networks (GANs) have gained popularity in the field of image generation research. GANs consist of two parts, a generator and a discriminator. The generator attempts to learn the distribution of real examples in order to generate new data, while the discriminator determines whether the input is from the real data space or not.

3.5 Variational Autoencoders Following variational bayes inference [97], Variational Autoencoders (VAE) are generative models that attempt to reflect data to a probabilistic distribution and learn reconstruction that is close to its original input. 3.6 Flow A Normalizing Flow is a distribution transformation from simple to complex by a sequence of invertible and differentiable mappings using coupling and autoregressive flows and convolutional and Residual Flows. 3.7 Diffusion The Generative Diffusion Model (GDM) is a cutting-edge class of generative models based on probability, which demonstrates state-of-the-art results in the field of computer vision. Itn works by progressively corrupting data with multiple-level noise perturbations and then learning to reverse this process for sample generation. Diffusion Models are mainly formulated into three categories, 1. DDPM applies two Markov chains respectively to progressively corrupt data with Gaussian noise and reverse the forward diffusion process by learning Markov transition kernels. 2. Score-based generative models (SGMs) directly work on the gradient of log density of data a.k.a score function. 3. perturbs data with multi-scale intensifying noise and jointly estimates score function of all such noisy data distribution by a neural network conditioned on all noise levels. It enjoys flexible sampling due to the completely decoupled training and inference steps

4. Large language models (LLM) 4.1 Introduction Scaling pre-trained langauge models (PLM) (e.g., scaling model size or data size) often leads to an improved model capacity on downstream tasks following the scaling law shown in the below diagram. A number of studies have explored the performance limit by training an ever larger PLM (e.g., the 175B-parameter GPT-3 and the 540B parameter PaLM). Although scaling is mainly conducted in model size (with similar architectures and pre-training tasks), these large-sized PLMs display different behaviors from smaller PLMs (e.g., 330M-parameter BERT and 1.5B parameter GPT-2) and show surprising abilities in solving a series of complex tasks. For example, GPT-3 can solve few-shot tasks through in-context learning, whereas GPT-2 cannot do well. Thus, the research community coins the term “large language models (LLM)” , for large-sized PLMs. A remarkable application of LLMs is ChatGPT that adapts the LLMs from the GPT series for dialogue, which presents an amazing conversation ability with humans. Source-A Survey of Large Language Models ‘ LLMs revolutionize the way that humans develop and use AI algorithms. Unlike small PLMs, the major approach to accessing LLMs is through the prompting interface (e.g., GPT-4 API). 4.2 Key Techniques for LLMs It has been a long way that LLMs evolve into the current state: general and capable learners. In the development process, a number of important techniques are proposed, which largely improve the capacity of LLMs. Here, we briefly list several important techniques that (potentially) lead to the success of LLMs, as follows Scaling There exists an evident scaling effect in Transformer language models: larger model/data sizes and more training compute typically lead to an improved model capacity. As two representative models, GPT-3 and PaLM explored the scaling limits by increasing the model size to 175B and 540B, respectively.

Training Due to the huge model size, it is very challenging to successfully train a capable LLM. Distributed training algorithms are needed to learn the network parameters of LLMs, in which various parallel strategies are often jointly utilized. To support distributed training, several optimization frameworks have been released to facilitate the implementation and deployment of parallel algorithms, such as DeepSpeed and Megatron-LM. Also, optimization tricks are important for training stability and model performance, e.g., restart to overcome training loss spike and mixed precision training. More recently, GPT-4 proposes to develop special infrastructure and optimization methods that reliably predict the performance of large models with much smaller models. Ability eliciting After being pre-trained on large-scale corpora, LLMs are endowed with potential abilities as general-purpose task solvers. These abilities might not be explicitly exhibited when LLMs perform some specific tasks. As the technical approach, it is useful to design suitable task instructions or specific in-context learning strategies to elicit such abilities. For instance, chain-of-thought prompting has been shown to be useful to solve complex reasoning tasks by including intermediate reasoning steps. Alignment tuning Since LLMs are trained to capture the data characteristics of pre-training corpora (including both high-quality and low-quality data), they are likely to generate toxic, biased, or even harmful content for humans. It is necessary to align LLMs with human values, e.g., helpful, honest, and harmless. For this purpose, InstructGPT designs an effective tuning approach that enables LLMs to follow the expected instructions, which utilizes the technique of reinforcement learning with human feedback. It incorporates human in the training loop with elaborately designed labeling strategies. ChatGPT is indeed developed on a similar technique to InstructGPT, which shows a strong alignment capacity in producing high-quality, harmless responses, e.g., rejecting to answer insulting questions. Tools manipulation LLMs are trained as text generators over massive plain text corpora, thus performing less well on the tasks that are not best expressed in the form of text (e.g., numerical computation). In addition, their capacities are also limited to the pre-training data, e.g., the inability to capture up-to-date information. To tackle these issues, a recently proposed technique is to employ external tools to compensate for the deficiencies of LLMs. For example, LLMs can utilize the calculator for accurate computation and employ search engines to retrieve unknown information. More recently, ChatGPT has enabled the mechanism of using external plugins (existing or newly created apps) , which are by analogy the “eyes and ears” of LLMs. Such a mechanism can broadly expand the scope of capacities for LLMs. In addition, many other factors (e.g., the upgrade of hardware) also contribute to the success of LLMs.

4.3 Fine Tuning For many NLP applications involving Transformer models, you can simply take a pretrained model from the Hugging Face Hub and fine-tune it directly on your data for the task at hand. Provided that the corpus used for pretraining is not too different from the corpus used for fine-tuning, transfer learning will usually produce good results. However, there are a few cases where you’ll want to first fine-tune the language models on your data, before training a task-specific head. For example, if your dataset contains legal contracts or scientific articles, a vanilla Transformer model like BERT will typically treat the domain-specific words in your corpus as rare tokens, and the resulting performance may be less than satisfactory. By fine-tuning the language model on in-domain data you can boost the performance of many downstream tasks, which means you usually only have to do this step once!. This process of fine-tuning a pretrained language model on in-domain data is usually called domain adaptation. Most large language models (LLM) are too big to be fine-tuned on consumer hardware. For instance, to fine-tune a 65 billion parameters model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 80 Gb GPUs. In other words, you would need cloud compute to fine-tune your models. Fine tuning of LLM could be done in three stages, not all necessary, depends on data and use case, pre-training, instruction fine-tuning, and reinforcement learning from human feedback (RLHF). Source - borealisai

Pre-training Pre-training involves leveraging unlabeled text to learn a universal language representation that embodies this knowledge. It follows that pre-training requires a substantial amount of data. For example, PaLM was trained on 780 billion tokens.The objective of this is to comeup with new model whihc has deep understanding of your data. In standard way, which updates all the weights of foundation model requires lots of memory and computes which will make it hard to do for an individual. Parameter-efficient LLM fine-tuning Low-Rank Adaptation of Large Language Models (LoRA) is a training method that accelerates the training of large models while consuming less memory. It adds pairs of rank-decomposition weight matrices (called update matrices) to existing weights, and only trains those newly added weights, which significantly reduces memory footprints. This has a couple of advantages: ● Previous pretrained weights are kept frozen so the model is not as prone to catastrophic forgetting. ● Rank-decomposition matrices have significantly fewer parameters than the original model, which means that trained LoRA weights are easily portable. ● LoRA matrices are generally added to the attention layers of the original model, though it could be added across the layers. ● The greater memory-efficiency allows you to run fine-tuning on consumer GPUs like the Tesla T4, RTX 3080 or even the RTX 2080 Ti! GPUs like the T4 are free and readily accessible in Kaggle or Google Colab notebooks. Quantized LLMs with Low-Rank Adapters (QLoRa) goes three steps further than LoRA, ● 4-bit NormalFloat (NF4) quantization, a new data type that is information theoretically optimal for normally distributed weights. ● double quantization to reduce the average memory footprint by quantizing the quantization constants. ● paged optimizers to manage memory spikes. Source- Q-LoRA As we seeing more and more people adapting LoRA/Q-LoRA that has made easy for fine tuning Large Language models and coming with new more efficent models. This has greatly helped open source comminuity to explore and try new models as they could fine tune them as needed with their own compute resource.

Some of the source for further exploration on fine tuning could be Alpaca-Lora, huggingface-Q-LoRA, huggingface-LoRA. Instruction fine-tuning To enhance the model’s ability to follow instructions and generate responses aligned with the ground truth, instruction fine-tuning, or instruction tuning for short, is introduced. This involves fine-tuning a pre-trained LLM using labeled demonstration data. Some researchers also refer to this process as supervised fine-tuning. For example “How could you teach LLMs to generate list of thing you need for planning a party”. There are three main methods for collecting the demonstration data required for instruction fine-tuning. ● Leverages existing labeled datasets created for language-related tasks. ● Human annotators to manually prepare it. ● Self-instruct, using the pre-trained LLM to generate its own training data. Once the demonstration data is prepared by one of these three methods, instruction fine-tuning typically follows a similar training process as pre-training. The objective is to optimize the model to correctly predict each sequential token of the response. Depending on the situation, the model may have access to both the prompt and response or only the response during fine-tuning. Instruction fine-tuning provides clear performance improvements. It also generally requires a much smaller dataset.For training InstructGPT, OpenAI sampled 13,000 prompts from previous GPT-3 API submissions and recruited 40 labelers to write their desired response (Instruction fine tuning with human feedback). This allows the model to learn from real questions posed by individuals during interactions with LLMs and generate responses that align with the demonstrated behaviours. However, collecting this type of data is time-consuming and expensive. Reinforcement learning from human feedback (RLHF) Many a times LLMs’ behaviours are not well aligned with human goals. For example, they sometimes make up facts or generate biased and toxic text. One of the main sources of misalignment is the training objective itself. Both pre-training and instruction fine-tuning focus only on maximizing the likelihood of the next words instead of the quality of the entire response. To bridge this alignment gap, Instruction fine tuning with human feedback applied Reinforcement Learning from Human Feedback (RLHF) as an additional fine-tuning stage. It is considered by many, one of the critical factors in the success of models like ChatGPT and GPT-4.

Overall fine tuning of GPT-3 175B model for InstructGPT, Source-Instruction fine tuning with human feedback One of the key LLMs for open source development is Meta open source LLaMA (Large Language Model Meta AI), a state-of-the-art foundational large language model designed to help researchers advance their work in this subfield of AI. Smaller, more performant models such as LLaMA enable others in the research community who don’t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field, for more details check the link LLAMA. Here is the snapshot of all the variant of models trained using LLaMA as foundation model. Source- Survey of Large Language Models 4.4 Data Source Brief snapshot of training data sources used for training LLMs, Books BookCorpus is a commonly used dataset in previous small-scale models (e.g., GPT and GPT-2 ), consisting of over 11,000 books covering a wide range of topics and genres (e.g., novels and biographies). CommonCrawl Common Crawls is one of the largest open-source web crawling databases, containing a petabyte scale data volume, which has been widely used as training data for existing LLMs.

As the whole dataset is very large, existing studies mainly extract subsets of web pages from it within a specific period. Reddit Links Reddit is a social media platform that enables users to submit links and text posts, which can be voted on by others through “upvotes” or “downvotes”. Highly upvoted posts are often considered useful, and can be utilized to create high-quality datasets. WebText is a well-known corpus composed of highly upvoted links from Reddit, but it is not publicly available. As a surrogate, there is a readily accessible open-source alternative called OpenWebText. Wikipedia Wikipedia is an online encyclopedia containing a large volume of high-quality articles on diverse topics. Most of these articles are composed in an expository style of writing (with supporting references), covering a wide range of languages and fields. Typically, the Englishonly filtered versions of Wikipedia are widely used in most LLMs (e.g., GPT-3, LaMDA, and LLaMA). Wikipedia is available in multiple languages, so it can be used in multilingual settings. Code To collect code data, existing work mainly crawls open-source licensed codes from the Internet. Two major sources are public code repositories under open-source licenses (e.g., GitHub) and code-related question-answering platforms (e.g., StackOverflow). Google has publicly released the BigQuery dataset, which includes a substantial number of open-source licensed code snippets in various programming languages, serving as a representative code dataset. Others The Pile is a large-scale, diverse, and opensource text dataset consisting of over 800GB of data from multiple sources, including books, websites, codes, scientific papers, and social media platforms. It is constructed from 22 diverse high-quality subsets. 4.5 Validation While large language models (LLMs) have been touted for their ability to generate natural-sounding text, there are concerns around potential negative effects of LLMs such as data memorization, bias, and inappropriate language. Thankfully, now there are some open source tools and data available to test these foundation models on differ. Evaluating on zero-shot classification tasks with hugging face model evaluator Hub Zero-shot evaluation is a popular way for researchers to measure the performance of large language models, as they have been shown to learn capabilities during training without explicitly being shown labeled examples. The Inverse Scaling Prize is an example of a recent community effort to conduct large-scale zero-shot evaluation across model sizes and families to discover tasks on which larger models may perform worse than their smaller counterparts. Evaluation on the Hub helps you evaluate any model on the Hub without writing code, and is powered by AutoTrain. Now, any causal language model on the Hub can be evaluated in a zero-shot fashion. Zero-shot evaluation measures the likelihood of a trained model producing a given set of tokens and does not require any labelled training data, which allows researchers to skip expensive labelling efforts.

Source-Very Large Language Models and How to Evaluate Them 4.6 Challenges 1. LLMs hallucinates ChatGPT can deliver some mind-blowingly accurate responses to questions. It can understand long, complex requests and then generate succinctly summarized responses that perfectly address the question. That is the real magic of ChatGPT. And it performs this feat a large majority of the time. The challenge is that the remaining percent of the time the responses are not just \"a little off\", they can be factually inaccurate and in some cases completely made up. This is an interesting dilemma, not just for ChatGPT but for LLMs in general. In fact, there’s a term for when LLMs sometimes generate factually incorrect statements: hallucination. 2. LLMs lack controllability LLMs know a lot, and they can do a lot. One of the most amazing things about ChatGPT is that you don't need to be a machine learning expert to get the LLM that it is built on top of to do magical things. Anyone can type in a prompt and get an immediate response. This ability is a result of the fact that LLMs are general models designed to perform a wide range of tasks and adapt to new environments — not a narrow set of tasks. As such, they are the result of fusing layers and layers of algorithms, and that’s where it gets complex. While this approach of fusing layers of models together dramatically shortens the time required to build and train complex systems, it offers limited ability to control the model's responses. Controllability refers to the ability of a system to be directed or brought to a specific state using a specific input. 3. LLMs have associative memory that gets stale To be able to deliver such an amazing experience, LLMs are trained on vast amounts of data. GPT-3, for example, is trained on a whopping 45 terabytes of text data from different datasets. However, training data are typically drawn from a specific time period and may not accurately reflect the current state of the world or the latest developments. To go deeper — LLMs learn both reasoning capabilities and associative memory of the knowledge they are trained on. Reasoning and memory are inseparable and are both required to perform a given task. The associated memory is stuck in the time period in which

it was trained, and there’s no easy overriding mechanism to update it except for spending millions of dollars to retrain them, though now there are ways around.

5. Prompt Engineering and In-Context Learning 5.1 Introduction Prompt learning is a relatively new concept that has been proposed in recent years within the context of pre-trained large language models. Previously, to make a prediction ������ given input ������ , the goal of traditional supervised learning is to find a language model that predicts the probability ������(������|������). With prompt learning, the goal becomes finding a template ������′ that directly predicts the probability ������(������|������′). Hence, the objective of using a language model becomes encouraging a pre-trained model to make predictions by providing a prompt specifying the task to be done. Normally, prompt learning will freeze the language model and directly perform few-shot or zero- shot learning on it. This enables the language models to be pre-trained on large amount of raw text data and be adapted to new domains without tuning it again. Hence, prompt learning could help save much time and efforts. There are several foundational techniques to remember in the process of prompt optimization.First, providing explicit instructions at the beginning of the prompt helps set the context and define the task for the model. Specifying the format or type of the expected answer is also beneficial. Additionally, you can enhance the interaction by incorporating system messages or role-playing techniques in the prompt. Below is an example prompt with the above techniques: I would like you to generate 10 quick-prep dinner meal ideas for recipe blogs, with each idea including a title and a one sentence description of the meal. These blogs will be written for an audience of parents looking for easy-to-prepare family meals. Output the results as a bulleted list. Compare that prompt with the following: Write 10 recipe blogs. As we cover more and more examples and applications with prompt engineering, you will notice that certain elements make up a prompt. 5.2 Basic elements of prompt <Instruction> - A specific task or instruction you want the model to perform. <Context> - External information or additional context that can steer the model to better responses. <Input Data> - The input or question that we are interested to find a response for. <Output Indicator> - The type or format of the output. You do not always need all the four elements for a prompt and the format depends on the task at hand. We will touch on more concrete examples in upcoming guides.

5.2 Basic tips for prompting Here are the few important tips for prompting, 1. Prompt Wording A prompt’s wording is paramount, as it guides the LLM in generating the desired output. It’s important to phrase the question or statement in a way that LLM understands and can respond to accurately. For example, if a user is not an expert in an area and does not know the right term to phrase a question, LLM may experience limitations on the answers they provide. It is similar to searching on the web without knowing the correct keyword. 2. Succinctness Succinctness in a prompt is important for clarity and precision. A well-crafted prompt should be concise and to the point, providing enough information for LLM to understand the user’s intent without being overly verbose. However, ensuring the prompt is not too brief is vital, which may lead to ambiguity or misunderstanding. This balance between not enough and too much can be tricky to strike. Practice is probably the best way to master this skill. Wording and succinctness in the prompt are important because it is for specificity. 3. Roles and Goals In prompt engineering, roles are personas assigned for the LLM and the intended audience. For example, if one is interested in having LLM write an outline for a blog post on machine learning classification metrics, explicitly stating that the LLM is to act as an expert machine learning practitioner and that its intended audience is data science newcomers would certainly help provide a fruitful response. Whether this should be stated in a conversational language (“You are to act as a real estate agent with 10 years experience in the Phoenix area“) or in a more formal manner (“Author: expert Phoenix real estate agent; Audience: inexperienced home buyers“) can be experimented within a given scenario. Goals are intimately connected to roles. Explicitly stating the goal of a prompt-guided interaction is not only a good idea but also necessary. Without it, how would LLM have any inkling of what output to generate?. 4. Positive and Negative Prompting Positive and negative prompting is another set of framing methods to guide the model’s output. Positive prompts (“do this“) encourage the model to include specific types of output and generate certain types of responses. Negative prompts (“don’t do this“), on the other hand, discourage the model from including specific types of output and generating certain types of responses. Using positive and negative prompts can greatly influence the direction and quality of the model’s output.

5.3 In-Context Learning Recently, in-context learning has received significant attention as an effective method for improving language models’ performance. This approach is a subset of prompt learning and involves using a pre-trained language model as the backbone, along with adding a few input-label demonstration pairs and instructions to the prompt. In-context learning has been shown to be highly effective in guiding language models to produce better answers that are more closely aligned with the given prompt. Some recent studies have also suggested that in-context learning can be viewed as a form of implicit fine-tuning, as it enables the model to learn how to generate answers more accurately based on the input prompt. Zero-Shot prompting Large LLMs today, such as GPT-3, are tuned to follow instructions and are trained on large amounts of data; so they are capable of performing some tasks \"zero-shot.\" Here is one of the examples for zero shot : Prompt: Classify the text into neutral, negative or positive. Text: I think the vacation is okay.Sentiment: Output: Neutral Note that in the prompt above we didn't provide the model with any examples of text alongside their classifications, the LLM already understands \"sentiment\" -- that's the zero-shot capabilities at work. One-Shot Prompting The one-shot strategy involves the LLM generating an answer based on a single example or piece of context provided by the user. This strategy can guide ChatGPT’s response and ensure it aligns with the user’s intent. The idea here would be that one example would provide more guidance to the model than none. Here is one of the examples for one shot, Prompt: Generate 10 possible names for my new dog. dog name that I like is Banana. Few-Shot prompting Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response. Here is one of the examples for one shot prompting, In the example, the task is to correctly use a new word in a sentence. Prompt: A \"whatpu\" is a small, furry animal native to Tanzania. An example of a sentence that usesthe word whatpu is:We were traveling in Africa and we saw these very cute whatpus.To do a \"farduddle\" means to jump up and down really fast. An example of a sentence that usesthe word farduddle is:

Output: When we won the game, we all started to farduddle in celebration. Chain-of-Thought Prompting (CoT) The chain-of-thought strategy involves providing the LLM with a few examples that help to refine the original question and ensure a more accurate and comprehensive answer. Chain-of-thought prompts include a few chain-of-thought examples in the prompting. Chain-of-thought is different from the few-shot prompting techniques as chain-of-thought prompts are structured to encourage critical thinking, and are designed to help uncover new insights or approaches that LLM may not have otherwise considered. The technique also encourages the LLM to output its critical reasoning. The calling card of chain-of-thought prompting is the phrase “Let’s think step by step” which is generally appended to the end of a prompt, which research suggests is enough to improve generated results. Here is one of the examples for Chain-of-thought prompting, Prompt: Q: Joe has 20 eggs. He buys 2 more cartons of eggs. Each carton contains 12 eggs. How many eggs does Joe have now? Let’s think step by step. A: Joe started with 20 eggs. 2 cartons of 12 eggs is 24 eggs. 20 + 24 = 44. Therefore, Joe has 44 eggs, and the answer is 44. Q: John had 93 chickens. If he sold 20 to Bill and bought twice that number more, how many chickens does John have now? Let’s think step by step. Response: A: John sold 20 chickens to Bill, so he was left with 93 – 20 = 73 chickens. He then bought twice the number of chickens he sold, so he bought 2 * 20 = 40 chickens. Adding these newly bought chickens to his remaining stick, John now has 73 + 40 = 113 chickens. Self-Criticism Prompting The self-criticism strategy involves prompting the LLM to assess its output for potential inaccuracies or improvement areas. This strategy can ensure the information provided by LLM is as accurate as possible. It can aid users in debugging their prompts and determining why they are not getting results that meet expectations. Here is one of the examples for Self-criticism prompting, Please re-read your above response. Do you see any issues or mistakes with your response? If so, please identify these issues or mistakes and make the necessary edits. Another example coaching LLM to use self-criticism for debugging its buggy code, Look at the code you have just generated. Currently it does not run. Are you able to see any syntax errors or flow control mistakes that you are able to rectify? If so, please identify the section of problematic code and re-generate it. Iterative Prompting Iterative prompting strategy involves prompting the LLM with follow-up prompts based on the output of an initial prompt. This involves iterating on the results by asking further

questions or making additional requests from each successive response. This approach is expansive. For example, consider having ChatGPT assist in creating an outline for a book you are writing. The first prompt could like this: I am writing a book on time travel theories. I have not settled on a specific topic. Generate 5 specific topic suggestions for such a book. For each suggestion, provide a title and one paragraph of description of what the book would cover. The book will be aimed at casual readers. Now, suppose one of the suggested topics is as follows, Title: “Temporal Paradoxes: Navigating the Complexities of Time Travel” Description: “Temporal Paradoxes” delves into the mind-bending intricacies of time travel paradoxes, exploring the fascinating conundrums they present. This book explores various paradoxes such as the Bootstrap Paradox, the Predestination Paradox, and the Information Paradox, among others. … You could then iterate on this using a follow-up prompt: I will write the book you suggested, “Temporal Paradoxes: Navigating the Complexities of Time Travel”. Generate a chapter outline for this book, including sections and subsections. A possible output excerpt is below, Introduction – Defining temporal paradoxes – Significance of temporal paradoxes in time travel narratives – Overview of the chapter structure Chapter 1: The Bootstrap Paradox 1.1 The essence of the Bootstrap Paradox – Explaining the concept and its origin – Key examples from literature, film, and real-life anecdotes … This iteration can then continue. Further prompting could be: Let’s flesh out the introduction. Generate a two paragraph summary of what each of the chapter’s sections will cover. The chapter outline for the introduction is below. ––– Introduction – Defining temporal paradoxes – Significance of temporal paradoxes in time travel narratives – Overview of the chapter structure

Model-Guided Prompting Model-guided prompting involves instructing the LLM to prompt you for the information needed to complete a requested task. This is similar to telling someone, “ask me what you need to know.” I would like you to write a Python program to manage my client information, which is stored in a Google Sheet. Please ask me whatever questions you need answers to in order to undertake this assignment. And questions from LLMs would be like, 1. Are you familiar with using Google Sheets API and OAuth2 authentication in Python? If not, I can guide you through the setup process. 2. What specific operations do you need to perform on the client information? For example, do you need to retrieve client details, update existing records, add new clients, or delete clients? 3. Could you share the structure of your Google Sheet? For instance, what are the column headers, and how is the client information organized in rows and columns? 4. Do you have any preferences for how the program should interact with the Google Sheet, such as using a specific library or method?

6. AI Agents 6.1 Introduction In classifical term, an \"agent\" is an automated entity that interacts with its surroundings by sensing its surroundings using sensors, then acting with actuators or effectors. Actuators are the elements that translate energy into motion. They take on the function of directing and moving a system. A few examples include gears, motors, and railroads. In the conext of artificial intelligent, AI agents are all about studying rational agents. Anything that makes decisions, whether it be a person, business, computer, or piece of software, can be a rational agent. These rational agents will always have an environment that might contain another agent. Since these agents have the power to automatically learn and process things, these agents are known as AI agents. We will be specially talking agents in the context of Large Language models which utiilizes pre-trained LLMs as it’s knowledge base and performs tasks based on user query. AI agents are designed to think and act independently. The only thing you have to provide is a goal <query> through prompt—be that researching competitors or buying a pizza. They'll generate a task list and get to work, relying on feedback from the environment and their own internal monologue. It's as if the AI agents can prompt themselves, constantly evolving and adapting to achieve their objective in the best way possible. AI agents can use computers extremely well. They can browse the web, use apps, read and write files, make payments with your credit card, and even control your laptop as they do so. This marks one more step toward AGI (artificial general intelligence). We're getting closer to the moment when a machine is able to carry out the same kinds of tasks that humans can across any topic or area of specialization, with complete flexibility and superior performance. 6.2 How it works Goal initialization: When you input your objective, the AI agent passes your prompt to the core LLM (the ones used now are GPT-3.5 and GPT-4), and returns the first output of its internal monologue, displaying that it understands what it needs to do. Creating a task list: Based on the goal, it'll generate a set of tasks and understand in which order it should complete them. Once it decides that it has a viable plan, it'll start searching for information. Execution: Since the agent can use a computer the same way you do, it can gather information from the internet. Agents can connect with other AI models to outsource tasks and decisions, letting them access e.g. geographical data processing, or computer vision features.All data is stored and managed by the agent, both so it can relay it back to you and so it can improve its strategy as it moves forward. Feedback: As tasks are crossed off the list, the agent assesses how far it still is from the goal by gathering feedback, both from external sources and from its internal monologue. And until the goal is met, the agent will keep iterating, creating more tasks, gathering more information and feedback, and moving forward without pause. Many studies have explored LLM capabilities on the possibility of combining verbal reasoning with interactive decision making in autonomous systems. On one hand, properly prompted large language models (LLMs) have demonstrated emergent capabilities to carry out several steps of reasoning traces to derive answers from questions in arithmetic, commonsense, and symbolic reasoning tasks.

6.3 Key Techniques Chain-of-thought (CoT) reasoning is a static black box, in that the model uses its own internal representations to generate thoughts and is not grounded in the external world, which limits its ability to reason reactively or update its knowledge. This can lead to issues like fact hallucination and error propagation over the reasoning process. ReAct (Reason+Act), a general paradigm to combine reasoning and acting with language models for solving diverse language reasoning and decision making tasks . ReAct prompts LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interact with the external environments (e.g. Wikipedia) to incorporate additional information into reasoning (act to reason). Here is one example of different prompting approaches , Source: [(PDF) ReAct: Synergizing Reasoning and Acting in Language Models] For more details on multi task agent look and tools @ ChatGPT plugins, Superpower LLMs with Conversational Agents | Pinecone 4 Autonomous AI Agents you need to know | by Sophia Yang, Ph.D. | Towards Data Science.

6.4 Limitations Though there are many advantages of the AI Agents in terms of saving cost and increasing productivity over long term, here are the few immediate challenges that could arise, 1. Lack of Human Touch and Personalization: While AI agents offer numerous benefits, they often lack the human touch and personalized approach that some customers desire. In certain contexts, customers may prefer human interactions that provide empathy, emotional intelligence, and a deeper level of understanding. Balancing the use of AI agents with human touch is crucial to meeting diverse customer needs. 2. Job Displacement and Workforce Changes: The implementation of AI agents can lead to job displacement and changes in the workforce landscape. As AI technology automates tasks previously performed by humans, certain roles may become redundant. However, this also presents an opportunity for businesses to reskill and upskill their workforce to adapt to new roles that complement AI agents. Proactive strategies and investment in human capital are essential to ensure a smooth transition. 3. Ethical Concerns and Bias: The rise of AI agents has raised ethical concerns related to privacy, data security, and potential biases in decision-making algorithms. It is crucial for organizations to develop responsible AI practices, ensuring the protection of user data and mitigating the risks of bias. Transparency, accountability, and ongoing monitoring are necessary to address ethical considerations associated with AI implementation. 4. Technical Limitations and Errors: AI agents, while powerful, are not without their limitations. Current AI technologies may struggle with handling complex or ambiguous situations that require human intuition and contextual understanding. There is also the risk of errors and misinterpretations in AI algorithms, which can have significant consequences. Ongoing research and development efforts aim to overcome these technical limitations and enhance the accuracy and reliability of AI agents. 5. Implementation Challenges and Costs: Implementing AI agents can present challenges for businesses. It requires robust infrastructure, effective data management systems, and specialized expertise. The process of integrating AI agents into existing workflows and systems can be complex and time-consuming. Additionally, there are associated costs involved, including investment in hardware, software, and training. Organizations need to carefully plan and allocate resources to overcome implementation challenges and ensure a successful deployment of AI agents.

7. Responsible AI AI is transforming industries and solving important, real-world challenges at scale. This vast opportunity carries with it a deep responsibility to build AI responsibly that works for everyone without any discrimination. Responsible AI is a set of practices that ensures AI systems are designed, deployed and used in an ethical and legal way. When companies implement responsible AI, they minimize the potential for artificial intelligence to cause harm and make sure it benefits individuals, communities and society. Responsible AI and ethical AI are closely related concepts, but they are not the same and shouldn’t be used interchangeably. Responsible AI focuses on the development and use of artificial intelligence in a way that considers its potential impact on individuals, communities and society as a whole. This involves not just ethics, but also fairness, transparency and accountability as a way to minimize harm. Ethical AI, by contrast, focuses specifically on the moral implications and considerations of artificial intelligence. It addresses the more ethics-based aspects of AI development and use, including bias, discrimination and its impact on human rights, ensuring that it is used in responsible ways. Ethical AI could be considered as a “subset” of responsible AI. Responsible AI is meant to address data privacy, bias and lack of explainability, which represent the “big three” concerns of ethical AI. Data, which AI models rely on, is sometimes scraped from the internet with no permission or attribution. Other times it is the proprietary information of a specific company. Either way, it is important that AI systems gather, store and use this data in a way that is both compliant with existing data privacy laws, and safe from any kind of cybersecurity threats. Then there’s the issue of bias. AI models are built on a foundation of data, and if that foundation has prejudiced, distorted or incomplete information, the outputs generated will reflect that and even magnify it. And there may not even be a clear explanation how or why an AI model is working in a particular way. These algorithms operate on immensely complex mathematical patterns — too complex for even experts to understand “black box”— which can make it difficult to understand why a model generated a particular output. As automation continues to disrupt virtually every business across all industries — affecting the way we live, work and create — the stakes are even higher. If an AI recruiting tool is consistently biased against women, people of color or people with disabilities, that could affect the livelihood of thousands or even millions of people. Or if a company somehow violates a data privacy law, people’s personal information is in danger, not to mention all the fines the company will have to deal with. “The kind of damages that can happen societally are really extensive. And they can happen inadvertently, which is why it’s really important for everyone who’s involved with AI to be careful”.

Responsible AI can help to mitigate those damages. It provides a framework on which companies can build and use more safe, trustworthy and fair AI products — allowing them to take advantage of all the benefits of artificial intelligence, responsibly. These are the guiding pricinples that could be followed when implementing responsible AI, 1. Fairness: AI systems should be built to avoid bias and discrimination. And they should not perpetuate or exacerbate any existing equity issues in the world. Instead, they should treat users of all demographics fairly, regardless of race, gender, socioeconomic background or any other factor. Accomplishing this requires AI developers to make certain that all the data used to train algorithms is diverse and representative of the real-world population. It also entails that developers remove any discriminatory patterns or outliers that may negatively impact an AI model's performance. They should also regularly test and audit their AI products to make sure they remain fair after their initial deployment. 2. Transparency: AI systems should be understandable and explainable to both the people who make them and the people who are affected by them. The inner-workings of how and why they came to a particular decision or generated a particular output should be transparent, including how the data used to train an AI system is collected, stored and used. Of course, this isn’t always possible. Sometimes AI models are just too big and complex for even experts to fully understand. But companies can choose to work with models that are inherently more transparent and explainable, such as decision trees or linear regression, which provide clear rules or logic that can be easily understood by humans. 3. Privacy and Security: Protecting the privacy of individuals is good practice, and in many cases it’s the law. Companies should handle any personal data they use to train their models appropriately, respecting any existing privacy regulations and ensuring that it is safe from theft or misuse. This typically requires a data governance framework, or a set of internal standards that an organization follows to ensure its data is accurate, usable, secure and available to the right people and under the right circumstances. Companies can also anonymize or aggregate any sensitive data to better protect it, which involves removing or encrypting personally identifiable information from the datasets used for training. 4. Inclusive Collaboration: Every AI system should be designed with the oversight of a team of humans that are just as diverse as the general population — with varied perspectives, backgrounds and experiences. Business leaders and experts in ethics, social sciences and other subject matters should be included in the process just as much as data scientists and AI engineers to ensure the product is inclusive and responsive to the needs of all everyone. 5. Accountability: Organizations developing and deploying AI systems should take responsibility for their actions, and they should have mechanisms in place to address and rectify any negative consequences or harms caused by AI products they either made or used. As of now, there are very few formal avenues for accountability when an AI system goes wrong. Violating data privacy legislation like the California Consumer Privacy Act and the EU’s General Data Protection Regulation can lead to some hefty fines. And there are several

anti-discriminatory laws already on the books that the U.S. Federal Trade Commission has said apply to AI. But there are no regulations pertaining specifically to artificial intelligence yet, though there are some progress in the this direction from the time of ChatGPT launched, people/goverments realising the need to control this. There are some Responsible AI tools which could be a effective way to inspect and understand AI models. Resources like Explainable AI, Model Cards, and the TensorFlow open-source toolkit provide model transparency in a structured, accessible way. Responsible AI could help in improving the compliance, quality, brand reputation AI tools, whihc will increase confidence and acceptability in the society.

8. References 1. Attention Is All You Need 2. On the Opportunities and Risks of Foundation Models 3. A History of Generative AI from GAN to ChatGPT 4. Attention is All You Need-PPT 5. Generative AI: Advantages, Disadvantages, Limitations, and Challenges 6. Generative AI use cases 7. How Transformers Work 8. Understanding GRU Networks 9. Nvidia-What Is a Transformer Model 10. The Illustrated Transformer 11. Gaussian Error Linear Units 12. NVIDIA-Megatron-LM 13. BigScience Large Open-science Open-access Multilingual Language Model 14. Megatron-Turing Natural Language Generation 15. OPT-LM 16. T5-LM 17. Switch Transformers 18. Huggingface BART 19. Denoising Diffusion Probabilistic Models 20. Google PALM Models 21. Meta LLAMA Model 22. OpenAI ChatGPT Prompt Engineering 23. Prompting Guide Basic Elements 24. Types of AI Agents 25. ReAct: Synergizing Reasoning and Acting in Language Models 26. All Readings: Introduction to Generative AI-Google 27. Google Responsible AI 28. Responsible AI 29. Advantages-and-Disadvantages-of-using-ai-agents. 30. Drawbacks_of_Transformers 32. Validating-large-language-models-with-relm/ 33. Drawbacks_of_Transformers 34. The strengths and weaknesses of large language models 35. A-gentle-introduction-to-positional-encoding-in-transformer-models 36. Fine-tuning a masked language model 37. Q-Lora-fine-tune-a-large-language-model-on-your-gpu 38. Stanford-Alpaca-Lora 39. RLHF 40. Diffusion Models 41. PaLM 42. GPT-3 43. ChatGPT 44. GPT-4 45. InstructGPT 46. Plugins 47. borealisai


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook