Juicing Tokens: LLM Prompting
October 22 2023
machine-learning

LLMs have taken the world by storm. With the rise of ChatGPT came an army of LLM-driven applications and an influx of new research into the field. We live in a post-LLM world, and it's awesome (for now). The diversity of use cases for LLMs and the prohibitive costs of training them have made their flexibility of the utmost importance.

Using an LLM

Many people don't have the time and resources of companies like Facebook, Google, OpenAI, and other developers of LLMs. Thus, as a user you basically shop around for a model you want - Llama, Claude, GPT-4 - add it to your basket, and bring it to YourLLMWrapper, where you feed it whatever strings you want it to process. The point is that your use case at YourLLMWrapper for the model you choose is completely divorced from its training.

Context Window

Before a string of text reaches an LLM it's broken up into a sequence of tokens, through a process called tokenization. There are many ways tokens are defined, but often times their derived from the model's training data.

You can imagine something like

Input:  What is the last digit in pi?
---------------------------------------------
Tokens: [What, is, the, last, digit, in, pi?]

The maximum number of tokens you can pass into an LLM is referred to as its context window.

For example, as of October 20th 2023, OpenAI's GPT-4 has a maximum context window of 32,768 tokens and Antrophic's Claude has a context window of 100,000 tokens.

As an LLM consumer, the N tokens you can put into your model's context window is the only way you can communicate what you want from the model. If the model is a house and the context window is the door frame, then your tokens are the furniture that needs to fit through the door.

Because all you can control is the tokens, a substantial effort has been put in to ensure your tokens will elicit the best responses from the model.

Juicing the Fixed Context

There are two broad categories of techniques used to improve results: prompt-based and retrieval-based. Prompt-based strategies focus on communicating intent to the model and encouraging coherent reasoning, whereas retrieval-based strategies focus on providing the right information to the model.

Prompting Techniques

The practice of developing prompts and prompt structures is called prompt engineering. Briefly, some examples of prompting techniques.

Prompt

Out of the options, determine which is bigger.
Options: Squirrel Tiger
Bigger:

Output

Tiger

Zero-shot prompting is when a prompt not provided in the training set is given to the model, without examples or additional text to guide the model's reasoning. The model is expected to infer only based on its training the right way to answer the query.

Prompt

a(5, 6) = 11
a(1, 3) = 4
a(7, 9) = 16

a(10, 1) = ?

Output

a(10, 1) = 11

Few-shot prompting is when examples of expected behavior are provided to the model before presenting a question or task. It's an example of in-context learning whereby the model learns how to respond based on demonstrations provided in the prompt.

Prompt

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6
   tennis balls. 5 + 6 = 11. The answer is 11.

Q: The cafeteria has 23 apples. If they used 20 to make lunch and bought
   6 more, how many apples do they have?
A:

Output

The cafeteria has 23 apples originally. They used 20 to make lunch. So they
had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer
is 9.

Example source

COT reasoning guides the model towards explaining its thought process while answering a question. As an extension to few-shot prompting, this approach has enabled models to engage in complex reasoning. It's also illustrative of how models arrive at their answers.

There are several of other prompting techniques, including self-consistency, generate knowledge prompting, tree of thought, active prompt, etc...

Prompt Engineering Guide is a great resource if you want to learn more.

Retrieval Techniques

Suppose you want to implement a system for semantic querying of text documents. Something that allows you to answer questions like: "What is Alice's middle name?", as found in MassiveNameList.txt.

The system handles huge documents and therefore the documents cannot fit inside your model's N token context window. So what do you do? You can't give all the context to the model, and yet you need to answer questions about the contents of the documents.

This is where retrieval techniques come in. The premise is simple: rather than feeding the entire document into the model's context, just pass in the (much smaller) relevant parts of the document. For the names example, that might be all names that start with "Alice".

Retrieval techniques function as follows. You have a large corpus of text and a query you want to run over the text. You look at the query and extract the relevant pieces of text, and then insert them before your query inside of your final prompt.

Prompt

<retrieved text>
<query>

Retrieval Augmented Generation (RAG) is a technique that does exactly this.

Content is put through an embedding model, in chunks, turning each chunk of text into a list of numbers, vector embeddings. The embeddings are stored in a vector database.

For each query, the embedding model turns the query into a vector embedding. The querys' vector embedding is then used to search the vector database for relevant content.

"What is Alice's middle name?" => EmbeddingModel => [0.4, 0.9, 0.2] =>

[0.4, 0.9, 0.2] => VectorDatabase => ["Alice Smith Jacobs", "Amy Brown",
"Allison Groovy"]

The relevant context, alongside the query, is then passed into the model

Final prompt

Alice Smith Jacobs
Amy Brown
Allison Groovy

Q: What is Alice's middle name?

Using RAG, and similar retrieval techniques, you can perform queries over volumes of data that greatly exceed your models' context window!

For those interested: A recently proposed new retrieval approach was outlined in MemGPT whereby a feedback loop was added. Relevant data is retrieved and if the model finds that it needs more context it will request more context from the system, effectively managing its own virtual memory!

Writing effective prompts

To recap, creating LLMs is expensive, time-consuming, and technically challenging so typically developers use existing models. Consequently, the only thing developers have control over when working with a model is the data that is passed into the model's fixed-size context window. Hence, to make the most of one's tokens, smartly structuring prompts is important.

The objective of a prompt is typically to communicate two things:

  • What do you want the model to do?
  • What do you want the model to know, when answering?

Prompt engineering is a new field dedicated to finding the best ways to communicate intent and promote intelligent reasoning from models. It's the source of prompting techniques such as zero-shot, few-shot, chain of thought, tree of thought, and others.

Retrieval techniques, on the other hand, aim to provide the model with the relevant context it needs to answer queries. If a model's purpose is to answer questions about a large set of data, retrieval techniques like RAG are the mechanism by which you can give the model access to this data, despite the amount of data vastly exceeding the model's context window.

Taken together, prompt and retrieval techniques are some of the best ways to make the most of an LLM's context window, enabling one to mobilize large amounts of data and get correct, well-intentioned, and reasonable responses.