Large Language Models exhibit 'intelligent like' behaviour when responding to a prompt. Different LLMs have been trained for different use cases and the most recent such as Google's Gemini can accept a task type which it uses to optimise its approach to the user case. Gemini is a set of LLMs trained on multimodal data. Nevertheless an LLM is a 'single inference' mechanism - a prompt or query is input and the LLM responds with generated output (usually text). It does not in itself retain context or session history and this is where Agents come in.
An agent has an core loop where it observes the world state, decides upon and enacts an action. It then waits for and observes the reaction to that action and continues until it's end goal is achieved.
Within this loop, the agent uses its profile, goals and instructions along with its memory and its planning abilities to decide upon the next action. This action could be predicting the next token or it could be that it decides it needs to perform some action on the environment and observe the impact or change. This involves calling a tool and then obtaining the response and adding that to the context or response. Tools are configured with a metadata that describes what the tool does along with its input and output parameters.
The latest LLMs are capable of following instruction based reasoning. Examples include ReAct, Chain-of-Thought or Tree-of-Thoughts. An agent uses these models and whereas the foundational LLM models are unable to interact with the real world, an agent can interact using external data and services. These tools often align with web apis to change the world state.
The orchestration layer in an agent manages its state (or memory) and maintains it's reasoning and planning using one or more of the prompt engineering frameworks to guide its reasoning and planning :
ReAct is a prompt engineering framework that provides a thought process strategy for language models to reason and take action on a user query.
Chain-of-Thought is a prompt engineering framework that enables reasoning through intermediate steps.
Tree-of-Thought is a prompt engineering framework that is suitable for exploration or strategic planning tasks. Agents can use these approaches to choose the next best action to take for a given user request.
Tools bridge the agent's internal knowledge to the real world. They allow the agent to interact with APIs, databases, etc.
Typically an agent will use its LLM to determine if additional information is needed, and if so, select an appropriate tool from its configured set (eg weather API, calculator, database lookup, etc.). It will construct a function call with the necessary parameters, invoke the tool, observe the output and perform any additional processing, and finally integrate the retrieved information with its current response and context to generate a complete response.
Consider a flight booking agent. A user makes a request 'I want to book a flight to X'. The agent provides its thoughts on what it should do. This might include 'i should search for flights to X'. As a result it might decide to use it's Flight Search Tool. The results of the execution of this tool can either be presented to the user or used as an intermediate step for futher reasoning. Currently there are 3 types of tools that an agent can use:
A tool that allows the agent to directly call the external api. The agent is taught how to use the api through examples so that the agent knows what arguments to use to successfully call the endpoint. The agent then uses the model and examples at runtime to decide which extension, if any, could be used.
Functions are similar to extensions but the model does not execute the function but rather selects one and fills its arguments as required. The function is returned to the client where it can be executed :
We define a function in the usual way:
And then set up the agent as follows:
A Data Store is an embedding store for additional data that are typically implemented as vector databases. A common example is in the implementation of Retrieval Augmented Generation applications.
The following, from Google, shows how RAG can be combined with ReAct :
The following code uses langchain to link a google search tool and the google places api. The agent can then be used to answer queries such as "Who did the Texas Longhorns play in football last week? What is the address of the other team's stadium?"
The following shows the actual output on running the code (above):