admin管理员组

文章数量:1579410

探索我们如何构建增强检索的对话代理。

我们在前面的章节已经见识了检索增强和对话式agent是多么强大。当我们把他们在一起使用时,他们变得更加有吸引力。

对话式agent可能会在数据时效性、特定领域知识或访问内部文档方便遇到困难。通过将agent和检索增强工具相结合,我们就不再有这些问题了。

在另外一方面,在不使用agent的情况下,使用“原生”检索增强意味着我们将在每次查询时检索上下文。同样,这并不总是理想的,因为并不是每次查询都需要访问外部知识。

将这些方法结合起来,我们就能兼得两者的优势。在这个笔记中,我们将学习如何做到这一点。

在开始之前,我们需要安装将我们在笔记中将使用的lib。

pip install -qU \
    openai==1.6.1 \
    pinecone-client==3.1.0 \
    langchain==0.1.1 \
    langchain-community==0.0.13 \
    tiktoken==0.5.2 \
    datasets==2.12.0

构建知识库

我们从构建知识库开始。我们将使用一个基本准备好的数据集,名为Stanford Question-Answering Dataset(SQuAD),托管在Hugging Face数据集上。我们按照下面的方法来下载:

from datasets import load_dataset

data = load_dataset('squad', split='train')
data

这个数据集包含重复的上下文,可以像下面这样去掉:

data = data.to_pandas()
data.head()

data.drop_duplicates(subset='context', keep='first', inplace=True)
data.head()

初始化嵌入模型和向量DB

我们将使用通过LangChain初始化的OpenAI的text-embedding-ada-002模型、以及Pinecone向量数据库。我们首先初始化嵌入模型,位次我们需要一个OpenAI API密钥。

(需要注意的是,OpenAI是一个收费的服务,因此运行这个笔记的剩余部分会带来一些小的支出)

import os
from getpass import getpass
from langchain.embeddings.openai import OpenAIEmbeddings

# get API key from top-right dropdown on OpenAI website
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or getpass("Enter your OpenAI API key: ")
model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

现在我们创建我们的向量DB来存储我们的向量。为了做这件事情,我们需要一个免费的Pinecone API密钥----该API密钥可以在Pinecone控制面左侧导航栏中的“API keys”按钮中找到。

from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv("PINECONE_API_KEY") or getpass("Enter your Pinecone API key: ")

# configure client
pc = Pinecone(api_key=api_key)

现在我们可以设置我们的索引规范,这个使得我们能够定义用来部署我们的index的云提供商和region。可以从这里看到所有可用提供商和region列表。

from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

创建一个index,我们设置dimension等于Ada-002(1536)dimensionality,并且使用与Ada-002匹配的metric(可以是cosine或者dotproduct)。我们同时将我们的spec传递给索引的初始化。

import time

index_name = "langchain-retrieval-agent"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

我们可以看到新的Pinecone索引total_vector_count0,因为我们还没有添加任何向量。

索引

我们可以使用LangChain的向量存储对象来执行索引任务。但是,直接通过Pinecone的python客户端来做这个事情会更加快。我们将以100个或更多为一批进行操作。

from tqdm.auto import tqdm

batch_size = 100

texts = []
metadatas = []

for i in tqdm(range(0, len(data), batch_size)):
    # get end of batch
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    # first get metadata fields for this record
    metadatas = [{
        'title': record['title'],
        'text': record['context']
    } for j, record in batch.iterrows()]
    # get the list of contexts / documents
    documents = batch['context']
    # create document embeddings
    embeds = embed.embed_documents(documents)
    # get IDs
    ids = batch['id']
    # add everything to pinecone
    index.upsert(vectors=zip(ids, embeds, metadatas))

我们已经将所有事情索引,现在我们可以像下面这样检查我们索引里向量的数量:

index.describe_index_stats()
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891}

创建一个向量存储并且查询

现在我们已经构建了我们的索引,可以回到LangChain。我们使用我们刚刚构建的相同索引来初始化一个向量存储。像下面这样来做:

from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed.embed_query, text_field
)
/Users/jamesbriggs/opt/anaconda3/envs/ml/lib/python3.9/site-packages/langchain_community/vectorstores/pinecone.py:74: UserWarning: Passing in `embedding` as a Callable is deprecated. Please pass in an Embeddings object instead.
  warnings.warn(

在上面的例子中,我们可以使用similarity_search方法做一个语义搜索(没有生成组件)。

query = "when was the college of engineering in the University of Notre Dame established?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)
[Document(page_content="In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis. By contrast, the Jesuit colleges, bastions of academic conservatism, were reluctant to move to a system of electives. Their graduates were shut out of Harvard Law School for that reason. Notre Dame continued to grow over the years, adding more colleges, programs, and sports teams. By 1921, with the addition of the College of Commerce, Notre Dame had grown from a small college to a university with five colleges and a professional law school. The university continued to expand and add new residence halls and buildings with each subsequent president.", metadata={'title': 'University_of_Notre_Dame'}),
 Document(page_content='The College of Engineering was established in 1920, however, early courses in civil and mechanical engineering were a part of the College of Science since the 1870s. Today the college, housed in the Fitzpatrick, Cushing, and Stinson-Remick Halls of Engineering, includes five departments of study – aerospace and mechanical engineering, chemical and biomolecular engineering, civil engineering and geological sciences, computer science and engineering, and electrical engineering – with eight B.S. degrees offered. Additionally, the college offers five-year dual degree programs with the Colleges of Arts and Letters and of Business awarding additional B.A. and Master of Business Administration (MBA) degrees, respectively.', metadata={'title': 'University_of_Notre_Dame'}),
 Document(page_content='Since 2005, Notre Dame has been led by John I. Jenkins, C.S.C., the 17th president of the university. Jenkins took over the position from Malloy on July 1, 2005. In his inaugural address, Jenkins described his goals of making the university a leader in research that recognizes ethics and building the connection between faith and studies. During his tenure, Notre Dame has increased its endowment, enlarged its student body, and undergone many construction projects on campus, including Compton Family Ice Arena, a new architecture hall, additional residence halls, and the Campus Crossroads, a $400m enhancement and expansion of Notre Dame Stadium.', metadata={'title': 'University_of_Notre_Dame'})]

看着像是我们得到了一个好的答案。让我们来看看我们如何将这个集成到对话式agent里。

初始化对话式agent

我们需要一个聊天LLM,对话式记忆,以及一个RetrievalQAchain初始化对话式agent。我们使用下面的方式来创建:

from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA

# chat completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)
# conversational memory
conversational_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=5,
    return_messages=True
)
# retrieval qa chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

使用这些,我们可以使用run方法得到一个答案:

qa.run(query)
'The College of Engineering at the University of Notre Dame was established in 1920.'

但是对于我们的对话是agent还没有准备好。为此,我们需要将检索chain转换为一个工具。我们按照下面来做这个:

from langchain.agents import Tool

tools = [
    Tool(
        name='Knowledge Base',
        func=qa.run,
        description=(
            'use this tool when answering general knowledge queries to get '
            'more information about the topic'
        )
    )
]

现在,我们可以像下面这样初始化agent:

from langchain.agents import initialize_agent

agent = initialize_agent(
    agent='chat-conversational-react-description',
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3,
    early_stopping_method='generate',
    memory=conversational_memory
)

通过这个,我们的检索增强对话式agent已经准备好,并且我们可以开始使用它.

使用对话式agent

我们简单的直接调用agent来进行查询:

agent(query)
> Entering new AgentExecutor chain...
{
    "action": "Knowledge Base",
    "action_input": "When was the College of Engineering in the University of Notre Dame established?"
}
Observation: The College of Engineering at the University of Notre Dame was established in 1920.
Thought:{
    "action": "Final Answer",
    "action_input": "The College of Engineering at the University of Notre Dame was established in 1920."
}

> Finished chain.
{'input': 'when was the college of engineering in the University of Notre Dame established?',
 'chat_history': [],
 'output': 'The College of Engineering at the University of Notre Dame was established in 1920.'}

看着很棒,如果我们询问一个非通用的知识类问题会怎样?

agent("what is 2 * 7?")
> Entering new AgentExecutor chain...
{
    "action": "Final Answer",
    "action_input": "The product of 2 multiplied by 7 is 14."
}

> Finished chain.
{'input': 'what is 2 * 7?',
 'chat_history': [HumanMessage(content='when was the college of engineering in the University of Notre Dame established?'),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.')],
 'output': 'The product of 2 multiplied by 7 is 14.'}

完美,agent能够识别这个问题无需参考他的通用知识工具。让我们试试更多的问题。

agent("can you tell me some facts about the University of Notre Dame?")
> Entering new AgentExecutor chain...
{
    "action": "Knowledge Base",
    "action_input": "University of Notre Dame"
}
Observation: The University of Notre Dame is a Catholic research university located in South Bend, Indiana, in the United States. It is known for its strong academic programs, including undergraduate colleges in Arts and Letters, Science, Engineering, Business, and the Architecture School. The university also has a graduate program with over 50 master's, doctoral, and professional degree programs. Notre Dame is recognized as one of the top universities in the United States and has a strong alumni network. It is also known for its iconic landmarks, such as the Golden Dome and the Basilica. The university is committed to research and has various institutes dedicated to different fields of study. Notre Dame is also home to the Notre Dame Global Adaptation Index, which ranks countries based on their vulnerability to climate change.
Thought:{
    "action": "Final Answer",
    "action_input": "The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It offers strong academic programs in various fields, including Arts and Letters, Science, Engineering, Business, and Architecture. Notre Dame is known for its academic excellence, iconic landmarks like the Golden Dome and the Basilica, and its commitment to research. It is also home to the Notre Dame Global Adaptation Index, which ranks countries based on their vulnerability to climate change."
}

> Finished chain.
{'input': 'can you tell me some facts about the University of Notre Dame?',
 'chat_history': [HumanMessage(content='when was the college of engineering in the University of Notre Dame established?'),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.'),
  HumanMessage(content='what is 2 * 7?'),
  AIMessage(content='The product of 2 multiplied by 7 is 14.')],
 'output': 'The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It offers strong academic programs in various fields, including Arts and Letters, Science, Engineering, Business, and Architecture. Notre Dame is known for its academic excellence, iconic landmarks like the Golden Dome and the Basilica, and its commitment to research. It is also home to the Notre Dame Global Adaptation Index, which ranks countries based on their vulnerability to climate change.'}
agent("can you summarize these facts in two short sentences")
> Entering new AgentExecutor chain...
{
    "action": "Final Answer",
    "action_input": "The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It offers strong academic programs and is known for its iconic landmarks and commitment to research."
}

> Finished chain.
{'input': 'can you summarize these facts in two short sentences',
 'chat_history': [HumanMessage(content='when was the college of engineering in the University of Notre Dame established?'),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.'),
  HumanMessage(content='what is 2 * 7?'),
  AIMessage(content='The product of 2 multiplied by 7 is 14.'),
  HumanMessage(content='can you tell me some facts about the University of Notre Dame?'),
  AIMessage(content='The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It offers strong academic programs in various fields, including Arts and Letters, Science, Engineering, Business, and Architecture. Notre Dame is known for its academic excellence, iconic landmarks like the Golden Dome and the Basilica, and its commitment to research. It is also home to the Notre Dame Global Adaptation Index, which ranks countries based on their vulnerability to climate change.')],
 'output': 'The University of Notre Dame is a Catholic research university located in South Bend, Indiana. It offers strong academic programs and is known for its iconic landmarks and commitment to research.'}

很棒!我们也可以询问参考对话历史交互,agent也可以将历史对话作为一个信息来源。

这个就是使用OPenAI、Pinecone(最佳组合)和LangChain来构建检索增强对话式代理的示例的全部内容。完成后,我们删除Pinecone的索引来节省资源:

pc.delete_index(index_name)

https://colab.research.google/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb

本文标签: 记忆Agent