LLM之RAG实战（三十二）| 使用RAGAs和LlamaIndex评估RAG|电子爱好者

admin管理员组
文章数量:1590340

在之前的文章中，我们介绍了RAG的基本流程和各种优化方法（query重写，语义分块策略以及重排序等）。那么，如果发现现有的RAG不够有效，该如何评估RAG系统的有效性呢？

在本文中，我们将介绍RAG评估框架RAGAs[1]，并使用RAGAs+LlamaIndex来实现整个RAG评估过程。

一、RAG评估指标

简单地说，RAG的过程包括三个主要部分：输入查询、检索的上下文和LLM生成的响应。这三个元素构成了RAG过程中最重要的三元组，并且是相互依存的。

因此，RAG的有效性可以通过测量这些三元组之间的相关性来评估，如图1所示：

论文《RAGAS: Automated Evaluation of Retrieval Augmented Generation》[1]提到了3个RAG评估指标：1）可信度（Faithfulness）、2）答案相关性（Answer Relevance）和3）上下文相关性（Context Relevance），这些指标不需要人工标注数据集或参考答案。

此外，RAGAs网站[2]还引入了两个指标：上下文精度（Context Precision）和上下文召回（Context Recall）。

1.1 可信度/忠诚度

可信度是指确保答案是基于给定的上下文生成的。这对于避免幻觉和确保检索到的上下文可以用作生成答案是非常重要的。如果得分较低，则表明LLM的响应不符合检索到的知识，这样提供幻觉答案的可能性增加。例如：

为了评估可信度，我们首先使用LLM来提取一组语句S(a(q))，方法是使用以下提示：

Given a question and answer, create one or more statements from each sentence in the given answer.question: [question]answer: [answer]

在生成S(a(q))之后，LLM确定是否可以从c(q)推断出每个语句si。使用以下提示执行此验证步骤：

Consider the given context and following statements, then determine whether they are supported by the information present in the context. Provide a brief explan ation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.statement: [statement 1]...statement: [statement n]

最终可信度分数F计算为F=|V|/|S|，其中|V|表示根据LLM支持的语句数，|S|表示语句总数。

1.2 答案相关性

答案相关性衡量的是生成答案和查询之间的相关性。得分越高表示相关性越好。例如：

为了估计答案的相关性，我们提示LLM基于给定的答案a(q)生成n个潜在问题qi，如下所示：

Generate a question for the given answer.answer: [answer]

然后，我们利用文本嵌入模型来获得所有问题的嵌入。对于每个qi，我们计算与原始问题q的相似性sim(q,qi)，相似性计算可以使用嵌入之间的余弦相似性，计算问题q的答案相关性得分AR，如下图公式所示：

1.3 上下文相关性

上下文相关性是一个衡量检索质量的指标，主要评估检索到的上下文支持查询的程度。得分低表示检索到大量不相关的内容，这可能会影响LLM生成的最终答案。例如：

为了估计上下文的相关性，使用LLM从上下文(c(q))中提取一组关键句子（Sext）。这些句子对回答问题至关重要。提示如下：

Please extract relevant sentences from the provided context that can potentially help answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase "Insufficient Information". While extracting candidate sentences you’re not allowed to make any changes to sentences from given context.

在RAGAs中，使用以下公式计算句子级别的相关性：

1.4 上下文召回

该指标衡量检索到的上下文和标注答案之间的一致性水平。它是使用基本事实和检索到的上下文来计算的，值越高表示性能越好。例如：

该评估方法需要提供标注数据。

计算公式如下：

1.5 上下文精度

该度量相对复杂，用于衡量检索到的包含真实事实的所有相关上下文是否排名靠前。分数越高表示精度越高。

该指标的计算公式如下：

上下文精度的优势在于它能够感知排名效果。然而，它的缺点是，如果相关召回很少，但都排名很高，那么分数也会很高。因此，有必要结合其他几个指标来考虑整体效果。

二、使用RAGAs+LlamaIndex进行RAG评估

主要流程如图6所示：

2.1 环境配置

使用pip安装ragas，并检查当前版本。

(py) Florian:~ Florian$ pip list | grep ragasragas                        0.0.22

如果您使用pip-install-git+https://github/explodinggradients/ragas.git安装最新版本（v0.1.0rc1），但该版本不支持LlamaIndex。

然后，导入相关库，设置环境和全局变量

import osos.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"dir_path = "YOUR_DIR_PATH"from llama_index import VectorStoreIndex, SimpleDirectoryReaderfrom ragas.metrics import (    faithfulness,    answer_relevancy,    context_relevancy,    context_recall,    context_precision)from ragas.llama_index import evaluate

目录指定的是论文《TinyLlama: An Open-Source Small Language Model》[3]PDF文件。

(py) Florian:~ Florian$ ls /Users/Florian/Downloads/pdf_test/tinyllama.pdf

2.2 使用LlamaIndex构建一个简单的RAG查询引擎

documents = SimpleDirectoryReader(dir_path).load_data()index = VectorStoreIndex.from_documents(documents)query_engine = index.as_query_engine()

LlamaIndex默认情况下使用OpenAI模型，LLM和嵌入模型可以使用ServiceContext轻松配置。

构建评估数据集

由于有些指标需要手动标注数据集，下面是一些问题及其相应的答案的示例：

eval_questions = [    "Can you provide a concise description of the TinyLlama model?",    "I would like to know the speed optimizations that TinyLlama has made.",    "Why TinyLlama uses Grouped-query Attention?",    "Is the TinyLlama model open source?",    "Tell me about starcoderdata dataset",]eval_answers = [    "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",    "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",      "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",    "Yes, TinyLlama is open-source",    "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",]eval_answers = [[a] for a in eval_answers]

指标选择和RAGA评估

metrics = [    faithfulness,    answer_relevancy,    context_relevancy,    context_precision,    context_recall,]result = evaluate(query_engine, metrics, eval_questions, eval_answers)result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')

请注意，默认情况下，在RAGA中，使用OpenAI模型。

在RAGAs中，如果您想使用另一个LLM（如Gemini）来使用LlamaIndex进行评估，即使在调试了RAGAs的源代码后，我也没有在版本0.0.22中找到任何有用的方法。

2.3 最终代码

import osos.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"dir_path = "YOUR_DIR_PATH"from llama_index import VectorStoreIndex, SimpleDirectoryReaderfrom ragas.metrics import (    faithfulness,    answer_relevancy,    context_relevancy,    context_recall,    context_precision)from ragas.llama_index import evaluatedocuments = SimpleDirectoryReader(dir_path).load_data()index = VectorStoreIndex.from_documents(documents)query_engine = index.as_query_engine()eval_questions = [    "Can you provide a concise description of the TinyLlama model?",    "I would like to know the speed optimizations that TinyLlama has made.",    "Why TinyLlama uses Grouped-query Attention?",    "Is the TinyLlama model open source?",    "Tell me about starcoderdata dataset",]eval_answers = [    "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",    "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",      "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",    "Yes, TinyLlama is open-source",    "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",]eval_answers = [[a] for a in eval_answers]metrics = [    faithfulness,    answer_relevancy,    context_relevancy,    context_precision,    context_recall,]result = evaluate(query_engine, metrics, eval_questions, eval_answers)result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')

请注意，在终端中运行程序时，pandas数据框可能无法完全显示。要查看它，可以将其导出为CSV文件，如图7所示：

从图7中可以明显看出，第四个问题“Tell me about starcoderdata dataset”全部为0，这是因为LLM无法提供答案。第二个和第三个问题的上下文精度为0，表明检索到的上下文中的相关上下文没有排在最前面。第二个问题的上下文调用为0，表示检索到的上下文与标注答案不匹配。

现在，让我们研究问题0到3。这些问题的答案相关性得分很高，表明答案与问题之间有很强的相关性。此外，忠实度得分并不低，这表明答案主要是从上下文中得出或总结的，可以得出结论，答案不是由于LLM的幻觉而产生的。

此外，我们发现，尽管我们的上下文相关性得分较低，gpt-3.5-turb-16k（RAGA的默认模型）仍然能够从中推断出答案。

基于这些结果，很明显，这个基本的RAG系统仍有很大的改进空间。

三、结论

一般来说，RAGAs为评估RAG提供了全面的评估指标，调用比较方便。

在调试了RAGAs的内部源代码后，发现RAGAs仍处于早期开发阶段。我们对其未来的更新和改进持乐观态度。

参考文献：

[1] https://arxiv/pdf/2309.15217.pdf

[2] https://docs.ragas.io/en/latest/concepts/metrics/index.html

[3] https://arxiv/pdf/2401.02385.pdf

本文标签：实战 RAG LLM LlamaIndex RAGAs

版权声明：本文标题：LLM之RAG实战（三十二）| 使用RAGAs和LlamaIndex评估RAG 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/dongtai/1728085248a1144892.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

LLM之RAG实战（三十二）| 使用RAGAs和LlamaIndex评估RAG

一、RAG评估指标

1.1 可信度/忠诚度

1.2 答案相关性

1.3 上下文相关性

1.4 上下文召回

1.5 上下文精度

二、使用RAGAs+LlamaIndex进行RAG评估

2.1 环境配置

2.2 使用LlamaIndex构建一个简单的RAG查询引擎

2.3 最终代码

三、结论

参考文献：

更多相关文章

【免费领源码】node jsMysql数据库+node.js精品课程网站 27724，计算机毕业设计项目推荐上万套实战教程JAVA、PHP，node.js，C++、python、大屏可视化

Java语言，MySQL数据库；生活物资配送系统 30174（免费领源码）计算机毕业设计项目推荐万套实战教程JAVA、PHP，node.js，C++、python等

实战：伪造MACIP地址入侵学校无线网

BT3下无线破解实战

刚来公司一周，搭建了一套持续集成交付环境，研发效率直接提升20倍，CTO再次给我涨薪！！（全程实战，建议收藏）

Docker（八）：Docker Compose（容器编排） 管理多容器应用—实战案例演示

Linux系统之系统管理与维护（理论+实战）

《java性能优化实战》之编程性能优化

arpspoof渗透工具使用方法详解+实战

项目实战：Java实现计算机自动关机（一）

php解析百度云真实地址吗,【笔记】百度网盘实战抓包分析获取文件真实地址思路...

python3网络爬虫开发实战pdf 崔庆才 百度网盘分享

node实战——koa实现文件下载和图片pdf视频预览（node后端储备知识）

人人都能开发安卓App App Inventor 2应用开发实战.pdf 免费下载（5）

实战脱壳360加固

【LLM】Dify 0.6.10 在Windows系统上本地化部署（一）

R语言ggplot2对变量的相互作用（interaction）后的分组数据进行可视化实战（group by two columns）

Java项目开发实战入门 PDF 扫描完整版

Visual Studio调试技巧与实用方法总结（实战经验分享）

LLM 怎样用于 OLAP 自助式数据分析？

发表评论

推荐文章

ID生成器——雪花算法

【Linux】ip命令详解

shap-Basic SHAP Interaction Value Example in XGBoost

Openlayers ol.interaction.Select取消默认选中效果

Vue项目构建开发入门

热门文章

uni-app简单介绍

如何统计各个分发平台的下载数据

Linux 查看端口常用命令

Linux常用命令--系统管理

计算机如何输入ip地址,电脑如何切换ip地址_怎么让电脑切换ip地址-win7之家

计算机app无法删除,苹果桌面软件无法删除怎么办_苹果电脑桌面软件删除不了如何处理-win7之家...

win7网络里的计算机,win7看不到局域网计算机怎么办

完全卸载sqlserver

Win10开机无法输入密码

配置文件加密解密

最新文章

如何将MP4视频转换为MP3音频

avi转换成mp4，这6种方法助你快速转换

怎么使用音乐格式转换器？一分钟编辑音频文件的技巧

在线mp3转换器有哪些？这篇文章助你轻松转换mp3格式

怎样将wmv格式的视频转换成mp4格式

免费mp3转换器哪个好用？本期文章告诉你答案

简单实用的音频转换器分享

如何将mkv视频转换成mpg视频格式

音频转换器电脑有哪些？好用工具看这里

音频转换器有哪些？帮你解决音频格式难题

如何将mp4视频转换成m4r音频

视频提取文案，这5种方法让你轻松提取出来

mp4怎么转换成wmv？4种转换方法分享给你

如何无损把mp4视频格式转换成mp3音频格式

视频怎么旋转方向？3种旋转视频方法分享

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

Docker（八）：Docker Compose（容器编排）管理多容器应用—实战案例演示

python3网络爬虫开发实战pdf 崔庆才百度网盘分享

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载