admin管理员组

文章数量:1584177

LLMs:《Orca: Progressive Learning from Complex Explanation Traces of GPT-4》翻译与解读

导读:2023年6月5日,微软与OpenAI合作,致力于整合AI功能到产品和服务中,并专注于构建更小、面向特定用例的模型。他们最近发布了一款名为Orca的新型AI模型,它通过模仿大型语言模型的推理过程来学习。Orca展现出了模仿大型基础模型的巧妙方法,在保持规模小的前提下,做到了性能上的近似。它标志着微软在通过特定模型优化来应用AI方面不断探索。
Orca是一个尺寸更小但能模仿大型语言模型的新型AI模型,拥有1.3兆参数。它的目标是通过模仿如GPT-4这样的大型基础模型的推理过程,来克服较小模型的限制。它可以从大模型如GPT-4中学习解释、分步思维过程和复杂指令。
Orca的竞争力包括可以利用大量多样化的数据进行自我完善、在复杂推理任务上比基础模型Vicuna快100%、在AGIEval评测上比传统AI模型快42%。Orca可以借助人类提供的分步解释和高级语言模型的能力,有望获得更高效的技能。Orca体现了微软将更小、面向特定用例的模型带入产品和服务的趋势,可降低计算资源消耗。

目录

《Orca: Progressive Learning from Complex Explanation Traces of GPT-4》翻译与解读

Abstract

1、Introduction引言

1.1、Challenges with Existing Methods现有方法的挑战

1.2、Key Contributions主要贡献

7、Limitations限制

8、Conclusions结论


《Orca: Progressive Learning from Complex Explanation Traces of GPT-4》翻译与解读

地址

论文:https://arxiv/abs/2306.02707

时间

2023年6月5日

作者

微软

Abstract

Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model’s capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca, a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. To promote this progressive learning, we tap into large-scale and diverse imitation data with judicious sampling and selection. Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big- Bench Hard (BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4. Our research indicates that learning from step-by-step explanations, whether these are generated by humans or more advanced AI models, is a promising direction to improve model capabilities and skills.

最近的研究集中在通过模仿学习增强较小模型的能力,利用大型基础模型(LFM)生成的输出。这些模型的质量受到一系列问题的影响,包括来自浅层 LFM 输出的有限模仿信号、规模较小的同质化训练数据,尤其重要的是缺乏严格的评估,导致高估小模型的能力,因为它们倾向于学习模仿 LFM 的风格,而不是推理过程。为了解决这些挑战,我们开发了 Orca,一个拥有 130 亿参数的模型,它学习模仿 LFM 的推理过程。Orca 从 GPT-4 获得丰富的信号,包括解释轨迹、逐步思考过程和其他复杂指令,同时由 ChatGPT 提供教师辅助。为了促进这种渐进式学习,我们利用谨慎的采样和选择获取大规模且多样化的模仿数据。在复杂的零样本推理基准测试(如 Big-Bench Hard 和 AGIEval)中,Orca 在超过 Vicuna-13B 这样的常规最先进的指令调优模型上的表现提高了 100% 以上,并在 BBH 基准测试上与 ChatGPT 达到了相同的水平,并在 SAT、LSAT、GRE 和 GMAT 这样的职业和学术考试中展现出有竞争力的性能(与优化系统消息相差 4 分),在无需 CoT 的零样本设置中,仅稍逊于 GPT-4。我们的研究表明,无论这些逐步解释是由人类还是更先进的 AI 模型生成的,都是改进模型能力和技能的有希望的方向。

1Introduction引言

Large Foundation Models (LFMs) such as ChatGPT and GPT-4 [2] exhibit remarkable zero- shot performances across a broad spectrum of tasks. Alongside academic benchmarks like Human Eval [3] and Big Bench [4], GPT-4 has also demonstrated human-level performance on various professional exams, including the bar exam, SAT, GRE, and USMLE. These advancements can be credited to the scaling of both model and dataset sizes, as well as the incorporation of a second layer of  training  to  better  align  the  models  with  user intent. This alignment is accomplished by fine-tuning the models via supervised learning on demonstrations of prompts and desired model behavior, and through reinforcement learning from human preferences [5].

As these models continue to evolve and become more powerful, an intriguing question arises: Can we use the model itself to supervise its own behavior or that of other AI models?  Bai et al. [6] have shown that by sampling output from an initial model, generating revisions, and then fine-tuning the original model based on these revised responses, model behavior can be controlled more effectively and can be made more harmless, with significantly fewer human labels.

Recently, there has been an influx of studies using LFMs like ChatGPT and GPT-4 as teachers to generate large datasets, for instruction  tuning,  and  to  train  smaller  models, such as Alpaca [7], WizardLM [8] and Vicuna [9]. While these models can produce content that matches the style of their teachers, they often fall short in terms of the reasoning and comprehension skills displayed by the larger foundation models.

Take, for example, the 13-billion parameter instruction-tuned model, Vicuna [9] (with LLAMA-13B [10] as the base), which is widely regarded as one of the best models in its family, as evidenced by its performance on leaderboards like OpenLLM3 and ChatArena4.

ChatGPT 和 GPT-4 这样的大型基础模型(LFMs)在各种任务上展示出了非凡的零样本性能。除了 Human Eval 和 Big Bench 等学术基准测试,GPT-4 还在包括律师资格考试、SAT、GRE 和 USMLE 在内的各种专业考试上展现出与人类水平的表现。这些进展归功于模型和数据集规模的扩大,以及通过第二层训练来更好地使模型与用户意图保持一致。这种对齐通过对提示和期望模型行为的演示进行监督学习的微调以及根据人类偏好进行强化学习来实现 [5]。

随着这些模型的不断发展和增强,一个有趣的问题出现了:我们能否利用模型本身来监督自己或其他 AI 模型的行为?白等人 [6] 已经证明,通过对初始模型的输出进行采样,生成修订版本,然后根据这些修订的响应对原始模型进行微调,可以更有效地控制模型行为,并使其更无害,减少了人类标签的数量。

最近,出现了一系列研究,利用 ChatGPT 和 GPT-4 等 LFMs 作为教师生成大型数据集,用于指令调优和训练 Alpaca [7]、WizardLM [8] 和 Vicuna [9] 等较小模型。虽然这些模型可以生成与教师相匹配的内容,但在推理和理解能力方面往往表现不佳,与更大的基础模型相比。

以 130 亿参数的指令调优模型 Vicuna [9](以 LLAMA-13B [10] 为基础)为例,它被广泛认为是该系列中最好的模型之一,其在 OpenLLM3 和 ChatArena4 等排行榜上的表现证明了这一点。

As illustrated in Figure 1, the widely-used evaluation method of using GPT-4 as the judge suggests that Vicuna retains 92% of ChatGPT’s quality. However, a more meticulous evaluation on reasoning benchmarks against human labels finds Vicuna to retain only 64% of ChatGPT’s quality on professional and academic exams (see Figure 2), and only 48% of ChatGPT’s quality on complex benchmarks like BigBench-hard [11] (see Figure 3)5. This discrepancy not only underscores the limitations of existing evaluation protocols with smaller LLMs, but it also reveals their significant lag in reasoning and comprehension capabilities. In essence, these models may be articulate, but they may not necessarily possess robust reasoning skills. In this study, we discuss some of the reasons behind these gaps and propose strategies for addressing them.

正如图 1 所示,广泛使用 GPT-4 作为评判方法表明 Vicuna 保留了 ChatGPT 的质量的 92%。然而,对专业和学术考试的推理基准进行更仔细的评估发现,Vicuna 在人类标签上仅保留了 ChatGPT 质量的 64%(见图 2),在复杂的 BigBench-hard [11] 基准上仅保留了 ChatGPT 质量的 48%(见图 3)[5]。这种差异不仅凸显了现有较小 LLM 评估协议的局限性,而且揭示了它们在推理和理解能力方面的显著滞后。实质上,这些模型可能很口才,但可能并不一定具备强大的推理能力。在本研究中,我们讨论了这些差距背后的一些原因,并提出了解决这些问题的策略。

1.1Challenges with Existing Methods现有方法的挑战

Current research on instruction-tuning to mimic the output of LFM’s like ChatGPT exhibits notable limitation in task diversity, query complexity, and data scaling. These observations are corroborated in a recent study by Gudibande et al. [12], where the authors assert that “model imitation is a false promise” since “broadly matching ChatGPT using purely imitation would require (1) a concerted effort to collect enormous imitation datasets and (2) far more diverse and higher quality imitation data than is currently available.”. Contrary to this assertion, we demonstrate that both conditions (1) and (2) are attainable and that it is possible to reduce the gap with proprietary LLM’s on multiple zero-shot benchmarks that require sophisticated reasoning. We elaborate on these challenges below:

目前关于指令调优以模仿 ChatGPT 输出的 LFMs 的研究在任务多样性、查询复杂性和数据规模方面存在明显的局限性。Gudibande 等人 [12] 在最近的一项研究中得出结论,即“模型模仿是一个虚假的承诺”,因为“通过纯粹的模仿来广泛匹配 ChatGPT 需要(1)集中精力收集大量的模仿数据集,以及(2)比当前可用的模仿数据更加多样化和高质量。”与这种说法相反,我们证明了条件(1)和(2)都是可以实现的,可以在多个需要复杂推理的零样本基准测试上缩小与专有 LLM 之间的差距。我们在下面详细阐述这些挑战。

Simple instructions with limited diversity. The Self-Instruct [13] process involves using an initial set of prompts to incite the LFM to produce new instructions. Any low-quality or overly similar responses are then removed, and the remaining instructions are reintegrated into the task pool for further iterations. Nonetheless, the resulting queries generated through Self-Instruct, such as “what are the three primary colors?", “what is the capital of France?", etc., can exhibit limitations in diversity and complexity. Both Alpaca [7] and WizardLM [8] employ a variant of self-instruct. WizardLM introduces the concept of Evol-Instruct, which gradually rewrites the initial set of instructions into more complex versions, attempting to overcome some of the method’s inherent shortcomings. On the other hand, recent works like Vicuna [9] and Koala [14] demonstrate remarkable performance due to more human-like conversations and natural instructions in community-contributed conversations like those in ShareGPT6 that provided a forum for users to share their conversations with ChatGPT.

Task diversity and data scaling. Human-contributed conversations in ShareGPT are a valuable source of data, but they also have some limitations. They tend to favor creative content generation and information-seeking queries over other types of tasks. Therefore, models trained on such natural conversations may capture the style but not the reasoning process of the LFMs – demonstrated in the performance of Vicuna in Figures 2 and 3. Additionally, such mode of data collection is also limited in scale. Table 1 shows an overview of the size of data and tuning methods employed in recent popular instruction tuning works.

Limited  imitation  signals.    Existing  methods  rely  on  immitation  learning  from〈query, response〉 pairs generated by the teacher model. However, this provides limited signals to trace the reasoning process of the teacher. Prior works [15, 16] on open-box model show that richer signals such as logits, intermediate representations and attention states can significantly improve distillation performance. While they are not accessible for closed-box LFM’s7, recent work [17] demonstrates that richer signals like LFM rationales can help close the gap for task-specific distillation.

Evaluation: Previous studies on instruction tuning of small models with LFMs are severely limited in their evaluation protocol.  They often rely on GPT-4 for auto-evaluation by asking it to compare the outputs of two systems with a prompt like “given responses from system 1 (reference) and system 2 (target), which one is better?”. However, this approach has several drawbacks, such as the small size of test sets (e.g., 80 instructions in Vicuna and 218 instructions in WizardLM) and the biases of GPT-4 as the judge [18]. For example, we notice that models that are instruction-tuned with GPT-4 responses tend to generate longer texts that GPT-4 prefers over shorter ones; as well as GPT-4 has a bias in the order of the candidate responses. We will show that such auto-evaluation measures overestimate the abilities of smaller models compared to LFMs, as the former are much weaker in comprehension and reasoning skills.

1.2Key Contributions主要贡献

In this research, our focus is on addressing the challenges mentioned above, specifically with:

Explanation tuning: We augment 〈query, response〉 pairs with detailed responses from GPT-4 that explain the reasoning process of the teacher as it generates the response. These provide the student with additional signals for learning. We leverage system instructions (e.g.., explain like I’m five, think  step-by-step  and  justify  your  response,  etc.)  to elicit such explanations. This is in contrast to vanilla instruction tuning, which only uses the prompt and the LFM response for learning, providing little opportunity for mimicking the LFM’s “thought” process.

在这项研究中,我们的重点是解决上述挑战,具体包括: 解释调优:我们利用来自 GPT-4 的详细响应增加了〈查询,响应〉对,以解释教师在生成响应时的推理过程。这为学生提供了额外的学习信号。我们利用系统指令(例如,“以我五岁孩子的理解方式解释”,“一步一步思考并解释你的回答”等)来引出这些解释。这与传统的指令调优不同,传统方法只使用提示和 LFM 的响应进行学习,很少有机会模仿 LFM 的“思考”过程。

Scaling tasks and instructions: We utilize the Flan 2022 Collection [19] as it provides an extensive public assortment of tasks and instructions. Particularly, we use FLAN- v2, supplemented with high-quality templates, advanced formatting patterns, and data augmentations. Even though FLAN holds tens of millions of instructions, we selectively sample from the task collection to form a diverse mixture of tasks, which we then further sub-sample to generate complex prompts. These prompts are used to query LFMs like ChatGPT and GPT-4, thus creating a rich and diverse training set. We collect 5 million ChatGPT responses, from which 1 million is further sampled to acquire GPT-4 responses. We demonstrate how ChatGPT as a teacher assistant helps in progressive learning.

Evaluation: We assess the generative, reasoning, and comprehension abilities of Orca, under a range of settings: (i) AutoEvaluation with GPT-4 on existing evaluation sets from Vicuna, WizardLM and the awesome prompts collection8; (ii) Academic benchmarks like Big-Bench Hard [4] and TruthfulQA [20]; (iii) Professional and Academic exams like SAT, LSAT, GRE, GMAT from AGIEval [1]; (iv) Safety evaluation with ToxiGen [21] to test toxic language generation and hate speech detection across different minority groups. Finally, we provide case-studies to compare the generation and reasoning abilities of Orca against OpenAI LFMs like ChatGPT and GPT-4, and instruction-tuned smaller model like Vicuna.

任务和指令的扩展:我们利用 Flan 2022 数据集 [19],它提供了大量的任务和指令。特别是,我们使用 FLAN-v2,补充了高质量的模板、高级格式化模式和数据增强。虽然 FLAN 包含数千万条指令,但我们有选择地从任务集合中进行采样,形成一个多样化的任务组合,然后进一步对其进行子采样以生成复杂的提示。这些提示用于查询像 ChatGPT 和 GPT-4 这样的 LFMs,从而创建一个丰富多样的训练集。我们收集了 500 万条 ChatGPT 的响应,其中 100 万条被进一步采样以获取 GPT-4 的响应。我们展示了 ChatGPT 作为教师助手在渐进学习中的帮助。

评估:我们在多个设置下评估 Orca 的生成、推理和理解能力:(i)使用 GPT-4 进行自动评估,使用 Vicuna、WizardLM 和 awesome prompts 数据集上的现有评估集;(ii)在 Big-Bench Hard 和 TruthfulQA 等学术基准上进行评估;(iii)在来自 AGIEval 的 SAT、LSAT、GRE、GMAT 等专业和学术考试上进行评估;(iv)使用 ToxiGen 在不同少数群体中测试有害语言生成和仇恨言论检测的安全评估。最后,我们提供案例研究,比较 Orca 在生成和推理能力上与 OpenAI 的 ChatGPT、GPT-4 以及指令调优的较小模型 Vicuna 之间的能力。

7Limitations限制

Orca, built upon the LLaMA model family, retains many of its constraints, as well as the common limitations of other large language models, including:

Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair.

Lack of Contextual Understanding: Despite their impressive capabilities in language un- derstanding and generation, these models exhibit limited real-world understanding, resulting in potential inaccuracies or nonsensical responses.

Lack of Transparency: Due to the complexity and size, large language models can act as ‘black boxes,’ making it difficult to comprehend the rationale behind specific outputs or decisions. We recommend reviewing transparency notes from Azure for more information22.

Orca 是基于 LLaMA 模型系列构建的,保留了许多其约束条件,以及其他大型语言模型的常见限制,包括:

数据偏见:基于大量数据训练的大型语言模型可能会无意中携带源数据中存在的偏见。因此,模型可能会生成潜在偏见或不公平的输出。

缺乏上下文理解:尽管这些模型在语言理解和生成方面具有令人印象深刻的能力,但它们在现实世界理解方面存在限制,可能导致不准确或荒谬的回应。

缺乏透明性:由于复杂性和规模,大型语言模型可能会像“黑匣子”一样,难以理解特定输出或决策的理由。建议阅读 Azure 的透明性说明以获取更多信息。

Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction.

Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models to fabricate content. Moreover, it is not clear whether small model may more susceptible to hallucination in ungrounded generation use cases due to their smaller size and hence reduced memorization capacity. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic.

Potential for Misuse: Without suitable safeguards, there is a risk that these models could be maliciously used for generating disinformation or harmful content.

Additionally, Orca’s performance is influenced by the data used for explanation tuning:

内容伤害:大型语言模型可能会引发各种类型的内容伤害。在使用这些模型时,重要的是要意识到这些伤害,并采取措施防止它们发生。建议利用不同公司和机构提供的各种内容审核服务。重要的是,我们希望未来政府和技术领导者能对AI技术中的内容伤害制定更好的法规和标准。我们重视并承认研究和开源社区在这方面发挥的重要作用。

幻觉:需要注意和谨慎,不要完全依赖给定的语言模型进行关键决策或具有深远影响的信息,因为如何防止这些模型生成虚假内容并不明显。此外,目前尚不清楚小模型在没有明确基础的生成用例中是否更容易产生幻觉,这可能是由于其较小的规模和较低的记忆容量。这是一个活跃的研究课题,我们希望在这个课题上能有更严格的测量、理解和缓解方法。

滥用的潜在可能性:在没有适当的保护措施的情况下,存在滥用这些模型生成虚假信息或有害内容的风险。

此外,Orca 的性能受到用于解释调优的数据的影响:

Zero-Shot Settings: Orca has been trained on data that simulate zero-shot setting with standard prompts. The model’s performance in other contexts such as multi-turn conversations, in-context-learning and few-shot learning, or advanced prompting techniques like chain-of-thought prompting remains untested.

Data Distribution: Orca’s performance is likely to correlate strongly with the distribution of the tuning data. This correlation might limit its accuracy in areas underrepresented in the training dataset such as math, coding, and reasoning.

System messages: Orca is trained with diverse system instructions to elicit different kinds of response. Additionally, the stochasticity introduced by the model size may lead to generation of non-deterministic responses to different system instructions.

GPT-4 Behavior: As Orca is trained to imitate GPT-4, it could inherit both the advantages and shortcomings of the teacher model. We posit that Orca benefits from the safety measures incorporated during GPT-4 training and safety guardrails (e.g., content filter) within the Azure OpenAI API. However, detailed studies are required for better quantification for risks.

This model is solely designed for research settings, and its testing has only been carried out in such environments. It should not be used in downstream applications, as additional analysis is needed to assess potential harm or bias in the proposed application.

零-shot设置:Orca 在模拟标准提示的零-shot设置下进行了训练。该模型在其他情境下,如多轮对话、上下文学习和少样本学习,或链式思维提示等高级提示技术方面的性能尚未经过测试。

数据分布:Orca 的性能很可能与调优数据的分布密切相关。这种相关性可能会限制其在训练数据集中代表性不足的领域(如数学、编码和推理)的准确性。

系统消息:Orca 使用多样的系统指令进行训练,以引出不同类型的回应。此外,模型规模引入的随机性可能导致对不同系统指令生成非确定性回应。

GPT-4 行为:由于 Orca 被训练来模仿 GPT-4,它可能继承教师模型的优点和缺点。我们认为在 GPT-4 的训练和 Azure OpenAI API 中的安全措施(例如内容过滤器)中,Orca 受益于安全措施的融入。然而,需要进行详细的研究以更好地量化风险。

该模型仅设计用于研究环境,并且其测试仅在此类环境中进行。在使用该模型于下游应用之前,需要进行额外的分析来评估所提出的应用可能造成的伤害或偏见。

8Conclusions结论

This paper offers insights into the current state of training smaller language models to mimic the behavior of Large Foundation Models (LFMs) such as GPT-4. Our research suggests that smaller models’ abilities are frequently overstated when compared to advanced models like ChatGPT and GPT-4. Evaluation benchmarks like AGIEval, which relies on standardized tests such as GRE, SAT, LSAT, etc., offer more robust evaluation frameworks.

The study also underscores the significance of data and imitation techniques, highlighting Explanation Tuning as an effective method for aligning smaller models to GPT-4. However, there remains a distinct need and potential for the development of more refined methods. We emphasize the crucial role of data size and coverage when it comes to aligning smaller models to their more powerful counterparts, like GPT-4. In addition, the quality of the base model is a key factor that influences model performance.

本文深入探讨了训练较小语言模型以模仿大型基础模型(如 ChatGPT 和 GPT-4)行为的当前状态。我们的研究表明,与 ChatGPT 和 GPT-4 等先进模型相比,较小模型的能力经常被夸大。像 AGIEval 这样依赖于 GRE、SAT、LSAT 等标准化测试的评估基准提供了更可靠的评估框架。

该研究还强调了数据和模仿技术的重要性,突出了解释调优作为将较小模型与 GPT-4 对齐的有效方法。然而,仍然需要更精细的方法的发展。我们强调在将较小模型与像 GPT-4 这样更强大的对应模型对齐时,数据大小和覆盖范围的重要性。此外,基础模型的质量是影响模型性能的关键因素。

Our findings indicate that Orca significantly outperforms other open-source smaller models. Moreover, in some settings, it can match or even surpass the quality of ChatGPT, although a substantial gap with GPT-4 still remains. This suggests smaller models can be trained to be more focused and adaptable in constrained settings without substantial loss in quality. It also suggests that learning from step-by-step explanations (generated by humans or more powerful AI models) could significantly improve the quality of models regardless of their size.

We hope these insights will inform future research and development in this field, especially in the design of more robust evaluation methods, advancement of alignment and post-training techniques, and more effective use of powerful models like GPT-4 as teachers.

我们的发现表明,Orca 在性能上显著优于其他开源较小模型。此外,在某些情境下,它可以与甚至超过 ChatGPT 的质量相匹配,尽管与 GPT-4 之间仍存在明显差距。这表明,在受限制的情境中,较小模型可以被训练得更加专注和适应,而不会丧失太多质量。这也表明,从逐步解释中学习(无论是人类生成的还是更强大的AI模型生成的)可以显著提高模型的质量,而不论其大小如何。

我们希望这些见解能够对未来的研究和发展提供指导,特别是在设计更强大的评估方法、推进对齐和训练后技术的发展以及更有效地使用像 GPT-4 这样的强大模型作为教师方面。


 

本文标签: ProgressiveLearningLLMsORCATraces