admin管理员组

文章数量:1546811

LLMs之Koala:《Koala: A Dialogue Model for Academic Research一款针对学术研究的对话模型》翻译与解读

导读:2023年4月3日伯克利大学发布Koala ,本篇文章主要介绍了 Koala 对话模型,总结如下:

背景:大模型可以提供强大的语言能力,但大多需要海量计算资源和闭源数据集,难以应用于学术研究。小模型性能不如大模型,但可在限算资源下运行,近期开源模型性能一直在提升。

痛点:如何利用有限资源训练出性能接近大模型的开源对话模型?

解决方案

>> 收集来自网上的用户与ChatGPT对话记录等对话数据,并精炼高质量数据集。

>> 使用该数据集复习训练基于LLaMA的Koala模型。

核心特点

>> 数据集选择了与ChatGPT等大模型对话,利用大模型产出提升小模型性能。

>> Koala模型在学术领域问题集和真实用户问题集上的人工评估表现优于Alpaca,与ChatGPT平分秋色。

>> 小模型仅靠高质量对话数据训练,性能有望接近大模型

优势

>> Koala模型较小,可在学术环境下灵活应用。

>> 结果表明利用高质量对话数据集可能比模型规模更关键

>> 希望Koala能作为开源平台,推动对话模型安全性及配备研究。

目录

《Koala: A Dialogue Model for Academic Research》翻译与解读

摘要

System Overview系统概述

Datasets and Training数据集和训练

ChatGPT Distillation Data蒸馏数据

Open Source Data开源数据

Preliminary Evaluation初步评估

Limitations and Safety局限性和安全性

Release发布

License许可证

Future Work未来工作


《Koala: A Dialogue Model for Academic Research》翻译与解读

地址

论文地址:Koala: A Dialogue Model for Academic Research – The Berkeley Artificial Intelligence Research Blog

时间

2023年4月3日

作者

伯克利大学

Xinyang Geng∗, Arnav Gudibande∗, Hao Liu∗, Eric Wallace∗, Pieter Abbeel⋄, Sergey Levine⋄ and Dawn Song⋄

摘要

In this post, we introduce Koala, a chatbot trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web. We describe the dataset curation and training process of our model, and also present the results of a user study that compares our model to ChatGPT and Stanford’s Alpaca. Our results show that Koala can effectively respond to a variety of user queries, generating responses that are often preferred over Alpaca, and at least tied with ChatGPT in over half of the cases.

在这篇文章中,我们介绍了一款名为Koala聊天机器人,它是通过在网络上收集的对话数据上Meta的LLaMA进行微调来训练的。我们描述了我们模型的 数据集整理和训练过程,并展示了与ChatGPT和斯坦福的Alpaca进行的用户研究的比较结果。我们的结果显示,Koala可以有效地响应各种用户查询,其生成的响应通常优于Alpaca,在超过一半的情况下至少与ChatGPT持平

We hope that these results contribute further to the discourse around the relative performance of large closed-source models to smaller public models. In particular, it suggests that models that are small enough to be run locally can capture much of the performance of their larger cousins if trained on carefully sourced data. This might imply, for example, that the community should put more effort into curating high-quality datasets, as this might do more to enable safer, more factual, and more capable models than simply increasing the size of existing systems. We emphasize that Koala is a research prototype, and while we hope that its release will provide a valuable community resource, it still has major shortcomings in terms of content, safety, and reliability, and should not be used outside of research.

Online interactive demo

EasyLM: training and serving framework

Koala model weights diff agaist base LLaMA

我们希望这些结果能进一步推动关于大型闭源模型与小型开源模型相对性能的讨论。特别是,这表明如果使用精心挑选的数据进行训练,小型到足以在本地运行的模型可以捕捉到其大型同类的大部分性能。这可能意味着,例如,社区应该更多地投入于策划高质量的数据集,因为这可能比简单增加现有系统的大小更能使模型更安全、更真实、更强大。我们强调,Koala是一个研究原型,虽然我们希望它的发布将提供一个有价值的社区资源,但在内容、安全和可靠性方面仍有重大缺陷,不应在研究之外使用。

在线互动演示:https://koala.lmsys/

EasyLM:训练和服务框架:GitHub - young-geng/EasyLM: Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.

Koala模型权重与基础LLaMA的差异:https://drive.google/drive/folders/10f7wrlAFoPIy-TECHsx9DKIvbQYunCfl?usp=sharing

System Overview系统概述

Large language models (LLMs) have enabled increasingly powerful virtual assistants and chat bots, with systems such as ChatGPT, Bard, Bing Chat, and Claude able to respond to a breadth of user queries, provide sample code, and even write poetry. Many of the most capable LLMs require huge computational resources to train, and oftentimes use large and proprietary datasets. This suggests that in the future, highly capable LLMs will be largely controlled by a small number of organizations, and both users and researchers will pay to interact with these models without direct access to modify and improve them on their own. On the other hand, recent months have also seen the release of increasingly capable freely available or (partially) open-source models, such as LLaMA. These systems typically fall short of the most capable closed models, but their capabilities have been rapidly improving. This presents the community with an important question: will the future see increasingly more consolidation around a handful of closed-source models, or the growth of open models with smaller architectures that approach the performance of their larger but closed-source cousins?

大型语言模型(LLMs)已经使越来越强大的虚拟助手和聊天机器人成为可能,例如ChatGPT、Bard、Bing Chat和Claude能够响应广泛的用户查询,提供示例代码,甚至能写诗。许多最强大的LLMs需要大量的计算资源来训练,并且通常使用大型和专有数据集。这表明在未来,高度能干的LLMs将主要由少数组织控制,用户和研究人员将付费与这些模型互动,而无法直接访问、修改和完善它们。另一方面,最近几个月也看到了越来越有能力免费提供或(部分)开源的模型发布,如LLaMA。这些系统通常不如最强大的闭源模型,但它们的能力正在迅速提高。这为社区提出了一个重要问题:未来是越来越集中在少数闭源模型周围,还是小型架构的开源模型增长,其性能接近更大型的闭源同类?

While the open models are unlikely to match the scale of closed-source models, perhaps the use of carefully selected training data can enable them to approach their performance. In fact, efforts such as Stanford’s Alpaca, which fine-tunes LLaMA on data from OpenAI’s GPT model, suggest that the right data can improve smaller open source models significantly.

We introduce a new model, Koala, which provides an additional piece of evidence toward this discussion. Koala is fine-tuned on freely available interaction data scraped from the web, but with a specific focus on data that includes interaction with highly capable closed-source models such as ChatGPT. We fine-tune a LLaMA base model on dialogue data scraped from the web and public datasets, which includes high-quality responses to user queries from other large language models, as well as question answering datasets and human feedback datasets. The resulting model, Koala-13B, shows competitive performance to existing models as suggested by our human evaluation on real-world user prompts.

虽然开源模型不太可能匹敌闭源模型的规模,但精心挑选的训练数据或许能使它们接近闭源模型的性能。实际上,像斯坦福的Alpaca这样的努力,它通过在来自OpenAI的GPT模型的数据上微调LLaMA,表明正确的数据可以显著提高小型开源模型的性能。

我们引入了一个新模型Koala,为这场讨论提供了额外的证据。Koala是在从网络上抓取的、可自由获取的交互数据上微调的,但特别关注包含与ChatGPT等高度能干的闭源模型交互的数据。我们在从网络和公共数据集中抓取的对话数据上微调了一个LLaMA基础模型,其中包括来自其他大型语言模型的高质量用户查询响应,以及问答数据集和人类反馈数据集。由此产生的模型Koala-13B,在我们的真实世界用户提示的人类评估中显示出与现有模型竞争的性能。

Our results suggest that learning from high-quality datasets can mitigate some of the shortcomings of smaller models, maybe even matching the capabilities of large closed-source models in the future. This might imply, for example, that the community should put more effort into curating high-quality datasets, as this might do more to enable safer, more factual, and more capable models than simply increasing the size of existing systems.

By encouraging researchers to engage with our system demo, we hope to uncover any unexpected features or deficiencies that will help us evaluate the models in the future. We ask researchers to report any alarming actions they observe in our web demo to help us comprehend and address any issues. As with any release, there are risks, and we will detail our reasoning for this public release later in this blog post. We emphasize that Koala is a research prototype, and while we hope that its release will provide a valuable community resource, it still has major shortcomings in terms of content, safety, and reliability, and should not be used outside of research. Below we provide an overview of the differences between Koala and notable existing models.

我们的结果表明,从高质量数据集中学习可以缓解小型模型的一些缺陷,也许在将来甚至能与大型闭源模型的能力相匹配。这可能意味着,例如,社区应该更多地投入于策划高质量的数据集,因为这可能比简单增加现有系统的大小更能使模型更安全、更真实、更强大。

通过鼓励研究人员参与我们的系统演示,我们希望发现任何意外的特征或缺陷,这将帮助我们评估未来的模型。我们请研究人员报告他们在我们的网络演示中观察到的任何令人警惕的行为,以帮助我们理解和解决任何问题。与任何发布一样,都有风险,我们将在本博客文章的后半部分详细说明我们公开发布的理由。我们强调,Koala是一个研究原型,虽然我们希望它的发布将提供一个有价值的社区资源,但在内容、安全和可靠性方面仍有重大缺陷,不应在研究之外使用。下面我们提供了Koala与现有显著模型的差异概述。

Datasets and Training数据集和训练

A primary obstacle in building dialogue models is curating training data. Prominent chat models, including ChatGPT, Bard, Bing Chat and Claude use proprietary datasets built using significant amounts of human annotation. To construct Koala, we curated our training set by gathering dialogue data from the web and public datasets. Part of this data includes dialogues with large language models (e.g., ChatGPT) which users have posted online.

Rather than maximizing quantity by scraping as much web data as possible, we focus on collecting a small high-quality dataset. We use public datasets for question answering, human feedback (responses rated both positively and negatively), and dialogues with existing language models. We provide the specific details of the dataset composition below.

构建对话模型的一个主要障碍是策划训练数据。包括ChatGPT、Bard、Bing Chat和Claude在内的著名聊天模型使用的是通过大量人类注释构建的专有数据集。为了构建Koala,我们从网络和公共数据集中收集对话数据来策划我们的训练集。这部分数据包括与大型语言模型(例如ChatGPT)的对话,用户在网上发布了这些对话。

我们不是通过尽可能多地抓取网络数据来最大化数量,而是专注于收集一个小型的高质量数据集。我们使用公共数据集中的问答、人类反馈(正面和负面评价的响应)和与现有语言模型的对话。我们在下面提供了数据集组成的具体细节。

ChatGPT Distillation Data蒸馏数据

Public User-Shared Dialogues with ChatGPT (ShareGPT) Around 60K dialogues shared by users on ShareGPT were collected using public APIs. To maintain data quality, we deduplicated on the user-query level and removed any non-English conversations. This leaves approximately 30K examples.

Human ChatGPT Comparison Corpus (HC3) We use both the human and ChatGPT responses from the HC3 english dataset, which contains around 60K human answers and 27K ChatGPT answers for around 24K questions, resulting in a total number of around 87K question-answer examples.

公共用户共享的ChatGPT对话(ShareGPT):通过公共API收集了大约60K个用户在ShareGPT上共享的对话。为了保持数据质量,我们在用户查询级别上进行了去重,并移除了所有非英语对话。这留下了大约30K个示例。 人类与ChatGPT比较语料库(HC3):我们使用了HC3英语数据集中的用户和ChatGPT的响应,该数据集包含大约60K个人类答案和27K个ChatGPT答案,针对大约24K个问题,总共大约有87K个问题-答案示例。

Open Source Data开源数据

Open Instruction Generalist (OIG). We use a manually-selected subset of components from the Open Instruction Generalist dataset curated by LAION. Specifically, we use the grade-school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.

Stanford Alpaca. We include the dataset used to train the Stanford Alpaca model. The dataset contains around 52K examples, which is generated by OpenAI’s text-davinci-003 following the self-instruct process. It is worth noting that HC3, OIG, and Alpaca datasets are single-turn question answering while ShareGPT dataset is dialogue conversations.

开放指令通用模型(OIG):我们从LAION策划的开放指令通用模型数据集中手动选择了部分组件。具体来说,我们使用了grade-school-math-instructions、poetry-to-songs和plot-screenplay-books-dialogue数据集。这总共产生了大约30k个示例。

斯坦福Alpaca:我们包括了用于训练斯坦福Alpaca模型的数据集。该数据集包含大约52K个示例,这些示例是由OpenAI的text-davinci-003按照自我指导过程生成的。值得注意的是,HC3、OIG和Alpaca数据集是单轮问答,而ShareGPT数据集是对话交流。

Anthropic HH. The Anthropic HH dataset contains human ratings of harmfulness and helpfulness of model outputs. The dataset contains ~160K human-rated examples, where each example in this dataset consists of a pair of responses from a chatbot, one of which is preferred by humans. This dataset provides both capabilities and additional safety protections for our model.

OpenAI WebGPT. The OpenAI WebGPT dataset includes a total of around 20K comparisons where each example comprises a question, a pair of model answers, and metadata. The answers are rated by humans with a preference score.

人类评价模型输出有害性和有益性的数据集(Anthropic HH):该数据集包含约160K个人类评价的示例,每个示例由一个聊天机器人的两个响应组成,其中一个是人类偏好的。这个数据集为我们的模型提供了能力和额外的安全保护。 OpenAI WebGPT:OpenAI WebGPT数据集包含大约20K个比较,每个示例包括一个问题、一对模型答案和元数据。答案由人类以偏好分数进行评级。

OpenAI Summarization. The OpenAI summarization dataset contains ~93K examples, each example consists of feedback from humans regarding the summarizations generated by a model. Human evaluators chose the superior summary from two options.

When using the open-source datasets, some of the datasets have two responses, corresponding to responses rated as good or bad (Anthropic HH, WebGPT, OpenAI Summarization). We build on prior research by Keskar et al, Liu et al, and Korbak et al, who demonstrate the effectiveness of conditioning language models on human preference markers (such as “a helpful answer” and “an unhelpful answer”) for improved performance. We condition the model on either a positive or negative marker depending on the preference label. We use positive markers for the datasets without human feedback. For evaluation, we prompt models with positive markers.

The Koala model is implemented with JAX/Flax in EasyLM, our open source framework that makes it easy to pre-train, fine-tune, serve, and evaluate various large language models. We train our Koala model on a single Nvidia DGX server with 8 A100 GPUs. It takes 6 hours to complete the training for 2 epochs. On public cloud computing platforms, such a training run typically costs less than $100 with preemptible instances.

 OpenAI摘要:OpenAI摘要数据集包含约93K个示例,每个示例由人类对模型生成的摘要的反馈组成。人类评估者从两个选项中选择较好的摘要。

在使用开源数据集时,一些数据集有两个响应,对应于被评为好或坏的响应(Anthropic HH、WebGPT、OpenAI摘要)。我们在Keskar等人、Liu等人和Korbak等人的先前研究基础上,展示了在人类偏好标记(如“一个有帮助的答案”和“一个没有帮助的答案”)上调节语言模型的有效性,以改善性能。我们根据偏好标签在正面或负面标记上调节模型。我们在没有人类反馈的数据集上使用正面标记。为了评估,我们用正面标记提示模型。 Koala模型是在我们的开源框架EasyLM中用JAX/Flax实现的,该框架使得预训练、微调、服务和评估各种大型语言模型变得容易。我们在一台装有8个A100 GPU的Nvidia DGX服务器上训练我们的Koala模型。完成2个周期的训练需要6个小时。在公共云计算平台上,这样的训练运行通常成本不到100美元,使用可抢占实例。

Preliminary Evaluation初步评估

In our experiments, we evaluated two models: Koala-Distill, which solely employs distillation data, and Koala-All, which employs all of the data, including both distillation and open-source data. Our aim is to compare the performance of these models and evaluate the influence of distillation and open-source datasets on final performance. We ran a human evaluation to compare Koala-All with Koala-Distill, Alpaca, and ChatGPT. We present our results in the figure above. We evaluate on two different sets, one consisting of 180 test queries used by Stanford’s Alpaca (“Alpaca Test Set”), and our own test set (“Koala Test Set”).

在我们的实验中,我们评估了两个模型:仅使用蒸馏数据的Koala-Distill,和使用所有数据(包括蒸馏和开源数据)的Koala-All。我们的目标是比较这些模型的性能,并评估蒸馏和开源数据集对最终性能的影响。我们进行了一项人类评估,以比较Koala-All与Koala-Distill、Alpaca和ChatGPT。我们在上面的图表中展示了我们的结果。我们在两个不同的数据集上进行了评估,一个是斯坦福的Alpaca使用的180个测试查询(“Alpaca测试集”),另一个是我们自己的测试集(“Koala测试集”)。

The Alpaca test set consists of user prompts sampled from the self-instruct dataset, and represents in-distribution data for the Alpaca model. To provide a second more realistic evaluation protocol, we also introduce our own (Koala) test set, which consists of 180 real user queries that were posted online. These user queries span various topics, are generally conversational in style, and are likely more representative of the real-world use cases of chat-based systems. To mitigate possible test-set leakage, we filtered out queries that have a BLEU score greater than 20% with any example from our training set. Additionally, we removed non-English and coding-related prompts, since responses to these queries cannot be reliably reviewed by our pool of raters (crowd workers). We release our test set for academic use and future benchmarking.

With these two evaluation sets, we conducted a blind pairwise comparison by asking approximately 100 evaluators on Amazon Mechanical Turk platform to compare the quality of model outputs on these held-out sets of prompts. In the ratings interface, we present each rater with an input prompt and the output of two models. They are then asked to judge which output is better (or that they are equally good) using criteria related to response quality and correctness.

Alpaca测试集由从自我指导数据集中抽取的用户提示组成,代表了Alpaca模型的分布内数据。为了提供第二个更现实的评估协议,我们还引入了我们自己的(Koala)测试集,该测试集由180个在线发布的真实用户查询组成。这些用户查询涉及各种主题,通常是会话式的风格,可能更能代表聊天式系统的实际使用情况。为了减轻可能的测试集泄漏,我们过滤掉了与我们的训练集中的任何示例BLEU得分大于20%的查询。此外,我们移除了非英语和与编程相关的提示,因为对这些查询的响应不能由我们的评分者(众包工作者)可靠地审查。我们发布我们的测试集供学术使用和未来的基准测试。

使用这两个评估集,我们在Amazon Mechanical Turk平台上询问了大约100名评估者,让他们对保留提示上的模型输出质量进行比较。在评分界面中,我们向每个评分者展示一个输入提示和两个模型的输出。然后要求他们根据响应质量和正确性相关的标准判断哪个输出更好(或者它们一样好)。

On the Alpaca test set, Koala-All exhibited comparable performance to Alpaca. However, on our proposed test set, which consists of real user queries, Koala-All was rated as better than Alpaca in nearly half the cases, and either exceeded or tied Alpaca in 70% of the cases. Of course, the more conversational prompts in the Koala test set more closely resemble the Koala training set, so this is perhaps not surprising, but insofar as such prompts more closely resemble likely downstream use cases for such models, this suggests that Koala would be expected to perform better in assistant-like applications. This suggests that data of LLM interactions sourced from examples posted by users on the web is an effective strategy for endowing such models with effective instruction execution capabilities.

Perhaps more surprisingly, we found that training on open-source data in addition to the distillation data (Koala-All) performs slightly worse than training on just ChatGPT distillation data (Koala-Distill), as shown by the comparison to Koala-Distill on both datasets. Though the difference might not be significant, this result suggests that the ChatGPT dialogues are of such high quality that incorporating even twice as much open-source data did not lead to a significant improvement. Our initial hypothesis was that Koala-All should perform at least somewhat better, hence we used it as our primary model in all evaluations, but a potential takeaway from these experiments is that effective instruction and assistant models could be finetuned from LLM backbones such as LLaMA entirely using data from larger and more powerful models, so long as the prompts for these responses are representative of the kinds of prompts that users will provide at test-time. This also further supports the notion that the key to building strong dialogue models may lie more in curating high-quality dialogue data that is diverse in user queries, rather than simply reformatting existing datasets as questions and answers.

在Alpaca测试集上,Koala-All的表现与Alpaca相当。然而,在我们的提议的测试集上,该测试集由真实的用户查询组成,Koala-All在将近一半的情况下被评为优于Alpaca,在70%的情况下要么超越要么与Alpaca持平。当然,Koala测试集中的更多对话提示与Koala训练集更相似,所以这可能不足为奇,但只要这些提示更接近这类模型可能的下游使用场景,那么这表明Koala在类似助手的应用中可能会表现更好。这表明从网络上用户发布的示例中获取的LLM交互数据是赋予此类模型有效指令执行能力的有效策略。

令人惊讶的是,我们发现除了蒸馏数据外,Koala-All在开源数据上的表现略逊于仅在ChatGPT蒸馏数据上训练的Koala-Distill,这在两个数据集上的比较中都有所体现。尽管差异可能并不显著,但这一结果表明ChatGPT对话的质量如此之高,以至于即使加入了两倍于开源数据,也不会带来显著的改进。我们的初步假设是Koala-All至少会表现得更好一些,所以我们将其作为所有评估中的主要模型,但从这个实验中可能得出的一个潜在结论是,有效的指导和助手模型可以完全使用来自更大、更强大的模型的数据(如LLaMA)进行微调。这进一步支持了这样一个观点,即构建强大的对话模型的关键可能更多地在于策划高质量的对话数据,这些数据在用户查询上具有多样性,而不是简单地将现有数据集重格式化为问题和答案。

Limitations and Safety局限性和安全性

Like other language models, Koala has limitations and can be harmful when misused. We observe that Koala can hallucinate and generate non-factual responses with a highly confident tone, which is likely a result of the dialogue fine-tuning. Perhaps an unfortunate implication of this is that smaller models inherit the confident style of larger language models before they inherit the same level of factuality—if true, this is a limitation that is important to study in future work. When misused, the hallucinated responses from Koala can potentially facilitate the spread of misinformation, spam, and other content.

Koalas can hallucinate inaccurate information in a confident and convincing tone.

Koalas can hallucinate inaccurate information in a confident and convincing tone. Beyond hallucinations, Koala shares deficiencies from other chatbot language models. Some of which include:

Biases and Stereotypes: Our model will inherit biases from the dialogue data it was trained on, possibly perpetuating harmful stereotypes, discrimination, and other harms.

Lack of Common Sense: While large language models can generate text that appears to be coherent and grammatically correct, they often lack common sense knowledge that humans take for granted. This can lead to nonsensical or inappropriate responses.

Limited Understanding: Large language models can struggle to understand the context and nuances of a dialogue. They can also have difficulty identifying sarcasm or irony, which can lead to misunderstandings.

与其他语言模型一样,Koala也有其局限性,如果被误用可能会造成伤害。我们观察到Koala可能会产生幻觉并生成带有高度自信语气的非事实性回应,这很可能是对话微调的结果。也许这是一个不幸的暗示,即小型模型在继承大型语言模型的事实性之前,先继承了它们的自信风格——如果这是真的,那么这是一个在未来的工作中需要研究的局限性。当被误用时,Koala产生的幻觉性回应可能会助长错误信息的传播、垃圾邮件和其他内容。

Koala可能会以自信和令人信服的语气虚构不准确的信息。

Koala可能会以自信和令人信服的语气虚构不准确的信息。除了幻觉之外,Koala还继承了其他聊天机器人语言模型的缺陷,其中一些包括:

偏见和刻板印象:我们的模型将继承它所训练的对话数据中的偏见,可能会延续有害的刻板印象、歧视和其他伤害。 缺乏常识:虽然大型语言模型可以生成看似连贯且语法正确的文本,但它们通常缺乏人类视为理所当然的常识知识。这可能导致荒谬或不适当的回应。

有限的理解力:大型语言模型可能难以理解对话的上下文和细微差别。它们也可能难以识别讽刺或反语,这可能导致误解。

To address the safety implications of Koala, we included adversarial prompts in the dataset from ShareGPT and Anthropic HH to make the model more robust and harmless. To further mitigate potential misuse, we deploy OpenAI’s content moderation filter in our online demo to flag and remove unsafe content. We will be cautious about the safety of Koala, and we are committed to perform further safety evaluations of it while also monitoring our interactive demo. Overall, we decided to release Koala because we think its benefits outweigh its risks.

为了解决Koala的安全问题,我们在数据集中包括了来自ShareGPT和Anthropic HH的对抗性提示,以使模型更加健壮和无害。为了进一步减轻潜在的误用,我们在在线演示中部署了OpenAI的内容审查过滤器,以标记和移除不安全的内容。我们将对Koala的安全性保持谨慎,并承诺对其进行进一步的安全评估,同时监控我们的互动演示。总的来说,我们决定发布Koala,因为我们认为其好处超过了其风险。

Release发布

We are releasing the following artifacts:

An online interactive demo of Koala

EasyLM: our open source framework we used to train Koala

The code for preprocessing our training data

Our test set of queries

Koala model weights diff against the base LLaMA model

我们正在发布以下工件:

Koala的在线互动演示

EasyLM:我们用来训练Koala的开源框架

我们训练数据的预处理代码

我们的查询测试集 Koala模型权重与基础LLaMA模型的差异

License许可证

The online demo is a research preview intended for academic research only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Any other usage of the online demo, including but not limited to commercial usage, is strictly prohibited. Please contact us If you find any potential violations. Our training and inference code is released under the Apache License 2.0.

在线演示是一个研究预览,仅供学术研究使用,受LLaMA模型的许可、OpenAI生成数据的条款和ShareGPT的隐私实践的约束。禁止在线演示的任何其他用途,包括但不限于商业用途。如果您发现任何潜在的违规行为,请联系我们。我们的训练和推理代码是根据Apache许可证2.0发布的。

Future Work未来工作

We hope that the Koala model will serve as a useful platform for future academic research on large language models: the model is capable enough to exhibit many of the capabilities that we associate with modern LLMs, while being small enough to be finetuned or utilized with more limited compute. Potentially promising directions might include:

Safety and alignment: Koala allows further study of language model safety and better alignment with human intentions.

Model bias: Koala enables us to better understand the biases of large language models, the presence of spurious correlations and quality issues in dialogue datasets, and methods to mitigate such biases.

Understanding large language models: because Koala inference can be performed on relatively inexpensive commodity GPUs, it enables us to better inspect and understand the internals of dialogue language models, making (previously black-box) language models more interpretable.

我们希望Koala模型将成为未来大型语言模型学术研究的有用平台:该模型足够强大,能够展示我们与现代LLMs相关的许多功能,同时足够小,可以在更有限的计算资源上进行微调或使用。可能有望的方向包括:

安全性和对齐:Koala允许进一步研究语言模型的安全性以及更好地与人类意图对齐。

模型偏见:Koala使我们能够更好地理解大型语言模型的偏见、对话数据集中的伪相关性和质量问题以及减轻这些偏见的方法。

理解大型语言模型:因为Koala的推理可以在相对便宜的商用GPU上执行,所以它使我们能够更好地检查和理解对话语言模型的内部,使(以前是黑箱的)语言模型更具可解释性。

本文标签: 学术研究模型dialogueKoalaLLMs