admin管理员组

文章数量:1558103

Abstract 摘要

大型语言模型(LLMs)已展现出在多个领域革新自然语言处理任务的潜力,因此在金融领域引起了极大的兴趣。访问高质量金融数据是金融LLMs(FinLLMs)面临的首个挑战。尽管像BloombergGPT这样的专有模型利用了其独特的数据积累优势,这种特权访问促使人们寻求一个开源的替代方案,以民主化互联网规模的金融数据。

在本文中,我们介绍了一个针对金融部门的开源大型语言模型,FinGPT。与专有模型不同,FinGPT采取了以数据为中心的方法,为研究人员和实践者提供了可访问和透明的资源,以开发他们的FinLLMs。我们强调了自动化数据策展管道和轻量级低秩适应技术在构建FinGPT中的重要性。此外,我们展示了几个潜在的应用作为用户的垫脚石,如机器人咨询、算法交易和低代码开发。通过开源AI4Finance社区内的协作努力,FinGPT旨在激发创新,民主化FinLLMs,并在开放金融中解锁新机遇。

两个相关的代码仓库为https://github/AI4Finance-Foundation/FinGPT 和 https://github/AI4Finance-Foundation/FinNLP。

Large language models (LLMs) have shown the potential of revolutionizing natural language processing tasks in diverse domains, sparking great interest in finance. Accessing high-quality financial data is the first challenge for financial LLMs (FinLLMs). While proprietary models like BloombergGPT have taken advantage of their unique data accumulation, such privileged access calls for an open-source alternative to democratize Internet-scale financial data. In this paper, we present an open-source large language model, FinGPT, for the finance sector. Unlike proprietary models, FinGPT takes a data-centric approach, providing researchers and practitioners with accessible and transparent resources to develop their FinLLMs. We highlight the importance of an automatic data curation pipeline and the lightweight low-rank adaptation technique in building FinGPT. Furthermore, we showcase several potential applications as stepping stones for users, such as roboadvising, algorithmic trading, and low-code development. Through collaborative efforts within the open-source AI4Finance community, FinGPT aims to stimulate innovation, democratize FinLLMs, and unlock new opportunities in open finance.

Two associated code repos are https://github. com/AI4Finance-Foundation/FinGPT and https:// github/AI4Finance-Foundation/FinNLP

1 Introduction 引言

人工智能的持续扩展和演化为大型语言模型的增长提供了肥沃的土壤【Vaswani et al., 2017; Radford et al., 2018; Devlin et al., 2018; Ethayarajh, 2019; Lewis et al., 2019; Lewis et al., 2020; Brown et al., 2020; Thoppilan et al., 2022】,从而在多个领域内的自然语言处理景观中引起了变革性的转变。这一彻底的变化在金融领域对这些模型潜在应用的兴趣中激发了极大的兴趣。然而,显而易见的是,获取高质量、相关且最新的数据是开发一个有效且高效的开源金融语言模型的关键因素。

在金融领域使用语言模型揭示了复杂的挑战。这些挑战范围从获取数据的困难、处理多种数据格式和类型、管理数据质量的不一致,到对最新信息的基本需求。特别是,历史或专门的金融数据提取由于数据媒介的不同,如网站平台、API、PDF文档和图像,证明是复杂的。

The continual expansion and evolution of artificial intelligence have provided a fertile ground for the proliferation of large language models [Vaswani et al., 2017; Radford et al., 2018; Devlin et al., 2018; Ethayarajh, 2019; Lewis et al., 2019; Lewis et al., 2020; Brown et al., 2020; Thoppilan et al., 2022], thereby effecting a transformative shift in the landscape of natural language processing across diverse domains. This sweeping change has engendered keen interest in the potential application of these models in the financial realm. It is, however, evident that the acquisition of high-quality, relevant, and up-to-date data stands as a critical factor in the development of an efficacious and efficient open-source financial language model.

Utilizing language models in the financial arena reveals intricate hurdles. These range from difficulties in obtaining data, dealing with diverse data formats and types, and managing data quality inconsistencies, to the essential requirement of up-to-date information. Especially, historical or specialized financial data extraction proves to be complex due to varying data mediums such as web platforms, APIs, PDF documents, and images.

在专有领域,像BloombergGPT【Wu et al., 2023】这样的模型利用其对专门数据的独家访问来训练特定于金融的语言模型。然而,他们的数据收集和训练协议的受限访问性和透明度加剧了对一个更开放和包容性替代品的需求。为了响应这一需求,我们正见证着向在开源领域民主化互联网规模金融数据的趋势转变。

In the proprietary sphere, models like BloombergGPT [Wu et al., 2023] have capitalized on their exclusive access to specialized data to train finance-specific language models. However, the restricted accessibility and transparency of their data collections and training protocols have accentuated the demand for a more open and inclusive alternative. In response to this demand, we are witnessing a shifting trend towards democratizing Internet-scale financial data in the open-source domain.

在本文中,我们讨论了与金融数据相关的上述挑战,并介绍FinGPT,一个端到端的开源金融大型语言模型(FinLLMs)框架。FinGPT采用以数据为中心的方法,强调数据获取、清洗和预处理在开发开源FinLLMs中的关键作用。通过提倡数据可访问性,FinGPT旨在增强金融领域的研究、合作和创新,为开放金融实践铺平道路。我们的贡献总结如下:

  • 民主化:作为一个开源框架,FinGPT旨在民主化金融数据和FinLLMs,揭示开放金融中未被挖掘的潜力。

  • 以数据为中心的方法:认识到数据策展的重要性,FinGPT采用以数据为中心的方法并实施严格的清洗和预处理方法来处理各种数据格式和类型,从而确保数据的高质量。

  • 端到端框架:FinGPT采用了一个完整的FinLLMs框架,包含四个层次:

    • 数据源层:该层保证全面的市场覆盖,通过实时信息捕捉解决金融数据的时间敏感性问题。

    • 数据工程层:为实时NLP数据处理而设计,该层解决了金融数据高时间敏感性和低信噪比的固有挑战。

    • LLMs层:关注一系列微调方法,该层减轻了金融数据高度动态的本质,确保模型的相关性和准确性。

    • 应用层:展示实用应用和演示,该层突出了FinGPT在金融领域的潜在能力。

我们对FinGPT的愿景是作为在金融领域内激发创新的催化剂。FinGPT不仅提供技术贡献,而且还培育了一个FinLLMs的开源生态系统,促进实时处理和用户的定制化适应。通过在开源AI4Finance社区内培养一个强大的协作生态系统,FinGPT定位于重塑我们对FinLLMs的理解和应用。

In this paper, we address these aforementioned challenges associated with financial data and introduce FinGPT, an endto-end open-source framework for financial large language models (FinLLMs). Adopting a data-centric approach, FinGPT underscores the crucial role of data acquisition, cleaning, and preprocessing in developing open-source FinLLMs. By championing data accessibility, FinGPT aspires to enhance research, collaboration, and innovation in finance, paving the way for open finance practices. Our contributions are summarized as follows:

• Democratization: FinGPT, as an open-source framework, aims to democratize financial data and FinLLMs, uncovering untapped potentials in open finance.

• Data-centric approach: Recognizing the significance of data curation, FinGPT adopts a data-centric approach and implements rigorous cleaning and preprocessing methods for handling varied data formats and types, thereby ensuring high-quality data.

• End-to-end framework: FinGPT embraces a full-stack framework for FinLLMs with four layers:

– Data source layer: This layer assures comprehensive market coverage, addressing the temporal sensitivity of financial data through real-time information capture.

– Data engineering layer: Primed for real-time NLP data processing, this layer tackles the inherent challenges of high temporal sensitivity and low signal-tonoise ratio in financial data.

– LLMs layer: Focusing on a range of fine-tuning methodologies, this layer mitigates the highly dynamic nature of financial data, ensuring the model’s relevance and accuracy.

– Application layer: Showcasing practical applications and demos, this layer highlights the potential capability of FinGPT in the financial sector.

Our vision for FinGPT is to serve as a catalyst for stimulating innovation within the finance domain. FinGPT is not limited to providing technical contributions, but it also cultivates an open-source ecosystem for FinLLMs, promoting real-time processing and customized adaptation for users. By nurturing a robust collaboration ecosystem within the open-source AI4Finance community, FinGPT is positioned to reshape our understanding and application of FinLLMs.

2 Related Work 相关工作

2.1 LLMs and ChatGPT 大型语言模型与ChatGPT

大型语言模型(LLMs)已被认为是自然语言处理技术的一个重大突破,例如GPT-3和GPT-4【Brown et al., 2020】。它们采用基于变压器的架构,在各种生成任务中展现出令人印象深刻的性能。

作为OpenAI开发的GPT系列的一个分支,ChatGPT旨在基于输入提示产生类似人类的文本。它在多种应用中显示出显著的实用性,从起草电子邮件到编写代码,甚至创造书面内容。

Large Language Models (LLMs) have been recognized as a technological breakthrough in natural language processing, such as GPT-3 and GPT-4 [Brown et al., 2020]. They take transformer-based architectures, demonstrating impressive performance across various generative tasks.

As an offshoot of the GPT family developed by OpenAI, ChatGPT was designed to produce human-like text based on input prompts. It has shown significant utility in diverse applications, from drafting emails to writing code and even in creating written content.

2.2 LLMs in Finance 金融中的大型语言模型

大型语言模型(LLMs)已被应用于金融领域内的各种任务【Dredze et al., 2016; Araci, 2019; Bao et al., 2021; DeLucia et al., 2022】,从预测建模到从原始财务数据生成有洞察力的叙述。最近的文献集中在使用这些模型进行金融文本分析,鉴于该领域存在大量的文本数据,如新闻文章、财报电话会议记录和社交媒体帖子。

金融LLMs的首个示例是BloombergGPT【Wu et al., 2023】,它训练于一个混合了财务和通用来源的数据集上。尽管其能力令人印象深刻,但存在访问限制,且高昂的训练成本激发了对低成本领域适应的需求。

我们的FinGPT响应了这些挑战,提出了一个开源的金融LLM。它采用了来自人类反馈的强化学习(RLHF)来理解和适应个人偏好,为个性化金融助手铺平了道路。我们的目标是结合通用LLMs(如ChatGPT)的优势与金融适应,利用LLM在金融方面的能力。

LLMs have been applied to various tasks within the financial sector [Dredze et al., 2016; Araci, 2019; Bao et al., 2021; DeLucia et al., 2022], from predictive modeling to generating insightful narratives from raw financial data. Recent literature has focused on using these models for financial text analysis, given the abundance of text data in this field, such as news articles, earnings call transcripts, and social media posts.

The first example of financial LLMs is BloombergGPT [Wu et al., 2023], which was trained on a mixed dataset of financial and general sources. Despite its impre

本文标签: 开源模型语言金融Source