admin管理员组

文章数量:1593973

文章目录

  • 1.Fact Checking
    • 1.1.任务介绍
    • 1.2.FEVER
    • 1.3.Climate-FEVER
    • 1.4.SciFact
  • 2.Citation Prediction
    • 2.1.任务介绍
    • 2.2.SciDocs
  • 3.Duplicate Question Retrieval
    • 3.1.任务介绍
    • 3.2.CQADupStack(community QA)
    • 3.3.Quora
  • 4.Argument Retrieval
    • 4.1.任务介绍
    • 4.2.ArguAna
    • 4.3.Touche-2020
  • 5.News Retrieval
    • 5.1.任务介绍
    • 5.2.TREC News
  • 6.Question Answering
    • 6.1.任务介绍
    • 6.2.Natural Questions(open-domain)
    • 6.3.HotpotQA
    • 6.4.FiQA-2018
  • 7.Tweet Retrieval
    • 7.1.任务介绍
    • 7.2.Signal-1M
  • 8.Biomedical IR
    • 8.1.任务介绍
    • 8.2.NFCorpus
    • 8.3.BioASQ
    • 8.4.TREC-COVID
  • 9.Entity Retrieval
    • 9.1.任务介绍
    • 9.2.DBpedia
  • 10.BEIR中的链接
  • 11.BEIR中对数据的介绍
  • 12.Ad-hoc IR
  • 13.CQA
  • 14.NLI
  • 15.Paraphrase Identification
  • 16.Response retrieval

1.Fact Checking

1.1.任务介绍

根据大量的证据来核实一项声明。我们将声明作为输入,证明声明的相关文档段落作为输出。

1.2.FEVER

FEVER是一个针对文本来源进行事实提取和验证的公开数据集。FEVER(事实提取和验证)由185,445个声明组成,通过修改从维基百科中提取的句子,然后在不知道这些句子的情况下进行验证。声明被分为 Supported,Refuted,NotRnoughInfo,数据是 Json 格式。

  • 地址:Fact Extraction and VERification (fever.ai)

1.3.Climate-FEVER

CLIMATE-FEVER是一个采用FEVER方法的数据集,包含1535个关于气候变化的真实声明。每个声明都附有五个从维基百科检索的人工注释证据句子,这些句子支持、反驳或没有提供足够的信息来验证该声明。整个数据集包含 7675 声明-证据对。此外,该数据集还包括涉及多个方面的具有挑战性的声明和同时存在支持和反驳证据的有争议的声明情况。

  • 地址:UZH - Center of Competence for Sustainable Finance - CLIMATE-FEVER

1.4.SciFact

声明被分割成了 claims_train.jsonl, claims_dev.jsonl, and claims_test.jsonl,每个声明占一行。证据文档数据集是 corpus.jsonl,每个证据文档占一行。

  • 地址:allenai/scifact: Data and models for the SciFact verification task. (github)

2.Citation Prediction

2.1.任务介绍

引用是科学文章之间相关性的关键信号。在此任务中,模型试图为给定的查询论文标题(输入)检索被引用论文(输出)

2.2.SciDocs

SciDocs一个benchmark,包括从引文预测到文献分类和推荐的七个文献级任务

  • 地址:SciDocs Dataset — Allen Institute for AI (allenai)

3.Duplicate Question Retrieval

3.1.任务介绍

重复问题检索是在社区问答论坛中识别重复问题的任务。给定的查询作为输入,重复问题是输出。

3.2.CQADupStack(community QA)

它包含来自12个StackExchange子论坛的线程,标注有重复的问题信息。为检索和分类实验提供了预定义的训练和测试分割,以确保使用该集合的不同研究之间的最大可比性。此外,它还附带了一个脚本,用于以各种方式操作数据。

  • 地址:CQADupStack (unimelb.edu.au)

3.3.Quora

我们发布的数据集将围绕与Quora相关的各种问题展开,并且提供给多个领域的研究者。数据集包含400000行潜在的问题重复对。

  • 地址:First Quora Dataset Release: Question Pairs - Data @ Quora

4.Argument Retrieval

4.1.任务介绍

Argument retrieval是根据它们与不同主题的文本查询(输入)的相关性,对关注的论据(输出)集合中的论据文本进行排序的任务。(针对主题的相关性排序)

个人观点:就是给出一个争议性(可辩论)的话题,然后找到一些与之相关的证据,话题,观点论述。

4.2.ArguAna

The ArguAna Counterargs Corpus:一个用于学习检索一个论点的最佳反论点的英文数据集。包含6753对论点与最佳反论点,源于辩论网站 idebate,还有不同的实验文件,多达百万个候选对。

  • 地址:Data – ArguAna (argumentation.bplaced)

4.3.Touche-2020

给定一个关于有争议话题的问题,从在线辩论门户的爬取中检索出相关论证。

<topic>
<number>1</number>
<title>Is climate change real?</title>
<description>You read an opinion piece on how climate change is a hoax and disagree. Now you are looking for arguments supporting the claim that climate change is in fact real.</description>
<narrative>Relevant arguments will support the given stance that climate change is real or attack a hoax side's argument.</narrative>
</topic>
  • 地址:Touché at CLEF 2020 - Argument Retrieval for Controversial Questions (webis.de)

5.News Retrieval

5.1.任务介绍

给定一个新闻标题,我们检索提供重要上下文或背景信息的相关新闻文章

5.2.TREC News

TREC News Track 以新闻领域的现代搜索任务为特点。TREC Washington Post Corpus包含从2012年1月到2020年12月的728,626篇新闻文章和博客文章。

  • title
  • byline
  • date of publication
  • kicker (a section header)
  • article text broken into paragraphs
  • links to embedded images and multimedia (for 2012-2017 documents)
  • 地址:TREC News Track (trec-news)

6.Question Answering

6.1.任务介绍

开放领域问答是在没有答案的预先定义位置的情况下,检索一个问题正确答案的任务。在开放领域任务中,模型必须在整个知识库(eg.Wikipeida)中进行检索。将问题作为输入,包含答案的段落作为输出。

6.2.Natural Questions(open-domain)

是一个 QA 数据集,包含 307373 训练样本,7830 提升样本,7842 测试样本。每个示例由google查询和相应的Wikipedia页面组成。每个维基百科页面上都有一段(或长答案)注释的回答问题的段落,以及一段或多小段注释的包含实际答案的 span。长回答和短回答注释可以为空。如果它们都是空的,那么这一页上就没有答案了。如果长回答标注非空,而短回答标注为空,则标注的文章回答了问题,但找不到显式的短答案。最后,有1%的文档在一段文字上标注了“yes”或“no”的简短答案,而不是一串很短的跨度。

  • 地址:Google’s Natural Questions

6.3.HotpotQA

从英文维基百科收集的 QA 数据集,包含大约113K个群众来源的问题,这些问题需要两篇维基百科文章的引言段落来回答。数据集中的每个问题都有两个黄金段落,以及这些段落中的句子列表,众包工作者认为这些句子是回答问题所必需的支持事实。

  • 地址:HotpotQA Homepage

6.4.FiQA-2018

关于金融的问答。一共有两个任务

  • 基于方面的金融语义分析:给定一个英文金融领域的文本实例(微博消息、新闻声明或标题),检测文本中提到的目标方面(从预定义的方面类列表中),并预测每个提到的目标的情感得分。

  • 在金融数据上基于选择的QA:给定来自不同英文金融数据源(微博、报告、新闻)的结构化和非结构化文本文档的语料库,构建一个回答自然语言问题的问答系统。

    “question”: “Why are big companies like Apple or Google not included in the Dow Jones Industrial Average (DJIA) index?”,

    “answers”:{

    “290156”: { “text”:" That is a pretty exclusive club and for the most part they are not interested in highly volatile companies like Apple and Google. Sure, IBM is part of the DJIA, but that is about as stalwart as you can get these days. The typical profile for a DJIA stock would be one that pays fairly predictable dividends, has been around since money was invented, and are not going anywhere unless the apocalypse really happens this year. In summary, DJIA is the boring reliable company index." ," timestamp": “Sep 11 '12 at 0:53”}

    }

  • 地址:FiQA - 2018 (google)

7.Tweet Retrieval

7.1.任务介绍

Twitter是一个微博客网站,人们发表实时的关于对一些主题选择的信息,并且讨论当下的问题。将新闻标题作为输入,检索相关tweets作为输出。

7.2.Signal-1M

该数据集由Signal AI发布,以方便对新闻文章进行研究。它包含100万篇以英语为主的文章,但也包括非英语和多种语言的文章。这些文章的来源除了当地新闻来源和博客外,还包括路透社等主要媒体。

  • id: a unique identifier for the article
  • title: the title of the article
  • content: the textual content of the article (may occasionally contain HTML and JavaScript content)
  • source: the name of the article source (e.g. Reuters)
  • published: the publication date of the article
  • media-type: either “News” or “Blog”
  • 地址:[Signal 1 Million News Articles Dataset | Signal Research (signal-ai)](https://github/signal-ai/Signal-1M-Tools)

8.Biomedical IR

8.1.任务介绍

生物医学信息检索是针对生物医学领域中给定的科学查询搜索相关的科学文档,如研究论文或博客。我们将科学查询作为输入,检索生物医学文档作为输出

8.2.NFCorpus

NFCorpus是一个用于医学信息检索的全文英文检索数据集。包含了3244个自然语言查询,对9964个医疗文件(用复杂的术语密集型语言编写)自动提取了169,756个相关判断。

  • 地址:NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval - StatNLP Heidelberg (uni-heidelberg.de)

8.3.BioASQ

BioASQ是一个问答数据集。BioASQ数据集中的实例由问题(Q)、人工标注的答案(A)和相关上下文©(也称为片段)组成。

  • 地址:BioASQ Participants Area | BioASQ

8.4.TREC-COVID

一共分为5轮,每一轮都会有一个CORD-19数据集,和一系列信息需求声明(主题)。在一轮提交截止日期之后,NIST使用提交的运行为每个主题生成一组文档,由人工注释人员评估与主题的相关性。

  • 地址:TREC-COVID Data (nist.gov)

9.Entity Retrieval

9.1.任务介绍

实体检索需要检索查询中提到的实体的唯一维基百科页面(通过实体来检索网页,用来介绍实体)。这对包含实体链接的任务是很重要的。承载实体的查询是输入,实体摘要和标题作为输出被检索。

9.2.DBpedia

致力于从维基百科项目创造的信息中提取结构化内容。DBpedia允许用户从语义上查询Wikipedia资源的关系和属性,包括到其他相关数据集的链接。

  • 地址:Home - DBpedia Association

10.BEIR中的链接

DatasetWebsite (Link)
MSMARCOhttps://microsoft.github.io/msmarco/
TREC-COVIDhttps://ir.nist.gov/covidSubmit/index.html
NFCorpushttps://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/
BioASQhttp://bioasq
NQhttps://ai.google/research/NaturalQuestions
HotpotQAhttps://hotpotqa.github.io
FiQA-2018https://sites.google/view/fiqa/
Signal-1M (RT)https://research.signal-ai/datasets/signal1m-tweetir.html
TREC-NEWShttps://trec.nist.gov/data/news2019.html
ArguAnahttp://argumentation.bplaced/arguana/data
Touchè-2020https://webis.de/events/touche-20/shared-task-1.html
CQADupStackhttp://nlp.cis.unimelb.edu.au/resources/cqadupstack/
Quorahttps://www.quora/q/quoradata/First-Quora-Dataset-Release-Question-Pairs
DBPedia-Entityhttps://github/iai-group/DBpedia-Entity/
SCIDOCShttps://allenai/data/scidocs
FEVERhttp://fever.ai
Climate-FEVERhttp://climatefever.ai
SciFacthttps://github/allenai/scifact

Table 6: Original dataset website (link) for all datasets present in beir.

11.BEIR中对数据的介绍

DatasetQueryRelevant-Document
MSMARCOwhat fruit is native to australia Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible. assiflora herbertiana. A rare passion fruit native to Australia…
TREC-COVIDwhat is the origin of COVID-19 Origin of Novel Coronavirus (COVID-19): A Computational Biology Study using Artificial Intelligence Origin of the COVID-19 virus has been intensely debated in the community…
BioASQWhat is the effect of HMGB2 loss on CTCF clustering HMGB2 Loss upon Senescence Entry Disrupts Genomic Organization and Induces CTCF Clustering across Cell Types. Processes like cellular senescence are characterized by complex events giving rise to heterogeneous cell populations. However, the early molecular events driving this cascade remain elusive….
NFCorpusTitanium Dioxide & Inflammatory Bowel Disease Titanium Dioxide Nanoparticles in Food and Personal Care Products Titanium dioxide is a common additive in many food, personal care, and other consumer products used by people, which after use can enter the sewage system, and subsequently enter the environment as treated effluent discharged to surface waters or biosolids applied to agricultural land, or incinerated wastes…
NQwhen did they stop cigarette advertising on television? Tobacco advertising The first calls to restrict advertising came in 1962 from the Royal College of Physicians, who highlighted the health problems and recommended stricter laws…
HotpotQAStockely Webster has paintings hanging in what home (that serves as the residence for the Mayor of New York)? Stokely Webster Stokely Webster (1912 – 2001) was best known as an American impressionist painter who studied in Paris. His paintings can be found in the permanent collections of many museums, including the Metropolitan Museum of Art in New York, the National Museum…
FiQA-2018What is the PEG ratio? How is the PEG ratio calculated? How is the PEG ratio useful for stock investing? PEG is Price/Earnings to Growth. It is calculated as Price/Earnings/Annual EPS Growth. It represents how good a stock is to buy, factoring in growth of earnings, which P/E does not. Obviously when PEG is lower, a stock is more undervalued, which means that it is a better buy, and more likely…
Signal-1M (RT)Genvoya, a Gentler Anti-HIV Cocktail, Okayed by EU Regulators All people with #HIV should get anti-retroviral drugs: @WHO, by @kkelland via @Reuters_Health #AIDS #TasP
TREC-NEWSWebsites where children are prostituted are immune from prosecution. But why? Senate launches bill to remove immunity for websites hosting illegal content, spurred by Backpage The legislation, along with a similar bill in the House, sets the stage for a battle between Congress and some of the Internet’s most powerful players, including Google and various free-speech advocates, who believe that Congress shouldn’t regulate Web content or try to force websites to police themselves more rigorously…
ArguAnaSexist advertising is subjective so would be too difficult to codify. Effective advertising appeals to the social, cultural, and personal values of consumers. Through the connection of values to products, services and ideas, advertising is able to accomplish its goal of adoption… media modern culture television gender house would ban sexist advertising Although there is a claim that sexist advertising is to difficult to codify, such codes have and are being developed to guide the advertising industry. These standards speak to advertising which demeans the status of women, objectifies them, and plays upon stereotypes about women which harm women and society in general. Earlier the Council of Europe was mentioned, Denmark, Norway and Australia as specific examples of codes or standards for evaluating sexist advertising which have been developed.
Tóuche-2020Should the government allow illegal immigrants to become citizens? America should support blanket amnesty for illegal immigrants. Undocumented workers do not receive full Social Security benefits because they are not United States citizens " nor should they be until they seek citizenship legally. Illegal immigrants are legally obligated to pay taxes…
CQADupStackCommand to display first few and last few lines of a file Combing head and tail in a single call via pipe On a regular basis, I am piping the output of some program to either ‘head‘ or ‘tail‘. Now, suppose that I want to see the first AND last 10 lines of piped output, such that I could do something like ./lotsofoutput | headtail…
QuoraHow long does it take to methamphetamine out of your blood? How long does it take the body to get rid of methamphetamine?
DBPediaPaul Auster novels The New York Trilogy The New York Trilogy is a series of novels by Paul Auster. Originally published sequentially as City of Glass (1985), Ghosts (1986) and The Locked Room (1986), it has since been collected into a single volume.
SCIDOCSCFD Analysis of Convective Heat Transfer Coefficient on External Surfaces of Buildings Application of CFD in building performance simulation for the outdoor environment: an overview This paper provides an overview of the application of CFD in building performance simulation for the outdoor environment, focused on four topics…
FEVERDodgeBall: A True Underdog Story is an American movie from 2004 DodgeBall: A True Underdog Story DodgeBall: A True Underdog Story is a 2004 American sports comedy film written and directed by Rawson Marshall Thurber and starring Vince Vaughn and Ben Stiller. The film follows friends who enter a dodgeball tournament…
Climate-FEVERSea level rise is now increasing faster than predicted due to unexpectedly rapid ice melting. Sea level rise A sea level rise is an increase in the volume of water in the world ’s oceans, resulting in an increase in global mean sea level. The rise is usually attributed to global climate change by thermal expansion of the water in the oceans and by melting of Ice sheets and glaciers…

Table 7: Examples of queries and relevant documents for all datasets included in beir. () and () are used to distinguish the title separately from the paragraph within a document in the table above. These tokens were not passed to the respective models.

12.Ad-hoc IR

ad-hoc信息检索特别指的是基于文本的检索,其中集合中的文档保持相对静态,新的查询不断提交到系统中

DatasetGenre#Query#Collections
Robust04news2500.5M
ClueWeb09-Cat-Bweb15050M
Gov2.gov pages15025M
MS MARCO (Document Ranking)web pages367,0133.2M
MQ2007.gov pages169225M
MQ2008.gov pages79425M
  • Robust04:包含0.5 million文档,总共有 250个查询,来自 TREC Robust Track 2004。
  • Cluebweb09:大的 web 集合,包含 34 million 文档。总共有150个查询,来自 TREC Web Tracks 2009, 2010, and 2011
  • Gov2:是大的 web 集合,网页爬取自 .gov,包含 25 million文档。总共有 150个查询,来自 TREC Terabyte Tracks 2004, 2005, and 2006
  • MS MARCO:从 Bing 的搜索日志中提供了大量的信息问题式查询。这些段落由人类用相关/不相关的标签注释。总共有8841822份文档。分别有808,731个查询,6,980个查询和48,598个查询用于训练,验证和测试
  • Million Query TREC 2007:是一个使用Gov2 web集合的LETOR基准数据集。有 1692 个查询和65323个标注文档
  • Million Query TREC 2008:是另一个LETOR基准数据集,它也使用Gov2 web集合。有 784个查询和 14383个标注文档

13.CQA

Community Question Answer是在针对给定问题提供的众多答案中自动搜索相关答案(答案选择),并搜索相关问题重用其已有答案(问题检索)。

DatasetDomain#Question#Answer
TRECQAOpen-domain1,2295,3417
WikiQAOpen-domain3,04729,258
InsuranceQAInsurance12,88921,325
FiQAFinancial6,64857,641
Yahoo! AnswersOpen-domain50,112253,440
SemEval-2015 Task 3Open-domain2,60016,541
SemEval-2016 Task 3Open-domain4,87936,198
SemEval-2017 Task 3Open-domain4,87936,198
  • TRECQA数据集是由Wang等人根据TRECQA track 8-13的数据创建的,候选答案自动从每个问题的文档池中选择,使用了重叠不间断的单词计数和模式匹配的组合
  • WikiQA是一组公开的问题和句子对,由Microsoft Research收集并注释,用于研究开放领域的问题回答。
  • InsuranceQA是一个来自保险领域的非事实性QA数据集。问题可能有多个正确答案,通常问题比答案短得多。对于开发和测试集中的每个问题,都有500个候选答案。
  • FiQA是一个来自金融领域的非事实性QA数据集,最近为WWW 2018挑战发布(前面出现过
  • Yahoo!Answers是一个人们发布问题和答案的网站,所有这些对任何愿意浏览或下载它们的网络用户都是公开的。在这个数据集中,答案的长度相对长于TrecQA和WikiQA。
  • SemEval-2015 Task 3包含两个子任务。
    • 在子任务A中,给定一个问题(短标题+扩展的描述),一些社区答案案,判断每个答案是相关,可能有用,bad或者不相关。
    • 在子任务B中,给定一个YES/NO问题(短标题+扩展描述)和一列社区答案,决定该问题的全局答案应该是YES、NO还是不确定
  • SemEval-2016 Task 3包括两个子任务,即问题-评论相似度和问题-问题相似度。
    • 在“问题-评论相似度”任务中,给出一个来自“问题-评论”线程的问题,根据它们与问题的相关性对评论进行排序。
    • 在问题-问题相似任务中,给定一个新问题,对搜索引擎检索到的所有相似问题重新排序。
  • SemEval-2017 Task 3包含两个子任务,即问题相似度和相关性分类。给定新问题和集合中的一组相关问题,问题相似度任务是根据与原始问题的相似度对相似问题进行排序。而相关性分类则是根据答案帖子与问题的相关性,基于问答线程对答案帖子进行排序。

14.NLI

Natural Language Inference是给定前提的情况下,决定一个假设是 true (entailment), false (contradiction), or undetermined (neutral) 的任务。

Dataset# sentence pair
SNLI570K
MultiNLI433K
SciTail27K
  • SNLI是斯坦福自然语言推理(Stanford Natural Language Inference)的缩写,它有570k对人工注释的句子对。前提数据来自 Flickr30k语料库的字幕,手工合成假设数据。
  • MultiNLI是Multi-Genre NLI的缩写,它有433k句对,其收集过程和任务细节与SNLI密切相关。前提数据从最广泛的美国英语类型中收集,如非虚构类型,口语类型,不太正式的书面类型(小说,信件)和专门的9/11类型。
  • SciTail蕴涵数据集由27k组成。与SNLI和mnli不同的是,它不是来自于人群,而是根据已经存在于“wild”的句子创建的。假设是由科学问题和相应的答案候选词创建的,而使用来自大型语料库的相关网络句子作为前提。

15.Paraphrase Identification

Paraphrase Identification是决定两个句子是否具有相同意思的任务。

Datasetpairs of sentence
MRPC5800
STS1750
SICK-R9840
SICK-E9840
Quora Question Pair404290
  • MRPC是Microsoft Research Paraphrase Corpus的缩写。它包含了5800对句子,这些句子都是从网上的新闻来源中提取出来的,并附有注释,表明每对句子是否捕捉到了释义/语义等价关系。
  • SentEval包含语义关联数据集,包括SICK和STS基准数据集。SICK数据集包括两个子任务SICK- R和SICK- E。对于STS和SICK-R,它学会预测两个句子之间的相关度分数,对于SICK-E,它有与SICK-R相同的句子对,但可以被视为一个三类分类问题(类是“蕴涵”、“矛盾”和“中性”)。
  • Quora Question Pairs是Quora发布的一个任务,旨在识别重复的问题。它由Quora上的40多万对问题组成,每对问题都有一个二值的注释,表明这两个问题是否互相改述。(前面提到过了

16.Response retrieval

Response retrieval/selection旨在从对话库中排序/选择一个合适的回答。自动对话(AC)旨在创建一个自动的人机对话过程,以实现问题回答、任务完成和社交聊天的目的。一般来说,AC可以被表述为一个IR问题,目的是从对话存储库中排列/选择一个适当的响应,也可以被表述为一个生成问题,目的是针对输入语句生成一个适当的响应。在这里,我们将响应检索称为基于ir的进行交流的方法。

DatasetPartition#Context Response pair#Candidate per ContextPositive:NegativeAvg #turns per context
UDCtrain/validation/test1M/500k/500k2/10/101:1/1:9/1:910.13/10.11/10.11
Doubantrain/validation/test1M/50k/10k2/2/101:1/1:1/1.18:8.826.69/6.75/6.45
MSDialogtrain/validation/test173k/37k/35k10/10/101:9/1:9/1:95.0/4.9/4.4
EDCtrain/validation/test1M/10k/10k2/2/101:1/1:1/1:95.51/5.48/5.64
Persona-Chat dataset8939/1000/96820/20/201:19/1:19/1:197.35/7.80/7.76
CMUDoG dataset2881/196/53720/20/201:19/1:19/1:1912.55/12.37/12.36
  • Ubuntu Dialog Corpus(UDC)包含从Ubuntu论坛的聊天日志收集的多回合对话。该数据集包含100万对用于训练的上下文-响应对、50万对用于验证的上下文-响应对和50万对用于测试的上下文-响应对。积极的响应是来自人类的真实反应,而消极的响应是随机抽样的。在训练中正样本和负样本的比例是1:1,在验证和测试中是1:9
    • 地址:Ubuntu Dialogue Corpus | Kaggle
  • Douban Conversation Corpus是由豆瓣组构造的开放域数据集。该数据集由100万个用于训练的上下文-响应对、50k对用于验证的上下文-响应对和10k对用于测试的上下文-响应对组成,每个上下文分别对应2个、2个和10个候选响应。测试集中的候选响应检索自新浪微博,通过人工评审进行标记
  • MSDialog是来自微软产品在线论坛(微软社区)的信息搜索者和答案提供者之间的问答(QA)交互的标签对话集。该数据集包含超过2000个多回合信息寻求对话,包含10,000个话语,这些话语在话语层面上带有用户意图的注释。
  • E-commerce Dialogue Corpus包含基于超过20个商品的五种以上类型的对话(商品咨询、物流快递、推荐)。在训练与验证时正样本:负样本=1:1,测试时是1:9
  • Persona-Chat dataset
    • 地址:Download and load persona-chat json dataset (github)
  • CMUDoG dataset:我们将“基于文档的对话”定义为关于指定文档内容的对话。在这个数据集中,指定的文档是维基百科关于流行电影的文章。该数据集包含4112个对话,平均每个对话21.43个回合。这使得该数据集不仅在生成响应时提供相关的聊天历史,而且还提供模型可以使用的信息源
    • 地址:festvox/datasets-CMU_DoG: CMU Document Grounded Conversation Dataset (github)

本文标签: 信息检索数据