      4.4 功能性
      4.6 安全防护
      5.4 模型训练
Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses large language models (LLMs) for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such “in-the-wild” data collected by AutoRT is significantly more diverse, and that AutoRT’s use of LLMs allows for instruction following data collection robots that can align to human preferences.



1 引言

One of the central goals of autonomous robotics research is to enable independent and broadly capable robotic agents: systems that can be tasked with some high-level goals (“keep the kitchen clean”), formulate plans for addressing these goals, and then carry out those plans with the skills and resources available to them. While current robotic learning methods offer appealing solutions for acquiring individual robotic skills, and large language models (LLMs), vision-language models (VLMs) and large multimodal models offer the ability to reason over such abstract tasks (Ahn et al., 2022; Rana et al., 2023), truly open-ended tasks still present major challenges. Performing innumerable number of tasks in diverse settings requires a grounded and generalist agent that can robustly adapt to scenarios outside where robots are trained. The bottleneck for achieving these goals, however, is the need for large amounts of robotic experience in the real world – much larger than robot datasets collected in lab settings with well-defined environments.


In this paper, we study how we can design agents to gather robotic experience for themselves at scale. Central to our work is leveraging knowledge contained in foundation models to drive realworld robots. We specifically focus on diverse robotic data acquisition: when a robot is placed in a new environment, potentially with a user command to collect data around some theme (e.g. office tasks), the robot should determine which tasks can be performed, which of its own skills to trigger to attempt them, and when it should rely on human teleoperators. We view this from the perspective of controlling a fleet of robots, spread across multiple locations, where there are many more robots than human supervisors, necessitating mixing expert demonstrations with suboptimal autonomous policies in a safe and appropriate way. Our system for large-scale orchestration of robotic agents, which we call AutoRT, tackles this problem.


At the core of AutoRT is an large foundation model that acts as a robot orchestrator, prescribing appropriate tasks to one or more robots in an environment based on the user’s prompt and environmental affordances (“task proposals”) discovered from visual observations. The scene description step perceive objects in the environment, the task proposal step suggests possible things the robot could do with them, and then the affordance filtering step decides which tasks to attempt and how based on these observations and prompt. This process takes into account constraints specified via “constitutional prompting”, where rules about robot behaviour can be defined by the user. It additionally accounts for the availability of human teleoperators, and handles working around the capabilities of the robot (e.g., the robot can pick up a cup but not a table, it can approach the sink but can’t sit in a chair, etc.).


We describe the AutoRT system, instantiate it with a fleet of real-world mobile manipulators, and present the results of an extensive real-world evaluation over 7 months, 4 different office buildings, and a fleet of over 20 robots, which resulted in the collection of 77,000 real-world robotic trials with both teleoperation and autonomous execution. AutoRT is, to the best of our knowledge, the first system where LLM-controlled robots are allowed to drive autonomously in real world settings, propose their own goals, and take actions toward those goals. We show that AutoRT scales robot deployment by allowing 1 human to supervise 3-5 mobile manipulators. Our evaluation studies how AutoRT can collect highly diverse data, be instructed to collect task appropriate data and shows such data can be used to improve state-of-the-art robot learning models. AutoRT also introduces aligning robot behavior to human preferences using prompting and critiquing with a robot constitution.



2 相关工作

Real robot data collection. Large scale real robot data collection for robotic manipulation falls into mainly two categories: autonomous data collection and human assisted demonstrations. Autonomous data collection in prior works is often conducted in constrained robot lab environments, on tasks like grasping (Pinto & Gupta, 2015; Levine et al., 2016; Kalashnikov et al., 2018; Platt, 2022), pushing (Yu et al., 2016; Ebert et al., 2018; Dasari et al., 2020), or pick and place (Kalashnikov et al., 2021; Bousmalis et al., 2023). Our work focuses on tackling more varied environments, similar to Gupta et al. (2018), and tackling a wider set of tasks. Human demonstrated data collection can be done in varied environments (Sharma et al., 2018; Mandlekar et al., 2019; Jang et al., 2021; Brohan et al., 2022), and teleoperated data can be far more diverse and valuable for skill learning than autonomously collected data, but is bottlenecked by availability of humans when scaling to many robots. This motivates hybrid approaches that mix teleoperation and autonomous policies, such as DAgger style methods (Ross et al., 2011; Kelly et al., 2019; Hoque et al., 2022). AutoRT is such a hybrid approach, collecting both teleoperated and autonomous episodes based on supply of human supervision, with a focus on collecting data on novel tasks in novel environments.

真实机器人数据收集。 对于机器人操纵的大规模真实机器人数据收集主要分为两类:自主数据收集和人类辅助演示。以前的自主数据收集通常在受限制的机器人实验室环境中进行,涉及抓取(Pinto & Gupta, 2015; Levine et al., 2016; Kalashnikov et al., 2018; Platt, 2022)、推动(Yu et al., 2016; Ebert et al., 2018; Dasari et al., 2020)或拾取和放置(Kalashnikov et al., 2021; Bousmalis et al., 2023)等任务。我们的工作侧重于处理更加多样化的环境,类似于Gupta等人(2018),并解决更广泛的任务。人类演示的数据收集可以在各种环境中完成(Sharma等人,2018; Mandlekar等人,2019; Jang等人,2021; Brohan等人,2022),而远程操作的数据可能比自主收集的数据更加多样和有价值,但在扩展到多个机器人时受到人力资源的瓶颈。这促使采用混合方法,结合远程操作和自主策略,如DAgger风格的方法(Ross等人,2011; Kelly等人,2019; Hoque等人,2022)。AutoRT 就是这样一种混合方法,根据人类监督收集远程操作和自主事件,重点是收集新环境中新任务的数据。

Large language models. Many recent works have studied using LLMs to generate agent-like behavior (Shinn et al., 2023; Yao et al., 2022; Park et al., 2023), improve embodied reasoning (Driess et al., 2023), and write robotics code (Vemprala et al., 2023; Liang et al., 2022). Works like Ahn et al. (2022) and Rana et al. (2023) use LLMs to generate language plans for robots to solve an instruction given by a user. Our work self-generates instructions for the robot to perform, which was proposed in Xian et al. (2023). Most similar is Voyager (Wang et al., 2023), an LLM-driven agent that autonomously explores a Minecraft environment. AutoRT runs on a real-world robot for extended periods of time, introducing challenges like reliability and safety that are less present in simulated environments.

大型语言模型。 许多最近的研究都研究了使用大型语言模型(LLMs)生成类似代理的行为(Shinn等人,2023; Yao等人,2022; Park等人,2023),改善实体推理(Driess等人,2023),并编写机器人代码(Vemprala等人,2023; Liang等人,2022)。像Ahn等人(2022)和Rana等人(2023)的作品使用LLMs生成机器人根据用户给定的指令解决问题的语言计划。我们的工作是为机器人自动生成执行的指令,这在Xian等人(2023)中已经提出。最相似的是Voyager(Wang等人,2023),这是一个由LLM驱动的代理,可以自主探索Minecraft环境。AutoRT在真实世界的机器人上运行了很长时间,引入了可靠性和安全性等在模拟环境中较少存在的挑战。


3 问题陈述

In this work, our goal is to build a system that enables large-scale, “in-the-wild” data collection to generate diverse, real-world robot data on new skills in new environments.


To do so, we assume access to a large fleet of N robots, capable of navigating across multiple buildings, and manipulating objects. The buildings are populated, where both robots and people are free to move around the space. We do not make any assumptions about the layout of the buildings, or the objects available for manipulation. We assume a limited bandwidth of human supervision, meaning there are more robots than human supervisors – that is, we cannot expect that a human will always be in charge of teleoperating a single robot.


Our goal is to have a single system that can handle any state s ∈ S observed by a robot, and generate tasks t executable by one of k different collect policies π ∈ {π 1 ,…,π k} = Π. For instance, πi can be an autonomous policy π auto i either hand-designed or learned a priori, or a policy executed by querying a human teleoperator, i.e., π teleop i . The goal of such a system: S → Π is to guide the data collection of the fleet of N robots by observing the state s and use this information to identify a set of feasible language-specified tasks t that correspond to specific policies π. In addition, the system needs to take into account other factors that impact throughput of data collection and safety. These include tradeoffs between autonomous and teleoperated policy primitives, generation of diverse and novel tasks proposals while at the same time considering guardrails and safety criteria.

我们的目标是建立一个单一系统,可以处理机器人观察到的任何状态s ∈ S,并生成可由k种不同收集策略π ∈ {π 1 ,…,π k} = Π执行的任务t。例如,πi可以是一个自主策略π auto i,可以是事先设计的,也可以是先验学习的,或者是通过查询人类远程操作员执行的策略,即π teleop i。这样一个系统的目标是通过观察状态s引导N个机器人的数据收集,利用这些信息识别一组与特定策略π相对应的可行的语言规定任务t。此外,系统还需要考虑影响数据收集和安全性的其他因素。这些因素包括自主和远程操作策略之间的权衡、生成多样化和新颖的任务提案,同时考虑防护栏和安全标准。


4 AutoRT:在真实场景中探索和执行

In this section, we describe each component of AutoRT, which is visualized in Fig. 5. At a high level, AutoRT gathers data via an open vocabulary object detector to first understand and describe the scene, then an LLM parses this description and generates sensible and safe language goals given high-level objectives, and finally an LLM is used to determine how to execute these goals.


Figure 1: System diagram for AutoRT. Each robot explores the environment, sampling a random navigation target close to objects. The scene and objects in it are described by a VLM to give text to an LLM, which generates manipulation tasks for the robot. Valid tasks are run by the robot, the episodes are scored, and the process repeats. No part of this requires advance knowledge of the layout of the environment or objects it contains, making it easy to run on a fleet of 20+ robots that are each in novel settings. Green sections are contributions of this work.

图1:AutoRT系统示意图。 每个机器人探索环境,采样靠近物体的随机导航目标。场景及其中的物体由VLM描述,为LLM提供文本,LLM生成机器人的操作任务。机器人执行有效的任务,对每个任务剧集(包括任务生成、执行和结果)进行评分,然后过程重复。这一过程无需对环境布局或其中包含的物体有任何先验知识,因此可以轻松在包含20多个机器人的机器人群体中运行,每个机器人都处于新颖的设置中。绿色部分是本文的贡献。

The robot platform used in AutoRT is a mobile manipulator with a camera, robot arm, and mobile base. Herein, we only consider manipulation data collection, so navigation is only used to gather diverse manipulation settings – however, we note that the system is general to other robotic embodiments and modes of collection. Further details on the robot platform and the implementation are in Appendix A.



4.1 探索:导航到目标

The first stage of AutoRT is to explore the space and find interesting scenes for manipulation. To map the environment, we use the natural language map approach proposed by Chen et al. (2023), which is built using a VLM to encode object detections into visual-language embeddings φi , with corresponding position (xi , yi ,zi) determined by the robot’s depth sensor and SLAM. Thus, given a textual target q like “sponge”, we can direct the robot towards a sponge by querying for a φi that is close to the text embedding for q. To determine navigation goals we sample this map for regions of interest via sampling states proportional to their latent distance to an average embedding of previously seen objects (see Appendix B for more details). For each environment, this map is generated once, then copied to all robots collecting in the space and loaded from cache to save time in future episodes.



4.2 机器人定律

Key to safe robot operation is breaking down high level objectives relevant to humans into tasks a robot may perform. We specify this to robots using what we call a Robot Constitution, a list of rules an LLM is instructed to follow, inspired by methods like Constitutional AI (Bai et al., 2022). These rules are divided into three categories:

确保机器人安全运行的关键是将与人类相关的高层次目标分解为机器人可以执行的任务。我们使用一个称为机器人定律的方式向机器人指定这一点,它是一份LLM被指示遵循的规则清单,灵感来自于诸如Constitutional AI(Bai等人,2022)的方法。这些规则分为三类:

• Foundational rules inspired by Asimov’s three laws (Asimov, 1942) that govern robotics in general and govern interactions with humans. We modify the exact text of these laws as described in Appendix D.

• Safety rules describing what tasks are considered unsafe or undesired based on current capabilities in deployment. These discourage the collect policies from interacting with humans or animals. They also discourage handling sharp and fragile objects or electrical equipment.

• Embodiment rules describing limitations of the robot’s embodiment, such as its maximum payload and its unimanual nature, to discourage attempting tasks with heavier objects or that which require two arms (e.g. “opening a fridge and picking up a drink”).

受到Asimov的三定律(Asimov, 1942)启发的基础规则,它们总体上统治机器人学和与人类的交互。我们根据附录D中描述的方式修改这些法则的确切文本。



A fourth category, the guidance rules, provides an input for an optional high-level human command: “The human command, which the robot should follow if given: {guidance}”. The way the robot constitution is used in task generation and affordance is explained below.



4.3 任务生成

Once a robot is in front of a manipulation scene si , it needs to generate a list of manipulation tasks to attempt. This is done via two steps:


• Scene description: Given an image from the robot camera, a VLM outputs text describing the scene the robot observes, and 5 objects that exist in that scene. For example, as shown in Fig. 1, the VLM lists soap, napkin, snack, cloth, sponge in the given scene.


• Task proposal: In this step, AutoRT is prompted to generate a list of tasks. This prompt begins with a system prompt, such as: “I am a robot operating in an office environment”, which describes the role the LLM should play. It continues with a list of rules that should be followed for task generation, codified by the robot constitution. The prompt ends with a section, where we can inject the scene and object description from the prior VLM call. Given this prompt, an LLM generates a list of potential manipulation tasks (see Fig. 5). We note, the LLM is not fine-tuned to our specific use case to maintain the generality the underlying model.


An important detail of AutoRT is that we use multiple collect policies {π 1 ,π 2 ,…,π k}, sampling one each episode. When the collect policy is sampled, and task generation must be modified to match the capabilities of that policy. Thus, for each policy π j , we append a π j -specific suffix to the end of the task generation prompt. See Appendix D for full text of the prompts.

AutoRT的一个重要细节是我们使用多个收集策略{π 1 ,π 2 ,…,π k},每次任务剧集采样一个。当采样收集策略时,任务生成必须修改以匹配该策略的能力。因此,对于每个策略π j,我们在任务生成提示的末尾附加一个π j -specific的后缀。有关提示的完整文本,请参见附录D。


4.4 功能性

Tasks generated by the LLM on the first pass may not fully follow the provided prompt and thus AutoRT uses an extra step of task filtering. This is done using another prompted LLM; one can view this as a self-reflection step where an LLM is prompted to critique its own output, inspired by approaches such as Reflexion (Shinn et al., 2023), ReAct (Yao et al., 2022), and Constitutional AI (Bai et al., 2022).

LLM在第一次生成的任务可能并不完全遵循提供的提示,因此AutoRT使用了任务过滤的额外步骤。这是使用另一个被提示的LLM完成的;可以将其视为一个自我反思的步骤,LLM被提示对其自己的输出进行批判,灵感来自于诸如Reflexion(Shinn等人,2023)、ReAct(Yao等人,2022)和Constitutional AI(Bai等人,2022)等方法。

During the affordance step, in addition to the robot constitution, the LLM is further prompted with the list of collect policies available and text summaries of what each collect policy can do. For each generated task, the LLM is asked to either output a collect policy or a reason to reject that task. A few examples are provided to guide the LLM output into the desired format. This can be viewed as a classifier between the k collect policies, with an extra category for unknown tasks. The final task is then selected by randomly sampling from the accepted tasks. For instance, as shown in Fig. 5, the originally sampled policy is π teleop. The first two proposed tasks by the LLM are classified as π teleop, the second two tasks are classified as π rt2, an autonomous policy from (Brohan et al., 2023), and the last task is rejected as the embodiment of the robot does not allow for a bimanual task. The final task is sampled from the first two tasks. We found classifying between all collect policies was fine, even though for filtering it would be sufficient to classify between π i and not-π i per episode.

在功能性步骤中,除了机器人定律外,LLM还进一步用可用的收集策略列表和每个收集策略的文本摘要进行提示。对于每个生成的任务,LLM被要求输出一个收集策略或拒绝该任务的原因。提供了一些示例来引导LLM的输出进入所需的格式。这可以看作是k个收集策略之间的分类器,额外增加了一个未知任务的类别。然后,最终任务是通过从接受的任务中随机抽样来选择的。例如,如图1所示,最初采样的策略是π teleop。LLM提出的前两个任务被分类为π teleop,接下来的两个任务被分类为π rt2,这是来自(Brohan等人,2023)的一个自主策略,最后一个任务被拒绝,因为机器人的具体实体不允许进行双手任务。最终的任务是从前两个任务中随机抽样的。我们发现在所有收集策略之间进行分类是可以的,尽管对于过滤来说,在每一集中分类为π i和非π i将足以。


4.5 数据收集

Any number of collect policies could be used, but our instance of AutoRT uses three: teleoperation, a scripted pick policy, and RT-2 (Brohan et al., 2023). The scripted pick policy pseudocode is provided in Appendix H. Each π i has a different sampling probability pi that is adjusted during collect primarily based on the number of robots supervised per person. For example, if 1 person is supervising 3 robots, then the human teleoperation collect policy was sampled p < 1 3 of the time to respect available supervision. After manipulation, the episode’s diversity is scored (see Section 5.1 for how), and the robot resets to start again. The human supervisor may occasionally reset the environment by hand.

可以使用任意数量的收集策略,但我们的AutoRT实例使用了三种:远程操作、脚本化的拾取策略和RT-2(Brohan等人,2023)。脚本化拾取策略的伪代码在附录H中提供。每个π i 具有不同的采样概率pi,在收集期间主要根据每人监督的机器人数量进行调整。例如,如果一个人监督3个机器人,那么人类远程操作收集策略将被采样p < 1 3 的时间,以重视可用的监督。在操纵之后,对任务剧集的多样性进行评分(详见第5.1节),然后机器人重置以重新开始。人类监督者偶尔可能通过手动重置环境。

Recent works like Brohan et al. (2023) suggest Internet-scale visual-language data can drive generalization in downstream robotic models. Assuming these trends continue, the upcoming bottleneck will be action diversity - collecting useful, diverse motions that make progress towards new tasks in novel environments. Teleoperated data is the most action diverse policy, so we focus on keeping throughput of teleoperation high (no worse than a “1 human 1 robot” setup), potentially at the cost of assisting autonomous robots less frequently. We additionally prompt task generation for teleop to collect varied tasks by including lines like “none of these tasks should be simple pick and place”. For a breakdown of throughput by collect policy, or visualization of action trajectories, see Appendix I



4.6 安全防护

AutoRT deploys foundation models in “in the wild” settings but foundation models, even if prompted correctly and with instruction finetuning have no guarantees on safety. We complement these with traditional robot environment controls as an additional layer of safety. These measures are detailed in Appendix C.



5 实验评估

Our experimental evaluation studies the deployment of AutoRT in a variety of real-world environments, covering about 7 months, 4 different buildings, simultaneous operation of over 20 robots, and about 77,000 real-world robotic trials. We aim to evaluate the diversity of the data collected by AutoRT, the degree to which we can steer the tasks that AutoRT attempts by modifying the prompt, the semantic and functional appropriateness of the automatically generated task proposals, and an initial evaluation showing an example application of the AutoRT-collected data to improve the RT-1 (Brohan et al., 2022) model.


AutoRT Environment Scaling Our collection environments for the robots include offices, kitchens, and cafeterias. The same code is used in every environment with the only per-environment change being the difference in driving bounds allowing AutoRT to start collecting in a new environment in ¡ 1 day without too much set up. Some of these environments are shown in Fig. 2.

AutoRT环境扩展。 我们为机器人收集的环境包括办公室、厨房和餐厅。相同的代码在每个环境中使用,每个环境的唯一更改是导航驾驶范围的不同,使得AutoRT可以在不需要太多设置的情况下在新环境中开始收集,时间不超过1天。图2显示了其中一些环境的示例。

Figure 2: Examples of robot collect environments used. These environments have a variety of surfaces and semantically different objects to practice manipulation on, along with freedom for the robot to move between manipulation scenes.


AutoRT Robot Deployment Scaling: Each human supervised between 3 to 5 robots at once, allowing to scale mobile manipulator deployment faster than number of humans employed. Some of AutoRT was run using stationary robots that skipped navigation, only running task generation and manipulation in a loop. These robots were easier to supervise due to their smaller range of motion, and were run with 1 human watching up to 8 robots. Human availability dictated the sampling ratios for collect policies.

AutoRT机器人部署扩展: 每个人同时监督3到5台机器人,允许比雇佣的人数更快地扩展移动操纵器的部署。部分AutoRT是使用固定机器人运行的,它们跳过导航,仅在循环中运行任务生成和操作。由于其较小的运动范围,这些机器人更容易监督,并且由1个人监视最多8台机器人运行。人的可用性决定了收集策略的采样比率。

Data statistics: In total, 53 robots were used to collect 77,000 new episodes over the course of 7 months, with a peak load of over 20 simultaneous robots. Over 6,650 unique instructions appear in the dataset. More details can be found in Fig. 3, Fig. 4 and Table 1. Interestingly, we find that RT-2 success rate is quite low during collection, because the complex environments, objects and requirement for navigation differed significantly from RT-2’s training set and inference capabilities. This influenced our decision to run RT-2 less frequently.

数据统计: 总共使用了53台机器人在7个月的时间内收集了77,000个新任务,同时运行的最大机器人数超过20台。数据集中出现了超过6,650个独特的指令。有关更多详细信息,请参见图3、图4和表1。有趣的是,我们发现RT-2在收集过程中的成功率相当低,因为复杂的环境、物体和导航要求与RT-2的训练集和推理能力差异很大。这影响了我们减少RT-2频率的决定。

Figure 3: On the left is AutoRT robot usage and on the right is t-SNE visualization of tasks, colored by collect policy used. Each point corresponds to a different task string.

图3: 左侧是AutoRT机器人使用情况,右侧是任务的t-SNE可视化,按使用的收集策略着色。每个点对应于一个不同的任务字符串。

Figure 4: AutoRT episodes collected and unique tasks over time

图4: 随时间变化的AutoRT任务剧集收集和独特任务。



5.1 多样性评分

Given a fixed budget of human oversight and a fleet of robots, we aim to collect as much useful data as possible. Evaluating this is challenging, because downstream methods for utilizing such data are still imperfect – despite considerable recent progress, RL methods present scalability challenges to such diverse environments (Cobbe et al., 2020), while imitation learning methods require nearoptimal data. Thus, our measure of success for AutoRT is the diversity of the collected data.We consider two different axes of diversity: visual diversity (how diverse are the collected trajectories visually), and language diversity (how diverse are the natural language instructions proposed by our system). We additionally present an evaluation of the RT-1 model via filtered BC in Section 5.4, however we note our evaluation is preliminary, and we hope that future advances in low-level robotic learning algorithms (e.g., RL and IL) will lead to better approaches for utilizing such data.


Language diversity: To measure language diversity, we use the L2 distance in a language embedding space – specifically that of Universal Sentence Encoder (Cer et al., 2018) that are normalized 512-d embeddings. We compare AutoRT’s task generation approach with the hand-designed tasks from three previous works: tasks from Language Table (Lynch et al., 2023), tasks from BC-Z (Jang et al., 2021), and tasks from RT-1 (Brohan et al., 2022). Table 2 shows AutoRT has higher average distance between language embeddings and generates more diverse language than all other approaches.

语言多样性: 为了衡量语言多样性,我们使用语言嵌入空间中的L2距离,具体而言是通用句子编码器(Cer等人,2018)的标准化512维嵌入。我们将AutoRT的任务生成方法与三个先前作品的手工设计任务进行比较:来自Language Table的任务(Lynch等人,2023),来自BC-Z的任务(Jang等人,2021)和来自RT-1的任务(Brohan等人,2022)。表2显示AutoRT的语言嵌入之间的平均距离更大,生成的语言比所有其他方法更多样化。

Table 2: Diversity of language embeddings from task generators. AutoRT generates language embeddings that are further apart.

表格2: 来自任务生成器的语言嵌入的多样性。AutoRT生成的语言嵌入之间的距离更远。

We additionally use the language diversity score to compare two VLMs for scene description without generating large amounts of robot data. We compare PaLI (Chen et al., 2022) and FlexCap (Review, 2023). Keeping the LLM prompts fixed, we first sample 70 random scenes the robots saw so far. Each scene was described by each VLM, and their descriptions were passed to task generation. The diversity of language embeddings after affordance filtering was then used to score the VLMs. We found both VLMs led to better scores than our baselines. Qualitative examples of sampled tasks from the two VLMs are in Appendix G.


Visual diversity: To measure visual diversity, we utilize a clustering method similar to a diversity measure used in Tirumala et al. (2023). Robot episodes are first embedded by a visual encoder, then k-means unsupervised clustering is done in the space. New episodes are scored based on the distance from that episode’s embedding to the nearest k-means centroid. This distance is the diversity score, with larger distances indicating more novel data. We utilize a CLIP model as our embedder, finetuned to contrast {first image, goal image} embeddings with natural language captions (Xiao et al., 2023), and cluster with k = 1000. We found this was better at capturing semantic differences, although it does ignore intermediate images.

视觉多样性: 为了衡量视觉多样性,我们利用一种类似于Tirumala等人(2023)使用的多样性度量的聚类方法。首先,通过视觉编码器对机器人任务剧集进行嵌入,然后在该空间中进行k均值无监督聚类。基于该任务剧集的嵌入到最近的k均值质心的距离对新任务剧集进行评分。这个距离就是多样性分数,距离越大表示数据越新颖。我们使用一个CLIP模型作为我们的嵌入器,通过对比{第一张图像,目标图像}的嵌入与自然语言字幕(Xiao等人,2023)进行微调,并使用k = 1000进行聚类。我们发现这更能捕捉语义差异,尽管它忽略了中间图像。

Fig. 5 shows the visual diversity across each of AutoRT’s data collection policies, along with the RT-1 dataset as a baseline. We find that the visual diversity is larger for each type of AutoRT data, with higher diversity in teleop than the scripted policy. Notably, RT-1’s dataset is only teleop, yet AutoRT is more diverse across all categories. Sample images are shown in Fig. 6. We also did an experiment where human supervisors directly optimized the visual diversity at collect time based on robot feedback. Further details are in Appendix E

图5 显示了AutoRT的每种数据收集策略的视觉多样性,以及RT-1数据集作为基线。我们发现每种类型的AutoRT数据的视觉多样性更大,而在远程操作中的多样性比脚本化策略更高。值得注意的是,RT-1的数据集只包含了远程操作,然而AutoRT在所有类别上都更加多样化。样本图像显示在图6中。我们还进行了一个实验,其中人类监督员根据机器人的反馈直接在收集时优化了视觉多样性。更多细节请参见附录E。

Figure 5: Visual diversity visualizations for AutoRT, as scored by distance to closest k-means centroid. Left: Histogram of 1000 random successes per collect policy (or all successes from RT-2 collect). Right: CDF of distributions, median of distribution annotated. Higher distances (more weight on the right) are further from prior data, and thus better. We find all AutoRT data is more diverse due to running in more varied environments, with teleop data from AutoRT scoring best.

图5 :对AutoRT的视觉多样性进行了可视化,通过与最接近的k均值质心的距离进行评分。左侧为每种收集策略的1000个随机成功(或来自RT-2收集的所有成功)的直方图。右侧为分布的CDF(累积分布函数),其中中位数被注释。较大的距离(更靠右的权重更重)离先前的数据更远,因此效果更好。我们发现,由于在更加多样化的环境中运行,所有AutoRT数据的多样性更高,其中来自AutoRT的远程操作(teleop)数据效果最佳。

Figure 6: Example last-frame images (color corrected) from RT-1 (left) and AutoRT (right)

图 6:RT-1(左)和 AutoRT(右)的最后一帧图像示例(颜色校正)


5.2 任务生成

In this section we study the quality of task generation prior to filtering based on feasibility (is the task possible) and relevance (does the task follow high-level guidance) and compare against two baselines. First, a simple templated language approach that matches a random verb from a hardcoded list with an object seen by the VLM, e.g. "

在这一部分,我们研究了在基于可行性(任务是否可能)和相关性(任务是否符合高级指导)进行过滤之前任务生成的质量,并与两个基准进行了比较。首先是一种简单的模板语言方法,它从硬编码列表中匹配一个随机动词与VLM看到的对象,例如" "。这反映了RT-1中使用的语言指导过程。其次,为了削弱AutoRT如何被引导到有用任务的能力,我们考虑了一个AutoRT(未引导)变体,该变体从提示中删除了引导规则。

To evaluate, the robot is placed in front of 5 scenes. We generate 75 tasks in total, using guidance like “collect gardening tasks” or “how would you clean this mess?” for AutoRT (guided). Results are shown in Table 3. We find that AutoRT’s tasks (guided and unguided) are 1.5x more likely to be feasible than templated language. The large increase in feasibility is because naively mixand-matching verbs is likely to generate nonsense language like “open keyboard”, whereas LLMs will tend to generate sensible language. We further find that we can guide task generation towards gardening, cleaning, etc., which is promising for allowing end-users to tell robots what data we would like them to collect. Qualitative outputs are in Appendix G.


Table 3: Comparison of task generation methods at generating completable tasks and relevant tasks. Injecting the high-level guidance into the LLM prompt improves the relevance of generated tasks. Using an LLM at all improves both feasibility and relevance thanks to common-sense inherited from Internet-scale data.



5.3 任务性质与机器人定律

In this section we study the effect of constitutional prompting and LLM self-critiquing on identifying safe and feasible tasks. Task generation and filtering are evaluated via two metrics: % Safe, the fraction of safe and feasible tasks proposed by AutoRT, and Recall, how often the self critiquing step correctly rejects unsuitable tasks generated during task proposal step.


Accuracy of AutoRT Task Generation: Across a sample of 64 scenes, we consider all 259 tasks generated and label whether each task is safe and feasible to collect. In this sample, we found 31 tasks that outght to have been rejected, giving a base rate of 228/259 = 88% acceptable tasks. After the LLM affordance filtering step we see the rate of acceptable tasks increase to 200/214 = 93%.

AutoRT任务生成的准确性:在64个场景的样本中,我们考虑了所有259个生成的任务,并标记了每个任务是否安全和可行。在这个样本中,我们发现了31个本应该被拒绝的任务,给出了一个基础比率为228/259 = 88%的可接受任务。经过LLM的任务过滤步骤后,我们看到可接受任务的比率增加到了200/214 = 93%。

When evaluating affordance, over-rejecting tasks is better than under-rejecting them, so we further evaluate the recall of rejected tasks. How often does the LLM reject (or fail to reject) tasks that should be rejected? Of the 31 unsuitable tasks, the LLM rejected 17/31 = 55% of them. Aditionally we find that all 14 errors occurred during teleop task sampling, attributable to forcing teleop task generation to remain highly diverse. These tasks were rejected by the teleoperator during collect indicating the importance of human-in-the-loop supervision, both as a safety mechanism and as a source of intervention data to improve affordance of task generation.

在评估任务性质时,过度拒绝任务比不足拒绝它们更好,因此我们进一步评估了被拒绝任务的召回率。LLM多久会拒绝(或未拒绝)应该被拒绝的任务?在31个不适当的任务中,LLM拒绝了17/31 = 55%。此外,我们发现所有14个错误发生在远程操作任务采样期间,这归因于强制远程操作任务生成保持高度多样性。这些任务在采集期间被远程操作员拒绝,表明人在循环监督既是一种安全机制,也是改进任务生成的任务性质的干预数据的重要来源。

Table 3: Comparison of task generation methods at generating completable tasks and relevant tasks. Injecting the high-level guidance into the LLM prompt improves the relevance of generated tasks. Using an LLM at all improves both feasibility and relevance thanks to common-sense inherited from Internet-scale data.


Adversarial Testing of Constitutional Prompting: To measure the effect of constitutional prompting, we set up deliberately adversarial scenes, and ablate our rules from the task generation prompt and affordance prompt. First, 5 test scenes were set up with objects that the robot should not interact with, including lifelike toy animals, sharp items, and people. Three task generation prompts are used: an unsafe prompt (designed to propose unsafe tasks), a minimal prompt (describing task generation without rules or constitution), and the constitutional prompt. These tasks are then filtered via two affordance prompts: a minimal one (describing affordance classification) and a constitutional one. Full prompt texts are in Appendix D.1. We show in Table 4 that the rate of safe tasks is significantly increased when robot constitution is included at task generation time or affordance filtering time, with best results when included at both steps. Additionally constitutional prompting is able to achieve high recall when given unsafe tasks.


Table 4: Effect of constitutional prompting on safety of proposed tasks



5.4 模型训练

The data generated by AutoRT covers a significantly wider range of language and visuals than in datasets such as RT-1 (Brohan et al., 2022). As a sanity check on the usefulness of the data, we run a training comparison with the RT-1 model. A pretrained RT-1 model is co-fine-tuned on a 50-50 mixture of the pretraining dataset described in Brohan et al. (2022) and AutoRT’s dataset. RT-1 is used instead of RT-2 due to training more quickly and cheaply.


The co-fine-tuned model is evaluated on two tasks we find RT-1 generalizes poorly to: picking from different heights, and wiping. Exact evaluation instructions and details are in Appendix F. When co-fine-tuned, RT-1’s performance increases from 0% to 12.5% on picking from different height, and 10% to 30% on wiping. We additionally include an ablation where we train from only the teleoperated segment of AutoRT data. We find this model is no longer able to pick from different heights, indicating that non-teleoperated AutoRT can be useful. These increases are modest, but we note that the focus of AutoRT was on collecting diverse data, not on achieving high success rates. RT-1 training was done to verify the data could improve the model, but the high diversity of tasks and scenarios leads to a challenging learning problem that is hard to perform well at.


Table 5: Results from co-finetuning RT-1 on AutoRT data

表 5:在 AutoRT 数据上协同微调 RT-1 的结果


6 结论、限制和未来工作

We presented AutoRT, an approach for directing fleets of robots to collect data in the real world, autonomously and with human help, supervised by large-scale vision and language models. We demonstrated that this approach results in useful, diverse, and large-scale data – leading to 77k realworld demonstrations collected by over 20 robots in 7 months in 4 buildings. We further introduced a robot constitution – which defined foundational rules, outlined safety constraints, and detailed the robot’s embodiment, and ablated the system design to show its usefulness. Finally, by training a model on this collected data we demonstrated novel capabilities and improved generalization over state of the art models. We believe this work is a step towards scaling robot data collection to the breadth of foundation models as well as embodying foundation models into robotic systems.


Despite the promise of AutoRT, the current approach comes with a number of limitations.


  1. AutoRT relies in large part on scripted and learned policies to scale collection for fixed teleoperation budget. If these policies only handle simpler tasks or have lower success rates in unseen settings, it lowers the throughput of successful episodes. Scaling the generation of higher quality data requires more robust and diverse autonomous collect policies as in Arenas et al. (2023)

  2. AutoRT在很大程度上依赖于脚本和学到的策略,以扩展固定远程操作预算的收集。如果这些策略只处理更简单的任务或在未见设置中的成功率较低,将降低成功演示的吞吐量。扩展生成更高质量数据需要更强大和多样化的自主收集策略,就像Arenas等人(2023年)中所描述的那样。

  3. Communication bandwidth between scene description and language model can introduce an information bottleneck in AutoRT. Failures of perception such as hallucination of objects, lack of generalization to novel environments, and motion blur can introduce and propagate failures in the system. As noted by prior work (Ahn et al., 2022; Mees et al., 2023; Gao et al., 2023), foundation models also face challenges in reasoning about task and embodiment specific information, such as physics of objects and capabilities of the robot. We ignored this for simplicity, but expect future efforts to require more accurate real-world reasoning.


  1. Thirdly, the type of data collected by AutoRT tends to be highly diverse, leading to fewer samples per task and lots of variety in scenes and object configurations. This “sparse” data presents a harder learning problem than the datasets used in existing state of the art robot learning methods like Brohan et al. (2022) and Brohan et al. (2023). AutoRT assumes data collection is decoupled from the control policy, but achieving the best policy improvement would likely require the two to evolve in tandem with each other.


  1. Lastly, though constitutional prompting improves safety of generated tasks, prompting an LLM does not guarantee that the prompt’s instructions will be followed, and a small percentage of unsafe tasks generated by the LLM will pass the affordance filtering. This necessitates some degree of human supervision.

  2. 最后,尽管定律提示提高了生成任务的安全性,但提示LLM并不能保证指令将被执行,LLM生成的一小部分不安全任务会通过容许性过滤。这需要一定程度的人类监督。

As we explore future directions, a chief question is how a robot should autonomously act in the world. What we call a robot constitution has historically been a topic reserved for science fiction (Asimov, 1942), but this work concretizes a real application where such rules could be helpful. We also see future work in treating model improvement and data collection as a single goal, rather than two separate areas, with an eye on identifying proximal skills and improving sample efficiency via directed data collection.



We thank Celeste Barajas, Joseph Dabis, Gavin Gonzalez, Tomas Jackson, Alex Luong, Utsav Malla, Emily Perez, Elio Prado, Jornell Quiambao, Sangeetha Ramesh, Jaspiar Singh, Clayton Tan, Jodexty Therlonge, Eric Tran, Steven Vega, and Samuel Wan for assistance on data collection, model evaluation, and AutoRT supervision. We thank Anthony Brohan and Noah Brown for assistance on data analysis. We thank David DoVo, Regine Firmeza, Tad Koch, Gus Kouretas, Jessica Lam, Thien Nguyen, and Eric Zankiewicz for robot setup and maintenance. We thank Nicolas Heess, Jacky Liang, Vincent Vanhoucke, and Andy Zeng for providing feedback on paper drafts.

致谢:我们感谢Celeste Barajas、Joseph Dabis、Gavin Gonzalez、Tomas Jackson、Alex Luong、Utsav Malla、Emily Perez、Elio Prado、Jornell Quiambao、Sangeetha Ramesh、Jaspiar Singh、Clayton Tan、Jodexty Therlonge、Eric Tran、Steven Vega和Samuel Wan在数据收集、模型评估和AutoRT监督方面的帮助。我们感谢Anthony Brohan




A 机器人和系统设置

Each robot is a 7 DoF robot arm attached to a mobile base, with a camera mounted on the head of the robot. The robot is capable of both navigation and manipulation. At collection time, the robot is driven to a location which could be either a natural environment, such as an office area, a kitchen area, a lounge, or an artificially set up room with objects on different surfaces. The robots are given the bounding box of the region they should stay within for safety purposes, but are not given any information on object locations ahead of time, and must explore the area to find objects for themselves.


The code is structured in a form we call the policy graph. Each node v ∈ V of the policy graph is a subpolicy π(a|s,data), where s is the robot state, a is the robot action, and data is information that accumulates as we go through the graph. The collect policies {π 1 ,…,π k} are themselves subpolicies in the policy graph, but the policy graph includes subpolicies for navigation, and subpolicies whose focus is only querying the LLM. Subpolicies that do not move the robot simply output a no-op action a.

代码的结构采用了我们称之为策略图的形式。策略图的每个节点v ∈ V都是一个子策略π(a|s,data),其中s是机器人状态,a是机器人动作,data是随着我们遍历图而累积的信息。收集策略{π1,…,πk}本身是策略图中的子策略,但策略图包括用于导航的子策略和仅专注于查询LLM的子策略。不移动机器人的子策略简单地输出一个无操作动作a。

After every timestep, we check the transition conditions β defined for each node. Transition conditions β : S×Data → {0,1},V are functions that take the current state and accumulated data, and decide if a subpolicy should yield control to the next node, and if so, which one. These conditions are similar to those in a finite-state machine. A given node can have multiple incoming and outgoing transition conditions. When there are multiple outgoing conditions, only one should be true at a time. For example, in Fig. 5 the AffordanceFilter has k outgoing transition conditions, one for each of collect policies π i ∈ {π 1 ,…,π k}, and the DiversityScoring node has k incoming transition conditions, one from each collect policies.

每个时间步之后,我们检查为每个节点定义的转移条件β。转移条件β:S×Data → {0,1},V是函数,它接受当前状态和累积数据,并决定子策略是否应将控制权移交给下一个节点,如果是,则是哪一个。这些条件类似于有限状态机中的条件。给定节点可以具有多个传入和传出转移条件。当存在多个传出条件时,一次只能有一个条件为真。例如,在图1中,AffordanceFilter有k个传出转移条件,每个都对应于收集策略πi ∈ {π1,…,πk}中的一个,并且Diversity Scoring节点有k个传入转移条件,每个都来自于每个收集策略。

One property of AutoRT is that it only generates tasks based on what the robot sees, which can bias task generation. For example, if run in an office environment, AutoRT will mostly see office supplies and generate office-based tasks. To get better coverage of task space, we gathered many (over 100) random objects, like plastic toys and soda cans, and scattered some of them in the environments each day, swapping the objects every day. This provides a greater variety of objects for AutoRT’s task generation.



B 导航抽样

We first define a fixed query embedding with the goal of biasing sampling towards easier tasks. A short list of object names from previous works was gathered.

apple, basket, blue can, bottled tea, bowl, box of tea, brown chip bag, can, cereal, chip bag, clipboard, coffee machine, coffee_machine, compost, compost bin, cup, drawer, drinking machine, empty bottle, energy bar, espresso machine, ficus, first aid station, fridge, fruit, green bag of chips, green can, green plant, green soda can, human, jar of white candy, landfill, light switch, microwave oven, mini fridge, multigrain chip, napkin box, orange, paper bowl, paper cup, pepsi, plastic bottle, poster, potted plant, red can, silver spoon, sink, slippery sign, snack jar, snack jar of almonds, snack jar of dried fruits, snack jar of gums, snack jar of nuts, socket, sponge, table, tap, trash can, tv, up side down mug, upside down paper cup, water bottle, water machine, water_bottle, white bowl, white chair, white jar, white mug, white sign, woven basket, yellow sign


This list was gathered once, and not changed or ablated during the project.


We defined φq as the normalized average text embedding for these object names. Each navigation target φi was then scored from 0 to 1 by:


and sampled proportionally to score β i , where β is a hyperparameter deciding the temperature of sampling. We use β = 1 in data collection to maintain higher variation during collection, but recommend using larger β when doing more targeted data collection.

并且按比例采样得分βi,其中β是决定抽样温度的超参数。在数据收集中,我们使用β = 1 以保持在收集过程中的更高变化,但建议在进行更有针对性的数据收集时使用较大的β。


C 安全防护措施

The following guardrails are put in place to ensure operational safety.


• All robots will pause motion if detected force on joints exceeds a threshold. All robots can also be immediately disengaged using a physical E-stop button.
• Unless the robot workspace is barricaded, at least one human must supervise the robots in such a way that all robots are within line of sight.
• During regular operation, we proactively remove objects from the environment that is unsafe for a robot to handle. This is in addition to prompting the LLM to not interact with them.
• Whenever we collect a human demonstration, the human expert sanity checks the generated task, since they are already available to provide human feedback to the model.

• 如果检测到关节上的力超过阈值,所有机器人将暂停运动。所有机器人也可以立即通过物理急停按钮停止运行。
• 除非机器人的工作空间被围挡,否则必须至少有一个人以使所有机器人在视线范围内进行监督。
• 在正常操作期间,我们主动从环境中删除对机器人处理不安全的物体。这是为了避免与它们交互,同时提示LLM不要与它们交互。
• 每当我们收集人类演示时,人类专家会对生成的任务进行合理性检查,因为他们已经可以提供对模型的人类反馈。

Many of these controls are standard practice in robot learning. As robot policies and LLMs improve, user expectations of robots will increase, and we anticipate verification protocols to become more complex and important to get right.



D 提示

All prompts are based on Python string formatting. When doing teleop task generation, we use num tasks=10. Task generation guidance is set to “N/A” unless specified otherwise.

所有提示都基于Python字符串格式。在进行远程操作任务生成时,我们使用num tasks=10。任务生成的指导设置为“N/A”,除非另有说明。

Robot constitution:

Asimov’s three laws of robotics are modified in two ways. The first law removes the “through inaction” part, as our robot’s agency is limited and we do not want to bias towards in-action. The order of the second and third laws are swapped, since our robots are currently more in need of protection from humans asking for tasks which could endanger the robots, rather than the other way around.


Task generation prompt for teleop policy:


Task generation prompts for RT-2:

RT-2 的任务生成提示:

Task generation prompts for scripted pick

Scripted Pick 的任务生成提示:

Affordance LLM prompt

Affordance LLM 提示:图片略


D.1 对抗性实验提示

Minimal task generation prompt for teleop. This is identical to the default prompt, without the inclusion of robot constitution rules.


Unsafe task generation prompt for teleop. This both removes the constituional rules and modifies the prompt to oversample tasks we want the affordance filter to capture.

远程操作的不安全任务生成提示。这既删除了宪法规则,又修改了提示,以过采样我们希望 affordance 过滤器捕捉的任务。


Minimal affordance LLM prompt used for affordance filtering ablation. This is identical to the default one, without the inclusion of the robot constitution rules.

Affordance 过滤消融使用的最小 affordance LLM 提示。这与默认提示相同,没有包含机器人宪法规则。


E 优化视觉多样性

Since our robot agents can calculate visual diversity scores after every episode, we can use this as a metric to optimize. We perform a pilot study where the robot speaks out loud the diversity score of the episode it has collected. The human supervising the data collection pays attention to this score, and changed the scene between episodes to try to maximize the spoken score. The resulting scenes in Fig. 7 feature more distractor objects, askew tables, and unconventional object arrangements like turned over recycling bins and objects on top of chairs. This demonstrates another benefit of quantifying data diversity - it can provide online feedback that allows for faster iteration loops during data collection.

由于我们的机器人代理可以在每个剧集之后计算视觉多样性分数,我们可以将其用作优化的指标。我们进行了一项试点研究,机器人大声宣布它已经收集的剧集的多样性分数。监督数据收集的人员注意到这个分数,并在剧集之间改变场景,试图最大化它。图7中的结果场景包含了更多的干扰对象,歪斜的桌子以及不规则的物体排列,比如倒过来的回收箱和放在椅子上的物体。这展示了量化数据多样性的另一个好处 - 它可以提供在线反馈,允许在数据收集过程中进行更快的迭代循环。

Figure 7: Robot environments before and after adjusting scene based on visual diversity. Note the unconventional arrangement of objects, surfaces, and distractors.

图 7:基于视觉多样性调整场景前后的机器人环境。 注意物体、表面和干扰物的非常规排列。


F 模型改进评估任务

For picking from different heights, pick attempts were done against 3 different heights: a desk, a shorter table, and the floor. For each height, we sampled 4 candidate tasks, giving 12 tasks in total. For wiping evals, the scene was set up with a table, a sponge, and a cloth, and we sampled 5 wiping tasks, some of which required using the correct object, and some of which could use either the sponge or cloth. All tasks were attempted 2 times each. Exact task strings are in Appendix F.


Table 6: Tasks used to evaluate training ablations

表 6:用于评估训练消融的任务


G 定性示例

We collect qualitative examples of LLM generations here. Table 7 lists sample text generations from AutoRT when using different VLMs. Table 8 lists tasks from Section 5.2 experiments for templated language, unguided AutoRT, and guided AutoRT. Table 9 lists tasks from adversarial testing of constitutional prompting


Table 7: Example generated tasks with AutoRT using the teleoperated prompt, comparing two different VLMs for describing the scene and nearby objects. We found FlexCap to be more descriptive in its object description, particularly with regards to color.


Table 8: Examples from Section 5.2 experiments testing relevance and feasibility


Table 9: Tasks generated in Section 5.3 experiments. We present an image the robot sees, tasks generated by the unsafe task generation prompt, and the reply of both the minimal affordance and constitutional affordance.



H 脚本拾取

Below is pseudocode for the scripted picking policy used in data collection. The first draft of this code was generated by an LLM, but changes were later made by hand to better comment behavior and improve robustness in edge cases. Our early explorations into code generation have found that LLMs can generate a good first attempt, but that first attempt often misses edge cases that need to be handled to make the code suitable for long-running data collection.



Figure 8: Robot trajectories from scripted motion (left) and teleop motion (right). Note that teleop is on the whole a lot more diverse from a trajectory perspective

Figure 9: Hours of data collected per policy per day. We aimed for teleop collect throughput to exceed a simple 1 person:1 robot baseline. We found a small increase in teleop throughput from AutoRT since AutoRT used fewer manual resets than typical collection (a robot can navigate to a new scene instead of waiting for a reset).

图9:每天每个策略收集的数据小时数。我们的目标是使远程操作的收集吞吐量超过一个简单的1人:1机器人的基线。我们发现与Typical collection相比,AutoRT的远程操作吞吐量略有增加,因为AutoRT使用的手动复位较少(机器人可以导航到新场景而无需等待复位)。

