admin管理员组

文章数量:1614997

[晓理紫]每日论文推送(有中文摘要或代码或项目地址)
每日更新论文,请转发给有需要的同学
[晓理紫]

专属领域论文订阅

VX关注晓理紫,获取每日新论文
VX关注晓理紫,并留下邮箱可免费获取每日论文推送服务

{晓理紫}喜分享,也很需要你的支持,喜欢留下痕迹哦!

分类:

  • 大语言模型LLM
  • 视觉模型VLM
  • 扩散模型
  • 视觉导航
  • 具身智能,机器人
  • 强化学习
  • 开放词汇,检测分割

== Visual Navigation ==

标题: Exploring Vulnerabilities of No-Reference Image Quality Assessment
Models: A Query-Based Black-Box Method

作者: Chenxi Yang, Yujia Liu, Dingquan Li

摘要: No-Reference Image Quality Assessment (NR-IQA) aims to predict image quality scores consistent with human perception without relying on pristine reference images, serving as a crucial component in various visual tasks. Ensuring the robustness of NR-IQA methods is vital for reliable comparisons of different image processing techniques and consistent user experiences in recommendations. The attack methods for NR-IQA provide a powerful instrument to test the robustness of NR-IQA. However, current attack methods of NR-IQA heavily rely on the gradient of the NR-IQA model, leading to limitations when the gradient information is unavailable. In this paper, we present a pioneering query-based black box attack against NR-IQA methods. We propose the concept of \emph{score boundary} and leverage an adaptive iterative approach with multiple score boundaries. Meanwhile, the initial attack directions are also designed to leverage the characteristics of the Human Visual System (HVS). Experiments show our attack method outperforms all compared state-of-the-art methods and is far ahead of previous black-box methods. The effective DBCNN model suffers a Spearman rank-order correlation coefficient (SROCC) decline of 0.6972 0.6972 0.6972 attacked by our method, revealing the vulnerability of NR-IQA to black-box attacks. The proposed attack method also provides a potent tool for further exploration into NR-IQA robustness.

中文摘要: 无参考图像质量评估(NR-IQA)旨在预测与人类感知一致的图像质量分数,而不依赖于原始参考图像,这是各种视觉任务的关键组成部分。确保NR-IQA方法的稳健性对于不同图像处理技术的可靠比较和推荐中一致的用户体验至关重要。NR-IQA的攻击方法为测试NR-IQA提供了一个强大的工具。然而,当前NR-IQA的攻击方法严重依赖于NR-IQA模型的梯度,导致在梯度信息不可用时受到限制。在本文中,我们提出了一种针对NR-IQA方法的开创性的基于查询的黑匣子攻击。我们提出了\emph{分数边界}的概念,并利用了一种具有多个分数边界的自适应迭代方法。同时,初始攻击方向也被设计为利用人类视觉系统(HVS)的特性。实验表明,我们的攻击方法优于所有最先进的方法,并且远远领先于以前的黑盒方法。有效的DBCNN模型在受到我们的方法攻击时,Spearman秩序相关系数(SROCC)下降了0.6972$,揭示了NR-IQA对黑匣子攻击的脆弱性。所提出的攻击方法也为进一步探索NR-IQA的鲁棒性提供了有力的工具

[Downlink:]http://arxiv/abs/2401.05217v1


标题: Amplifying robotics capacities with a human touch: An immersive
low-latency panoramic remote system

作者: Junjie Li, Kang Li, Dewei Han

摘要: AI and robotics technologies have witnessed remarkable advancements in the past decade, revolutionizing work patterns and opportunities in various domains. The application of these technologies has propelled society towards an era of symbiosis between humans and machines. To facilitate efficient communication between humans and intelligent robots, we propose the “Avatar” system, an immersive low-latency panoramic human-robot interaction platform. We have designed and tested a prototype of a rugged mobile platform integrated with edge computing units, panoramic video capture devices, power batteries, robot arms, and network communication equipment. Under favorable network conditions, we achieved a low-latency high-definition panoramic visual experience with a delay of 357ms. Operators can utilize VR headsets and controllers for real-time immersive control of robots and devices. The system enables remote control over vast physical distances, spanning campuses, provinces, countries, and even continents (New York to Shenzhen). Additionally, the system incorporates visual SLAM technology for map and trajectory recording, providing autonomous navigation capabilities. We believe that this intuitive system platform can enhance efficiency and situational experience in human-robot collaboration, and with further advancements in related technologies, it will become a versatile tool for efficient and symbiotic cooperation between AI and humans.

中文摘要: 人工智能和机器人技术在过去十年中取得了显著进步,改变了各个领域的工作模式和机会。这些技术的应用将社会推向了一个人与机器共生的时代。为了促进人类与智能机器人之间的高效通信,我们提出了“阿凡达”系统,这是一个沉浸式低延迟全景人机交互平台。我们设计并测试了一个坚固的移动平台原型,该平台集成了边缘计算单元、全景视频捕获设备、动力电池、机械臂和网络通信设备。在良好的网络条件下,我们实现了延迟357ms的低延迟高清全景视觉体验。操作员可以利用VR耳机和控制器对机器人和设备进行实时沉浸式控制。该系统能够实现跨越校园、省份、国家甚至大洲(纽约到深圳)的远距离远程控制。此外,该系统结合了用于地图和轨迹记录的视觉SLAM技术,提供了自主导航功能。我们相信,这个直观的系统平台可以提高人机协作的效率和情景体验,随着相关技术的进一步进步,它将成为人工智能与人类高效共生合作的通用工具

[Downlink:]http://arxiv/abs/2401.03398v2


标题: Autonomous robotic re-alignment for face-to-face underwater human-robot
interaction

作者: Demetrious T. Kutzke, Ashwin Wariar, Junaed Sattar

摘要: The use of autonomous underwater vehicles (AUVs) to accomplish traditionally challenging and dangerous tasks has proliferated thanks to advances in sensing, navigation, manipulation, and on-board computing technologies. Utilizing AUVs in underwater human-robot interaction (UHRI) has witnessed comparatively smaller levels of growth due to limitations in bi-directional communication and significant technical hurdles to bridge the gap between analogies with terrestrial interaction strategies and those that are possible in the underwater domain. A necessary component to support UHRI is establishing a system for safe robotic-diver approach to establish face-to-face communication that considers non-standard human body pose. In this work, we introduce a stereo vision system for enhancing UHRI that utilizes three-dimensional reconstruction from stereo image pairs and machine learning for localizing human joint estimates. We then establish a convention for a coordinate system that encodes the direction the human is facing with respect to the camera coordinate frame. This allows automatic setpoint computation that preserves human body scale and can be used as input to an image-based visual servo control scheme. We show that our setpoint computations tend to agree both quantitatively and qualitatively with experimental setpoint baselines. The methodology introduced shows promise for enhancing UHRI by improving robotic perception of human orientation underwater.

中文摘要: 由于传感、导航、操纵和机载计算技术的进步,自动水下航行器(AUV)用于完成传统上具有挑战性和危险性的任务的使用激增。在水下人机交互(UHRI)中使用AUV的增长水平相对较小,这是由于双向通信的局限性和弥合陆地交互策略与水下领域可能的策略之间的差距的重大技术障碍。支持UHRI的一个必要组成部分是建立一个安全的机器人潜水员方法系统,以建立考虑非标准人体姿势的面对面交流。在这项工作中,我们介绍了一种用于增强UHRI的立体视觉系统,该系统利用立体图像对的三维重建和机器学习来定位人类关节估计。然后,我们为坐标系建立了一个约定,该约定对人类相对于相机坐标系所面对的方向进行编码。这允许自动设置点计算,该设置点计算保持人体比例并且可以用作基于图像的视觉伺服控制方案的输入。我们表明,我们的设定点计算往往在数量和质量上与实验设定点基线一致。所介绍的方法有望通过改善机器人对水下人类方位的感知来增强UHRI

[Downlink:]http://arxiv/abs/2401.04320v1


标题: A Visual Analytics Design for Connecting Healthcare Team Communication
to Patient Outcomes

作者: Hsiao-Ying Lu, Yiran Li, Kwan-Liu Ma

摘要: Communication among healthcare professionals (HCPs) is crucial for the quality of patient treatment. Surrounding each patient’s treatment, communication among HCPs can be examined as temporal networks, constructed from Electronic Health Record (EHR) access logs. This paper introduces a visual analytics system designed to study the effectiveness and efficiency of temporal communication networks mediated by the EHR system. We present a method that associates network measures with patient survival outcomes and devises effectiveness metrics based on these associations. To analyze communication efficiency, we extract the latencies and frequencies of EHR accesses. Our visual analytics system is designed to assist in inspecting and understanding the composed communication effectiveness metrics and to enable the exploration of communication efficiency by encoding latencies and frequencies in an information flow diagram. We demonstrate and evaluate our system through multiple case studies and an expert review.

中文摘要: 医疗保健专业人员之间的沟通对患者治疗的质量至关重要。围绕每个患者的治疗,HCP之间的通信可以作为时间网络进行检查,该网络由电子健康记录(EHR)访问日志构建。本文介绍了一个可视化分析系统,旨在研究EHR系统所介导的时间通信网络的有效性和效率。我们提出了一种将网络测量与患者生存结果相关联的方法,并基于这些关联设计有效性指标。为了分析通信效率,我们提取了EHR接入的延迟和频率。我们的可视化分析系统旨在帮助检查和理解组合的通信效率指标,并通过在信息流图中编码延迟和频率来探索通信效率。我们通过多个案例研究和专家评审来展示和评估我们的系统

[Downlink:]http://arxiv/abs/2401.03700v1


标题: Amirkabir campus dataset: Real-world challenges and scenarios of Visual
Inertial Odometry (VIO) for visually impaired people

作者: Ali Samadzadeh, Mohammad Hassan Mojab, Heydar Soudani

摘要: Visual Inertial Odometry (VIO) algorithms estimate the accurate camera trajectory by using camera and Inertial Measurement Unit (IMU) sensors. The applications of VIO span a diverse range, including augmented reality and indoor navigation. VIO algorithms hold the potential to facilitate navigation for visually impaired individuals in both indoor and outdoor settings. Nevertheless, state-of-the-art VIO algorithms encounter substantial challenges in dynamic environments, particularly in densely populated corridors. Existing VIO datasets, e.g., ADVIO, typically fail to effectively exploit these challenges. In this paper, we introduce the Amirkabir campus dataset (AUT-VI) to address the mentioned problem and improve the navigation systems. AUT-VI is a novel and super-challenging dataset with 126 diverse sequences in 17 different locations. This dataset contains dynamic objects, challenging loop-closure/map-reuse, different lighting conditions, reflections, and sudden camera movements to cover all extreme navigation scenarios. Moreover, in support of ongoing development efforts, we have released the Android application for data capture to the public. This allows fellow researchers to easily capture their customized VIO dataset variations. In addition, we evaluate state-of-the-art Visual Inertial Odometry (VIO) and Visual Odometry (VO) methods on our dataset, emphasizing the essential need for this challenging dataset.

中文摘要: 视觉惯性里程计(VIO)算法通过使用相机和惯性测量单元(IMU)传感器来估计精确的相机轨迹。VIO的应用范围广泛,包括增强现实和室内导航。VIO算法有可能促进视障人士在室内和室外环境中的导航。然而,最先进的VIO算法在动态环境中,特别是在人口稠密的走廊中,遇到了巨大的挑战。现有的VIO数据集,例如ADVIO,通常无法有效利用这些挑战。在本文中,我们引入了Amirkabir校园数据集(AUT-VI)来解决上述问题并改进导航系统。AUT-VI是一个新颖且极具挑战性的数据集,包含17个不同位置的126个不同序列。该数据集包含动态对象、具有挑战性的回路闭合/地图重用、不同的照明条件、反射和相机突然移动,以覆盖所有极端导航场景。此外,为了支持正在进行的开发工作,我们向公众发布了用于数据捕获的Android应用程序。这使得其他研究人员能够轻松地捕捉他们定制的VIO数据集变体。此外,我们在数据集上评估了最先进的视觉惯性里程计(VIO)和视觉里程计(VO)方法,强调了对这一具有挑战性的数据集的必要性

[Downlink:]http://arxiv/abs/2401.03604v1


== 具身智能,机器人 ==

标题: Unified Learning from Demonstrations, Corrections, and Preferences
during Physical Human-Robot Interaction

作者: Shaunak A. Mehta, Dylan P. Losey

摘要: Humans can leverage physical interaction to teach robot arms. This physical interaction takes multiple forms depending on the task, the user, and what the robot has learned so far. State-of-the-art approaches focus on learning from a single modality, or combine multiple interaction types by assuming that the robot has prior information about the human’s intended task. By contrast, in this paper we introduce an algorithmic formalism that unites learning from demonstrations, corrections, and preferences. Our approach makes no assumptions about the tasks the human wants to teach the robot; instead, we learn a reward model from scratch by comparing the human’s inputs to nearby alternatives. We first derive a loss function that trains an ensemble of reward models to match the human’s demonstrations, corrections, and preferences. The type and order of feedback is up to the human teacher: we enable the robot to collect this feedback passively or actively. We then apply constrained optimization to convert our learned reward into a desired robot trajectory. Through simulations and a user study we demonstrate that our proposed approach more accurately learns manipulation tasks from physical human interaction than existing baselines, particularly when the robot is faced with new or unexpected objectives. Videos of our user study are available at: https://youtu.be/FSUJsTYvEKU

中文摘要: 人类可以利用物理交互来教授机器人手臂。这种物理交互有多种形式,具体取决于任务、用户以及机器人迄今为止所学的知识。现有技术的方法侧重于从单一模态学习,或者通过假设机器人具有关于人类预期任务的先验信息来组合多种交互类型。相比之下,在本文中,我们引入了一种算法形式主义,它将从演示、更正和偏好中学习结合起来。我们的方法不对人类想要教机器人的任务进行假设;相反,我们通过将人类的输入与附近的替代品进行比较,从头开始学习奖励模型。我们首先推导出一个损失函数,该函数训练一组奖励模型,以匹配人类的演示、校正和偏好。反馈的类型和顺序取决于人类老师:我们使机器人能够被动或主动地收集反馈。然后,我们应用约束优化将我们学到的奖励转化为所需的机器人轨迹。通过模拟和用户研究,我们证明了我们提出的方法比现有的基线更准确地从物理人类交互中学习操纵任务,特别是当机器人面临新的或意想不到的目标时。我们的用户研究视频可在以下网站获取:https://youtu.be/FSUJsTYvEKU

[Downlink:]http://arxiv/abs/2207.03395v2

[Project:]https://youtu.be/FSUJsTYvEKU|


标题: Theory of Mind abilities of Large Language Models in Human-Robot
Interaction : An Illusion?

作者: Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati

摘要: Large Language Models have shown exceptional generative abilities in various natural language and generation tasks. However, possible anthropomorphization and leniency towards failure cases have propelled discussions on emergent abilities of Large Language Models especially on Theory of Mind (ToM) abilities in Large Language Models. While several false-belief tests exists to verify the ability to infer and maintain mental models of another entity, we study a special application of ToM abilities that has higher stakes and possibly irreversible consequences : Human Robot Interaction. In this work, we explore the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot’s generated behavior in a manner similar to human observer. We focus on four behavior types, namely - explicable, legible, predictable, and obfuscatory behavior which have been extensively used to synthesize interpretable robot behaviors. The LLMs goal is, therefore to be a human proxy to the agent, and to answer how a certain agent behavior would be perceived by the human in the loop, for example “Given a robot’s behavior X, would the human observer find it explicable?”. We conduct a human subject study to verify that the users are able to correctly answer such a question in the curated situations (robot setting and plan) across five domains. A first analysis of the belief test yields extremely positive results inflating ones expectations of LLMs possessing ToM abilities. We then propose and perform a suite of perturbation tests which breaks this illusion, i.e. Inconsistent Belief, Uninformative Context and Conviction Test. We conclude that, the high score of LLMs on vanilla prompts showcases its potential use in HRI settings, however to possess ToM demands invariance to trivial or irrelevant perturbations in the context which LLMs lack.

中文摘要: 大型语言模型在各种自然语言和生成任务中表现出了非凡的生成能力。然而,可能的拟人化和对失败案例的宽容推动了对大语言模型涌现能力的讨论,尤其是对大语言模式中心理理论能力的讨论。虽然存在一些错误信念测试来验证推断和维护另一个实体的心理模型的能力,但我们研究了ToM能力的一个特殊应用,它具有更高的风险和可能不可逆转的后果:人机交互。在这项工作中,我们探索了感知行为识别的任务,其中机器人采用大型语言模型(LLM)以类似于人类观察者的方式评估机器人生成的行为。我们关注四种行为类型,即可解释、可阅读、可预测和模糊行为,这些行为已被广泛用于合成可解释的机器人行为。因此,LLM的目标是成为代理的人类代理,并回答某个代理行为将如何被循环中的人类感知,例如“给定机器人的行为X,人类观察者会发现它是可解释的吗?”。我们进行了一项人类受试者研究,以验证用户能够在五个领域的精心策划的情况下(机器人设置和计划)正确回答这样的问题。信念测试的第一个分析产生了非常积极的结果,夸大了人们对LLM拥有ToM能力的期望。然后,我们提出并执行了一套打破这种错觉的扰动测试,即不一致信念、不一致上下文和信念测试。我们得出的结论是,LLM在香草提示上的高分显示了它在HRI设置中的潜在用途,然而,在LLM缺乏的情况下,拥有ToM要求对琐碎或无关的扰动保持不变

[Downlink:]http://arxiv/abs/2401.05302v1


标题: Evaluating Gesture Recognition in Virtual Reality

作者: Sandeep Reddy Sabbella, Sara Kaszuba, Francesco Leotta

摘要: Human-Robot Interaction (HRI) has become increasingly important as robots are being integrated into various aspects of daily life. One key aspect of HRI is gesture recognition, which allows robots to interpret and respond to human gestures in real-time. Gesture recognition plays an important role in non-verbal communication in HRI. To this aim, there is ongoing research on how such non-verbal communication can strengthen verbal communication and improve the system’s overall efficiency, thereby enhancing the user experience with the robot. However, several challenges need to be addressed in gesture recognition systems, which include data generation, transferability, scalability, generalizability, standardization, and lack of benchmarking of the gestural systems. In this preliminary paper, we want to address the challenges of data generation using virtual reality simulations and standardization issues by presenting gestures to some commands that can be used as a standard in ground robots.

中文摘要: 随着机器人融入日常生活的各个方面,人机交互变得越来越重要。HRI的一个关键方面是手势识别,它允许机器人实时解释和响应人类手势。手势识别在HRI的非言语交际中起着重要作用。为此,正在进行的研究是,这种非语言交流如何加强语言交流,提高系统的整体效率,从而增强机器人的用户体验。然而,手势识别系统需要解决几个挑战,包括数据生成、可传输性、可扩展性、可推广性、标准化以及缺乏手势系统的基准测试。在这篇初步论文中,我们希望通过向一些可以用作地面机器人标准的命令提供手势,来解决使用虚拟现实模拟生成数据的挑战和标准化问题

[Downlink:]http://arxiv/abs/2401.04545v1


标题: Testing Human-Robot Interaction in Virtual Reality: Experience from a
Study on Speech Act Classification

作者: Sara Kaszuba, Sandeep Reddy Sabbella, Francesco Leotta

摘要: In recent years, an increasing number of Human-Robot Interaction (HRI) approaches have been implemented and evaluated in Virtual Reality (VR), as it allows to speed-up design iterations and makes it safer for the final user to evaluate and master the HRI primitives. However, identifying the most suitable VR experience is not straightforward. In this work, we evaluate how, in a smart agriculture scenario, immersive and non-immersive VR are perceived by users with respect to a speech act understanding task. In particular, we collect opinions and suggestions from the 81 participants involved in both experiments to highlight the strengths and weaknesses of these different experiences.

中文摘要: 近年来,越来越多的人机交互(HRI)方法在虚拟现实(VR)中得到了实施和评估,因为它可以加快设计迭代,并使最终用户更安全地评估和掌握HRI原语。然而,确定最合适的VR体验并不简单。在这项工作中,我们评估了在智能农业场景中,用户如何在语音行为理解任务中感知沉浸式和非沉浸式VR。特别是,我们收集了参与这两个实验的81名参与者的意见和建议,以突出这些不同经历的优势和劣势

[Downlink:]http://arxiv/abs/2401.04534v1


标题: Amplifying robotics capacities with a human touch: An immersive
low-latency panoramic remote system

作者: Junjie Li, Kang Li, Dewei Han

摘要: AI and robotics technologies have witnessed remarkable advancements in the past decade, revolutionizing work patterns and opportunities in various domains. The application of these technologies has propelled society towards an era of symbiosis between humans and machines. To facilitate efficient communication between humans and intelligent robots, we propose the “Avatar” system, an immersive low-latency panoramic human-robot interaction platform. We have designed and tested a prototype of a rugged mobile platform integrated with edge computing units, panoramic video capture devices, power batteries, robot arms, and network communication equipment. Under favorable network conditions, we achieved a low-latency high-definition panoramic visual experience with a delay of 357ms. Operators can utilize VR headsets and controllers for real-time immersive control of robots and devices. The system enables remote control over vast physical distances, spanning campuses, provinces, countries, and even continents (New York to Shenzhen). Additionally, the system incorporates visual SLAM technology for map and trajectory recording, providing autonomous navigation capabilities. We believe that this intuitive system platform can enhance efficiency and situational experience in human-robot collaboration, and with further advancements in related technologies, it will become a versatile tool for efficient and symbiotic cooperation between AI and humans.

中文摘要: 人工智能和机器人技术在过去十年中取得了显著进步,改变了各个领域的工作模式和机会。这些技术的应用将社会推向了一个人与机器共生的时代。为了促进人类与智能机器人之间的高效通信,我们提出了“阿凡达”系统,这是一个沉浸式低延迟全景人机交互平台。我们设计并测试了一个坚固的移动平台原型,该平台集成了边缘计算单元、全景视频捕获设备、动力电池、机械臂和网络通信设备。在良好的网络条件下,我们实现了延迟357ms的低延迟高清全景视觉体验。操作员可以利用VR耳机和控制器对机器人和设备进行实时沉浸式控制。该系统能够实现跨越校园、省份、国家甚至大洲(纽约到深圳)的远距离远程控制。此外,该系统结合了用于地图和轨迹记录的视觉SLAM技术,提供了自主导航功能。我们相信,这个直观的系统平台可以提高人机协作的效率和情景体验,随着相关技术的进一步进步,它将成为人工智能与人类高效共生合作的通用工具

[Downlink:]http://arxiv/abs/2401.03398v2


标题: Large Language Models for Robotics: Opportunities, Challenges, and
Perspectives

作者: Jiaqi Wang, Zihao Wu, Yiwei Li

摘要: Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights toward bridging the gap in Human-Robot-Environment interaction.

中文摘要: 大型语言模型(LLM)经历了显著的扩展,并越来越多地跨各个领域进行集成。值得注意的是,在机器人任务规划领域,LLM利用其先进的推理和语言理解能力,根据自然语言指令制定精确高效的行动计划。然而,对于机器人与复杂环境交互的具体任务,由于与机器人视觉感知缺乏兼容性,纯文本LLM往往面临挑战。这项研究全面概述了LLM和多模式LLM在各种机器人任务中的新兴集成。此外,我们提出了一个框架,该框架利用多模式GPT-4V,通过自然语言指令和机器人视觉感知的组合来增强具体任务规划。我们基于不同数据集的结果表明,GPT-4V有效地提高了机器人在具体任务中的性能。这项针对各种机器人任务的LLM和多模式LLM的广泛调查和评估丰富了对以LLM为中心的具体智能的理解,并为弥合人机环境交互中的差距提供了前瞻性见解

[Downlink:]http://arxiv/abs/2401.04334v1


== Reinforcement Learning ==

标题: HomeRobot: Open-Vocabulary Mobile Manipulation

作者: Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav

摘要: HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it involves tackling sub-problems from across robotics: perception, language understanding, navigation, and manipulation are all essential to OVMM. In addition, integration of the solutions to these sub-problems poses its own substantial challenges. To drive research in this area, we introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them on target receptacles. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch to encourage replication of real-world experiments across labs. We implement both reinforcement learning and heuristic (model-based) baselines and show evidence of sim-to-real transfer. Our baselines achieve a 20% success rate in the real world; our experiments identify ways future research work improve performance. See videos on our website: https://ovmm.github.io/.

中文摘要: 家庭机器人(名词):一种价格合理的兼容机器人,可在家中导航并操纵各种物体以完成日常任务。开放词汇移动操作(OVMM)是指在任何看不见的环境中拾取任何对象,并将其放置在命令位置的问题。这是机器人在人类环境中成为有用助手的一个基本挑战,因为它涉及到解决机器人的子问题:感知、语言理解、导航和操作都是OVMM的关键。此外,这些子问题的解决方案的一体化也带来了自身的重大挑战。为了推动这一领域的研究,我们引入了HomeRobot OVMM基准,在该基准中,代理导航家庭环境,以抓取新物体并将其放置在目标容器上。HomeRobot有两个组件:一个模拟组件,在新的、高质量的多房间家庭环境中使用大型和多样化的策划对象集;和一个真实世界的组件,为低成本的Hello Robot Stretch提供了一个软件堆栈,以鼓励在实验室中复制真实世界的实验。我们实现了强化学习和启发式(基于模型的)基线,并展示了模拟到真实转移的证据。我们的基线在现实世界中实现了20%的成功率;我们的实验确定了未来研究工作提高性能的方法。查看我们网站上的视频:https://ovmm.github.io/.

[Downlink:]http://arxiv/abs/2306.11565v2

[Project:]https://ovmm.github.io/.|


标题: Yes, this is what I was looking for! Towards Multi-modal Medical
Consultation Concern Summary Generation

作者: Abhisek Tiwari, Shreyangshu Bera, Sriparna Saha

摘要: Over the past few years, the use of the Internet for healthcare-related tasks has grown by leaps and bounds, posing a challenge in effectively managing and processing information to ensure its efficient utilization. During moments of emotional turmoil and psychological challenges, we frequently turn to the internet as our initial source of support, choosing this over discussing our feelings with others due to the associated social stigma. In this paper, we propose a new task of multi-modal medical concern summary (MMCS) generation, which provides a short and precise summary of patients’ major concerns brought up during the consultation. Nonverbal cues, such as patients’ gestures and facial expressions, aid in accurately identifying patients’ concerns. Doctors also consider patients’ personal information, such as age and gender, in order to describe the medical condition appropriately. Motivated by the potential efficacy of patients’ personal context and visual gestures, we propose a transformer-based multi-task, multi-modal intent-recognition, and medical concern summary generation (IR-MMCSG) system. Furthermore, we propose a multitasking framework for intent recognition and medical concern summary generation for doctor-patient consultations. We construct the first multi-modal medical concern summary generation (MM-MediConSummation) corpus, which includes patient-doctor consultations annotated with medical concern summaries, intents, patient personal information, doctor’s recommendations, and keywords. Our experiments and analysis demonstrate (a) the significant role of patients’ expressions/gestures and their personal information in intent identification and medical concern summary generation, and (b) the strong correlation between intent recognition and patients’ medical concern summary generation The dataset and source code are available at https://github/NLP-RL/MMCSG.

中文摘要: 在过去几年中,互联网在医疗保健相关任务中的使用突飞猛进,这对有效管理和处理信息以确保其高效利用提出了挑战。在情绪动荡和心理挑战的时刻,我们经常求助于互联网作为我们最初的支持来源,由于相关的社会污名,我们选择了互联网而不是与他人讨论我们的感受。在本文中,我们提出了一项新的任务,即生成多模式医疗问题摘要(MMCS),该任务对患者在咨询过程中提出的主要问题进行了简短而准确的总结。非语言提示,如患者的手势和面部表情,有助于准确识别患者的担忧。医生还会考虑患者的个人信息,如年龄和性别,以便适当地描述医疗状况。受患者个人背景和视觉手势的潜在功效的启发,我们提出了一个基于转换器的多任务、多模式意图识别和医疗问题摘要生成(IR-MMCSG)系统。此外,我们提出了一个多任务框架,用于医患会诊的意图识别和医疗问题摘要生成。我们构建了第一个多模式医疗问题摘要生成(MM MediConSummation)语料库,其中包括用医疗问题摘要、意图、患者个人信息、医生建议和关键词注释的医患咨询。我们的实验和分析证明了(a)患者的表情/手势及其个人信息在意图识别和医疗问题摘要生成中的重要作用,以及(b)意图识别和患者医疗问题摘要生成器之间的强相关性。数据集和源代码可在https://github/NLP-RL/MMCSG.

[Downlink:]http://arxiv/abs/2401.05134v1

[GitHub:]https://github/NLP-RL/MMCSG.|


标题: Human as AI Mentor: Enhanced Human-in-the-loop Reinforcement Learning
for Safe and Efficient Autonomous Driving

作者: Zilin Huang, Zihao Sheng, Chengyuan Ma

摘要: Despite significant progress in autonomous vehicles (AVs), the development of driving policies that ensure both the safety of AVs and traffic flow efficiency has not yet been fully explored. In this paper, we propose an enhanced human-in-the-loop reinforcement learning method, termed the Human as AI mentor-based deep reinforcement learning (HAIM-DRL) framework, which facilitates safe and efficient autonomous driving in mixed traffic platoon. Drawing inspiration from the human learning process, we first introduce an innovative learning paradigm that effectively injects human intelligence into AI, termed Human as AI mentor (HAIM). In this paradigm, the human expert serves as a mentor to the AI agent. While allowing the agent to sufficiently explore uncertain environments, the human expert can take control in dangerous situations and demonstrate correct actions to avoid potential accidents. On the other hand, the agent could be guided to minimize traffic flow disturbance, thereby optimizing traffic flow efficiency. In detail, HAIM-DRL leverages data collected from free exploration and partial human demonstrations as its two training sources. Remarkably, we circumvent the intricate process of manually designing reward functions; instead, we directly derive proxy state-action values from partial human demonstrations to guide the agents’ policy learning. Additionally, we employ a minimal intervention technique to reduce the human mentor’s cognitive load. Comparative results show that HAIM-DRL outperforms traditional methods in driving safety, sampling efficiency, mitigation of traffic flow disturbance, and generalizability to unseen traffic scenarios. The code and demo videos for this paper can be accessed at: https://zilin-huang.github.io/HAIM-DRL-website/

中文摘要: 尽管自动驾驶汽车取得了重大进展,但确保自动驾驶汽车安全和交通流效率的驾驶政策的制定尚未得到充分探索。在本文中,我们提出了一种增强的人在环强化学习方法,称为基于人工智能导师的深度强化学习(HAIM-DRL)框架,该框架有助于混合交通车队中安全高效的自动驾驶。从人类学习过程中汲取灵感,我们首先引入了一种创新的学习范式,将人类智能有效地注入人工智能,称为“人类即人工智能导师”(HAIM)。在这种范式中,人类专家充当人工智能代理的导师。在允许智能体充分探索不确定环境的同时,人类专家可以在危险情况下进行控制,并展示正确的行动以避免潜在的事故。另一方面,可以引导代理最小化交通流干扰,从而优化交通流效率。详细地说,HAIM-DRL利用从自由探索和部分人类演示中收集的数据作为其两个训练来源。值得注意的是,我们避开了手动设计奖励函数的复杂过程;相反,我们直接从部分人类演示中导出代理状态动作值,以指导代理的策略学习。此外,我们采用最小干预技术来减少人类导师的认知负荷。比较结果表明,HAIM-DRL在驾驶安全性、采样效率、交通流干扰的缓解以及对未知交通场景的可推广性方面优于传统方法。本文的代码和演示视频可访问:https://zilin-huang.github.io/HAIM-DRL-website/

[Downlink:]http://arxiv/abs/2401.03160v2

[Project:]https://zilin-huang.github.io/HAIM-DRL-website/|


标题: Two-Stage Constrained Actor-Critic for Short Video Recommendation

作者: Qingpeng Cai, Zhenghai Xue, Chi Zhang

摘要: The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users’ cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.

中文摘要: 短视频在社交媒体上的广泛流行为优化视频共享平台上的推荐系统带来了新的机遇和挑战。用户顺序地与系统交互,并提供复杂和多方面的响应,包括观看时间和与多个视频的各种类型的交互。一方面,平台旨在长期优化用户的累计观看时间(主要目标),强化学习可以有效地优化用户的累积观看时间。另一方面,平台还需要满足容纳多个用户交互(辅助目标)(如关注、分享等)的响应的约束。在本文中,我们将短视频推荐问题公式化为约束马尔可夫决策过程(CMDP)。我们发现传统的约束强化学习算法在这种情况下不能很好地工作。我们提出了一种新的两阶段约束行动者-批评家方法:在第一阶段,我们学习单个策略来优化每个辅助信号。在第二阶段,我们学习了一种策略,以(i)优化主信号,(ii)保持与第一阶段学习的策略接近,这有效地保证了该主策略在辅助设备上的性能。通过广泛的离线评估,我们证明了我们的方法在优化主要目标和平衡其他目标方面的有效性。我们在短视频推荐的现场实验中进一步展示了我们的方法的优势,在观看时间和互动方面,它明显优于其他基线。我们的方法已在生产系统中全面推出,以优化平台上的用户体验

[Downlink:]http://arxiv/abs/2302.01680v3

[GitHub:]https://github/AIDefender/TSCAC.|


标题: StarCraftImage: A Dataset For Prototyping Spatial Reasoning Methods For
Multi-Agent Environments

作者: Sean Kulinski, Nicholas R. Waytowich, James Z. Hare

摘要: Spatial reasoning tasks in multi-agent environments such as event prediction, agent type identification, or missing data imputation are important for multiple applications (e.g., autonomous surveillance over sensor networks and subtasks for reinforcement learning (RL)). StarCraft II game replays encode intelligent (and adversarial) multi-agent behavior and could provide a testbed for these tasks; however, extracting simple and standardized representations for prototyping these tasks is laborious and hinders reproducibility. In contrast, MNIST and CIFAR10, despite their extreme simplicity, have enabled rapid prototyping and reproducibility of ML methods. Following the simplicity of these datasets, we construct a benchmark spatial reasoning dataset based on StarCraft II replays that exhibit complex multi-agent behaviors, while still being as easy to use as MNIST and CIFAR10. Specifically, we carefully summarize a window of 255 consecutive game states to create 3.6 million summary images from 60,000 replays, including all relevant metadata such as game outcome and player races. We develop three formats of decreasing complexity: Hyperspectral images that include one channel for every unit type (similar to multispectral geospatial images), RGB images that mimic CIFAR10, and grayscale images that mimic MNIST. We show how this dataset can be used for prototyping spatial reasoning methods. All datasets, code for extraction, and code for dataset loading can be found at https://starcraftdata.davidinouye

中文摘要: 多智能体环境中的空间推理任务,如事件预测、智能体类型识别或缺失数据插补,对于多个应用程序非常重要(例如,传感器网络上的自主监控和强化学习(RL)的子任务)。《星际争霸II》游戏回放对智能(和对抗性)多智能体行为进行编码,并可以为这些任务提供测试平台;然而,为这些任务的原型设计提取简单和标准化的表示是费力的,并且阻碍了再现性。相比之下,MNIST和CIFAR10尽管极其简单,但已经实现了ML方法的快速原型设计和再现性。遵循这些数据集的简单性,我们构建了一个基于星际争霸II回放的基准空间推理数据集,该数据集表现出复杂的多智能体行为,同时仍然像MNIST和CIFAR10一样易于使用。具体来说,我们仔细总结了255个连续游戏状态的窗口,从60000次回放中创建了360万个摘要图像,包括所有相关的元数据,如游戏结果和玩家种族。我们开发了三种降低复杂性的格式:每种单位类型都有一个通道的高光谱图像(类似于多光谱地理空间图像)、模拟CIFAR10的RGB图像和模拟MNIST的灰度图像。我们展示了如何将该数据集用于空间推理方法的原型设计。所有数据集、用于提取的代码和用于加载数据集的代码都可以在https://starcraftdata.davidinouye

[Downlink:]http://arxiv/abs/2401.04290v1

[Project:]https://starcraftdata.davidinouye|


标题: A Reinforcement Learning Approach to Sensing Design in
Resource-Constrained Wireless Networked Control Systems

作者: Luca Ballotta, Giovanni Peserico, Francesco Zanini

摘要: In this paper, we consider a wireless network of smart sensors (agents) that monitor a dynamical process and send measurements to a base station that performs global monitoring and decision-making. Smart sensors are equipped with both sensing and computation, and can either send raw measurements or process them prior to transmission. Constrained agent resources raise a fundamental latency-accuracy trade-off. On the one hand, raw measurements are inaccurate but fast to produce. On the other hand, data processing on resource-constrained platforms generates accurate measurements at the cost of non-negligible computation latency. Further, if processed data are also compressed, latency caused by wireless communication might be higher for raw measurements. Hence, it is challenging to decide when and where sensors in the network should transmit raw measurements or leverage time-consuming local processing. To tackle this design problem, we propose a Reinforcement Learning approach to learn an efficient policy that dynamically decides when measurements are to be processed at each sensor. Effectiveness of our proposed approach is validated through a numerical simulation with case study on smart sensing motivated by the Internet of Drones.

中文摘要: 在本文中,我们考虑一个智能传感器(代理)的无线网络,该网络监测动态过程,并将测量结果发送到执行全局监测和决策的基站。智能传感器同时具备传感和计算功能,可以发送原始测量值,也可以在传输前进行处理。受约束的代理资源提出了一个基本的延迟-准确性权衡。一方面,原始测量不准确,但生产速度很快。另一方面,资源受限平台上的数据处理以不可忽略的计算延迟为代价生成准确的测量结果。此外,如果处理后的数据也被压缩,则由无线通信引起的延迟对于原始测量可能更高。因此,决定网络中的传感器何时何地传输原始测量值或利用耗时的本地处理是一项挑战。为了解决这个设计问题,我们提出了一种强化学习方法来学习一种有效的策略,该策略动态地决定何时在每个传感器处处理测量。通过数值模拟和无人机互联网驱动的智能传感案例研究,验证了我们提出的方法的有效性

[Downlink:]http://arxiv/abs/2204.00703v5


== Open vocabulary detection ==

标题: LinK3D: Linear Keypoints Representation for 3D LiDAR Point Cloud

作者: Yunge Cui, Yinlong Zhang, Jiahua Dong

摘要: Feature extraction and matching are the basic parts of many robotic vision tasks, such as 2D or 3D object detection, recognition, and registration. As is known, 2D feature extraction and matching have already achieved great success. Unfortunately, in the field of 3D, the current methods may fail to support the extensive application of 3D LiDAR sensors in robotic vision tasks due to their poor descriptiveness and inefficiency. To address this limitation, we propose a novel 3D feature representation method: Linear Keypoints representation for 3D LiDAR point cloud, called LinK3D. The novelty of LinK3D lies in that it fully considers the characteristics (such as the sparsity and complexity) of LiDAR point clouds and represents the keypoint with its robust neighbor keypoints, which provide strong constraints in the description of the keypoint. The proposed LinK3D has been evaluated on three public datasets, and the experimental results show that our method achieves great matching performance. More importantly, LinK3D also shows excellent real-time performance, faster than the sensor frame rate at 10 Hz of a typical rotating LiDAR sensor. LinK3D only takes an average of 30 milliseconds to extract features from the point cloud collected by a 64-beam LiDAR and takes merely about 20 milliseconds to match two LiDAR scans when executed on a computer with an Intel Core i7 processor. Moreover, our method can be extended to LiDAR odometry task, and shows good scalability. We release the implementation of our method at https://github/YungeCui/LinK3D.

中文摘要: 特征提取和匹配是许多机器人视觉任务的基本部分,如二维或三维物体检测、识别和配准。众所周知,二维特征提取和匹配已经取得了巨大的成功。不幸的是,在3D领域,由于3D激光雷达传感器的描述性差和效率低,目前的方法可能无法支持其在机器人视觉任务中的广泛应用。为了解决这一限制,我们提出了一种新的3D特征表示方法:3D激光雷达点云的线性关键点表示,称为LinK3D。LinK3D的新颖之处在于,它充分考虑了激光雷达点云的特性(如稀疏性和复杂性),并用其鲁棒的邻居关键点来表示关键点,这在关键点的描述中提供了强大的约束。在三个公共数据集上对所提出的LinK3D进行了评估,实验结果表明,我们的方法具有很好的匹配性能。更重要的是,LinK3D还显示出出色的实时性能,比典型旋转激光雷达传感器在10Hz下的传感器帧速率更快。LinK3D从64束激光雷达收集的点云中提取特征平均只需30毫秒,在配备英特尔酷睿i7处理器的计算机上执行时,匹配两次激光雷达扫描仅需约20毫秒。此外,我们的方法可以扩展到激光雷达里程计任务,并显示出良好的可扩展性。我们在发布方法的实现https://github/YungeCui/LinK3D.

[Downlink:]http://arxiv/abs/2206.05927v3

[GitHub:]https://github/YungeCui/LinK3D.|


标题: DC-Net: Divide-and-Conquer for Salient Object Detection

作者: Jiayi Zhu, Xuebin Qin, Abdulmotaleb Elsaddik

摘要: In this paper, we introduce Divide-and-Conquer into the salient object detection (SOD) task to enable the model to learn prior knowledge that is for predicting the saliency map. We design a novel network, Divide-and-Conquer Network (DC-Net) which uses two encoders to solve different subtasks that are conducive to predicting the final saliency map, here is to predict the edge maps with width 4 and location maps of salient objects and then aggregate the feature maps with different semantic information into the decoder to predict the final saliency map. The decoder of DC-Net consists of our newly designed two-level Residual nested-ASPP (ResASPP 2 ^{2} 2) modules, which have the ability to capture a large number of different scale features with a small number of convolution operations and have the advantages of maintaining high resolution all the time and being able to obtain a large and compact effective receptive field (ERF). Based on the advantage of Divide-and-Conquer’s parallel computing, we use Parallel Acceleration to speed up DC-Net, allowing it to achieve competitive performance on six LR-SOD and five HR-SOD datasets under high efficiency (60 FPS and 55 FPS). Codes and results are available: https://github/PiggyJerry/DC-Net.

中文摘要: 在本文中,我们将分割和征服引入显著对象检测(SOD)任务,以使模型能够学习用于预测显著图的先验知识。我们设计了一种新的网络,即分治网络(DC Net),它使用两个编码器来解决有助于预测最终显著性图的不同子任务,这里是预测宽度为4的边缘图和显著对象的位置图,然后将具有不同语义信息的特征图聚合到解码器中,以预测最终的显著性图。DC Net的解码器由我们新设计的两级残差嵌套ASPP(ResASPP 2 ^{2} 2)模块组成,该模块能够用少量卷积运算捕获大量不同尺度的特征,并具有始终保持高分辨率和能够获得大而紧凑的有效感受野(ERF)的优点。基于Divide and Conquer并行计算的优势,我们使用并行加速来加速DCNet,使其能够在6个LR-SOD和5个HR-SOD数据集上以高效(60 FPS和55 FPS)的速度获得有竞争力的性能。代码和结果可用:https://github/PiggyJerry/DC-Net.

[Downlink:]http://arxiv/abs/2305.14955v3

[GitHub:]https://github/PiggyJerry/DC-Net.|


标题: Actor-agnostic Multi-label Action Recognition with Multi-modal Query

作者: Anindya Mondal, Sauradip Nag, Joaquin M Prada

摘要: Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called ‘actor-agnostic multi-modal multi-label action recognition,’ which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github/mondalanindya/MSQNet.

中文摘要: 由于参与者之间固有的拓扑和明显的差异,现有的动作识别方法通常是特定于参与者的。这需要特定于演员的姿势估计(例如,人类与动物),导致繁琐的模型设计复杂性和高昂的维护成本。此外,他们通常专注于单独学习视觉模态和单标签分类,而忽略了其他可用的信息来源(例如,类名文本)和多个动作的同时发生。为了克服这些限制,我们提出了一种新的方法,称为“行动者不可知的多模式多标签动作识别”,它为包括人类和动物在内的各种行动者提供了统一的解决方案。我们在基于变换器的对象检测框架(例如,DETR)中进一步提出了一种新的多模式语义查询网络(MSQNet)模型,其特征是利用视觉和文本模式更好地表示动作类。消除了特定于演员的模型设计是一个关键优势,因为它完全消除了对演员姿势估计的需要。在五个公开可用的基准上进行的广泛实验表明,我们的MSQNet在人类和动物的单标签和多标签动作识别任务上始终优于现有技术的演员特定替代品高达50%。代码可在https://github/mondalanindya/MSQNet.

[Downlink:]http://arxiv/abs/2307.10763v3

[GitHub:]https://github/mondalanindya/MSQNet.|


标题: Generalizing Medical Image Representations via Quaternion Wavelet
Networks

作者: Luigi Sigillo, Eleonora Grassucci, Aurelio Uncini

摘要: Neural network generalizability is becoming a broad research field due to the increasing availability of datasets from different sources and for various tasks. This issue is even wider when processing medical data, where a lack of methodological standards causes large variations being provided by different imaging centers or acquired with various devices and cofactors. To overcome these limitations, we introduce a novel, generalizable, data- and task-agnostic framework able to extract salient features from medical images. The proposed quaternion wavelet network (QUAVE) can be easily integrated with any pre-existing medical image analysis or synthesis task, and it can be involved with real, quaternion, or hypercomplex-valued models, generalizing their adoption to single-channel data. QUAVE first extracts different sub-bands through the quaternion wavelet transform, resulting in both low-frequency/approximation bands and high-frequency/fine-grained features. Then, it weighs the most representative set of sub-bands to be involved as input to any other neural model for image processing, replacing standard data samples. We conduct an extensive experimental evaluation comprising different datasets, diverse image analysis, and synthesis tasks including reconstruction, segmentation, and modality translation. We also evaluate QUAVE in combination with both real and quaternion-valued models. Results demonstrate the effectiveness and the generalizability of the proposed framework that improves network performance while being flexible to be adopted in manifold scenarios and robust to domain shifts. The full code is available at: https://github/ispamm/QWT.

中文摘要: 由于来自不同来源和用于各种任务的数据集的可用性不断增加,神经网络的可推广性正成为一个广泛的研究领域。在处理医学数据时,这个问题更为广泛,因为缺乏方法标准导致不同成像中心提供的或使用各种设备和辅因子获取的数据存在很大差异。为了克服这些限制,我们引入了一种新颖的、可推广的、数据和任务不可知的框架,能够从医学图像中提取显著特征。所提出的四元数小波网络(QUAVE)可以很容易地与任何预先存在的医学图像分析或合成任务集成,并且它可以涉及实数、四元数或超复值模型,将其应用推广到单通道数据。QUAVE首先通过四元数小波变换提取不同的子带,得到低频/近似带和高频/细粒度特征。然后,它对要涉及的最具代表性的子带集进行加权,作为用于图像处理的任何其他神经模型的输入,取代标准数据样本。我们进行了广泛的实验评估,包括不同的数据集、不同的图像分析和合成任务,包括重建、分割和模态翻译。我们还结合实数和四元数值模型来评估QUAVE。结果证明了所提出的框架的有效性和可推广性,该框架提高了网络性能,同时在多种场景中灵活采用,并对域转移具有鲁棒性。完整代码位于:https://github/ispamm/QWT.

[Downlink:]http://arxiv/abs/2310.10224v2

[GitHub:]https://github/ispamm/QWT.|


标题: WidthFormer: Toward Efficient Transformer-based BEV View Transformation

作者: Chenhongyi Yang, Tianwei Lin, Lichao Huang

摘要: In this work, we present WidthFormer, a novel transformer-based Bird’s-Eye-View (BEV) 3D detection method tailored for real-time autonomous-driving applications. WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy. In this work, we propose a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information, which enables our model to generate high-quality BEV representations with only a single transformer decoder layer. This mechanism is also beneficial for existing sparse 3D object detectors. Inspired by the recently-proposed works, we further improve our model’s efficiency by vertically compressing the image features when serving as attention keys and values. We also introduce two modules to compensate for potential information loss due to feature compression. Experimental evaluation on the widely-used nuScenes 3D object detection benchmark demonstrates that our method outperforms previous approaches across different 3D detection architectures. More importantly, our model is highly efficient. For example, when using 256 × 704 256\times 704 256×704 input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 edge computing chips, respectively. Furthermore, WidthFormer also exhibits strong robustness to different degrees of camera perturbations. Our study offers valuable insights into the deployment of BEV transformation methods in real-world, complex road environments. Code is available at https://github/ChenhongyiYang/WidthFormer .

中文摘要: 在这项工作中,我们提出了WidthFormer,这是一种新的基于变压器的鸟瞰图(BEV)3D检测方法,专为实时自动驾驶应用而设计。WidthFormer在计算上高效、稳健,不需要任何特殊的工程部署。在这项工作中,我们提出了一种新的3D位置编码机制,该机制能够准确封装3D几何信息,使我们的模型能够仅用单个变换器解码器层生成高质量的BEV表示。这种机制对于现有的稀疏3D对象检测器也是有益的。受最近提出的工作的启发,我们通过在充当注意力键和值时垂直压缩图像特征,进一步提高了模型的效率。我们还介绍了两个模块来补偿由于特征压缩而造成的潜在信息损失。对广泛使用的nuScenes 3D对象检测基准的实验评估表明,我们的方法在不同的3D检测架构中优于以前的方法。更重要的是,我们的模型非常高效。例如,当使用 256 × 704 256\times 704 256×704输入图像时,它在NVIDIA 3090 GPU和Horizon Journey-5边缘计算芯片上分别实现了1.5毫秒和2.8毫秒的延迟。此外,WidthFormer对不同程度的相机扰动也表现出较强的鲁棒性。我们的研究为在现实世界复杂的道路环境中部署纯电动汽车转换方法提供了宝贵的见解。代码位于https://github/ChenhongyiYang/WidthFormer.

[Downlink:]http://arxiv/abs/2401.03836v3

[GitHub:]https://github/ChenhongyiYang/WidthFormer|


标题: IODeep: an IOD for the introduction of deep learning in the DICOM
standard

作者: Salvatore Contino, Luca Cruciata, Orazio Gambino

摘要: Background and Objective: In recent years, Artificial Intelligence (AI) and in particular Deep Neural Networks (DNN) became a relevant research topic in biomedical image segmentation due to the availability of more and more data sets along with the establishment of well known competitions. Despite the popularity of DNN based segmentation on the research side, these techniques are almost unused in the daily clinical practice even if they could support effectively the physician during the diagnostic process. Apart from the issues related to the explainability of the predictions of a neural model, such systems are not integrated in the diagnostic workflow, and a standardization of their use is needed to achieve this goal. Methods: This paper presents IODeep a new DICOM Information Object Definition (IOD) aimed at storing both the weights and the architecture of a DNN already trained on a particular image dataset that is labeled as regards the acquisition modality, the anatomical region, and the disease under investigation. Results: The IOD architecture is presented along with a DNN selection algorithm from the PACS server based on the labels outlined above, and a simple PACS viewer purposely designed for demonstrating the effectiveness of the DICOM integration, while no modifications are required on the PACS server side. Also a service based architecture in support of the entire workflow has been implemented. Conclusion: IODeep ensures full integration of a trained AI model in a DICOM infrastructure, and it is also enables a scenario where a trained model can be either fine-tuned with hospital data or trained in a federated learning scheme shared by different hospitals. In this way AI models can be tailored to the real data produced by a Radiology ward thus improving the physician decision making process. Source code is freely available at https://github/CHILab1/IODeep.git

中文摘要: 背景和目的:近年来,随着越来越多的数据集的可用性和众所周知的竞争的建立,人工智能(AI),特别是深度神经网络(DNN)成为生物医学图像分割的相关研究课题。尽管基于DNN的分割在研究方面很受欢迎,但这些技术在日常临床实践中几乎没有使用过,即使它们可以在诊断过程中有效地支持医生。除了与神经模型预测的可解释性相关的问题外,这些系统没有集成在诊断工作流程中,需要对其使用进行标准化以实现这一目标。方法:本文向IODeep提出了一种新的DICOM信息对象定义(IOD),旨在存储已经在特定图像数据集上训练的DNN的权重和架构,该图像数据集被标记为采集模式、解剖区域和正在研究的疾病。结果:IOD体系结构以及基于上述标签的PACS服务器的DNN选择算法,以及一个专门设计用于演示DICOM集成有效性的简单PACS查看器,而不需要在PACS服务器端进行修改。此外,还实现了支持整个工作流的基于服务的体系结构。结论:IODeep确保了训练后的人工智能模型在DICOM基础设施中的完全集成,它还实现了一种场景,即训练后的模型可以根据医院数据进行微调,也可以在不同医院共享的联合学习方案中进行训练。通过这种方式,人工智能模型可以根据放射科病房产生的真实数据进行定制,从而改进医生的决策过程。源代码免费提供于https://github/CHILab1/IODeep.git

[Downlink:]http://arxiv/abs/2311.16163v2

[GitHub:]https://github/CHILab1/IODeep.git|


本文标签: 中文机器人摘要视觉代码