admin管理员组文章数量:1590153
Abstract
我们提出了模块化交互式 VOS (MiVOS) 框架,该框架将interaction-to-mask和mask propagation解耦,从而实现更高的通用性和更好的性能。单独训练的交互模块将用户交互转换为对象掩码,然后由我们的传播模块在读取space-time memory时使用新的 top-k 过滤策略进行时间传播。为了有效地考虑用户的意图,提出了一种新颖的difference-aware module来学习如何在每次交互之前和之后正确融合掩码,这些掩码通过使用space-time memory与目标帧对齐。我们在 DAVIS 上使用不同形式的用户交互(例如,涂鸦、点击)对我们的方法进行了定性和定量评估,以表明我们的方法优于当前最先进的算法,同时需要更少的帧交互,在泛化方面具有额外优势 针对不同类型的用户交互。我们贡献了一个具有 480 万帧像素精确分割的大规模合成 VOS 数据集,以配合我们的源代码,以促进未来的研究。
Introduction
interactive VOS(iVOS)
特点: interactive VOS方法将用户交互(例如,涂鸦或点击)作为输入,用户可以在其中迭代地细化结果直到满意。
包含的两个任务:
- interaction understanding
- temporal propagation
Existing Problem
(1)The strong coupling limits the form of user interaction (e.g., scribbles only) and makes training difficult.Attempts to decouple the two tasks fail to reach state-of-the-art accuracy as user’s intent cannot be adequately taken into account in the propagation process.
强耦合限制了用户交互的形式(例如,仅涂鸦)并使训练变得困难。由于在传播过程中无法充分考虑用户的意图,尝试将这两个任务解耦未能达到最先进的准确性 .
(2)naive decoupling may lead to loss of user’s intent as the original interaction is no longer available in the propagation stage.
naive解耦可能会导致失去用户的意图,因为原始交互在传播阶段不再可用。
Solution
We present a decoupled modular framework to address the iVOS problem.
Contributions
- We innovate on the decoupled interaction-propagation framework and show that this approach is simple, effective, and generalizable.我们对解耦的交互传播框架进行了创新,并表明这种方法简单、有效且可推广。
- We propose a novel lightweight top-k filtering scheme for the attention-based memory read operation in mask generation during propagation.我们提出了一种新颖的轻量级 top-k 过滤方案,用于在传播过程中的掩码生成中基于注意力的内存读取操作。
- We propose a novel difference-aware fusion module to faithfully capture the user’s intent which improves iVOS accuracy and reduces the amount of user interaction.我们提出了一种新颖的差异感知融合模块来忠实地捕捉用户的意图,从而提高 iVOS 的准确性并减少用户交互量。
- We contribute a large-scale synthetic VOS dataset with 4.8M frames to accompany our source codes to facilitate future research.我们提供了一个具有 480 万帧的大规模合成 VOS 数据集,以配合我们的源代码,以促进未来的研究。
Related Work
Progress in iVOS is shown below:
Semi-Supervised Video Object Segmentation
defination: segment a specific object throughout a video given only a fully-annotated mask in the first frame.
Interactive Video Object Segmentation (iVOS)
focus:
(1)scribble interaction
(2)click interaction
Interactive Image Segmentation
Method
Initial Work
Initially, the user selects and interactively annotates one frame (e.g., using scribbles or clicks) to produce a mask.
最初,用户选择并交互式地注释一帧(例如,使用涂鸦或点击)以生成蒙版。
MiNet Overview
Character Denfination
(1)We denote r as the current interaction round
(2)the user-interacted frame index in the r-th round is tr
(3)the mask results of the r-th round is Mr
(4)the mask of individual j-th frame is denoted as M rj
Core Component
interaction-to-mask:allowing the user to obtain real-time feedback and achieve a satisfactory result on a single frame
mask propagation: the corrected mask is bidirectionally propagated
difference-aware fusion: use the two sequences while avoiding possible decay or loss of user’s intent.
how to capture the user’s intent:use the difference in the selected mask before and after user interaction
Figure
Interaction-to-Mask
Scribble-to-Mask(S2M)
Goal: produce a single-image segmentation in real time given input scribbles
backbone: DeepLabV3+ semantic segmentation network
Local Control
previous state-of-the-art approach:it may harm the global result when only local fine adjustment is needed toward the end of the segmentation process.
the source of previous state-of-the-art approach:
Konstantin Sofiiuk, Ilia Petrov, Olga Barinova, and Anton Konushin. f-brs: Rethinking backpropagating refinement for interactive segmentation. In CVPR, 2020. 1, 2, 3, 4, 7, 8
our approach:it is straightforward to assert local control by limiting the interactive algorithm to apply in a user-specified region
the comparison of above two approaches:
Temporal Propagation
Goal: tracks the object and produces corresponding masks in subsequent frames.
Memory Read with Top-k Filtering
(1)计算affinity
F ∈ R THW ×HW represents the affinity between a query position and a memory position
(2)filter the affinities such that only the top-k entries are kept
作用:effectively removes noises regardless of the sequence length
优点:increase robustness and overcome the overhead of top-k
(3)For query position j, the feature mj is read from memory by:
(4)concatenate the read features with vQ
the process is shown below:
Propagation strategy
our propagation scheme:
Difference-Aware Fusion
(1)compute the positive and negative changes separately as two masks D+ and D−
说明:(·)+ is the max(·, 0)
(2)compute the aligned masks
说明:W来自Memory Read with Top-k Filtering中的第二步
(3)feed these features into a simple five-layer residual network which is terminated by a sigmoid to output a final fused mask
Mechanism of the difference-aware fusion module:
说明:
Experiment
Performance on the DAVIS interactive validation set:
Conclusion
我们提出 MiVOS,一种由三个模块组成的新型解耦方法:Interaction-to-Mask, Propagation and Difference-Aware Fusion.通过将交互与传播解耦,MiVOS 是通用的,并且不受交互类型的限制。另一方面,所提出的fusion module通过忠实地捕捉用户的意图来协调交互和传播,并减少在解耦过程中丢失的信息,从而使 MiVOS 既准确又高效。我们希望我们的 MiVOS 能够激发和激发 iVOS 的未来研究
本文标签: 论文InteractiveVideoobjectModular
版权声明:本文标题:论文阅读-Modular Interactive Video Object Segmentation Interaction-to-Mask, Propagation 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://m.elefans.com/dongtai/1728076009a1144481.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论