Abstract

Interactive segmentation pursues generating high-quality pixel-level predictions with a few user-provided clicks, which is gaining attention for its convenience in segmentation data annotation. Users are allowed to iteratively refine the prediction by adding clicks until the result is satisfactory. Existing interactive methods usually transform the clicks into a set of localization maps by Euclidian distance computation or RGB texture extraction to guide the segmentation, which makes the click transformation a core module in interactive segmentation networks. However, when adopted in human images where large poses, occlusions, and bad illuminations are prevailing, prior transformation methods tend to cause uncorrectable overlapping across localization maps, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , one click corresponds to multiple transformed values at the same position in different localization map channels, which are difficult to form a good match among human parts and limit the interaction efficiency. Furthermore, the inappropriately transformed information is hard to be refined with the static transformation manner, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , based on the fixed formulas / RGB textures, which is out of tune with the dynamically refined interaction process. Hence, we design a dynamic transformation scheme for interactive human parsing (IHP) named Dynamic Interaction Dilation Net ( <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DID-Net</b> ), which serves as an initial attempt to break the limitations of static transformation while capturing long-range dependencies of clicks within each human part. Specifically, we construct a Dynamic Dilation Module ( <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DD-Module</b> ) to dilate clicks radially in several directions assisted by human body edge detection. The continually refined edges guide to improve the dilation quality in each interaction iteration, thereby better fitting user intention. Furthermore, we propose an Adaptive Interaction Excitation Block ( <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">AIE-Block</b> ) to exploit potential semantic clues buried in the dilated clicks and emphasize semantic expression for each human part by feature recalibration. Our DID-Net achieves state-of-the-art performance on 3 public human parsing benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call