Multi-task architecture learning has achieved significant success by learning optimal sharing architectures for different tasks. However, previous works to learn branched architectures for different tasks can sometimes lead to unsatisfying multi-task performance, as not all detailed branches are relevant to a specific task. Task-relevant architectures can be sparse, including only partial channels or layers in the entire architecture (i.e., a sub-network). In addition, most previous works rely on a heuristic architecture selection procedure that could not support continuous architecture optimization. To this end, in this paper, we propose dual-mask, a progressively sparse multi-task architecture learning method. Starting with a task-free architecture, it identifies the informative features along two-level, channels and layers, for each task, while suppressing conflicting or noisy parts in a differentiable manner, so that better task-specific sub-networks are captured. Specifically, the channel and layer selection modules produce respective hybrid binary and real value masks, designed to pick salient channels and layers for each task, respectively. To jointly optimize masks with model parameters, we propose an importance-guided relaxation method for solving the stochastic binary optimization problem, after which the interference or noise parts can be pruned by masks. Additionally, a progressive training strategy with continuation is provided that gradually sparsity the task-specific sub-networks. Experiments show that dual-mask achieves superior performance than SOTA multi-task methods.
Read full abstract