The past few years have witnessed considerable efforts devoted to translating images from one domain to another, mainly aiming at editing global style. Here, we focus on a more general case, selective image translation (SLIT), under an unsupervised setting. SLIT essentially operates through a shunt mechanism that involves learning gates to manipulate only the contents of interest (CoIs), which can be either local or global, while leaving the irrelevant parts unchanged. Existing methods typically rely on a flawed implicit assumption that CoIs are separable at arbitrary levels, ignoring the entangled nature of DNN representations. This leads to unwanted changes and learning inefficiency. In this work, we revisit SLIT from an information-theoretical perspective and introduce a novel framework, which equips two opposite forces to disentangle the visual features. One force encourages independence between spatial locations on the features, while the other force unites multiple locations to form a "block" that jointly characterizes an instance or attribute that a single location may not independently characterize. Importantly, this disentanglement paradigm can be applied to visual features of any layer, enabling shunting at arbitrary feature levels, which is a significant advantage not explored in existing works. Our approach has undergone extensive evaluation and analysis, confirming its effectiveness in significantly outperforming the state-of-the-art baselines.
Read full abstract