Visual Semantic Embedding ( VSE ) is a primary model for cross-modal retrieval, wherein the global feature aggregator is a crucial component of the VSE model. In recent research, the General Pooling Operator ( GPO ) aggregator, which weighs the features reconstructed from the local feature set to aggregate, facilitates the related models to achieve good retrieval performance. However, the reason for the effectiveness remains to be explored. To enhance the rationality of aggregator designs, we analyze the reason from the perspective of feature space. Indeed, for each data, the local feature set forms a hypercube containing abundant data information, and the feature learned by GPO measures the hypercube, thereby representing the data. The geometric structure of the hypercube implies that the set containing all points within the hypercube is a convex set, so the feature learned by weighted aggregation is an interior point of the hypercube. However, using the interior point to measure the hypercube leads to some problems in feature representation and model optimization, as well as the reduction of retrieval efficiency caused by weight computation. For example, the related pair’s features may be far, while the unrelated ones may be close. To measure the hypercube more clearly and alleviate the problems mentioned above, we propose Hypercube Pooling ( HCP ) aggregator. Specifically, HCP concatenates the Max and Min Pooling features as the global features. This aggregation method has multiple advantages, e.g., the learned global feature represents all hyperplanes of the hypercube that contain critical information and hypercube geometric structure. Moreover, HCP adds normalization-before-concatenation and reduces the usual setting of margin in the loss function by half to avoid gradient loss caused by the difference in the feature value and dimensionality doubling. The experimental results on the Flickr30K and MSCOCO datasets show that the HCP model has excellent performance with high efficiency, confirming the correctness of the spatial analysis and the effectiveness of the HCP aggregator.
Read full abstract