Abstract

Automatic architecture search is efficient to discover novel neural networks while it is mostly employed for pure vision or natural language tasks. However, cross-modality tasks are highly emphasized on the associative mechanisms between visual and language models rather than merely convolutional neural network (CNN) or recurrent neural network (RNN) with the best performance. In this work, the intermediary associative connection is approximated to the topological inner structure of RNN cell, which is further evolved by an evolutionary algorithm on the proxy of image captioning task. On the MSCOCO dataset, the proposed algorithm, starting from scratch, discovers more than 100 RNN variants with the performances all above 100 on CIDEr and 31 on BLEU4, and the top performance achieves 101.4 and 32.6 accordingly. Additionally, several unknown interesting patterns as well as many existing powerful structures are found in the generated RNNs. The patterns of operation and connection in the generated architecture are analyzed to understand the language modeling of cross-modality compared with general RNNs.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.