Abstract

Neural-network-based image and video captioning can be substantially improved by utilizing architectures that make use of special features from the scene context, objects, and locations. A novel discriminatively trained evaluator network for choosing the best caption among those generated by an ensemble of caption generator networks further improves accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call