The past few years have witnessed significant advancements in generative artificial intelligence (AI) led by large language models (LLMs), applications demonstrating capabilities in traditionally unattainable tasks. Numerous efforts are being initiated exploring a prospect all the more exciting, to employ LLMs not just as language processors, but as a starting point toward AI agents that can adapt to diverse tasks and complex scenarios. In this paper, a survey is offered on the state-of-the-art strategies of deploying such models for the generation of both text-based domain-specific contents and multimodal outputs embodied in interactions with web applications, industrial software and ultimately the physical world. Two approaches to implementing multimodality are delineated: Direct embedding of multimodal data, and conversion of multimodal data to text. Both types have seen extensive use in areas of research focus, such as image processing, embodied action and software automation. Representative cases from these categories are reviewed with a focus on their input/output modalities, methods of processing multimodal data and output quality.
Read full abstract