Abstract: This paper introduces the Multimodal-driven Computer Interface, a framework that enables multimodal models to interact with and control a computer. The framework receives user input through various modalities (e.g., speech, text, gestures), processes it through a multimodal fusion algorithm, and generates appropriate actions using a decision-making module. These actions are then executed on the computer through platform-specific APIs. Currently, it is integrated with the multimodal LLM and operates on Windows, Mac, and Linux systems. While the framework demonstrates promising results, challenges remain in improving the accuracy of mouse click location prediction and adapting to diverse user needs and preferences. The Multimodaldriven Computer Interface has the potential to revolutionize human-computer interaction, opening up new possibilities for accessibility, productivity, and entertainment.