People nowadays often ignore the importance of proper hydration. Water is indispensable to the human body’s function, including maintaining normal temperature, getting rid of wastes and preventing kidney damage. Once the fluid intake is lower than the consumption, it is difficult to metabolize waste. Furthermore, insufficient fluid intake can also cause headaches, dizziness and fatigue. Fluid intake monitoring plays an important role in preventing dehydration. In this study, we propose a multimodal approach to drinking activity identification to improve fluid intake monitoring. The movement signals of the wrist and container, as well as acoustic signals of swallowing, are acquired. After pre-processing and feature extraction, typical machine learning algorithms are used to determine whether each sliding window is a drinking activity. Next, the recognition performance of the single-modal and multimodal methods is compared through the event-based and sample-based evaluation. In sample-based evaluation, the proposed multi-sensor fusion approach performs better on support vector machine and extreme gradient boosting and achieves 83.7% and 83.9% F1-score, respectively. Similarly, the proposed method in the event-based evaluation achieves the best F1-score of 96.5% on the support vector machine. The results demonstrate that the multimodal approach performs better than the single-modal in drinking activity identification.