It is important for humans to remain hydrated, particularly for older adults who are at a greater risk of dehydration and may forget to drink. Monitoring liquid intake and getting reminders to drink throughout the day is a useful solution to increase hydration levels. The objective of this paper is to automatically detect drink events from multiple containers in a simulated home environment using a vision-based approach. The proposed work compares the use of depth and RGB (red, green, blue) cameras for this task. In this paper, we compared 2D and 3D Convolutional Neural Networks (CNN) using RGB and depth cameras. We collected data from nine participants performing drinking, eating and other Activities of Daily Living (ADL) in a simulated home environment. We found that for the 3D models, the RGB and depth camera inputs provided very similar F1-scores for both 10-Fold (94.3% vs 93.9%, respectively) and Leave-One-Subject-Out (LOSO) cross validation (84.2% vs 86.2%, respectively). This is a promising result as depth cameras also mitigate the challenges to privacy of RGB-based models. The 3D CNN models outperformed the 2D models, thereby creating a more robust system. Depth cameras are a useful alternative to RGB cameras with equal performance in identifying drinking events.