In this paper, we present a new transportation mode recognition system for smartphones called DeepVehicleSense, which is widely applicable to mobile context-aware services. DeepVehicleSense aims at achieving three performance objectives: high accuracy, low latency, and low power consumption at once by exploiting sound characteristics captured from the built-in microphone while being on candidate transportations. To attain high energy efficiency, DeepVehicleSense adopts hierarchical accelerometer-based triggers that minimize the activation of the microphone of smartphones. Further, to achieve high accuracy and low latency, DeepVehicleSense makes use of non-linear filters that can best extract the transportation sound samples. For recognition of five different transportation modes, we design a deep learning based sound classifier using a novel deep neural network architecture with multiple branches. Our staged inference technique can significantly reduce runtime and energy consumption while maintaining high accuracy for the majority of samples. Through 263-hour datasets collected by seven different Android phone models, we demonstrate that DeepVehicleSense achieves the recognition accuracy of 97.44\% with only sound samples of 2 seconds at the power consumption of 35.08 mW on average for all-day monitoring.