Fast inference services for alternative deep learning structures

Eduardo Romero,Nathaniel Morris,Christopher Stewart

doi:10.1145/3318216.3363331

Abstract

AI inference services receive requests, classify data and respond quickly. These services underlie AI-driven Internet of Things, recommendation engines and video analytics. Neural networks are widely used because they provide accurate results and fast inference, but it is hard to explain their classifications. Tree-based deep learning models can provide accuracy and are innately explainable. However, it is hard to achieve high inference rates because branch misprediction and cache misses produce inefficient executions. My research seeks to produce low latency inference services based on tree-based models. I will exploit the emergence of large L3 caches to convert tree-based model inference from sequential branching toward fast, in-cache lookups. Our approach begins with fully trained, accurate tree-based models, compiles them for inference on target processors and executes inference efficiently. If successful, our approach will enable qualitative advances in AI services. Tree-based models can report the most significant features in a classification in a single pass. In contrast, neural networks require iterative approaches to explain their results. Consider interactive AI recommendation services where users seek to explicitly order their instantaneous preferences to attract preferred content. Tree-based models can provide user feedback much more quickly than neural networks. Tree-based models also have less prediction variance than neural networks. Given the same training data, neural networks require many inferences to quantify variances of borderline classifications. Fast tree-based inference can explain variance in seconds (versus minutes). Our approach shows that competing machine learning approaches can provide comparable accuracy but desire wholly different architectural and platform support.

Full Text