Human action recognition in video analytics has been widely studied in recent years. Yet, most of these methods assign a single action label to video after either analyzing a complete video or using classifier for each frame. But when compared to human vision strategy, it can be deduced that we (human) require just an instance of visual data for recognition of scene. It turns out that small group of frames or even single frame from the video are enough for precise recognition. In this paper, we present an approach to detect, localize and recognize actions of interest in almost real-time from frames obtained by a continuous stream of video data that can be captured from a surveillance camera. The model takes input frames after a specified period and is able to give action label based on a single frame. Combining results over specific time we predicted the action label for the stream of video. We demonstrate that YOLO is effective method and comparatively fast for recognition and localization in Liris Human Activities dataset.