Abstract

Crowd scene understanding is challenging task with distinctly importance in computer vision. The crowd scene categories are often defined by multi-level information, which leads to a large intra-class variations. Besides, crowd dynamic exhibits various formulations across different crowd systems. A large-scale crowd scene datasets and a quantified generic properties for crowd representation are the key issues of this topic. A framework called Two-Stream Residual Network (TSRN) deep model to federatively learn and aggregate appearance and motion features for crowd understanding is proposed in this paper. Appearance stream is generated from static frame through Residual Network. Motion stream is generated from three scene-independent motion maps: collectiveness, stability, and the conflict as the complement of the appearance streams. Experiments are conducted on a macroscale crowd video dataset named as the Who do What at some Where (WWW), devised to understand crowded scenes. The results show excellent performance in accuracy compared with prior hand-crafted and deep learning methods, attaining a state-of-art accuracy of 88% and 74.9% in two streams respectively, and achieving a 89% accuracy in combined two-stream ResNet.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.