This article describes the development and implementation of a 3D lidar perception framework to guarantee the precise cognition of the surrounding environment for urban autonomous driving. The proposed framework consists of two different detection modules operating in parallel: a deep learning-based and a geometric model-free cluster-based method. The first module utilizes the convolutional gated recurrent unit (ConvGRU)-based residual network (CGRN). The module aims to repredict 3D objects based on results from a continuous single-frame detection network. A vision-fusion methodology based on 2D projection is adopted for postprocessing in the first module. The second module utilizes geometric model-free area (GMFA) cluster detection and is designed to cope with false-negative cases of unclassified objects from the prior module. For the second module, a cluster variance-based ground removal is conducted to prevent false-positive cases. A kinematic model-based particle filter (PF) is then applied to estimate the dynamic states of detection. The suggested framework has been developed with real-time operation in mind, to be implemented in autonomous vehicles equipped with automotive lidars and low-cost cameras. The test results show that the framework with CGRN and GMFA successfully improved the surrounding object detection and state estimation accuracy in urban autonomous driving.