Endobronchial intervention is increasingly used as a minimally invasive means for the treatment of pulmonary diseases. In order to acquire the position of bronchoscopy, vision-based localization approaches are clinically preferable but are sensitive to visual variations. The static nature of pre-operative planning makes mapping of intraoperative anatomical features challenging for learning-based methods using visual features alone. In this work, we propose a robust navigation framework based on Vision Kinematic Interaction (VKI) for monocular bronchoscopic videos. To address visual-imbalance between the virtual and real views of bronchoscopy images, a Visual Similarity Network (VSN) is proposed to extract domain-invariant features to represent the lumen structure from endoscopic views, as well as domain-specific features to characterize the surface texture and visual artefacts. To improve the robustness of online estimation of camera pose, we also introduce a Kinematic Refinement Network (KRN) that allows progressive refinement of camera pose estimation based on network prediction and robot control signals. The accuracy of camera localization is validated on phantom and porcine lung datasets from a robotically controlled endobronchial intervention system, with both quantitative and qualitative results demonstrating the performance of the techniques. Results show that the features extracted by the proposed method can preserve the structural information of small airways in the presence of large visual variations along with the much-improved camera localization accuracy. The absolute trajectory errors (ATE) on phantom data and porcine data are 8.01 mm and 8.62 mm respectively.