Current multi-object tracking and segmentation (MOTS) methods follow the tracking-by-detection paradigm and adopt 2D or 3D convolutions to extract instance embeddings for instance association. However, due to the large receptive field of deep convolutional neural networks, the foreground areas of the current instance and the surrounding areas containing the nearby instances or environments are usually mixed up in the learned instance embeddings, resulting in ambiguities in tracking. In this paper, we propose a highly effective method for learning instance embeddings based on segments by converting the compact image representation to un-ordered 2D point cloud representation. In this way, the non-overlapping nature of instance segments can be fully exploited by strictly separating the foreground point cloud and the background point cloud. Moreover, multiple informative data modalities are formulated as point-wise representations to enrich point-wise features. For each instance, the embedding is learned on the foreground 2D point cloud, the environment 2D point cloud, and the smallest circumscribed bounding box. Then, similarities between instance embeddings are measured for the inter-frame association. In addition, to enable the practical utility of MOTS, we modify the one-stage instance segmentation method SpatialEmbedding for instance segmentation. The resulting efficient and effective framework, named PointTrackV2, outperforms all the state-of-the-art methods including 3D tracking methods by large margins (4.8 percent higher sMOTSA for pedestrians over MOTSFusion) with the near real-time speed (20 FPS evaluated on a single 2080Ti). Extensive evaluations on three datasets demonstrate both the effectiveness and efficiency of our method. Furthermore, as crowded scenes for cars are insufficient in current MOTS datasets, we provide a more challenging dataset named APOLLO MOTS with a much higher instance density.