Point cloud accumulation is a crucial technique in point cloud analysis, facilitating various downstream tasks like surface reconstruction. Current methods merely rely on raw LiDAR points, yielding unsatisfactory performance due to the limited geometric information, particularly in complex scenarios characterized by intricate motions, diverse objects, and an increased number of frames. In this paper, we introduce camera modality data, which is usually acquired alongside LiDAR data at minimal expense. To this end, we present the Multimodal Spatiotemporal Aggregation solution (termed MSA) to thoroughly explore and aggregate these two distinct modalities (sparse 3D points and multi-view 2D images). Concretely, we propose a multimodal spatial aggregation module to bridge the data gap between different modalities in the Bird’s-Eye-View (BEV) space and further fuse them by learnable adaptive channel-wise weights. By assembling their respective strengths, this module generates a reliable and consistent scene representation. Subsequently, we design a temporal aggregation module to capture continuous motion information across consecutive sequences, which is beneficial for identifying the motion state of the foreground scene and enabling the model to extend robustly to longer sequences. Experiments demonstrate MSA outperforms state-of-the-art (SoTA) point cloud accumulation methods across all evaluation metrics in the public benchmark, especially with more frames.