Coded aperture compressive temporal imaging (CACTI) utilizes compressive sensing (CS) theory to compress three dimensional (3D) signals into 2D measurements for sampling in a single snapshot measurement, which in turn acquires high-dimensional (HD) visual signals. To solve the problems of low quality and slow runtime often encountered in reconstruction, deep learning has become the mainstream for signal reconstruction and has shown superior performance. Currently, however, impressive networks are typically supervised networks with large-sized models and require vast training sets that can be difficult to obtain or expensive. This limits their application in real optical imaging systems. In this paper, we propose a lightweight reconstruction network that recovers HD signals only from compressed measurements with noise and design a block consisting of convolution to extract and fuse local and global features, stacking multiple features to form a lightweight architecture. In addition, we also obtain unsupervised loss functions based on the geometric characteristics of the signal to guarantee the powerful generalization capability of the network in order to approximate the reconstruction process of real optical systems. Experimental results show that our proposed network significantly reduces the model size and not only has high performance in recovering dynamic scenes, but the unsupervised video reconstruction network can approximate its supervised version in terms of reconstruction performance.