Composite indoor human activity recognition is very important in elderly health monitoring and is more difficult than identifying individual human movements. This article proposes a sensor-based human indoor activity recognition method that integrates indoor positioning. Convolutional neural networks are used to extract spatial information contained in geomagnetic sensors and ambient light sensors, while transform encoders are used to extract temporal motion features collected by gyroscopes and accelerometers. We established an indoor activity recognition model with a multimodal feature fusion structure. In order to explore the possibility of using only smartphones to complete the above tasks, we collected and established a multisensor indoor activity dataset. Extensive experiments verified the effectiveness of the proposed method. Compared with algorithms that do not consider the location information, our method has a 13.65% improvement in recognition accuracy.