Temporal action localization presents a significant challenge in computer vision, as the development of an efficient method for this task remains elusive. The objective is to identify human activities within untrimmed videos, determining when and which actions occur in each video. While using trimmed videos could potentially resolve the localization problem and enhance classification accuracy, it is impractical for real-world applications as the trimming process itself requires human intervention. This highlights the importance of temporal localization. Due to the availability of several successful approaches for action recognition in trimmed video, conventional multi-stage methods for untrimmed video, commonly employ a network to generate activity proposals, followed by a separate network for classification. These disjoint networks are optimized individually and thus usually vary from the global optimum, leading to less precise candidate action proposals. To address this challenge, we propose a novel end-to-end neural network that utilizes error estimation for precise action localization and recognition in untrimmed videos. The proposed method performs the localization and classification of action instances simultaneously, thereby optimizing the corresponding networks concurrently. To increase the precision of the action proposal boundaries, the Regression module is innovatively utilized as part of the proposed end-to-end network, along with the Evaluation and Classification modules. This module estimates the potential error in proposal time boundaries and enhances the result accuracy. We have conducted experiments on THUMOS 14 and ActivityNet-1.3, which are considered the most challenging datasets for temporal action localization. The novel, yet fairly simple, proposed network achieves remarkable performance improvement compared to the other state-of-the-art methods. This improvement, which is more pronounced in the cases of high temporal intersection with ground truth, is accomplished without requiring extra data or complicated architecture. By incorporating error estimation, we achieved improvement in mean Average Precision (mAP). The proposed approach particularly shines for the localization of challenging activities in the complex and diverse dataset ActivityNet-1.3. For instance, for the “drinking coffee” activity, the mean Average Precision (mAP) was enhanced fivefold compared to the best-reported results.