Abstract
In this paper, we investigate and present a cost-effective memory protection and reliability evaluation methodology for machine learning systems based on machine error-tolerance. We show that by well exploiting the inherent error-tolerability, the incurred cost of the typical memory protection methods including ECC (Error Correction Code) and TMR (Triple Modular Redundancy) can be greatly reduced, making these methods attractive to be adopted to implement a reliable machine learning system. In particular, we also target the up-to-date powerful object detection machine learning model YOLOv4 as a case study. To the best of our knowledge, there is no work in the literature addressing reliability evaluation and enhancement for YOLOv4. Based on identifying the set of error-sensitive (critical) memory blocks and protecting only this set, we develop more efficient ECC and TMR methods. Our ECC method does not require any additional memory cost, while the area cost of our TMR method can be reduced from 200 % to only 47.5 %. The reliability evaluation results show that the Mean Time to Failure (MTTF) of the YOLOv4 object detection system can be extended by about 12 times by the proposed ECC method, while the TMR method can even achieve 108 times longer than that of ECC. We also present a generic methodology to exploit machine error-tolerance for developing a cost-effective memory protection method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.