Abstract

Parallel computation is essential for machine learning systems that handle large amount of data for training. One popular form of parallel machine learning is the one called 'Data parallel' which use large number of parameter servers that manage computational workers. Fault tolerance is always a crucial issue on large scale computation systems in general, and parameter server based machine learning systems are no exception. However, there are many discussions on the fault tolerance of large scale computation systems in general, there are no discussions on parallel machine learning systems, in spite of their unique characteristics. In this paper, we discuss the fault tolerance of parallel machine learning systems which use parameter servers that provides extra redundancy to the system and could double as the checkpoint server. We also quantitatively evaluate several fault tolerance method using parallel environment simulator SimGrid, and demonstrate the effectiveness of the proposed method.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.