A good task policy should explicitly interpret the preconditions of actions and the composition structure of task. We aim to automatically learn such a task policy from videos which remains challenging at present. This issue can be further aggravated when task-irrelevant components are involved in videos, such as unoperated objects and small actions. Task-irrelevant objects may introduce disruptive visual relations, and task-irrelevant actions would lead to misleading and even failed task planning. Solving both issues simultaneously is beyond the scope of existing methods. To this end, we propose Task Parse Tree (TPT) as a novel task policy representation, distinguishing task-relevant actions with definite preconditions and clear execution order. The automatic generation of TPT relies on two core designs, where spatio-temporal graph (STG) seizes the vital changes in visual relations of objects and their attributes both spatially and temporally, and conjugate action graph (CAG) models the execution logic of actions in a graph. We collect a dataset of a real-world task, Make Tea, and experiment results on the dataset show that TPT realizes both accurate and interpretable task planning in two different scenarios.