This paper concerns methodology for evaluating task-oriented personal assistants, where users perform a complex task that has objective success criteria, independent of personal preferences, and the agent provides suggestions to help users repeatedly perform the task consistently and accurately. We develop a systematic approach to evaluating task-oriented personal assistants in normal contexts of use through extending the methodology of empirical software engineering to evaluate effectiveness, efficiency and satisfaction. The approach allows the evaluation of both the human-agent system and of the personal assistant using data obtained by observations of user and system behaviour. A key element of our approach is to define empirically observable conditions that separate the learning period, when users and the agent are learning to perform the task, from the evaluation period, when performance benefits are measured. The methodology is illustrated using the example of a system for users to extract, annotate and code events from news articles.