Demand prediction to support appropriate production decisions is being actively studied. Many prediction models are designed to minimize the prediction error, which is measured by determining the difference between the predicted and ground-truth demand. However, these models ignore the effect of the prediction error on downstream production decisions. This prompted our study, which focuses on demand prediction models for two-stage uncapacitated lot-sizing problems. In this paper, we present a prediction model that minimizes the decision error, which is measured by the optimization objective of lot-sizing problems. Our model mitigates the impact of prediction errors by leveraging the structure of the lot-sizing problems. To enhance the ability to accommodate imperfect data, such as data based on inaccurate information, we subsequently extend the prediction models to distributionally robust versions. We consider the worst-case formulation in the feature space to enhance the robustness of the model to data imperfection. Numerical experiments demonstrate that the proposed prediction models are significantly superior to the traditional prediction methods when the model being trained is misspecified. In addition, the robust extension enables the models to train well on imperfect datasets while requiring less training data. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —This study is motivated by the two-stage lot-sizing problem in manufacturing systems which involves the determination of the time at which (the setup decisions) to produce before the demand is revealed, and the production quantity decisions are adjusted during production processing. Predicting the demand properly while making the set-up decisions helps to decrease the production cost. This paper proposes a novel method that aims to minimize the total production cost to train demand prediction models. Practitioners can deploy the proposed method to train popular prediction models including linear prediction, decision tree, and neural network models. It is shown that the production cost associated with the decisions based on our models is lower than that associated with the traditional prediction models. Furthermore, our method enhances the ability of the model to handle model misspecification and imperfect data. Practitioners can apply our method to overcome the problems caused by imperfect data such as those containing contaminated and inaccurate measurements.