We introduce an automated wavefunction-identification program to generate large-scale datasets for active-region (AR) design of quantum cascade lasers (QCLs) by using machine learning. Conventional QCL design methods rely on expert knowledge and extensive iterative testing; thus, they are inefficient AR design approaches. Our automated approach identifies crucial wavefunctions within a QCL band diagram rapidly and with high accuracy. Key wavefunctions in the optical-transition, injector-, and extractor-regions, which include, upper- and lower-laser levels, injecting levels, extractor levels, and high-energy leakage-path levels, are identified by using a refined k-means clustering algorithm and tailored probability formulas. We find that an accuracy of >95% can be achieved using this identification program. Leveraging our automated wavefunction identification program, we generated approximately 430 000 QCL structures, identified wavefunctions, and computed basic metrics such as energy differences between various levels. A nominally 8 μm-emitting QCL structure, with each stage comprising 24 layers and a fixed applied electric field of 40 kV/cm, which is expected to be close to that corresponding to the laser threshold, was used in the analysis. The compositions of InGaAs wells and AlInAs barriers were lattice-matched to InP, and only layer thicknesses were varied within empirically derived ranges. Using this dataset, we trained neural networks to map QCL structures to energy level differences and tested the performance. The promising results, with a coefficient of determination, R2, values around 90%, validate both the efficacy of our automated program for generating substantial, usable training data and the capability of our network for QCL-AR design.