Poor psychometrics, particularly low test-retest reliability, pose a major challenge for using behavioral tasks in individual differences research. Here, we demonstrate that full generative modeling of the Iowa Gambling Task (IGT) substantially improves test-retest reliability and may also enhance the IGT's validity for use in characterizing internalizing pathology, compared to the traditional analytic approach. IGT data was collected across two sessions, one month apart. Our full generative model incorporated (1) the Outcome Representation Learning (ORL) computational model at the person-level and (2) a group-level model that explicitly modeled test-retest reliability, along with other group-level effects. Compared to the traditional 'summary score' (proportion good decks selected), the ORL model provides a theoretically rich set of performance metrics (Reward Learning Rate , Punishment Learning Rate , Win Frequency Sensitivity , Perseveration Tendency , Memory Decay ), capturing distinct psychological processes. While test-retest reliability for the traditional summary score was only moderate (, BCa 95% CI [.04, .63]), test-retest reliabilities for ORL performance metrics produced by the full generative model were substantially improved, with test-retest correlations ranging between for the five ORL parameters. Further, while summary scores showed no substantial associations with internalizing symptoms, ORL parameters were significantly associated with internalizing symptoms. Specifically, Punishment Learning Rate was associated with higher self-reported depression and Perseveration Tendency was associated with lower self-reported anhedonia. Generative modeling offers promise for advancing individual differences research using the IGT, and behavioral tasks more generally, through enhancing task psychometrics.