Objective:Interpretation of neuropsychological (NP) tests depends on the quality of the normative standards available for the tests. Co-norming across tests is necessary when interpreting differences between scores on different tests. The relevance of specific norms for an individual examinee further depends on multiple design features of the standardization studies, including: when the studies were conducted, sampling strategy, inclusion/exclusion criteria, age, sex/gender, education, race and ethnicity, socioeconomic status, and region. This paper examines the standardization studies of the most widely used NP tests, identifies their strengths and weaknesses, and makes recommendations for interpretive caveats based on these analyses.Participants and Methods:We reviewed the standardization strategies and coded information about the sampling frames, inclusion/exclusion criteria, stratification methods, demographic characteristics, and sample sizes overall and within each stratum where relevant. These methods were applied to the WAIS-IV, WMS-IV, CVLT3, D-KEFS, Pearson Advanced Clinical Solutions (ACS), Rey Complex Figure Test, WCST, Symbol Digit Modalities Test, RBANS, BVMT-R, HVLT, Halstead-Reitan (“Heaton et al”) Norms for Boston Naming, Finger Tapping, Grooved Pegboard), MOANS, and MOAANS (Boston Naming, Trail Making Test, Judgement of Line Orientation). We calculated multiple indexes for each test, including standard errors and confidence intervals for scaled scores.Results:Most tests used age only as a stratification factor, providing “age corrected” scores for selected age bands. The sample sizes for the age strata range from 1 to ∼200 but were usually less than 100 participants/stratum. Sex differences were rarely reported and some studies had markedly uneven distributions of sex. Education was not used as a stratification factor in any study, and few norms attempted corrections for education. The possible interactions of age and education on test scores are seldom reported and cell sizes for combinations of age and education may be too small to enable robust estimates of scores, especially at lower levels of education and older ages. The possible impact of race and ethnicity are rarely interrogated except in ACS, Heaton and MOAANS norms, which all focus on “African American” participants. Discrepancies in scores across ACS, Heaton and MOAANS suggest marked sampling differences.Conclusions:Existing norms have major limitations which may impact the clinical assessment of individuals and result in inappropriate treatment recommendations as well as lead to inappropriate classification in clinical trials, which may include score “cutoffs” based on widely used normative standards. Most norms use only age as a stratification factor, despite robust impacts of education on scores. Race and ethnicity are poorly represented, fail to reflect current demographic characteristics of the United States, and existing norms present major conflicts for African American groups, with the same raw scores differing by a full standard deviation depending only on the source of normative data. Sex differences are examined infrequently and it remains unclear to what extent sex or gender differences may affect some scores. There is an urgent need for new, preferably “dynamic” normative standards, that include sampling by socially and demographically meaningful metrics, to provide greater precision in assessment of neuropsychological scores and score discrepancies, and for evaluating the inclusion/exclusion criteria, and criteria for efficacy in clinical trials that use neurocognitive endpoints.