ABSTRACTThis study was conceived in response to criticisms of the current TOEFL listening comprehension test‐item format. Major areas of criticism have included speculation that listening as tested places too much burden on short‐term memory as opposed to comprehension, that a knowledge of reading is required in order to respond successfully, and that many items appear to require mere recall and matching of details rather than higher‐order processing skills. To address these criticisms in turn, a study was designed with 120 ESL learners and three listening tests (comprised of 144 total real and adapted TOEFL test items) to examine the characteristics of item functioning under conditions of stimulus repetition versus nonrepetition, variation of length of aural stimulus passage and of associated numbers of items, shorter versus longer reading response options, and higher versus lower level of processing skills required. Those item types and stimulus conditions that were found to associate with superior item functioning as indicated by estimates of item difficulty, item discriminability, internal consistency reliability, fit to a latent trait model, and convergent and discriminant validity were identified.Results suggested that, while repetition of the stimulus passage predictably tended to reduce item difficulty when control was made for concomitant influences, there was no consistent effect of stimulus passage repetition on item discrimination, Rasch model fit, or discriminant validity across difficulty level. However, there was a tendency for items in the no‐repetition condition to exhibit greater convergent and discriminant validity than items in the one‐repetition condition.Although passage length was confounded with numbers of items per passage and with comprehension hierarchy level, the test with passages of three‐sentence length tended to be more reliable than the test with passages of two‐sentence length, and the test with passages of two‐sentence length tended to be more reliable than the test with passages of one‐sentence length. Also, the test with the longest passages tended predictably to be slightly more difficult than the test with the shortest passages.Item response‐option length was significantly related to item difficulty and Rasch model fit in the direction that items with options that were shortened to about half current TOEFL response‐option length tended to be easier and to exhibit better fit than items with current longer options. Also, items with shortened options showed greater convergent and discriminant validity across levels of difficulty than did items with unshortened options. And, there was a near‐significant tendency for items with shortened options to exhibit better discrimination than items with unshortened options, when concomitant influences were controlled.Comprehension hierarchy level of items, as defined by the length of passage required to respond correctly, was not significantly related to item difficulty except through a complex option‐length‐by‐hierarchy‐ level interaction. However, hierarchy level was related to discrimination and Rasch model fit in the direction that items with lower level of processing (i.e., those that required comprehension of less stimulus text) showed better fit and discrimination than higher‐level items after concomitant influences were removed. Also, greater convergent and discriminant validity across difficulty levels was exhibited by lower‐level comprehension items than higher‐level items.It was concluded that tasks like those employed in TOEFL Listening Comprehension Section A would benefit from a shortening of current response‐option length, but that it was not beneficial to repeat stimulus passages, nor was it desirable to increase the proportion of items that depended on comprehension of greater rather than lesser amounts of text.