An adaptive testing item selection strategy via a deep reinforcement learning approach.
Computerized adaptive testing (CAT) aims to present items that statistically optimize the assessment process by considering the examinee's responses and estimated trait levels. Recent developments in reinforcement learning and deep neural networks provide CAT with the potential to select items that utilize more information across all the items on the remaining tests, rather than just focusing on the next several items to be selected. In this study, we reformulate CAT under the reinforcement learning framework and propose a new item selection strategy based on the deep Q-network (DQN) method. Through simulated and empirical studies, we demonstrate how to monitor the training process to obtain the optimal Q-networks, and we compare the accuracy of the DQN-based item selection strategy with that of five traditional strategies-maximum Fisher information, Fisher information weighted by likelihood, Kullback‒Leibler information weighted by likelihood, maximum posterior weighted information, and maximum expected information-on both simulated and real item banks and responses. We further investigate how sample size and the distribution of the trait levels of the examinees used in training affect DQN performance. The results show that DQN achieves lower RMSE and MAE values than traditional strategies under simulated and real banks and responses in most conditions. Suggestions for the use of DQN-based strategies are provided, as well as their code.
- Research Article
1
- 10.3724/sp.j.1041.2012.00400
- Apr 12, 2013
- Acta Psychologica Sinica
Item selection strategy (ISS) is a core component in Computerized Adaptive Testing (CAT). Polytomous items can provide more information about examinee compared with dichotomous items, and adopting polytomously scored items in test is a research direction of CAT. As we know, the most widely used ISS is the maximum Fisher information (MFI) criterion, which raises concerns about cost-efficiency of the pool utilization and poses security risks for CAT programs. Chang Ying (1999) and Chang, Qian, Ying (2001) proposed two alternative item selection procedures, the a-stratified method (a-STR) and the a-stratified with b blocking method (b-STR) based on dichotomous model, with the goal to remedy the problems of item overexposure and item underexposure produced by MFI. However, the technology of a-STR and b-STR is static because the items are stratified according to the given information at the beginning of test. Based on graded response model (GRM), a technique of the reduction dimensionality of difficulty (or step) parameters was employed to construct some ISSs recently. The limitation of this dimension reduction technique is that it loses a lot of information. Thus, in order to improve MFI, two new item selection methods are proposed based on GRM: (1) modify the technique of the reduction dimensionality of difficulty (or step) parameters by integrating the interval estimation; (2) dynamic a-STR and dynamic b-STR methods are implemented in the testing process. On one hand, these new ISSs can avoid and remedy the limitations of MFI and make good use of the advantages of the Fisher information function (FIF); FIF compresses all item parameters and ability parameters, so it is a comprehensive tool for all parameters in nature.On the other hand, the new ISSs employ the property that FIF could represent the inverse of the variance of the ability estimation, let e be the square root of the reciprocal of the Fisher information, d be the absolute deviation between the estimate ability and the function of the parameters of an item, which may be chosen and could be changed during the course of CAT, the inequality of de has the form of interval estimation, and its utility could be imaged as a more flexible shadow item pool. A simulation study based on GRM was conducted. Four item pools of different structures were simulated, and 1000 examinees was generated and their abilities were randomly drawn from the standard normal distribution N (0,1). Each pool consists of 1000 polytomous items and the maximum score of each item was randomly selected from set {3, 4, 5, 6}. In this paper, we assume the prior distribution of ability is standard normal and the Bayesian expected a posteriori (EAP) is employed to estimate the ability parameter. The CAT test stopped when the accumulative information satisfies the pre-determined value M (M=16) or reaches the pre-assigned test length 30. The results of the simulation study show that the new item selection methods required shorter test lengths and lower average exposure rates than the other methods, while maintaining the accuracy of ability estimation. More specifically, the new ISSs which applied the idea of the interval estimate were better than the correspondent ISS in terms of the Chi-square value. And the same effect appeared when comparing the dynamic a-STR and dynamic b-STR ISS with MFI. Some important results are also found by comparing different structure of item pool. The accuracy of ability estimation and item exposure rate were related to the distribution of the difficult parameters b, that is, the accuracy of ability estimation obtained from the condition in which b was sampled from N (0,1) was better than that when b was sampled from uniform distribution. The conclusion of item exposure rate is on the contrary. Also, the test length was related to the distribution of the discrimination parameter a, the test length required by the condition in which a was sampled from uniform distribution was shorter than that when the logarithm of a was sampled from N (0,1). In a word, in terms of controlling and balancing the item exposure, the new ISSs may gain an advantage over the former correspondent ISS.
- Research Article
2
- 10.3724/sp.j.1041.2008.01212
- Jun 5, 2009
- Acta Psychologica Sinica
Computerized adaptive testing(CAT) is one of the ultimate areas in the field of item response theory(IRT).Many high stake tests,such as GRE and TOEFL,have their CAT versions. Item selection strategy is the core content of CAT.And item information function(IIF) always is the important index of item selection.Although item information of dichotomously scored items has been extensively studied,item information of polytomously scored items receives much less attention. However,due to the advantages inbred in Computerized adaptive testing(CAT) with polytomously scored items,it gains more and more attention now.But the item selection strategies implemented under such situations are not systematically proved to be efficient.Many researchers use the degree of closeness between trait level and the average of item category parameters as the index of item selection strategy,or other strategies such as the degree of closeness between trait level and the median of item category parameters,etc. Up to now,seldom research had systematically concerned about the inherent relationship between the trait level and item category parameters under polytomously scored item types,and its effect on item information. The primary purpose of this research is to systematically investigate the relations of item information to item category parameters and subject trait levels. In this study,we simulated 121 trait values that distributed uniformly between the ranges of-3 to 3.Also,we simulated 504 sets of item parameters,with 4 sets of discrimination parameters which separately matched the 126 sets of difficulty parameters.Each item is graded in terms of 5 categories with differential degrees of difficulty. Based on the results of item information of simulated data,we find that the trait value that correspondence to the maximum item information matches the difficulty parameter group with high-frequency item categories.We call this principal as "item category parameter priority rule".Such principle is very different from the previous item selection strategies under computerized adaptive testing situations. The results of this research will be very useful for the construction of computerized adaptive testing with polytomously scored items.
- Conference Article
- 10.2991/icecee-15.2015.221
- Jan 1, 2015
Stating item response theory and computer adaptive test system, analyzing the problems existing in adaptive test system, and proposing the system design idea, taking the establishment of adaptive test system for C language programming design for example, it analyzes the system structure of computer adaptive test system based on IRT as well as the detailed design and realization of main functional module in details. The adaptive test mode is always conducted surrounding the examinees' competence, and all test items are selected automatically from the item bank by the system depending on the examinees' competence: there is a difficult degree distinguished for test items, adjust the difficult degree of test items at any time according to the test accuracy, with strong pertinence, highlight the principal status of examinees and individualized demands, and enhance the validity and reliability of test, so that the test efficiency is improved. The computer adaptive test (CAT) with item response theory (IRT) as a basic theory has caused the extensive concern from each country around the world, which has been promoted gradually in various social aspects: with USA post-graduate entrance examination GRE and GMAT for the educational world; with Nurse National Committee License Test (NNCLT) for the aspect of vocational qualification authentication; with authentication test organized by Novell Company, CCNA authentication test organized by Cisco Company and MCSE adaptive test organized by Microsoft Company in the corporate world. At domestic, the research and application of IRT has caused the extensive concern from the Education Department. Item response theory CAT is built on the basis of modern test theory - item response theory, from the construction of item bank to the extraction of test items, and then to test paper generating and testing until to the final evaluation, it is conducted under the guidance of IRT. Therefore, CAT is identified as the greatest contribution on the test among modern test theory. For item response theory, Hambleton and Swaminathan made the following statement on it: in the test scenario, through the definition of examinee characteristic - namely peculiarity or competence, estimate the examinee's score on such characteristic (referred to as competence score), and predict or explain the item and answering condition by application of such score, explain and predict the examinee's answer. Peculiarity (competence) and item (test item) are two core concepts for IRT, and the main relationship between both items is the major connotation of IRT. The main research content and theoretical support of IRT and computer adaptive test under its guidance include: parameter estimation, test equating, item selection strategy and termination principles. Computer adaptive test
- Research Article
3
- 10.3724/sp.j.1041.2011.00203
- Mar 29, 2012
- Acta Psychologica Sinica
As far as Computerized Adaptive Testing (CAT) is concerned, the issue of item selection strategy has received more attention because of its vital role. It is well known that there are two typical selection strategies called Maximum Information Criterion (MIC) and a-Stratification (a-STR). However, both of the two strategies have their advantages together with their downsides. On the one hand, MIC method can obtain high efficiency and accurate estimation of ability; on the other hand, its uneven item selection may lead to the insecurity of examination. Meanwhile, though a-STR can improve the test security by controlling the item exposure rate, it may result in the inefficiency of the test and failure in adjusting the discrimination within the layers. As a result, the development of both effective and safe item selection strategies has always been a goal to pursue in studies on CAT. According to the previous studies, the test security can be enhanced and the item pool utilization rate can be increased by balancing the item exposure rate. Therefore, in 0-1 scored CAT, two new item selection strategies are proposed in this paper to improve the MIC and a-STR methods by introducing exposure factor, adjusting automatically the discrimination by stage and increasing the accuracy of item selection. One of the new item selection strategies has three prominent characteristics: First, a function of item information (FII) rather than the item information function is set up to combine the advantages of both MIC and a-STR. Second, the effect of the discrimination on different stages in CAT is taken into account and a function of item discrimination is used in the FII to make up for the defect of a-STR for not being able to control the item discrimination in the internal layer. Third, mechanism of online control exposure is adopted. While some specific items in a certain examination process are more frequently exposured than others, with the help of the mechanism, they will turn out to become less likely to be selected in the future tests. At the same time, some specific items in a certain examination process are less frequently exposured than others, with the help of the mechanism, they will turn out to become more likely to be selected in the future tests. Thanks to the mechanism of online control exposure, the whole exposure rate of all the items in the item pool is evened and the utilization rate of the item pool is increased. In order to fill the gap between the exposure rate of each item and mean exposure rate of all the items in the item pool, this paper attempts to treat the item exposure rate directly as a part of the selection strategies expression. And it also tries to equalize the exposure rate by decreasing the exposure rate of the items with high ones and increasing the uses of the items with low ones. The approach differs from the approaches which only control the items with high exposure rate, e.g. SH. The results of Monte Carlo simulations show that compared with other approaches, the approach proposed in this paper is more effective in terms of exposure control and more ideal in the performance of other indexes.
- Research Article
8
- 10.1016/j.caeai.2022.100083
- Jan 1, 2022
- Computers and Education: Artificial Intelligence
The development and implementation of a computer adaptive progress test across European countries
- Research Article
4
- 10.21449/ijate.1105769
- Sep 30, 2022
- International Journal of Assessment Tools in Education
Recently, adaptive test approaches have become a viable alternative to traditional fixed-item tests. The main advantage of adaptive tests is that they reach desired measurement precision with fewer items. However, fewer items mean that each item has a more significant effect on ability estimation and therefore those tests are open to more consequential results from any flaw in an item. So, any items indicating differential item functioning (DIF) may play an important role in examinees' test scores. This study, therefore, aimed to investigate the effect of DIF items on the performance of computer adaptive and multi-stage tests. For this purpose, different test designs were tested under different test lengths and ratios of DIF items using Monte Carlo simulation. As a result, it was seen that computer adaptive test (CAT) designs had the best measurement precision over all conditions. When multi-stage test (MST) panel designs were compared, it was found that the 1-3-3 design had higher measurement precision in most of the conditions; however, the findings were not enough to say that 1-3-3 design performed better than the 1-2-4 design. Furthermore, CAT was found to be the least affected design by the increase of ratio of DIF items. MST designs were affected by that increment especially in the 10-item length test.
- Research Article
- 10.4108/eetinis.v12i4.10461
- Nov 4, 2025
- EAI Endorsed Transactions on Industrial Networks and Intelligent Systems
This paper investigates deep reinforcement learning (DRL) approaches designed to counter jammers that maximize disruption by employing unequal sweeping probabilities. We first propose a model and defense action based on a Markov Decision Process (MDP) under non-uniform attacks. A key drawback of the standard MDP model, however, is its assumption that the defending agent can acquire sufficient information about the jamming patterns to determine the transition probability matrix. In a dynamic environment, the attacker’s patterns and models are often unknown or difficult to obtain. To overcome this limitation, RL techniques such as Q-learning, deep Q-network (DQN), and double deep Q-network (DDQN) have been considered effective defense strategies that operate without an explicit jamming model. With Q-learning, defense strategies can still be computationally expensive and require long time to learn the optimal policy. This limitation arises because a large state space or a substantial number of actions causes the Q-table to grow exponentially. Leveraging the flexibility, adaptability, and scalability of RL, we first propose a DQN framework designed to handle large-scale action spaces across expanded channels and jammers. Furthermore, to overcome the inherent overestimation bias present in Q-learning and DQN algorithms, we investigate a DDQN framework. Assuming the estimation error of the action value in DQN follows a zero-mean Gaussian distribution, we then analytically derive the expected loss. Numerical examples are finally presented to characterize the performances of the proposed algorithms and the superiority of DDQN over DQN and Q-learning approaches.
- Research Article
25
- 10.1109/te.2004.837035
- May 1, 2005
- IEEE Transactions on Education
This paper is an attempt to design and to evaluate the platform-independent computerized adaptive testing (CAT) system, which can expand the diversity of CAT-administering platforms. By using extensible markup language (XML) to describe the item bank, one might find the implementation of CAT on a different platform, such as a personal computer (PC), a personal digital assistant (PDA), and other handheld devices, more convenient. An experiment was conducted to examine the effects of a CAT administration platform on precision and efficiency. Fifty senior high school students were selected to take an English vocabulary CAT both on PC and PDA, which enabled them to compare the relevant advantages with the disadvantages of the two different administration platforms firsthand. Both tests used the same-size, well-calibrated item bank, ability estimation algorithm, and item selection strategy. The results indicate that the platforms on which examinees take CAT do not affect the performance of CAT. The responses of the questionnaire on the testing environment also show that most examinees prefer to take the test on PDA. It is concluded that using a PDA to administer CAT is both as precise and effective as a PC and more enjoyable and convenient.
- Research Article
3
- 10.3724/sp.j.1041.2008.00618
- Oct 28, 2008
- Acta Psychologica Sinica
The objective of computerized adaptive testing(CAT)is to construct an optimal test for each examinee.Item Selection Strategy(ISS)is an important part of CAT research,whose quality is directly related to the reliability,efficiency,and security of the test.Many researches and applications of CAT are based on a dichotomously scored model.It is highly evident that more information can be obtained from examinees using a polytomously scored model rather than a dichotomous model.Moreover,it is necessary for us to further explore CAT research based on a polytomously scored model.Both the Generalized Partial Credit Model(GPCM)and the Graded Response Model(GRM)are within the range of a polytomously scored model.However,they differ from each other.In the GRM,the item grade difficulties ascend monotonously as the grades increase;while the GPCM shows the performing process of the item,which is separated into some line-steps to put forwards.In the GPCM,each item contains several step parameters,and there are no specific rules governing them.The posterior step cannot advance when the earlier step has not been completed,and the posterior's step parameter may be lower than that of the previous one.Considerable research is already being conducted on CAT using the GRM;however,in our country,there are few reports pertaining to research on CAT using the GPCM.This study investigated the four types of ISS in comparison with CAT in various circumstances,using the GPCM through computer simulated programs.They are implemented in four item pools,and each item pool has a capacity of 1000 items.Each item has five step parameters;further,the discrimination parameter and step parameters are distributed as follows:b~N(0,1),lna~N(0,1),b~N(0,1),a~U(0.2,2.5),b~U(-3,3),lna~N(0,1),b~U(-3,3),and a~U(0.2,2.5).Item parameters are generated based on the Monte Carlo simulation method.Responses to the items are generated according to the GPCM for a sample of 3000 simulatees θ~N(0,1)whose trait level was also generated using the Monte Carlo simulation method in some types of ISS.During the course of responses,the simulatees' ability is estimated based on the response obtained.In addition,after the four item pools are sorted by the discrimination parameter to complete the a-stratified design,the abovementioned process is performed repeatedly.Thirty-two simulated CATs are administered with the output evaluated with regard to the following measurements:precision,ISS steady,item used even,average use of item per person,χ2,efficiency,and item overlap.The data in tables 1 and 2 include both the index values used for evaluation(which were obtained from the CAT process using four types of ISS when the item pool did not adopt the stratified design and instead adopted the a-stratified design)and values that are calculated after summing the weight of every index value.We can draw the following conclusions from the data in the tables:all the ability estimates are highly accurate and have fewer differences.Moreover,we compare the value by summing every means weight,we learn that the item step parameter distribution greatly influences the choices of ISS.On the condition that the examinee's trait level follows normal distribution,the application results of the ISS and the item step parameter distribution share a very close relationship.(1)If the item's step parameters follow a normal distribution,the efficiency of the ISS for a random step parameter matching the trait level is much better than that for others.(2)If the item's step parameters follow a uniform distribution,the efficiency of the item selection strategy for the item's average step parameter matching the trait level is much better than that for others.
- Research Article
- 10.1145/3750053
- Aug 8, 2025
- ACM Transactions on Evolutionary Learning and Optimization
Evolutionary Reinforcement Learning (EvoRL) has emerged as a promising approach to overcoming the limitations of traditional reinforcement learning (RL) by integrating the Evolutionary Computation (EC) paradigm with RL. However, the population-based nature of EC significantly increases computational costs, thereby restricting the exploration of algorithmic design choices and scalability in large-scale settings. To address this challenge, we introduce EvoRL 1 , the first end-to-end EvoRL framework optimized for GPU acceleration. The framework executes the entire training pipeline on accelerators, including environment simulations and EC processes, leveraging hierarchical parallelism through vectorization and compilation techniques to achieve superior speed and scalability. This design enables the efficient training of large populations on a single machine. In addition to its performance-oriented design, EvoRL offers a comprehensive platform for EvoRL research, encompassing implementations of traditional RL algorithms (e.g., A2C, PPO, DDPG, TD3, SAC), Evolutionary Algorithms (e.g., CMA-ES, OpenES, ARS), and hybrid EvoRL paradigms such as Evolutionary-guided RL (e.g., ERL, CEM-RL) and Population-Based AutoRL (e.g., PBT). The framework's modular architecture and user-friendly interface allow researchers to seamlessly integrate new components, customize algorithms, and conduct fair benchmarking and ablation studies. The project is open-source and available at: https://github.com/EMI-Group/evorl .
- Research Article
18
- 10.1016/j.trc.2023.104281
- Aug 4, 2023
- Transportation Research Part C: Emerging Technologies
Decentralized signal control for multi-modal traffic network: A deep reinforcement learning approach
- Conference Article
1
- 10.1109/aiiot54504.2022.9817159
- Jun 6, 2022
In this work we present an ensemble reinforcement learning (ERL) framework comprising of deep-Q networks (DQNs). The aim is to optimize sum rate for non orthogonal multiple access unmanned aerial network (NOMA-UAV) network. Power in downlink (DL) and bandwidth allotment for a NOMA cluster is managed over fixed UAV trajectory. The environment is dynamic and quality of service (QoS) requirements are varying for each node on ground. A comparative analysis between conventional reinforcement learning (CRL) framework and proposed ensemble of ERL yields a performance gain in undermentioned metrics. The ERL achieves 20 percent performance gain in average sum rate and the gain in spectral efficiency is 2 percent, over conventional reinforcement learning framework with single DQN. It also achieves high performance over different UAV speeds in cumulative sum rate and device coverage.
- Research Article
6
- 10.3390/a16050227
- Apr 27, 2023
- Algorithms
Several approaches have applied Deep Reinforcement Learning (DRL) to Unmanned Aerial Vehicles (UAVs) to do autonomous object tracking. These methods, however, are resource intensive and require prior knowledge of the environment, making them difficult to use in real-world applications. In this paper, we propose a Lightweight Deep Vision Reinforcement Learning (LDVRL) framework for dynamic object tracking that uses the camera as the only input source. Our framework employs several techniques such as stacks of frames, segmentation maps from the simulation, and depth images to reduce the overall computational cost. We conducted the experiment with a non-sparse Deep Q-Network (DQN) (value-based) and a Deep Deterministic Policy Gradient (DDPG) (actor-critic) to test the adaptability of our framework with different methods and identify which DRL method is the most suitable for this task. In the end, a DQN is chosen for several reasons. Firstly, a DQN has fewer networks than a DDPG, hence reducing the computational resources on physical UAVs. Secondly, it is surprising that although a DQN is smaller in model size than a DDPG, it still performs better in this specific task. Finally, a DQN is very practical for this task due to the ability to operate in continuous state space. Using a high-fidelity simulation environment, our proposed approach is verified to be effective.
- Book Chapter
- 10.1002/9781118445112.stat06405
- Sep 29, 2014
This entry presents an overview of computer‐adaptive testing (CAT). The basic CAT algorithm is introduced, followed by a discussion of item response theory (IRT) test information functions and their role in adaptive testing. Issues such as test efficiency and security risks under CAT are also discussed. The entry concludes with a description of four promising CAT variants, including CAT shadow tests, a‐stratified CAT, testlet‐based CAT, and computer‐adaptive multistage testing.
- Research Article
122
- 10.1002/acr.20581
- Nov 1, 2011
- Arthritis Care & Research
The National Institutes of Health (NIH) Patient-Reported Outcomes Measurement Information System (PROMIS®) Roadmap initiative (www.nihpromis.org) is a cooperative research program designed to develop, evaluate, and standardize item banks to measure patient-reported outcomes (PROs) across different medical conditions as well as the US population (1). The goal of PROMIS is to develop reliable and valid item banks using item response theory (IRT) that can be administered in a variety of formats including short forms and computerized adaptive tests (CAT)(1-3). IRT is often referred to as “modern psychometric theory,” in contrast to “classic test theory,” or CTT. The basic idea behind both IRT and CTT is that there is some latent construct, or “trait,” underlying an illness experience. This construct cannot be directly measured, but can be indirectly measured by creating items that are scaled and scored. For example, “fatigue,” “pain,” “disability,” or even “happiness” are latent constructs, i.e. subjective feelings – we cannot take a picture, snap an X-Ray to view them, or run a blood test to check for them. However, we know they exist. People can experience more or less of these constructs, thus it is helpful to try to translate that experience into several levels represented by scores. IRT models the associations between items and the latent construct. Specifically, IRT models describe relationships between a respondent's underlying level on a construct and the probability of particular item responses. Tests developed with CTT (such as the Health Assessment Questionnaire-Disability Index(4), the Scleroderma Gastrointestinal Tract instrument(5)) require administering all items, even though only some are appropriate for the persons' trait level. Some items are too high for those with low trait levels (e.g., “can you walk 100 yards” to a patient in a wheelchair) or too low for those with high trait levels (e.g., “can you get up from the chair?” to a runner). In contrast, IRT methods make it possible to estimate person trait levels with any subset of items appropriate for the persons' trait levels in an item pool. As such, any set of items from the pool could be administered as a fixed form or, for greatest efficiency, administered as a CAT. CAT is an approach to administering the subset of items in an item bank that are most informative for measuring the health construct in order to achieve a target standard error of measurement. A good item bank will have items that represent a range of content and difficulty, provide high level of information, and have items that perform equivalently in different subgroups of the target population. How does CAT work? Without prior information, the first item administered in a CAT is typically one of medium trait level. For example, “In the past 7 days I was grouchy” with multi-level response from “never” to “always.” After each response, the person's trait level and associated standard error are estimated. The next item administered to someone not endorsing the first item, is an “easier” item. If the person endorses the first item, the next item administered is a “harder” item. CAT is terminated when the standard error falls below an acceptable value. This provides an estimate of one's score with the minimal number of questions and no loss of measurement precision. In addition, scores from different studies using different items can be compared using a common scale. IRT models estimate the underlying scale score (theta) from the items. All items are calibrated on the same metric and independently and collectively provide an estimate of theta. Hence, it is possible to estimate the score using any subset of items and to estimate the standard error of the estimated score. This allows assessment of health outcomes across patients with differing medical conditions (such as compare scores of someone with arthritis to someone with heart disease) at various degrees of physical and other impairments, both at the lowest and highest ends of trait levels.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.