Large language model-based multiagent collaboration for abstract screening toward automated systematic reviews.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Systematic reviews (SRs) are essential for evidence-based practice but remain labor-intensive, especially during abstract screening. This study evaluates whether multiple large language models (multi-LLMs) collaboration can improve the efficiency and reduce costs for abstract screening. Abstract screening was framed as a question-answering (QA) task using cost-effective LLMs. Three multi-LLM collaboration strategies were evaluated, including majority voting by averaging opinions of peers, multi-agent debate for answer refinement, and LLM-based adjudication against answers of individual QA baselines. These strategies were evaluated on 28 SRs of the CLEF eHealth 2019 technology-assisted review benchmark using standard performance metrics such as mean average precision (MAP) and work saved over sampling at 95% recall (work saved over sampling WSS@95%). Multi-LLM collaboration significantly outperformed QA baselines. Majority voting was overall the best strategy, achieving the highest MAP 0.462 and 0.341 on subsets of SRs about clinical intervention and diagnostic technology assessment, respectively, with WSS@95% 0.606 and 0.680, enabling in theory up to 68% workload reduction at 95% recall of all relevant studies. Multi-agent debate improved weaker models most. Our own adjudicator-as-a-ranker method was the second strongest approach, surpassing adjudicator-as-a-judge, but at a significantly higher cost than majority voting and debating. Multi-LLM collaboration substantially improves abstract screening efficiency, and the success lies in model diversity. Making the best use of diversity, majority voting stands out in terms of both excellent performance and low cost compared to adjudication. Despite context-dependent gains and diminishing model diversity, multi-agent debate is still a cost-effective strategy and a potential direction of further research.

Similar Papers
  • Research Article
  • Cite Count Icon 10
  • 10.1109/tase.2020.3035291
Parameter Identification for Bernoulli Serial Production Line Model
  • Nov 25, 2020
  • IEEE Transactions on Automation Science and Engineering
  • Yuting Sun + 3 more

Model-based analysis of production systems is one of the main areas in manufacturing research. The foundation of the successful application of these theoretical studies is the availability of valid and high-fidelity mathematical models that are capable of capturing the behavior of job flow in production systems. The modeling process of a production system, however, may require a significant amount of nonstandardized work that can only be done properly by someone with solid training in the area and extensive experience through real case studies. This poses a critical challenge in the effective implementation of these valuable theoretical results in the Industry 4.0 era. To overcome this, we propose a new production systems modeling paradigm inspired by system identification: calculate production system model parameters that best match the standard system performance metrics measured on the factory floor. Specifically, in this article, we consider production lines characterized by the Bernoulli serial line model and develop algorithms that identify model parameters to fit the system throughput and work-in-process. Analytical algorithms are derived to solve this problem in a two-machine line case and then extended to multi-machine lines. The accuracy and computational efficiency of the algorithms are demonstrated through extensive numerical experiments. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —A high-fidelity mathematical model is of critical importance to the implementation of any model-based production system analysis method. Currently, the construction of such models is carried out in an ad hoc manner. The quality of the resulting models may heavily depend on the training, experience, intuition, and personal preference of the modeler. The proposed model parameter identification method focuses on standard key performance indices commonly measured on the factory floor. The advantage is twofold. First, these standard performance metrics are consistently defined regardless of industry, thus avoiding any data-ambiguity issue that may occur when using complex machine/equipment status data. Second, measuring these performance metrics in real time is typically convenient and cost effective, even for manufacturing plants without high-end IT infrastructure, thus making the technology accessible to not only large but also small- and mid-sized manufacturers. Using the algorithms developed in this article, a practitioner can quickly construct a serial production line model and then utilize it to access the rich library of production analysis, design, and control methods available in the literature.

  • Research Article
  • Cite Count Icon 32
  • 10.1007/s00034-018-0880-y
An Efficient QRS Complex Detection Using Optimally Designed Digital Differentiator
  • Jun 21, 2018
  • Circuits, Systems, and Signal Processing
  • Chandan Nayak + 3 more

Heart rate variability (HRV) analysis is considered as a preliminary diagnosis method to check the cardiac health of the human heart. The reliability of the HRV analysis system solely depends on the accuracy of the QRS complex detector. Hence, in this paper, an optimally designed digital differentiator (DD) for precise detection of QRS complex is proposed. The proposed DD is designed by using an efficient evolutionary optimization technique called gases Brownian motion optimization (GBMO) algorithm and is used in the preprocessing stage of the QRS detector. In GBMO algorithm, a balanced trade-off is maintained between both the exploration and the exploitation phases to find the global optimum solution. The electrocardiogram signal is preprocessed by using the proposed DD to generate the feature signals corresponding to the R-peaks only. The detection technique utilizes the principle of Hilbert transform and zeroes crossing detection. The proposed approach is verified against all the first channel records of MIT/BIH arrhythmia database by considering the standard QRS detection performance metrics and produces a sensitivity (Se) of 99.92%, positive predictivity (+P) of 99.92%, detection error rate (DER) of 0.1562%, QRS detection rate of 99.92%, accuracy (Acc) of 99.84%, and F score of 0.9992%. With respect to the standard performance metrics, the proposed QRS detector outperforms all the recently reported QRS detection techniques.

  • PDF Download Icon
  • Preprint Article
  • Cite Count Icon 6
  • 10.7287/peerj.preprints.2838v1
The impact of using large training data set KDD99 on classification accuracy
  • Mar 1, 2017
  • Atilla Özgür + 1 more

This study investigates the effects of using a large data set on supervised machine learning classifiers in the domain of Intrusion Detection Systems (IDS). To investigate this effect 12 machine learning algorithms have been applied. These algorithms are: (1) Adaboost, (2) Bayesian Nets, (3) Decision Tables, (4) Decision Trees (J48), (5)Logistic Regression, (6) Multi-Layer Perceptron, (7) Naive Bayes, (8) OneRule, (9)Random Forests, (10) Radial Basis Function Neural Networks, (11) Support Vector Machines (two different training algorithms), and (12) ZeroR. A well-known IDS benchmark dataset, KDD99 has been used to train and test classifiers. Full training data set of KDD99 is 4.9 million instances while full test dataset is 311,000 instances. In contrast to similar previous studies, which used 0.08%–10% for training and 1.2%–100% for testing, this study uses full training dataset and full test dataset. Weka Machine Learning Toolbox has been used for modeling and simulation. The performance of classifiers has been evaluated using standard binary performance metrics: Detection Rate, True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, Precision, and F1-Rate. To show effects of dataset size, performance of classifiers has been also evaluated using following hardware metrics: Training Time, Working Memory and Model Size. Test results shows improvements in classifiers in standard performance metrics compared to previous studies.

  • Research Article
  • 10.22214/ijraset.2026.77153
Coral Health Monitoring for Sustainable Reef Conservation using YOLOv8-Based CNN Model
  • Jan 31, 2026
  • International Journal for Research in Applied Science and Engineering Technology
  • M S Naveen Roy

Coral reefs are among the most biologically diverse ecosystems on Earth and play a crucial role in maintaining marine ecological balance. However, climate change–induced ocean warming, acidification, and human activities have accelerated coral bleaching and reef degradation. Continuous and accurate monitoring of coral health is therefore essential, yet manual assessment methods are labor-intensive, time-consuming, and pronetosubjectivity.ThispaperpresentsanIEEE-styleresearch study derived strictly from the project report titled Deep Diving into YOLOv8 CNN Model Driven Coral Health Monitoring for SustainableReefConservation.Theproposedworkintroducesan automated deep learning–based framework using the YOLOv8 convolutionalneuralnetworkforreal-timecoralhealthdetection and classification.A dataset consisting of 923 labeled underwater coral images representing healthy, partially bleached, and fully bleached corals is utilized. The system employs image preprocessing and augmentation techniques to handle underwater distortions,followedbytransferlearning–basedfine-tuningofthe YOLOv8 model. The trained model is evaluated using standard performance metrics including precision, recall, mean Average Precision (mAP), and confusion matrix analysis. Experimental results demonstrate that the proposed approach achieves reliable detection accuracy and robust generalization, validating its suitabilityforscalableandreal-timereefmonitoringapplications. The framework provides a practical and intelligent solution to support sustainable coral reef conservation efforts.

  • Research Article
  • Cite Count Icon 13
  • 10.1007/s11548-011-0643-8
Case-based fracture image retrieval
  • Jul 29, 2011
  • International Journal of Computer Assisted Radiology and Surgery
  • Xin Zhou + 2 more

Case-based fracture image retrieval can assist surgeons in decisions regarding new cases by supplying visually similar past cases. This tool may guide fracture fixation and management through comparison of long-term outcomes in similar cases. A fracture image database collected over 10 years at the orthopedic service of the University Hospitals of Geneva was used. This database contains 2,690 fracture cases associated with 43 classes (based on the AO/OTA classification). A case-based retrieval engine was developed and evaluated using retrieval precision as a performance metric. Only cases in the same class as the query case are considered as relevant. The scale-invariant feature transform (SIFT) is used for image analysis. Performance evaluation was computed in terms of mean average precision (MAP) and early precision (P10, P30). Retrieval results produced with the GNU image finding tool (GIFT) were used as a baseline. Two sampling strategies were evaluated. One used a dense 40 × 40 pixel grid sampling, and the second one used the standard SIFT features. Based on dense pixel grid sampling, three unsupervised feature selection strategies were introduced to further improve retrieval performance. With dense pixel grid sampling, the image is divided into 1,600 (40 × 40) square blocks. The goal is to emphasize the salient regions (blocks) and ignore irrelevant regions. Regions are considered as important when a high variance of the visual features is found. The first strategy is to calculate the variance of all descriptors on the global database. The second strategy is to calculate the variance of all descriptors for each case. A third strategy is to perform a thumbnail image clustering in a first step and then to calculate the variance for each cluster. Finally, a fusion between a SIFT-based system and GIFT is performed. A first comparison on the selection of sampling strategies using SIFT features shows that dense sampling using a pixel grid (MAP = 0.18) outperformed the SIFT detector-based sampling approach (MAP = 0.10). In a second step, three unsupervised feature selection strategies were evaluated. A grid parameter search is applied to optimize parameters for feature selection and clustering. Results show that using half of the regions (700 or 800) obtains the best performance for all three strategies. Increasing the number of clusters in clustering can also improve the retrieval performance. The SIFT descriptor variance in each case gave the best indication of saliency for the regions (MAP = 0.23), better than the other two strategies (MAP = 0.20 and 0.21). Combining GIFT (MAP = 0.23) and the best SIFT strategy (MAP = 0.23) produced significantly better results (MAP = 0.27) than each system alone. A case-based fracture retrieval engine was developed and is available for online demonstration. SIFT is used to extract local features, and three feature selection strategies were introduced and evaluated. A baseline using the GIFT system was used to evaluate the salient point-based approaches. Without supervised learning, SIFT-based systems with optimized parameters slightly outperformed the GIFT system. A fusion of the two approaches shows that the information contained in the two approaches is complementary. Supervised learning on the feature space is foreseen as the next step of this study.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/icime.2009.44
The Factors Affecting the Performance of Data Fusion Algorithms
  • Jan 1, 2009
  • Mohammad Othman Nassar + 1 more

The enormous amount of data which is distributed on the World Wide Web can be very useful if the users became able to get these data in an easy and appropriate method, search engines help the users to find what they need from this enormous amount of data. Meta-search is the application of data fusion to document retrieval, Metasearch engine takes as an input the N ranked lists output by each of N search engines in response to a given query, As output, it computes a single ranked list, which is hopefully an improvement over any input list as measured by standard information retrieval performance metrics such as the mean average precision (MAP). Our goal in this paper is to answer the following question, what are the factors affecting the performance of Data fusion algorithms? The reason behind introducing those factors is the absence of a single source in the literature able to present all those factors in an organized and complete manner. This work is needed to integrate all data fusion performance research findings. This paper contributes to the data fusion literature by two things, firstly; it will deliver all factors affecting the performance of data fusion algorithms in an organized and complete manner. Secondly; it will deliver recommendations which are related to how and when to deal with the factors that affect the performance.

  • Research Article
  • Cite Count Icon 44
  • 10.1016/j.eswa.2024.124922
Unsupervised anomaly detection in time-series: An extensive evaluation and analysis of state-of-the-art methods
  • Jul 30, 2024
  • Expert Systems With Applications
  • Nesryne Mejri + 5 more

Unsupervised anomaly detection in time-series has been extensively investigated in the literature. Notwithstanding the relevance of this topic in numerous application fields, a comprehensive and extensive evaluation of recent state-of-the-art techniques taking into account real-world constraints is still needed. Some efforts have been made to compare existing unsupervised time-series anomaly detection methods rigorously. However, only standard performance metrics, namely precision, recall, and F1-score are usually considered. Essential aspects for assessing their practical relevance are therefore neglected. This paper proposes an in-depth evaluation study of recent unsupervised anomaly detection techniques in time-series. Instead of relying solely on standard performance metrics, additional yet informative metrics and protocols are taken into account. In particular, (i) more elaborate performance metrics specifically tailored for time-series are used; (ii) the model size and the model stability are studied; (iii) an analysis of the tested approaches with respect to the anomaly type is provided; and (iv) a clear and unique protocol is followed for all experiments. Overall, this extensive analysis aims to assess the maturity of state-of-the-art time-series anomaly detection, give insights regarding their applicability under real-world setups and provide to the community a more complete evaluation protocol.

  • Conference Article
  • Cite Count Icon 1
  • 10.2118/189811-ms
Evaluating Human-Machine Interaction for Automated Drilling Systems
  • Mar 13, 2018
  • A Farhangfar + 2 more

The efficient utilization of automation systems necessitates a clear understanding of the interaction of the human operator, the automation system and any automated routines being run. If automated routines perform actions not desirable to the human operator, time is lost as the routine is interrupted and human control re-engaged. In addition, automatic handoff back to the human operator, both due to human intervention and due to exist conditions or anomalies must also be managed. Activity data from rigs across North America is analyzed to understand automation process utilization and interrupt timing. Realtime and historic data is tagged, either automatically, semi-automatically using machine learning, or manually, to create a minute-by-minute timeline of rig operations. Operations are then classified both by operation – steering, reaming, making hole, etc. – and well plan to understand how operational demands change automation system utilization. This results in a new set of metrics which can be used to precisely quantify the performance metrics of both the human and automated drilling systems. Performance of the automation system is found to be a strong function of hole deviation with the system outperforming during simple operations and in the vertical hole, but with reduced performance while in the curve and horizontal, due to high interruption of certain tasks. It is found that standard performance metrics, such as slip to slip or weight to weight are affected by standard practices and if these are used to grade system performance, these practices must be account for. This paper presents a detailed investigation of the interaction of the driller with an automated drilling automation system and lays out the utilization of the automation system as a function of rig operations and well path. It is specially noted that standard performance metrics must consider standard practices which may differ between operations.

  • Conference Article
  • Cite Count Icon 413
  • 10.1145/1148170.1148176
User performance versus precision measures for simple search tasks
  • Aug 6, 2006
  • Andrew Turpin + 1 more

Several recent studies have demonstrated that the type of improvements in information retrieval system effectiveness reported in forums such as SIGIR and TREC do not translate into a benefit for users. Two of the studies used an instance recall task, and a third used a question answering task, so perhaps it is unsurprising that the precision based measures of IR system effectiveness on one-shot query evaluation do not correlate with user performance on these tasks. In this study, we evaluate two different information retrieval tasks on TREC Web-track data: a precision-based user task, measured by the length of time that users need to find a single document that is relevant to a TREC topic; and, a simple recall-based task, represented by the total number of relevant documents that users can identify within five minutes. Users employ search engines with controlled mean average precision (MAP) of between 55% and 95%. Our results show that there is no significant relationship between system effectiveness measured by MAP and the precision-based task. A significant, but weak relationship is present for the precision at one document returned metric. A weak relationship is present between MAP and the simple recall-based task.

  • Research Article
  • 10.55592/cilamce.v6i06.10334
Automatic Detection of Seafloor Bedforms for 3D Bathymetric Data
  • Dec 2, 2024
  • Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE)
  • Larissa Marques Freguete + 3 more

Seafloor bedforms are sedimentary structures that can reveal the local hydrodynamics conditions as they are the result of the bottom sediments' response to the dominant flow. The studying of these dynamic bottom shapes is important given that they can provide auxiliary information for the mapping of benthic habitats and can present a risk to navigation and marine structures. Mapping of the sea bottom is often done with dense and spatially extensive 3D bathymetry data (point clouds) resulting in a more precise representation of the targeted area. However, the most common praxis in this field is to rasterize the original data for the easiness in computation and lack of yet well-established methodology for 3D bathymetric data processing. As a consequence, the rasterization process causes information loss and increase of data processing time. The advantage of points clouds is that they comprise a larger volume of information in the same file, e.g. depth, intensity, RGB, and point classes, enabling a closer representation of reality and simultaneous generation of multiple products as potential habitat maps, Landscape Information Model (LIM), and denser Digital Bathymetry Model (DBM). Therefore, the purpose of this work is to apply a modified U-Net convolutional neural network for detecting and classifying bedform types in the bathymetry point cloud. The methodology will be applied to two datasets collected on the Espirito Santo Continental Shelf: Recifes Esquecidos (RE) and Doce River (DR).The following methodological steps will be carried out: (1) data collection, (2) generation of bathymetry derivatives as slope, curvature, geomorphons, aspect, and data tiling up for data augmentation, (3) image labeling, (4) model implementation including the training, validation, and testing steps, (5) calculation of model’s performance metrics: Intersection over Union (IoU), mean Average Precision (mAP), recall, and precision. This study will also present a performance comparison between the modified U-Net and a Random Forest (RF) forest. The selected areas are very rich in sedimentary features, so it is expected as the final product a classified points cloud in transversal, parallel, transitional, and artifacts (errors related to the surveying method) classes. It is also expected to have better performance from the non-convolutional model, RF.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/iccitechn.2015.7488040
An effective approach for relevant paragraph retrieval in Question Answering systems
  • Dec 1, 2015
  • Md Moinul Hoque + 1 more

Paragraph retrieval is a substantial task in Question Answering (QA) systems. It represents the extraction of data from a huge data collection, which have the probability to contain an answer to a question and is a significantly important intermediary step between a user of a QA system and the answers. It is barely impossible to analyse a large collection of data extensively in quick time and because of the vast nature of the background data, it is very important to narrow down the search space from where an answer can be looked for. In this paper, we address the information extraction step and present an effectively designed model for the relevant paragraph retrieval, which is efficient in terms of execution time as well. The model deals with the structure and the organization of information, performs an in-depth analysis of user's question, and presents a priority based searching capability for retrieving paragraphs according to the necessity which will be both effective and time efficient. Experiments were carried out and compared against similar systems based on the data of the document retrieval task of Text REtrieval Conference (TREC) 2005. We also tested our methodology against the data set from TREC 2007. The system performance was measured in terms of various parameters such as R-precision, Recall, and Mean Average Precision. A satisfactory result achieved by our approach establishes its competency for getting integrated into any realtime QA systems.

  • Research Article
  • 10.1118/1.3476208
Sci-Sat AM(1): Planning - 08: Estimating Planning Target Volume Margins for Fractionated Stereotactic Radiotherapy on Perfexion
  • Jul 1, 2010
  • Medical Physics
  • Mark Ruschin + 7 more

The purpose of this study was to estimate planning target volume (PTV) margins for frame‐based Perfexion (PFX) SRT using the eXtend™ system's relocatable head frame (RHF). Patients with large brain metastases are currently undergoing hypofractionated (3 fractions) SRT on PFX enrolled on a phase 1 dose‐escalation clinical trial. In prior investigation, the performance of the RHF was quantified using cone‐beam CT (CBCT) in fourteen patients undergoing linac‐based SRT (median: 30 treatment fractions). Standard performance metrics — group mean (μ), systematic (Σ) and random (σ) uncertainties — were determined for frame‐guided positioning and intra‐fraction motion. A published margin‐determination formula (2.5*Σ) +0.7*σ) was used to estimate the PTV margin. An additional factor of (σ/√3) was added to the systematic component of the formula when initially designing the PTV for 3 fractions in PFX‐SRT. To more accurately account for PFX dose distributions and only 3 treatment fractions, a population‐based stochastic modeling approach is being developed to refine the PTV margin for hypofractionated PFX‐SRT. For frame‐guided SRT (30 fractions), the post‐correction positioning performance estimates were μ(position) = {0.1,−0.2,−0.6}mm, Σ(position) = {0.2;0.8;0.6}mm, and σ(position) = {0.3;0.6;0.4}mm in {Right; Superior;Anterior}. For intra‐fraction motion, μ(motion) = {−0.1;−0.1;0.0}mm, Σ(motion) = {0.2;0.2;0.1}mm, and σ(motion)={0.2;0.4;0.2}mm. The margin formula indicated an expansion of {1.0;2.6;1.8}mm and {1.6;3.1;2.3}mm for 30 fractions and 3 fractions, respectively. For three patients treated to date on PFX, μ(position) = (0.2mm;−0.9mm;−0.8mm). To ensure that the GTV receives the prescription dose, PTV margins have been calculated to account for the geometric uncertainties present in PFX‐SRT. The margins will be reviewed as more data are collected, RHF refinements are made, and stochastic‐based modeling is used.

  • Research Article
  • Cite Count Icon 3
  • 10.3390/s24248059
Deep FS: A Deep Learning Approach for Surface Solar Radiation.
  • Dec 18, 2024
  • Sensors (Basel, Switzerland)
  • Fatih Kihtir + 1 more

Contemporary environmental challenges are increasingly significant. The primary cause is the drastic changes in climates. The prediction of solar radiation is a crucial aspect of solar energy applications and meteorological forecasting. The amount of solar radiation reaching Earth's surface (Global Horizontal Irradiance, GHI) varies with atmospheric conditions, geographical location, and temporal factors. This paper presents a novel methodology for estimating surface sun exposure using advanced deep learning techniques. The proposed method is tested and validated using the data obtained from NASA's Goddard Earth Sciences Data and Information Services Centre (GES DISC) named the SORCE (Solar Radiation and Climate Experiment) dataset. For analyzing and predicting accurate data, features are extracted using a deep learning method, Deep-FS. The method extracted and provided the selected features that are most appropriate for predicting the surface exposure. Time series analysis was conducted using Convolutional Neural Networks (CNNs), with results demonstrating superior performance compared to traditional methodologies across standard performance metrics. The proposed Deep-FS model is validated and compared with the traditional approaches and models through the standard performance metrics. The experimental results concluded that the proposed model outperforms the traditional models.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 83
  • 10.1186/1471-2288-12-102
Assessment of performance of survival prediction models for cancer prognosis
  • Jul 23, 2012
  • BMC Medical Research Methodology
  • Hung-Chia Chen + 3 more

BackgroundCancer survival studies are commonly analyzed using survival-time prediction models for cancer prognosis. A number of different performance metrics are used to ascertain the concordance between the predicted risk score of each patient and the actual survival time, but these metrics can sometimes conflict. Alternatively, patients are sometimes divided into two classes according to a survival-time threshold, and binary classifiers are applied to predict each patient’s class. Although this approach has several drawbacks, it does provide natural performance metrics such as positive and negative predictive values to enable unambiguous assessments.MethodsWe compare the survival-time prediction and survival-time threshold approaches to analyzing cancer survival studies. We review and compare common performance metrics for the two approaches. We present new randomization tests and cross-validation methods to enable unambiguous statistical inferences for several performance metrics used with the survival-time prediction approach. We consider five survival prediction models consisting of one clinical model, two gene expression models, and two models from combinations of clinical and gene expression models.ResultsA public breast cancer dataset was used to compare several performance metrics using five prediction models. 1) For some prediction models, the hazard ratio from fitting a Cox proportional hazards model was significant, but the two-group comparison was insignificant, and vice versa. 2) The randomization test and cross-validation were generally consistent with the p-values obtained from the standard performance metrics. 3) Binary classifiers highly depended on how the risk groups were defined; a slight change of the survival threshold for assignment of classes led to very different prediction results.Conclusions1) Different performance metrics for evaluation of a survival prediction model may give different conclusions in its discriminatory ability. 2) Evaluation using a high-risk versus low-risk group comparison depends on the selected risk-score threshold; a plot of p-values from all possible thresholds can show the sensitivity of the threshold selection. 3) A randomization test of the significance of Somers’ rank correlation can be used for further evaluation of performance of a prediction model. 4) The cross-validated power of survival prediction models decreases as the training and test sets become less balanced.

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-030-43887-6_62
UNCC Biomedical Semantic Question Answering Systems. BioASQ: Task-7B, Phase-B
  • Jan 1, 2020
  • Sai Krishna Telukuntla + 2 more

In this paper, we detail our submission to the 7th year BioASQ competition. We present our approach for Task-7b, Phase B, Exact Answering Task. These Question Answering (QA) tasks include Factoid, Yes/No, List Type Question answering. Our system is based on a contextual word embedding model. We have used a Bidirectional Encoder Representations from Transformers (BERT) based system, fined tuned for biomedical question answering task using BioBERT. In the third test batch set, our system achieved the highest ‘MRR’ score for Factoid Question Answering task. Also, for List type question answering task our system achieved the highest recall score in the fourth test batch set. Along with our detailed approach, we present the results for our submissions, and also highlight identified downsides for our current approach and ways to improve them in our future experiments.

Save Icon
Up Arrow
Open/Close