Language Models Accurately Infer Correlations Between Psychological Items and Scales From Text Alone
Many behavioral scientists do not agree on core constructs and how they should be measured. Different literatures measure related constructs, but the connections are not always obvious to readers and meta-analysts. Many measures in behavioral science are based on agreement with survey items. Because these items are sentences, computerized language models can make connections between disparate measures and constructs and help researchers regain an overview over the rapidly growing, fragmented literature. Our fine-tuned language model, the SurveyBot3000, accurately predicts the correlations between survey items, the reliability of aggregated measurement scales, and intercorrelations between scales from item positions in semantic vector space. We measured the model’s performance as the convergence between its synthetic model estimates and empirical coefficients observed in human data. In our pilot study, the out-of-sample accuracy was .71 for item correlations, .89 for reliabilities, and .89 for scale correlations. In our preregistered validation study using novel items, the out-of-sample accuracy was slightly reduced to .59 for item correlations, .84 for reliabilities, and .84 for scale correlations. The synthetic item correlations showed an average prediction error of .17, and there were larger errors for middling correlations. Predictions exhibited generalizability beyond the training data and across various domains, with some variability in accuracy. Our work shows language models can reliably predict psychometric relationships between survey items, enabling researchers to evaluate new measures against existing scales, reduce redundancy in measurement, and work toward a more unified behavioral-science taxonomy.
- Research Article
345
- 10.1109/5.880084
- Aug 1, 2000
- Proceedings of the IEEE
Statistical language models used in large-vocabulary speech recognition must properly encapsulate the various constraints, both local and global, present in the language. While local constraints are readily captured through n-gram modeling, global constraints, such as long-term semantic dependencies, have been more difficult to handle within a data-driven formalism. This paper focuses on the use of latent semantic analysis, a paradigm that automatically uncovers the salient semantic relationships between words and documents in a given corpus. In this approach, (discrete) words and documents are mapped onto a (continuous) semantic vector space, in which familiar clustering techniques can be applied. This leads to the specification of a powerful framework for automatic semantic classification, as well as the derivation of several language model families with various smoothing properties. Because of their large-span nature, these language models are well suited to complement conventional n-grams. An integrative formulation is proposed for harnessing this synergy, in which the latent semantic information is used to adjust the standard n-gram probability. Such hybrid language modeling compares favorably with the corresponding n-gram baseline: experiments conducted on the Wall Street Journal domain show a reduction in average word error rate of over 20%. This paper concludes with a discussion of intrinsic tradeoffs, such as the influence of training data selection on the resulting performance.
- Research Article
2
- 10.28932/jutisi.v6i2.2684
- Aug 10, 2020
- Jurnal Teknik Informatika dan Sistem Informasi
Here a development of an Acoustic and Language Model is presented. Low Word Error Rate is an early good sign of a good Language and Acoustic Model. Although there are still parameters other than Words Error Rate, our work focused on building Bahasa Indonesia with approximately 2000 common words and achieved the minimum threshold of 25% Word Error Rate. There were several experiments consist of different cases, training data, and testing data with Word Error Rate and Testing Ratio as the main comparison. The language and acoustic model were built using Sphinx4 from Carnegie Mellon University using Hidden Markov Model for the acoustic model and ARPA Model for the language model. The models configurations, which are Beam Width and Force Alignment, directly correlates with Word Error Rate. The configurations were set to 1e-80 for Beam Width and 1e-60 for Force Alignment to prevent underfitting or overfitting of the acoustic model. The goals of this research are to build continuous speech recognition in Bahasa Indonesia which has low Word Error Rate and to determine the optimum numbers of training and testing data which minimize the Word Error Rate.
- Research Article
44
- 10.1111/epi.17570
- Mar 13, 2023
- Epilepsia
Epilepsy is a neurological disorder characterized by recurrent seizures, which can significantly impact the quality of life of affected individuals. Fortunately, advances in artificial intelligence (AI) are providing new opportunities to improve the diagnosis and treatment of epilepsy. Briefly, examples of ongoing epilepsy-related AI research include (1) algorithms that can analyze large amounts of electroencephalography (EEG) time-series data to label interictal epileptiform discharges both independently and with human supervision,1, 2 (2) diagnostic biomedical imaging with automated magnetic resonance imaging (MRI)–based lesion detection, surgical decision-making support, and outcome prediction,3, 4 and (3) Clinical Decision Support Systems (CDSS) that use patient data to provide physicians with recommendations based on up-to-date evidence and guidelines for an, overall, improved diagnostic and therapeutic accuracy.5, 6 Language models are often used in chatbots and other conversational systems to generate context-aware human-like text in response to an input prompt from a user. Such models are trained on large data sets of human conversations using machine learning (ML) techniques to learn the patterns and structure of natural language. Various artificial intelligence (AI) language models have been developed since the 1950s, but significant advances have only been made in recent years due to improved ML models paired with an increased availability of large amounts of data and computational resources. Some of the earliest examples of such models include ELIZA, developed in the 1960s (one of the first programs to simulate a patient-doctor relationship), and SHRDLU from the 1970s (a program able to emulate dialogue around a simplified world with a limited number of objects, the "blocks world").7, 8 However, these early language models were inherently limited in their capabilities and could perform only a narrow range of tasks. In recent years, more complex, large language models have led to significant progress in natural language processing. Several of these AI language models can be used for dialogue, for example, (1) GPT-3 (Generative Pre-trained Transformer 3), a state-of-the-art language model developed by OpenAI that can generate contextual human-like text for a wide range of applications, including dialogues9; (2) DialoGPT, a language model developed by Microsoft that is trained on a large data set of social media comment chains and can generate responses in single-turn conversations10; (3) Meena, a sensible and specific language model developed by Google that is trained on human–human conversations from public-domain social media and can generate responses that are coherent and contextually appropriate11; and (4) XLNet, a language model developed by Google and Carnegie Mellon University that is capable of several language modeling tasks including question answering, natural language inference, sentiment analysis, and document ranking; and many others.12 Mainly such algorithms enable the analysis of free-text electronic medical records and other written materials (e.g., test results and treatment plans) that are otherwise inaccessible without preprocessing and standardization. By analyzing large amounts of free-text medical records, language models can learn to identify and summarize relevant patterns. Possible outcomes are information on identified hierarchical patient subgroups based on seizure patterns, documented treatment options, and outcome parameters.13-15 This structured information could be queried to provide personalized treatment recommendations based on medical history and other relevant factors. For example, by identifying early candidates for epilepsy surgery, language models can help minimize treatment delays and improve patient outcomes.16, 17 Another example of how language models can improve health care are Clinical Decision Support Systems (CDSS) trained to understand and offer natural responses to queries from health care providers. CDSS can provide medical or surgical treatment recommendations, suggest relevant clinical guidelines or protocols, and alert health care providers to potential errors or risks. Similar methods may be used to create virtual assistants for individuals with epilepsy to answer questions and provide easy access to information about their condition, treatment options, and other related topics, including driving, causes of premature death (including sudden unexpected death in epilepsy [SUDEP]), and status epilepticus.18, 19 Overall, AI language models have the future potential to significantly improve the care and management of individuals with epilepsy by providing natural conversational interfaces to both patients and physicians, allowing for easy access to structured information. We tested ChatGPT (ChatGPT Dec 15 Version, available at chat.openai.com, last accessed 01/07/2023 at 9:30 p.m.) for some of the use cases outlined above and provided the prompts used and model responses in Figure 1. First, we assumed the role of an individual with epilepsy taking levetiracetam. The model correctly responded that aggression is a possible side effect and recommended follow-up with the prescribing physician (Figure 1A).20 We then requested an Acute Seizure Action Plan (ASAP), a structured treatment plan used to guide patients and caregivers in the event of an epileptic seizure. The model provided a reasonable first draft in line with expert recommendations (Figure 1B).21 We found this useful to quickly generate general patient-facing informational content, but note that each ASAP should be subject to human review to screen for misinformation, and to personalize the draft to include additional information from the individual's medical history and seizure types. We proceeded to present the model with a short, simplified case study of an individual with treatment-resistant left mesial temporal lobe epilepsy. Of interest, the model correctly integrated the medical history and diagnostic findings, noting that hippocampal sclerosis presents an epileptogenic lesion before proceeding to recommend epilepsy surgery. Although this assessment represents a simplification of phase I presurgical evaluation findings and surgical strategies, the overall recommendation is sound.22 However, limitations became apparent when we informed the model that the previously discussed patient now had additional evidence of right temporal lobe seizure onset. Although the initial response is still appropriate, the following advice is actively harmful (Figure 1D). The model confidently states that the patient's health care team may consider bilateral temporal lobectomy or removal of both temporal lobes and the adjacent frontal and parietal lobes (a procedure incorrectly defined as "hemispherotomy" by the model). Finally, even simple queries for structured information may fail if it concerns particularly specialized or disputed areas of knowledge. In Figure 1E, we queried if there is a relationship between variants in SCN9A and autosomal dominant epilepsy. The positive response was incorrect, likely due to misinformation in the academic literature present in the model's training data. Any relationship between variants in SCN9A and epilepsy has been refuted.23, 24 Previous research, as outlined above, has focused on language models trained on large amounts of public-domain data of general human conversations, commonly involving text messages from social media sites (Twitter, Reddit, Facebook, etc.) and some additional training data from books or academic literature. Indeed, the use cases shown above do not accurately represent the limits of this tool, as it was likely not trained on a sufficiently extensive, high-quality, domain-specific data set. It is important to note that language models cannot easily deal with disputed areas of knowledge and may not provide correct answers when contradictions are present in the input data. In light of these general considerations and the specific use cases outlined above, we argue that oversight from medical professionals will be needed to distill training information, and that all current AI applications need to be utilized in combination with human expertise. This is made immediately relevant by the fact that the broad ethical and legal implications of generative models are subjects of ongoing debate, with developers denying liability that may then fall onto the clinician user. Another important limitation of language models is an issue coined "hallucination," which describes confidently formulated answers with incorrect or nonsensical content.25 This misinformation is a result of biased training data or mismatches between token encoding and concept representation, and it is particularly difficult to identify. Finally, users should be aware that language models show bias against individuals based on gender, race, or disability.26 This issue is particularly sensitive in epilepsy, where stigma is still prevalent.27 Extraction of structured information from electronic medical records and assistance with simple human-supervised tasks are feasible use-case scenarios. However, these systems will need to be thoroughly tested and rigorously validated before they can be used in clinical care, in line with existing regulations on Software as a Medical Device or AI/ML-Enabled Medical Devices.28 Ultimately, AI language models in epilepsy care will depend on developing robust and reliable systems as per the Ethics Guidelines for Trustworthy Artificial Intelligence,29 driven by community-based data sharing and epilepsy-specific AI research. Outside of the clinical care of patients, several successful applications of language models (e.g., smart data processing, content generation, and sentiment analysis) provide a promising perspective of AI-augmented future clinical practice. To achieve similar success stories with AI language models in epilepsy and general clinical practice, we will need to develop protocols for applying decentralized language learning models (i.e., using federated learning) on distributed identifiable patient data from multiple institutions. These coordinated decentralized language models will take advantage of the collective knowledge and insights of multiple sources, including specialty fields like epilepsy, while protecting patient privacy. We confirm that we have read the Journal's position on issues involved in ethical publication and affirm that this report is consistent with those guidelines. Christian M Boßelmann: Conceptualization, Writing – original draft; Costin Leu: Writing – review & editing; Dennis Lal: Writing – review & editing, Supervision. None. The authors report no conflicts of interest.
- Conference Article
5
- 10.3115/1075168.1075172
- Jan 1, 2003
This tutorial will cover the state-of-the-art in language modeling. Language models give the probability of word sequences, i.e. recognize is much more probable than wreck a nice beach. While most widely known for their use in speech recognition, language models are useful in a large number of areas, including information retrieval, machine translation, handwriting recognition, context-sensitive spelling correction, and text entry for Chinese and Japanese or on small input devices. Many language modeling techniques can be applied to other areas or to modeling any discrete sequence. This tutorial should be accessible to anyone with a basic knowledge of probability.The most basic language models -- n-gram models -- essentially just count occurrences of words in training data. I will describe five relatively simple improvements over this baseline: smoothing, caching, skipping, sentence-mixture models, and clustering. I will talk a bit about the applications of language modeling and then I will quickly describe other recent promising work, and available tools and resources.I will begin by describing conventional-style language modeling techniques.• Smoothing addresses the problem of data sparsity: there is rarely enough data to accurately estimate the parameters of a language model. Smoothing gives a way to combine less specific, more accurate information with more specific, but noisier data. I will describe two classic techniques -- deleted interpolation and Katz (or Good-Turing) smoothing -- and one recent technique, Modified Kneser-Ney smoothing, which is the best known.• Caching is a widely used technique that uses the observation that recently observed words are likely to occur again. Models from recently observed data can be combined with more general models to improve performance.• Skipping models use the observation that even words that are not directly adjacent to the target word contain useful information.• Sentence-mixture models use the observation that there are many different kinds of sentences. By modeling each sentence type separately, performance is improved.• Clustering is one of the most useful language modeling techniques. Words can be grouped together into clusters through various automatic techniques; then the probability of a cluster can be predicted instead of the probability of the word. Clustering can be used to make smaller models or better performing ones. I will talk briefly about clustering issues specific to the huge amounts of data used in language modeling (hundreds of millions of words) to form thousands of clusters.I will then talk about other language modeling applications, with an emphasis on information retrieval, but also mentioning spelling correction, machine translation, and entering text in Chinese or Japanese.I will briefly describe some recent successful techniques, including Bellegarda's work using latent semantic analysis and Wang's SuperARV language models. Finally, I will also talk about some practical aspects of language modeling. I will describe how freely available, off-the-shelf tools can be used to easily build language models, where to get data to train a language model, and how to use methods such as count cutoffs or relative-entropy techniques to prune language models.Those who attend the tutorial should walk away with a broad understanding of current language modeling techniques, and the background needed to build their own language models, and choose the right language model techniques for their applications.
- Conference Article
58
- 10.1145/3366423.3380144
- Apr 20, 2020
Online services are interested in solutions to opinion mining, which is the problem of extracting aspects, opinions, and sentiments from text. One method to mine opinions is to leverage the recent success of pre-trained language models which can be fine-tuned to obtain high-quality extractions from reviews. However, fine-tuning language models still requires a non-trivial amount of training data. In this paper, we study the problem of how to significantly reduce the amount of labeled training data required in fine-tuning language models for opinion mining. We describe Snippext, an opinion mining system developed over a language model that is fine-tuned through semi-supervised learning with augmented data. A novelty of Snippext is its clever use of a two-prong approach to achieve state-of-the-art (SOTA) performance with little labeled training data through: (1) data augmentation to automatically generate more labeled training data from existing ones, and (2) a semi-supervised learning technique to leverage the massive amount of unlabeled data in addition to the (limited amount of) labeled data. We show with extensive experiments that Snippext performs comparably and can even exceed previous SOTA results on several opinion mining tasks with only half the training data required. Furthermore, it achieves new SOTA results when all training data are leveraged. By comparison to a baseline pipeline, we found that Snippext extracts significantly more fine-grained opinions which enable new opportunities of downstream applications.
- Book Chapter
- 10.4324/9781315782379-223
- Apr 24, 2019
When solving problems, people often use a wide array of different strategies. Effective teaching often requires isolating what strategies students are using in order to more effectively structure the instructional intervention. Latent semantic analysis (LSA) is a computational tool that extracts the co-occurrence of words in a corpus. Through high-dimensional matrix decomposition, LSA is able to produce a "semantic-space" allowing all experienced words, phrases, and sentences to be represented as vectors within that space. Free text responses to military scenarios were collected from officers in training as well as experienced military officers. The novice database is used as a descriptive reference, while the expert database provides the normative references. Free text responses to military scenarios were collected from officers in training as well as experienced military officers.
- Conference Article
31
- 10.1145/3543507.3583199
- Apr 30, 2023
Past literature has illustrated that language models (LMs) often memorize parts of training instances and reproduce them in natural language generation (NLG) processes. However, it is unclear to what extent LMs "reuse" a training corpus. For instance, models can generate paraphrased sentences that are contextually similar to training samples. In this work, therefore, we study three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2 generated texts, in comparison to its training data, and further analyze the plagiarism patterns of fine-tuned LMs with domain-specific corpora which are extensively used in practice. Our results suggest that (1) three types of plagiarism widely exist in LMs beyond memorization, (2) both size and decoding methods of LMs are strongly associated with the degrees of plagiarism they exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus similarity and homogeneity. Given that a majority of LMs' training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, raising concerns about indiscriminately pursuing larger models with larger training corpora. Plagiarized content can also contain individuals' personal and sensitive information. These findings overall cast doubt on the practicality of current LMs in mission-critical writing tasks and urge more discussions around the observed phenomena. Data and source code are available at https://github.com/Brit7777/LM-plagiarism.
- Research Article
22
- 10.1145/1034780.1034781
- Jun 1, 2004
- ACM Transactions on Asian Language Information Processing
introduction Share on Introduction to the special issue on statistical language modeling Authors: Jianfeng Gao Microsoft Research Asia, Beijing, China Microsoft Research Asia, Beijing, ChinaView Profile , Chin-Yew Lin Information sciences institute, university of southern california, CA Information sciences institute, university of southern california, CAView Profile Authors Info & Claims ACM Transactions on Asian Language Information ProcessingVolume 3Issue 2June 2004 pp 87–93https://doi.org/10.1145/1034780.1034781Published:01 June 2004Publication History 4citation859DownloadsMetricsTotal Citations4Total Downloads859Last 12 Months3Last 6 weeks1 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my Alerts New Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteGet Access
- Video Transcripts
- 10.48448/6dax-5c76
- Aug 1, 2021
We present a targeted, scaled-up comparison of incremental processing in humans and neural language models by collecting by-word reaction time data for sixteen different syntactic test suites across a range of structural phenomena. Human reaction time data comes from a novel online experimental paradigm called the Interpolated Maze task. We compare human reaction times to by-word probabilities for four contemporary language models, with different architectures and trained on a range of data set sizes. We find that across many phenomena, both humans and language models show increased processing difficulty in ungrammatical sentence regions with human and model `accuracy' scores a la Marvin and Linzen (2018) about equal. However, although language model outputs match humans in direction, we show that models systematically under-predict the difference in magnitude of incremental processing difficulty between grammatical and ungrammatical sentences. Specifically, when models encounter syntactic violations they fail to accurately predict the longer reading times observed in the human data. These results call into question whether contemporary language models are approaching human-like performance for sensitivity to syntactic violations.
- Research Article
1
- 10.2196/76773
- Jul 8, 2025
- JMIR Medical Informatics
BackgroundDisease name recognition is a fundamental task in clinical natural language processing, enabling the extraction of critical patient information from electronic health records. While recent advances in large language models (LLMs) have shown promise, most evaluations have focused on English, and little is known about their robustness in low-resource languages such as Japanese. In particular, whether these models can perform reliably on previously unseen in-hospital data, which differs from training data in writing styles and clinical contexts, has not been thoroughly investigated.ObjectiveThis study evaluated the robustness of fine-tuned LLMs for disease name recognition in Japanese clinical notes, with a particular focus on their performance on in-hospital data that was not included during training.MethodsWe used two corpora for this study: (1) a publicly available set of Japanese case reports denoted as CR, and (2) a newly constructed corpus of progress notes, denoted as PN, written by ten physicians to capture stylistic variations of in-hospital clinical notes. To reflect real-world deployment scenarios, we first fine-tuned models on CR. Specifically, we compared a LLM and a baseline-masked language model (MLM). These models were then evaluated under two conditions: (1) on CR, representing the in-domain (ID) setting with the same document type, similar to training, and (2) on PN, representing the out-of-domain (OOD) setting with a different document type. Robustness was assessed by calculating the performance gap (ie, the performance drop from in-domain to out-of-domain settings).ResultsThe LLM demonstrated greater robustness, with a smaller performance gap in F1-scores (ID–OOD = −8.6) compared to the MLM baseline performance (ID–OOD = −13.9). This indicated more stable performance across ID and OOD settings, highlighting the effectiveness of fine-tuned LLMs for reliable use in diverse clinical settings.ConclusionsFine-tuned LLMs demonstrate superior robustness for disease name recognition in Japanese clinical notes, with a smaller performance gap. These findings highlight the potential of LLMs as reliable tools for clinical natural language processing in low-resource language settings and support their deployment in real-world health care applications, where diversity in documentation is inevitable.
- Conference Article
8
- 10.1109/ase51524.2021.9678871
- Nov 1, 2021
The neural network model is having a significant impact on many real-world applications. Unfortunately, the increasing popularity and complexity of these models also amplifies their security and privacy challenges, with privacy leakage from training data being one of the most prominent issues. In this context, prior studies proposed to analyze the abstraction behavior of neural network models, e.g., RNN, to understand their robustness. However, the existing research rarely addresses privacy breaches caused by memorization in neural language models. To fill this gap, we propose a novel approach, DeepMemory, that analyzes memorization behavior for a neural language model. We first construct a memorization-analysis-oriented model, taking both training data and a neural language model as input. We then build a semantic first-order Markov model to bind the constructed memorization-analysis-oriented model to the training data to analyze memorization distribution. Finally, we apply our approach to address data leakage issues associated with memorization and to assist in dememorization. We evaluate our approach on one of the most popular neural language models, the LSTM-based language model, with three public datasets, namely, WikiText-103, WMT2017, and IWSLT2016. We find that sentences in the studied datasets with low perplexity are more likely to be memorized. Our approach achieves an average AUC of 0.73 in automatically identifying data leakage issues during assessment. We also show that with the assistance of DeepMemory, data breaches due to memorization of neural language models can be successfully mitigated by mutating training data without reducing the performance of neural language models.
- Research Article
11
- 10.1109/tcbb.2022.3165592
- Jan 1, 2023
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
DNA-binding proteins (DBPs) play vital roles in the regulation of biological systems. Although there are already many deep learning methods for predicting the sequence specificities of DBPs, they face two challenges as follows. Classic deep learning methods for DBPs prediction usually fail to capture the dependencies between genomic sequences since their commonly used one-hot codes are mutually orthogonal. Besides, these methods usually perform poorly when samples are inadequate. To address these two challenges, we developed a novel language model for mining DBPs using human genomic data and ChIP-seq datasets with decaying learning rates, named DNA Fine-tuned Language Model (DFLM). It can capture the dependencies between genome sequences based on the context of human genomic data and then fine-tune the features of DBPs tasks using different ChIP-seq datasets. First, we compared DFLM with the existing widely used methods on 69 datasets and we achieved excellent performance. Moreover, we conducted comparative experiments on complex DBPs and small datasets. The results show that DFLM still achieved a significant improvement. Finally, through visualization analysis of one-hot encoding and DFLM, we found that one-hot encoding completely cut off the dependencies of DNA sequences themselves, while DFLM using language models can well represent the dependency of DNA sequences. Source code are available at: https://github.com/Deep-Bioinfo/DFLM.
- Research Article
12
- 10.1097/pcc.0000000000003468
- Feb 8, 2024
- Pediatric critical care medicine : a journal of the Society of Critical Care Medicine and the World Federation of Pediatric Intensive and Critical Care Societies
Generative language models (LMs) are being evaluated in a variety of tasks in healthcare, but pediatric critical care studies are scant. Our objective was to evaluate the utility of generative LMs in the pediatric critical care setting and to determine whether domain-adapted LMs can outperform much larger general-domain LMs in generating a differential diagnosis from the admission notes of PICU patients. Single-center retrospective cohort study. Quaternary 40-bed PICU. Notes from all patients admitted to the PICU between January 2012 and April 2023 were used for model development. One hundred thirty randomly selected admission notes were used for evaluation. None. Five experts in critical care used a 5-point Likert scale to independently evaluate the overall quality of differential diagnoses: 1) written by the clinician in the original notes, 2) generated by two general LMs (BioGPT-Large and LLaMa-65B), and 3) generated by two fine-tuned models (fine-tuned BioGPT-Large and fine-tuned LLaMa-7B). Differences among differential diagnoses were compared using mixed methods regression models. We used 1,916,538 notes from 32,454 unique patients for model development and validation. The mean quality scores of the differential diagnoses generated by the clinicians and fine-tuned LLaMa-7B, the best-performing LM, were 3.43 and 2.88, respectively (absolute difference 0.54 units [95% CI, 0.37-0.72], p < 0.001). Fine-tuned LLaMa-7B performed better than LLaMa-65B (absolute difference 0.23 unit [95% CI, 0.06-0.41], p = 0.009) and BioGPT-Large (absolute difference 0.86 unit [95% CI, 0.69-1.0], p < 0.001). The differential diagnosis generated by clinicians and fine-tuned LLaMa-7B were ranked as the highest quality in 144 (55%) and 74 cases (29%), respectively. A smaller LM fine-tuned using notes of PICU patients outperformed much larger models trained on general-domain data. Currently, LMs remain inferior but may serve as an adjunct to human clinicians in real-world tasks using real-world data.
- Research Article
6
- 10.1353/csd.2022.0017
- Mar 1, 2022
- Journal of College Student Development
Evaluating Mentorship Programs:Survey Items for Improving Student Affairs Practice Frank Fernandez (bio), Sarah Mason (bio), Carrie L. Saetermoe (bio), and Gabriela Chavira (bio) Student affairs professionals are increasingly expected to assess and evaluate programs that support student success (Fallucca, 2018). Beyond satisfying accountability pressures, assessment and evaluation work is important for gathering data to improve practice and support students. The two leading student affairs associations, ACPA and NASPA (2015), have called upon student affairs professionals to use assessment and evaluation practices in ways that are culturally relevant and that support the ethics and values of the profession. In this Research in Brief, we draw on our experiences evaluating a program that uses critical race theory to improve faculty–student mentoring. We share survey items from the quantitative portion of the evaluation, which examines the extent to which race is part of [End Page 223] mentoring relationships. Then we provide preliminary findings to show that the survey items predict sense of belonging when they are used as a summative scale. We discuss implications for professionals who work with student affairs-based mentoring programs, who use assessment and evaluation in their work, and for undergraduate research mentors. LITERATURE REVIEW AND STUDY CONTEXT Positive mentorship perceptions relate to higher intent to persist (Baier et al., 2016) and sense of belonging (e.g., Apriceno et al., 2020). Apriceno and colleagues (2020) used multiple survey items to examine student engagement with mentors, but they were unable to consider how mentors incorporate Black, Indigenous and People of Color (BIPOC) students' minoritized statuses as part of the mentoring relationship. Other scholars have captured multiple factors in mentoring relationships but overlooked the importance of race in those relationships (Docherty et al., 2018). For instance, while Strayhorn and Terrell's (2007) study did not examine mentor demographics, it showed that mentoring relationships were more influential when they moved beyond personal, informal interactions and became research-focused. Black students who experienced research-focused relationships had higher college satisfaction (Strayhorn & Terrell, 2007) Prior literature has suggested that student affairs professionals must consider how mentoring relationships can support racially minoritized students. Rendón (1994) concluded that colleges and universities should "orient faculty … to the needs and strengths of culturally diverse student populations" because those faculty may then serve "as validating mentors for students who find the transition to college difficult" (p. 46). Recent qualitative studies have supported Rendón's early work. In a study of a single campus, Rodriguez (2020) highlighted the importance of using mentoring to promote students' racial efficacy. Additionally, a multicampus case study of students in STEM found that some mentors do, in fact, validate racially minoritized backgrounds (McCoy et al., 2015). Conversely, when faculty take a colorblind approach to mentoring, they tend to frame "their mentoring relationships in culturally racist ways" and, perhaps unintentionally, assume "a condescending, paternalistic attitude toward Students of Color" (McCoy et al., 2015, p. 236). This paper adds to prior literature by presenting a set of items that may be used to evaluate mentoring programs that incorporate discussion of race and ethnicity. BUILD PODER The study is part of a program evaluation of an NIH-funded project, Building Infrastructure Leading to Diversity (BUILD) Promoting Opportunities for Diversity in Education and Research (PODER) program at California State University, Northridge, a four-year comprehensive university. BUILD PODER (BP) is an established program that focuses on increasing diversity in biomedical and biomedically related fields. The program is designed to explicitly train faculty mentors and students from four colleges within the university (i.e., Health and Human Development, Social and Behavioral Sciences, Science and Mathematics, Engineering and Computer Science) about critical race theory (CRT). CRT is generally thought to include five tenets: (a) racism is "ordinary, not aberrational"; (b) racism serves the interest of "white-over-color ascendancy"; (c) race is socially constructed; (d) the dominant group alters notions of racial groups over time to suit needs; and (e) People of Color have unique insights and stories to tell (Delgado & Stefancic, 2017, pp. 8–9). [End Page 224] BP faculty mentor training in the first year consisted of a 16-hour face-to-face workshop related to the more familiar aspects of CRT—microaggressions and microaffirmations...
- Research Article
1
- 10.1089/genbio.2023.29086.hth
- Apr 1, 2023
- GEN Biotechnology
Learning to Read and Write in the Language of Proteins
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.