Automatic fine-grained semantic classification for domain adaptation
This paper presents an automated method for deriving fine-grained, domain-specific semantic classes of verb arguments and clustering verbs into meaningful groups, enabling effective semantic typing; in a pilot study, the approach achieves near-perfect recall and high precision in classifying verb arguments within the same domain.
Assigning arguments of verbs to different semantic classes ('semantic typing'), or alternatively, checking the 'selectional restrictions' of predicates, is a fundamental component of many natural language processing tasks. However, a common experience has been that general purpose semantic classes, such as those encoded in resources like WordNet, or handcrafted subject-specific ontologies, are seldom quite right when it comes to analysing texts from a particular domain. In this paper we describe a method of automatically deriving fine-grained, domain-specific semantic classes of arguments while simultaneously clustering verbs into semantically meaningful groups: the first step in verb sense induction. We show that in a small pilot study on new examples from the same domain we are able to achieve almost perfect recall and reasonably high precision in the semantic typing of verb arguments in these texts.
- Research Article
7
- 10.3414/me17-01-0120
- Feb 1, 2018
- Methods of information in medicine
The UMLS assigns semantic types to all its integrated concepts. The semantic types are widely used in various natural language processing tasks in the biomedical domain, such as named entity recognition, semantic disambiguation, and semantic annotation. Due to the size of the UMLS, erroneous semantic type assignments are hard to detect. It is imperative to devise automated techniques to identify errors and inconsistencies in semantic type assignments. Designing a methodology to perform programmatic checks to detect semantic type assignment errors for UMLS concepts with one or more SNOMED CT terms and evaluating concepts in a selected set of SNOMED CT hierarchies to verify our hypothesis that UMLS semantic type assignment errors may exist in concepts residing in semantically inconsistent groups. Our methodology is a four-stage process. 1) partitioning concepts in a SNOMED CT hierarchy into semantically uniform groups based on their assigned semantic tags; 2) partitioning concepts in each group from 1) into the disjoint sub-groups based on their semantic type assignments; 3) mapping all SNOMED CT semantic tags into one or more semantic types in the UMLS; 4) identifying semantically inconsistent groups that have inconsistent assignments between semantic tags and semantic types according to the mapping from 3) and providing concepts in such groups to the domain experts for reviewing. We applied our method on the UMLS 2013AA release. Concepts of the semantically inconsistent groups in the PHYSICAL FORCE and RECORD ARTIFACT hierarchies have error rates 33% and 62.5% respectively, which are greatly larger than error rates 0.6% and 1% in semantically consistent groups of the two hierarchies. Concepts in semantically in - consistent groups are more likely to contain semantic type assignment errors. Our methodology can make auditing more efficient by limiting auditing resources on concepts of semantically inconsistent groups.
- Research Article
- 10.1186/s13326-025-00334-5
- Jul 28, 2025
- Journal of Biomedical Semantics
PurposeOnline consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient’s intent and route them towards the relevant information.MethodsThis paper proposes a novel two-step approach to address the challenge of semantic type classification in Indonesian consumer health questions. We acknowledge the scarcity of Indonesian health domain data, a hurdle for machine learning models. To address this gap, we first introduce a novel corpus of annotated Indonesian consumer health questions. Second, we utilize this newly created corpus to build and evaluate a data-driven predictive model for classifying question semantic types. To enhance the trustworthiness and interpretability of the model’s predictions, we employ an explainable model framework, LIME. This framework facilitates a deeper understanding of the role played by word-based features in the model’s decision-making process. Additionally, it empowers us to conduct a comprehensive bias analysis, allowing for the detection of “semantic bias”, where words with no inherent association with a specific semantic type disproportionately influence the model’s predictions.ResultsThe annotation process revealed moderate agreement between expert annotators. In addition, not all words with high LIME probability could be considered true characteristics of a question type. This suggests a potential bias in the data used and the machine learning models themselves. Notably, XGBoost, Naïve Bayes, and MLP models exhibited a tendency to predict questions containing the words “kanker” (cancer) and “depresi” (depression) as belonging to the DIAGNOSIS category. In terms of prediction performance, Perceptron and XGBoost emerged as the top-performing models, achieving the highest weighted average F1 scores across all input scenarios and weighting factors. Naïve Bayes performed best after balancing the data with Borderline SMOTE, indicating its promise for handling imbalanced datasets.ConclusionWe constructed a corpus of query semantics in the domain of Indonesian consumer health, containing 964 questions annotated with their corresponding semantic types. This corpus served as the foundation for building a predictive model. We further investigated the impact of disease-biased words on model performance. These words exhibited high LIME scores, yet lacked association with a specific semantic type. We trained models using datasets with and without these biased words and found no significant difference in model performance between the two scenarios, suggesting that the models might possess an ability to mitigate the influence of such bias during the learning process.
- Research Article
83
- 10.1136/jamia.2000.0070066
- Jan 1, 2000
- Journal of the American Medical Informatics Association
The Unified Medical Language System (UMLS) combines many well-established authoritative medical informatics terminologies in one knowledge representation system. Such a resource is very valuable to the health care community and industry. However, the UMLS is very large and complex and poses serious comprehension problems for users and maintenance personnel. The authors present a representation to support the user's comprehension and navigation of the UMLS. An object-oriented database (OODB) representation is used to represent the two major components of the UMLS-the Metathesaurus and the Semantic Network-as a unified system. The semantic types of the Semantic Network are modeled as semantic type classes. Intersection classes are defined to model concepts of multiple semantic types, which are removed from the semantic type classes. The authors provide examples of how the intersection classes help expose omissions of concepts, highlight errors of semantic type classification, and uncover ambiguities of concepts in the UMLS. The resulting UMLS OODB schema is deeper and more refined than the Semantic Network, since intersection classes are introduced. The Metathesaurus is classified into more mutually exclusive, uniform sets of concepts. The schema improves the user's comprehension and navigation of the Metathesaurus. The UMLS OODB schema supports the user's comprehension and navigation of the Metathesaurus. It also helps expose and resolve modeling problems in the UMLS.
- Research Article
13
- 10.1016/j.jbi.2011.08.021
- Sep 8, 2011
- Journal of Biomedical Informatics
Overcoming an obstacle in expanding a UMLS semantic type extent
- Conference Article
51
- 10.3115/976973.976990
- Jan 1, 1995
In this paper we present further work on learning SRs from on-line corpora. The technique relays in the use of a wide-coverage noun taxonomy and a statistical measure of co-occurrence to generalize from words to semantic classes. We analyze some experimental results, detect some unsolved problems and outline possible lines of research. We claim for the need of objective evaluation measures for the SRs learning task; presenting and discussing some of them. Some variations on the basic technique, affecting the statistical association measure and thresholding, are presented and discussed. Some experimental results on these variations are reported. Some of these variations seem to improve the performance. Concluding, we summarize the future lines of research we think can lead to further improvements.
- Research Article
16
- 10.3389/fpsyg.2020.574353
- Nov 20, 2020
- Frontiers in Psychology
We present a case study of grammatical constructions and how their function in a single language (Russian) can be captured through semantic and syntactic classification. Since 2016 an on-going joint project of UiT The Arctic University of Norway and the National Research University Higher School of Economics in Moscow has been collecting and analyzing multiword grammatical constructions of Russian. The main product is the Russian Constructicon (https://site.uit.no/russian-constructicon/), which, with over two thousand two hundred constructions (and more being continuously added), is arguably the largest openly available constructicon resource for any language. The combination of this large size with depth of analysis, containing both syntactic and semantic tags, makes it possible to view the interrelation of constructions as families and to discover trends in their behavior. Our annotation includes 53 semantic tags of varying frequency, with three tags that are by far more frequent than all the rest, accounting for 30% of the entire inventory of the Russian Constructicon. These three semantic types are Assessment, Attitude, and Intensity, all of which convey a speaker’s evaluation of a topic, in contrast to most of the other tags (such as Time, Manner, and Comparison). Assessment and Attitude constructions are investigated in greater detail in this article. Secondary semantic tags reveal that negative evaluation among these two semantic types is more than twice as frequent as positive evaluation. Examples of negative evaluations are: for Assessment VP tak sebe, as in Na pianino ja igraju tak sebe “I play the piano so-so [lit. thus self]”; for Attitude s PronPers-Gen xvatit/xvatilo (NP-Gen), as in S menja xvatit “I’m fed up [lit. from me enough].” In terms of syntax, the most frequent syntactic types of constructions in the Russian Constructicon are clausal constructions [constituting an independent clause like s PronPers-Gen xvatit/xvatilo (NP-Gen)] and constructions with the anchor in the role of adverbial modifier (like VP tak sebe). Our semantic and syntactic classification of this large body of Russian constructions makes it possible to postulate patterns of grammatical constructions constituting a radial category with central and peripheral types. Classification of large numbers of constructions reveals systematic relations that structure the grammar of a language.
- Research Article
- 10.14569/ijacsa.2025.0161056
- Jan 1, 2025
- International Journal of Advanced Computer Science and Applications
Sentence embedding is a very important technique in most natural language processing (NLP) tasks, such as answer generation, semantic similarity detection, text classification and information retrieval. This technique aims to transform the semantic meaning of a sentence into a fixed-dimensional vector, allowing machines to understand human language. Sentence embedding has moved in recent years from simple word vector averaging methods to the development of more sophisticated models, particularly those based on transformer structures such as the BERT model and its variants. However, systematic reviews that critical, analyze and compare the performance of these models are still limited, particularly the selection of the appropriate embedding model for a specific NLP task. This study aims to address this gap by a comprehensive review for sentence embedding models and a systematic evaluation of their performance on NLP tasks, such as semantic similarity, clustering, and retrieval. The study enabled us to identify the appropriate embedding model for each task, identify the main challenges faced by embedding models, and propose effective solutions to improve the performance and efficiency of sentence embedding.
- Conference Article
- 10.1109/ialp.2014.6973471
- Oct 1, 2014
The interest has been increasing in recent years in extracting and analyzing evaluations and opinions of service or products from large bodies of text. It is important to classify predicates according to sense because whether or not a statement includes the speaker's opinion depends strongly on its predicate. It is generally assumed that Japanese part-of-speech (POS) for predicates is classified according to sense; however, the POS classifications differ from their semantic classification. On this subject, semantic types, which aim to classify predicates, have been proposed. In this paper, we describe semantic types and present our construction of a disambiguator for Japanese verbs. Specifically, we constructed this disambiguator using a support vector machine by building feature vectors. We used semantic categories of noun and results of morphological analysis for the feature vectors. We then achieved 69.9% accuracy of disambiguation for newspaper articles using 10-fold cross-validation.
- Book Chapter
6
- 10.4324/9781315782379-97
- Apr 24, 2019
Philosophers and linguists have claimed that verb meanings are divided into semantic types or superordinate categories that differ in internal conceptual structure. In particular, eventive verbs, which have internal causal structure are distinguished from stative verbs, which have no internal causal structure. In this paper, we explore the processing consequences of assuming that the lexical representations of verb meanings differ in the complexity of their internal representations. We conducted two experiments, a lexical decision task and a self-paced reading study, that investigated how verb types of different complexity are processed. We predicted that the conceptually more complex eventive verbs would take longer to process than stative verbs. In both experiments, this prediction was confirmed. This lends support to theories of verb concepts that propose classifications based on internal representations and shows that there are discrete and abstract conceptual categories in the domain of events.
- Video Transcripts
- 10.48448/tnzr-at19
- Aug 30, 2020
- Underline Science Inc.
Recent advances in language model (LM) pre-training from large- scale corpora have shown to improve various natural language processing tasks. They achieve performances comparable to non-expert humans on the GLUE benchmark for natural language understanding (NLU). While the improvement of the different contextualized representations comes from (i) the usage of more and more data, (ii) changing the types of lexical pre-training tasks or (iii) increasing the model size, NLU is more than memorizing word co-occurrences. But how much world knowledge and common sense can those language model capture? How much can those models infer from just the lexical information? To overcome this problem, some approaches include semantic information in the training process. We highlight existing approaches to combine different types of semantics with language models during the pre-training or fine-tuning phase
- Research Article
14
- 10.1038/s41598-024-57408-0
- Mar 27, 2024
- Scientific Reports
The imbalance of land cover categories is a common problem. Some categories appear less frequently in the image, while others may occupy the vast majority of the proportion. This imbalance can lead the classifier to tend to predict categories with higher frequency of occurrence, while the recognition effect on minority categories is poor. In view of the difficulty of land cover remote sensing image multi-target semantic classification, a semantic classification method of land cover remote sensing image based on depth deconvolution neural network is proposed. In this method, the land cover remote sensing image semantic segmentation algorithm based on depth deconvolution neural network is used to segment the land cover remote sensing image with multi-target semantic segmentation; Four semantic features of color, texture, shape and size in land cover remote sensing image are extracted by using the semantic feature extraction method of remote sensing image based on improved sequential clustering algorithm; The classification and recognition method of remote sensing image semantic features based on random forest algorithm is adopted to classify and identify four semantic feature types of land cover remote sensing image, and realize the semantic classification of land cover remote sensing image. The experimental results show that after this method classifies the multi-target semantic types of land cover remote sensing images, the average values of Dice similarity coefficient and Hausdorff distance are 0.9877 and 0.9911 respectively, which can accurately classify the multi-target semantic types of land cover remote sensing images.
- Research Article
- 10.3233/ida-240083
- Mar 1, 2025
- Intelligent Data Analysis: An International Journal
Relation extraction is one of the core tasks of natural language processing, which aims to identify entities in unstructured text and judge the semantic relationships between them. In the traditional methods, the extraction of rich features and the judgment of complex semantic relations are inadequate. Therefore, in this paper, we propose a relation extraction model, HAGCN, based on heterogeneous graph convolutional neural network and graph attention mechanism. We have constructed two different types of nodes, words and relations, in a heterogeneous graph convolutional neural network, which are used to extract different semantic types and attributes and further extract contextual semantic representations. By incorporating the graph attention mechanism to distinguish the importance of different information, and the model has stronger representation ability. In addition, an information update mechanism is designed in the model. Relation extraction is performed after iteratively fusing the node semantic information to obtain a more comprehensive node representation. The experimental results show that the HAGCN model achieves good relation extraction performance, and its F1 value reaches 91.51% in the SemEval-2010 Task 8 dataset. In addition, the HAGCN model also has good results in the WebNLG dataset, verifying the generalization ability of the model.
- Book Chapter
- 10.1007/978-3-031-63536-6_9
- Jan 1, 2024
We study the question to what extent the task of predicting the quality of student essays can be supported with computing “flows” of semantic types of argumentative units. Specifically, we use tagsets for claim and premise types that were recently applied to the Argument Annotated Essays corpus (AAE; Stab/Gurevych 2017) by Schaefer et al (2023). We train argument component and semantic type classification models on AAE and then use them to label the essays in two corpora that have numeric essay ratings, viz. FEEDBACK/PERSUADE and ICLE. We train linear classification models on flow features and find that flows of our semantic types are a better predictor for essay quality (in a simplified, good/bad dichotomy) than flows of coarse argument components (major claim, claim, premise). Finally, we calculate feature impact and perform a qualitative inspection, which shows some tendencies for pattern occurrence in the two essay classes.
- Conference Article
2
- 10.18653/v1/w15-3811
- Jan 1, 2015
Complex noun phrases are pervasive in biomedical texts, but are largely underexplored in entity discovery and information extraction. Such expressions often contain a mix of highly specific names (diseases, drugs, etc.) and common words such as “condition”, “degree”, “process”, etc. These words can have different semantic types depending on their context in noun phrases. In this paper, we address the task of classifying these common words onto fine-grained semantic types: for instance, “condition” can be typed as “symptom and finding” or “configuration and setting”. For information extraction tasks, it is crucial to consider common nouns only when they really carry biomedical meaning; hence the classifier must also detect the negative case when nouns are merely used in a generic, uninformative sense. Our solution harnesses a small number of labeled seeds and employs label propagation, a semisupervised learning method on graphs. Experiments on 50 frequent nouns show that our method computes semantic labels with a microaveraged accuracy of 91.34%.
- Research Article
19
- 10.4204/eptcs.119.17
- Jul 16, 2013
- Electronic Proceedings in Theoretical Computer Science
Model-checking the alternating-time temporal logics ATL and ATL* with incomplete information is undecidable for perfect recall semantics. However, when restricting to memoryless strategies the model-checking problem becomes decidable. In this paper we consider two other types of semantics based on finite-memory strategies. One where the memory size allowed is bounded and one where the memory size is unbounded (but must be finite). This is motivated by the high complexity of model-checking with perfect recall semantics and the severe limitations of memoryless strategies. We show that both types of semantics introduced are different from perfect recall and memoryless semantics and next focus on the decidability and complexity of model-checking in both complete and incomplete information games for ATL/ATL*. In particular, we show that the complexity of model-checking with bounded-memory semantics is Delta_2p-complete for ATL and PSPACE-complete for ATL* in incomplete information games just as in the memoryless case. We also present a proof that ATL and ATL* model-checking is undecidable for n >= 3 players with finite-memory semantics in incomplete information games.