Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Equipping machines with comprehensive knowledge of the world’s entities and their relationships has been a longstanding goal of AI. Over the last decade, large-scale knowledge bases, also known as knowledge graphs, have been automatically constructed from web contents and text sources, and have become a key asset for search engines. This machine knowledge can be harnessed to semantically interpret textual phrases in news, social media and web tables, and contributes to question answering, natural language processing and data analytics. This article surveys fundamental concepts and practical methods for creating and curating large knowledge bases. It covers models and methods for discovering and canon-icalizing entities and their semantic types and organizing them into clean taxonomies. On top of this, the article discusses the automatic extraction of entity-centric properties. To support the long-term life-cycle and the quality assurance of machine knowledge, the article presents methods for constructing open schemas and for knowledge curation. Case studies on academic projects and industrial knowledge graphs complement the survey of concepts and methods.

Similar Papers
  • Research Article
  • Cite Count Icon 13
  • 10.14778/3476311.3476393
Knowledge graphs 2021
  • Jul 1, 2021
  • Proceedings of the VLDB Endowment
  • Gerhard Weikum

Providing machines with comprehensive knowledge of the world's entities and their relationships has been a long-standing vision and challenge for AI. Over the last 15 years, huge knowledge bases, also known as knowledge graphs, have been automatically constructed from web data, and have become a key asset for search engines and other use cases. Machine knowledge can be harnessed to semantically interpret texts in news, social media and web tables, contributing to question answering, natural language processing and data analytics. This position paper reviews these advances and discusses lessons learned. It highlights the role of "DB thinking" in building and maintaining high-quality knowledge bases from web contents. Moreover, the paper identifies open challenges and new research opportunities. In particular, extracting quantitative measures of entities (e.g., height of buildings or energy efficiency of cars), from text and web tables, presents an opportunity to further enhance the scope and value of knowledge bases.

  • Book Chapter
  • Cite Count Icon 17
  • 10.1007/978-3-642-37450-0_8
Mapping Entity-Attribute Web Tables to Web-Scale Knowledge Bases
  • Jan 1, 2013
  • Xiaolu Zhang + 4 more

There are many entity-attribute tables on the Web that can be utilized for enriching the entities of knowledge bases (KBs). This requires the schema mapping (matching) between the Web tables and the huge KBs. Existing solutions on schema mapping are inadequate for mapping a Web table and a KB, because of many reasons such as (1) there are many duplicates of entities and their types in a KB; (2) the schema of KB is often implicit, informal, and evolving over time; (3) the KB is typically very large in volume. In this paper, we propose a pure instance-based schema mapping solution to statistically find the effective mapping between a Web table and a KB via the matched data examples. Besides, we propose efficient solutions on finding the matched data examples as well as the overall mapping of a table and a KB. Experiments over real data sets show that our solution is much more accurate than the two baselines of existing solutions. Results also show that our solution is feasible for the mapping of Web tables to large scale KBs.

  • Conference Article
  • 10.1145/2740908.2741984
Knowledge Bases for Web Content Analytics
  • May 18, 2015
  • Johannes Hoffart + 3 more

No abstract available.

  • Conference Article
  • 10.1109/ictai.2015.54
An Approach to Construct Semantic Networks with Confidence Scores Based on Data Analysis - Case Study in Osaka Wholesale Market
  • Nov 1, 2015
  • Takahiro Kawamura + 1 more

In recent years, several large-scale knowledge bases (KBs) have been constructed, such as YAGO, DBpedia, and Google Knowledge Graph. Although automatic extractio techniques that extract facts and rules from the Web is necessary for constructing such large-scale KBs, incorporation of noisy, unreliable knowledge cannot be unavoidable. Thus, Google Knowledge Vault assigns extracted knowledge with confidence scores based on consistency with the existing KBs. In this paper, we propose a new approach for associating confidence scores with knowledge based on a large amount of raw data for domains, where there is no existing KB. We first construct knowledge in a specific domain as a semantic network, and then design a probabilistic network, that corresponds to the semantic network. To associate the confidence scores with the semantic network, we train the probabilistic network with a large amount of open data, provided by the Osaka central wholesale market in Japan. We also confirm the validity of the confidence scores with the accuracy of reasoning on the probabilistic network. A semantic network associated with confidence scores, that is, a weighted labeled graph is advantageous not only for reducing the noisy, unreliable knowledge with low confidence, but also for making retrieval results ranking on the KB. In the future, probabilistic reasoning on semantic networks may also be possible.

  • Research Article
  • 10.2196/53424
Application of a Language Model Tool for COVID-19 Vaccine Adverse Event Monitoring Using Web and Social Media Content: Algorithm Development and Validation Study
  • Dec 20, 2024
  • JMIR Infodemiology
  • Chathuri Daluwatte + 7 more

BackgroundSpontaneous pharmacovigilance reporting systems are the main data source for signal detection for vaccines. However, there is a large time lag between the occurrence of an adverse event (AE) and the availability for analysis. With global mass COVID-19 vaccination campaigns, social media, and web content, there is an opportunity for real-time, faster monitoring of AEs potentially related to COVID-19 vaccine use. Our work aims to detect AEs from social media to augment those from spontaneous reporting systems.ObjectiveThis study aims to monitor AEs shared in social media and online support groups using medical context-aware natural language processing language models.MethodsWe developed a language model–based web app to analyze social media, patient blogs, and forums (from 190 countries in 61 languages) around COVID-19 vaccine–related keywords. Following machine translation to English, lay language safety terms (ie, AEs) were observed using the PubmedBERT-based named-entity recognition model (precision=0.76 and recall=0.82) and mapped to Medical Dictionary for Regulatory Activities (MedDRA) terms using knowledge graphs (MedDRA terminology is an internationally used set of terms relating to medical conditions, medicines, and medical devices that are developed and registered under the auspices of the International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use). Weekly and cumulative aggregated AE counts, proportions, and ratios were displayed via visual analytics, such as word clouds.ResultsMost AEs were identified in 2021, with fewer in 2022. AEs observed using the web app were consistent with AEs communicated by health authorities shortly before or within the same period.ConclusionsMonitoring the web and social media provides opportunities to observe AEs that may be related to the use of COVID-19 vaccines. The presented analysis demonstrates the ability to use web content and social media as a data source that could contribute to the early observation of AEs and enhance postmarketing surveillance. It could help to adjust signal detection strategies and communication with external stakeholders, contributing to increased confidence in vaccine safety monitoring.

  • Preprint Article
  • 10.2196/preprints.53424
Application of a Language Model Tool for COVID-19 Vaccine Adverse Event Monitoring Using Web and Social Media Content: Algorithm Development and Validation Study (Preprint)
  • Oct 6, 2023
  • Chathuri Daluwatte + 7 more

BACKGROUND Spontaneous pharmacovigilance reporting systems are the main data source for signal detection for vaccines. However, there is a large time lag between the occurrence of an adverse event (AE) and the availability for analysis. With global mass COVID-19 vaccination campaigns, social media, and web content, there is an opportunity for real-time, faster monitoring of AEs potentially related to COVID-19 vaccine use. Our work aims to detect AEs from social media to augment those from spontaneous reporting systems. OBJECTIVE This study aims to monitor AEs shared in social media and online support groups using medical context-aware natural language processing language models. METHODS We developed a language model–based web app to analyze social media, patient blogs, and forums (from 190 countries in 61 languages) around COVID-19 vaccine–related keywords. Following machine translation to English, lay language safety terms (ie, AEs) were observed using the PubmedBERT-based named-entity recognition model (precision=0.76 and recall=0.82) and mapped to Medical Dictionary for Regulatory Activities (MedDRA) terms using knowledge graphs (MedDRA terminology is an internationally used set of terms relating to medical conditions, medicines, and medical devices that are developed and registered under the auspices of the International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use). Weekly and cumulative aggregated AE counts, proportions, and ratios were displayed via visual analytics, such as word clouds. RESULTS Most AEs were identified in 2021, with fewer in 2022. AEs observed using the web app were consistent with AEs communicated by health authorities shortly before or within the same period. CONCLUSIONS Monitoring the web and social media provides opportunities to observe AEs that may be related to the use of COVID-19 vaccines. The presented analysis demonstrates the ability to use web content and social media as a data source that could contribute to the early observation of AEs and enhance postmarketing surveillance. It could help to adjust signal detection strategies and communication with external stakeholders, contributing to increased confidence in vaccine safety monitoring.

  • Conference Article
  • Cite Count Icon 68
  • 10.1145/2872427.2883017
Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases
  • Apr 11, 2016
  • Dominique Ritze + 3 more

Cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph have gained increasing attention over the last years and are starting to be deployed within various use cases. However, the content of such knowledge bases is far from being complete, far from always being correct, and suffers from deprecation (i.e. population numbers become outdated after some time). Hence, there are efforts to leverage various types of Web data to complement, update and extend such knowledge bases. A source of Web data that potentially provides a very wide coverage are millions of relational HTML tables that are found on the Web. The existing work on using data from Web tables to augment cross-domain knowledge bases reports only aggregated performance numbers. The actual content of the Web tables and the topical areas of the knowledge bases that can be complemented using the tables remain unclear. In this paper, we match a large, publicly available Web table corpus to the DBpedia knowledge base. Based on the matching results, we profile the potential of Web tables for augmenting different parts of cross-domain knowledge bases and report detailed statistics about classes, properties, and instances for which missing values can be filled using Web table data as evidence. In order to estimate the potential quality of the new values, we empirically examine the Local Closed World Assumption and use it to determine the maximal number of correct facts that an ideal data fusion strategy could generate. Using this as ground truth, we compare three data fusion strategies and conclude that knowledge-based trust outperforms PageRank- and voting-based fusion.

  • Conference Article
  • 10.1109/skg.2011.34
A Matrix Approach to Implicit Relationship Finding in Large-Scale Knowledge Bases
  • Oct 1, 2011
  • Yan Wang + 3 more

Relationships between entities in a Knowledge Base (KB) are not always explicitly expressed. In addition, entities may implicitly exist within explicit ones. These phenomena are very common when it comes to large-scale KBs. Finding implicit relationships in a KB can make the original KB more meaningful and enhance its potential in real world applications. In this paper, we focus on the problem of finding implicit-relationship networks in large-scale KBs. Since a network can be mathematically expressed as a matrix, the process of reasoning for implicit relationship finding can be transformed to matrix computation. Considering that there are many advantages for matrix computation instead of logic based and graph based reasoning (such as scalability for storing and processing relationships), by realizing the mathematical nature of KBs, we use matrix transformation and computation to investigate the problem of implicit relationship finding. We give several illustrative real world examples using large-scale KBs to validate this framework. In addition, we also investigate the potential problems of scalability on matrix storage, as well as the cost for computation and time. Based on the proposed approach and the consideration on the scalability issue, we develop the MIRF and MIRF-L algorithms which can efficiently process this kind of problem if the rules in concrete cases can be clearly expressed.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/0306-4379(90)90012-e
KBMS: Aspects, theory and implementation
  • Jan 1, 1990
  • Information Systems
  • D.E Altenkrueger

KBMS: Aspects, theory and implementation

  • Conference Instance
  • 10.1145/2509558
Proceedings of the 2013 workshop on Automated knowledge base construction
  • Oct 27, 2013

Extracting knowledge from Web pages, and integrating it into a coherent knowledge base (KB), is a task that spans the areas of natural language processing, information extraction, information integration, databases, search and machine learning. Recent years have seen significant advances on the creation of large-scale KBs. Examples include Wikipedia-based KBs (e.g., YAGO, DBpedia, and Freebase), KBs generated from Web documents (e.g., NELL, PROSPERA), and open information extraction approaches (e.g., TextRunner, PRISMATIC, Rexa). Most prominently, all major search engine providers (Yahoo!, Microsoft Bing, and Google) nowadays experiment with semantic KBs (e.g., the Google Knowledge Graph). The workshop on automated knowledge base construction (AKBC) is the venue for sharing the latest research in the area of knowledge extraction. Unlike many other workshops, its focus is on keynotes by high profile researchers in the field. This year, we are proud to welcome talks by Bonnie Dorr, DARPA, USAEvgeniy Gabrilovich, Google Research, USAAlon Halevy, Google Research, USAChris Manning, Stanford University, USAJames Mayfield, Johns Hopkins University, USAAndrew McCallum, University of Massachusetts Amherst, USATom Mitchell, Carnegie Mellon University, USADan Weld, University of Washington, USAHaixun Wang, Microsoft Research Asia Our invited speakers will share their visions on knowledge extraction with the audience. In addition, the workshop invites regular paper submissions. We focus exclusively on short, visionary papers, even if the experimentation is still rudimentary. This way, we aim to attract the latest cutting-edge research that has not yet been presented at conferences. The main means of presentation will be through posters. By focusing on interactive presentations rather than talks, we hope to stimulate discussion, improve understanding of the work, and sow the ideas for future research. This year, the AKBC workshop accepted 19 papers. All of them will be presented as posters. In addition, 9 of these papers were selected for short oral presentations. We will have presentations that encompass many interesting topics, including confidence estimation in KBs, using computers to pass an elementary science test, and mining history from newspaper archives. A best paper award will be also given to the submission with the best reviews. We hope that, like in 2010 and 2012, our workshop will again prove to be an inspiring venue for researchers in the area of knowledge management. We are looking forward to welcoming you at the AKBC 2013!

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 11
  • 10.2196/39888
Deciphering the Diversity of Mental Models in Neurodevelopmental Disorders: Knowledge Graph Representation of Public Data Using Natural Language Processing
  • Aug 5, 2022
  • Journal of Medical Internet Research
  • Manpreet Kaur + 5 more

BackgroundUnderstanding how individuals think about a topic, known as the mental model, can significantly improve communication, especially in the medical domain where emotions and implications are high. Neurodevelopmental disorders (NDDs) represent a group of diagnoses, affecting up to 18% of the global population, involving differences in the development of cognitive or social functions. In this study, we focus on 2 NDDs, attention deficit hyperactivity disorder (ADHD) and autism spectrum disorder (ASD), which involve multiple symptoms and interventions requiring interactions between 2 important stakeholders: parents and health professionals. There is a gap in our understanding of differences between mental models for each stakeholder, making communication between stakeholders more difficult than it could be.ObjectiveWe aim to build knowledge graphs (KGs) from web-based information relevant to each stakeholder as proxies of mental models. These KGs will accelerate the identification of shared and divergent concerns between stakeholders. The developed KGs can help improve knowledge mobilization, communication, and care for individuals with ADHD and ASD.MethodsWe created 2 data sets by collecting the posts from web-based forums and PubMed abstracts related to ADHD and ASD. We utilized the Unified Medical Language System (UMLS) to detect biomedical concepts and applied Positive Pointwise Mutual Information followed by truncated Singular Value Decomposition to obtain corpus-based concept embeddings for each data set. Each data set is represented as a KG using a property graph model. Semantic relatedness between concepts is calculated to rank the relation strength of concepts and stored in the KG as relation weights. UMLS disorder-relevant semantic types are used to provide additional categorical information about each concept’s domain.ResultsThe developed KGs contain concepts from both data sets, with node sizes representing the co-occurrence frequency of concepts and edge sizes representing relevance between concepts. ADHD- and ASD-related concepts from different semantic types shows diverse areas of concerns and complex needs of the conditions. KG identifies converging and diverging concepts between health professionals literature (PubMed) and parental concerns (web-based forums), which may correspond to the differences between mental models for each stakeholder.ConclusionsWe show for the first time that generating KGs from web-based data can capture the complex needs of families dealing with ADHD or ASD. Moreover, we showed points of convergence between families and health professionals’ KGs. Natural language processing–based KG provides access to a large sample size, which is often a limiting factor for traditional in-person mental model mapping. Our work offers a high throughput access to mental model maps, which could be used for further in-person validation, knowledge mobilization projects, and basis for communication about potential blind spots from stakeholders in interactions about NDDs. Future research will be needed to identify how concepts could interact together differently for each stakeholder.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3582496
Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods
  • Jun 17, 2023
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Jawad Shafi + 2 more

Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download.

  • Book Chapter
  • Cite Count Icon 11
  • 10.1007/978-3-319-50112-3_18
Entity Linking in Web Tables with Multiple Linked Knowledge Bases
  • Jan 1, 2016
  • Tianxing Wu + 5 more

The World-Wide Web contains a large scale of valuable relational data, which are embedded in HTML tables (i.e. Web tables). To extract machine-readable knowledge from Web tables, some work tries to annotate the contents of Web tables as RDF triples. One critical step of the annotation is entity linking (EL), which aims to map the string mentions in table cells to their referent entities in a knowledge base (KB). In this paper, we present a new approach for EL in Web tables. Different from previous work, the proposed approach replaces a single KB with multiple linked KBs as the sources of entities to improve the quality of EL. In our approach, we first apply a general graph-based algorithm to EL in Web tables with each single KB. Then, we leverage the existing and newly learned “sameAs” relations between the entities from different KBs to help improve the results of EL in the first step. We conduct experiments on the sampled Web tables with Zhishi.me, which consists of three linked encyclopedic KBs. The experimental results show that our approach outperforms the state-of-the-art table’s EL methods in different evaluation metrics.

  • Conference Article
  • Cite Count Icon 4
  • 10.1145/2932194.2932197
Fusing time-dependent web table data
  • Jun 26, 2016
  • Yaser Oulabi + 2 more

A subset of the HTML tables on the Web contains relational data. The data in these tables covers a multitude of topics and is thus very useful for complementing or validating cross-domain knowledge bases, such as DBpedia, YAGO, or the Google Knowledge Graph. A large fraction of the data in these knowledge bases is time-dependent, meaning that the correctness of an attribute value depends on a point in time. Fusing data from web tables in order to determine correct values for time-dependent attributes is challenging as most web tables do not contain timestamp information. A possibility to deal with this sparsity is to exploit timestamps which appear in different locations on the web page around the table. But as these timestamps might not apply to the web table value in question, this approach introduces noise. This paper investigates the extent to which the performance of data fusion strategies that rely on voting, PageRank, and Knowledge-Based-Trust can be improved by incorporating noisy and sparse timestamp information. For this, we present a machine-learning-based approach which considers different types of noisy timestamps in the data fusion process, and experiment with propagating timestamp information between web tables in order to overcome sparsity. We evaluate the data fusion strategies using a large public corpus of web tables and a public gold standard of time-dependent attribute values. We find that our methods effectively choose and weigh timestamp information per attribute and reduce sparsity using propagation. By incorporating timestamp information into data fusion strategies that previously did not exploit temporal meta information, we are able to increase F1-measure on average by 5%.

  • Conference Article
  • Cite Count Icon 17
  • 10.1109/icde.2013.6544916
Knowledge harvesting from text and Web sources
  • Apr 1, 2013
  • F Suchanek + 1 more

The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, Probase, ReadTheWeb, and YAGO, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Prominent examples of how knowledge bases can be harnessed include the Google Knowledge Graph and the IBM Watson question answering system. This tutorial presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.