Lucene Research Articles

Cordra is a digital object server that can function as a key infrastructural piece in FAIR DO (findable, accessible, interoperable and reusable digital object) implementations. Cordra manages JSON records and payloads as typed digital objects identified by handles. Cordra is neither a database nor an indexer, but it integrates the two and provides a unified interface. Cordra is intended to support both quick prototyping as well as production systems. For prototyping, Cordra makes it easy to get up and running rapidly with a digital object server. A potential Cordra administrator can download Cordra and very quickly have a server which supports creation, search, and retrieval of digital objects with resolvable identifiers. The server supports Digital Object Interface Protocol (DOIP) and HTTP APIs out of the box, as well as an immediately usable prototype user interface. Cordra saves substantial development time as it comes with ready-made functionality ranging from user authentication and access control to information validation, enrichment, storing, and indexing. By default, Cordra is configured to store objects on the local file system of the machine and use embedded Apache Lucene for indexing. Simply by editing type definitions in Cordra's user interface, the administrator can start changing the behavior of the APIs and user interface in real time for experimentation, including adding custom operations. For production use, Cordra allows intensive extension and customization of the processes underlying the digital object server: how digital objects are stored and indexed, how they are validated and enriched, how users authenticate, when and to whom to give access to objects, and what custom operations can be performed. In production Cordra is run at scale, supporting high reliability and performance; among other options Cordra supports MongoDB and Amazon S3 for storage, and Elasticsearch and Apache Solr for indexing. By definition of the underlying types and operations, Cordra is intended to serve directly as the API backend for a production application. This talk will cover basic Cordra features as well as customization/configuration basics. Examples of current use will be shown, including the use of the Digital Object Interface Protocol (DOIP), for which Cordra serves as a reference implementation. Current users of Cordra include the Derivatives Service Bureau (DSB), which uses Cordra as part of its backend to manage the automated generation of International Securities Identification Numbers (ISINs) for OTC derivatives in the financial services sector; and the British Standard Institute (BSI) whose Identify service for construction product manufacturers aims to assign a Universal Persistent Identification Number (UPIN) "for every product that is specified and incorporated in a building structure". The DSB, Cordra users since 2017, has a production system with over 80 million identified digital objects which receives millions of searches each month. BSI.Identify has a system where Cordra's DOIP interface is directly accessible as the service's public API.

Read full abstract

BackgroundRecord linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required.MethodsWe developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings.ResultsOverall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records.ConclusionCIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures.

Read full abstract

Lucene Research Articles

Related Topics

Articles published on Lucene

Building a searchable online corpus of Australian and New Zealand aligned speech

Innovative Approaches to Full-Text Search with Solr and Lucene

SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems

Document Retrieval System for Biomedical Question Answering

SSD In-Storage Computing for Search Engines

IMPLEMENTATION OF TEXT INDEXING SYSTEM IN WEB-BASED DOCUMENT SEARCH APPLICATION USING MONGODB

Towards "Biodiversity PMC"

RECOMMENDING JAVA API METHODS BASED ON PROGRAMMING TASK DESCRIPTIONS BY NOVICE PROGRAMMERS

The Analysis of Open Source Search Engines

An Introduction to Cordra

Vovel metrics-novel coupling metrics for improved software fault prediction.

Two-Way Refinement Approach For Extra Corrupted Shard Removal In Elastic Search With Lucene And Translog

CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability

Search Engine for Halal Linked Open Data Using Entity Ranking Approach

PiNET: a versatile web platform for downstream analysis and visualization of proteomics data.

Indexing documents with reliable indexing techniques using Apache Lucene in Hadoop

Lucene-P2: A Distributed Platform for Privacy-Preserving Text-based Search.

Design and Develop CMS for Sindhi E-News Papers

Medical Terminology Server for the Hospital of Clinics of Paraguay using Apache Lucene and the UMLS Metathesaurus

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Lucene Research Articles

Related Topics

Articles published on Lucene

Building a searchable online corpus of Australian and New Zealand aligned speech

Innovative Approaches to Full-Text Search with Solr and Lucene

SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems

Document Retrieval System for Biomedical Question Answering

SSD In-Storage Computing for Search Engines

IMPLEMENTATION OF TEXT INDEXING SYSTEM IN WEB-BASED DOCUMENT SEARCH APPLICATION USING MONGODB

Towards "Biodiversity PMC"

RECOMMENDING JAVA API METHODS BASED ON PROGRAMMING TASK DESCRIPTIONS BY NOVICE PROGRAMMERS

The Analysis of Open Source Search Engines

An Introduction to Cordra

Vovel metrics-novel coupling metrics for improved software fault prediction.

Two-Way Refinement Approach For Extra Corrupted Shard Removal In Elastic Search With Lucene And Translog

CLUSTERING AND INDEXING OF MULTIPLE DOCUMENTS USING FEATURE EXTRACTION THROUGH APACHE HADOOP ON BIG DATA

CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability

Search Engine for Halal Linked Open Data Using Entity Ranking Approach

PiNET: a versatile web platform for downstream analysis and visualization of proteomics data.

Indexing documents with reliable indexing techniques using Apache Lucene in Hadoop

Lucene-P2: A Distributed Platform for Privacy-Preserving Text-based Search.

Design and Develop CMS for Sindhi E-News Papers

Medical Terminology Server for the Hospital of Clinics of Paraguay using Apache Lucene and the UMLS Metathesaurus