Comparative Analysis of Indexing Techniques for Table Search in Data Lakes

  • Abstract
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Data lakes store vast amount of datasets of various forms collected from various sources. In this context, efficient table search is essential for identifying and integrating data to support business intelligence and machine learning pipelines. This paper explores effective methods for finding related tables using advanced table representation learning. Representation learning generates dense vector representations for tables at different levels (row, column, cell), enabling the use of advanced indexing techniques such as LSH, HNSW, and DiskANN, which speed up the core operation of approximate k-NN search within vector spaces. However, while several indexing techniques have been proposed so far, a thorough study and comparison of their effectiveness versus performance trade-offs is still missing. In this paper, we aim at shedding light on this gap. We begin by reviewing advanced vector-search techniques for table search in data lakes, followed by a detailed analysis of k-ANN indexes. Next, we present a comparison of the HNSW and DiskANN indexing techniques, comparing their internal structure, effectiveness, efficiency, and scalability. Additionally, we explore the impact of model accuracy on index performance. Our experiments include four datasets of various sizes and complexity. This study allows us to explore indexing design options, revealing the strengths and weaknesses of each, and also to identify potentially interesting future research directions.

Similar Papers
  • Research Article
  • Cite Count Icon 3
  • 10.1017/s175173112000155x
Storing, combining and analysing turkey experimental data in the Big Data era
  • Jan 1, 2020
  • Animal
  • D Schokker + 4 more

With the increasing availability of large amounts of data in the livestock domain, we face the challenge to store, combine and analyse these data efficiently. With this study, we explored the use of a data lake for storing and analysing data to improve scalability and interoperability. Data originated from a 2-day animal experiment in which the gait score of approximately 200 turkeys was determined through visual inspection by an expert. Additionally, inertial measurement units (IMUs), a 3D-video camera and a force plate (FP) were installed to explore the effectiveness of these sensors in automating the visual gait scoring. We deployed a data lake using the IMU and FP data of a single day of that animal experiment. This encompasses data from 84 turkeys for which we preprocessed by performing an ‘extract, transform and load’ (ETL-) procedure. To test scalability of the ETL-procedure, we simulated increasing volumes of the available data from this animal experiment and computed the ‘wall time’ (elapsed real time) for converting FP data into comma-separated files and storing these files. With a simulated data set of 30 000 turkeys, the wall time reduced from 1 h to less than 15 min, when 12 cores were used compared to 1 core. This demonstrated the ETL-procedure to be scalable. Subsequently, a machine learning (ML) pipeline was developed to test the potential of a data lake to automatically distinguish between two classses, that is, very bad gait scores v. other scores. In conclusion, we have set up a dedicated customized data lake, loaded data and developed a prediction model via the creation of an ML pipeline. A data lake appears to be a useful tool to face the challenge of storing, combining and analysing increasing volumes of data of varying nature in an effective manner.

  • Conference Article
  • Cite Count Icon 14
  • 10.1145/3555041.3589409
Table Discovery in Data Lakes: State-of-the-art and Future Directions
  • Jun 4, 2023
  • Grace Fan + 3 more

Data discovery refers to a set of tasks that enable users and downstream applications to explore and gain insights from massive collections of data sources such as data lakes. In this tutorial, we will provide a comprehensive overview of the most recent table discovery techniques developed by the data management community. We will cover table understanding tasks such as domain discovery, table annotation, and table representation learning which help data lake systems capture semantics of tables. We will also cover techniques enabling various query-driven discovery and table exploration tasks, as well as how table discovery can support key data science applications such as machine learning and knowledge base construction. Finally, we will discuss future research directions on developing new table discovery paradigms by combining structured knowledge and dense table representations, as well as improving the efficiency of discovery using state-of-the-art indexing techniques, and more.

  • Book Chapter
  • Cite Count Icon 9
  • 10.1007/978-3-642-22351-8_2
Location-Based Instant Search
  • Jan 1, 2011
  • Shengyue Ji + 1 more

Location-based keyword search has become an important part of our daily life. Such a query asks for records satisfying both a spatial condition and a keyword condition. State-of-the-art techniques extend a spatial tree structure by adding keyword information. In this paper we study location-based instant search, where a system searches based on a partial query a user has typed in. We first develop a new indexing technique, called filtering-effective hybrid index (FEH), that judiciously uses two types of keyword filters based on their selectiveness to do powerful pruning. Then, we develop indexing and search techniques that store prefix information on the FEH index and efficiently answer partial queries. Our experiments show a high efficiency and scalability of these techniques.

  • Supplementary Content
  • Cite Count Icon 14
  • 10.3389/fdata.2022.945720
Toward data lakes as central building blocks for data management and analysis
  • Aug 19, 2022
  • Frontiers in Big Data
  • Philipp Wieder + 1 more

Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research institutions or as the core entity of machine learning pipelines. The basic underlying idea of retaining data in its native format within a data lake facilitates a large range of use cases and improves data reusability, especially when compared to the schema-on-write approach applied in data warehouses, where data is transformed prior to the actual storage to fit a predefined schema. Storing such massive amounts of raw data, however, has its very own challenges, spanning from the general data modeling, and indexing for concise querying to the integration of suitable and scalable compute capabilities. In this contribution, influential papers of the last decade have been selected to provide a comprehensive overview of developments and obtained results. The papers are analyzed with regard to the applicability of their input to data lakes that serve as central data management systems of research institutions. To achieve this, contributions to data lake architectures, metadata models, data provenance, workflow support, and FAIR principles are investigated. Last, but not least, these capabilities are mapped onto the requirements of two common research personae to identify open challenges. With that, potential research topics are determined, which have to be tackled toward the applicability of data lakes as central building blocks for research data management.

  • Research Article
  • Cite Count Icon 24
  • 10.1145/3588689
SANTOS: Relationship-based Semantic Table Union Search
  • May 26, 2023
  • Proceedings of the ACM on Management of Data
  • Aamod Khatiwada + 6 more

Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of the union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover the semantic relationships between pairs of columns. The first uses an existing knowledge base (KB), and the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating synthesized KBs from data lakes with limited KB coverage and using them for union search.

  • Conference Article
  • Cite Count Icon 5
  • 10.1145/2390068.2390078
Indexing methods for efficient protein 3D surface search
  • Oct 29, 2012
  • Sungchul Kim + 2 more

This paper exploits efficient indexing techniques for protein structure search where protein structures are represented as vectors by 3D-Zernike Descriptor (3DZD). 3DZD compactly represents a surface shape of protein tertiary structure as a vector, and the simplified representation accelerates the structural search. However, further speed up is needed to address the scenarios where multiple users access the database simultaneously. We address this need for further speed up in protein structural search by exploiting two indexing techniques, i.e., iDistance and iKernel, on the 3DZDs. The results show that both iDistance and iKernel significantly enhance the searching speed. In addition, we introduce an extended approach for protein structure search based on indexing techniques that use the 3DZD characteristic. In the extended approach, index structure is constructured using only the first few of the numbers in the 3DZDs. To find the top-k similar structures, first top-10 x k similar structures are selected using the reduced index structure, then top-k structures are selected using similarity measure of full 3DZDs of the selected structures. Using the indexing techniques, the searching time reduced 69.6% using iDistance, 77% using iKernel, 77.4% using extended iDistance, and 87.9% using extended iKernel method.

  • Research Article
  • Cite Count Icon 17
  • 10.1145/2000486.2000490
Selecting vantage objects for similarity indexing
  • Aug 1, 2011
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Reinier H Van Leuken + 1 more

Indexing has become a key element in the pipeline of a multimedia retrieval system, due to continuous increases in database size, data complexity, and complexity of similarity measures. The primary goal of any indexing algorithm is to overcome high computational costs involved with comparing the query to every object in the database. This is achieved by efficient pruning in order to select only a small set of candidate matches. Vantage indexing is an indexing technique that belongs to the category of embedding or mapping approaches, because it maps a dissimilarity space onto a vector space such that traditional access methods can be used for querying. Each object is represented by a vector of dissimilarities to a small set of m reference objects, called vantage objects. Querying takes place within this vector space. The retrieval performance of a system based on this technique can be improved significantly through a proper choice of vantage objects. We propose a new technique for selecting vantage objects that addresses the retrieval performance directly, and present extensive experimental results based on three data sets of different size and modality, including a comparison with other selection strategies. The results clearly demonstrate both the efficacy and scalability of the proposed approach.

  • Research Article
  • Cite Count Icon 8
  • 10.1007/s11517-021-02392-0
Text-based multi-dimensional medical images retrieval according to the features-usage correlation
  • Jan 1, 2021
  • Medical & Biological Engineering & Computing
  • Aliasghar Safaei

Emerging medical imaging applications in healthcare, the number and volume of medical images is growing dramatically. Information needs of users in such circumstances, either for clinical or research activities, make the role of powerful medical image search engines more significant. In this paper, a text-based multi-dimensional medical image indexing technique is proposed in which correlation of the features-usages (according to the user’s queries) is considered to provide an off-the content indexing while taking users’ interestingness into account. Assuming that each medical image has some extracted features (e.g., based on the DICOM standard), correlations of the features are discovered by performing data mining techniques (i.e., quantitative association pattern discovery), on the history of users’ queries as a data set. Then, based on the pairwise correlation of the features of medical images (a.k.a. Affinity), set of the all features is fragmented into subsets (using method like the vertical fragmentation of the tables in distribution of relational DBs). After that, each of these subsets of the features turn into a hierarchy of the features (by applying a hierarchical clustering algorithm on that subset), subsequently all of these distinct hierarchies together make a multi-dimensional structure of the features of medical images, which is in fact the proposed text-based (feature-based) multi-dimensional index structure. Constructing and using such text-based multi-dimensional index structure via its specific required operations, medical image retrieval process would be improved in the underlying medical image search engine. Generally, an indexing technique is to provide a logical representation of documents in order to optimize the retrieval process. The proposed indexing technique is designed such that can improve retrieval of medical images in a medical image search engine in terms of its effectiveness and efficiency. Considering correlation of the features of the image would semantically improve precision (effectiveness) of the retrieval process, while traversing them through the hierarchy in one dimension would try to optimize (i.e., minimize) the resources to have a better efficiency. The proposed text-based multi-dimensional indexing technique is implemented using the open source search engine Lucene, and compared with the built-in indexing technique available in the Lucene search engine, and also with the Terrier platform (available for the benchmarking of information retrieval systems) and other the most related indexing techniques. Evaluation results of memory usage and time complexity analysis, beside the experimental evaluations efficiency and effectiveness measures show that the proposed multi-dimensional indexing technique significantly improves both efficiency and effectiveness for a medical image search engine.Graphical abstract

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.14569/ijacsa.2021.0120864
Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform
  • Jan 1, 2021
  • International Journal of Advanced Computer Science and Applications
  • Vladimir Belov + 1 more

When developing large data processing systems, the question of data storage arises. One of the modern tools for solving this problem is the so-called data lakes. Many implementations of data lakes use Apache Hadoop as a basic platform. Hadoop does not have a default data storage format, which leads to the task of choosing a data format when designing a data processing system. To solve this problem, it is necessary to proceed from the results of the assessment according to several criteria. In turn, experimental evaluation does not always give a complete understanding of the possibilities for working with a particular data storage format. In this case, it is necessary to study the features of the format, its internal structure, recommendations for use, etc. The article describes the features of both widely used data storage formats and the currently gaining popularity.

  • Conference Article
  • Cite Count Icon 93
  • 10.1145/3299869.3300065
JOSIE
  • Jun 25, 2019
  • Erkang Zhu + 3 more

We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set similarity search problem by considering columns as sets and matching values as intersection between sets. Although set similarity search is well-studied in the field of approximate string search (e.g., fuzzy keyword search), the solutions are designed for and evaluated over sets of relatively small size (average set size rarely much over 100 and maximum set size in the low thousands) with modest dictionary sizes (the total number of distinct values in all sets is only a few million). We observe that modern data lakes typically have massive set sizes (with maximum set sizes that may be tens of millions) and dictionaries that include hundreds of millions of distinct values. Our new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets. We show that JOSIE completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. More surprising, we also consider state-of-the-art approximate algorithm and show that our new exact search algorithm performs almost as well, and even in some cases better, on real data lakes.

  • Research Article
  • Cite Count Icon 32
  • 10.14778/3587136.3587146
Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning
  • Mar 1, 2023
  • Proceedings of the VLDB Endowment
  • Grace Fan + 4 more

Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical results on real table benchmarks show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).

  • Book Chapter
  • 10.1007/978-3-642-32498-7_20
Indexing and Search for Fast Music Identification
  • Jan 1, 2012
  • Guang-Ho Cha

In this paper, we present a new technique for indexing and search in a database that stores songs. A song is represented by a high dimensional binary vector using the audio fingerprinting technique. Audio fingerprinting extracts from a song a fingerprint which is a content-based compact signature that summarizes an audio recording. A song can be recognized by matching an extracted fingerprint to a database of known audio fingerprints. In this paper, we are given a high dimensional binary fingerprint database of songs and focus our attention on the problem of effective and efficient database search. However, the nature of high dimensionality and binary space makes many modern search algorithms inapplicable. The high dimensionality of fingerprints suffers from the curse of dimensionality, i.e., as the dimension increases, the search performance decreases exponentially. In order to tackle this problem, we propose a new search algorithm based on inverted indexing, the multiple sub-fingerprint match principle, the offset match principle, and the early termination strategy. We evaluate our technique using a database of 2,000 songs containing approximately 4,000,000 sub-fingerprints and the experimental result shows encouraging performance.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/3430984.3431968
Exploring State-of-the-Art Nearest Neighbor (NN) Search Techniques
  • Jan 2, 2021
  • Parth Nagarkar + 2 more

Finding nearest neighbors (NN) is a fundamental operation in many diverse domains such as databases, machine learning, data mining, information retrieval, multimedia retrieval, etc. Due to the data deluge and the application of nearest neighbor queries in many applications where fast performance is necessary, efficient index structures are required to speed up finding nearest neighbors. Different application domains have different data characteristics and, therefore, require different types of indexing techniques. While the internal indexing and searching mechanism is generally hidden from the top-level application, it is beneficial for a data scientist to understand these fundamental operations and choose a correct indexing technique to improve the performance of the overall end-to-end workflow. Choosing the correct searching mechanism to solve a nearest neighbor query can be a daunting task, however. A wrong choice can potentially lead to low accuracy, slower execution time, or in the worst case, both. The objective of this tutorial is to present the audience with the knowledge to choose the correct index structure for specific applications. We present the state-of-the-art Nearest Neighbor (NN) indexing techniques for different data characteristics. We also present the effect, in terms of time and accuracy, of choosing the wrong index structure for different application needs. We conclude the tutorial with a discussion on the future challenges in the Nearest Neighbor search domain.

  • Research Article
  • Cite Count Icon 10
  • 10.14778/3494124.3494149
Ember
  • Nov 1, 2021
  • Proceedings of the VLDB Endowment
  • Sahaana Suri + 3 more

Structured data, or data that adheres to a pre-defined schema, can suffer from fragmented context: information describing a single entity can be scattered across multiple datasets or tables tailored for specific business needs, with no explicit linking keys. Context enrichment, or rebuilding fragmented context, using keyless joins is an implicit or explicit step in machine learning (ML) pipelines over structured data sources. This process is tedious, domain-specific, and lacks support in now-prevalent no-code ML systems that let users create ML pipelines using just input data and high-level configuration files. In response, we propose Ember, a system that abstracts and automates keyless joins to generalize context enrichment. Our key insight is that Ember can enable a general keyless join operator by constructing an index populated with task-specific embeddings. Ember learns these embeddings by leveraging Transformer-based representation learning techniques. We describe our architectural principles and operators when developing Ember, and empirically demonstrate that Ember allows users to develop no-code context enrichment pipelines for five domains, including search, recommendation and question answering, and can exceed alternatives by up to 39% recall, with as little as a single line configuration change.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/978-3-319-68474-1_20
DS-Prox: Dataset Proximity Mining for Governing the Data Lake
  • Jan 1, 2017
  • Ayman Alserafi + 3 more

With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.

More from: International Journal of Semantic Computing
  • Research Article
  • 10.1142/s1793351x25020040
Guest Editors’ Introduction for AIxSET and AIxHEART 2024
  • Oct 24, 2025
  • International Journal of Semantic Computing
  • Gary Glesener + 2 more

  • Research Article
  • 10.1142/s1793351x25020039
Guest Editorial: Special Issue on Artificial Intelligence for Medicine, Health and Care
  • Oct 17, 2025
  • International Journal of Semantic Computing
  • Bryan Chou + 2 more

  • Research Article
  • 10.1142/s1793351x25430032
A Multi-Modal Robotic System for Indoor Assistance and Fall Detection in Elderly and Mobility-Impaired Individuals
  • Jul 1, 2025
  • International Journal of Semantic Computing
  • Savin Seneviratne + 1 more

  • Research Article
  • 10.1142/s1793351x25440027
Bridging Linguistics and Artificial Intelligence: A Phoneme-Centric Method for Assessing Synthetic Speech
  • Jun 18, 2025
  • International Journal of Semantic Computing
  • Sarah Reynolds + 1 more

  • Research Article
  • 10.1142/s1793351x25430020
Detection of Large Vessel Occlusion in Ischemic Stroke Patients Using Deep Residual Distilled Convolutional Networks
  • Jun 12, 2025
  • International Journal of Semantic Computing
  • Rohan Chatterjee + 7 more

  • Research Article
  • 10.1142/s1793351x25440015
Translative Research Assistant: A Retrieval-Augmented Generation Pipeline Refinement with Keyword Extraction Using Extended Scalable Betweenness Centrality
  • Jun 12, 2025
  • International Journal of Semantic Computing
  • Chung-Hsien Chou + 1 more

  • Research Article
  • 10.1142/s1793351x25500011
Explainable ICD Code Assignment Using Knowledge-Based Sentence Extraction and Deep Learning
  • May 20, 2025
  • International Journal of Semantic Computing
  • Joshua Carberry + 1 more

  • Research Article
  • 10.1142/s1793351x25420024
Comparative Analysis of Indexing Techniques for Table Search in Data Lakes
  • May 14, 2025
  • International Journal of Semantic Computing
  • Ibraheem Taha + 3 more

  • Research Article
  • 10.1142/s1793351x25420061
Using the Hurwicz Criterion to Optimize Selection Queries Under Partial Ignorance
  • May 7, 2025
  • International Journal of Semantic Computing
  • Sven Helmer + 2 more

  • Research Article
  • 10.1142/s1793351x25420073
FinCaKG: A Framework to Construct Financial Causality Knowledge Graph from Text
  • Apr 24, 2025
  • International Journal of Semantic Computing
  • Ziwei Xu + 2 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon