Comparative Analysis of Indexing Techniques for Table Search in Data Lakes
Data lakes store vast amount of datasets of various forms collected from various sources. In this context, efficient table search is essential for identifying and integrating data to support business intelligence and machine learning pipelines. This paper explores effective methods for finding related tables using advanced table representation learning. Representation learning generates dense vector representations for tables at different levels (row, column, cell), enabling the use of advanced indexing techniques such as LSH, HNSW, and DiskANN, which speed up the core operation of approximate k-NN search within vector spaces. However, while several indexing techniques have been proposed so far, a thorough study and comparison of their effectiveness versus performance trade-offs is still missing. In this paper, we aim at shedding light on this gap. We begin by reviewing advanced vector-search techniques for table search in data lakes, followed by a detailed analysis of k-ANN indexes. Next, we present a comparison of the HNSW and DiskANN indexing techniques, comparing their internal structure, effectiveness, efficiency, and scalability. Additionally, we explore the impact of model accuracy on index performance. Our experiments include four datasets of various sizes and complexity. This study allows us to explore indexing design options, revealing the strengths and weaknesses of each, and also to identify potentially interesting future research directions.
- Research Article
3
- 10.1017/s175173112000155x
- Jan 1, 2020
- Animal
With the increasing availability of large amounts of data in the livestock domain, we face the challenge to store, combine and analyse these data efficiently. With this study, we explored the use of a data lake for storing and analysing data to improve scalability and interoperability. Data originated from a 2-day animal experiment in which the gait score of approximately 200 turkeys was determined through visual inspection by an expert. Additionally, inertial measurement units (IMUs), a 3D-video camera and a force plate (FP) were installed to explore the effectiveness of these sensors in automating the visual gait scoring. We deployed a data lake using the IMU and FP data of a single day of that animal experiment. This encompasses data from 84 turkeys for which we preprocessed by performing an ‘extract, transform and load’ (ETL-) procedure. To test scalability of the ETL-procedure, we simulated increasing volumes of the available data from this animal experiment and computed the ‘wall time’ (elapsed real time) for converting FP data into comma-separated files and storing these files. With a simulated data set of 30 000 turkeys, the wall time reduced from 1 h to less than 15 min, when 12 cores were used compared to 1 core. This demonstrated the ETL-procedure to be scalable. Subsequently, a machine learning (ML) pipeline was developed to test the potential of a data lake to automatically distinguish between two classses, that is, very bad gait scores v. other scores. In conclusion, we have set up a dedicated customized data lake, loaded data and developed a prediction model via the creation of an ML pipeline. A data lake appears to be a useful tool to face the challenge of storing, combining and analysing increasing volumes of data of varying nature in an effective manner.
- Conference Article
14
- 10.1145/3555041.3589409
- Jun 4, 2023
Data discovery refers to a set of tasks that enable users and downstream applications to explore and gain insights from massive collections of data sources such as data lakes. In this tutorial, we will provide a comprehensive overview of the most recent table discovery techniques developed by the data management community. We will cover table understanding tasks such as domain discovery, table annotation, and table representation learning which help data lake systems capture semantics of tables. We will also cover techniques enabling various query-driven discovery and table exploration tasks, as well as how table discovery can support key data science applications such as machine learning and knowledge base construction. Finally, we will discuss future research directions on developing new table discovery paradigms by combining structured knowledge and dense table representations, as well as improving the efficiency of discovery using state-of-the-art indexing techniques, and more.
- Book Chapter
9
- 10.1007/978-3-642-22351-8_2
- Jan 1, 2011
Location-based keyword search has become an important part of our daily life. Such a query asks for records satisfying both a spatial condition and a keyword condition. State-of-the-art techniques extend a spatial tree structure by adding keyword information. In this paper we study location-based instant search, where a system searches based on a partial query a user has typed in. We first develop a new indexing technique, called filtering-effective hybrid index (FEH), that judiciously uses two types of keyword filters based on their selectiveness to do powerful pruning. Then, we develop indexing and search techniques that store prefix information on the FEH index and efficiently answer partial queries. Our experiments show a high efficiency and scalability of these techniques.
- Supplementary Content
14
- 10.3389/fdata.2022.945720
- Aug 19, 2022
- Frontiers in Big Data
Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research institutions or as the core entity of machine learning pipelines. The basic underlying idea of retaining data in its native format within a data lake facilitates a large range of use cases and improves data reusability, especially when compared to the schema-on-write approach applied in data warehouses, where data is transformed prior to the actual storage to fit a predefined schema. Storing such massive amounts of raw data, however, has its very own challenges, spanning from the general data modeling, and indexing for concise querying to the integration of suitable and scalable compute capabilities. In this contribution, influential papers of the last decade have been selected to provide a comprehensive overview of developments and obtained results. The papers are analyzed with regard to the applicability of their input to data lakes that serve as central data management systems of research institutions. To achieve this, contributions to data lake architectures, metadata models, data provenance, workflow support, and FAIR principles are investigated. Last, but not least, these capabilities are mapped onto the requirements of two common research personae to identify open challenges. With that, potential research topics are determined, which have to be tackled toward the applicability of data lakes as central building blocks for research data management.
- Research Article
24
- 10.1145/3588689
- May 26, 2023
- Proceedings of the ACM on Management of Data
Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of the union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover the semantic relationships between pairs of columns. The first uses an existing knowledge base (KB), and the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating synthesized KBs from data lakes with limited KB coverage and using them for union search.
- Conference Article
5
- 10.1145/2390068.2390078
- Oct 29, 2012
This paper exploits efficient indexing techniques for protein structure search where protein structures are represented as vectors by 3D-Zernike Descriptor (3DZD). 3DZD compactly represents a surface shape of protein tertiary structure as a vector, and the simplified representation accelerates the structural search. However, further speed up is needed to address the scenarios where multiple users access the database simultaneously. We address this need for further speed up in protein structural search by exploiting two indexing techniques, i.e., iDistance and iKernel, on the 3DZDs. The results show that both iDistance and iKernel significantly enhance the searching speed. In addition, we introduce an extended approach for protein structure search based on indexing techniques that use the 3DZD characteristic. In the extended approach, index structure is constructured using only the first few of the numbers in the 3DZDs. To find the top-k similar structures, first top-10 x k similar structures are selected using the reduced index structure, then top-k structures are selected using similarity measure of full 3DZDs of the selected structures. Using the indexing techniques, the searching time reduced 69.6% using iDistance, 77% using iKernel, 77.4% using extended iDistance, and 87.9% using extended iKernel method.
- Research Article
17
- 10.1145/2000486.2000490
- Aug 1, 2011
- ACM Transactions on Multimedia Computing, Communications, and Applications
Indexing has become a key element in the pipeline of a multimedia retrieval system, due to continuous increases in database size, data complexity, and complexity of similarity measures. The primary goal of any indexing algorithm is to overcome high computational costs involved with comparing the query to every object in the database. This is achieved by efficient pruning in order to select only a small set of candidate matches. Vantage indexing is an indexing technique that belongs to the category of embedding or mapping approaches, because it maps a dissimilarity space onto a vector space such that traditional access methods can be used for querying. Each object is represented by a vector of dissimilarities to a small set of m reference objects, called vantage objects. Querying takes place within this vector space. The retrieval performance of a system based on this technique can be improved significantly through a proper choice of vantage objects. We propose a new technique for selecting vantage objects that addresses the retrieval performance directly, and present extensive experimental results based on three data sets of different size and modality, including a comparison with other selection strategies. The results clearly demonstrate both the efficacy and scalability of the proposed approach.
- Research Article
8
- 10.1007/s11517-021-02392-0
- Jan 1, 2021
- Medical & Biological Engineering & Computing
Emerging medical imaging applications in healthcare, the number and volume of medical images is growing dramatically. Information needs of users in such circumstances, either for clinical or research activities, make the role of powerful medical image search engines more significant. In this paper, a text-based multi-dimensional medical image indexing technique is proposed in which correlation of the features-usages (according to the user’s queries) is considered to provide an off-the content indexing while taking users’ interestingness into account. Assuming that each medical image has some extracted features (e.g., based on the DICOM standard), correlations of the features are discovered by performing data mining techniques (i.e., quantitative association pattern discovery), on the history of users’ queries as a data set. Then, based on the pairwise correlation of the features of medical images (a.k.a. Affinity), set of the all features is fragmented into subsets (using method like the vertical fragmentation of the tables in distribution of relational DBs). After that, each of these subsets of the features turn into a hierarchy of the features (by applying a hierarchical clustering algorithm on that subset), subsequently all of these distinct hierarchies together make a multi-dimensional structure of the features of medical images, which is in fact the proposed text-based (feature-based) multi-dimensional index structure. Constructing and using such text-based multi-dimensional index structure via its specific required operations, medical image retrieval process would be improved in the underlying medical image search engine. Generally, an indexing technique is to provide a logical representation of documents in order to optimize the retrieval process. The proposed indexing technique is designed such that can improve retrieval of medical images in a medical image search engine in terms of its effectiveness and efficiency. Considering correlation of the features of the image would semantically improve precision (effectiveness) of the retrieval process, while traversing them through the hierarchy in one dimension would try to optimize (i.e., minimize) the resources to have a better efficiency. The proposed text-based multi-dimensional indexing technique is implemented using the open source search engine Lucene, and compared with the built-in indexing technique available in the Lucene search engine, and also with the Terrier platform (available for the benchmarking of information retrieval systems) and other the most related indexing techniques. Evaluation results of memory usage and time complexity analysis, beside the experimental evaluations efficiency and effectiveness measures show that the proposed multi-dimensional indexing technique significantly improves both efficiency and effectiveness for a medical image search engine.Graphical abstract
- Research Article
8
- 10.14569/ijacsa.2021.0120864
- Jan 1, 2021
- International Journal of Advanced Computer Science and Applications
When developing large data processing systems, the question of data storage arises. One of the modern tools for solving this problem is the so-called data lakes. Many implementations of data lakes use Apache Hadoop as a basic platform. Hadoop does not have a default data storage format, which leads to the task of choosing a data format when designing a data processing system. To solve this problem, it is necessary to proceed from the results of the assessment according to several criteria. In turn, experimental evaluation does not always give a complete understanding of the possibilities for working with a particular data storage format. In this case, it is necessary to study the features of the format, its internal structure, recommendations for use, etc. The article describes the features of both widely used data storage formats and the currently gaining popularity.
- Conference Article
93
- 10.1145/3299869.3300065
- Jun 25, 2019
We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set similarity search problem by considering columns as sets and matching values as intersection between sets. Although set similarity search is well-studied in the field of approximate string search (e.g., fuzzy keyword search), the solutions are designed for and evaluated over sets of relatively small size (average set size rarely much over 100 and maximum set size in the low thousands) with modest dictionary sizes (the total number of distinct values in all sets is only a few million). We observe that modern data lakes typically have massive set sizes (with maximum set sizes that may be tens of millions) and dictionaries that include hundreds of millions of distinct values. Our new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets. We show that JOSIE completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. More surprising, we also consider state-of-the-art approximate algorithm and show that our new exact search algorithm performs almost as well, and even in some cases better, on real data lakes.
- Research Article
32
- 10.14778/3587136.3587146
- Mar 1, 2023
- Proceedings of the VLDB Endowment
Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical results on real table benchmarks show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).
- Book Chapter
- 10.1007/978-3-642-32498-7_20
- Jan 1, 2012
In this paper, we present a new technique for indexing and search in a database that stores songs. A song is represented by a high dimensional binary vector using the audio fingerprinting technique. Audio fingerprinting extracts from a song a fingerprint which is a content-based compact signature that summarizes an audio recording. A song can be recognized by matching an extracted fingerprint to a database of known audio fingerprints. In this paper, we are given a high dimensional binary fingerprint database of songs and focus our attention on the problem of effective and efficient database search. However, the nature of high dimensionality and binary space makes many modern search algorithms inapplicable. The high dimensionality of fingerprints suffers from the curse of dimensionality, i.e., as the dimension increases, the search performance decreases exponentially. In order to tackle this problem, we propose a new search algorithm based on inverted indexing, the multiple sub-fingerprint match principle, the offset match principle, and the early termination strategy. We evaluate our technique using a database of 2,000 songs containing approximately 4,000,000 sub-fingerprints and the experimental result shows encouraging performance.
- Conference Article
1
- 10.1145/3430984.3431968
- Jan 2, 2021
Finding nearest neighbors (NN) is a fundamental operation in many diverse domains such as databases, machine learning, data mining, information retrieval, multimedia retrieval, etc. Due to the data deluge and the application of nearest neighbor queries in many applications where fast performance is necessary, efficient index structures are required to speed up finding nearest neighbors. Different application domains have different data characteristics and, therefore, require different types of indexing techniques. While the internal indexing and searching mechanism is generally hidden from the top-level application, it is beneficial for a data scientist to understand these fundamental operations and choose a correct indexing technique to improve the performance of the overall end-to-end workflow. Choosing the correct searching mechanism to solve a nearest neighbor query can be a daunting task, however. A wrong choice can potentially lead to low accuracy, slower execution time, or in the worst case, both. The objective of this tutorial is to present the audience with the knowledge to choose the correct index structure for specific applications. We present the state-of-the-art Nearest Neighbor (NN) indexing techniques for different data characteristics. We also present the effect, in terms of time and accuracy, of choosing the wrong index structure for different application needs. We conclude the tutorial with a discussion on the future challenges in the Nearest Neighbor search domain.
- Research Article
10
- 10.14778/3494124.3494149
- Nov 1, 2021
- Proceedings of the VLDB Endowment
Structured data, or data that adheres to a pre-defined schema, can suffer from fragmented context: information describing a single entity can be scattered across multiple datasets or tables tailored for specific business needs, with no explicit linking keys. Context enrichment, or rebuilding fragmented context, using keyless joins is an implicit or explicit step in machine learning (ML) pipelines over structured data sources. This process is tedious, domain-specific, and lacks support in now-prevalent no-code ML systems that let users create ML pipelines using just input data and high-level configuration files. In response, we propose Ember, a system that abstracts and automates keyless joins to generalize context enrichment. Our key insight is that Ember can enable a general keyless join operator by constructing an index populated with task-specific embeddings. Ember learns these embeddings by leveraging Transformer-based representation learning techniques. We describe our architectural principles and operators when developing Ember, and empirically demonstrate that Ember allows users to develop no-code context enrichment pipelines for five domains, including search, recommendation and question answering, and can exceed alternatives by up to 39% recall, with as little as a single line configuration change.
- Book Chapter
6
- 10.1007/978-3-319-68474-1_20
- Jan 1, 2017
With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.
- Research Article
- 10.1142/s1793351x25020040
- Oct 24, 2025
- International Journal of Semantic Computing
- Research Article
- 10.1142/s1793351x25020039
- Oct 17, 2025
- International Journal of Semantic Computing
- Research Article
- 10.1142/s1793351x25430032
- Jul 1, 2025
- International Journal of Semantic Computing
- Research Article
- 10.1142/s1793351x25440027
- Jun 18, 2025
- International Journal of Semantic Computing
- Research Article
- 10.1142/s1793351x25430020
- Jun 12, 2025
- International Journal of Semantic Computing
- Research Article
- 10.1142/s1793351x25440015
- Jun 12, 2025
- International Journal of Semantic Computing
- Research Article
- 10.1142/s1793351x25500011
- May 20, 2025
- International Journal of Semantic Computing
- Research Article
- 10.1142/s1793351x25420024
- May 14, 2025
- International Journal of Semantic Computing
- Research Article
- 10.1142/s1793351x25420061
- May 7, 2025
- International Journal of Semantic Computing
- Research Article
- 10.1142/s1793351x25420073
- Apr 24, 2025
- International Journal of Semantic Computing
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.