Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • Open Access Icon
  • Research Article
  • Cite Count Icon 260
  • 10.1109/tbdata.2025.3618474
The Faiss Library
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Matthijs Douze + 8 more

Vector databases typically manage large collections of embedding vectors. As AI applications are growing rapidly, the number of embeddings that need to be stored and indexed is increasing. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This paper describes the trade-offs in vector search and the design principles of Faiss in terms of structure, approach to optimization and interfacing. We benchmark key features of the library and discuss a few selected use cases to highlight its broad applicability.

  • Research Article
  • 10.1109/tbdata.2025.3639973
Knockoff-Guided Feature Selection via a Single Pre-Trained Reinforced Agent
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Xinyuan Wang + 5 more

Modern data-driven applications generate vast, complex data that often contain irrelevant, redundant, or noisy features. The task involves selecting the optimal subset of features from a dataset by removing redundant and irrelevant ones to enhance downstream performance. Using reinforcement learning (RL) for feature selection is ideal for data-centric tasks due to its interactive nature, which dynamically adapts to evolving data environments. However, high initial exploration variability, reliance on downstream tasks, and the need to control the false discovery rate bring challenges. To address these issues, we introduce a knockoff-guided RL framework, which uses pseudo-features to control early-stage randomness and matrix reconstruction for unsupervised reward. Our approach is a single-agent RL approach with unsupervised rewards that utilize knockoff information and matrix reconstruction to enhance feature selection without relying on labeled data. The framework integrates three components: the knockoff information to control exploration variability, decision network pre-training to guide RL policy, and matrix reconstruction to guide unsupervised rewards. Extensive experiments are conducted on various datasets across different task types, including classification and regression, demonstrating the superiority of our framework. Codes are available here.

  • Research Article
  • 10.1109/tbdata.2025.3639917
A Fast Linearithmic Graph Clustering Approach for Big Data Using Gravitational Attraction Principle
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Mohammad Maksood Akhter + 3 more

With the exponential growth of Big Data in domains such as healthcare, genomics, and sensor networks, computationally efficient and effective clustering techniques have become essential for uncovering meaningful patterns. Traditional clustering methods face fundamental limitations in Big Data analysis. K-means is among the fastest known approaches, but it fails to capture non-spherical clusters. Hierarchical clustering can detect arbitrary shapes but suffers from sub-cubic complexity, while many state-of-the-art methods still incur quadratic complexity. Moreover, most existing approaches fail to capture the intrinsic structure of data. In this context, graph-based clustering has emerged as a powerful alternative due to its ability to model geometric relationships and reveal underlying structures. However, existing graph-based techniques typically incur quadratic complexity, limiting their scalability. The objective of this work is to develop a scalable graph-based clustering framework that reduces complexity while preserving clustering quality in large, noisy, and high-dimensional datasets. To achieve this, we propose a fast graph clustering framework with overall complexity <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\mathcal {O}(N \lg N)$</tex-math></inline-formula>, where <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$N$</tex-math></inline-formula> denotes the number of data points. The method employs a two-stage dispersion-based partitioning to generate cohesive sub-clusters, followed by the construction of a sparse graph on sub-cluster centers to efficiently capture adjacency. Sub-clusters are then merged iteratively using a gravitational-force-inspired attraction model, enabling the discovery of coherent structures with reduced computation. Extensive experiments on 41 multi-scale datasets demonstrate that our method consistently outperforms traditional and state-of-the-art approaches, achieving average 27.33% higher clustering accuracy while reducing runtime by more than 86.64% on average. These results highlight both the innovation and the effectiveness of the proposed approach, making it highly suitable for Big Data analytics.

  • Research Article
  • Cite Count Icon 3
  • 10.1109/tbdata.2025.3627488
GraphLLM: Boosting Graph Reasoning Ability of Large Language Model
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Ziwei Chai + 6 more

The advancement of Large Language Models (LLMs) has remarkably pushed the boundaries towards artificial general intelligence (AGI), with their exceptional ability on understanding diverse types of information, including but not limited to images and audio. Despite this progress, a critical gap remains in empowering LLMs to proficiently understand and reason on graph data, which is ubiquitous in Big Data applications such as social networks, knowledge graphs, and molecular databases. Recent studies underscore LLMs' underwhelming performance on fundamental graph reasoning tasks. In this paper, we endeavor to unearth the obstacles that impede LLMs in graph reasoning, pinpointing the common practice of converting graphs into natural language descriptions (Graph2Text) as a fundamental bot tleneck. To overcome this impediment, we introduce GraphLLM, a pioneering end-to-end approach that synergistically integrates graph learning models with LLMs through a novel Dynamic Task Configuration System. This system employs a Hierarchical Graph Processing Pipeline that combines Local Structure Analyzers for node-level features with Global Pattern Synthesizers for graph level understanding, enabling scalable processing of large-scale graph data. Our empirical evaluations across four fundamental graph reasoning tasks validate the effectiveness of GraphLLM. The results exhibit a substantial average accuracy enhancement of 54.44%, alongside a noteworthy context reduction of 96.45% across various graph reasoning tasks, demonstrating significant potential for Big Data graph analytics.

  • Open Access Icon
  • Research Article
  • 10.1109/tbdata.2025.3639968
SARF: Sparsity-Aware Reconstruction Framework for Large-Scale Datasets
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Agustin Zuniga + 5 more

Large-scale datasets, particularly those collected from smart devices and Internet of Things sensors, usually exhibit significant temporal and spatial sparsity, resulting in high amounts of missing data. Unless addressed in the analysis, this sparsity can result in substantial gaps and biases as well as limit the generalizability of conclusions drawn from such data. To address this challenge in data quality, we contribute the Sparsity-Aware Reconstruction Framework (SARF) as a novel and unified data fusion and reconstruction framework that enhances data quality and addresses sparsity. SARF analyzes datasets, partitioning the data into segments with similar characteristics, and reconstructs the data in each segment individually by selecting a reconstruction technique that is tailored to the internal temporal-spatial characteristics of the dataset. Through extensive experiments on two representative datasets - mobile application measurements and IoT sensor data from low-cost air quality sensors - we demonstrate that the targeted adaptation of reconstruction strategies employed by SARF significantly enhances the quality of reconstructed data. Our results show the robustness of SARF's performance across spatiotemporal variations, outperforming current state-of-the-art methods by margins up to 68% on average (74% for compressive sensing, 53% for convolutional sparse coding, 78% for deep learning). These findings underscore SARF's potential to enhance datadriven insights across multiple domains, paving the way for more robust analyses of sparsity-affected datasets.

  • Research Article
  • 10.1109/tbdata.2025.3624982
Taylor-Sensus Network: Embracing Noise to Enlighten Uncertainty for Scientific Data
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Guangxuan Song + 4 more

Uncertainty estimation is vital for machine learning with scientific data. Although existing methods effectively address inherent model uncertainty, they frequently overlook explicit modeling of complex noise in data devoid of temporal or spatial dependencies. This gap is especially challenging in structured scientific data, where such dependencies are commonly lacking. To address these challenges in scientific research, we propose the Taylor-Sensus Network (TSNet). TSNet innovatively uses a Taylor series expansion to model complex, heteroscedastic noise and proposes a deep Taylor block for aware noise distribution. TSNet includes a noise-aware contrastive learning module and a data density perception module for aleatoric and epistemic uncertainty. Additionally, an uncertainty combination operator is used to integrate these uncertainties, and the network is trained using a novel heteroscedastic mean square error loss. TSNet demonstrates superior performance over mainstream and state-of-the-art methods in experiments, highlighting its potential in scientific research and noise resistance.

  • Research Article
  • 10.1109/tbdata.2025.3618478
ACJoin: A Low-Latency Multi-Table Join Order Selection Model With Minimum Cost Using Asynchronous Advantage Actor-Critic
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Shaojie Qiao + 5 more

Join order selection is one of the most challenging problems in query optimization and plays an essential role in providing high query performance in Big Data management. Currently, researchers have applied deep reinforcement learning methods, for example, Rejoin and DQ, to join order selection in order to obtain high query performance. However, Rejoin and DQ cannot capture the structural characteristic of the join tree, which may lead to similar encoding structure for different execution plans. To tackle these challenges, we propose a new learning optimizer called ACJoin (asynchronous advantage Actor-Critic for multi-table Join order selection). ACJoin employs a new encoding method to capture the structural characteristics of the join tree through integrating GRU (Gated Recurrent Unit). In particular, ACJoin can distinguish different execution plans. It uses A3C (Asynchronous Advantage Actor-Critic) to guide the join order selection and reduce the time taken to find the best query plan with the minimum cost. Compared with existing search strategies, ACJoin can find the globally optimal solution with efficient and stable query performance. Extensive experiments are conducted on the real JOB and the synthetic TPC-H datasets. The results show that ACJoin outperforms the state-of-the-art join order selection methods and DRL Deep Reinforcement Learning)-based methods in cost and latency.

  • Research Article
  • Cite Count Icon 2
  • 10.1109/tbdata.2025.3624955
NTFormer: A Composite Node Tokenized Graph Transformer for Node Classification
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Jinsong Chen + 2 more

Tokenized graph Transformers have advanced node classification by transforming graphs into token sequences, but existing methods suffer from limited flexibility due to single-type token generation, which captures partial graph information and requires tailored modifications. To address this, we propose NTFormer, a novel graph Transformer with a dedicated token generator called Node2Par. Node2Par constructs diverse token sequences for each node using multiple token elements (i.e., neighborhood tokens and node tokens) from both topology view and attribute view, enabling comprehensive expression of graph features from multi-perspectives. Leveraging the outputs of Node2Pars, NTFormer adopts a standard Transformer backbone without additional graph-aware modules and a learnable information fusion strategy to adaptively learn expressive node representations from generated different token sequences, eliminating the need for tailored encoding strategies. Extensive experiments on benchmark datasets including homophily and heterophily graphs showcase that NTFormer outperforms representative graph Transformers and GNNs in node classification.

  • Research Article
  • 10.1109/tbdata.2025.3640011
Lafa: Unlocking Superior Memory Efficiency via Adaptive Metadata Strategy for Scalable Large-Scale Dataset Loading
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Cong Wang + 8 more

The rapid growth of deep learning models and the increasing demand for large-scale datasets have posed unprece dented challenges for data loading and memory management. Existing frameworks (e.g., PyTorch, TensorFlow) often encounter performance bottlenecks when handling large datasets resulting in inefficiencies and excessive memory usage. To address these issues, we propose Lafa, a dynamic metadata loading mechanism optimized for efficient large-scale dataset processing. Lafa introduces the .Lafa format and an adaptive loading strategy with three modes to balance memory usage and loading performance, along with a local shuffle approach that reduces memory overhead and computational complexity while preserving data randomness. Experimental results on GPU (RTX 3090) and Ascend (910A) platforms demonstrate that Lafa significantly improves memory efficiency compared to existing frameworks. Specifically, for every 10 million samples loaded, Lafa reduces additional memory consumption by a factor of 1.33× to 31.34× across various dataset types, relative to the most memory-efficient baseline among PyTorch, TensorFlow, and MindSpore.

  • Research Article
  • 10.1109/tbdata.2025.3639992
A High-Throughput Method for Fabric in Scenarios With Multiple Aborted Transactions
  • Apr 1, 2026
  • IEEE Transactions on Big Data
  • Yan Wang + 4 more

Hyperledger Fabric is one of the most popular federation chains and is widely used in many fields, such as healthcare, government, and education. However, in practical application scenarios, Fabric faces the challenge of handling a large number of aborted transactions. These transaction failures are primarily caused by read-write conflicts resulting from read-write locks during the execution and validation phases, as well as transaction ordering dependencies in the ordering phase. The frequent occurrence of aborted transactions significantly reduces system throughput, as each failed transaction not only wastes computational resources and network bandwidth but also requires reprocessing. Therefore, this paper proposes a solution called Fabric*, which includes a transaction reordering mechanism, a cache queue mechanism, and a lock-free mechanism. The former two mechanisms are used to abort conflicting transactions early in the ordering phase and to avoid aborting conflicting transactions again. The latter mechanism is used to abort transactions that read obsolete data during the simulation phase. Fabric* reduces abort transactions and increases the throughput of successful transactions in the system. Experimental results show that Fabric* improves throughput by up to about 23.9% and 7.4% over the original Fabric and Fabric++.