Integrating Embedding and LSHiForest in English Text Anomaly Detection

Qingquan Tong,Rongju Yao

doi:10.1002/cpe.8370

Qingquan Tong, Rongju Yao

https://doi.org/10.1002/cpe.8370

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

ABSTRACTIn the realm of natural language processing (NLP), anomaly detection plays a critical role in identifying irregularities and outliers within textual data. Traditional methods often struggle with the high‐dimensional and sparse nature of text data, leading to inefficiencies in detecting meaningful anomalies, especially in the big data application context. To address these challenges, this paper proposes the integration of LSHiForest (Locality‐Sensitive Hashing Isolation Forest) into the process of English text anomaly detection. LSHiForest, which synergistically combines the dimensionality reduction capabilities of locality‐sensitive hashing (LSH) with the robust outlier detection of Isolation Forest, offers a novel approach to handling the complexities of textual data. The proposed approach involves transforming English text into feature vectors, followed by the application of LSHiForest to detect anomalies across various text datasets. The effectiveness of this approach is evaluated through comparative experiments with traditional anomaly detection methods, with various performance metrics. The experimental results demonstrate that LSHiForest significantly improves the efficiency and accuracy of outlier identification in English text, particularly in scenarios involving large‐scale and high‐dimensional datasets.

Full Text