Abstract
ABSTRACTIn the realm of natural language processing (NLP), anomaly detection plays a critical role in identifying irregularities and outliers within textual data. Traditional methods often struggle with the high‐dimensional and sparse nature of text data, leading to inefficiencies in detecting meaningful anomalies, especially in the big data application context. To address these challenges, this paper proposes the integration of LSHiForest (Locality‐Sensitive Hashing Isolation Forest) into the process of English text anomaly detection. LSHiForest, which synergistically combines the dimensionality reduction capabilities of locality‐sensitive hashing (LSH) with the robust outlier detection of Isolation Forest, offers a novel approach to handling the complexities of textual data. The proposed approach involves transforming English text into feature vectors, followed by the application of LSHiForest to detect anomalies across various text datasets. The effectiveness of this approach is evaluated through comparative experiments with traditional anomaly detection methods, with various performance metrics. The experimental results demonstrate that LSHiForest significantly improves the efficiency and accuracy of outlier identification in English text, particularly in scenarios involving large‐scale and high‐dimensional datasets.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have