Abstract. In the era of information explosion, the Internet produces a large number of text documents daily, providing rich information but also posing a challenge: how to swiftly extract key information. Traditional manual reading is time-consuming and inadequate for handling vast data. Thus, automatic text summarization technology emerges as a crucial solution. This paper reviews the application and deviation analysis of this technology across various fields, focusing on addressing shortcomings of traditional methods, such as initial cluster center selection and redundancy. An automatic text summarization method based on an improved TextRank algorithm and K-Means clustering is introduced. Existing methods often struggle with inaccurate initial clustering center selection and high summary redundancy, especially with long texts, resulting in summaries that fail to reflect core content accurately. Furthermore, the widespread use of pre-trained language models introduces potential biases that can propagate to downstream tasks, affecting summary accuracy and impartiality. To address these issues, this paper proposes an innovative automatic text summarization method that optimizes initial clustering center selection and clustering refinement strategies to enhance summary accuracy and readability. Additionally, it discusses name-nationality bias in pre-trained language models and its propagation in text summary tasks, offering a theoretical foundation and practical guidance for developing a more just and reliable Natural Language Processing (NLP) system.
Read full abstract