AlphaLexChinese: Measuring lexical complexity in Chinese texts and its predictive validity for L2 writing scores
AlphaLexChinese: Measuring lexical complexity in Chinese texts and its predictive validity for L2 writing scores
- Research Article
- 10.1145/3744250
- Aug 7, 2025
- ACM Transactions on Knowledge Discovery from Data
Text style transfer plays a vital role in online entertainment and social media. However, existing models struggle to handle the complexity of Chinese long texts, such as rhetoric, structure, and culture, which restricts their broader application. To bridge this gap, we propose a Chinese Article-style Transfer (CAT-LLM) framework, which addresses the challenges of style transfer in complex Chinese long texts. At its core, CAT-LLM features a bespoke pluggable Text Style Definition (TSD) module that integrates machine learning algorithms to analyze and model article styles at both word and sentence levels. This module acts as a bridge, enabling large language models (LLMs) to better understand and adapt to the complexities of Chinese article styles. Furthermore, it supports the dynamic expansion of internal style trees, enabling the framework to seamlessly incorporate new and diverse style definitions, enhancing adaptability and scalability for future research and applications. Additionally, to facilitate robust evaluation, we created 10 parallel datasets using a combination of ChatGPT and various Chinese texts, each corresponding to distinct writing styles, significantly improving the accuracy of the model evaluation and establishing a novel paradigm for text style transfer research. Extensive experimental results demonstrate that CAT-LLM, combined with GPT-3.5-Turbo, achieves state-of-the-art performance, with a transfer accuracy F1 score of 79.36% and a content preservation F1 score of 96.47% on the “Fortress Besieged” dataset. These results highlight CAT-LLM’s innovative contributions to style transfer research, including its ability to preserve content integrity while achieving precise and flexible style transfer across diverse Chinese text domains. Building on these contributions, CAT-LLM presents significant potential for advancing Chinese digital media and facilitating automated content creation. Source code is available at GitHub ( https://github.com/TaoZhen1110/CAT-LLM ).
- Research Article
- 10.1145/3625390
- Apr 10, 2024
- ACM Transactions on Asian and Low-Resource Language Information Processing
Due to the complexity of Chinese and the differences between Chinese and English, the application of Chinese text in the digital field has a certain complexity. Taking Chinese text in Open Relation Extraction (ORE) as the research object, the complexity of Chinese text is analyzed. An extraction system of word vectors based on construction grammar theory and Deep Learning (DL) is constructed to achieve smooth extraction of Chinese text. The work of this paper mainly includes the following aspects. To study the application of DL in the complexity analysis of Chinese text based on construction grammar, firstly, the connotation of construction grammar and its role in Chinese text analysis are explored. Secondly, from the perspective of the ORE of word vectors in language analysis, an ORE model based on word vectors is implemented. Moreover, an extraction method based on the distance of word vectors is proposed. The test results show that the F1 value of the proposed algorithm is 67% on the public WEB-500 and NYT-500 datasets, which is superior to other similar text extraction algorithms. When the recall rate is more than 30%, the accuracy of the proposed method is higher than several other latest language analysis systems. This indicates that the proposed Chinese text extraction system based on the DL algorithm and construction grammar theory has advantages in complexity analysis and can provide a new research idea for Chinese text analysis.
- Conference Article
1
- 10.1109/cisce55963.2022.9851162
- May 27, 2022
A large number of text images come from natural scenes. However, most of them are taken by non professional cameras such as mobile phones, and their resolution is low, which greatly reduces the readability of the text in the images. The text image super-resolution method aims to enhance the low resolution image to obtain a clearer text image. At present, with the appearance of convolutional neural network, image super-resolution technology has developed rapidly. The traditional super-resolution method is mainly used for natural scene images, and is usually not suitable for text. Many scholars have proposed super-resolution network methods for text images, but most of them are based on English text. Due to the complexity and diversity of Chinese texts, these networks are not suitable for Chinese texts. In order to improve the shortcomings of the above methods, a Chinese text dataset is generated, and a text-image super-resolution lightweight network based on Mobile ViT is also proposed. Experiments show that our proposed dataset CTW is meaningful for the study of Chinese text image super-resolution. And the new method has better performance in PSNR/SSIM metrics and can effectively restore blurred low-resolution images into clear high-resolution images.
- Research Article
2
- 10.1155/2021/5933652
- Nov 24, 2021
- Scientific Programming
The medical information carried in electronic medical records has high clinical research value, and medical named entity recognition is the key to extracting valuable information from large-scale medical texts. At present, most of the studies on Chinese medical named entity recognition are based on character vector model or word vector model. Owing to the complexity and specificity of Chinese text, the existing methods may fail to achieve good performance. In this study, we propose a Chinese medical named entity recognition method that fuses character and word vectors. The method expresses Chinese texts as character vectors and word vectors separately and fuses them in the model for features. The proposed model can effectively avoid the problems of missing character vector information and inaccurate word vector partitioning. On the CCKS 2019 dataset for the named entity recognition task of Chinese electronic medical records, the proposed model achieves good performance and can effectively improve the accuracy of Chinese medical named entity recognition compared with other baseline models.
- Research Article
- 10.1186/s40535-015-0011-9
- Jul 3, 2015
- Applied Informatics
The study proposes our extended method to assess structure complexity for symbol-free sequences, such as literal texts, DNA sequences, rhythm, and musical input. This method is based on L-system and topological entropy for context-free grammar. Inputs are represented as binary trees. Different input features are represented separately within tree structure and actual node contents. Our method infers tree generating grammar and estimates its complexity. This study reviews our previous results on texts and DNA sequences and provides new information regarding them. Also, we show new results measuring complexity of Chinese classical texts and music samples with rhythm and melody components. Our method demonstrates enough sensitivity to extract quasi-regular structured fragments of Chinese texts and to detect irregular styled samples of music inputs. To our knowledge, there is no other method that can detect such quasi-regular patterns.
- Conference Article
- 10.1109/iske47853.2019.9170461
- Nov 1, 2019
Relation classification is a fundamental ingredient in various information extraction systems. To extract personal entity relation from Chinese text, a novel deep neural network architecture is proposed this paper, which employs bidirectional Gated Recurrent Unit (Bi-GRU) by adding attention mechanism to capture important semantic information in a sentence without hand-crafted features. Considering the complexity of Chinese text, word representation is obtained as a concatenation of word embeddings and character embeddings. Besides, the relative distances of the current word to the entities are added to the word representation to improve the performance of the relation classification. At last, the experimental results demonstrate the proposed model is more effective than state-of-the-art methods.
- Conference Article
3
- 10.1109/ccis48116.2019.9073727
- Dec 1, 2019
The sentiment classification is still one of the most popular research directions in the past decade. Furthermore, the information contained in the Chinese text is definitely not limited to the positive or negative emotions. In view of the complexity of emotional texts, this paper combines various tags to increase the accuracy of sentiment analysis. Since the traditional methods are mostly based on single-direction propagation, which decreases semantic correlations. This paper applies the word2vec and bidirectional LSTM to train the model aiming at the semantic correlations between the forward and backward texts. The experiment results indicate that the accuracy of emotional prediction is significantly improved.
- Conference Article
2
- 10.1109/ict4da56482.2022.9971306
- Nov 28, 2022
Text complexity is the level of difficulty of the document for understanding by the target readers. One common type of this text complexity is Lexical complexity which can cause comprehensibility and understandability problems for second language learners, and children, furthermore it is challenging for NLP applications. To reduce this text complexity for low resourced and morphologically reach language Amharic, we have designed a complexity detection and lexical simplification model using a machine learning approach. For the tasks, we have developed three subsequent models. The first model is used to classify text complexity, trained using 19k sentences. The second model is developed for detecting specific complex terms. The model is developed using 1002 unique complex terms. Lastly, we have trained word2vec (CBOW) model using 57.6k sentences (which contains 9756 unique tokens) for simplification generation and ranking. The experimental result of the classification model scores an accuracy of 88%(LSTM), 88%(BiLSTM), and 91%(BERT). The simplification generation for the identified complex term using cosine similarity results 92% for top-ranked and 61% for lowest-ranked simplest equivalents from five top-generated words.
- Research Article
6
- 10.1080/10888438.2023.2244620
- Aug 11, 2023
- Scientific Studies of Reading
Purpose This study sought to 1) identify linguistic features important for Chinese text complexity with a theory-based and systematic approach, and 2) address how feature sets and algorithms affect the performance of Chinese text complexity models. Method Texts from Chinese language arts textbooks from Grades 1 to 6 (N = 1,478) in Mainland China were analyzed. The predictor variables were 265 linguistic features of texts: 154 lexical features and 111 sentence and discourse features. The outcome variable was the complexity level of texts; a one-semester-scale was applied, thus 12 levels in total (two semesters per grade). Results Features of the categories of character and word frequency, character and word semantic features, lexical diversity, part-of-speech syntactic categories, and referential cohesion were found the most important. With the important features identified, we found that text complexity models with features at all levels outperformed those with features at only one level. Models using the two machine learning algorithms (Random Forest Regression and Support Vector Regression) outperformed those using Linear Regression. Conclusion This work clarifies important linguistic features for Chinese text complexity, and points to the necessity of considering features across levels and using machine learning algorithms in future text complexity research.
- Supplementary Content
- 10.1016/0016-0032(75)90075-7
- Aug 1, 1975
- Journal of the Franklin Institute
The social and intellectual value of large projects
- Research Article
268
- 10.1016/j.asw.2005.02.001
- Jan 1, 2005
- Assessing Writing
Differences in written discourse in independent and integrated prototype tasks for next generation TOEFL
- Research Article
- 10.62051/ijcsit.v5n1.09
- Jan 23, 2025
- International Journal of Computer Science and Information Technology
Sentiment analysis of COVID-19-related content on Weibo is of significant importance for studying public sentiment during the pandemic and economic recovery. Due to the lack of well-annotated Chinese Weibo COVID-19 data (such as the Weibo NCOV dataset), as well as the emotional complexity and ambiguity of Chinese Weibo texts, this paper proposes an innovative sentiment analysis model for Chinese Weibo COVID-19 data, namely BERT-BiLSTM-Attention. The model first encodes Weibo comment data using BERT to enhance the semantic feature representation of the text and improve its contextual understanding. Next, BiLSTM is used to enrich the contextual information of the Weibo text, helping to extract important and effective information from the text sequences. Finally, an Attention mechanism is employed to quickly capture the most relevant information. Experimental results show that the model is effective in sentiment analysis of Weibo COVID-19 data, achieving an accuracy of 88.2%. It can be concluded that the proposed model significantly improves the performance of Weibo text classification and demonstrates strong generalizability, making it suitable for sentiment analysis in various fields.
- Research Article
- 10.17507/tpls.1509.08
- Sep 3, 2025
- Theory and Practice in Language Studies
This study investigates the performance of artificial intelligence (AI) in the translation of legislative texts, focusing on the quality of translations produced by ChatGPT 4o and DeepL Pro. By using TAALED and NeoSCA, we evaluated and compared a number of indices in lexical diversity and syntactic complexity of AI-generated and human translations of twenty Chinese legislative texts. We used JASP to calculate Bayes Factor and then compare the translations of human and AI. Our findings indicate that while AI models demonstrate notable strengths in function words diversity and coordinate syntactic structures, they still lag behind human translators in overall lexical diversity and syntactic complexity. The study underscores the potential and limitations of AI in legal translation, highlighting the necessity for human-AI collaboration to achieve high-quality translations in this specialized field.
- Research Article
71
- 10.1080/15434303.2017.1405421
- Jan 2, 2018
- Language Assessment Quarterly
ABSTRACTThe main objective of this article is to demonstrate with the help of learner corpus data the practical relevance of the phraseological dimension of language for writing assessment in higher education. Phraseological competence is now widely recognized as an important part of fluent and idiomatic language use, but its development has not received the attention it deserves in the CEFR. The study investigates the development of linguistic correlates of syntactic, lexical, and phraseological complexity in learner texts at B2, C1, and C2 and shows that while no measure of syntactic or lexical complexity seems to have an impact on human raters’ overall judgement of writing quality, two measures of phraseological complexity explain 25% of the variance in the data set. Results suggest that incorporating phraseological competence into the scoring rubrics of university entrance language tests would help language test developers add construct validity to language assessment in higher education. More generally, this study also shows the crucial role that Language for Specific Purposes learner corpora could play in language assessment.
- Research Article
1
- 10.1515/iral-2022-0236
- Aug 14, 2023
- International Review of Applied Linguistics in Language Teaching
Lexical complexity has been a key consideration of teaching preparation in determining grade appropriateness of teaching materials. However, the lack of quantified and defined standards for benchmarking lexical complexity has made it difficult for teachers when adapting source texts to target learners. This study has assessed quantitative differences in lexical complexity of exemplar texts at different points of schooling using a range of lexical diversity and sophistication features. The data consists of 2,372 texts from popular curriculum packages adopted from 1 to 12 grades of the English curriculum in China. One-way ANOVAs revealed significant differences in 16 out of 17 lexical complexity indices among different grades. Subsequent post hoc tests identified three lexical diversity features and four sophistication features that helped to differentiate exemplar texts across these 12 grades. These findings on the nature and role of lexical complexity have yielded new insights into the establishment of grade-level benchmarks for material preparation.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.