Adversarial training flat-lattice transformer for named entity recognition of chinese legal texts
Adversarial training flat-lattice transformer for named entity recognition of chinese legal texts
169
- 10.3115/1119176.1119196
- Jan 1, 2003
956
- 10.1109/tkde.2020.2981314
- Jan 1, 2022
- IEEE Transactions on Knowledge and Data Engineering
2
- 10.1109/icaica52286.2021.9498036
- Jun 28, 2021
11
- 10.1088/1742-6596/1592/1/012040
- Aug 1, 2020
- Journal of Physics: Conference Series
2432
- 10.18653/v1/p19-1285
- Jan 1, 2019
178
- 10.18653/v1/d18-1017
- Jan 1, 2018
1
- 10.1088/1742-6596/1616/1/012108
- Aug 1, 2020
- Journal of Physics: Conference Series
140
- 10.5121/ijnlc.2012.1402
- Dec 31, 2012
- International Journal on Natural Language Computing
36
- 10.1016/j.ins.2019.10.065
- Nov 1, 2019
- Information Sciences
48
- 10.1186/s13640-020-0490-z
- Jan 13, 2020
- EURASIP Journal on Image and Video Processing
- Conference Article
3
- 10.1109/icpr.2016.7900156
- Dec 1, 2016
Scene text information extraction plays an important role in many computer vision applications. Unlike most existing text extraction algorithms for English texts, in this paper, we focus on Chinese texts, which are more complex in stroke and structure. To tackle this challenging problem, we propose a novel convolutional neural network (CNN) based text structure feature extractor for Chinese texts. Each Chinese character contains its specific types and combination of text structure components, which is rarely seen in backgrounds. Thus, different from the features only applicable to one text extraction stage (text detection or text recognition), the text structure component feature is suitable for both Chinese text detection and recognition. A text structure component detector (TSCD) layer is designed to detect the large amount of component types, which is the most challenging part of extracting text structure component features. Through statistical classification various types of text structure component are detected by their specially designed convolutional units in the TSCD layer. With the TSCD layer, the CNN has improvements in the accuracy and uniqueness of text feature description. In the evaluation, both text detection and recognition algorithms based on the proposed text structure feature extractor achieve state-of-the-art results in two datasets.
- Research Article
54
- 10.1109/access.2017.2676158
- Jan 1, 2017
- IEEE Access
Scene text information extraction plays an important role in many computer vision applications. Most features in existing text extraction algorithms are only applicable to one text extraction stage (text detection or recognition), which significantly weakens the consistency in an end-to-end system, especially for the complex Chinese texts. To tackle this challenging problem, we propose a novel text structure feature extractor based on a text structure component detector (TSCD) layer and residual network for Chinese texts. Inspired by the three-layer Chinese text cognition model of a human, we combine the TSCD layer and the residual network to extract features suitable for both text extraction stages. The specialized modeling for Chinese characters in the TSCD layer simulates the key structure component cognition layer in the psychological model. And the residual mechanism in the residual network simulates the key bidirectional connection among the layers in the psychological model. Through the organic combination of the TSCD layer and the residual network, the extracted features are applicable to both text detection and recognition, as humans do. In evaluation, both text detection and recognition models based on our proposed text structure feature extractor achieve great improvements over baseline CNN models. And an end-to-end Chinese text information extraction system is experimentally designed and evaluated, showing the advantage of the proposed feature extractor as a unified feature extractor.
- Conference Article
- 10.1145/3446999.3447027
- Dec 25, 2020
In view of the problems of polysemy and overlapping relations of Chinese tea text. In this paper, we present a joint model BERT-LCM-Tea for extraction of entities and relations, which combines the Bidirectional Encoder Representations from Transformers (BERT) and the last character matching (LCM) algorithm. This model uses BERT to fine-tuning character embedding through contextual information, the problem of polysemy is solved and the performance of entity recognition of Chinese tea text is improved. In addition, the model uses last character matching algorithm, the problem of overlapping relations is solved and the accuracy of relation extraction of Chinese tea text is improved. The experimental results show that BERT-LCM-Tea F1 score to 86.8% in entity recognition task and F1 score to 77.1% in relation extraction task, which is higher than the currently popular Bi-RNN-CRF, Bi-LSTM-CRF and Bi-GRU-CRF. Thus, the BERT-LCM-Tea is more suitable for the entity recognition and relation extraction of Chinese tea text, and provides a basis for future research on the construction of tea knowledge graph.
- Research Article
16
- 10.1007/s12559-015-9346-8
- Jul 2, 2015
- Cognitive Computation
For machine learning methods, processing and understanding Chinese texts are difficult, for that the basic unit of Chinese texts is not character but phrases, and there is no natural delimiter in Chinese texts to separate the phrases. The processing of a large number of Chinese Web texts is more difficult, because such texts are often less topic focused, short, irregular, sparse, and lacking in context. It poses a challenge for mining, clustering, and classification of Chinese Web texts. Typically, the recognition accuracy of the real meaning of such texts is low. In this paper, we propose a method that recognizes stable and abstract semantic topics that express the highly hierarchical relationship behind the Chinese texts from BaiduBaike. Then, based on these semantic topics, a discrete distribution model is established to convert analysis to a convex optimization problem by geometric programming. Our experiments demonstrated that the proposed approach outperforms many conventional machine learning methods, such as KNN, SVM, WIKI, CRFs, and LDA, regarding the recognition of mini training data and short Chinese Web texts.
- Conference Article
3
- 10.1109/robio.2018.8665259
- Dec 1, 2018
In this paper, we present a DAV-based system for text (mainly English and Chinese) detection and recognition. With the combination of unmanned aerial vehicle and scene text recognition, the system can realize text detection and recognition in long-range air-plane images, providing an underlay for unmanned navigation and fast text information understanding. Robust text detection and accurate text recognition can be achieved by two contributions. First, a scalable engine is proposed to synthesize text images by overlaying English or Chinese text into existing images in a natural way. Second, an framework which is trainable and end-to-end by combining Convolutional Neural Network and Recurrent Neural Network is adapted to recognize the variable-length text with a high accuracy. Field experiments are performed with different videos shot in various backgrounds and outdoors to show that the proposed system can detect and recognise text information in UAV imagery robustly and effectively.
- Conference Article
1
- 10.1109/icaml57167.2022.00029
- Jul 1, 2022
With the deepening of national judicial reform, how to combine artificial intelligence with judicial work has become a research emphasis in judicial intelligence research. Aiming at polysemy problem in Chinese, and characteristics of Chinese legal text that complicated context, professional, and diverse types of entities, we design a named entity recognition method for Chinese legal text based on BERT. Firstly, we build a Chinese legal text corpus, and utilize the corpus to domain pre-train the pre-training model BERT to make it perform better for named entity recognition task in Chinese legal text. Secondly, we design the method of adding clue words and the method of replacing synonyms for training data augmentation to increase the diversity of training data. Finally, we combine the BERT model after domain pre-training with a CRF layer, and utilize the datasets after data augmentation to train it. Meanwhile, we utilize adversarial training in the training process to improve the generalization ability of the model. To verify the performance of our model, we conduct experiments on the competition data set of the information extraction track of the China Legal Intelligence Technology Evaluation Competition, which proves the feasibility and effectiveness of the method.
- Conference Article
102
- 10.1109/icdar.2019.00253
- Sep 1, 2019
Chinese scene text reading is one of the most challenging problems in computer vision and has attracted great interest. Different from English text, Chinese has more than 6000 commonly used characters and Chinesecharacters can be arranged in various layouts with numerous fonts. The Chinese signboards in street view are a good choice for Chinese scene text images since they have different backgrounds, fonts and layouts. We organized a competition called ICDAR2019-ReCTS, which mainly focuses on reading Chinese text on signboard. This report presents the final results of the competition. A large-scale dataset of 25,000 annotated signboard images, in which all the text lines and characters are annotated with locations and transcriptions, were released. Four tasks, namely character recognition, text line recognition, text line detection and end-to-end recognition were set up. Besides, considering the Chinese text ambiguity issue, we proposed a multi ground truth (multi-GT) evaluation method to make evaluation fairer. The competition started on March 1, 2019 and ended on April 30, 2019. 262 submissions from 46 teams are received. Most of the participants come from universities, research institutes, and tech companies in China. There are also some participants from the United States, Australia, Singapore, and Korea. 21 teams submit results for Task 1, 23 teams submit results for Task 2, 24 teams submit results for Task 3, and 13 teams submit results for Task 4. The official website for the competition is http://rrc.cvc.uab.es/?ch=12.
- Book Chapter
11
- 10.1007/978-3-030-64452-9_3
- Jan 1, 2020
In the last decades, a huge number of documents has been digitised, before undergoing optical character recognition (OCR) to extract their textual content. This step is crucial for indexing the documents and to make the resulting collections accessible. However, the fact that documents are indexed through their OCRed content is posing a number of problems, due to the varying performance of OCR methods over time. Indeed, OCR quality has a considerable impact on the indexing and therefore the accessibility of digital documents. Named entities are among the most adequate information to index documents, in particular in the case of digital libraries, for which log analysis studies have shown that around 80% of user queries include a named entity. Taking full advantage of the computational power of modern natural language processing (NLP) systems, named entity recognition (NER) can be operated over enormous OCR corpora efficiently. Despite progress in OCR, resulting text files still have misrecognised words (or noise for short) which are harming NER performance. In this paper, to handle this challenge, we apply a spelling correction method to noisy versions of a corpus with variable OCR error rates in order to quantitatively estimate the contribution of post-OCR correction to NER. Our main finding is that we can indeed consistently improve the performance of NER when the OCR quality is reasonable (error rates respectively between 2% and 10% for characters (CER) and between 10% and 25% for words (WER)). The noise correction algorithm we propose is both language-independent and with low complexity.
- Research Article
8
- 10.1109/access.2020.3026535
- Jan 1, 2020
- IEEE Access
Named Entity Recognition (NER) systems have been largely advanced by deep neural networks in the recent decade. However, the state-of-the-arts on NER have been less applied to Chinese historical texts due to the lack of standard corpora in Chinese historical domains and the difficulty of accessing a quality ancient corpus. This paper addresses the respective issues and proposes an efficient automatic processing solution for tackling NER of ancient Chinese data, including the implementation of data-driven tagging and an innovative end-to-end network namely “MoGCN” (Mixture of Gated Convolutional Neural Network). A corpus consisting of three genres of Chinese historical classics is generated by our tagging approach, which is experimented for uncovering the generalization ability of proposed model. The empirical analysis demonstrates that our proposed model achieves the best results with above 1.5% F1-score improvement over other sophisticated models in this dataset, where the experimental performance shows positive dependence on the quality of corpus. Furthermore, our model can perform much better on shorter entities especially for 2-charater ones, while many long-range entities can be only identified by our model based on our auxiliary attribute analysis. This work serves as a preliminary exploitation of NER for historical data, providing unique insights and reference values for similar tasks. Future work should be focused on more exploration about NER optimization on massive Chinese traditional texts with linguistic features and learning strategies.
- Research Article
- 10.1007/s44443-025-00101-7
- Jul 1, 2025
- Journal of King Saud University Computer and Information Sciences
A text clarification and deep relational reasoning method for Mongolian-Chinese bilingual arbitrary-shaped scene text detection
- Conference Article
2
- 10.1109/icdarw.2019.40087
- Sep 1, 2019
Chinese characters are always arranged in arbitrary orders and forms in scene images which are difficult to handle for text line based methods. Recently approaches adopt irregular text line reading to solve this problem but a text line in Chinese is often ambiguous. In this paper, we propose a character based framework to spotting Chinese text from another view. Specifically, the framework consists of three components for character detection, character recognition and character grouping respectively. It is worth mentioning that a novel Conditional Random Field(CRF) based character grouping algorithm is presented for arbitrary arrangement in Chinese text. Experiments on ReCTS-ARB549 dataset demonstrate that our framework achieves superior performance comparing with state of the art text line based approaches.
- Research Article
57
- 10.1007/s00607-019-00766-9
- Nov 25, 2019
- Computing
Owing to the uneven distribution of key features in Chinese texts, key features play different roles in text recognition in Chinese text classification tasks. We propose a feature-enhanced fusion model based on attention mechanism for Chinese text classification, a long short-term memory (LSTM) network, a convolutional neural network (CNN), and a feature-difference enhancement attention algorithm model. The Chinese text is digitized into a vector form containing certain semantic context information into the embedding layer to train and test the neural network by preprocessing. The feature-enhanced fusion model is implemented by double-layer LSTM and CNN modules to enhance the fusion of text features extracted from the attention mechanism for classifying the classifiers. The feature-difference enhancement attention algorithm model not only adds more weight to important text features but also strengthens the differences between them and other text features. This operation can further improves the effect of important features on Chinese text recognition. The two models are classified by the softmax function. The text classification experiments are conducted based on the Chinese text corpus. The experimental results show that compared with the contrast model, the proposed algorithm can significantly improve the recognition ability of Chinese text features.
- Research Article
49
- 10.1109/tmm.2016.2625259
- Mar 1, 2017
- IEEE Transactions on Multimedia
Text detection in a natural environment plays an important role in many computer vision applications. While existing text detection methods are focused on English characters, there are strong application demands on text detection in other languages, such as Chinese. In this paper, we present a novel text detection algorithm for Chinese characters based on a specific designed convolutional neural network (CNN). The CNN contains a text structure component detector layer, a spatial pyramid layer, and a multi-input-layer deep belief network (DBN). The CNN is pre-trained via a convolutional sparse auto-encoder, specifically designed for extracting complex features from Chinese characters. In particular, the text structure component detectors enhance the accuracy and uniqueness of feature descriptors by extracting multiple text structure components in various ways. The spatial pyramid layer enhances the scale invariability of the CNN for detecting texts in multiple scales. Finally, the multi-input-layer DBN replaces the fully connected layers in the CNN to ensure features from multiple scales are comparable. A multilingual text detection dataset, in which texts in Chinese, English, and digits are labeled separately, is set up to evaluate the proposed text detection algorithm. The proposed algorithm shows a significant performance improvement over the baseline CNN algorithms. In addition the proposed algorithm is evaluated over a public multilingual benchmark and achieves state-of-the-art result under multiple languages. Furthermore, a simplified version of the proposed algorithm with only general components is evaluated on the ICDAR 2011 and 2013 datasets, showing comparable detection performance to the existing general text detection algorithms.
- Research Article
9
- 10.1109/access.2019.2919994
- Jan 1, 2019
- IEEE Access
Text detection in natural scene image is challenging due to text variation in size, orientation, color and complex background, contrast, and resolution. In this paper, we focus on the long text detection in complex background. In order to deal with multi-scale text variation and exploit the recognition result to enhance the detection performance, we propose a detection and verification model based on SSD and encoder-decoder network for scene text detection. First, we present a text localization neural network based on SSD, which incorporates a text detection layer into the standard SSD model and can detect horizontal texts, especially long and dense Chinese texts in natural scenes more effectively. Second, a text verification model based on the encoder-decoder network is designed to recognize and verify the initial detection results, in order to eliminate non-text areas that are falsely detected as text areas. A series of experiments have been conducted on our constructed horizontal text detection dataset, which is composed of the horizontal text images in ICDAR 2017 Competition on Reading Chinese Text in the Wild (RCTW 2017) and some scene images taken by cameras. Compared with previous approaches, experimental results show that our method has achieved the highest recall rate of 0.784 and competitive precision rate in text detection, indicating the effectiveness of our proposed method.
- Research Article
8
- 10.3897/rio.6.e55789
- Jul 3, 2020
- Research Ideas and Outcomes
We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies.Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. Potentially, NER could be used in conjunction with other online services, such as those of the Biodiversity Heritage Library to map the named entities to entities in the biodiversity literature (https://www.biodiversitylibrary.org/docs/api3.html).We have highlighted the main recommendations for potential pipeline components. The document also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.
- Research Article
- 10.1007/s10506-025-09488-0
- Oct 17, 2025
- Artificial Intelligence and Law
- Research Article
- 10.1007/s10506-025-09480-8
- Oct 14, 2025
- Artificial Intelligence and Law
- Research Article
- 10.1007/s10506-025-09479-1
- Oct 9, 2025
- Artificial Intelligence and Law
- Research Article
- 10.1007/s10506-025-09482-6
- Oct 6, 2025
- Artificial Intelligence and Law
- Research Article
- 10.1007/s10506-025-09483-5
- Oct 4, 2025
- Artificial Intelligence and Law
- Research Article
- 10.1007/s10506-025-09471-9
- Aug 26, 2025
- Artificial Intelligence and Law
- Research Article
- 10.1007/s10506-025-09466-6
- Aug 12, 2025
- Artificial Intelligence and Law
- Research Article
- 10.1007/s10506-025-09475-5
- Aug 7, 2025
- Artificial Intelligence and Law
- Research Article
- 10.1007/s10506-025-09476-4
- Jul 25, 2025
- Artificial Intelligence and Law
- Research Article
- 10.1007/s10506-025-09472-8
- Jul 23, 2025
- Artificial Intelligence and Law
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.