Adversarial training flat-lattice transformer for named entity recognition of chinese legal texts

  • Abstract
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Adversarial training flat-lattice transformer for named entity recognition of chinese legal texts

ReferencesShowing 10 of 25 papers
  • Open Access Icon
  • Cite Count Icon 169
  • 10.3115/1119176.1119196
Maximum entropy models for named entity recognition
  • Jan 1, 2003
  • Oliver Bender + 2 more

  • Open Access Icon
  • Cite Count Icon 956
  • 10.1109/tkde.2020.2981314
A Survey on Deep Learning for Named Entity Recognition
  • Jan 1, 2022
  • IEEE Transactions on Knowledge and Data Engineering
  • Jing Li + 3 more

  • Cite Count Icon 2
  • 10.1109/icaica52286.2021.9498036
Improving Named Entity Recognition of Chinese Legal Documents by Lexical Enhancement
  • Jun 28, 2021
  • Zhenzhen Yuan + 1 more

  • Open Access Icon
  • Cite Count Icon 11
  • 10.1088/1742-6596/1592/1/012040
Recognizing Chinese judicial named entity using BiLSTM-CRF
  • Aug 1, 2020
  • Journal of Physics: Conference Series
  • Pin Tang + 5 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 2432
  • 10.18653/v1/p19-1285
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
  • Jan 1, 2019
  • Zihang Dai + 5 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 178
  • 10.18653/v1/d18-1017
Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism
  • Jan 1, 2018
  • Pengfei Cao + 4 more

  • Open Access Icon
  • Cite Count Icon 1
  • 10.1088/1742-6596/1616/1/012108
Attention-BLSTM-CRF Based Method for Named Entity Recognition in Judicial Domain
  • Aug 1, 2020
  • Journal of Physics: Conference Series
  • Chen Wang + 2 more

  • Open Access Icon
  • Cite Count Icon 140
  • 10.5121/ijnlc.2012.1402
Named Entity Recognition using Hidden Markov Model (HMM)
  • Dec 31, 2012
  • International Journal on Natural Language Computing
  • Sudha Morwal

  • Cite Count Icon 36
  • 10.1016/j.ins.2019.10.065
Dispatched attention with multi-task learning for nested mention recognition
  • Nov 1, 2019
  • Information Sciences
  • Hao Fei + 2 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 48
  • 10.1186/s13640-020-0490-z
Adversarial attacks on fingerprint liveness detection
  • Jan 13, 2020
  • EURASIP Journal on Image and Video Processing
  • Jianwei Fei + 3 more

Similar Papers
  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icpr.2016.7900156
A novel text structure feature extractor for Chinese scene text detection and recognition
  • Dec 1, 2016
  • Xiaohang Ren + 5 more

Scene text information extraction plays an important role in many computer vision applications. Unlike most existing text extraction algorithms for English texts, in this paper, we focus on Chinese texts, which are more complex in stroke and structure. To tackle this challenging problem, we propose a novel convolutional neural network (CNN) based text structure feature extractor for Chinese texts. Each Chinese character contains its specific types and combination of text structure components, which is rarely seen in backgrounds. Thus, different from the features only applicable to one text extraction stage (text detection or text recognition), the text structure component feature is suitable for both Chinese text detection and recognition. A text structure component detector (TSCD) layer is designed to detect the large amount of component types, which is the most challenging part of extracting text structure component features. Through statistical classification various types of text structure component are detected by their specially designed convolutional units in the TSCD layer. With the TSCD layer, the CNN has improvements in the accuracy and uniqueness of text feature description. In the evaluation, both text detection and recognition algorithms based on the proposed text structure feature extractor achieve state-of-the-art results in two datasets.

  • Research Article
  • Cite Count Icon 54
  • 10.1109/access.2017.2676158
A Novel Text Structure Feature Extractor for Chinese Scene Text Detection and Recognition
  • Jan 1, 2017
  • IEEE Access
  • Xiaohang Ren + 5 more

Scene text information extraction plays an important role in many computer vision applications. Most features in existing text extraction algorithms are only applicable to one text extraction stage (text detection or recognition), which significantly weakens the consistency in an end-to-end system, especially for the complex Chinese texts. To tackle this challenging problem, we propose a novel text structure feature extractor based on a text structure component detector (TSCD) layer and residual network for Chinese texts. Inspired by the three-layer Chinese text cognition model of a human, we combine the TSCD layer and the residual network to extract features suitable for both text extraction stages. The specialized modeling for Chinese characters in the TSCD layer simulates the key structure component cognition layer in the psychological model. And the residual mechanism in the residual network simulates the key bidirectional connection among the layers in the psychological model. Through the organic combination of the TSCD layer and the residual network, the extracted features are applicable to both text detection and recognition, as humans do. In evaluation, both text detection and recognition models based on our proposed text structure feature extractor achieve great improvements over baseline CNN models. And an end-to-end Chinese text information extraction system is experimentally designed and evaluated, showing the advantage of the proposed feature extractor as a unified feature extractor.

  • Conference Article
  • 10.1145/3446999.3447027
Joint Extraction of Entities and Relations for Chinese Text of Tea
  • Dec 25, 2020
  • Zihao Zhou + 4 more

In view of the problems of polysemy and overlapping relations of Chinese tea text. In this paper, we present a joint model BERT-LCM-Tea for extraction of entities and relations, which combines the Bidirectional Encoder Representations from Transformers (BERT) and the last character matching (LCM) algorithm. This model uses BERT to fine-tuning character embedding through contextual information, the problem of polysemy is solved and the performance of entity recognition of Chinese tea text is improved. In addition, the model uses last character matching algorithm, the problem of overlapping relations is solved and the accuracy of relation extraction of Chinese tea text is improved. The experimental results show that BERT-LCM-Tea F1 score to 86.8% in entity recognition task and F1 score to 77.1% in relation extraction task, which is higher than the currently popular Bi-RNN-CRF, Bi-LSTM-CRF and Bi-GRU-CRF. Thus, the BERT-LCM-Tea is more suitable for the entity recognition and relation extraction of Chinese tea text, and provides a basis for future research on the construction of tea knowledge graph.

  • Research Article
  • Cite Count Icon 16
  • 10.1007/s12559-015-9346-8
Classification of Chinese Texts Based on Recognition of Semantic Topics
  • Jul 2, 2015
  • Cognitive Computation
  • Ye-wang Chen + 3 more

For machine learning methods, processing and understanding Chinese texts are difficult, for that the basic unit of Chinese texts is not character but phrases, and there is no natural delimiter in Chinese texts to separate the phrases. The processing of a large number of Chinese Web texts is more difficult, because such texts are often less topic focused, short, irregular, sparse, and lacking in context. It poses a challenge for mining, clustering, and classification of Chinese Web texts. Typically, the recognition accuracy of the real meaning of such texts is low. In this paper, we propose a method that recognizes stable and abstract semantic topics that express the highly hierarchical relationship behind the Chinese texts from BaiduBaike. Then, based on these semantic topics, a discrete distribution model is established to convert analysis to a convex optimization problem by geometric programming. Our experiments demonstrated that the proposed approach outperforms many conventional machine learning methods, such as KNN, SVM, WIKI, CRFs, and LDA, regarding the recognition of mini training data and short Chinese Web texts.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/robio.2018.8665259
A Text Detection and Recognition System Based on an End-to-End Trainable Framework from UAV Imagery
  • Dec 1, 2018
  • Qingtian Wu + 2 more

In this paper, we present a DAV-based system for text (mainly English and Chinese) detection and recognition. With the combination of unmanned aerial vehicle and scene text recognition, the system can realize text detection and recognition in long-range air-plane images, providing an underlay for unmanned navigation and fast text information understanding. Robust text detection and accurate text recognition can be achieved by two contributions. First, a scalable engine is proposed to synthesize text images by overlaying English or Chinese text into existing images in a natural way. Second, an framework which is trainable and end-to-end by combining Convolutional Neural Network and Recurrent Neural Network is adapted to recognize the variable-length text with a high accuracy. Field experiments are performed with different videos shot in various backgrounds and outdoors to show that the proposed system can detect and recognise text information in UAV imagery robustly and effectively.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icaml57167.2022.00029
Named Entity Recognition of Chinese Legal Text Based on BERT
  • Jul 1, 2022
  • Huawei Lu + 1 more

With the deepening of national judicial reform, how to combine artificial intelligence with judicial work has become a research emphasis in judicial intelligence research. Aiming at polysemy problem in Chinese, and characteristics of Chinese legal text that complicated context, professional, and diverse types of entities, we design a named entity recognition method for Chinese legal text based on BERT. Firstly, we build a Chinese legal text corpus, and utilize the corpus to domain pre-train the pre-training model BERT to make it perform better for named entity recognition task in Chinese legal text. Secondly, we design the method of adding clue words and the method of replacing synonyms for training data augmentation to increase the diversity of training data. Finally, we combine the BERT model after domain pre-training with a CRF layer, and utilize the datasets after data augmentation to train it. Meanwhile, we utilize adversarial training in the training process to improve the generalization ability of the model. To verify the performance of our model, we conduct experiments on the competition data set of the information extraction track of the China Legal Intelligence Technology Evaluation Competition, which proves the feasibility and effectiveness of the method.

  • Conference Article
  • Cite Count Icon 102
  • 10.1109/icdar.2019.00253
ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard
  • Sep 1, 2019
  • Rui Zhang + 14 more

Chinese scene text reading is one of the most challenging problems in computer vision and has attracted great interest. Different from English text, Chinese has more than 6000 commonly used characters and Chinesecharacters can be arranged in various layouts with numerous fonts. The Chinese signboards in street view are a good choice for Chinese scene text images since they have different backgrounds, fonts and layouts. We organized a competition called ICDAR2019-ReCTS, which mainly focuses on reading Chinese text on signboard. This report presents the final results of the competition. A large-scale dataset of 25,000 annotated signboard images, in which all the text lines and characters are annotated with locations and transcriptions, were released. Four tasks, namely character recognition, text line recognition, text line detection and end-to-end recognition were set up. Besides, considering the Chinese text ambiguity issue, we proposed a multi ground truth (multi-GT) evaluation method to make evaluation fairer. The competition started on March 1, 2019 and ended on April 30, 2019. 262 submissions from 46 teams are received. Most of the participants come from universities, research institutes, and tech companies in China. There are also some participants from the United States, Australia, Singapore, and Korea. 21 teams submit results for Task 1, 23 teams submit results for Task 2, 24 teams submit results for Task 3, and 13 teams submit results for Task 4. The official website for the competition is http://rrc.cvc.uab.es/?ch=12.

  • Book Chapter
  • Cite Count Icon 11
  • 10.1007/978-3-030-64452-9_3
When to Use OCR Post-correction for Named Entity Recognition?
  • Jan 1, 2020
  • Vinh-Nam Huynh + 2 more

In the last decades, a huge number of documents has been digitised, before undergoing optical character recognition (OCR) to extract their textual content. This step is crucial for indexing the documents and to make the resulting collections accessible. However, the fact that documents are indexed through their OCRed content is posing a number of problems, due to the varying performance of OCR methods over time. Indeed, OCR quality has a considerable impact on the indexing and therefore the accessibility of digital documents. Named entities are among the most adequate information to index documents, in particular in the case of digital libraries, for which log analysis studies have shown that around 80% of user queries include a named entity. Taking full advantage of the computational power of modern natural language processing (NLP) systems, named entity recognition (NER) can be operated over enormous OCR corpora efficiently. Despite progress in OCR, resulting text files still have misrecognised words (or noise for short) which are harming NER performance. In this paper, to handle this challenge, we apply a spelling correction method to noisy versions of a corpus with variable OCR error rates in order to quantitatively estimate the contribution of post-OCR correction to NER. Our main finding is that we can indeed consistently improve the performance of NER when the OCR quality is reasonable (error rates respectively between 2% and 10% for characters (CER) and between 10% and 25% for words (WER)). The noise correction algorithm we propose is both language-independent and with low complexity.

  • Research Article
  • Cite Count Icon 8
  • 10.1109/access.2020.3026535
MoGCN: Mixture of Gated Convolutional Neural Network for Named Entity Recognition of Chinese Historical Texts
  • Jan 1, 2020
  • IEEE Access
  • Chengxi Yan + 2 more

Named Entity Recognition (NER) systems have been largely advanced by deep neural networks in the recent decade. However, the state-of-the-arts on NER have been less applied to Chinese historical texts due to the lack of standard corpora in Chinese historical domains and the difficulty of accessing a quality ancient corpus. This paper addresses the respective issues and proposes an efficient automatic processing solution for tackling NER of ancient Chinese data, including the implementation of data-driven tagging and an innovative end-to-end network namely “MoGCN” (Mixture of Gated Convolutional Neural Network). A corpus consisting of three genres of Chinese historical classics is generated by our tagging approach, which is experimented for uncovering the generalization ability of proposed model. The empirical analysis demonstrates that our proposed model achieves the best results with above 1.5% F1-score improvement over other sophisticated models in this dataset, where the experimental performance shows positive dependence on the quality of corpus. Furthermore, our model can perform much better on shorter entities especially for 2-charater ones, while many long-range entities can be only identified by our model based on our auxiliary attribute analysis. This work serves as a preliminary exploitation of NER for historical data, providing unique insights and reference values for similar tasks. Future work should be focused on more exploration about NER optimization on massive Chinese traditional texts with linguistic features and learning strategies.

  • Research Article
  • 10.1007/s44443-025-00101-7
A text clarification and deep relational reasoning method for Mongolian-Chinese bilingual arbitrary-shaped scene text detection
  • Jul 1, 2025
  • Journal of King Saud University Computer and Information Sciences
  • Yuefeng Liu + 2 more

A text clarification and deep relational reasoning method for Mongolian-Chinese bilingual arbitrary-shaped scene text detection

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icdarw.2019.40087
Reading Chinese Scene Text with Arbitrary Arrangement Based on Character Spotting
  • Sep 1, 2019
  • Qi Song + 6 more

Chinese characters are always arranged in arbitrary orders and forms in scene images which are difficult to handle for text line based methods. Recently approaches adopt irregular text line reading to solve this problem but a text line in Chinese is often ambiguous. In this paper, we propose a character based framework to spotting Chinese text from another view. Specifically, the framework consists of three components for character detection, character recognition and character grouping respectively. It is worth mentioning that a novel Conditional Random Field(CRF) based character grouping algorithm is presented for arbitrary arrangement in Chinese text. Experiments on ReCTS-ARB549 dataset demonstrate that our framework achieves superior performance comparing with state of the art text line based approaches.

  • Research Article
  • Cite Count Icon 57
  • 10.1007/s00607-019-00766-9
Chinese text classification based on attention mechanism and feature-enhanced fusion neural network
  • Nov 25, 2019
  • Computing
  • Jinbao Xie + 6 more

Owing to the uneven distribution of key features in Chinese texts, key features play different roles in text recognition in Chinese text classification tasks. We propose a feature-enhanced fusion model based on attention mechanism for Chinese text classification, a long short-term memory (LSTM) network, a convolutional neural network (CNN), and a feature-difference enhancement attention algorithm model. The Chinese text is digitized into a vector form containing certain semantic context information into the embedding layer to train and test the neural network by preprocessing. The feature-enhanced fusion model is implemented by double-layer LSTM and CNN modules to enhance the fusion of text features extracted from the attention mechanism for classifying the classifiers. The feature-difference enhancement attention algorithm model not only adds more weight to important text features but also strengthens the differences between them and other text features. This operation can further improves the effect of important features on Chinese text recognition. The two models are classified by the softmax function. The text classification experiments are conducted based on the Chinese text corpus. The experimental results show that compared with the contrast model, the proposed algorithm can significantly improve the recognition ability of Chinese text features.

  • Research Article
  • Cite Count Icon 49
  • 10.1109/tmm.2016.2625259
A Convolutional Neural Network-Based Chinese Text Detection Algorithm via Text Structure Modeling
  • Mar 1, 2017
  • IEEE Transactions on Multimedia
  • Xiaohang Ren + 5 more

Text detection in a natural environment plays an important role in many computer vision applications. While existing text detection methods are focused on English characters, there are strong application demands on text detection in other languages, such as Chinese. In this paper, we present a novel text detection algorithm for Chinese characters based on a specific designed convolutional neural network (CNN). The CNN contains a text structure component detector layer, a spatial pyramid layer, and a multi-input-layer deep belief network (DBN). The CNN is pre-trained via a convolutional sparse auto-encoder, specifically designed for extracting complex features from Chinese characters. In particular, the text structure component detectors enhance the accuracy and uniqueness of feature descriptors by extracting multiple text structure components in various ways. The spatial pyramid layer enhances the scale invariability of the CNN for detecting texts in multiple scales. Finally, the multi-input-layer DBN replaces the fully connected layers in the CNN to ensure features from multiple scales are comparable. A multilingual text detection dataset, in which texts in Chinese, English, and digits are labeled separately, is set up to evaluate the proposed text detection algorithm. The proposed algorithm shows a significant performance improvement over the baseline CNN algorithms. In addition the proposed algorithm is evaluated over a public multilingual benchmark and achieves state-of-the-art result under multiple languages. Furthermore, a simplified version of the proposed algorithm with only general components is evaluated on the ICDAR 2011 and 2013 datasets, showing comparable detection performance to the existing general text detection algorithms.

  • Research Article
  • Cite Count Icon 9
  • 10.1109/access.2019.2919994
A Detection and Verification Model Based on SSD and Encoder-Decoder Network for Scene Text Detection
  • Jan 1, 2019
  • IEEE Access
  • Xue Gao + 2 more

Text detection in natural scene image is challenging due to text variation in size, orientation, color and complex background, contrast, and resolution. In this paper, we focus on the long text detection in complex background. In order to deal with multi-scale text variation and exploit the recognition result to enhance the detection performance, we propose a detection and verification model based on SSD and encoder-decoder network for scene text detection. First, we present a text localization neural network based on SSD, which incorporates a text detection layer into the standard SSD model and can detect horizontal texts, especially long and dense Chinese texts in natural scenes more effectively. Second, a text verification model based on the encoder-decoder network is designed to recognize and verify the initial detection results, in order to eliminate non-text areas that are falsely detected as text areas. A series of experiments have been conducted on our constructed horizontal text detection dataset, which is composed of the horizontal text images in ICDAR 2017 Competition on Reading Chinese Text in the Wild (RCTW 2017) and some scene images taken by cameras. Compared with previous approaches, experimental results show that our method has achieved the highest recall rate of 0.784 and competitive precision rate in text detection, indicating the effectiveness of our proposed method.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.3897/rio.6.e55789
Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections
  • Jul 3, 2020
  • Research Ideas and Outcomes
  • David Owen + 7 more

We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies.Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. Potentially, NER could be used in conjunction with other online services, such as those of the Biodiversity Heritage Library to map the named entities to entities in the biodiversity literature (https://www.biodiversitylibrary.org/docs/api3.html).We have highlighted the main recommendations for potential pipeline components. The document also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.

More from: Artificial Intelligence and Law
  • Research Article
  • 10.1007/s10506-025-09488-0
GPT vs human legal texts annotations: A comparative study with privacy policies
  • Oct 17, 2025
  • Artificial Intelligence and Law
  • David Cevallos-Salas + 4 more

  • Research Article
  • 10.1007/s10506-025-09480-8
LLMES: an LLMs-based expert system for quality management system audits
  • Oct 14, 2025
  • Artificial Intelligence and Law
  • Yunhan Li + 5 more

  • Research Article
  • 10.1007/s10506-025-09479-1
ARDI: a new dataset for automatic advocate recommendation in the Indian Legal System
  • Oct 9, 2025
  • Artificial Intelligence and Law
  • Upal Bhattacharya + 7 more

  • Research Article
  • 10.1007/s10506-025-09482-6
LegisSearch: navigating legislation with graphs and large language models
  • Oct 6, 2025
  • Artificial Intelligence and Law
  • Andrea Colombo + 5 more

  • Research Article
  • 10.1007/s10506-025-09483-5
Automated neural patent landscaping in the small data regime using citations and CPC codes
  • Oct 4, 2025
  • Artificial Intelligence and Law
  • Tisa Islam Erana + 1 more

  • Research Article
  • 10.1007/s10506-025-09471-9
Combining topic modelling and citation network analysis to study case law from the European Court of Human Rights on the right to respect for private and family life
  • Aug 26, 2025
  • Artificial Intelligence and Law
  • Mohammad Mohammadi + 3 more

  • Research Article
  • 10.1007/s10506-025-09466-6
Using GPT-4o as a factor extractor for Brazilian consumer law judgments*
  • Aug 12, 2025
  • Artificial Intelligence and Law
  • Lucas De Castro Rodrigues Pereira + 11 more

  • Research Article
  • 10.1007/s10506-025-09475-5
Correction: An interpretable approach to detect case law on housing and eviction issues within the HUDOC database
  • Aug 7, 2025
  • Artificial Intelligence and Law
  • Mohammad Mohammadi + 2 more

  • Research Article
  • 10.1007/s10506-025-09476-4
Adversarial training flat-lattice transformer for named entity recognition of chinese legal texts
  • Jul 25, 2025
  • Artificial Intelligence and Law
  • Jiabao Wang + 3 more

  • Research Article
  • 10.1007/s10506-025-09472-8
TRACS-LLM: LLM-based traffic accident criminal sentencing prediction focusing on imprisonment, probation, and fines
  • Jul 23, 2025
  • Artificial Intelligence and Law
  • Hyunsik Min + 1 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon