Articles published on Word Boundary Information
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
33 Search results
Sort by Recency
- Research Article
8
- 10.1109/tnnls.2025.3528416
- Jun 1, 2025
- IEEE transactions on neural networks and learning systems
- Chengyu Wang + 6 more
Recently, character-word lattice structures have achieved promising results for Chinese named entity recognition (NER), reducing word segmentation errors and increasing word boundary information for character sequences. However, constructing the lattice structure is complex and time-consuming, thus these lattice-based models usually suffer from low inference speed. Moreover, the quality of the lexicon affects the accuracy of the NER model. Since noise words can potentially confuse NER, limited coverage of the lexicon can cause lattice-based models to degenerate into partial character-based models. In this article, we propose a hierarchical label-enhanced contrastive learning (HLCL) method for Chinese NER. Instead of relying on the lattice structure, HLCL offers an alternative solution to robustly integrate entity boundary and type information with the help of both labels semantic and contrastive learning. HLCL is empowered by two techniques: 1) sentence-level contrastive learning (SCL) to model global mutual information between two different modalities (e.g., labels and sentences) and 2) token-level contrastive learning (TCL) to close the gap between representations of different characters (e.g., label-enhanced characters and original characters), resulting in local mutual information. With the well-designed contrastive learning scheme and the concise model during inference, HLCL can fully leverage the transferable label semantic and has a superb speed of inference. Experiments on four Chinese NER datasets show that HLCL obtains excellent efficiency as well as performance compared with existing lattice-based approaches.
- Research Article
- 10.47852/bonviewjdsis42024432
- Dec 17, 2024
- Journal of Data Science and Intelligent Systems
- Tao Wu + 5 more
Named entity recognition (NER) is a fundamental subtask for information extraction that aims to locate and classify named entities in unstructured text into predefined categories. Recently, large-scale language models (LLMs) have achieved SOTA performance on a variety of natural language processing tasks. However, because NER is a sequence labeling task in nature while LLMs is a text-generation model, the performance of LLMs on NER is still significantly below supervised baselines, and NER remains a difficult task. Meanwhile, the word boundary and semantic information of Chinese words are usually quite vague, as words contained in Chinese texts are not separated by spaces. Thus, the NER task still requires supervised learning paradigm and heavily relies on large amounts of labeled data, such as entity type and boundary information. However, the cost of labeling data can be prohibitively large, and the purely supervised approaches usually suffer from poor generalization capability. In this article, we propose a multitask learning-based bidirectional iterated dilated convolution model, BCNN-CWS, for low-resource NER via leveraging word boundary information of Chinese word segmentation (CWS) task. Specifically, to efficiently recognize named entities, an iterated dilated convolutional model with a limited number of layers is implemented. In addition, a bidirectional causal convolution mechanism is presented for contextual information extraction. Results of extensive experiments on public Chinese datasets demonstrate that BCNN-CWS achieves superior performance over state-of-the-art models, and it yields up to about 50% speed improvement over existing methods. It is worth noting that BCNN-CWS can be further improved by combining with a pretrained model. Received: 25 Spetember 2024 | Revised: 4 November 2024 | Accepted: 28 November 2024 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement The data that support the findings of this study are openly available in GitLab at https://github.com/jiangfeng13/BCNN-CWS Author Contribution Statement Tao Wu: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Visualization, Supervision. Xinwen Cao: Resources, Data curation. Feng Jiang: Software, Validation, Formal analysis, Investigation, Writing – original draft. Canyixing Cui: Data curation, Writing -review & editing. Xuehao Li: Resources. Xingping Xian: Supervision, Project administration, Funding acquisition.
- Research Article
1
- 10.1016/j.learninstruc.2024.102034
- Oct 18, 2024
- Learning and Instruction
- Weiyan Liao + 1 more
Does word boundary information facilitate Chinese sentence reading in children as beginning readers?
- Research Article
- 10.1109/access.2024.3507382
- Jan 1, 2024
- IEEE Access
- Wazir Ali + 5 more
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It’s cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
- Research Article
6
- 10.1016/j.specom.2023.102970
- Aug 14, 2023
- Speech Communication
- Ijazul Haq + 3 more
Correction of whitespace and word segmentation in noisy Pashto text using CRF
- Research Article
8
- 10.1145/3604811
- Aug 11, 2023
- ACM Transactions on Intelligent Systems and Technology
- Qibin Li + 4 more
Joint entity and relation extraction (RE) construct a framework for unifying entity recognition and relationship extraction, and the approach can exploit the dependencies between the two tasks to improve the performance of the task. However, the existing tasks still have the following two problems. First, when the model extracts entity information, the boundary is blurred. Secondly, there are mostly implicit interactions between modules, that is, the interactive information is hidden inside the model, and the implicit interactions are often insufficient in the degree of interaction and lack of interpretability. To this end, this study proposes a joint entity and relation extraction model (ESEI) based on E fficient S ampling and E xplicit I nteraction. We innovatively divide negative samples into sentences based on whether they overlap with positive samples, which improves the model’s ability to extract entity word boundary information by controlling the sampling ratio. In order to increase the explicit interaction ability between the models, we introduce a heterogeneous graph neural network (GNN) into the model, which will serve as a bridge linking the entity recognition module and the relation extraction module, and enhance the interaction between the modules through information transfer. Our method substantially improves the model’s discriminative power on entity extraction tasks and enhances the interaction between relation extraction tasks and entity extraction tasks. Experiments show that the method is effective, we validate our method on four datasets, and for joint entity and relation extraction, our model improves the F1 score on multiple datasets.
- Research Article
9
- 10.1145/3603626
- Jul 20, 2023
- ACM Transactions on Asian and Low-Resource Language Information Processing
- Yibo Yan + 4 more
Named entity recognition (NER) is a fundamental task for information extraction applications. NER is challenging because of semantic ambiguities in academic literature, especially for non-Latin languages. Besides word semantic information, recognizing Chinese named entities needs to consider word boundary information, as words contained in Chinese texts are not separated with spaces. Leveraging word boundary information could help to determine entity boundaries and thus improve entity recognition performance. In this article, we propose to combine word boundary information and semantic information for named entity recognition based on multi-task adversarial learning. Specifically, we learn commonly shared boundary information of entities from multiple kinds of tasks, including Chinese word segmentation (CWS), part-of-speech (POS) tagging, and entity recognition, with adversarial learning. We learn task-specific semantic information of words from these tasks and combine the learned boundary information with the semantic information to improve entity recognition with multi-task learning. We then propose a compression method based on improved clustering to accelerate the proposed model. We conduct extensive experiments on four public benchmark datasets and two private datasets, compared with state-of-the-art baseline models, and the experimental results demonstrate that our model achieves considerable performance improvements on various evaluation datasets.
- Research Article
30
- 10.1109/tnnls.2021.3114378
- Jul 1, 2023
- IEEE Transactions on Neural Networks and Learning Systems
- Shan Zhao + 5 more
Word-character lattice models have been proved to be effective for some Chinese natural language processing (NLP) tasks, in which word boundary information is fused into character sequences. However, due to the inherently unidirectional sequential nature, prior approaches have only learned sequential interactions of character-word instances but fail to capture fine-grained correlations in word-character spaces. In this article, we propose a lattice-aligned attention network (LAN) that aims to model dense interactions over word-character lattice structure for enhancing character representations. By carefully combining cross-lattice module, gated word-character semantic fusion unit, and self-lattice attention module, the network can explicitly capture fine-grained correlations across different spaces (e.g., word-to-character and character-to-character), thus significantly improving model performance. Experimental results on three Chinese NLP benchmark tasks demonstrate that LAN obtains state-of-the-art results compared to several competitive approaches.
- Research Article
8
- 10.1145/3570328
- Mar 23, 2023
- ACM Transactions on Asian and Low-Resource Language Information Processing
- Kaifang Long + 7 more
Chinese Named Entity Recognition (NER) is an essential task in natural language processing, and its performance directly impacts the downstream tasks. The main challenges in Chinese NER are the high dependence of named entities on context and the lack of word boundary information. Therefore, how to integrate relevant knowledge into the corresponding entity has become the primary task for Chinese NER. Both the lattice LSTM model and the WC-LSTM model did not make excellent use of contextual information. Additionally, the lattice LSTM model had a complex structure and did not exploit the word information well. To address the preceding problems, we propose a Chinese NER method based on the deep neural network with multiple ways of embedding fusion. First, we use a convolutional neural network to combine the contextual information of the input sequence and apply a self-attention mechanism to integrate lexicon knowledge, compensating for the lack of word boundaries. The word feature, context feature, bigram feature, and bigram context feature are obtained for each character. Second, four different features are used to fuse information at the embedding layer. As a result, four different word embeddings are obtained through cascading. Last, the fused feature information is input to the encoding and decoding layer. Experiments on three datasets show that our model can effectively improve the performance of Chinese NER.
- Research Article
13
- 10.3389/fpsyg.2023.783960
- Mar 13, 2023
- Frontiers in Psychology
- Yaqiong Cui
Unlike English, Chinese does not have interword spacing in written texts, which poses difficulties for Chinese-as-a-second-language (CSL) learners’ identification of word boundaries and affects their reading comprehension and vocabulary acquisition. The eye-movement literature has suggested that interword spacing is important in alphabetic languages; examining languages that lack interword spaces such as Chinese, thus, may help to inform theoretical accounts of eye-movement control and word identification during reading. Research investigating the interword spacing effect in reading Chinese showed that adding spacing facilitated CSL learners’ reading comprehension and speed as well as vocabulary learning. However, the bulk of this research mainly looked at the learning outcomes (off-line measures), with few studies focusing on L2 learners’ reading processes. Building on this background, this study seeks to provide a descriptive perspective of the eye movements of CSL learners. In this study, 24 CSL learners with intermediate Chinese proficiency were recruited as the experimental group, and 20 Chinese native speakers were recruited as the control group. The EyeLink 1,000 eye tracker was used to record their reading of four segmentation conditions of Chinese texts, namely, no space condition, word-spaced condition, non-word-spaced condition, and pinyin-spaced condition. Results show that: (1) CSL learners with intermediate Chinese proficiency generally spent less time reading Chinese texts with spaces between words, and they showed more gazes and regressions when reading texts without spaces; (2) Non-word-spaced texts and Pinyin-spaced texts interfere with CSL learners’ reading process; and (3) Intermediate CSL learners show consistent eye movement patterns in the normal no-space condition and word-spaced condition. I conclude that word boundary information can effectively guide CSL learners’ eye movement behaviors and eye saccade planning, thus improving reading efficiency.
- Research Article
6
- 10.1038/s41598-022-25759-1
- Jan 6, 2023
- Scientific Reports
- Danhui Wang + 7 more
Interword spaces exist in the texts of many languages that use alphabetic writing systems. In most cases, interword spaces, as a kind of word boundary information, play an important role in the reading process of readers. Tibetan also uses alphabetic writing, its text has no spaces between words as word boundary markers. Instead, there are intersyllable tshegs (“”), which are superscript dots. Interword spaces play an important role in reading as word boundary information. Therefore, it is interesting to investigate the role of tshegs and what effect replacing tshegs with spaces will have on Tibetan reading. To answer these questions, Experiment 1 was conducted in which 72 Tibetan undergraduates read three-syllable-boundary conditions (normal, spaced, and untsheged). However, in Experiment 1, because we performed the experimental operations of deleting tshegs and replacing tshegs, the spatial information distribution of Tibetan sentences under different operating conditions was different, which may have a certain potential impact on the experimental results. To rule out the underlying confounding factor, in Experiment 2, 58 undergraduates read sentences for both untsheged and alternating-color conditions. Overall, the global and local analyses revealed that tshegs, spaces, and alternating-color markers as syllable boundaries can help readers segment syllables in Tibetan reading. In Tibetan reading, both spaces and tshegs are effective visual syllable segmentation cues, and spaces are more effective visual syllable segmentation cues than tshegs.
- Research Article
4
- 10.16910/jemr.14.1.6
- May 31, 2021
- Journal of Eye Movement Research
- Ehab W Hermena
Persian is an Indo-Iranian language that features a derivation of Arabic cursive script,where most letters within words are connectable to adjacent letters with ligatures. Twoexperiments are reported where the properties of Persian script were utilized to investigatethe effects of reducing interword spacing and increasing the interletter distance (ligature)within a word. Experiment 1 revealed that decreasing interword spacing while extendinginterletter ligature by the same amount was detrimental to reading speed. Experiment 2largely replicated these findings. The experiments show that providing the readers withinaccurate word boundary information is detrimental to reading rate. This was achieved byreducing the interword space that follows letters that do not connect to the next letter inExperiment 1, and replacing the interword space with ligature that connected the words inExperiment 2. In both experiments, readers were able to comprehend the text read, despitethe considerable costs to reading rates in the experimental conditions.
- Research Article
27
- 10.1609/aaai.v35i16.17706
- May 18, 2021
- Proceedings of the AAAI Conference on Artificial Intelligence
- Shan Zhao + 4 more
Word-character lattice models have been proved to be effective for Chinese named entity recognition (NER), in which word boundary information is fused into character sequences for enhancing character representations. However, prior approaches have only used simple methods such as feature concatenation or position encoding to integrate word-character lattice information, but fail to capture fine-grained correlations in word-character spaces. In this paper, we propose DCSAN, a Dynamic Cross- and Self-lattice Attention Network that aims to model dense interactions over word-character lattice structure for Chinese NER. By carefully combining cross-lattice and self-lattice attention modules with gated word-character semantic fusion unit, the network can explicitly capture fine-grained correlations across different spaces (e.g., word-to-character and character-to-character), thus significantly improving model performance. Experiments on four Chinese NER datasets show that DCSAN obtains stateof-the-art results as well as efficiency compared to several competitive approaches.
- Research Article
13
- 10.1007/s11145-021-10164-3
- May 6, 2021
- Reading and Writing
- Ziming Song + 3 more
There is no obvious boundary information in Chinese reading. It has been shown that the introduction of word boundary information presented with alternating colors without changing the text distribution could significantly improve the reading speed of Chinese children in grade 2 (Perea and Wang in Mem Cognit 45(7):1160−1170, 2017. https://doi.org/10.3758/s13421-017-0717-0 ). However, few studies have examined how the effect of word boundary information on children's oral reading develops and changes as children’s grade increases. The present study asked Chinese children in grades 2–5 to read alternating-color and mono-color text orally and used eye-tracking technology to explore the developmental trajectory of the influence of word boundary information on oral reading. The results indicated that children in grade 2 and grade 3 showed faster reading speeds in the alternating-color condition than in the mono-color condition. In contrast, there was no difference between the two conditions in children in grade 4 and grade 5. We discuss the mechanisms of the findings and the implications for education.
- Research Article
29
- 10.1007/s11145-020-10067-9
- Jul 21, 2020
- Reading and Writing
- Jinger Pan + 3 more
Word boundary information is not marked explicitly in Chinese sentences and word ambiguity happens in Chinese texts. This introduces difficulty to parse characters into words when reading Chinese sentences, especially for beginning readers. In an eye-tracking study, we tested whether explicit word boundary information as provided by alternating text-colors affects reading performance of Chinese children and how such an effect is influenced by individual differences in word segmentation ability. Results showed that across a number of eye-movement measures, grade three children overall benefited from explicit marking of word boundary. Additionally, children with highest word segmentation ability showed the largest benefits in reading speed. We discuss possible implications for education.
- Research Article
41
- 10.1017/s0142716420000211
- May 1, 2020
- Applied Psycholinguistics
- Wei Zhou + 2 more
Abstract The present study investigated whether word-boundary information, provided by alternating colors (consistent or inconsistent with word-boundary information) in a Chinese sentence would facilitate the reading of second-language (L2) learners. Thirty-three Korean students were recruited in the eye-movement experiment. Relative to a baseline (i.e., mono-colors) condition, incorrect word segmentation produced closer fixation location toward the beginning of words, longer fixation duration, higher refixation rate, and slower reading speed. In contrast, word segmentation with alternating colors produced further fixation location toward the center of words, shorter fixation duration, lower refixation rate, and faster reading speed. These results indicate that L2 readers are capable of making use of word-boundary knowledge for saccade generation, which can result in a facilitation of reading efficiency.
- Research Article
13
- 10.1016/j.bandl.2019.104663
- Aug 9, 2019
- Brain and Language
- Wei Zhou + 4 more
Alternating-color words influence Chinese sentence reading: Evidence from neural connectivity
- Research Article
37
- 10.3758/s13421-018-0797-5
- Feb 12, 2018
- Memory & Cognition
- Wei Zhou + 4 more
During sentence reading, low spatial frequency information afforded by spaces between words is the primary factor for eye guidance in spaced writing systems, whereas saccade generation for unspaced writing systems is less clear and under debate. In the present study, we investigated whether word-boundary information, provided by alternating colors (consistent or inconsistent with word-boundary information) influences saccade-target selection in Chinese. In Experiment 1, as compared to a baseline (i.e., uniform color) condition, word segmentation with alternating color shifted fixation location towards the center of words. In contrast, incorrect word segmentation shifted fixation location towards the beginning of words. In Experiment 2, we used a gaze-contingent paradigm to restrict the color manipulation only to the upcoming parafoveal words and replicated the results, including fixation location effects, as observed in Experiment 1. These results indicate that Chinese readers are capable of making use of parafoveal word-boundary knowledge for saccade generation, even if such information is unfamiliar to them. The present study provides novel support for the hypothesis that word segmentation is involved in the decision about where to fixate next during Chinese reading.
- Research Article
28
- 10.1037/xhp0000425
- Sep 1, 2017
- Journal of Experimental Psychology: Human Perception and Performance
- Aaron Veldre + 2 more
We examined the effect of individual differences in written language proficiency on unspaced text reading in a large sample of skilled adult readers who were assessed on reading comprehension and spelling ability. Participants' eye movements were recorded as they read sentences containing a low or high frequency target word, presented with standard interword spacing, or in one of three unsegmented text conditions that either preserved or eliminated word boundary information. The average data replicated previous studies: unspaced text reading was associated with increased fixation durations, a higher number of fixations, more regressions, reduced saccade length, and an inflation of the word frequency effect. The individual differences results provided insight into the mechanisms contributing to these effects. Higher reading ability was associated with greater overall reading speed and fluency in all conditions. In contrast, spelling ability selectively modulated the effect of interword spacing with poorer spelling ability predicting greater difficulty across the majority of sentence- and word-level measures. These results suggest that high quality lexical representations allowed better spellers to extract lexical units from unfamiliar text forms, inoculating them against the disruptive effects of being deprived of spacing information. (PsycINFO Database Record
- Research Article
- 10.1007/s10579-016-9354-7
- May 21, 2016
- Language Resources and Evaluation
- Shinsuke Mori + 1 more
In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary and the second is adding annotated sentences to the training corpus. The experimental results showed that addition of annotated sentences to the training corpus is better than the addition of entries to the dictionary. In particular, adding annotated sentences is especially efficient when we add new words with contexts of several real occurrences as partially annotated sentences, i.e. sentences in which only some words are annotated with word boundary information. According to this knowledge, we performed real annotation experiments on invention disclosure texts and observed word segmentation accuracy. Finally we investigated various language resource addition cases and introduced the notion of non-maleficence, asymmetricity, and additivity of language resources for a task. In the WS case, we found that language resource addition is non-maleficent (adding new resources causes no harm in other domains) and sometimes additive (adding new resources helps other domains). We conclude that it is reasonable for us, NLP tool providers, to distribute only one general-domain model trained from all the language resources we have.