Quantifying Syntagmatic Patterning in Translated and Native Chinese: An R-Motif Approach Based on POS Sequences

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

ABSTRACT Research on translation universals has traditionally focused on isolated linguistic features along paradigmatic dimensions due to ease of interpretation. However, syntagmatic approaches, which examine how linguistic elements combine sequentially, remain underexplored. This corpus-based study addresses this gap by analysing R-motifs, defined as recurring sequences of part-of-speech tags, across four genres in translated and native Chinese texts. We investigate both the rank-frequency distributions of R-motif types and motif lengths as potential indicators of translation universals. Our analysis shows that R-motif frequencies in both text types follow the right-truncated Zeta distribution, whereas motif length distributions conform to the Pólya model. Random Forests are used to establish the text classification model where texts are represented by the POS R-motif distribution parameters and attributes. The experiments show that the combination of features from distribution parameters and attributes can detect the translationese efficiently. Future research may extend this approach by exploring more granular features beyond part-of-speech sequences.

Similar Papers
  • Research Article
  • Cite Count Icon 113
  • 10.1075/ijcl.15.1.01xia
How different is translated Chinese from native Chinese?
  • Mar 22, 2010
  • International Journal of Corpus Linguistics
  • Richard Xiao

Corpus-based translation studies focus on translation as a product by comparing comparable corpora of translational and non-translational texts. A number of distinctive features of translational English in relation to native English have been uncovered. Nevertheless, research of this area has so far been confined largely to translational English translated from closely related European languages. If the features of translational language that have been reported on the basis of translated English are to be generalized as ‘translation universals’, it is of vital importance to find supporting evidence from non-European languages. Clearly, evidence from “genetically” distinct language pairs such as English and Chinese is arguably more convincing, if not indispensable. This article explores potential features of translational Chinese on the basis of two balanced monolingual comparable corpora of translated and native Mandarin Chinese. The implications of the study for translation universal hypotheses are also discussed.

  • Book Chapter
  • Cite Count Icon 4
  • 10.1007/978-3-642-41363-6_8
The Features of Translational Chinese and Translation Universals
  • Jan 1, 2015
  • Richard Xiao + 1 more

We have so far analysed and compared translational and non-translational or native Chinese as represented by our corpora LCMC and ZCTC in terms of their macro-statistic features in Chap. 5 and the lexical and grammatical characteristics in Chaps. 6 and 7 , while the present chapter is an interface between the empirical findings and theoretical hypotheses, that is, it is a combination of descriptive translation studies with the “pure translation theory” (Holmes 1972/1988). It is important to find these connections for the reason that without a higher level of generalisation, empirical and quantitative discoveries can be meaningless or aimless. We will first of all summarise the discriminatory features of translational Chinese at different levels and then discuss the implications, if any, of these translation specific features to translation universals hypotheses reviewed in Chap. 3 . Due to the fact that the translated corpus (ZCTC) used as the basis of this research consists mostly of translated texts from English and that the parallel corpus (GCEPC) which is used whenever necessary is a corpus of English and Chinese translation, our generalisation for the sake of translation universals should be limited within the particular realm of English-to-Chinese translation.

  • Research Article
  • Cite Count Icon 4
  • 10.3233/jifs-211295
A comparative analysis of euphemistic sentences in news using feature weight scheme and intelligent techniques
  • Feb 2, 2022
  • Journal of Intelligent & Fuzzy Systems
  • K Seethappan + 1 more

Although there have been various researches in the detection of different figurative language, there is no single work in the automatic classification of euphemisms. Our primary work is to present a system for the automatic classification of euphemistic phrases in a document. In this research, a large dataset consisting of 100,000 sentences is collected from different resources for identifying euphemism or non-euphemism utterances. In this work, several approaches are focused to improve the euphemism classification: 1. A Combination of lexical n-gram features 2.Three Feature-weighting schemes 3.Deep learning classification algorithms. In this paper, four machine learning (J48, Random Forest, Multinomial Naïve Bayes, and SVM) and three deep learning algorithms (Multilayer Perceptron, Convolutional Neural Network, and Long Short-Term Memory) are investigated with various combinations of features and feature weighting schemes to classify the sentences. According to our experiments, Convolutional Neural Network (CNN) achieves precision 95.43%, recall 95.06%, F-Score 95.25%, accuracy 95.26%, and Kappa 0.905 by using a combination of unigram and bigram features with TF-IDF feature weighting scheme in the classification of euphemism. These results of experiments show CNN with a strong combination of unigram and bigram features set with TF-IDF feature weighting scheme outperforms another six classification algorithms in detecting the euphemisms in our dataset.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/icosst.2014.7029340
Arabic speaker identification system using combination of DWT and LPC features
  • Dec 1, 2014
  • Shahid Munir Shah + 1 more

Speaker recognition plays a significant role in the field of human computer interaction. In the recent years, several researchers have contributed in this field and have successfully build machine learning models for automatic speaker recognition systems. In this paper, we propose an automatic speaker identification system for qaries (Quran reciter) of Arabic Language. For feature extraction discrete Wavelet Transform (DWT) and Linear Predictive Coding (LPC) feature extraction techniques were used. Classification was performed by Random Forest (RF). In order to improve the identification accuracy DWT and LPC features were used singly (One at a time) and combined to train RF. Our system showed the best performance when RF was trained with the combination of features. In this case 90.90% recognition accuracy was achieved.

  • Research Article
  • Cite Count Icon 28
  • 10.3390/diagnostics13152538
Analysis of WSI Images by Hybrid Systems with Fusion Features for Early Diagnosis of Cervical Cancer.
  • Jul 31, 2023
  • Diagnostics
  • Mohammed Hamdi + 5 more

Cervical cancer is one of the most common types of malignant tumors in women. In addition, it causes death in the latter stages. Squamous cell carcinoma is the most common and aggressive form of cervical cancer and must be diagnosed early before it progresses to a dangerous stage. Liquid-based cytology (LBC) swabs are best and most commonly used for cervical cancer screening and are converted from glass slides to whole-slide images (WSIs) for computer-assisted analysis. Manual diagnosis by microscopes is limited and prone to manual errors, and tracking all cells is difficult. Therefore, the development of computational techniques is important as diagnosing many samples can be done automatically, quickly, and efficiently, which is beneficial for medical laboratories and medical professionals. This study aims to develop automated WSI image analysis models for early diagnosis of a cervical squamous cell dataset. Several systems have been designed to analyze WSI images and accurately distinguish cervical cancer progression. For all proposed systems, the WSI images were optimized to show the contrast of edges of the low-contrast cells. Then, the cells to be analyzed were segmented and isolated from the rest of the image using the Active Contour Algorithm (ACA). WSI images were diagnosed by a hybrid method between deep learning (ResNet50, VGG19 and GoogLeNet), Random Forest (RF), and Support Vector Machine (SVM) algorithms based on the ACA algorithm. Another hybrid method for diagnosing WSI images by RF and SVM algorithms is based on fused features of deep-learning (DL) models (ResNet50-VGG19, VGG19-GoogLeNet, and ResNet50-GoogLeNet). It is concluded from the systems' performance that the DL models' combined features help significantly improve the performance of the RF and SVM networks. The novelty of this research is the hybrid method that combines the features extracted from deep-learning models (ResNet50-VGG19, VGG19-GoogLeNet, and ResNet50-GoogLeNet) with RF and SVM algorithms for diagnosing WSI images. The results demonstrate that the combined features from deep-learning models significantly improve the performance of RF and SVM. The RF network with fused features of ResNet50-VGG19 achieved an AUC of 98.75%, a sensitivity of 97.4%, an accuracy of 99%, a precision of 99.6%, and a specificity of 99.2%.

  • Abstract
  • 10.1016/j.ijrobp.2021.07.081
The Prediction of Mandibular Osteoradionecrosis (ORN) in Head and Neck Radiotherapy Using CT-Derived Radiomic Features
  • Oct 22, 2021
  • International Journal of Radiation Oncology*Biology*Physics
  • R Reiazi + 5 more

The Prediction of Mandibular Osteoradionecrosis (ORN) in Head and Neck Radiotherapy Using CT-Derived Radiomic Features

  • Research Article
  • Cite Count Icon 3
  • 10.1080/10106049.2024.2380372
Comparison of machine learning and parametric methods for the discrimination of urban land cover types
  • Jan 1, 2024
  • Geocarto International
  • Enkhmanlai Amarsaikhan + 3 more

The aim of this study is to compare the performances of different machine learning and parametric techniques for differentiating highly mixed urban land cover classes in Ulaanbaatar, the capital city of Mongolia, using multisource data sets. For data sources, 17 features are chosen, including the original 10 spectral bands of the Sentinel-2 data; VV, VH, average of HH & HV, and simple ratio of Sentinel-1 data; and normalized difference vegetation index (NDVI), Bare soil index (BSI), and modified soil adjusted vegetation index (MSAVI). Six different feature combinations are used to identify available urban land cover classes. To discriminate existing classes, a support vector machine (SVM), artificial neural network (ANN), random forest (RF), and a statistical maximum likelihood classifier (MLC) are employed and compared. In all six feature combinations, the RF method outperforms the others with an overall accuracy ranging from 81.72% to 90.71%. The SVM has an overall accuracy ranging from 77.77%-83.50%, with the second-highest performance in four combinations and the lowest in two. The ANN has an overall accuracy ranging from 74.83%-83.31%, with poorer performance than the SVM. The MLC's performance varies across feature combinations, with an overall accuracy ranging from 70.96%-85.92%, and the second-highest performance in two feature combinations. Overall, the study shows that multisource information along with additional features/indices can significantly improve the classification of mixed urban land cover types, and for the given test site, the RF technique is the best option for producing a dependable land cover map.

  • Research Article
  • 10.1515/geo-2025-0838
Hybrid methods for land use and land cover classification using remote sensing and combined spectral feature extraction: A case study of Najran City, KSA
  • Sep 13, 2025
  • Open Geosciences
  • Mohammed Alshahrani + 4 more

In recent years, the classification of land change has revolutionized the ability to monitor and understand dynamic changes occurring on the Earth’s surface. Artificial intelligence (AI) techniques must improve the performance and accuracy of land change detection by extracting spectral features from several Convolutional Neural Networks (CNNs) and integrating them. In this study, AI techniques were applied to classify the land use and land cover (LULC) of the Najran city map in Saudi Arabia based on 2020 Landsat 8 satellite imagery. This was achieved using several hybrid models combining CNN and random forest (RF) models, namely AlexNet-RF and GoogLeNet-RF, as well as the combined spectral features of AlexNet-GoogLeNet with RF. The results showed that LULC classification using a hybrid system was superior to CNN and proved that the proposed hybrid system of combined spectral features extracted from AlexNet-GoogLeNet with RF provided better results than using the hybrid system proposed by AlexNet with RF and GoogLeNet with RF. The proposed hybrid system of combined spectral features extracted from AlexNet-GoogLeNet with RF achieved an accuracy of 96.95%, a Kappa coefficient of 0.9638, sensitivity of 96.95%, AUC of 98.4%, and specificity of 99.83%. The proposed hybrid methods aim to enhance the classification accuracy and increase the robustness of the system, ensuring consistent performance across diverse earth-change scenarios. It substantially impacts various domains, including environmental monitoring, disaster management, and sustainable urban planning.

  • Research Article
  • Cite Count Icon 7
  • 10.1590/s0100-67622004000600019
Momentos-L: teoria e aplicação em hidrologia
  • Dec 1, 2004
  • Revista Árvore
  • Ana Esmeria Lacerda Valverde + 3 more

Esta nota técnica foi redigida com o objetivo de apresentar o método momentos-L, que tem sido proposto para o cálculo dos parâmetros das principais distribuições de probabilidades utilizadas em estudos hidrológicos. Também foi seu objetivo inferir sobre o tipo de distribuição estatística mais empregada em aplicações específicas. Com base na revisão, pôde-se concluir que, ao analisar dados de eventos extremos, é recomendável testar a aderência, pelo menos, das seguintes distribuições de três parâmetros: Generalizada Logística, Generalizada de Eventos Extremos, Generalizada Normal, Pearson tipo III e Generalizada de Pareto. Concluiu-se também que os parâmetros dessas distribuições, e seus quantis, devem ser estimados utilizando os momentos-L derivados dos momentos ponderados por probabilidade.

  • Research Article
  • Cite Count Icon 1
  • 10.52436/1.jutif.2025.6.5.5128
Comparative Analysis of CNN, SVM, Decision Tree, Random Forest, and KNN for Maize Leaf Disease Detection Using Color and Texture Feature Extraction
  • Oct 21, 2025
  • Jurnal Teknik Informatika (Jutif)
  • Nurhikma Arifin + 1 more

Corn (Zea mays L.) is an important agricultural commodity in Indonesia, serving as the second staple food after rice and playing a crucial role in supporting national food security. However, corn production is frequently threatened by sudden outbreaks of pests and diseases, making accurate early detection essential to maintaining yield stability. This study aims to detect maize leaf diseases using five classification algorithms: Support Vector Machine (SVM), Decision Tree, K-Nearest Neighbors (KNN), Random Forest, and Convolutional Neural Network (CNN). These algorithms were tested using a combination of texture and color features, including Gray Level Co-occurrence Matrix (GLCM), Local Binary Pattern (LBP), Hue-Saturation-Value (HSV), and L*a*b*. The dataset used consists of 2,048 maize leaf images classified into four categories: Blight, Common Rust, Gray Leaf Spot, and Healthy, with 512 images per class. Each class was divided into training and testing sets to train and evaluate the classification models. The results show that CNN achieved the highest accuracy of 93.93% when using a complete combination of color and texture features. Meanwhile, SVM also demonstrated high performance, achieving the same accuracy (93.93%) using only the combination of color features (HSV and Lab*). Random Forest and Decision Tree performed best when using color features alone, with accuracies of 89.81% and 87.14%, respectively. These findings indicate that color features have a dominant influence on classification accuracy, and that combining color and texture features can significantly enhance model performance, particularly in CNN architectures. This study contributes to the development of early disease detection systems in precision agriculture.

  • Research Article
  • Cite Count Icon 41
  • 10.1016/j.bspc.2011.10.001
Evaluating and comparing performance of feature combinations of heart rate variability measures for cardiac rhythm classification
  • Oct 21, 2011
  • Biomedical Signal Processing and Control
  • Alan Jovic + 1 more

Evaluating and comparing performance of feature combinations of heart rate variability measures for cardiac rhythm classification

  • Research Article
  • Cite Count Icon 37
  • 10.3390/rs14071608
Forest Above-Ground Biomass Inversion Using Optical and SAR Images Based on a Multi-Step Feature Optimized Inversion Model
  • Mar 27, 2022
  • Remote Sensing
  • Wangfei Zhang + 5 more

Forest biomass change monitoring is essential for climate change. Synthetic aperture radar (SAR) and optimal remote sensing (RS) data are two very helpful data sources for forest biomass monitoring and estimation. During the procedure of biomass estimation using RS technique, optimal features selection and estimation models used are two critical steps. This paper therefore focuses on building an operational and robust method of biomass retrieval using optical and SAR RS data. First, random forest (RF) algorithms are used for reducing time-consuming and decreasing computational burden; then, an iterative procedure was embedded in K-nearest neighbor (KNN) algorithms for the best optimal feature selection and combination; last, the best feature combinations and KNN models were applied for forest biomass estimation. Moreover, forest type effects and RS feature source effects were considered. The results showed that feature combination of two optical images and the SAR image showed highest estimation accuracy by using the proposed algorithm (R2 = 0.70 for Forest-1, R2 = 0.72 for Forest-2, and R2 = 0.71 for Forest-3; RMSE = 16.18 Mg/ha for Forest-1, RMSE =17.66 Mg/ha for Forest-2, and RMSE = 18.67 Mg/ha for Forest-3, where Forest-1 is natural pure forests of Yunnan Pines, Forest-2 is natural mixed coniferous forests, and Forest-3 is the combination of Forest-1 and Forest-2). With the comparative analysis of proposed algorithm and different non-parametric algorithms, traditional nonparametric algorithms performed better in Forest-1, but worse in Forest-2 and Forest-3, while the proposed algorithm performed no obvious difference in three forest types and using five feature groups. The results revealed that the proposed algorithm was robust in biomass estimation, with almost no feature source and forest structure dependent for biomass estimation.

  • Research Article
  • Cite Count Icon 10
  • 10.1016/j.biosystemseng.2021.11.021
Estimating the total nitrogen content of Aquilaria sinensis leaves based on a hybrid feature selection algorithm and image data from a modified digital camera
  • Dec 9, 2021
  • Biosystems Engineering
  • Zhulin Chen + 2 more

Estimating the total nitrogen content of Aquilaria sinensis leaves based on a hybrid feature selection algorithm and image data from a modified digital camera

  • Research Article
  • Cite Count Icon 506
  • 10.1093/nar/gkm368
MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features.
  • May 8, 2007
  • Nucleic Acids Research
  • P Jiang + 5 more

To distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (pseudo pre-miRNAs), a hybrid feature which consists of local contiguous structure-sequence composition, minimum of free energy (MFE) of the secondary structure and P-value of randomization test is used. Besides, a novel machine-learning algorithm, random forest (RF), is introduced. The results suggest that our method predicts at 98.21% specificity and 95.09% sensitivity. When compared with the previous study, Triplet-SVM-classifier, our RF method was nearly 10% greater in total accuracy. Further analysis indicated that the improvement was due to both the combined features and the RF algorithm. The MiPred web server is available at http://www.bioinf.seu.edu.cn/miRNA/. Given a sequence, MiPred decides whether it is a pre-miRNA-like hairpin sequence or not. If the sequence is a pre-miRNA-like hairpin, the RF classifier will predict whether it is a real pre-miRNA or a pseudo one.

  • Research Article
  • Cite Count Icon 130
  • 10.1016/j.artmed.2010.09.005
Electrocardiogram analysis using a combination of statistical, geometric, and nonlinear heart rate variability features
  • Oct 25, 2010
  • Artificial Intelligence in Medicine
  • Alan Jovic + 1 more

Electrocardiogram analysis using a combination of statistical, geometric, and nonlinear heart rate variability features

Save Icon
Up Arrow
Open/Close