Phraseological units in academic English texts have been a central focus in recent corpus linguistic research. This paper describes a special category of clause-level phraseological units, namely, Characteristic Sentence Stems (CSSs), with a view to describing their identifying criteria and their extraction method. CSSs are contiguous lexico-grammatical sequences which contain a subject-predicate structure and which are frame expressions characteristic of academic writing. The extraction method of a CSS consists of six steps: POS tagging, n-gram segmentation, structure identification, significance of occurrence calculation, text range calculation, and overlapping sequence reduction. The significance of occurrence calculation is the crux of this method. It includes the computing of both the internal association and the boundary independence of a CSS, and it tests the occurring significance of the CSS from both the inside and the outside perspectives. Our methods and results suggest that CSSs can be statistically defined and extracted from corpora and can employed in large-scale studies to more fully account for the phraseological features of non-native English academic writing.
Read full abstract