(ProQuest: ... denotes formulae omitted.)1 IntroductionIntrinsic plagiarism detection implies the recognition of those parts of text within a document that are different taking into account the writing style of a certain author. Those parts are later on analysed as input data for the verification using external plagiarism detection tools. If a document in written by a single author, it is supposed that the passages written by him to be similar accordingly to its unique writing style.Using this technique of comparing the writing style within each part of text from the papers written by multiple authors and adding unsupervised automatic classification techniques, those parts of text are grouped in clusters depending on the membership of each author. The problem of plagiarism detection using this type of analysis involves extracting the unique writing style of each author, method also called stylometry analysis. Having a set of characteristics that best describes in a unique manner the writing style of an author, a metric is created for value description of percentage membership of documents to authors. In the research conducted in [3], [4], [5], [6], [7], [8] and [10], the problems and methods of inserting intrinsic plagiarism are referred, adding into the discussion also the stylometry, the writing style of a specific author over his history of research or just within a single document.Regardless of the type of plagiarism evaluation, intrinsic or external one, it is very important to determine the set of characteristics that must be taken into account in order to obtain as accurate results as possible. Those characteristics depend on the set of analysed documents, the language in which the documents are written and also the type of documents. The present research paper addresses the problem of literary English written documents by English native or European authors. For extracting from the initial set of documents the semantic analysis that describes the stylometry, multidimensional analysis is used. Chapter 2 reveals the relation between semantic analysis and main vocabulary richness metrics used in order to extract a value indicator of the words found in the analysed authors' set of written documents, transformed into tokens, and the semantic distances between them. The terms of words, tokens and frequency appearance are presented along with the main set of features of written style. The pre-processing phase is also presented, a step needed to convert words into WordNet tokens.In chapter 3, the improved semantic richness vocabulary metric is presented and defined along with an example of applying in upon a given phrase. The time evolution analyses is done within chapter 4, where 13 years values are inserted into a time series. Using three methods, absolute mean change, average index and linear regression, the trend indicator is evaluated. Comparing the sum of squared errors of the three methods, the linear regression method is chosen for the forecast. The conclusions are withdrawn in chapter 5 along with the future work directions.2Vocabulary Richness Metrics In Stilometry AnalysisFor analysis of an author's style of writing in the context of external analysis or intrinsic characteristics of plagiarism, the richness of vocabulary is defined as the characteristic of the author defines the degree to which the author uses words in a wider or narrower vocabulary. This feature was demonstrated in works such as [1], [2], as a feature closely related to the author, it can be fed into optimal set of features of the style of writing.Table 1 contains a list of metrics used to assess vocabulary wealth within the set of features writing, detailing the variables in formulas defined metrics that are presented in this paper [1].where:N - total number of words in the document analyzed;V - total concepts identified in the set of words;V^sub i^ - total concepts that appear of i times in the document;p^sub v^ - the relative frequency of the most v present concept in the document. …
Read full abstract