A LITERARY PROBLEM which promises to be with us for yet some time is the objective characterization of style. Of the numerous attacks which have been made on this problem, the one most amenable to formulation as a computer problem is the study of word frequency data. This approach has been especially popular in studying authorship disputes, e.g., in the case of the Federalist papers (Mosteller and Wallace, 1964) or Diderot's authorship in The EncyclopOdie (Frautschi, 1970). If one is going to count words, it is necessary to decide which words to count. Even a decision to count all words amounts to a decision, and in fact would quite likely be a mistake. Since an author's style is presumed to be independent of the content of his writings (with the possible exception that style may be influenced by genre), those words which are dependent on content should probably not be counted. What one is after are those personal idiosyncrasies which may well be unknown to the writer himself, what Paisley (1964, p. 220) calls "minor encoding habits." This notion arises in the study of paintings, and Paisley (op. cit., p. 225), drawing on Berenson, Morelli, and Mosteller and Wallace, defines good markers of personal style as 1) lack of prominence, to guard against imitation; 2) mechanical execution on the part of the creator, thereby exhibiting low variance from work to work of a single creator; 3) use not wholly dictated by convention and exhibiting high variance from the work of others; and 4) not overly rare, with the frequency of occurrence high relative to sampling error. (Somewhat analogous considerations on what to count occur in many of the social sciences. Examples are discussed in Webb, et at., 1973.) In the context of word frequencies, the most likely candidates for meeting the criteria above are the so-called "function words," basically prepositions, common adverbs, pronouns, articles, and the like. On this point, Mostetler and Wallace (op. cit., p. 265) are quite definite: "Context is a source of risk. We need variables that depend on authors and nothing else. Some function words come close to this ideal, but most other words do not. So many words and other variables depend on topics that their exploration for differences between authors
Read full abstract