Abstract

Conventional approaches to processing records of linguistic origin for storage and retrieval tend to regard the data as immutable. The data generally exhibit great variety and disparate frequency distributions, which are largely ignored and which entail either the storage of extensive lists of items or the use of complex numerical algorithms such as hash coding. The results in each case are far from ideal.
 The variety-generator approach seeks to reflect the microstructure of data elements in their description for storage and search, and takes advantage of the consistency of statistical characteristics of data elements in homogeneous data bases.
 In this paper, the application of the variety-generator approach to the description of personal author names from the INSPEC data base by means of small sets of keys is detailed. It is shown that high degrees of partitioning of names can be obtained by key-sets generated from the initial characters of surnames, fmm the terminal characters of surnames, and from the initials.
 The implications of the findings for computer-based bibliographical informationsystems are discussed.

Highlights

  • The application of computer technology to the storage of bibliographic data bases and to the selection of items from them on the basis of the content of specified data elements poses considerable problems

  • Among the most important of these, from the viewpoint of the efficiency of computer use, is the fact that many of the individual data elements exhibit great variety, and show relatively disparate distributions. This behavior is encountered in different degrees in regard to items such as words in the titles of monograph or periodical ar

  • Conventional approaches to processing records comprising linguistic data tend to disregard the statistical properties of the items, and attempt to overcome the resultant problems either by storage of extensive lists of items or by using complex numerical algorithms

Read more

Summary

INTRODUCTION

The application of computer technology to the storage of bibliographic data bases and to the selection of items from them on the basis of the content of specified data elements poses considerable problems. Application of the procedure to the surnames of the 50,000 name file (the name records had a maximum of eighteen characters, left-justified and space-filled if less than this length), with a threshold frequency of 300 (i.e., a probability of 0.006), gave a key-set consisting of eighty-seven keys, including all the alphabetic characters. As the size of the key-set increases, the range of probabilities represented among the keys narrows, and the relative entropy of the distribution increases, becoming eventually asymptotic with the value of one This i~ illustrated, for the surnames in a file of 50,000 entries. This key-set consists of the twenty-six characters, seventy-eight digrams,

H CH ICH
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call