Abstract

AbstractThe number of parameters necessary for the word N‐gram model is equal to the n‐th power of the size of the vocabulary. As a result, compression of the parameter space is vital, depending on the field in question. In this research, singular value decomposition (SVD) of an N‐pair word co‐occurrence matrix is performed. The word and phrase state are taken to be vectors in a K‐dimensional space. The authors then attempt to compress the N‐gram probability parameter space using an approximation of the original matrix but with a lower number of dimensions. The results clearly show that in vector space, the Trigram model can be represented using roughly 17.5% fewer parameters. In addition, clustering is performed based on the distance in the defined space, and whether or not words are positioned appropriately in the linear space is investigated. These results confirm through a comparison using the same number of parameters that the entropy value is lower compared to the class model obtained using a method based on the maximization of the amount of mutual information, and that the positioning is good. © 2003 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3, 86(8): 61–70, 2003; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjc.10106

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.