Abstract

This work originates from the observation that today's state-of-the-art statistical language models are impressive not only for their performance, but also---and quite crucially---because they are built entirely from correlations in unstructured text data. The latter observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We put forth enriched category theory as a natural answer. We show that sequences of symbols from a finite alphabet, such as those found in a corpus of text, form a category enriched over probabilities. We then address a second fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? We answer this by constructing a functor from our enriched category of text to a particular enriched category of reduced density operators. The latter leverages the Loewner order on positive semidefinite operators, which can further be interpreted as a toy example of entailment.

Highlights

  • ArXiv:2007.03834v4 [cs.CL] 27 Nov 2021. Their interrelations? How do logic and propositional entailment arise? Even so, today’s state of the art statistical language models are quite impressive for being built only from correlations in unstructured text data. This observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We propose that enriched category theory provides a natural home for the answer

  • Category theory gives a principled means of organizing “what goes with what” in a corpus of text along with the statistics of the resulting expressions—precisely the information used as input to today’s statistical language models. We turn to another fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? In other words, what is a representation of this mathematical structure? We propose the answer lies in a functor from our enriched category of text to a particular enriched category of linear operators

  • The statistics in language observed in corpora of text resembles the same statistics observed in one-dimensional quantum critical systems, and the ground states of the latter are known to be well approximated by low rank tensor factorizations—see [LT17, and references therein] as well as [KMH+20, GO19, EV11, PTV17, PV17]

Read more

Summary

Related Work

Tensor network language models have been explored previously [ZSZ+18, ZZM+19, GO19], though to our knowledge these efforts do not seek to identify the mathematical structure in unstructured text data, nor do they ask for a faithful representation of such structure or work under the hypothesis that tensor networks are candidate representations of it This foundational perspective is absent from quantum language models such as [BT17, SNB13, LZSH16, CPD20, ZNS+18, LWM19]. The primary role of the Loewner order in our work is that it instantiates the existence of an enriched functor that preserves the mathematical structure present in corpora of text This perspective is key to the work below and is absent from the approaches listed above. For an introduction to these visual representations, see [Sto[19], BB17, Orú[14], Eve19] or [Bra[20], Section 2.2.2]

Modeling Probability Distributions with Density Operators
Understanding Reduced Densities
Why Densities for Language?
Assigning Reduced Densities to Words
Preserving a Preorder Structure
Language as a Preordered Set
Language as a Category Enriched Over Probabilities
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.