Abstract
Text similarity calculation by text embeddings requires fine-tuning of the language model by a large amount of labeled data, which may not be available for small text collections in their specific knowledge domains, in particular, in public organizations. As an alternative to machine learning, this research proposes pairwise term co-occurrence within plain-text matching, i.e., the query and the document share co-occurrences of two terms in a text span. In the entire document, the co-occurrences form the context that affects a term. This is analogous to a contextual word embedding, except our context affects the importance, not the meaning, of the term. Pairwise term co-occurrence has been applied in three text similarity calculation methods: term-pair-based text similarity, BM25 with term weights enhanced by pairwise term co-occurrence, and likewise enhanced cosine similarity. The three methods were evaluated for retrieval of four text types – email messages, web articles, fill-in forms, and brochures from a public organization – by having the first three as queries. Pairwise term co-occurrence performed on par with or better than BERT sentence embeddings without fine-tuning the BERT language model. With some text types, pairwise term co-occurrence outperformed bag-of-words matching by as much as 29.44 (MAP) and 31.71 (P@1) percentage points. Pairwise term co-occurrence can fill a niche by improving text similarity calculation where supervised machine learning is difficult to carry out.
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have