We apply a paradigm of transfer learning to build a taxonomy of entities intended to improve search engine relevance in a vertical domain. The taxonomy construction process starts from the seed entities and mines available source domains for new entities associated with these seed entities. New entities are formed by applying the machine learning of syntactic parse trees (their generalizations) to the search results for existing entities to form commonalities between them. These commonality expressions then form parameters of existing entities, and are turned into new entities at the next learning iteration. To match natural language expressions between source and target domains, we use syntactic generalization, an operation which finds a set of maximal common sub-trees of constituency parse trees of these expressions.Taxonomy and syntactic generalization are applied to relevance improvement in search and text similarity assessment. We conduct an evaluation of the search relevance improvement in vertical and horizontal domains and observe significant contribution of the learned taxonomy in the former, and a noticeable contribution of a hybrid system in the latter domain. We also perform industrial evaluation of taxonomy and syntactic generalization-based text relevance assessment and conclude that a proposed algorithm for automated taxonomy learning is suitable for integration into industrial systems. The proposed algorithm is implemented as a component of Apache OpenNLP project.
Read full abstract