Abstract

We present our research at Onet, the largest Polish news portal, aimed at constructing meaningful user profiles that are most descriptive of their interests in the context of the media content they browse. We used two distinct state-of-the-art numerical text-representation techniques: LDA topic modeling and Word2Vec word embeddings. We trained our models on the corpora of articles in Polish and compare them with a baseline model built on a general language corpora. We compared the performance of algorithms on two distinct tasks - similar articles retrieval and users gender classification. Our results show that the choice of text representation depends on the task - Word2Vec is more suitable for text comparison, especially for short texts such as titles. In the user profiling task, the best performance was obtained with a combination of features: topics from the article text and word embeddings from the title.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call