Abstract

We present our research at Onet, the largest Polish news portal, aimed at constructing meaningful user profiles that are most descriptive of their interests in the context of the media content they browse. We used two distinct state-of-the-art numerical text-representation techniques: LDA topic modeling and Word2Vec word embeddings. We trained our models on the corpora of articles in Polish and compare them with a baseline model built on a general language corpora. We compared the performance of algorithms on two distinct tasks - similar articles retrieval and users gender classification. Our results show that the choice of text representation depends on the task - Word2Vec is more suitable for text comparison, especially for short texts such as titles. In the user profiling task, the best performance was obtained with a combination of features: topics from the article text and word embeddings from the title.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.