Homepage2Vec: Language-Agnostic Website Embedding and Classification

Sylvain Lugeon,Tiziano Piccardi,Robert West

doi:10.1609/icwsm.v16i1.19380

Abstract

Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset of more than two million category-labeled websites in 92 languages collected from Curlie, the largest multilingual human-edited Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained model for classifying and embedding websites based on their homepage in a language-agnostic way. Homepage2Vec, thanks to its feature set (textual content, metadata tags, and visual attributes) and recent progress in natural language representation, is language-independent by design and generates embedding-based representations. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages. Feature analysis shows that a small subset of efficiently computable features suffices to achieve high performance even with limited computational resources. We make publicly available the curated Curlie dataset aligned across languages, the pre-trained Homepage2Vec model, and libraries: https://github.com/epfl-dlab/homepage2vec.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Homepage2Vec: Language-Agnostic Website Embedding and Classification

Abstract

Talk to us

Similar Papers

More From: Proceedings of the International AAAI Conference on Web and Social Media

Lead the way for us

Journal: Proceedings of the International AAAI Conference on Web and Social Media	Publication Date: May 31, 2022
Citations: 3

Similar Papers

Language-Agnostic Website Embedding and Classification
...
-
, et. al. ...
08 May 2022
08 May 2022

Improving Machine Reading Comprehension with General Reading Strategies
Kai Sun ... Dian Yu
-
Kai Sun, et. al.Kai Sun ... Dian Yu
01 Jan 2019
01 Jan 2019

A two-stage framework for Arabic social media text misinformation detection combining data augmentation and AraBERT
Ebtsam A Mohamed ... Eman M G Younis
Social Network Analysis and Mining | VOL. 14
Ebtsam A Mohamed, et. al.Ebtsam A Mohamed ... Eman M G Younis
08 Mar 2024
Social Network Analysis and Mining | VOL. 14

Data Curation and Quality Evaluation for Machine Learning-Based Cyber Intrusion Detection
Ngan Tran ... Haihua Chen
IEEE Access | VOL. 10
Ngan Tran, et. al.Ngan Tran ... Haihua Chen
01 Jan 2021
IEEE Access | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Homepage2Vec: Language-Agnostic Website Embedding and Classification

Abstract

Talk to us

Similar Papers

More From: Proceedings of the International AAAI Conference on Web and Social Media