Abstract

Several recent studies have presented different approaches for clustering and classifying machine-generated mail based on email headers. We propose to expand these approaches by considering email message bodies. We argue that our approach can help increase coverage and precision in several tasks, and is especially critical for mail extraction. We remind that mail extraction supports a variety of mail mining applications such as ad re-targeting, mail search, and mail summarization. We introduce new structural clustering methods that leverage the HTML structure that is common to messages generated by a same mass-sender script. We discuss how such structural clustering can be conducted at different levels of granularity, using either strict or flexible matching constraints, depending on the use cases. We present large scale experiments carried over real Yahoo mail traffic. For our first use case of automatic mail extraction, we describe novel flexible-matching clustering methods that meet the key requirements of high intra-cluster similarity, adequate clusters size, and relatively small overall number of clusters. We identify the precise level of flexibility that is needed in order to achieve extremely high extraction precision (close to 100%), while producing relatively small number of clusters. For our second use case, namely, mail classification, we show that strict structural matching is more adequate, achieving precision and recall rates between 85%-90%, while converging to a stable classification after a short learning cycle. This represents an increase of 10%-20% compared to the sender-based method described in previous work, when run over the same period length. Our work has been deployed in production in Yahoo mail backend.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.