This paper proposes a supervised machine learning method for the problem of automatic keyphrase extraction for Social Question Answering (SQA) sites. The method is developed by: 1) Analyzing the structural and activity characteristics of typical SQA sites, 2) Developing and categorizing four types of calculation features that can describe those characteristics, and 3) Developing customized logistic regression model to be trained by the real dataset from six popular SQA sites, in both English and Chinese. Experimental results show the influences from those proposed SQA related features vary, some are helpful to keyphrase extraction for SQA sites of both languages while some are only useful for a specific site. The results also demonstrate a generally better performance comparing to a typical keyphrase extraction algorithms published previously like KEA.
Read full abstract