Abstract
PDF HTML阅读 XML下载 导出引用 引用提醒 基于流信息距离的多文本流热点挖掘 DOI: 10.3724/SP.J.1001.2011.03893 作者: 作者单位: 作者简介: 通讯作者: 中图分类号: 基金项目: 国家自然科学基金(600773169); 国家科技支撑计划(2006BAI05A01) Mining Hotspots from Multiple Text Streams Based on Stream Information Distance Author: Affiliation: Fund Project: 摘要 | 图/表 | 访问统计 | 参考文献 | 相似文献 | 引证文献 | 资源附件 | 文章评论 摘要:把文本流中的热点区分为局部热点和全局热点,分析了二者的相关性,并将Kolmogorov 复杂度应用于多文本流中的热点挖掘.首先,定义了基于Kolmogorov 复杂度的冗余信息的概念,并论证了文本流存在局部热点的必要条件是冗余信息超过某个阈值;其次,基于条件Kolmogorov 复杂度提出了一个相似性度量指标——流信息距离(stream information distance,简称SID),以衡量不同文本流之间的相似度;并借鉴计算生物学领域中的种系发生树的思想,提出了一种基于层次聚类的多文本流全局热点挖掘启发式算法.在合成和真实数据集的实验,验证了算法的收敛性、有效性和规模可伸缩性. Abstract:This paper characterizes the local and global hotspots in text streams and elaborates their correlation. The paper then applies Kolmogorov complexity to mining the hotspots in multiple text streams. The Redundant Information is defined based on Kolmogorov complexity, and it has been demonstrated that the Redundant Information exceeding a threshold is necessary for the local hotspots. Secondly, a similarity metric, termed as Stream Information Distance (SID), is suggested based on the conditional Kolmogorov complexity to quantify the similarity between different text streams. Borrowing ideas of Phylogeny originated from Computational Biology, a heuristic algorithm based on hierarchical clustering is proposed to mine the global hostspots from multiple text streams. Finally, the convergency, effectiveness, and scalability of this algorithm are validated by the extensive experiments over synthetic and real data set. 参考文献 相似文献 引证文献
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have