Abstract
Dealing with big data in computational social networks may require powerful machines, big storage, and high bandwidth, which may seem beyond the capacity of small labs. We demonstrate that researchers with limited resources may still be able to conduct big-data research by focusing on a specific type of data. In particular, we present a system called MPT (Microblog Processing Toolkit) for handling big volume of microblog posts with commodity computers, which can handle tens of millions of micro posts a day. MPT supports fast search on multiple keywords and returns statistical results. We describe in this paper the architecture of MPT for data collection and phrase search for returning search results with statistical analysis. We then present different indexing mechanisms and compare them on the microblog posts we collected from popular online social network sites in mainland China.
Highlights
Dealing with big data in computational social networks may require big machines with big storage
By focusing on a specific type of data, it is possible to carry out big-data research on online social networks (OSNs) using commodity computers in a small lab environment
We present a system for handling big volume of microblog posts (MBPs)
Summary
Dealing with big data in computational social networks may require big machines with big storage. To make use of these data, MPT is required to support, among other things, phrase search that will quickly return the set of MBPs that contain a set of words entered by the user and the statistical results of these posts displayed in various graphs. Using the built-in regular expressions of Mongo DB, we would only need to traverse all the MBPs the database returned and carry out the statistical analysis Since the nextword index file contains the position of each word pair and the required statistical features, the system can carry out statistical analysis by traversing the position list for the given input phrase to be searched for without querying the entire database. This method is good for building a fast search engine but may not meet the requirement of high accuracy
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.