Abstract

Technological advances allow perceptual physical objects to be connected to the Internet and share their information over webpages. As a predominant source of public sensor data, automatic discovery of these data is critical and desirable for general Internet of Things search service, which is fundamental to many intelligent applications. However, discovering sensor data from webpages is quite challenging, since there are diverse data presentation modes, and webpage layouts and structures are complicated and heterogeneous. To this end, we explore webpages to discover and collect sensor data under a hierarchical mechanism. In this paper, we first devise novel textual features (TFs) to recognize potential webpages that may contain sensor data; specifically, we construct sensing information corpus to provide keyword reference for the features. Then to get the position of sensor data, we develop granularity adaptive page segmentation (GAPS) algorithm to segment potential webpages into a set of informative blocks; and accordingly, we extract several visual features (VFs) of the blocks so that sensor data can be identified via a block classifier. Based on the novel created sensing information pages dataset, which consists of webpages and manual markings about sensing data, extensive experiments are conducted to evaluate the performance of our exploration methods. Results demonstrate that the TFs achieve supreme performance in sensing data recognition when compared to the state-of-the-art approaches, and GAPS is efficient to locate multimodal sensor data in cooperation with the VFs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call