Keyword-based access to structured data has been gaining traction both in research and industry as a means to facilitate access to information. In recent years, the research community and big data technology vendors have put much effort into developing new approaches for keyword search over structured data. Accessing these data through structured query languages, such as SQL or SPARQL, can be hard for end-users accustomed to Web-based search systems. To overcome this issue, keyword search in databases is becoming the technology of choice, although its efficiency and effectiveness problems still prevent a large scale diffusion. In this work, we focus on graph data, and we propose the TSA+BM25 and the TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach. This approach enables high scalability because it moves most of the computational complexity off-line and then exploits highly efficient text retrieval techniques and data structures to carry out the on-line phase. Nevertheless, text retrieval techniques scale well to large datasets but need to be adapted to the complexity of structured data. The new approaches we propose are more efficient and effective compared to state-of-the-art systems. In particular, we show that our systems scale to work with RDF datasets composed of hundreds of millions of triples and obtain competitive results in terms of effectiveness.
Read full abstract