Manual extraction of data from unstructured data sources like websites is labour intensive and becomes almost in-feasible at large scale. Recent state-of-the-art techniques for the task of information extraction show encouraging results. In this work, we make an attempt to extract professional details like name, email, address, contact number, and specialization from home pages of doctors. The work covers two possible scenarios of websites having these details. One scenario is where a website contains details of a single doctor. Another scenario is where a website may contain multiple information of multiple doctors/professionals at the same time. The problem is attempted to be solved as a relation extraction task for Information Extraction. The proposed solution has been built on top of DeepDive, a tool developed by Stanford. In both scenarios, DeepDive takes pre-processed data sentences as input and constructs entity-relations. For each entity-relation, DeepDive computes a probability that the relationship is a correct match using distance supervision and user-defined heuristic rules. In case of experiment-1, our system achieves 69.14% accuracy for the name, 88.67% accuracy for location and 100% for email, number and specialization. In case of experiment-2, the observed probabilities are not so significant and mostly around 0.5-0.7 but we present some solutions for future work. The techniques presented here can easily be extended to generalize for other types of professionals too and not just doctors.
Read full abstract