The abundance of information regarding any topic makes the Internet a very good resource. Even though searching the Internet is very easy, what remains difficult is to automate the process of information extraction from the available online information due to the lack of structure and the diversity in the sharing methods. Most of the times, information is stored in different proprietary formats, complying with different standards and protocols which makes tasks like data mining and information harvesting very difficult. In this paper, an information harvesting tool (heteroHarvest) is presented with objectives to address these problems by filtering the useful information and then normalizing the information in a singular non hypertext format. Finally we describe the results of experimental evaluation. The results are found promising with an overall error rate equal to 6.5% across heterogeneous formats.
Ieee Intelligence and Security Informatics Conference 2011 (ieee Isi 2011), 2011, p. 231-231