Details

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

Vidya. V.L

PG Scholar, Department of Computer Science, Mohandas College of Engineering and Technology, Anad

Aarathy Gandhi

Assistant Professor, Department of IT, Mohandas College of Engineering and Technology, Anad

99-107

Vol: 5, Issue: 4, 2015

Receiving Date: 2015-08-26 Acceptance Date:

2015-09-22

Publication Date:

2015-10-25

Download PDF

Abstract

The Web is a vast and rapidly growing information repository in which data are usually presented using friendly formats, which makes it difficult to extract relevant data from various sources. So web data extractors are used to extract the data from the web pages in order to feed automated processes.web data extraction techniques are usually based on extraction rules that require maintenance if web sources change. In this paper introduced a Featured ternary tree based approach to extract the data from the web pages that share a common pattern, based on this tree generate the regular expression and later it can be used to extract the data from the similar web documents.

Keywords: data extraction, wrapper induction, Data alignment, pattern mining

References

  1. H. A. Sleiman and R. Corchuelo,” Trinity: On Using Trinary Trees for unsupervised web data extraction” IEEE Trans.Knowl.DataEng., vol.26, No.6, June 2014.
  2. C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, “A survey of web information extraction systems,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1411–1428, Oct. 2006.
  3. C.-N. Hsu and M.-T. Dung, “Generating finite-state transducers for semi-structured data extraction from the web,” Inform. Syst., vol. 23, no. 8, pp. 521–538, Dec. 1998.
  4. C.-H. Chang and S.-C. Kuo, “OLERA: Semi supervised web-data extraction with visual support,” IEEE Intell. Syst., vol. 19, no. 6, pp. 56–64, Nov./Dec. 2004.
  5. V. Crescenzi, G. Mecca, and P. Merialdo, “Road runner: Towards automatic data extraction from large web sites,” in Proc. 27th Int. Conf. VLDB, Rome, Italy, 2001, pp. 109– 118.
  6. C.-H.Chang and S.-C. Lui, “IEPAD: Information extraction based on pattern discovery,” in Proc. 10th Int. Conf. WWW, Hong Kong, China, 2001, pp. 681–688.
  7. A. Arasu and H. Garcia-Molina, “Extracting structured data from web pages,” in Proc. 2003 ACM SIGMOD, San Diego, CA, USA, pp. 337–348.
  8. B. Liu and Y. Zhai, “NET: A system for extracting web data from flat and nested data records,” in Proc. 6th Int. Conf. WISE, New York, NY, USA, 2005, pp. 487–495.
  9. M. Kayed and C.-H. Chang, “FiVaTech: Page-level web data extraction from template pages,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 2, pp. 249–263, Feb. 2010.
  10. J. Wang and F. Lochovsky. 'Wrapper Induction based on nested pattern discovery.' , Technical Report HKUSTCS-27-02, Dept. of Computer Science, Hong Kong U. of Science and Technology, 2002
  11. Tai, K. The tree-to-tree correction problem. J. ACM, 26(3):422–433, 1979
  12. D. Freitag, “Information extraction from HTML: Application of a general machine learning approach,” in Proc. 15th Nat/10th Conf.AAAI/IAAI, Menlo Park, CA, USA, 1998, pp. 517–523.
Back

Disclaimer: All papers published in IJRST will be indexed on Google Search Engine as per their policy.

We are one of the best in the field of watches and we take care of the needs of our customers and produce replica watches of very good quality as per their demands.