WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE
Vidya. V.L
PG Scholar, Department of Computer Science, Mohandas College of Engineering and Technology, Anad
Aarathy Gandhi
Assistant Professor, Department of IT, Mohandas College of Engineering and Technology, Anad
Receiving Date:
2015-08-26
Acceptance Date:
2015-09-22
Publication Date:
2015-10-25
Download PDF
Abstract
The Web is a vast and rapidly growing information repository in which data are usually presented
using friendly formats, which makes it difficult to extract relevant data from various sources. So web
data extractors are used to extract the data from the web pages in order to feed automated
processes.web data extraction techniques are usually based on extraction rules that require
maintenance if web sources change. In this paper introduced a Featured ternary tree based approach
to extract the data from the web pages that share a common pattern, based on this tree generate the
regular expression and later it can be used to extract the data from the similar web documents.
Keywords:
data extraction, wrapper induction, Data alignment, pattern mining
References
- H. A. Sleiman and R. Corchuelo,” Trinity: On Using Trinary Trees for unsupervised web data extraction” IEEE Trans.Knowl.DataEng., vol.26, No.6, June 2014.
- C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, “A survey of web information extraction systems,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1411–1428, Oct. 2006.
- C.-N. Hsu and M.-T. Dung, “Generating finite-state transducers for semi-structured data extraction from the web,” Inform. Syst., vol. 23, no. 8, pp. 521–538, Dec. 1998.
- C.-H. Chang and S.-C. Kuo, “OLERA: Semi supervised web-data extraction with visual support,” IEEE Intell. Syst., vol. 19, no. 6, pp. 56–64, Nov./Dec. 2004.
- V. Crescenzi, G. Mecca, and P. Merialdo, “Road runner: Towards automatic data extraction from large web sites,” in Proc. 27th Int. Conf. VLDB, Rome, Italy, 2001, pp. 109– 118.
- C.-H.Chang and S.-C. Lui, “IEPAD: Information extraction based on pattern discovery,” in Proc. 10th Int. Conf. WWW, Hong Kong, China, 2001, pp. 681–688.
- A. Arasu and H. Garcia-Molina, “Extracting structured data from web pages,” in Proc. 2003 ACM SIGMOD, San Diego, CA, USA, pp. 337–348.
- B. Liu and Y. Zhai, “NET: A system for extracting web data from flat and nested data records,” in Proc. 6th Int. Conf. WISE, New York, NY, USA, 2005, pp. 487–495.
- M. Kayed and C.-H. Chang, “FiVaTech: Page-level web data extraction from template pages,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 2, pp. 249–263, Feb. 2010.
- J. Wang and F. Lochovsky. 'Wrapper Induction based on nested pattern discovery.' , Technical Report HKUSTCS-27-02, Dept. of Computer Science, Hong Kong U. of Science and Technology, 2002
- Tai, K. The tree-to-tree correction problem. J. ACM, 26(3):422–433, 1979
- D. Freitag, “Information extraction from HTML: Application of a general machine learning approach,” in Proc. 15th Nat/10th Conf.AAAI/IAAI, Menlo Park, CA, USA, 1998, pp. 517–523.
Back