Open Access Open Access  Restricted Access Subscription or Fee Access

Searching Web Page Using Entropy Estimation

Vijay R. Sonawane, P. P. Halkarnikar

Abstract


Explosive growth of the web has made information search and extraction harder to the web. User needs to automatically search product based web pages to locate the product description from huge data. In this paper, we propose simple technique to locate products in the retrieved web page of the e-commercial web site. For this we are taking the benefits of hierarchical structure of HTML language. First it discovers the set of product descriptions based on the measure of entropy at each node in the HTML tag tree of the retrieved web page. Afterward, a set of association rules based on heuristic features is employed for more accuracy in the product extraction.


Keywords


Entropy, representative value, association rule, filter, product description.

Full Text:

PDF

References


L. Bing, G. Robert, and Z. Yanhong. Mining data records in web pages.

D. Buttler, L. Liu, and C. Pu. A fully automated extraction system for the World Wide Web. Proceedings of IEEE ICDCS- 21, 2001.

M Califf and R. J. Mooney. Relational learning for patternmatch rules for information extraction. Proceedings of the Sixteenth National Conference on Artificial Intelligence, pages 328–334, 1999.

C. -H. Chang and S. -L. Lui. Iepad: Information extraction based on pattern discovery. Proceedings of WWW-10, 2001.

V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the 26th VLDB, pages 109–118, 2001.

D. W. Embley, D. M. Campbell, Y. S. Jiang, S. waW. Liddle, Y. -K. Ng, D. Quass, and R. D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Data and Knowledge Engineering, 31(3):227–251, November 1999.

D. Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2-3):169–202, 2000.

C. -N. Hsu and M. -T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521–538, 1998.

N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2):15–68, 2000.

L. Liu, C. Pu, and W. Han. Xwrap: An xml-enable wrapper construction system for web information sources. Proceedings of the 16th IEEE International Conference on Data Engineering, pages 611–621, 2000.

I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semi structured information sources. Autonomous Agents and Multi-Agent.

A. Sahuguet and F. Azavant. Building intelligent web applications using lightweight wrappers. Data and Knowledge Engineering, 36(3):283–272, 2001.

S. Soderlan. Learning information extraction rules for semistructured and free text. Machine Learning, 34(1-3):233–272, 1999.

Hieu Xuan Phan, Susumu Horiguchi, Bao Tu Ho, PEWeb: Product Extraction from the Web Based on Entropy Estimation

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD, pages 207–216, 1993.

Web Page Change Detection System”, P. P. Halkarnikar, H. P. Khandagale, National Conference on Internet Algorithm, NCIA08, Fr.Rodrige College of Enginnering, Mumbai, 16, 17 May 2008.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.