Open Access Open Access  Restricted Access Subscription or Fee Access

Layout of Simultaneous Parser: Hypertext Language with Ontology Based Information Element

A. Daison Raj, P. Vidhya, P. Sathiyalakshmi, N. Narmatha

Abstract


HTML is Hypertext Markup Language which is used to design webpage for storing static information. HTML is a broadly utilized markup dialect to make up endless pages. Simultaneous HTML parser would prompt to considerable execution change and a superior client encounter. In any case, simultaneous the HTML parser is testing a direct result of a solid cyclic reliance in the parser display. In this paper proposes an ontology-based HTML Simultaneous parser plan that parts the info HTML record by a "div" tag, and procedures the autonomous halfway contributions with different parser strings. Simultaneous Parser scans the input and modifies as Tokens. Tokens are sequence of characters. Simultaneous parser can be executed parallel. This system assessed the proposed HTML Simultaneous parser with the benchmarks chose from main 500 site pages and accomplished a greatest speedup of 1.49x.


Keywords


HTML5, Ontology Parser, Syntactic, DOM Tree, Tokenization, Hypothesis

Full Text:

PDF

References


S. Gupta, G. Kaiser, P. Grimm, M. Chiang, J. Starren, “Automating Content Extraction of HTML Documents,” in World Wide Web, vol. 8, no. 2, pp. 179-224, June 2005.

G. Wu, L. Li, X. Hu, X. Wu, “Web news extraction via path ratios,” in Proc ACM intl. conf. on information & knowledge management, pp. 2059-2068, 2013.

T. Weninger, W.H. Hsu, "Text Extraction from the Web via Text-to-Tag Ratio," in Database and Expert Systems Application, pp.23-28, Sept. 2008.

T. Weninger, W.H. Hsu, J. Han, “CETR: content extraction via tag ratios,” in Proc. Intl. conf. on World Wide Web, pp. 971-980, April 2010.

T. Gottron, "Evaluating content extraction on HTML documents," in Proc. Intl. conf. on Internet Technologies and Apps, pp. 123-132. 2007.

D. Song, F. Sun, L. Liao, "A hybrid approach for content extraction with text density and visual importance of DOM nodes,"in Knowledge and Information Systems, vol. 42, no. 1, pp. 75-96, 2015.

A.F.R. Rahman, H. Alam, R. Hartono, "Content extraction from html documents," in Intl. Workshop on Web Document Analysis, pp. 1-4, 2001.

F. Sun, D. Song, L. Liao, “DOM based content extraction via text density,” in Proc. Intl. conference on Research and development in Information Retrieval, pp. 245-254, 2011.

B. Adelberg, “NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents,” in Proc. ACM Intl. conf. on Management of data, pp. 283-294, 1998.

DOM Living Standard, /https://dom.spec.whatwg.org/

C.B.Sivaparthipan, S.Raja Ranganathan, D. Prabakar, Dr.T.Kalaikumaran,”Increasing the Accessibility of Data Sets by using Distributed Algorithm in Data Mining” ,IJRCAR Journal, pp.34-41,Nov. 2013

M.BalaAnand, “Preserving Big Data Integrity in Cloud with an Efficient Manner”, CiiT journal of Artificial Intelligence System and Machine Learning, pp.240-243, 2016.

M. R. Ghorab, D. Zhou, A. O. Connor, and V. Wade, “Personalised Information Retrieval: survey and classification,” pp. 381–443, 2013.

T. Gottron, "Content code blurring: A new approach to content extraction," in Intl. Workshop on Database and Expert Systems Application, pp. 29-33, 2008.

M.E. Peters, D. Lecocq, "Content extraction using diverse feature sets," in Proc. Intl. conf. on World Wide Web companion, pp. 89-90, 2013.

K. Nethra, J. Anitha, G. Thilagavathi, "Web Content Extraction Using Hybrid Approach," in ICTACT Journal On Soft Computing,vol. 4, no. 02 (2014).

A. Bhardwaj, V. Mangat, "A novel approach for content extraction from web pages," in Recent Advances in Engineering and Computational Sciences, pp. 1-4, 2014.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.