Separation of Tamil and Devanagari Script Words inPrinted Bilingual Document Images

R. Rathinapriya; S. Abirami; B. Manjula

doi:10.36039/AA042009005

Separation of Tamil and Devanagari Script Words inPrinted Bilingual Document Images

R. Rathinapriya, S. Abirami, B. Manjula

Abstract

Identification of scripts from bi-script document is one of the important steps in the design of an OCR system for successful analysis and recognition. Most optical character recognition (OCR) systems can recognize at most a few scripts. But for large archives of document images that contain different scripts, there must be some way to automatically categorize these documents before applying the proper OCR on them. Much work has already been reported in this area. In the Indian context, though some results have been reported, the task is still at its infancy. This paper presents a research in the identification of Tamil, Devanagari scripts at word level irrespective of their font faces and sizes. The proposed technique performs document vectorization method which generates vectors from the nine zones segmented over the characters based on their shape, density and transition features. Then script is proposed technique identifies scripts with minimal pre-processing and high accuracy. It can also be extended for other scripts. Since this determined by using Rule based classifiers containing set of classification rules which are raised from the vectors. Results from experiments, simulations, and human vision encounter that the system can act as a plug-in, this can be embedded with OCR prior to the recognition stage.

Keywords

Bi-lingual document, Script Identification, Rule Based Classification, Optical Character Recognition (OCR),

Full Text:

PDF

References

Andrew Busch, Wagesh W. Boles and Sridha Sridharan, Texture for Script Identification,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, 2006.

Pal B. B. Choudhuri, “Automatic Separation of Words in Multi Lingual multi Script Indian Documents,” Proc. 4th International Conference on Document Analysis and Recognition, 576-579, (1997).

Lawrence Spitz. A, “Determination of the Script and Language Content of Document Images,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235-245, 1997.

Dhanya. D, A. G. Ramakrishnan and Peeta Basa Pati, Script Identification in Printed Bilingual Document,” vol. 27, 2002.

Gopal Datt Joshi, Saurabh Garg and Jayanthi Sivaswamy, “Script Identification From Indian Document,” DAS, LNCS 3872, pp. 255-267, 2006.

Huanfeng Ma and David Doermann , “Gabor Filter Based Multi-Class Classifier for Scanned Document Images,” Proceedings of the Seventh International Conference on Document Image Analysis and Recognition (ICDAR’03), 2003.

U. Pal, B. B. Choudhuri, “Script Line Separation From Indian Multi-Script Documents,” Proc. 5th International Conference on Document Analysis and Recognition(IEEE Comput. Soc. Press), 406-409, 1999.

J. Hochberg, L. Kerns, P. Kelly, and T. Thomas, “Automatic Script Identification from Images Using Cluster- Based Templates,” Proc. Third Int’l Conf. Document Analysis and Recognition, pp.378-381, 1995.

Judith Hochberg, Patrick Kelly and Li and Kerns,”Automatic Script Identification From document Images Using Cluster-Based Templates, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 176-181, 1997.

S. Lu and C. L. Tan, “Script and Language Identification in Degraded and Distorted Document Images,” Twenty-First National Conference on Artificial Intelligence, pp. 769- , 1997, 2006.

U. Pal and B. B. Chaudhuri, “Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line,” Proc. 6th Intl. Conf: Document Analysis and Recognition (ICDAR'OI), pages 790-794, 2001.

U. Pal, S. Sinha and B. B. Choudhuri, “Word-wise script identification from a document containing English, Devanagari and Telugu text,” Proc. 2nd National Conference on Document Analysis and Recognition, Karnataka, India, 213-220, (2003).

T. Dunning. “Statistical Identification of Language,” Technical report, Computing Research Laboratory, New Mexico State University, 1994.

Santanu Choudhury, Gaurav Harit, Shekar Madnani and R. B. Shet, “Identification of Scripts of Indian Languages by Combining Trainable Classifiers,” ICVGIP 2000, Dec. , 20-22, Bangalore, India.

Lu Shijian and Chew Lim Tan, “Script and Language Identification in Noisy and Degraded Document Images,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 2007.

M. C. Padma and Dr. P. A. Vijaya, “Language Identification Of Kannada, Hindi And English Text Words Through Visual Discriminating Features,” International Journal Of Computational Intelligence Systems, Vol. 1, No. 2, 116–126, May, 2008.

Peeta Basa Patil, S. Sabari Raju, Nishikanta. Pati and A. G. Ramakrishnan,”Gabor Filters for Document analysis on Indian Bilingual Documents”, ICISIP 2004.

Anoop M. Namboodiri and Anil K. Jain,”Online Handwritten Script recognition,”IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 26, No. 1,pp124-130, 2004.

Chew Lim Tan, Peck Yoke Leong and Shoujie He, “Language Identification in Multilingual Documents,” 2003.

B. V. Dhandra, Mallikarjun Hangarge, “Global and Local Features Based Handwritten Text Words and Numerals Script Identification,” International Conference on Computational Intelligence and Multimedia Applications, 0-7695-3050-8/07 $25. 00, DOI 10. 1109/ICCIMA. 2007. 125

V. Dhandra, Mallikarjun Hangarge', Ravindra Hegadil and V. S. Malemathl, “Word Level Script Identification In Bilingual Documents Through Discriminating Features,” IEEE - ICSCN 2007, MIT Campus, Anna University, Chennai, India. Feb. 22-24, pp. 630-635. 2007.

B. V. Dhandra, P. Nagabhushan, Mallikarjun Hangarge, Ravindra Hegadi and V. S. Malemath, “Script Identification Based On Morphological Reconstruction In Document Images,” The 18th International Conference On Pattern Recognition (ICPR’06), 2006.

B. V. Dhandra, H. Mallikarjun, Ravindra Hegadi and V. S. Malemath, "Word-wise Script Identification based on Morphological Reconstruction in Printed Bilingual Documents," Proc. Of IET International Conference on Vision Information Engineering VIE- 2006, Bangalore, pp 389-393. 2006.

B. V. Dhandra, H. Mallikarjun, Ravindra Hegadi and V. S. Malemath,"Word-wise Script Identification from Bilingual Documents based on Morphological Reconstruction," Proc. of First IEEE International Conference on Digital Information Management (ICDIM- 2006) pp. 389-394. 2006.

DOI: http://dx.doi.org/10.36039/AA042009005

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me