Open Access Open Access  Restricted Access Subscription or Fee Access

A Survey on Pre-processing Techniques for Text Mining

Manthan J. Vyas, Sanjay D. Bhanderi


Text mining is the process of obtaining interesting patterns or knowledge from text documents. The most often used type of data in the WWW is text. Text mining is used to extract interesting knowledge from unstructured text data. Pre-processing is a very important phase in the text mining process. Text mining framework includes two components, text refining and knowledge distillation. This paper is about pre-processing for text mining in English and Gujarati language. There is very less work done for text mining in Gujarati language. It is very challenging task as Gujarati is very rich in morphology, it gives rise to a very large number of word forms and feature spaces. Some pre-processing techniques in Gujarati are introduced in this paper.


Pre-Processing, Stop-Words, Stemming, Text Mining

Full Text:



Srividhya, R. Anitha, “Evaluating Preprocessing Techniques in Text Categorization”.

Keno Buss, “Literature Review on Preprocessing for Text Mining”.

A. Anil Kumar, S. Chandrasekhar, “Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering”.

Miral Patel, Prem Balani, “Clustering Algorithm for Gujarati Language”.

K. A. Chauhan, R. S. Patel, H. J. Joshi, “Towards improvement in Gujarati Text Information Retrieval by using Efeective Gujarati Stemmer”,

Kartik Suba, Dipti Jiandani, Pushpak Bhattacharyya, “Hybrid Inflectional Stemmer and Rule-based Derivational Stemmer for Gujarati”.

Pratikkumar Patel, Kashyap Popat, Pushpak Bhattacharyya, “Hybrid Stemmer for Gujarati”.

Juhi Ameta, Nisheeth Joshi, Iti Mathur, “A Lightweight Stemmer for Gujarati”.

Chirag Patel and Karthik Gali, “Part-Of-Speech Tagging for Gujarati Using Conditional Random Fields”.

M. Thangarasu, Dr. R. Manavalan, “A Literature Review: Stemming Algorithms for Indian Languages”.

J. Ignacio Serrano, M. Dolores del Castillo, Jesus Oliva, Angel Iglesias, “The Influence of Stop-Words and Stemming on Human Text-base Comprehension”.

Joshi Hardik, Pareek Jyoti, “Evaluation of some IR models for Gujarati Ad hoc Monolingual tasks”.

Ljiljana Dolamic, Jacques Savoy, “When Stop-word Lists Make the Difference”,

A. Ramanathan and D. Rao, “A Lightweight Stemmer for Hindi”.

“A light weight stemmer for Bengali and its Use in spelling Checker,” Proc. 1st Intl. Conf. on Digital Comm. and Computer Applications, Irbid, Jordan, March 19-23,2007.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.