Improved Fuzzy C-Means Clustering of Web Usage Data with Genetic Algorithm
Abstract
Clustering is one of the important functions in web usage mining. Web usage mining involves application of data mining techniques to discover usage patterns from the web data. Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. These methods are not only major tools to uncover the underlying structures of a given data set, but also promising tools to uncover local input-output relations of a complex system. Fuzzy C-means (FCM) is one of the most widely used fuzzy clustering algorithms in real world applications. However there are two major limitations that exist in this method. The first is that a predefined number of clusters must be given in advance. The second is that the FCM technique can get stuck in sub-optimal solutions. In this paper,we have proposed a new framework to improve the web sessions’ cluster quality from fuzzy c-means clustering using Genetic Algorithm (GA). Initially the fuzzy c-means algorithm is used to cluster the user sessions. And in the second step, we have proposed a GA based refinement algorithm to improve the cluster quality. The proposed algorithm is tested with web access logs collected from the Internet Traffic Archive (ITA) and shows that refined initial starting points and post processing refinement of clusters indeed lead to improved solutions.
Keywords
Full Text:
PDFReferences
J. Srivastava, R. Cooley, M. Deshpande, and P. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, in SIGKDD Explorations, 1(2):1-12, 2000.
S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.
P. Baldi, P. Frasconi, and P. Smyth. Modeling the Internet and the Web Wiley, 2003.
A. Banerjee, J. Ghosh, Clickstream clustering using weighted longest common subsequences, in: Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining, 2001.
I. V. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. Modelbased clustering and visualization of navigation patterns on a Web site.Data Mining and Knowledge Discovery, 7(4):399-424, 2003.
Y. Fu, K. Sandhu, and M-Y Shih. Clustering of Web users based on access patterns. In Proceedings of WEBKDD, 1999.
B. Hay, K Vanhoof, and G. Wetsr Clustering navigation patterns on a Website using a sequence alignment method. In Proceedings of 17th International Joint Conference on Artificial Intelligence, Seattle,Washington, USA, August, 2001.
R. Kothari, P. A. Mittal, V. Jain, and M. K. Mohania. On using page co occurrences for computing clickstream similarity. In Proceedings of the 3rd SIAM International Conference on Data Mining., San Francisco,USA, May 2003.
N. Eiron and K. S. McCurley. Untangling compound documents on theWeb. In Proceedings of ACM Hypertext,, pages 85-94, 2003.
G. Greco, S. Greco, and E. Zumpano. Web communities:models and algorithms. WorldWide Web, 7(1):58-82, 2004.
G. W. Flake, S. Lawrence, C. Lee Giles, and Frans Coetzee. Selforganization and identification of Web Communities IEEE Computer,35(3), 2002.
Z. Chen, A.Wai-Chee Fu, and F. Chi-Hung Tong. Optimal algorithms for finding user access sessions from very large Web logs. World Wide Web: Internet and Information Systems, 6:259-279, 2003.
R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining World Wide Web browsing patterns Knowledge Information Systems,1:5-32, 1999.
J. Heer, E.H. Chi, Mining the structure of user activity using cluster stability, in: Proceedings of the Workshop on Web Analytics, Second SIAM Conference on Data Mining, ACM Press, 2002.
J.Z. Huang, M. Ng, W.-K. Ching, J. Ng, D. Cheung, A cube model and cluster analysis for web access sessions, Lecture Notes in Computer Science, Springer, vol. 2356, 48–67, 2002.
Y. Xie, V.V. Phoha, Web user clustering from access log using belief function, in: Proceedings of the First International Conference on Knowledge Capture (K-CAP 2001), ACM Press, 202–208, 2001.
C. Shahabi, F. Banaei-Kashani, A framework for efficient and anonymous web usage mining based on client-side tracking, in: R.Kohavi, B. Masand, M. Spiliopoulou, J. Srivastava (Eds.), WEBKDD 2001—Mining Web Log Data Across All Customers Touch Points,Third International Workshop, San Francisco, CA, USA, August 26, 2001. Revised papers, vol. 2356 of Lecture Notes in Comp Sc, Springer,113 144, 2002.
B. Hay, G. Wets, K. Vanhoof, Clustering navigation patterns on a website using a sequence alignment method. In: Intelligent Techniques for Web Personalization: IJCAI 2001, 17th Int. Joint Conf. on Artificial Intelligence, Seattle, WA, USA, 1–6, 2001.
RT. Ng and J Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Transactions on Knowledge and Data Engineering, 14(5): pp. 1003–1016, 2002.
O. Nasraoui, F. Gonzalez, D. Dasgupta, The fuzzy artificial immune system: Motivations, basic concepts, and application to clustering and web profiling, in: Proceedings of the World Congress on Computational Intelligence (WCCI) and IEEE International Conference on Fuzzy Systems, 711–716, 2002.
A. Ypma, T. Heskes, Clustering web surfers with mixtures of hidden markov models, in: Proceedings of the 14th Belgian–Dutch Conference on AI (BNAIC_02), 2002.
S. Oyanagi, K. Kubota, A. Nakase, Application of matrix clustering to web log analysis and access prediction, in: WEBKDD2001—MiningWeb LogDataAcrossAll Customers Touch Points, Third InternationalWorkshop, 2001.
B. Mobasher, H. Dai, M. Tao, Discovery and evaluation of aggregate usage profiles for web personalization, Data Mining and Knowledge Discovery , 6:61–82, 2002.
C Gupta and R Grossman, GenIc: A Single Pass Generalized Incremental Algorithm for Clustering, Proceedings of the Fourth SIAM_International Conference on Data Mining (SDM 04), pp. 22–24, 2004.
NR. Pal and JC. Bezdek, Complexity Reduction for “Large Image”Processing, IEEE Transactions on Systems, Man, and Cybernetics, Part B 32(5):pp. 598–611, 2002.
L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani,Streaming-Data Algorithms for High-Quality Clustering, Proceedings of IEEE International Conference on Data Engineering, March 2002.
Jain AK, Murty MN and Flynn PJ, Data clustering: a review, ACM Computing Surveys, vol. 31, no.3, pp. 264-323, 1999.
Bezdeck J.C, Ehrlich R., Full W., "FCM:Fuzzy C-Means Algorithm",Computers and Geoscience 1984.
Kamel, M.S., Selim,S.Z.: New Algorithms for Solving the Fuzzy Clustering Problem. Pattern Recognition. Vol. 27 (1994), pp. 421-428.
X Wang, Y Wang, L Wang, Improving fuzzy c-means clustering based on feature-weight learning, Pattern Recognition Letters 25, pp. 1123–1132, 2004.
X Wang, J M. Garibaldi, Simulated Annealing Fuzzy Clustering in Cancer Diagnosis, Informatica, 29, pp. 61–70, 2005.
D Altman, Efficient Fuzzy Clustering of Multi-spectral Images, FUZZIEEE, 1999.
C Borgelt and R Kruse, Speeding Up Fuzzy Clustering with Neural Network Techniques, Fuzzy Systems, V. 2, pp. 852–856, 2003.
S Eschrich, J Ke, LO. Hall and DB. Goldgof, Fast Accurate Fuzzy Clustering through Data Reduction, IEEE Transactions on Fuzzy Systems, V. 11, 2, pp. 262–270, 2003.
JF. Kolen and T Hutcheson, Reducing the Time Complexity of the Fuzzy C-Means Algorithm, IEEE Transactions on Fuzzy Systems. V.10, pp. 263–267, 2002.
O. Nasraoui, H. Frigui, A. Joshi, and R. Krishnapuram, “Mining Web Access Logs Using Relational Competitive Fuzzy Clustering”, to be presented at the Eight International Fuzzy Systems Association World Congress - IFSA 99, Taipei, August 99.
L. Davis (Ed.), Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, 1991.
J.L.R. Filho, P.C. Treleaven, C. Alippi, Genetic algorithm programming environments, IEEE Comput. 27:28-43,1994.
D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, New York, 1989.
Z. Michalewicz, “Genetic Algorithms, Data Structures" Evolution Programs, Springer, New York, 1992.
J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981.
MR Rezaee, BPF. Leieveldt, and JHC. Reiber, A New Cluster Validity Index for the Fuzzy C-Means, Pattern Recognition Letters, Vol. 19,Elsevier, pp. 237-246, 1998.
X. L. Xie and G. Beni, “ A validity measure for fuzzy clustering”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 13, pp. 841-847,1991.
NR. Pal and JC. Bezdek, “On cluster validity for the fuzzy c-means model”, IEEE Trans. Fuzzy Systs., Vol. 3, pp. 370 379, 1995.
Refbacks
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution 3.0 License.