Open Access Open Access  Restricted Access Subscription or Fee Access

Enhancement of Discriminative Embedded Clustering for Clustering High Dimensional Data using Hub Concept

Ghatage Trupti Babasaheb, Patil Deepali Eknath, B. Takmare Sachin

Abstract


We often face very high dimensional data in many real applications. Many dimensions are not always helpful or may even affect the performance of the subsequent clustering algorithms. For dealing with this problem one way is to first reduce dimensionality and then apply clustering. But if we consider the requirement of dimensionality reduction during the process of clustering and vice versa then the performance of clustering can be improved. Discriminative Embedded Clustering (DEC) combines clustering and subspace learning. It has two main objective functions, first is dimensionality reduction and second is clustering.

In high dimensional data some data points are included in many more k-nearest-neighbor lists compared to other points. These points are called hubs. The tendency of high dimensional data to contain hubs is called hubness. Hubs are closer to all the other points as they are situated near cluster centeres. It is proved that major hubs can be effectively used as cluster prototypes. Use of hubness for clustering leads to enhancement over centroid-based approaches. Therefore, the aim of this paper is to design a system for clustering high dimensional data by using Discriminative Embedding Method and Hub based clustering.


Keywords


Clustering; High Dimensional Data; Subspace Learning; Hubs; Discriminative Embedded Clustering (DEC)

Full Text:

PDF

References


J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, 2nd edition, Morgan Kaufmann, 2006.

K. Kailing, P. Kroger, H. P. Kriegel, and S. Wanka, “Ranking Interesting Subspaces for Clustering High Dimensional Data”, Proc. Seventh European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD), page no. 241-252, 2003.

C.C. Aggarwal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proc. 26th ACM SIGMOD Int. Conf. Management of Data, page no. 70-81, June 2000.

J. Li and D. Tao, “Simple exponential family PCA”, IEEE Trans. Neu. Netw. Learn. Syst., vol. 24, no. 3, pp. 485-497, Mar. 2013.

H. Park, M. Jeon, and J. B. Rosen, “Lower dimensional representation of text data based on centroids and least squares”, BIT Numer. Math., vol. 43, no. 2, pp. 427-448, 2003.

J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction”, Science, vol. 290, no. 5500, pp. 2319-2323, Dec. 2000.

S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding”, Science, vol. 290, pp. 2323-2326, Dec. 2000.

Chenping Ho, Dongyun Yi, Feiping Nie, and Dacheng Tao, “Discriminative Embedded Clustering: A Framework for Grouping High Dimensional Data”, IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 6, pp.1287-1299, June 2015.

Nenad Tomasev, Milos Radovanovic, Dunja Mladenic, and Mirjana Ivanovi, “The Role of Hubness in Clustering High Dimensional Data”, IEEE Trans. Knowledge and Data Eng., vol. 26, no. 3, pp.739-751, March 2014.

J. Shi and J. Malik, “Normalized cuts and image segmentation”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888-905, Aug. 2000.

Pasi Franti, Olli Virmajoki, and Ville Hautamaki, “Fast Agglomerative Clustering Using a k-Nearest Neighbors graph”, IEEE trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 11, vol. 43, no. 12, pp. 1875-1881, 2006.

K. Kailing, H.-P. Kriegel, and P. Kroger, “Density-Connected Subspace Clustering for High-Dimensional Data”, Proc. Fourth SIAM Int’l Conf. Data Mining (SDM), pp. 246-257, 2004.

E. Muller, S. Gunnemann, I. Assent, and T. Seidl, “Evaluating Clustering in Subspace Projections of High Dimensional Data”, Proc. VLDB Endowment, vol. 2, pp. 1270-1281, 2009.

F. Nie, S. Xiang, Y. Liu, C. Hou, and C. Zhang, “Orthogonal vs. uncorrelated least squares discriminant analysis for feature extraction,” Pattern Recognit. Lett., vol. 33, no. 5, pp. 485-491, 2012.

M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation”, Neural Comput., vol. 15, no. 6, pp. 1373-1396, 2003.

S. Xiang, F. Nie, C. Zhang, and C. Zhang, “Nonlinear dimensionality reduction with local spline embedding”, IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1285-1298, Sep. 2009.

R.J. Durrant and A. Kaba´n, “When Is ‘Nearest Neighbour’ Meaningful: A Converse Theorem and Implications,” J. Complexity, vol. 25, no. 4, pp. 385-397, 2009.

A. Kaba´n, “Non-Parametric Detection of Meaningless Distances in High Dimensional Data,” Statistics and Computing, vol. 22, no. 2, pp. 375-385, 2012.

D. Franc¸ois, V. Wertz, and M. Verleysen, “The Concentration of Fractional Distances”, IEEE Trans. Knowledge and Data Eng., vol. 19, no. 7, pp. 873-886, July 2007.

C.C. Aggarwal, A. Hinneburg, and D.A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces,” Proc. Eighth Int’l Conf. Database Theory (ICDT), pp. 420-434, 2001.

M. Radovanovic, A. Nanopoulos, and M. Ivanovic, “Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data”, J. Machine Learning Research, vol. 11, pp. 2487-2531, 2010.

F. De La Torre and T. Kanade, “Discriminative cluster analysis”, in Proc. ICML, 2006, pp. 241-248.

L. Parsons, E. Haque, and H. Liu, “Subspace clustering for high dimensional data: A review”, ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 90-105, 2004.

X. R. Li, T. Jiang, and K. Zhang, “Efficient and robust feature extraction by maximum margin criterion,” IEEE Trans. Neural Netw., vol. 17, no. 1, pp. 157-165, Jan. 2006.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.