Open Access Open Access  Restricted Access Subscription or Fee Access

Map-Reduce Based High Performance Clustering On Large Scale Dataset Using Parallel Data Processing

Anusha Vasudevan, M. Swetha, H. Hyba, G. Rajiv Suresh Kumar

Abstract


The amount of data in our world has been exploding, and analyzing large data sets—so-called big data—will become a key basis of competition, reinforcement new waves of productivity growth, innovation, and consumer surplus. Big data refers to the size of a dataset that has grown too large to be manipulated through traditional methods. These methods include capture, storage, and processing of the data in a tolerable amount of time. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of hardware. It works with Map Reduce software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Clustering analysis is an unsupervised learning task that consist on classify objects into group. Then, the objects from one group share similar feature and are different from objects belonging to other groups. This paper shows that Map Reduce framework K-means clustering algorithm can obtain a higher performance when handling large scale document automatic classification in a multimode environment.


Keywords


Bigdata, Hadoop, Map-Reduce, Clustering, HDFS.

Full Text:

PDF

References


https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

http://readwrite.com/2013/05/23/hadoop-what-it-is-and-how-it-works#awesm=~ou9MVKuMbKIKwb

http://searchbusinessanalytics.techtarget.com/definition/Hadoop-cluster

Haixun Wang Wei Wang Jiong Yang Philip S. Yu Clustering by Pattern Similarity in Large Data Sets

.http://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf

Vishal S Patil1, Pravin D. Soni2 HADOOP SKELETON & FAULT TOLERANCE IN HADOOP CLUSTERS

Hyeokju Lee, Joon Her, Sung-Ryul Kim Implementation of a Large-scalable Social Data Analysis System based on Map-Reduce.

http://www.aosabook.org/en/hdfs.html.

http://bradhedlund.com/?s=Understanding+Hadoop+Clusters+and+the+Network

Anil K. Jain, Data clustering: 50 years beyond K-means ,Pattern Recognition Letters 31 (2010) 651–666.

Dan pelleg, Andrew moore ,X-means: Extending K-means with Efficient Estimation of the Number of Clusters

http://www.ibm.com/developerworks/library/wa-introhdfs/

http://developer.yahoo.com/hadoop/tutorial/module1.html

http://www.guruzon.com/6/hadoop-cluster/architecture/what-is-Namenode-hadoop-cluster-limitation-use-functions http://en.wikipedia.org/wiki/K-means_clustering

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.