### Decision Trees for Uncertain Data

#### Abstract

Classification based on decision trees is one of the

important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms such as federated and peer-to-peer databases are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of high dimensional databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the vision that these databases are soon to contain large amounts of genomic data, characterized by its high dimensionality. Current decision tree algorithms would require high communication bandwidth when executed on such data, which is not likely to exist in large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the

statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include

measurement/quantization errors, data staleness, and multiple repeated measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the

accuracy of a decision tree classifier can be much improved if the ―complete information‖ of a data item (taking into account the probability density function (pdf)) is utilized. We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have been conducted that show that the resulting classifiers are more accurate than those using value averages.

Since processing pdf’s is computationally more costly than processingsingle values

#### Keywords

#### Full Text:

PDF#### References

R. Agrawal, T. Imielinski, and A. N. Swami, ―Database mining: A

performance perspective,‖ IEEE Trans. Knowl. Data Eng., 1993.

J. R. Quinlan, ―Induction of decision trees,‖ Machine Learning, 1986.

——, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

C. L. Tsien, I. S. Kohane, and N. McIntosh, ―Multiple signal integration

by decision tree induction to detect artifacts in the neonatal intensive care

unit,‖ Artificial Intelligence in Medicine, vol. 19, no. 3, 2000.

M. Chau, R. Cheng, B. Kao, and J. Ng, ―Uncertain data mining: An

example in clustering location data,‖ in PAKDD, 2006, pp. 199–204.

W. K. Ngai, B. Kao, C. K. Chui, R. Cheng, M. Chau, and K. Y. Yip,

―Efficient clustering of uncertain data,‖ in ICDM, 2006, pp. 436–445.

S. D. Lee, B. Kao, and R. Cheng, ―Reducing UK-means to K-means,‖in

st Workshop on Data Mining of Uncertain Data, in ICDM, 2007.

Y. Yuan and M. J. Shaw, ―Induction of fuzzy decision trees,‖ Fuzzy Sets

Syst., vol. 69, no. 2, pp. 125–139, 1995.

T. Elomaa and J. Rousu, ―General and efficient multisplitting of

numerical attributes,‖ Machine Learning, vol. 36, no. 3, pp. 201–244,

U. M. Fayyad and K. B. Irani, ―On the handling of continuous-valued

attributes in decision tree generation,‖ Machine Learning, 1992.

T. Elomaa and J. Rousu, ―Efficient multisplitting revisited:

Optimapreserving elimination of partition candidates,‖ Data Mining and

Knowledge Discovery, vol. 8, no. 2, pp. 97–126, 2004.

T. M. Mitchell, Machine Learning. McGraw-Hill, 1997.

A. Asuncion and D. Newman, ―UCI machine learning repository,‖ 2007.

[Online]. Available: http://www.ics.uci.edu/_mlearn/MLRepository.html

G. Greenspan and D. Geiger. Model-based inference of haplotype block

variation. RECOMB, pages 131{137, 2003.

R. R. Hudson. Generating samples under a Wright- Fisher neutral model

of genetic variation. Bioinformatics, 18:337{338, 2002.

E. B. Hunt, J. Marin, and P. T. Stone. Experiments in Induction.

Academic Press, 1966. Figure 2: Scalability in network size. The above

figure show the distribution of the average communication overhead over

the network tree levels for di_erent network sizes (Gini index).

R. Jin and G. Agrawal. Communication and memory ancient parallel

decision tree construction. In Proc. of the 3rd (SDM), 2003.

M. V. Joshi, G. Karypis, and V. Kumar. A new scalable and ancient

parallel classification algorithm for mining large datasets. In Proc. of

International Parallel Processing Symposium, 1998.

M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier

for data mining. In Proc. of the Fifth Int'l Conference on Extending

Database Technology, Avignon, France, 1996.

J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81{106,

Rajeev Rastogi and Kyuseok Shim. PUBLIC: A decision tree classier that

integrates building and pruning. Data Mining and Knowledge Discovery,

(4):315{344, 2000.

N. J. Risch. Searching for genetic determinants in the new millennium. In

Nature 405, pages 847{856, 2000.

J. C. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel

classifier for data mining. In Proc. of the 22nd VLDB Conf., 1996.

A. Srivastava, E.-H. (S.) Han, V. Kumar, and V. Singh. Parallel

formulations of decision-tree classification algorithms. Data Mining and

Knowledge Discovery: An International Journal, 3:237{261, 1999.

W. Sthlinger, O. Hogl, H. Stoyan, and M. Muller. Intelligent data mining

for medical quality management. In the Fifth Workshop on Intelligent

Data Analysis in Medicine and Pharmacology (IDAMAP-2000),

Workshop Notes of the 14th European Conference on Artificial

Intelligence (ECAI-2000), pp. 55-67, 2000.

### Refbacks

- There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.