Analyzing the Effect of Adding Noise on Compressed Textual Data
Abstract
Compression is one of the techniques for better utilization of storage devices, resulting in saving of storage space.This paper addresses compression by using the technique called Normalized Compression Distance (NCD). The Normalized Compression Distance is based on algorithmic complexity developed by Kolmogorov, called Normalized Information Distance.Normalized Compression Distance can be used to cluster objects of any kind, such as music, texts, or gene sequences (microarray classification). The NCD between two binary strings is defined in terms of compressed sizes of the two strings and of their concatenation; it is designed to be an effective approximation of the non computable but universal Kolmogorov distance between two strings. This paper studies the influence of noise on the normalized compression distance, a measure based on the use of compressors to compute the degree of similarity of two files. This influence is approximated by a first order differential equation which gives rise to a complex effect, which explains the fact that the NCD may give values greater than 1. Finally, the analyzing the effect of adding noise on compressed textual data and findings are that NCD performs well even in the presence of quite high noise levels by using CompLearn Toolkit.
Keywords
Full Text:
PDFReferences
Paul M. B. Vitanyi, Frank J. Balbach, Rudi L. Cilibrasi, and Ming Li,“Normalized Information Distance”, September 2008.
Lance Fortnow, “Kolmogorov Complexity”, January 2000.
M. Li, P. M. B. Vitanyi, “An Introduction to Kolmogorov Complexity and Its Applications”, 2nd Ed., Springer-Verlag, New York1997.
Cilibrasi, R.L., Vitanyi, P.M.B. “Clustering by compression” IEEE Transactions on Information Theory, Vol. 51, No.4, 1523–1545 2005.
Chen, S. Kwong, and M. Li. “A compression algorithm for DNA sequences and its applications in genome comparison”, In Genome Informatics: Proceedings of the 10th Workshop on Genome Informatics,pages 51–61, 1999.
Kiril Simov and Petya Osenova, “Experiments with Normalized Compression Metric”, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences 5-6 December 2005.
Manuel Cebrián, Manuel Alfonseca, Alfonso Ortega, “The Normalized Compression Distance Is Resistant to Noise”, 0018-9448, 2007.
R. Cilibrasi, A. L. Cruz, and S. de Rooijet, The CompLearn Toolkit [Online]. Available: http://www.complearn.org.
M. Li, X. Chen, X. Li, B. Ma, P. Vitanyi, “The similarity metric”, IEEE Transactions of Information Theory, Vol. 50, No.12, 3250-3264, 2004.
Bennett, C., Gacs, P., Li, M., Zurek, W., Vitányi, P., “Information Distance”, IEEE Transactions on Information Theory, Vol. 44, No. 4,1407-1423,1998.
Volker Nannen, “A Short Introduction to Kolmogorov Complexity”,April, 2003.
Khalid Sayood, “Introduction to Data Compression”, Third Edition Morgan Kaufmann Publishers, ISBN 13:978-0-12-620862-7, December15, 2005.
Data Compression, URL: http://en.wikipedia.org/wiki/Data_compression.
S. Braunstein, C. Fuchs, D. Gottesman, and H. Lo. “A quantum analog of huffman coding”. IEEE Transactions of Information Theory, Vol.46,1644–1649, 2000.
Sameh Ghwanmeh, Riyad Al-Shalabi and Ghassan Kanaan , “Efficient Data Compression Scheme using Dynamic Huffman Code Applied on Arabic Language” Journal of Computer Science, Vol. 2, No.12, 885- 888, 2006..
Refbacks
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution 3.0 License.