Biological Sequence Compression Based On Properties of Unique and Repeated Similarities of Sequences Using Variable Length LUT

Rajendra Kumar Bharti; Archana Verma; R.K. Singh

Biological Sequence Compression Based On Properties of Unique and Repeated Similarities of Sequences Using Variable Length LUT

Rajendra Kumar Bharti, Archana Verma, R.K. Singh

Abstract

Genome may contain several copies of the same gene. Although human genome contains about 3 billion base pairs, only 3% of it encodes protein. There are only about 25000 genes in human genome which encode about 100000 proteins by alternative splicing. Biological sequences are commonly of two types - unique and repeated. We are utilizing these properties of the sequences. The earlier algorithms either work on unique repeat or repeated repeat sequence. We are merging both methodologies to develop a new algorithm which collectively compress both type of sequences, i.e. we can apply the same compression algorithm on all types of sequences. This will definitely reduce our effort for developing different algorithm and it will be easier to apply one single algorithm rather using different algorithm. In this paper, a Biological sequence compression is proposed to compress both unique sequences, which are repeated in one area, and repeated sequences that are interspersed throughout the genome. The algorithm is also compared with existing ones and it is found to achieve better compression ratio than other.

Keywords

Genome, Sequence, Uniqueness, Compression ratio, DNA Compress, Gen Compress, LUT, Base Pair

Full Text:

PDF

References

Ateet Mehta , Bankim Patel, “ DNA Compression using Hash Based Data Structure”, 2010, Vol2 No.2, pp. 383-386, , IJIT&KM.

Choi Ping Paula Wu, Nagi Fong and Wan chi Siu, “ Cross chromosomal similarity for DNA sequence compression”, 2008, Bioinformatics 2(9): 412-416.

Gregory Vey, “Differential direct coding: a compression algorithm for nucleotide sequence data”, vol. 2009, Database, doi: 10.1093/database/bap013.

J. Ziv and A. Lempel., “A universal algorithm for sequential data compression,” vol. IT-23, May 1977, IEEE Transactions on Information Theory.

K.N. Mishra, A. aggarwal, E. Abdelhadi, P.C. srivastava, “ An efficient Horizontal and Vertical Method for Online DNA sequence Compression”, 2010, Vol3, PP 39-45, IJCA.

P. raja Rajeswari, Dr. Allam Apparao, “ GENBIT Compress- Algorithm for repetitive and non repetitive DNA sequences”, 2010, PP 25-29 JTAIT.

Pavol Hanus, Janis Dingel, Georg Chalkidis and Joachim hagenauer, “Compression of whole Genome Alignments”, 2010, , vol.56, IEE Transactions of Information Theory, No.2Doi: 10.1109/TIT.2009.2037052.

R. Curnow and T. Kirkwood, “Statistical analysis of deoxyribonucleic acid sequence data-a review,” J Royal Statistical Soc., vol. 152, 1989, pp. 199-220.

Sheng Bao, Shi Chen, Zhi-Qiang Jing, Ran Ren, “A DNA Sequence Compression Algorithm Based on LUT and LZ77”, 2005 IEEE International Symposium on Signal Processing and Information Technology.

U. Ghoshdastider, B Saha, “GenomeCompress: A Novel Algorithm for DNA Compression”, 2005, ISSN 0973-6824.

Xin Chen, Ming Li, Bin Ma and John Tromp,” DNA Compress: fast and effective DNA sequence Compression” BIOINFORMATICS APPLICATIONS NOTE, Vol. 18 no. 12, 2002, Pages 1696–1698.

X. Chen, M. Li, B. Ma, and J. Tromp, “Dnacompress:fast and effective dna sequence compression,” vol. 18, 2002, Bioinformatics.

Voet &Voet, Biochemistry, 3rd Edition, 2004.

Ascii code. [Online]. Available: http://www.LookupTables.com/

http://grandlab.cer.net/topic.php?TopicID=47

The c++ implemention of precoding routine. [Online]. Available: http://arxiv.org/cs/0504100.

Pierce, B.A.(2005), “ Genetics: A conceptual approach.” Freeman, PP 311.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me