06 July 2012

The LZW compression algorithm as a measure of the short-reads complexity

The LZW algorithm, is a dictionary-based universal lossless data compression algorithm. The algorithm is easy to implement, here is a pseudocode (copied from there):

string s;
char ch;
...

s = empty string;
while (there is still data to be read)
{
    ch = read a character;
    if (dictionary contains s+ch)
    {
 s = s+ch;
    }
    else
    {
 encode s to output file;
 add s+ch to dictionary;
 s = ch;
    }
}
encode s to output file;
And here is my C++ implementation: the size of the dictionary reflects the complexity of the sequence: http://code.google.com/p/variationtoolkit/source/browse/trunk/src/lzw.h.

I've used this complexity to plot the number-of-reads=f(size-LZW);

Exome data

#complexity mapped unmapped sample
9 23274 21 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
10 1676 31379 CTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
11 2365 455 CCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
12 1523 5118 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGGAAAAAA
13 1941 2827 GTTCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
14 2253 2495 CCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCCCCCACCCCCCACCCACACC
15 2774 2908 ATAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAG
16 3965 3149 AAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTCCCCCCCCCTCCT
17 6944 4020 CTTTTTTTTTTTCTTTTCTTTTTTTTTTCCCTCTTTTTTTTTTTTTTTTTTTTC
18 11607 5143 TTGGTTTTTTTTTTTTTTTTTTTTTTTGGTTTGTTTTTTTTTTTTTTTTACCCT
19 19659 6724 GGGGGGGGGGGGGGGGGGGGGAGGAGGAAGGGGAGGAAGGGAGGAGGAAAGAGA
20 32504 9412 ACCCTAACCCTACCCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAA
21 50824 13984 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
22 77399 19651 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACTCTAACCCTAACCCTAACC
23 114774 28966 GATCTCCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACC
24 176229 43729 TCCGATCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
25 316878 67402 TTCCGATCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
26 721852 104378 TTCCGATCTGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT
27 2028152 164968 CCGATCTAGGGTTAGGGTTAGGGTTGGGGTTAGGGTTAGGGTTAGGGTTAGGGT
28 6108817 284769 GCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCA
29 16907095 553236 GGGCACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCCGCCCGCTGG
30 37103260 1130111 GTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAATTTCCACTGATGAT
31 55419720 1911772 GACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAG
32 47163550 2041580 GCCATGTGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCCGGTGTT
33 17867328 1073476 CTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTTACT
34 2014482 212089 TTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTNNACCNGGCCTTTGAGAG
35 72637 30914 ACATCAANCTCAGGCACNTGGCCCAGGTCTGGCACTTAGAAGTAGTTCTCTGGG
36 8496 8247 AGGATATCTGGGNTGCNNCCGGAGTCGCAGTGTCTTGGGCCGCCTGAAGGTGAG
37 905 1506 AAGCATTACTGGAAACATCCTCATTGTGTTNTCTGNGACCANTNACCCTCACTN
38 58 102 TCGAGCNNCGTTGACTTCAGGNGGTCTNCTACCAGCAGCTCGNAATAGTTGCAC
39 1 0 AANTTCNAACGACTGTANNTCATNNGGCNNTGCNGGNCCNANAAACTGGCTGAG

Whole Genome data

#complexity mapped unmapped sample
13 1 2728 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
14 0 608 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGGGG
15 0 2181 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
16 7 2095 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGG
17 27 1924 AAAAAAAAAAAAAAAACTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT
18 41 2558 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAAAAATAAAAAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
19 66 2961 GGGGGGGGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGTGTGGGCG
20 127 3391 AGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGCGGGGGGGGGAGGAGGGGGGGG
21 181 4244 NNNNNNNNNNNNNNNNNNNNNNNATTANNNNNNNNNNNNNNNNNNNTAANNNNNNNNNNNNNNNNNNNANNNNNNNNNNNANNNNNNNNNNNNNNNNNNN
22 371 5242 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGAAGAAAAAAAAAAA
23 627 6308 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGCGCGCGGCCGGGGGCGCGGGT
24 1398 8204 GGTATAATGCTAGGTATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
25 3990 10923 GAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGA
26 8469 15085 CATCAGAATACAGCTAACAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAA
27 14494 20162 GCGGTGGCGGGGGCCCGCGGGCCCCCCGCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGGGGCGG
28 24273 26976 TTCTTTCTTTCTTTCTTTCTTTCTTTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTCCTCCTTTTCTTTCCTTTTCTTTCT
29 37918 35975 ACCAACACCACACCCCCACCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCAACCCCTAACCCTAACCCTAACC
30 58261 48523 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
31 81164 62111 CTAACCCTAACCCTAACCCTAACCCTAACCCTCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
32 112886 79802 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCTAACCCCAAC
33 154666 101551 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCGACCCCTAACCCGA
34 204241 130861 CCCTAGCCCCTCCCTATCCCTAACCCTAACCCTAACCCTAAAACCCTAACCCTAAAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCAACCCTAA
35 267951 166738 TCACCCTCACCCTCACCCTCACCCTAACCCTCACCCTCACCCTCACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACC
36 347051 210144 TAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCCAACCCTAACCCTAACCCTAACCCTAACCCCATCACTAACCTGTAACCCTCACCCTAACCCTA
37 447808 259191 CCCCACCCCCATCCCTAACCCGACCCTCAACCCAACCCCGAACCCAAACCCCAACCCCAACCCAAACCCAAACCCTAACCCTAACCCAAACCCTAACCCA
38 581941 320139 CTAACCCTAACCCTAACCCTAACCCTAACCCTTACCCTTACCCTTAACCCTCAACCCAACCCTAACACTAACCCTAACCCTAACCCCAAACCCAAGCCCA
39 770244 392189 CCCTACCCCTANCCCTACCCCTACCCCTAACCCTAACCCTAACCCTAACNCTAACCCAACCCCTCACACTACCCATCACCCCCACACCCTACCCCTACCC
40 1040306 485156 CCTTCACTCTACGCTTATCTCCCTACCTACCCCTAACCCTATCCCTAACCCTAACCCTATCCCTAACCCTAACCCTACCCCTAACCCTTACCCTAACCCA
41 1472994 592142 CAACCCGAGTACAATGGAAACGAATGGAATGGAATGAAATGGAATGGAATGGAATGGAATAGAATGGAATGGAATGGAATGGAATCAACCCGAGTGCAAT
42 2167116 711100 CCATCACCCCACCCCTACCCCTAACGCCACCCCTACCCCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCAGATCGGA
43 3307945 822818 CTTCCCCTACCCCCAACCCCGATCCCGAACCCAACCCCTAGCCCTACCCTTAACCCATCCCCATCCCTACCCCTAACCCTAACCCTAACCCTAAGCCAAC
44 5436093 924140 AGGGTTAGGGTAAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTAGGGTTCGGGATTGGAAAGAGCGGCGGGTTGGGGGAGGGTTATGGGATCTGATGAAAT
45 10232300 1030747 CTAACCCTAACCCTAACCCTAACCTAACCCATCCCCCAGCCAACCTTTACCCTCACCCCTCCTCTGACCCTAACCCTCAACCTTCCCCTGCCTCGGAATC
46 21775389 1204827 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCC
47 47888153 1605181 CTAACCCTAACCCTAACCCTAACCCTAACCTAAACCTTATCGCTATGCTTACCAGTAGCCTGAACCTGACCAATACACTAACCCTCACCCGGAAAATAAA
48 98581306 2536484 TAACCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTA
49 175000036 4206340 GTGGTTTTTGTCTGCCAGTTCATGGTAATCACAGTGATTTCAAGGGGGGGTAAAAAAAGGAGGTGTGAGAGGGGCCCCCGGTTTCCACACAGACACCACA
50 246244603 6207853 CCTAACCCTAACCCTACCCATATACCTAACCCTAAAATTAAAAGTAATCATAACCCTAACCTTAGTTCTGCAACTACGGCTACACACACGTGCAGACCTA
51 247156482 7048827 CCTCTGGTGGCCCTGTCCGGGCATGACAGAAGGCGCGCACCCTTGACTTCTGTTCACTTCTCACTATGTCCCCTCAGCCCCTATCTCTGAATGGCCTGGC
52 155610209 5305604 GCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCAACGAAATCTGTGCAGAGGACAACGCAGCT
53 53933711 2288949 AGCGTCGCAACTCAAATGCAGCATTCCTAATGCACACATGACACCCAAAATATAACAGACATATTACTCATGGAGGGGGAGGGTGAGTGTGAGGGTGAGG
54 9490136 496289 TTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCGGGCGCTGTGCCCTACCTTTGCTCTGCCCGCTGGAGACGGGGTTTGTCATGGG
55 991089 67105 AATTTCTGGAATGGATTATTAAACAGAGAGTCTGTAAGCACTTAGAAAAGGCCGCGGTGAGCCCCAGGGGCCAGCACTGCTCGAAATGTACAGCATTTCT
56 105830 11519 GGAAAATTTCTGGAATGGATTATTACAGAGTCTGTAAGCACTTAGAAAAGGCCGCGGTGAGTCCCAGGGGCCAGCACTGCTCGAAATGTACAGCATTTCT
57 21081 3128 TNACNGANGNNTNNNGTNTATTGNTCCAANAATCGNAGANNGAGAGGTTAAANTNNNNNNCNNNGATTNTGGGTTGTCTATTGATGTTTTTGGTCTATTC
58 6263 964 ATCNAGAGGCCAAGCCCAGCCTGTCNGCTTTNGTGTATAAAGNTCTCATGGAACAGAGCTGTGAGCCTGCCGNNTGTNGTCNNNNNNTNCGCCTGGNNAN
59 1360 246 ATTNGCCGGATGTGGTGGTGGGCGCCTGTAGTCCCAACTACTCAGGAGGCTGAAGCAGGAGAATGGCNAGAACNCGNNAGATGGNNGNTGNNGTNAGCCG
60 198 26 TCGGTCAACAAAATGGGTGACAGAGACCTACGCACGGATTATAATNNANCNGGCNCCANCCCGAGTGNTNNNCGGGGATTGGATGGNNCANNNTCCATAG
61 8 3 TGGAAAATNACTAGCNNGGAAGCAGACTNCGGGCCANANANATANNCAGTCACTTTANGCCCNGNANGGTGGNTCACANCTGTNATCCTANGNCNTTGGN
62 1 0 AGTTACGTGCTTACAGAATACTTTNTTTTGAGGTCAATANNANNANTAAGTNANGNATNCNNGATATCCTAGNGGGAATTCTCCGNCCTTCTGGAAGCTG


I'm sure there's must be something to say about this, but I just don't have time :-)

That's it,

Pierre


No comments: