We BLATted the Internet! The DNA sequences from 40 billion webpages mapped to hg19 and other species: http://t.co/5XAsFCguE2
/ UCSC Genome Browser (@GenomeBrowser) January 23, 2014
"We're pleased to announce the release of the Web Sequences track on the UCSC Genome Browser. This track, produced in collaboration with Microsoft Research, contains the results of a 30-day scan for DNA sequences from over 40 billion different webpages. The sequences were then mapped with Blat to the human genome (...) The data were extracted from a variety of sources including patents, online textbooks, help forums, and any other webpages that contain DNA sequence. In essence, this track displays the Blat alignments of nearly every DNA sequence on the internet!"
I've mapped each genomic location from this track to a country and generated the following (unreadable) picture:
How this picture was generated
- I've downloaded the data from the UCSC using the Table browser. The data look like this:
#bin chrom chromStart chromEnd name score strand thickStart thickEnd reserved blockCount blockSizes chromStarts tSeqTypes seqIds seqRanges publisher pmid doi issn journal title firstAuthor year impact classes locus 585 chr1 14789 15004 3500336380 75 14789 15004 8421504 2 40,35 0,180 g 350033638000000000 0-75 Tophat, Cufflinks and replicates - Page 2 - SEQanswers seqanswers.com 0 0 WASH2P,WASH7P 585 chr1 15017 15590 3500327042 381 15017 15590 8421504 2 326,55 0,518 g 350032704200000008 0-747 Research Technologies at Indiana University biomedapp.iu.edu 0 0 WASH7P 585 chr1 68858 68895 3500020489 37 68858 68895 8421504 1 37 0 g 350002048900000000,350002048900000001 0-36,0-36 Genome mapability - Musings from a PhD candidate davetang.org 0 0 OR4F5 585 chr1 69170 69479 3500359797 142 69170 69479 8421504 2 76,66 0,243 c 350035979700000000,350035979700000002 0-76,10-76 CRAM compression and TLEN SAM's field - SEQanswers seqanswers.com 0 0 OR4F5 585 chr1 70013 70230 3500427570 150 70013 70230 8421504 2 75,75 0,142 g 350042757000000000,350042757000000001 0-75,0-75 Inconsistency with SAM flag output? - SEQanswers seqanswers.com 0 0 OR4F5 585 chr1 98860 98888 3500207083 26 98860 98888 8421504 3 5,7,14 0,6,14 g 350020708300000108,350020708300000060,350020708300000239 0-24,0-21,0-21 Method For The Simultaneous Determination Of Blood Group And Platelet Antigen Genotypes .freshpatents.com 0 0 OR4F5 586 chr1 137603 138008 3500170315 405 137603 138008 8421504 1 405 0 p 350017031500015076,350017031500015074 0-135,0-270 Balding D. (2007) Handbook of Statistical Genetics www.scribd.com 0 0 OR4F5 586 chr1 139485 143008 3500419332 1794 139485 143008 8421504 2 65,1729 0,1794 g 350041933200000004,350041933200000000,350041933200000001,350041933200000002,350041933200000003 0-1263,0-1859,0-1852,0-1860,0-576 PPT Evolution by Genome Duplication PowerPoint presentation | free to view www.powershow.com 0 0 OR4F5 586 chr1 141535 143008 3500270480 1372 141535 143008 8421504 24 57,60,58,59,61,59,60,59,59,62,61,58,60,58,16,59,59,59,59,57,57,59,58,58 0,61,125,187,250,314,377,441,503,566,631,695,756,819,881,919,981,1044,1107,1170,1230,1291,1353,1415 g 350027048000000003,350027048000000002 0-902,0-525 Chen-Kung Chou 3-22-2004 www.dls.ym.edu.tw 0 0 OR4F5
- I want to generate BED file: 'chrom/start/end/country'. The 23rd column contains the URL of the web-sequence. I use the domain of the URL to try to guess the country. The following awk script was used to generate the file:
BEGIN { FS="[\t]"; } { country=$23; for(;;) { slash=index(country,"/"); if(slash==0) break; country=substr(country,1,slash-1); } for(;;) { colon=index(country,":"); if(colon==0) break; country=substr(country,1,colon-1); } if( country ~ /\.$/ ) next; if( country ~ /\.com$/ ) next; if( country ~ /\.org$/ ) next; if( country ~ /\.cat$/ ) next; if( country ~ /\.net$/ ) next; if( country ~ /\.gov$/ ) next; if( country ~ /\.edu$/ ) next; if( country ~ /\.name$/ ) next; if( country ~ /\.info$/ ) next; if( country ~ /\.biz$/ ) next; if( country ~ /\.[0-9]+$/ ) next; if( index(country,".")==0) next; if( index(country," ")!=0) next; for(;;) { dot=index(country,"."); if(dot==0) break; country=substr(country,dot+1); } if( country== "af") {country="afghanistan";} else if( country== "ax") {country="Ålandislands";} else if( country== "al") {country="albania";} else if( country== "dz") {country="algeria";} else if( country== "as") {country="americansamoa";} else if( country== "ad") {country="andorra";} else if( country== "ao") {country="angola";} else if( country== "ai") {country="anguilla";} else if( country== "aq") {country="antarctica";} else if( country== "ag") {country="antiguaandbarbuda";} else if( country== "ar") {country="argentina";} else if( country== "am") {country="armenia";} else if( country== "aw") {country="aruba";} else if( country== "au") {country="australia";} else if( country== "at") {country="austria";} else if( country== "az") {country="azerbaijan";} else if( country== "bs") {country="bahamas";} else if( country== "bh") {country="bahrain";} else if( country== "bd") {country="bangladesh";} else if( country== "bb") {country="barbados";} else if( country== "by") {country="belarus";} else if( country== "be") {country="belgium";} else if( country== "bz") {country="belize";} else if( country== "bj") {country="benin";} else if( country== "bm") {country="bermuda";} else if( country== "bt") {country="bhutan";} else if( country== "bo") {country="bolivia,plurinationalstateof";} else if( country== "bq") {country="bonaire,sinteustatiusandsaba";} else if( country== "ba") {country="bosniaandherzegovina";} else if( country== "bw") {country="botswana";} else if( country== "bv") {country="bouvetisland";} else if( country== "br") {country="brazil";} else if( country== "io") {country="britishindianoceanterritory";} else if( country== "bn") {country="bruneidarussalam";} else if( country== "bg") {country="bulgaria";} else if( country== "bf") {country="burkinafaso";} else if( country== "bi") {country="burundi";} else if( country== "kh") {country="cambodia";} else if( country== "cm") {country="cameroon";} else if( country== "ca") {country="canada";} else if( country== "cv") {country="capeverde";} else if( country== "ky") {country="caymanislands";} else if( country== "cf") {country="centralafricanrepublic";} else if( country== "td") {country="chad";} else if( country== "cl") {country="chile";} else if( country== "cn") {country="china";} else if( country== "cx") {country="christmasisland";} else if( country== "cc") {country="cocos(keeling)islands";} else if( country== "co") {country="colombia";} else if( country== "km") {country="comoros";} else if( country== "cg") {country="congo";} else if( country== "cd") {country="congo,thedemocraticrepublicofthe";} else if( country== "ck") {country="cookislands";} else if( country== "cr") {country="costarica";} else if( country== "ci") {country="cÔted'ivoire";} else if( country== "hr") {country="croatia";} else if( country== "cu") {country="cuba";} else if( country== "cw") {country="curaÇao";} else if( country== "cy") {country="cyprus";} else if( country== "cz") {country="czechrepublic";} else if( country== "dk") {country="denmark";} else if( country== "dj") {country="djibouti";} else if( country== "dm") {country="dominica";} else if( country== "do") {country="dominicanrepublic";} else if( country== "ec") {country="ecuador";} else if( country== "eg") {country="egypt";} else if( country== "sv") {country="elsalvador";} else if( country== "gq") {country="equatorialguinea";} else if( country== "er") {country="eritrea";} else if( country== "ee") {country="estonia";} else if( country== "et") {country="ethiopia";} else if( country== "fk") {country="falklandislands(malvinas)";} else if( country== "fo") {country="faroeislands";} else if( country== "fj") {country="fiji";} else if( country== "fi") {country="finland";} else if( country== "fr") {country="france";} else if( country== "gf") {country="frenchguiana";} else if( country== "pf") {country="frenchpolynesia";} else if( country== "tf") {country="frenchsouthernterritories";} else if( country== "ga") {country="gabon";} else if( country== "gm") {country="gambia";} else if( country== "ge") {country="georgia";} else if( country== "de") {country="germany";} else if( country== "gh") {country="ghana";} else if( country== "gi") {country="gibraltar";} else if( country== "gr") {country="greece";} else if( country== "gl") {country="greenland";} else if( country== "gd") {country="grenada";} else if( country== "gp") {country="guadeloupe";} else if( country== "gu") {country="guam";} else if( country== "gt") {country="guatemala";} else if( country== "gg") {country="guernsey";} else if( country== "gn") {country="guinea";} else if( country== "gw") {country="guinea-bissau";} else if( country== "gy") {country="guyana";} else if( country== "ht") {country="haiti";} else if( country== "hm") {country="heardislandandmcdonaldislands";} else if( country== "va") {country="holysee(vaticancitystate)";} else if( country== "hn") {country="honduras";} else if( country== "hk") {country="china";} else if( country== "hu") {country="hungary";} else if( country== "is") {country="iceland";} else if( country== "in") {country="india";} else if( country== "id") {country="indonesia";} else if( country== "ir") {country="iran";} else if( country== "iq") {country="iraq";} else if( country== "ie") {country="ireland";} else if( country== "im") {country="isleofman";} else if( country== "il") {country="israel";} else if( country== "it") {country="italy";} else if( country== "jm") {country="jamaica";} else if( country== "jp") {country="japan";} else if( country== "je") {country="jersey";} else if( country== "jo") {country="jordan";} else if( country== "kz") {country="kazakhstan";} else if( country== "ke") {country="kenya";} else if( country== "ki") {country="kiribati";} else if( country== "kp") {country="northkorea";} else if( country== "kr") {country="southkorea";} else if( country== "kw") {country="kuwait";} else if( country== "kg") {country="kyrgyzstan";} else if( country== "la") {country="laopeople'sdemocraticrepublic";} else if( country== "lv") {country="latvia";} else if( country== "lb") {country="lebanon";} else if( country== "ls") {country="lesotho";} else if( country== "lr") {country="liberia";} else if( country== "ly") {country="libya";} else if( country== "li") {country="liechtenstein";} else if( country== "lt") {country="lithuania";} else if( country== "lu") {country="luxembourg";} else if( country== "mo") {country="macao";} else if( country== "mk") {country="macedonia,theformeryugoslavrepublicof";} else if( country== "mg") {country="madagascar";} else if( country== "mw") {country="malawi";} else if( country== "my") {country="malaysia";} else if( country== "mv") {country="maldives";} else if( country== "ml") {country="mali";} else if( country== "mt") {country="malta";} else if( country== "mh") {country="marshallislands";} else if( country== "mq") {country="martinique";} else if( country== "mr") {country="mauritania";} else if( country== "mu") {country="mauritius";} else if( country== "yt") {country="mayotte";} else if( country== "mx") {country="mexico";} else if( country== "fm") {country="micronesia,federatedstatesof";} else if( country== "md") {country="moldova,republicof";} else if( country== "mc") {country="monaco";} else if( country== "mn") {country="mongolia";} else if( country== "me") {country="montenegro";} else if( country== "ms") {country="montserrat";} else if( country== "ma") {country="morocco";} else if( country== "mz") {country="mozambique";} else if( country== "mm") {country="myanmar";} else if( country== "na") {country="namibia";} else if( country== "nr") {country="nauru";} else if( country== "np") {country="nepal";} else if( country== "nl") {country="netherlands";} else if( country== "nc") {country="newcaledonia";} else if( country== "nz") {country="newzealand";} else if( country== "ni") {country="nicaragua";} else if( country== "ne") {country="niger";} else if( country== "ng") {country="nigeria";} else if( country== "nu") {country="niue";} else if( country== "nf") {country="norfolkisland";} else if( country== "mp") {country="northernmarianaislands";} else if( country== "no") {country="norway";} else if( country== "om") {country="oman";} else if( country== "pk") {country="pakistan";} else if( country== "pw") {country="palau";} else if( country== "ps") {country="palestine,stateof";} else if( country== "pa") {country="panama";} else if( country== "pg") {country="papuanewguinea";} else if( country== "py") {country="paraguay";} else if( country== "pe") {country="peru";} else if( country== "ph") {country="philippines";} else if( country== "pn") {country="pitcairn";} else if( country== "pl") {country="poland";} else if( country== "pt") {country="portugal";} else if( country== "pr") {country="puertorico";} else if( country== "qa") {country="qatar";} else if( country== "re") {country="france";} else if( country== "ro") {country="romania";} else if( country== "ru") {country="russia";} else if( country== "rw") {country="rwanda";} else if( country== "bl") {country="saintbarthÉlemy";} else if( country== "sh") {country="sainthelena,ascensionandtristandacunha";} else if( country== "kn") {country="saintkittsandnevis";} else if( country== "lc") {country="saintlucia";} else if( country== "mf") {country="saintmartin(frenchpart)";} else if( country== "pm") {country="saintpierreandmiquelon";} else if( country== "vc") {country="saintvincentandthegrenadines";} else if( country== "ws") {country="samoa";} else if( country== "sm") {country="sanmarino";} else if( country== "st") {country="saotomeandprincipe";} else if( country== "sa") {country="saudiarabia";} else if( country== "sn") {country="senegal";} else if( country== "rs") {country="serbia";} else if( country== "sc") {country="seychelles";} else if( country== "sl") {country="sierraleone";} else if( country== "sg") {country="singapore";} else if( country== "sx") {country="sintmaarten(dutchpart)";} else if( country== "sk") {country="slovakia";} else if( country== "si") {country="slovenia";} else if( country== "sb") {country="solomonislands";} else if( country== "so") {country="somalia";} else if( country== "za") {country="southafrica";} else if( country== "gs") {country="southgeorgiaandthesouthsandwichislands";} else if( country== "ss") {country="southsudan";} else if( country== "es") {country="spain";} else if( country== "lk") {country="srilanka";} else if( country== "sd") {country="sudan";} else if( country== "sr") {country="suriname";} else if( country== "sj") {country="svalbardandjanmayen";} else if( country== "sz") {country="swaziland";} else if( country== "se") {country="sweden";} else if( country== "ch") {country="switzerland";} else if( country== "sy") {country="syrianarabrepublic";} else if( country== "tw") {country="taiwan";} else if( country== "tj") {country="tajikistan";} else if( country== "tz") {country="tanzania";} else if( country== "th") {country="thailand";} else if( country== "tl") {country="timor-leste";} else if( country== "tg") {country="togo";} else if( country== "tk") {country="tokelau";} else if( country== "to") {country="tonga";} else if( country== "tt") {country="trinidadandtobago";} else if( country== "tn") {country="tunisia";} else if( country== "tr") {country="turkey";} else if( country== "tm") {country="turkmenistan";} else if( country== "tc") {country="turksandcaicosislands";} else if( country== "tv") {country="tuvalu";} else if( country== "ug") {country="uganda";} else if( country== "ua") {country="ukraine";} else if( country== "ae") {country="unitedarabemirates";} else if( country== "gb") {country="unitedkingdom";} else if( country== "uk") {country="unitedkingdom";} else if( country== "us") {country="USA";} else if( country== "um") {country="unitedstatesminoroutlyingislands";} else if( country== "uy") {country="uruguay";} else if( country== "uz") {country="uzbekistan";} else if( country== "vu") {country="vanuatu";} else if( country== "ve") {country="venezuela";} else if( country== "vn") {country="vietnam";} else if( country== "vg") {country="virginislands,british";} else if( country== "vi") {country="virginislands,u.s.";} else if( country== "wf") {country="wallisandfutuna";} else if( country== "eh") {country="westernsahara";} else if( country== "ye") {country="yemen";} else if( country== "zm") {country="zambia";} else if( country== "zw") {country="zimbabwe";} else { next;} printf("%s\t%s\t%s\t%s\n",$2,$3,$4,country); }
- For the world map, I've used a SVG-vectorial map from wikipedia: https://commons.wikimedia.org/wiki/File:World_V2.0.svg.
The coordinates of the boundaries of each country are defined in a SVG 'path' element:<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN" "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/ <svg xmlns="http://www.w3.org/2000/svg" width="8.88889in" height="4.44444in" viewBox="0 0 800 400"> <path id="Taiwan" fill="none" stroke="black" stroke-width="1" d="M 668.85,151.22 C 668.98,150.71 ... <path id="Estonia" fill="none" stroke="black" stroke-width="1" d="M 460.75,68.26 C 459.95,68.11 4 ... <path id="Latvia" fill="none" stroke="black" stroke-width="1" d="M 461.23,72.27 C 460.75,72.20 46 ... <path id="Lithuania" fill="none" stroke="black" stroke-width="1" d="M 452.39,79.42 C 452.67,79.72 ... <path id="Byelarus" fill="none" stroke="black" stroke-width="1" d="M 453.57,81.92 C 453.87,82.37 ... <path id="Ukraine" fill="none" stroke="black" stroke-width="1" d="M 453.09,85.95 C 453.43,86.30 4 ... <path id="Moldova" fill="none" stroke="black" stroke-width="1" d="M 460.57,93.00 C 461.00,93.70 4 ... <path id="Syria" fill="none" stroke="black" stroke-width="1" d="M 480.33,127.61 C 481.03,127.90 4 ... <path id="Turkey" fill="none" stroke="black" stroke-width="1" d="M 499.47,116.91 C 499.31,116.32 ... <path id="Kuwait" fill="none" stroke="black" stroke-width="1" d="M 505.26,133.84 C 504.97,134.56 ... <path id="Saudi Arabia" fill="none" stroke="black" stroke-width="1" d="M 495.83,163.75 C 496.31,1 ... <path id="United Arab Emirates" fill="none" stroke="black" stroke-width="1" d="M 516.03,150.12 C ... <path id="Yemen" fill="none" stroke="black" stroke-width="1" d="M 517.68,161.79 C 517.01,160.50 5 ... <path id="Slovenia" fill="none" stroke="black" stroke-width="1" d="M 430.55,97.07 C 429.62,97.36 ... <path id="Croatia" fill="none" stroke="black" stroke-width="1" d="M 439.09,103.97 C 439.01,103.46 ... <path id="Bosnia and Herzegovina" fill="none" stroke="black" stroke-width="1" d="M 440.44,105.02 ... (...)
- I've joined the data using a custom java program (available on github at: https://github.com/lindenb/jvarkit/wiki/WorldMapGenome ). The program transforms the 'path' elements to a GeneralPath
$ cat map.bed |\ java -jar dist/worldmapgenome.jar \ -u World_V2.0.svg \ -w 2000 -o ~/ouput.jpg \ -R hg19.fasta
Pierre
No comments:
Post a Comment