I've used the BAMs under http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/.
The following command is used to build the list of BAMs:
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/" |\ tr ' <>"' "\n" | grep -F .bam | grep -v bai | sort | uniq | sed 's/.bam$//' | sed 's/$/ \\/' wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep1V2 \ wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2 \ wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep1V2 \ wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep2V2 \ wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2 \ wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep2V2 \ wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep1V2 \ wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep2V2 \ wgEncodeCaltechRnaSeqGm12878R2x75Il400AlignsRep2V2 \ wgEncodeCaltechRnaSeqGm12878R2x75Il400SplicesRep2V2 \ (...)
This list is inserted as a list named SAMPLES a Makefile.
For each BAM, we use samtools to retrieve the reads in the region(s)) of interest. The reads are then filtered with samjs (https://github.com/lindenb/jvarkit/wiki/SamJS) to only keep the reads carrying an intron-exon junction at the desired location(s). Basically, the javascript-based filter loops over the CIGAR string of the read, computes the genomic interval skipped when the cigar operator is a deletion or a skipped region/intron. The read is printed if it describes the new intron-exon junction.
All in one:
That's it,
Pierre
No comments:
Post a Comment