I suggested to create a new magic file for the linux file command. As an example, a BAM file starts (position=0) with the magic bytes
BAM\1
. A magic file definition 'bam' for BAM would be: 0 string BAM\1 BAM file v1.0Compile the new magic file:
file -C -m bamNow use this magic:
file -z -m bam.mgc file.bam file.bam: BAM file v1.0 (data)
Now, I've started a github repo containing some 'magic' patterns for bioinformatics (Fastq, blast, bam, bigwig, etc... ): .
(My current problem is to prioritize some results like differentiating a 'Nucleotide' and a 'Protein' Fasta sequence ( http://unix.stackexchange.com/questions/154096/file1-and-magic5-prioritizing-a-result).)
That's it,
Pierre
Great idea. Like you noticed (protein versus nucleotide), you'll run into limitations of the file utility pretty soon.
ReplyDeleteI suggest that you wrap this into a (bash) script -- biofile? -- that extends this project's capabilities to scripting necessary functions. Bio* projects have plenty of code examples illustrating practical algorithms to choose from.