Reshaping data with AWK

AWK is really useful for reshaping data, here is an example of something I’m running into quite frequently. I have a file with a list of contigs and some onther information. The first column is a list of chromosomes from one species, the second is a list of contigs that have shared synteny from another species, the third column is their relative orientation:

3	contig_GL873650	1
3	contig_GL873641	1
3	contig_GL873608	-1
4	contig_GL873649	-1
4	contig_GL873710	-1
4	contig_GL873558	-1
4	contig_GL873656	1
4	contig_GL873622	1
4	contig_GL873648	1
4	contig_GL873683	-1
5	contig_GL873679	1
5	contig_GL873522	-1
5	contig_GL873780	1
6	contig_GL873668	1

I want a list of the contigs from the second column that match chromosome 4 in the first column. We can do this quite easily with:

awk '$1 == "4" {print $2}' file.txt > list.txt

This should give:


For this dataset I then had an additional problem where these contigs were named differently to another file that I was working with. Thankfully there was a pattern; the names of the contigs in the other file were numeric, and these numbers were always 873519 less than the number in the corresponding contig name from the list I was working with. So to convert the list I had above to the same format, I could do:

cat list.txt | cut -c10-16 | awk '{print ($1 - 873519) " " $2}' | awk -F: '{ printf "%05i %s\n", $1,$2 }'

Where cut -c10-16 removes the leading text, awk ‘{print ($1 – 873519) ” ” $2}’ subtracts 873519 from the remaining number, and awk -F: ‘{ printf “%05i %s\n”, $1,$2 }’ adds enough zeros to the front of the number to make it six digits. This should give:


Which is now a list of the contigs I was interested in, that also uses the naming format needed for the file being worked on!


