Investigating mapping coverage with awk

When you’re mapping reads to a reference genome, sometimes you’re adjusting mapping parameters and it’s good to get a rough idea of how much of your reference is getting reads mapped to it. Say you have the following data:

mCoord	chr	coord	samplo5	samplo10
1	X	1	41	0
2	X	2	41	0
3	X	3	41	0
4	X	4	41	0
5	X	5	41	0
6	X	6	41	0
7	X	7	41	5
8	X	8	40	5
9	X	9	40	5
10	X	10	39	5
11	X	11	38	5

where samplo5 and samplo10 are the number of reads mapped at each coordinate for two different samples. You can get a sense of how much of the reference has reads mapped to it by counting the number of non-zero entries for the column in awk; reads from samplo10 for example could be investigated with:

awk '($5!="0"){++count} END {print count}' file.txt

This should output 6

You can then compare this to the total length of the reference to get the percentage of the reference that has reads mapped to it.

We might also be interested in knowing if more reads have mapped overall since changing some of the mapping parameters, in this case we can just sum the entire samplo10 column in awk:

awk '{ sum+=$5} END {print sum}' file.txt

This should output 25


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s