Summarising data with awk

AWK can be really useful for generating a summary of big data. I found this nice script on a stack exchange question. With a bit of modification, you can give this script more general applicability:

#!/bin/sh
input=$1
col=$2
sort -n $input | awk '
  BEGIN {
    c = 0;
    sum = 0;
    OFS="\t";
    print "sum", "count", "mean", "median", "min", "max";
  }
  $'''$col''' ~ /^[0-9]*(\.[0-9]*)?$/ {
    a[c++] = $'''$col''';
    sum += $'''$col''';
  }
  END {
    #comment out the below line if data includes 0s
    ave = sum / c;
    if( (c % 2) == 1 ) {
      median = a[ int(c/2) ];
    } else {
      median = ( a[c/2] + a[c/2-1] ) / 2;
    }
    OFS="\t";
    print sum, c, ave, median, a[0], a[c-1];
  }
'

Usage

Say we have the following data:

mCoord	chr	coord	samplo5	samplo10	corrlo5	corrlo10
1	X	1	41	4	7.42585e-07	7.5852e-08
2	X	2	41	4	7.42585e-07	7.5852e-08
3	X	3	41	5	7.42585e-07	9.48149e-08
4	X	4	41	5	7.42585e-07	9.48149e-08
5	X	5	41	5	7.42585e-07	9.48149e-08
6	X	6	41	5	7.42585e-07	9.48149e-08
7	X	7	41	5	7.42585e-07	9.48149e-08
8	X	8	40	5	7.24473e-07	9.48149e-08
9	X	9	40	5	7.24473e-07	9.48149e-08
10	X	10	39	5	7.06362e-07	9.48149e-08
11	X	11	38	5	6.8825e-07	9.48149e-08

We can get the sum, count, mean, median, minimum and maximum for column 5 by running the following in a linux terminal (assuming script and file.txt are in the current directory):

./datacheck file.txt 5

This should return something like:

sum	count	mean	median	min	max
53	11	4.81818	5	4	5
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s