AWK can be really useful for generating a summary of big data. I found this nice script on a stack exchange question. With a bit of modification, you can give this script more general applicability:

#!/bin/sh input=$1 col=$2 sort -n $input | awk ' BEGIN { c = 0; sum = 0; OFS="\t"; print "sum", "count", "mean", "median", "min", "max"; } $'''$col''' ~ /^[0-9]*(\.[0-9]*)?$/ { a[c++] = $'''$col'''; sum += $'''$col'''; } END { #comment out the below line if data includes 0s ave = sum / c; if( (c % 2) == 1 ) { median = a[ int(c/2) ]; } else { median = ( a[c/2] + a[c/2-1] ) / 2; } OFS="\t"; print sum, c, ave, median, a[0], a[c-1]; } '

#### Usage

Say we have the following data:

mCoord chr coord samplo5 samplo10 corrlo5 corrlo10 1 X 1 41 4 7.42585e-07 7.5852e-08 2 X 2 41 4 7.42585e-07 7.5852e-08 3 X 3 41 5 7.42585e-07 9.48149e-08 4 X 4 41 5 7.42585e-07 9.48149e-08 5 X 5 41 5 7.42585e-07 9.48149e-08 6 X 6 41 5 7.42585e-07 9.48149e-08 7 X 7 41 5 7.42585e-07 9.48149e-08 8 X 8 40 5 7.24473e-07 9.48149e-08 9 X 9 40 5 7.24473e-07 9.48149e-08 10 X 10 39 5 7.06362e-07 9.48149e-08 11 X 11 38 5 6.8825e-07 9.48149e-08

We can get the sum, count, mean, median, minimum and maximum for column 5 by running the following in a linux terminal (assuming script and file.txt are in the current directory):

./datacheck file.txt 5

This should return something like:

sum count mean median min max 53 11 4.81818 5 4 5

Advertisements