How to install ‘vanilla’ TeXLive on Ubuntu

Most of the updates to TeXLive that come from software package managers are quite delayed. Luckily, I found the best stack exchange answer ever that addresses this problem with full installation instructions for ‘vanilla’ TeXLive. After following them, everything works fine on Ubuntu 16.04

Advertisements

Generating a consensus sequence with samtools/bcftools/vcftools

Say you have mapped reads to a reference genome in .bam format. If you would like to get a consensus sequence of the alignment, run the following (you will probably need to install/locate samtools, bcftools and vcftools):

samtools mpileup -AuDS -Q 10 -q 10 -d 10000 -f mito_eulemur.fasta AKF23_sorted.bam | bcftools view -cg - | /usr/share/samtools/vcfutils.pl vcf2fq | awk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' | sed -n '/+/q;p' > AKF23MitoCons.fasta

Convert fastq to fasta format with awk

If you need to convert illumina reads in fastq format to fasta format, try using awk. If the reads are compressed, you can extract with zcat, then pipe the output to an awk one-liner:

zcat myReads.fastq.gz | awk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' > myReads.fasta

Otherwise if they aren’t compressed, you can just use awk:

awk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' myreads.fastq > myReads.fasta

Summarising data with awk

AWK can be really useful for generating a summary of big data. I found this nice script on a stack exchange question. With a bit of modification, you can give this script more general applicability:

#!/bin/sh
input=$1
col=$2
sort -n $input | awk '
  BEGIN {
    c = 0;
    sum = 0;
    OFS="\t";
    print "sum", "count", "mean", "median", "min", "max";
  }
  $'''$col''' ~ /^[0-9]*(\.[0-9]*)?$/ {
    a[c++] = $'''$col''';
    sum += $'''$col''';
  }
  END {
    #comment out the below line if data includes 0s
    ave = sum / c;
    if( (c % 2) == 1 ) {
      median = a[ int(c/2) ];
    } else {
      median = ( a[c/2] + a[c/2-1] ) / 2;
    }
    OFS="\t";
    print sum, c, ave, median, a[0], a[c-1];
  }
'

Usage

Say we have the following data:

mCoord	chr	coord	samplo5	samplo10	corrlo5	corrlo10
1	X	1	41	4	7.42585e-07	7.5852e-08
2	X	2	41	4	7.42585e-07	7.5852e-08
3	X	3	41	5	7.42585e-07	9.48149e-08
4	X	4	41	5	7.42585e-07	9.48149e-08
5	X	5	41	5	7.42585e-07	9.48149e-08
6	X	6	41	5	7.42585e-07	9.48149e-08
7	X	7	41	5	7.42585e-07	9.48149e-08
8	X	8	40	5	7.24473e-07	9.48149e-08
9	X	9	40	5	7.24473e-07	9.48149e-08
10	X	10	39	5	7.06362e-07	9.48149e-08
11	X	11	38	5	6.8825e-07	9.48149e-08

We can get the sum, count, mean, median, minimum and maximum for column 5 by running the following in a linux terminal (assuming script and file.txt are in the current directory):

./datacheck file.txt 5

This should return something like:

sum	count	mean	median	min	max
53	11	4.81818	5	4	5

Investigating mapping coverage with awk

When you’re mapping reads to a reference genome, sometimes you’re adjusting mapping parameters and it’s good to get a rough idea of how much of your reference is getting reads mapped to it. Say you have the following data:

mCoord	chr	coord	samplo5	samplo10
1	X	1	41	0
2	X	2	41	0
3	X	3	41	0
4	X	4	41	0
5	X	5	41	0
6	X	6	41	0
7	X	7	41	5
8	X	8	40	5
9	X	9	40	5
10	X	10	39	5
11	X	11	38	5

where samplo5 and samplo10 are the number of reads mapped at each coordinate for two different samples. You can get a sense of how much of the reference has reads mapped to it by counting the number of non-zero entries for the column in awk; reads from samplo10 for example could be investigated with:

awk '($5!="0"){++count} END {print count}' file.txt

This should output 6

You can then compare this to the total length of the reference to get the percentage of the reference that has reads mapped to it.

We might also be interested in knowing if more reads have mapped overall since changing some of the mapping parameters, in this case we can just sum the entire samplo10 column in awk:

awk '{ sum+=$5} END {print sum}' file.txt

This should output 25

Reshaping data with AWK

AWK is really useful for reshaping data, here is an example of something I’m running into quite frequently. I have a file with a list of contigs and some onther information. The first column is a list of chromosomes from one species, the second is a list of contigs that have shared synteny from another species, the third column is their relative orientation:

#CHR1	CHR2	ORIENTATION
3	contig_GL873650	1
3	contig_GL873641	1
3	contig_GL873608	-1
4	contig_GL873649	-1
4	contig_GL873710	-1
4	contig_GL873558	-1
4	contig_GL873656	1
4	contig_GL873622	1
4	contig_GL873648	1
4	contig_GL873683	-1
5	contig_GL873679	1
5	contig_GL873522	-1
5	contig_GL873780	1
6	contig_GL873668	1

I want a list of the contigs from the second column that match chromosome 4 in the first column. We can do this quite easily with:

awk '$1 == "4" {print $2}' file.txt > list.txt

This should give:

contig_GL873649
contig_GL873710
contig_GL873558
contig_GL873656
contig_GL873622
contig_GL873648
contig_GL873683

For this dataset I then had an additional problem where these contigs were named differently to another file that I was working with. Thankfully there was a pattern; the names of the contigs in the other file were numeric, and these numbers were always 873519 less than the number in the corresponding contig name from the list I was working with. So to convert the list I had above to the same format, I could do:

cat list.txt | cut -c10-16 | awk '{print ($1 - 873519) " " $2}' | awk -F: '{ printf "%05i %s\n", $1,$2 }'

Where cut -c10-16 removes the leading text, awk ‘{print ($1 – 873519) ” ” $2}’ subtracts 873519 from the remaining number, and awk -F: ‘{ printf “%05i %s\n”, $1,$2 }’ adds enough zeros to the front of the number to make it six digits. This should give:

00130 
00191 
00039 
00137 
00103 
00129 
00164

Which is now a list of the contigs I was interested in, that also uses the naming format needed for the file being worked on!

Sliding window in awk

AWK is good for running a sliding window/moving average on your data because it’s fast and you can pipe the output of samtools or most other cli programs straight into it. Say you have the following data:

mCoord	chr	coord	samplo5	samplo10	corrlo5	corrlo10
1	X	1	41	4	7.42585e-07	7.5852e-08
2	X	2	41	4	7.42585e-07	7.5852e-08
3	X	3	41	5	7.42585e-07	9.48149e-08
4	X	4	41	5	7.42585e-07	9.48149e-08
5	X	5	41	5	7.42585e-07	9.48149e-08
6	X	6	41	5	7.42585e-07	9.48149e-08
7	X	7	41	5	7.42585e-07	9.48149e-08
8	X	8	40	5	7.24473e-07	9.48149e-08
9	X	9	40	5	7.24473e-07	9.48149e-08
10	X	10	39	5	7.06362e-07	9.48149e-08
11	X	11	38	5	6.8825e-07	9.48149e-08

where corrsamplo5 and corrsamplo10 are depth coverage values at each coordinate (corrected by the number of reads for each sample) for two different samples. You can make a sliding window for these columns with the following awk one-liner:

awk -v OFS="\t" 'BEGIN{window=4;slide=2} { if(NR==1) print "coord",$6,$7 } {mod=NR%window; if(NR<=window){count++}else{sum-=array[mod];sum2-=array2[mod]}sum+=$6;sum2+=$7;array[mod]=$6;array2[mod]=$7;} (NR%slide)==0{print NR,sum/count,sum2/count}' file.txt

change the values in BEGIN{window=4;slide=2} to change the window size and the number of bases to slide. This should output:

coord	corrlo5	corrlo10
2	3.71293e-07	3.7926e-08
4	5.56939e-07	6.16297e-08
6	7.42585e-07	9.00742e-08
8	7.42585e-07	9.48149e-08
10	7.33529e-07	9.48149e-08
12	7.10889e-07	9.48149e-08

Summarize disk usage in unix

If you run out of space on a remote machine, or even your own Unix based PC, you can investigate where your disk usage problems are coming from with du and sort. This command will summarise disk usage info for the directories in your working directory:

du -sh * | sort -rgb

where for:

  • du
    -s prevents the program traversing subdirectories and providing information
    -h shows human readable size information (mb, gb ect.)
  • sort
    -r reverses the order (largest usage first)
    -g sorts numerically
    -b ignores whitespace

Backup or move your Zotero library to another PC

Zotero is a reference management system that runs on most systems and integrates well with most word processors. It’s a great tool to keep track of pdf files, to group citations, and to easily cite while writing. Several times I’ve had to shift my Zotero library to another work PC, this post describes how to do this:

On linux systems, just copy the contents of

~/.zotero/zotero

from the old PC to the same directory in the new PC. When you start Zotero on the new PC, your library should be exactly the same, including linked pdfs, tags etc. If this directory doesn’t exist on your maching, the library might not be in exactly the same place, but you can find it by following the instuctions here

pass: Unix password management

Pass is a password management system for Unix. You only need to remember one password to access a personal database of passwords for what ever you want. What made me shift toward this (as opposed to keepass for example) was the simplicity, the possibility to use autocompletion when trying to access passwords and the ability to use version control with Git.

Installation

sudo apt-get install pass
Setup

There are quite a few guides available, but the most comprehensive is here. This guide starts from scratch with implementing GPG keys. It also shows how to use pass, and how to use the same database on multiple systems. The only issue is that if you want to store your database in dropbox to sync with many devices for example, you need to change the directory that the database is stored in. So after installation, make sure you set the following environment variable before following the guide above (changing the directory to where you want the database to be):

export PASSWORD_STORE_DIR=~/Dropbox/.password-store

Note that when you restart, you will need to set this variable again. In order to set it permanently, make an .sh script in your /etc/profile.d directory (any *.sh script in this directory will be run at startup):

sudo vi /etc/profile.d/envvars.sh

enter your password, then paste the following (changing the directory to your own):

# put your personal env variables to load at startup here
export PASSWORD_STORE_DIR=~/Dropbox/.password-store

also, to make it easier to initialise with the gpg key, you can set the pass init command as an alias to remember your key id. Make a .bash_aliases file in your home directory if it doesn’t already exist with:

vi ~/.bash_aliases

then paste the following

alias passinit='pass init <your key>'

now whenever you startup your machine, you should be able to open a terminal and immediately get into pass by entering passinit

Sources

Info on using an alternate directory came from here