Sometimes your sequence data is too big to experiment with all at once. When trying to optimize new analysis workflows involving paired end reads in fastq format, make a subset of your data with HTSeq in Python.
On a local machine
Installation
sudo apt-get install build-essential python2.7-dev python-numpy python-matplotlib
if you have another os there are instructions here
Usage
Paste the following python code into subsetfastq.py (taken from Simon, here)
import sys, random, itertools import HTSeq fraction = float( sys.argv[1] ) in1 = iter( HTSeq.FastqReader( sys.argv[2] ) ) in2 = iter( HTSeq.FastqReader( sys.argv[3] ) ) out1 = open( sys.argv[4], "w" ) out2 = open( sys.argv[5], "w" ) for read1, read2 in itertools.izip( in1, in2 ): if random.random() < fraction: read1.write_to_fastq_file( out1 ) read2.write_to_fastq_file( out2 ) out1.close() out2.close()
We can run this with:
python /path/to/subsetfastq.py fraction forward.fastq reverse.fastq forward.sub.fastq reverse.sub.fastq
with fraction being the fraction of the fastq reads that you would like to keep
On a cluster
Installation
To implement this on the computing cluster I have access to, I installed HTSeq with:
wget https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1p1.tar.gz#md5=c44d7b256281a8a53b6fe5beaeddd31c tar -zxvf HTSeq-0.6.1pcd HTSeq-0.6.1p1/1.tar.gz python setup.py install --user
Usage
Copy subsetfastq.py (above) somewhere on your cluster account. Now we can execute this as a job on the cluster using a PBS script for Torque and Maui to handle. Paste the following into subsetfastq.pbs:
#/!bin/bash # --------------------------------------------------------------------------- # subsetfastq.pbs - takes a random subset of paired reads from files # Copyright 2016, Rylan Shearn # Usage: qsub -v forward=</path/to/file/forwardreads.fastq.gz>,reverse=</path/to/file/reversereads.fastq.gz>,fraction=<number between 0 and 1> subsetfastq.pbs # Revision history: # 2016-02-23 Created by newscript ver. 3.3 # --------------------------------------------------------------------------- #PBS -l nodes=1:ppn=1,walltime=12:00:00 #PBS -o /path/to/output/log/ #PBS -e /path/to/error/log/ python /panhome/shearn/bin/subsetfastq.py $fraction $forward $reverse "$forward".sub "$reverse".sub
As noted in the usage notes, you can schedule this for example with:
qsub -v forward=/path/to/file/forwardreads.fastq.gz,reverse=/path/to/file/reversereads.fastq.gz,fraction=0.1 subsetfastq.pbs
which should output two new fastq.gz files, each with a .sub suffix and each having 0.1 times the reads of the original files.