Subset paired end .fastq illumina reads

Sometimes your sequence data is too big to experiment with all at once. When trying to optimize new analysis workflows involving paired end reads in fastq format, make a subset of your data with HTSeq in Python.

On a local machine


sudo apt-get install build-essential python2.7-dev python-numpy python-matplotlib

if you have another os there are instructions here


Paste the following python code into (taken from Simon, here)

import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

for read1, read2 in itertools.izip( in1, in2 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )

We can run this with:

python /path/to/ fraction forward.fastq reverse.fastq forward.sub.fastq reverse.sub.fastq

with fraction being the fraction of the fastq reads that you would like to keep

On a cluster


To implement this on the computing cluster I have access to, I installed HTSeq with:

tar -zxvf HTSeq-0.6.1pcd
python install --user

Copy (above) somewhere on your cluster account. Now we can execute this as a job on the cluster using a PBS script for Torque and Maui to handle. Paste the following into subsetfastq.pbs:

# ---------------------------------------------------------------------------
# subsetfastq.pbs - takes a random subset of paired reads from files
# Copyright 2016, Rylan Shearn
# Usage: qsub -v forward=</path/to/file/forwardreads.fastq.gz>,reverse=</path/to/file/reversereads.fastq.gz>,fraction=<number between 0 and 1> subsetfastq.pbs
# Revision history:
# 2016-02-23 Created by newscript ver. 3.3
# ---------------------------------------------------------------------------

#PBS -l nodes=1:ppn=1,walltime=12:00:00
#PBS -o /path/to/output/log/
#PBS -e /path/to/error/log/

python /panhome/shearn/bin/ $fraction $forward $reverse "$forward".sub "$reverse".sub

As noted in the usage notes, you can schedule this for example with:

qsub -v forward=/path/to/file/forwardreads.fastq.gz,reverse=/path/to/file/reversereads.fastq.gz,fraction=0.1 subsetfastq.pbs

which should output two new fastq.gz files, each with a .sub suffix and each having 0.1 times the reads of the original files.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s