Subset paired end .fastq illumina reads

Sometimes your sequence data is too big to experiment with all at once. When trying to optimize new analysis workflows involving paired end reads in fastq format, make a subset of your data with HTSeq in Python.

On a local machine

Installation

sudo apt-get install build-essential python2.7-dev python-numpy python-matplotlib

if you have another os there are instructions here

Usage

Paste the following python code into subsetfastq.py (taken from Simon, here)

import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

for read1, read2 in itertools.izip( in1, in2 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()

We can run this with:

python /path/to/subsetfastq.py fraction forward.fastq reverse.fastq forward.sub.fastq reverse.sub.fastq

with fraction being the fraction of the fastq reads that you would like to keep

On a cluster

Installation

To implement this on the computing cluster I have access to, I installed HTSeq with:

wget https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1p1.tar.gz#md5=c44d7b256281a8a53b6fe5beaeddd31c
tar -zxvf HTSeq-0.6.1pcd
HTSeq-0.6.1p1/1.tar.gz
python setup.py install --user
Usage

Copy subsetfastq.py (above) somewhere on your cluster account. Now we can execute this as a job on the cluster using a PBS script for Torque and Maui to handle. Paste the following into subsetfastq.pbs:

#/!bin/bash
# ---------------------------------------------------------------------------
# subsetfastq.pbs - takes a random subset of paired reads from files
# Copyright 2016, Rylan Shearn
# Usage: qsub -v forward=</path/to/file/forwardreads.fastq.gz>,reverse=</path/to/file/reversereads.fastq.gz>,fraction=<number between 0 and 1> subsetfastq.pbs
# Revision history:
# 2016-02-23 Created by newscript ver. 3.3
# ---------------------------------------------------------------------------

#PBS -l nodes=1:ppn=1,walltime=12:00:00
#PBS -o /path/to/output/log/
#PBS -e /path/to/error/log/

python /panhome/shearn/bin/subsetfastq.py $fraction $forward $reverse "$forward".sub "$reverse".sub

As noted in the usage notes, you can schedule this for example with:

qsub -v forward=/path/to/file/forwardreads.fastq.gz,reverse=/path/to/file/reversereads.fastq.gz,fraction=0.1 subsetfastq.pbs

which should output two new fastq.gz files, each with a .sub suffix and each having 0.1 times the reads of the original files.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s