Trimming bad data

There are two ways of filtering data: trimming ends that may have very low quality, or removing reads that are low quality. In general, short-read sequence aligners take quality information into account, and so conservative trimming and filtering is not necessary. However, if you have a run with very low quality ends, trimming those ends can help your analysis, especially if you are assembly a de novo transcriptome.

There are a number of tools designed to help you control read quality, each with their own benefits. For today, we will use a program called ’Trimmomatic’ because it does a great job of explicitly handling paired-end data like these. To call Trimmomatic, we will use java, and simply pass the arguments we want to use. For more detail on each option, go to the website: http://www.usadellab.org/cms/?page=trimmomatic.

One Note: paired-end data requires two outputs for each file, one for those that match the opposite direction read, and one for those that don’t. The code below is an example that may be a helpful starting point; note that the ‘\’ at the end of each line means ‘put this all on one line; don’t hit return yet’ and can either be copied in directly (and interpreted by the console), or omitted to put everything on one line (interpreted by you).

Let's again check our input files to trim:

In [ ]:
cd /home/gea_user/data/raw_data/fastq
ls

According to the trimmomatic website this is what a command for running Trimmomatic looks like:

trimmomatic SE input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Let's break this down.

First, we need to 'invoke' (tell the computer what program we are using) by writing its name:

trimmomatic

Everything else that comes is going to be an option; specific ways we can customize what trimmomatic will do to work with out data. The first option we write is:

SE

this tells Trimmomatic that we are using single-end sequence data. Next, we need to specify the name of the input file we want to use and the name of the new ouput file we are going to create. In the example command above these are called input.fq.gz and output.fq.gz - when we write our command, we will substitute those terms for our own choice of file names.

Finally, there are several trimming options that we can use. In our case the following ones will be important:

  • -threads - Number of CPUs (8) to use to complete this analysis (we can adjust this number up or down depending on how powerful our computer is - a higher number will allow the analysis to go faster - if you don't know how many you have don't worry- the program will just use what is available)
  • SLIDINGWINDOW:i:j - Take the average quality of i (a number we choose) reads and trim if the average is below j (another number we choose)
  • MINLEN:i - Drop any read if less than i (a number we choose) nucleotides after SLIDINGWINDOW trimming

To trim our small.fastq.gz file the command is as follows:

trimmomatic SE small.fastq.gz small_trimmed -threads 8 SLIDINGWINDOW:4:20 MINLEN:75

Translating this into English it would read something like:

Use the trimmomatic program and let it know that we are working with single-end sequence data. I have 8 CPUs you can use, so use all 8 if possible. For each read, start at the begining of the read and take the average Phred quality score of the first 4 bases. If that average is 20 or more, move down and take the average of the next four; repeat this process until the end of the read, but if you hit a spot where the average is less than 20, trim the read there. Finally, after the SLIDINGWINDOW trim is done check to make sure the trimmed read is 75 nucleotides or longer, if it is not, delete the read from my data (I'm only interested in longer reads).

Let's run the command

In [ ]:
trimmomatic SE small.fastq.gz small_trimmed.fastq.gz -threads 8 SLIDINGWINDOW:4:20 MINLEN:75

Let's look for our file:

In [ ]:
ls *_trimmed.fastq.gz

let's move it to an appropriate directory

In [ ]:
mkdir /home/gea_user/rna-seq-project/trimmed-reads
mv *_trimmed.fastq.gz /home/gea_user/rna-seq-project/trimmed-reads

To see the results of trimming, we need to use fastqc so let's change into the directory with the trimmed read and run fastqc as in the pevious notebook

In [ ]:
cd /home/gea_user/rna-seq-project/trimmed-reads
In [ ]:
fastqc small_trimmed.fastq.gz

Navigate to /home/gea_user/rna-seq-project/trimmed-reads and open the small_trimmed_fastqc.html file. How does it compare with the small_fastqc.html (untrimmed) file?


Exercise 3 Trimming with trimmomatic


Using your assigned file at home/gea_user/data/raw_data/fastq run trimmomatic and then run fastqc on the trimmed file

  1. Complete the code below to change directories into home/gea_user/data/raw_data/fastq
In [ ]:
cd
  1. Complete the code below to trim your reads. Make sure that you change INPUT and OUTPUT to reflect your file(s). If you are working with more than one file, see the bonus exercise.
  • INPUT - should be the name of your file
  • OUTPUT - should be changed so that the prefix of the file (e.g. SRR5017XXX) is followed by _trimmed.fastq.gz
In [ ]:
trimmomatic SE INPUT OUTPUT -threads 8 SLIDINGWINDOW:4:20 MINLEN:75
  1. Move your trimmed files to our previously created output folder (no need to change this code)
In [ ]:
mv *_trimmed.fastq.gz /home/gea_user/rna-seq-project/trimmed-reads
  1. Complete the cells below to change directories into the trimmed-reads folder and run fastqc
In [ ]:
cd
In [ ]:
fastqc

Using the Jupyter Labs files browser you can navigate to the trimmed-reads folder in rna-seq-project to view your fastqc results


Bonus: For loops


Often, working with data on the command line, you want to run a program on multiple files. Rather than running a command over and over again and changing the command slighly to reflect the file you are working with, you can use a for loop. Using that link you can learn more about 'For loops' which unfortunately we don't have time to go into in this exercise. If it is useful to you however, here is a for loop that will run the trimming on every compressed (.gz) fastq file in home/gea_user/data/raw_data/fastq:

In [ ]:
for infile in /home/gea_user/data/raw_data/fastq/*.fastq.gz
 do
 base=$(basename --suffix=.fastq.gz $infile)
 trimmomatic \
  SE \
  -threads 32 \
  ${infile} ${base}_trimmed.fastq.gz \
  SLIDINGWINDOW:4:20 MINLEN:75
 done

We now have 6 trimmed fastq files

In [ ]:
ls *_trimmed.fastq.gz

Let's run another round of fastqc on the trimmed results to compare

In [ ]:
for trimmed_file in /home/gea_user/rna-seq-project/trimmed-reads/*_trimmed.fastq.gz
 do
 fastqc $trimmed_file
 done

Let's move all the fastqc results from the trimmed reads to an appropriate locations

In [ ]:
mkdir -p /home/gea_user/rna-seq-project/trimmed-reads/fastqc-trimmed-results
mv /home/gea_user/rna-seq-project/trimmed-reads/*.html /home/gea_user/rna-seq-project/trimmed-reads/fastqc-trimmed-results
mv /home/gea_user/rna-seq-project/trimmed-reads/*.zip /home/gea_user/rna-seq-project/trimmed-reads/fastqc-trimmed-results

Navigate to these results to see the fastqc reports