There are two ways of filtering data: trimming ends that may have very low quality, or removing reads that are low quality. In general, short-read sequence aligners take quality information into account, and so conservative trimming and filtering is not necessary. However, if you have a run with very low quality ends, trimming those ends can help your analysis, especially if you are assembly a de novo transcriptome.
There are a number of tools designed to help you control read quality, each with their own benefits. For today, we will use a program called ’Trimmomatic’ because it does a great job of explicitly handling paired-end data like these. To call Trimmomatic, we will use java, and simply pass the arguments we want to use. For more detail on each option, go to the website: http://www.usadellab.org/cms/?page=trimmomatic.
One Note: paired-end data requires two outputs for each file, one for those that match the opposite direction read, and one for those that don’t. The code below is an example that may be a helpful starting point; note that the ‘\’ at the end of each line means ‘put this all on one line; don’t hit return yet’ and can either be copied in directly (and interpreted by the console), or omitted to put everything on one line (interpreted by you).
Let's again check our input files to trim:
cd /home/gea_user/data/raw_data/fastq
ls
According to the trimmomatic website this is what a command for running Trimmomatic
looks like:
trimmomatic SE input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Let's break this down.
First, we need to 'invoke' (tell the computer what program we are using) by writing its name:
trimmomatic
Everything else that comes is going to be an option; specific ways we can customize what trimmomatic will do to work with out data. The first option we write is:
SE
this tells Trimmomatic
that we are using single-end sequence data.
Next, we need to specify the name of the input file we want to use and the name of the new ouput file we are going to create. In the example command above these are called input.fq.gz
and output.fq.gz
- when we write our command, we will substitute those terms for our own choice of file names.
Finally, there are several trimming options that we can use. In our case the following ones will be important:
-threads
- Number of CPUs (8) to use to complete this analysis (we can adjust this number up or down depending on how powerful our computer is - a higher number will allow the analysis to go faster - if you don't know how many you have don't worry- the program will just use what is available)SLIDINGWINDOW:i:j
- Take the average quality of i (a number we choose) reads and trim if the average is below j (another number we choose)MINLEN:i
- Drop any read if less than i (a number we choose) nucleotides after SLIDINGWINDOW trimmingTo trim our small.fastq.gz
file the command is as follows:
trimmomatic SE small.fastq.gz small_trimmed -threads 8 SLIDINGWINDOW:4:20 MINLEN:75
Translating this into English it would read something like:
Use the trimmomatic program and let it know that we are working with single-end sequence data. I have 8 CPUs you can use, so use all 8 if possible. For each read, start at the begining of the read and take the average Phred quality score of the first 4 bases. If that average is 20 or more, move down and take the average of the next four; repeat this process until the end of the read, but if you hit a spot where the average is less than 20, trim the read there. Finally, after the SLIDINGWINDOW trim is done check to make sure the trimmed read is 75 nucleotides or longer, if it is not, delete the read from my data (I'm only interested in longer reads).
Let's run the command
trimmomatic SE small.fastq.gz small_trimmed.fastq.gz -threads 8 SLIDINGWINDOW:4:20 MINLEN:75
Let's look for our file:
ls *_trimmed.fastq.gz
let's move it to an appropriate directory
mkdir /home/gea_user/rna-seq-project/trimmed-reads
mv *_trimmed.fastq.gz /home/gea_user/rna-seq-project/trimmed-reads
To see the results of trimming, we need to use fastqc
so let's change into the directory with the trimmed read and run fastqc as in the pevious notebook
cd /home/gea_user/rna-seq-project/trimmed-reads
fastqc small_trimmed.fastq.gz
Navigate to /home/gea_user/rna-seq-project/trimmed-reads
and open the small_trimmed_fastqc.html
file. How does it compare with the small_fastqc.html
(untrimmed) file?
Using your assigned file at home/gea_user/data/raw_data/fastq
run trimmomatic and then run fastqc on the trimmed file
home/gea_user/data/raw_data/fastq
cd
INPUT
and OUTPUT
to reflect your file(s). If you are working with more than one file, see the bonus exercise. _trimmed.fastq.gz
trimmomatic SE INPUT OUTPUT -threads 8 SLIDINGWINDOW:4:20 MINLEN:75
mv *_trimmed.fastq.gz /home/gea_user/rna-seq-project/trimmed-reads
fastqc
cd
fastqc
Using the Jupyter Labs files browser you can navigate to the trimmed-reads
folder in rna-seq-project
to view your fastqc results
Often, working with data on the command line, you want to run a program on multiple files. Rather than running a command over and over again and changing the command slighly to reflect the file you are working with, you can use a for loop. Using that link you can learn more about 'For loops' which unfortunately we don't have time to go into in this exercise. If it is useful to you however, here is a for loop that will run the trimming on every compressed (.gz) fastq file in home/gea_user/data/raw_data/fastq
:
for infile in /home/gea_user/data/raw_data/fastq/*.fastq.gz
do
base=$(basename --suffix=.fastq.gz $infile)
trimmomatic \
SE \
-threads 32 \
${infile} ${base}_trimmed.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:75
done
We now have 6 trimmed fastq files
ls *_trimmed.fastq.gz
Let's run another round of fastqc on the trimmed results to compare
for trimmed_file in /home/gea_user/rna-seq-project/trimmed-reads/*_trimmed.fastq.gz
do
fastqc $trimmed_file
done
Let's move all the fastqc results from the trimmed reads to an appropriate locations
mkdir -p /home/gea_user/rna-seq-project/trimmed-reads/fastqc-trimmed-results
mv /home/gea_user/rna-seq-project/trimmed-reads/*.html /home/gea_user/rna-seq-project/trimmed-reads/fastqc-trimmed-results
mv /home/gea_user/rna-seq-project/trimmed-reads/*.zip /home/gea_user/rna-seq-project/trimmed-reads/fastqc-trimmed-results
Navigate to these results to see the fastqc reports