RNA-Seq from scratch - Kallisto

In this notebook we will align (pseudoalign) reads to the mouse transcriptome using a software package called kallisto.

Obtain reference data

There are a few files we need to analyze our data with Kallisto

  • Reference transcriptome: A file of all the known trasncripts of the mouse genome
  • Reference annotations: A file with information on the location and structure of the genes in the mouse genome

These data (for mouse, and many other organisims) are available from public databases

Import transcriptome data

First, we will download the mouse transcriptome data from Ensemble. The mouse transcriptome is available on this page. The wget command will allow us to download the transcriptome from Ensembl. This may take a minute or two - make sure the download completes before moving on - you shoul get a message saying ‘Mus_musculus.GRCm38.cdna.all.fa.gz’ saved [51982200].

In [ ]:
wget ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz

Let's also organize our downloaded data; we will make a new directory (transcriptome) in our rna-seq-project folder and use the mv command to move the transcriptome data we downloaded in the step above.

In [ ]:
mkdir -p /home/gea_user/rna-seq-project/transcriptome && mv Mus_musculus.GRCm38.cdna.all.fa.gz /home/gea_user/rna-seq-project/transcriptome

Index transcriptome

We will now use Kallisto's indexing function to prepare the transcriptome for analysis. The "Index" is a lookup table for the transcriptome that allows it to be more easily searched by Kallisto. First let's organize our files by creating a new directory to hold our kallisto work.

In [ ]:
mkdir -p /home/gea_user/rna-seq-project/kallisto

Next run the indexing command. This prepares the transcriptome so that we can peudoalign reads to it.

In [ ]:
kallisto index --index="Mus_musculus.GRCm38_index" /home/gea_user/rna-seq-project/transcriptome/Mus_musculus.GRCm38.cdna.all.fa.gz

We now have a transcriptome index which can now be used for pseudoalignment, we'll move it into the transcriptome folder:

In [ ]:
mv Mus_musculus.GRCm38_index /home/gea_user/rna-seq-project/transcriptome/

We also will need the mouse GTF file, a file containing coordinates and descriptions for all gene names and locations - we will also download this from Ensembl.

In [ ]:
wget ftp://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.chr.gtf.gz

We will make a folder called annotations and save that there as well.

In [ ]:
mkdir -p /home/gea_user/rna-seq-project/kallisto/annotations
In [ ]:
mv Mus_musculus.GRCm38.97.chr.gtf.gz /home/gea_user/rna-seq-project/kallisto/annotations

Finally we also need a textfile that has the name of each mouse chromosome (e.g. 1, 2, 3, ... X, Y, MT) and the length of each chromosome, This will be used for visualization. The information will be downloaded and then edited using bash commands.

In [ ]:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6/GCF_000001635.26_GRCm38.p6_assembly_report.txt
In [ ]:
head -n64 GCF_000001635.26_GRCm38.p6_assembly_report.txt|tail -n21|cut -f1,9 > /home/gea_user/rna-seq-project/kallisto/annotations/mouse_chromosomes.tsv

We can see this edited list of chromosome numbers and lengths here:

In [ ]:
cat /home/gea_user/rna-seq-project/kallisto/annotations/mouse_chromosomes.tsv

Quantify reads

In this final step, we will run Kallisto on all of our files to quantify the reads. We will write a for loop to do this. Let's see once again our trimmed reads

Exercise 3: Pseudoaligning reads with Kallisto

Working with data you trimmed yourself

If you used trimmomatic to trim your reads, run the next cell (otherwise skip to the cell for fastp reads)

In [ ]:
cd /home/gea_user/rna-seq-project/trimmed-reads

We don't want the small_trimmed.fastq.gz file to be part of our analysis, so we will delete it

In [ ]:
rm small_trimmed.fastq.gz

Now we should have just the trimmed file(s) we want to analyze.

In [ ]:

Working with data pre-trimmed data

All instructions for the commands we are using are in the Kallisto manual: https://pachterlab.github.io/kallisto/manual. Since we are using single read data, we need to provide information on the fragment length used for the library (200) and an estimate of the standard deviation for this value - here we will have to guess (20). We need to run Kallisto separately on each of our 6 files so we will use a for loop.

In [ ]:
for file in *_trimmed.fastq.gz; do output=$(basename --suffix=.fastq.gz_trimmed.fastq.gz $file)_quant; kallisto quant\
 $file; done

Lets move our results to an appropriate directory and view the directories containing our analyses:

In [ ]:
mkdir -p /home/gea_user/rna-seq-project/kallisto/analyzed
mv *_quant /home/gea_user/rna-seq-project/kallisto/analyzed
cd /home/gea_user/rna-seq-project/kallisto/analyzed

Each of these directories containes several files, lets remind ourselves what each of these directories contain:

SRA Sample Sample Name Folder Name
SRS1794108 High-Fat Diet Control 1 SRR5017135_trimmed.fastq.gz_quant
SRS1794110 High-Fat Diet Control 2 SRR5017137_trimmed.fastq.gz_quant
SRS1794106 High-Fat Diet Control 3 SRR5017133_trimmed.fastq.gz_quant
SRS1794105 High-Fat Diet Tumor 1 SRR5017132_trimmed.fastq.gz_quant
SRS1794101 High-Fat Diet Tumor 2 SRR5017128_trimmed.fastq.gz_quant
SRS1794111 High-Fat Diet Tumor 3 SRR5017138_trimmed.fastq.gz_quant

Let's look at the files in the High-Fat Diet Control 1

In [ ]:
cd /home/gea_user/rna-seq-project/kallisto/analyzed/SRR5017135_trimmed.fastq.gz_quant

This file contains a list of abundances (counts) for the Regular Diet Control sequences from SRR5017139

In [ ]:
head -n 100 abundance.tsv