Pseudoalignment of Data Using Kallisto¶

learning-objectives

Know how counting RNA-Seq reads allows you to make inferences about gene expression

Know the types of data need to us Kallisto

Use Kallisto to align RNA-Seq reads to a transcriptome

The Ideal RNA-Seq experiment¶

In a perfect world, RNA-Seq might work something like this:

We know the location of every single gene in the genome (even genes that are identical duplicates of each other).

At any given time, we can get an exact readout of how many mRNAs a gene has produced (transcription) and for protein coding mRNA transcripts we know how much protein was yielded.

Given 1 and 2, we can clearly tell if gene transcription is different between two different samples (e.g. normal tissue and cancerous).

Real-world RNA-Seq¶

Unfortunately, here are just some of the ways a real world experiment fails to meet our ideal expectations:

We don’t know the location of every gene in the genome (often we think we know most of them though!).

Genomes contain gene duplications (identical or near identical copies). Since our DNA sequencing reads are often around 100 nucleotides we may have difficulty telling where a read came from (and thus how to count it).

Even if we have an idea of how much RNA is being produced, we don’t know how much of it ends up making a protein.

RNA-Seq with Kallisto¶

Despite these challenges, RNA-Seq is a very useful experiment and researchers have worked through many of the challenges above to develop software that can help us infer what is happening in the transcriptome.

Kallisto is a quick, highly-efficient software for quantifying transcript abundances in an RNA-Seq experiment. Even on a typical laptop, Kallisto can quantify 30 million reads in less than 3 minutes.

In order to analyze data with Kallisto we need several inputs:

We need the FastQ files from the RNA-Seq experiment; we usually start with reads that have already been trimmed/filtered

We need a reference transcriptome. This is a file that has the sequences for all the known expressed genes. Reference transcriptomes are usually available from repositories like Ensembl and NCBI. We will be using the mouse reference transcriptome (available on the linked page “Download sequences in FASTA format for transcript”). Unlike a genome, the transcriptome is only coding genes.

Optional

We also will get the annotations and chromosome information for our subsequent visualization steps:

We need a reference annotations. This is a file that has the names and locations for all the genes in the transcriptome. Annotations have information like how many introns/exons a gene contains, and perhaps other information about the gene (such as a predicted function). This information also comes from the Ensembl page

Because of the way we sequenced (single-end sequencing) we also need to know the chromosome lengths and this information from the NCBI mouse reference page.

Basic Kallisto Steps

Analysis with Kallisto has two main steps (see the manual on the Kallisto home page):

Step Description Command Command explanation Input Output
1

Genome indexing
kallisto index <--index=> <transcriptome.fa.gz>
kallisto index: Command to begin the indexing

–index=: The name your want your index file to be called

transcriptome.fa.gz: A transcriptome in fasta format (may becompressed “.gz”)

Reference transcriptome

Kallisto index (no file extension)
2

Pseudoalignment*
kallisto quant
--single
--threads=<t>
--index=<index name>
--bootstrap-samples=<b>
--fragment-length=<f>
--sd=<s>
--output-dir=<dir>
--genomebam
--gtf=<annotations.gff>
--chromosomes=<chromosome_list.tsv>
<fastq_input.fastq.gz>
kallisto quant: Command to start Kallisto

single: Indicates this is single-end sequence data

threads=[t]: Optional - indicates how many CPU threads to use

bootstrap-samples=<b>: Conduct <b> (integer) of bootstrap samplings

fragment-length=<f>: Estimated average fragment length the

sequencing library used DNA fragments of length f (integer)

sd=<s>: Estimated standard deviation of fragment length

(default: -l, -s values are estimated from paired end data, but are required when using –single)

output-dir=<dir>: Name of directory where results should be output

genomebam: Create a pseudobam file (will be used in visualization)

to see where individual reads map on the genome

gtf=<annotations.gff>:GTF file for transcriptome annotation information

(required for –genomebam)

chromosomes=<chromosome_list.tsv>: Tab separated file with

chromosome names and lengths (optional for –genomebam, but recommended)

<fastq_input.fastq.gz>: FastQ file for analysis

FastQ file (one per run of Kallisto)

Kallisto index (from indexing step)

abundances.h5: HDF5 binary file containing run info,

abundance estimates, bootstrap estimates, and transcript length information length

abundances.tsv: Plaintext file of the abundance estimates. It does

not contains bootstrap estimates. When plaintext mode is selected; output plaintext abundance estimates. Alternatively, kallisto h5dump will output an HDF5 file to plaintext. The first line contains a header for each column, including estimated counts, TPM, effective length

run_info.json: a json file containing information about the run

*Note: The Psuedoalignment step in the JupyterNotbook is written in a “For Loop” since we want to run it for one or more fastq files. If you have only one file, a for loop is overkill, but for more files, the For Loop is efficient since it would automatically run analysis on one file right after another. See Linux/Bash Command Line Primer for more information.

Pseudoalignment and genomics word search¶

Alignment of reads is an expansive topic. Several reviews cover some of the important topics including Stark et. al. 2019. This blog post and the kallisto paper are required readings to get a deep understanding of the subject.

To try and reduce this problem to its most basic, let’s use an analogy. In the traditional case, when software does alignment, it tries to match a read to the genome.

Genome: ACTACGTAGCCGTCAAATATCCCGGGTATCGTACGATCGACGT

Read:   AAATTTCCCGGG

If we move things around we can find the match:

Genome: ACTACGTAGCCGTCAAATATCCCGGGTATCGTACGATCGACGT
                      |||| |||||||
Read:                 AAATTTCCCGGG

In this example a “|” appears when there is a match, and we have one mismatch. Although finding this match was simple, the genome is far more complex.

In the word search below is the word “DNA”. Can you find it? It may take you a while. Searching for words is a process similar to taking sequencing reads and trying to match them to the genome. Computers are fast, but just as matching a small word (3 letters) in a large (1639 letter) word puzzle is time-intensive, it take a long time to match millions of short reads against genomes of billions of nucleotides.

CUSVFVMAASJFHUTMNCCQMBVXOLBEETYHSRBWOSEY
MOBJEYXAZMPMFENZHQKMHHSCZUXUQYEBQONJVYWH
LCMIFVSPNMAGIJAOOFCWNYYDETTLMGCDOBSLOPXO
ZAUSKOGLCYOIKIXZSHOXHLYGZJLRLZMRGHRFRJWN
HBEOJQLIXUYAIAWMJASRBOVSBHMAHJFPUOXTQIYZ
LSSCVOGPJCIIMUILZCFKLASLFLQFIZVSYXWJYNMJ
QPWLYGLRTSTQHYUVFVGGPDMICLGDUMZOVPHFPLOD
YFZAYPVONRRCPHJPNJVROZEZPBLZNYBQHOVUPOGJ
SAVCNZWIYQKHJOLKQBHZYGJZKUEHKJLCLXOLMUJF
YJAGKRFQOQRNAXXXZAZUMMWMLKWNREHDFOLVVYWK
ATOXWLBGQECNAMWJIMONGSAVKGSHUVUOROVJRJGT
AJLVEYWLSFPUIYABELQZLYLUGZJGPGWEVPXYCHGO
IPXOCIPREEDCYTFZHBDTTOIJRWLEYEMMTRIDSMGJ
BNEZKUNVOXZFIXAAGGYUWAYZSNVUHXJZMMPXBPTE
QMVXMUJAFUGBBLAVVCGVBDIGDBIDBSHTEVPCJTUB
ELJIPXDVTMPNCROAPIHZMROJXZFCDHRYWIEUSZST
ERRCPCLAPIJROAAUKZNDYYKNWKNUNFDKSTLYTXKA
UAMNKGZVKWMOYGLLWUJOECUNBRUGRPWWUNNVEXYF
MQYPHTEPNKPLWECXVORINUEHHVLVBNFUJJXNEZYT
MJDQTZMIRJXMLYUCXYCVFQHHQBERSMGQLRWNYAJP
IELYQQDAIXJDTUONASHBTLCZISHQWCMJZFAFJOXE
UGZWMJGDLLRKIZVVLPHYBCULVCAIRMMEQPRRHFHQ
FFCWSFSQZJTBCVFHUAEECJPXKHMKRKJFBZVQNOMZ
ZZZEMQQNVWUXDOYBDHSDFLHMAQEWFKWPMIGJVZUT
ZNEQDNAHXMPBLRVDEBJRJFTDOIWUPFLSIYOOHNQH
TTMIXMVUMETGMJWFEWCRITKZWZGVMCNQRAPDCJZD
AXPTCGYNKVFDPALHRQNPKQUBSEGOOQCVUDDHTIWR
GEMJXKJGSSRAROGDWETVVTZMWUSCYXIKTUFLUIXO
BDZFSFRPFXHALCKUGTOHGMEHYPRPTFXUSWSKDBWC
UGGJXCDKYYZZFLDAWLDYJLCRSTDSMLOGPCVAUDZZ
AHGKHDAGOHYQDKGWDTOGFVOSRKJCPRRENRQRBBPQ
ABHWPOFZYTGHXPREABCTWABQLZLYBOQVHOMUHGRS
FPNWCEQGTYJEDRWAVDYKUAZMLVVNTYNRZHPFHYGR
UAJNXZWZSBLEVYIGLRPFNZLOGRULAHIAHDLXTJXF
DQEKRQNVJACCBVGABSKTWRPYEFXNEHSICVLHWOIV
GVOGZWUSLGRGHMVENEYNSBCZRLOZYCMXCYGWUMSG
LFXDWDWHJYXAXDVWVMTPKQVTAYHNCADYJCNKNDZM
ASJBGHYSPRUIIVQDAVUQBEEDNQKBKQPMKYQKRIKV
MWFHPISMEIUIVZVBEUKTFOUADMUDXAJSGYHLXUSP
OSIVVLZMDTLDMCLGZLEGWLPVPWOLNERAINTAESFR

Pseudoalignment is one approach to this computational “word search”. It takes advantage of a trick to speed up performance without loosing accuracy. Take the second line in our “DNA” word search puzzle:

MOBJEYXAZMPMFENZHQKMHHSCZUXUQYEBQONJVYWH

There is no “D” in this line. True, We don’t know that until we read the entire line, but once we realize this line can’t possible be a match without a “D” We can ignore this line. Word search puzzles don’t have to be read in just one direction (words might be on diagonals, backwards, etc.), but now consider what happens in the pseudoalignment “word search”. In this case we are searching not the entire genome, but linear transcripts:

Transcript 1: CUSVFVMAASJFHUTMNCCQMBVXOLBEETYHSRBWOSEY
Transcript 2: MOBJEYXAZMPMFENZHQKMHHSCZUXUQYEBQONJVYWH
Transcript 3: LCMIFVSPNMAGIJAOOFCWNYYDETTLMGCDOBSLOPXO
Transcript 4: ZAUSKOGLCYOIKIXZSHOXHLYGZJLRLZMRGHRFRJWN
Transcript 5: ZNEQDNAHXMPBLRVDEBJRJFTDOIWUPFLSIYOOHNQH

We can immediately eliminate transcripts that don’t contain the letter D:

Transcript 3: LCMIFVSPNMAGIJAOOFCWNYYDETTLMGCDOBSLOPXO
Transcript 4:
Transcript 5: ZNEQDNAHXMPBLRVDEBJRJFTDOIWUPFLSIYOOHNQH

Immediately, the problem is made easier by throwing away transcripts that could not contain the answer. This is not an exact analogy but basically, rather than trying to match every read to every position in the genome, Kallisto is faster because 1) we are only matching to the transcriptome (a subset of the genome) and 2) we focus only on transcripts that could have generated a particular read.

Note

Pseudoalignment is just one approach to aligning RNA-Seq reads. Other software will do full alignments of the read to a transcriptome or genome. These methods have different advantages and requirements - the Stark et. al. 2019 is a suggested reading.

Use Kallisto to Pseudoalign RNA-Seq Reads¶

Using the Jupyter Notebook, you will run use Kallisto to analyze trimmed RNA-Seq reads. The notebook covers importing reference data (transcriptome, annotations, chromosome coordinates) from Ensembl and NCBI.

Access the JupyterLab Lesson on CyVerse and complete Notebook 3: RNA-Seq from scratch - Kallisto

Notebook Preview

This is a preview of the notebook in this lesson. Go back to JupyterLab Lesson on CyVerse to launch and use the interactive notebook.

Questions¶

Question

What are some of the constraints we have to work within when doing an RNA-Seq experiment (e.g. the difference between the ideal experiment and the real-world experiment?)
In addition to FastQ reads, what are some other datasets needed to do RNA-Seq using Kallisto?
Use an analogy to explain Kallisto pseudoalignment

Bonus

Using the referenced papers and other research materials you can find, what are some other software tools that could be used as an alternative to Kallisto in RNA-Seq? Do they require the same datasets as Kallisto does in order to run?