Import file from NCBI Sequence Read Archive

Note: This notebook is optional. The data we need to begin our analysis is already imported into CyVerse. However, this notebook may be helpful in guiding you to import a dataset of your choice from the SRA

The data we need are available on the NCBI SRA. The data we are working with is from an experiment with mice and described here.

High-fat diet induced leptin and Wnt expression: RNA-sequencing and pathway analysis of mouse colonic tissue and tumors

Obesity, an immense epidemic affecting approximately half a billion adults, has doubled in prevalence in the last several decades. Epidemiological data support that obesity due to intake of a high-fat, western diet increases the risk of colon cancer; however, the mechanisms underlying this risk remain unclear. Here, utilizing next generation RNA sequencing, we aimed to determine the high-fat diet mediated gene expression profile in mouse colon and the AOM/DSS model of colon cancer.

First we need to get the list of accessions (sequencing runs) which is available for download here: We are looking for the SraRunTable.txt file. which can be downloaded here: (if you were downloading this on your own, you would click the RunInfo Table button to download this file). We have provided the file for you.

In [ ]:
head -n4 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt

This is quite hard to read, but we need the Run column to download read data. This is column 10 in our file. We use the Unix cut command with the -f (field) option to get the 10th field (column) in our SraRunTable.txt file

In [ ]:
cut -f10 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt

Each of the above lines (e.g. SRS1784103) corresponds to a file on the SRA with the read data we need for our experiment. Let's also look at the Sample_Name column (column 11) so we can see what these data sets are:

In [ ]:
cut -f10,11 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt

There are a lot of possible combination of data we could study, however to keep things simple, we can focus on just the following samples

SRA_Sample Sample_Name
SRS1794108 High-Fat Diet Control 1
SRS1794110 High-Fat Diet Control 2
SRS1794106 High-Fat Diet Control 3
SRS1794105 High-Fat Diet Tumor 1
SRS1794101 High-Fat Diet Tumor 2
SRS1794111 High-Fat Diet Tumor 3

We will look at 3 replicates of RNA-Seq data from normal liver samples from mice on a high-fat diet, and 3 replicates of RNA-Seq data from liver tumor samples from mice on a high-fat diet. During the following exercises students will focus on one replicate from each of the samples.

With this command we will get all of the "High-Fat Diet Control" samples and place them in a text file called finalsamples.txt

In [ ]:
cut -f10,11 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt|grep "High-Fat Diet Control"|cut -f1 >> finalsamples.txt

Next we will get all of the "High-Fat Diet Tumor" samples and place them in a text file called finalsamples.txt

In [ ]:
cut -f10,11 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt|grep "High-Fat Diet Tumor"|cut -f1 >> finalsamples.txt

Let's verify our finalsamples.txt file has the samples (SRA run numbers) we are looking for; in order to download the samples in the next steps, each run number must be on its own single line in the file

In [ ]:
cat finalsamples.txt

Option 1 (Quick) Working with pre-imported data

Once we have a list of SRA accessions to import, there next 2 steps to complete the analysis would be:

  1. Use the prefetch program from SRA Tools to import the data from NCBI
  2. Use the parallel-fastq-dump progam ((Github Link)[]) to transform files from the SRA format to the fastq format.

Both of these steps take a lot of time complete so we have made the output of those available to you here. There are our six fastq files as well as small.fastq.gz a sample file we will use in subsequent notebooks

In [ ]:
ls /home/gea_user/data/raw_data/fastq

You can skip the rest of this notebook, unless you want to proceed with the optional steps which could take an hour or more

Now, let's do two things. We are going to use the SRA Toolkit to import the files we need from the SRA. Rather than do 19 downloads one-by-one, we can take this list of accessions and use a "while loop" to do the import.

A while loop is a bit of code, and if you are not familiar with Linux/command line it's ok to ignore this for now.

We will use a while loop to read the list of run names and import them from NCBI. There are some additional options we can use to import the data more quickly, but for now we will just use the simplest options.

(Warning: These steps takes can take from 30 minutes to several hours to import - these data are pre-download on CyVerse, but we provide the code here for advanced learners who want to modify it - for example to download other SRA data)

Option 2 (Slow) SRA import and conversion to fastq format

If you want to do the SRA import and conversion to fastq the cells below describe the process using our mouse data. If you were using a different dataset, you could use these commands, being careful to subsitute in your own SRA accessions and paying attention to file name changes.

SRA Import

In [ ]:
# You could create your own `finalsamples.txt` file 
# This file would be a list of SRA sample accessions 
# with one accession (e.g. SRS179109) per line 
# running this cell will allow you to do the SRA import

while read line; do prefetch $line; done<finalsamples.txt

Your files from the SRA import will be in the following location:

In [ ]:
ls /home/jovyan/ncbi/public/sra

Lets move these files into a more convenient location

In [ ]:
mkdir -p /home/gea_user/data/raw_data && mv /home/jovyan/ncbi/public/sra/*.sra /home/gea_user/data/raw_data

We now have our 6 sra files in the raw_data directory

In [ ]:
ls /home/gea_user/data/raw_data

We now need to use another tool to convert these files into fastq format. We will covert them to a compressed (fastq.gz) format which can be directly used by Kallisto. This will take several minutes per file.

In [ ]:
mkdir -p /home/gea_user/data/raw_data/fastq && cd /home/gea_user/data/raw_data/ && for file in /home/gea_user/data/raw_data/*.sra; do parallel-fastq-dump --gzip --threads 8 --outdir ./fastq/ -s $file; done

We should now have 6, zipped fastq files in our raw_data/fastq directory

In [ ]:
ls /home/gea_user/data/raw_data/fastq

We can delete the .sra files now that we have fastq files

In [ ]:
rm /home/gea_user/data/raw_data/*.sra