Note: This notebook is optional. The data we need to begin our analysis is already imported into CyVerse. However, this notebook may be helpful in guiding you to import a dataset of your choice from the SRA
The data we need are available on the NCBI SRA. The data we are working with is from an experiment with mice and described here.
High-fat diet induced leptin and Wnt expression: RNA-sequencing and pathway analysis of mouse colonic tissue and tumors
Obesity, an immense epidemic affecting approximately half a billion adults, has doubled in prevalence in the last several decades. Epidemiological data support that obesity due to intake of a high-fat, western diet increases the risk of colon cancer; however, the mechanisms underlying this risk remain unclear. Here, utilizing next generation RNA sequencing, we aimed to determine the high-fat diet mediated gene expression profile in mouse colon and the AOM/DSS model of colon cancer.
First we need to get the list of accessions (sequencing runs) which is available for download here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP093363. We are looking for the
SraRunTable.txt file. which can be downloaded here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP093363 (if you were downloading this on your own, you would click the RunInfo Table button to download this file). We have provided the file for you.
head -n4 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt
This is quite hard to read, but we need the
Run column to download read data. This is column 10 in our file. We use the Unix
cut command with the
-f (field) option to get the 10th field (column) in our SraRunTable.txt file
cut -f10 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt
Each of the above lines (e.g. SRS1784103) corresponds to a file on the SRA with the read data we need for our experiment. Let's also look at the
Sample_Name column (column 11) so we can see what these data sets are:
cut -f10,11 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt
There are a lot of possible combination of data we could study, however to keep things simple, we can focus on just the following samples
|SRS1794108||High-Fat Diet Control 1|
|SRS1794110||High-Fat Diet Control 2|
|SRS1794106||High-Fat Diet Control 3|
|SRS1794105||High-Fat Diet Tumor 1|
|SRS1794101||High-Fat Diet Tumor 2|
|SRS1794111||High-Fat Diet Tumor 3|
We will look at 3 replicates of RNA-Seq data from normal liver samples from mice on a high-fat diet, and 3 replicates of RNA-Seq data from liver tumor samples from mice on a high-fat diet. During the following exercises students will focus on one replicate from each of the samples.
With this command we will get all of the "High-Fat Diet Control" samples and place them in a text file called
cut -f10,11 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt|grep "High-Fat Diet Control"|cut -f1 >> finalsamples.txt
Next we will get all of the "High-Fat Diet Tumor" samples and place them in a text file called
cut -f10,11 /home/gea_user/data/pre-imported/sra-files/SraRunTable.txt|grep "High-Fat Diet Tumor"|cut -f1 >> finalsamples.txt
Let's verify our
finalsamples.txt file has the samples (SRA run numbers) we are looking for; in order to download the samples in the next steps, each run number must be on its own single line in the file
Once we have a list of SRA accessions to import, there next 2 steps to complete the analysis would be:
prefetchprogram from SRA Tools to import the data from NCBI
parallel-fastq-dumpprogam ((Github Link)[https://github.com/rvalieris/parallel-fastq-dump]) to transform files from the SRA format to the fastq format.
Both of these steps take a lot of time complete so we have made the output of those available to you here. There are our six fastq files as well as
small.fastq.gz a sample file we will use in subsequent notebooks
Now, let's do two things. We are going to use the SRA Toolkit to import the files we need from the SRA. Rather than do 19 downloads one-by-one, we can take this list of accessions and use a "while loop" to do the import.
A while loop is a bit of code, and if you are not familiar with Linux/command line it's ok to ignore this for now.
We will use a while loop to read the list of run names and import them from NCBI. There are some additional options we can use to import the data more quickly, but for now we will just use the simplest options.
(Warning: These steps takes can take from 30 minutes to several hours to import - these data are pre-download on CyVerse, but we provide the code here for advanced learners who want to modify it - for example to download other SRA data)
If you want to do the SRA import and conversion to fastq the cells below describe the process using our mouse data. If you were using a different dataset, you could use these commands, being careful to subsitute in your own SRA accessions and paying attention to file name changes.
# You could create your own `finalsamples.txt` file # This file would be a list of SRA sample accessions # with one accession (e.g. SRS179109) per line # running this cell will allow you to do the SRA import while read line; do prefetch $line; done<finalsamples.txt
Your files from the SRA import will be in the following location:
Lets move these files into a more convenient location
mkdir -p /home/gea_user/data/raw_data && mv /home/jovyan/ncbi/public/sra/*.sra /home/gea_user/data/raw_data
We now have our 6 sra files in the
We now need to use another tool to convert these files into fastq format. We will covert them to a compressed (fastq.gz) format which can be directly used by Kallisto. This will take several minutes per file.
mkdir -p /home/gea_user/data/raw_data/fastq && cd /home/gea_user/data/raw_data/ && for file in /home/gea_user/data/raw_data/*.sra; do parallel-fastq-dump --gzip --threads 8 --outdir ./fastq/ -s $file; done
We should now have 6, zipped fastq files in our
We can delete the .sra files now that we have fastq files