A Note on Sample Data


For the remainder of the notebook, you will be assigned one or more of the following real files from our dataset:

SRA Sample Sample Name File Name
SRS1794108 High-Fat Diet Control 1 SRR5017135.fastq.gz
SRS1794110 High-Fat Diet Control 2 SRR5017137.fastq.gz
SRS1794106 High-Fat Diet Control 3 SRR5017133.fastq.gz
SRS1794105 High-Fat Diet Tumor 1 SRR5017132.fastq.gz
SRS1794101 High-Fat Diet Tumor 2 SRR5017128.fastq.gz
SRS1794111 High-Fat Diet Tumor 3 SRR5017138.fastq.gz

In the explanations of the steps we will first use the file small.fastq.gz. Then in the exercises portion of the notebook, you will run the analysis on your assigned file(s)


Running FastQC


Now, we are ready to analyze our data. Move into the ‘seqData’ directory in the project folder, this will keep all of our results together, instead of each running it in our own home folder.

Now let's check the content of our fastq folder - these are the pre-imported files we want to do quality checks on.

In [ ]:
ls /home/gea_user/data/raw_data/fastq

As you can see, we have 6 fastq files to analyze. To get the data, we will use a program called fastqc. To run the program we use the command fastqc and the name of the file we wish to analyze. We will move into the directory where are data are stored.

In [ ]:
cd /home/gea_user/data/raw_data/fastq

We will analyze one file to get familiar with the fastqc output (this file is almost 5GB so this may take a few minutes to complete):

In [ ]:
fastqc small.fastq.gz

Our output is returned in two files:

  1. An .html file (than can be viewed as a webpage)
  2. A .zip file that contains individual images and reports
In [ ]:
ls *.html && ls *.zip

We can make a new directory to place these results in

In [ ]:
mkdir -p /home/gea_user/rna-seq-project/fastqc-untrimmed-results

Now let's move these results to the new directory

In [ ]:
mv *.zip /home/gea_user/rna-seq-project/fastqc-untrimmed-results
mv *.html /home/gea_user/rna-seq-project/fastqc-untrimmed-results

You can browse your HTML (webpage) results in the file browser on the left (rna-seq-project > fastqc-results) - you can click the top-most folder icon to navigate to the home directory for this Jupyter lab session.


foldernavigation

We will run fastqc on all our files and examine the output in the next notebook:


Exercise 2: FASTQ


As a reminder - in this laboratory you will be assigned one or more of the 6 FASTQ files to follow through the rest of the analysis. The data files from the leptin experiment are:

SRA Sample Sample Name File Name
SRS1794108 High-Fat Diet Control 1 SRR5017135.fastq.gz
SRS1794110 High-Fat Diet Control 2 SRR5017137.fastq.gz
SRS1794106 High-Fat Diet Control 3 SRR5017133.fastq.gz
SRS1794105 High-Fat Diet Tumor 1 SRR5017132.fastq.gz
SRS1794101 High-Fat Diet Tumor 2 SRR5017128.fastq.gz
SRS1794111 High-Fat Diet Tumor 3 SRR5017138.fastq.gz
  1. In the blank cell below, run FastQC on your assigned sample (e.g. if you are assigned High-Fat Diet Control 1) your filename is SRR5017135.fastq.gz. To run the FastQC program you type fastqc, then a space, then th name of the file:

Exampe:

fastqc fastqfile.fastq.gz

In [ ]:

  1. Use the command below to move the output of the fastqc analysis to the fastqc-results folder. We do this using the mv (move) command below

Let's move all results to our previously created folder:

In [ ]:
mv *.zip /home/gea_user/rna-seq-project/fastqc-untrimmed-results
mv *.html /home/gea_user/rna-seq-project/fastqc-untrimmed-results
  1. You can view the results in the file browser on the left of the Jupyter-Lab screen; navigate to the rna-seq-project folder and then the fastqc-untrimmed-results folder.


foldernavigation