IDOM/DRC FGC & PSOM NGSC Cores: RNA-Seq

Summary

RNA-Seq is a technique for measuring RNA expression levels using high-throughput sequencing. There are many variations, but the common theme is to create a library that contains as little rRNA as possible, but otherwise contains fragments of RNA. The RNA fragments are sequenced, aligned to the genome and/or transcripts, then the number of reads hitting a transcript are counted to get the expression level of the transcript. Additionally, information about exon usage and splice forms can be extracted as well.

Typical Recommendations

Use multiplexed libraries.
We charge for sequencing by the lane which will yield about 200 million single reads or read pairs.
You can pool libraries and sequence over multiple lanes to get the necessary depth.
If splicing is important to you then use 100bp paired-end sequencing (100PE).
These recommendations are for 'complex' organisms. Shorter reads and/or fewer reads may be necessary for 'simpler' organisms.

A Basic Well-Designed Experiment

3 or more replicates per condition

may not be enough if effect is small or variability is high.

50 million read (pairs) per sample

A Quick Survey

2 replicates
30 million read (pairs)

this is similar to a microarray experiment.
sequence deeper to get more detail on low-expressing transcripts.

Deep Sequencing Experiment

200 millions read (pairs)

Counting

Since the fundamental aspect of RNA-Seq is counting, it is important to get enough reads to adequately determine expression levels. Common wisdom is that about 10-30 million reads is roughly equivalent to a microarray. However, many applications done at the core routinely use 50 to 100 or even 200 million reads to reach deeper into the transcriptome. Also, if you are interested in quantifying exon or splice junction usage you will need more reads to adequately quantify these smaller features.
A characteristic of RNA-Seq is that at a given sequencing depth, the longer and/or more highly-expressed transcripts are more accurately quantified. Shorter or lower expressed genes take more sequence to quantify. Since the range of expression values is much higher than the range of mRNA sizes, expression level is the dominant force, but sequence length should not be discounted. It is especially important when quantifying exons or splice junctions as each of these features will be captured by only a small portion of the reads from a gene.

Splice Forms

There are two main lines of evidence for identifying splice forms. The first is explicit identification of a splice junction in the alignment of a read. The number of such detected junctions will be increased by using paired-end sequencing as the second read gives the opportunity for more junctions to be discovered from a given fragment. The second is implicit deduction of the presence or absence of an exon from the length distribution of fragments and the distance between two ends of a paired-end sequenced fragment. For example, in a gene with three exons and the two ends of paired read aligning to exons 1 and 3, we can infer that exon 2 is unlikely to be in the transcript the reads came from if exon 2 is large and the average RNA insert length is small.
Assembling full-length splice forms is difficult with short single-read sequencing.

Replicates

RNA-Seq is not fundamentally different from a microarray experiment in that replicates are essential to understand the inherent biological and technical variability in an experiment. However, though not recommended, it is possible to get p-values from single replicate comparisons.

Library Prep

rRNA Reduction

rRNA can be reduced in a few different ways. First, it can be depleted by using rRNA (complementary) sequences attached to beads. The sequences grab then rRNA which is then removed by extracting the beads. Second, it can be depleted by using poly-T sequences attached to beads which are used to pull out the poly-A mRNA from the total RNA. Third, there is a dsDNase which can be used to remove double-stranded DNA from a library. Since the rRNA fragments form the major portion, they will re-anneal before the other RNA and be digested. Finally, some kits, e.g., one from Nugen use special primers in the RNA to cDNA process to avoid reverse transcribing rRNA.

Costs

The Illumina tru-Seq kit costs about $75 per sample, but at the moment (2012/05/22) it is back-ordered.

Multiplexing

We strongly recommend that RNA-Seq libraries be multiplexed so that test runs can be done. Additionally, this allows many libraries to be sequenced in a pool over many lanes which allows us to tune coverage to the needs of the experiment and to mitigate any lane-

Data Analysis

At the moment the Core uses RUM to analyze RNA-Seq data. RUM is run on the cloud (AWS) and the charges reflect AWS usage charges and labor.

IDOM/DRC FGC & PSOM NGSC Cores

Wednesday, March 7, 2012

RNA-Seq