IDOM/DRC FGC & PSOM NGSC Cores: Methods

Showing posts with label Methods. Show all posts

Monday, April 9, 2012

Bacterial Contamination

The Problem

A ChIP-seq, RNA-Seq, or other library type looks good and sequences well, but has a very alignment percentage, i.e., much less than 50%. Sometimes this is due to poor library construction that results in artefactual sequences, but once in a while the library is good but contains a large amount of bacterial sequence and relatively little of the intended species.

Confirming the Diagnosis

In the blastn section of the NCBI BLAST website () there is a 'Whole-genome shotgun contis (wgs)' database which contains sequence from a wide range of organisms. If you convert 20 or so of your reads to FASTA format, you can paste them into the search window and see what species they match. If you get a bunch of excellent matches to a species (other than the one you had hoped for) then that's the problem. The exact species can depend on the source of the contamination, i.e., local water, reagents, or from the bacteria in the host species, e.g., gut flora/fauna or dirt etc.

Tracking Down the Source

For bacterial contamination the easiest way to find the source is to do PCR on water samples to see if you can find ribosomal sequence. The primers below hit the 16rDNA gene in a wide variety of species but do not match mammals.

The forward primer, 5'-TCCTACGGGAGGCAGCAGT-3'
the reverse primer, 5'-GGACTACCAGGGTATCTAATCCTGTT-3'
and the probe, (6-FAM)-5'-CGTATTACCGCGGCTGCTGGCAC-3'

Something's Fishy!

Another possible problem is the use of Salmon DNA as a blocking agent in ChIP-Seq experiments. This is rare these days as we warn against it and the issue is more widely reported. The blast search above should identify this problem as well.

Friday, March 9, 2012

HITS-CLIP

Summary

HITS-CLIP is roughly equivalent to ChIP-seq, except that it is identifying RNA-bound proteins. Additionally, when the target protein is Argonaut (Ago) the libraries can contain miRNAs bound to the mRNA as well as the mRNA.

Library Prep

Data Analysis

Wednesday, March 7, 2012

Exome Capture - Targeted Resequencing

Summary

Library Prep

Data Analysis

RNA-Seq

Summary

RNA-Seq is a technique for measuring RNA expression levels using high-throughput sequencing. There are many variations, but the common theme is to create a library that contains as little rRNA as possible, but otherwise contains fragments of RNA. The RNA fragments are sequenced, aligned to the genome and/or transcripts, then the number of reads hitting a transcript are counted to get the expression level of the transcript. Additionally, information about exon usage and splice forms can be extracted as well.

Typical Recommendations

Use multiplexed libraries.
We charge for sequencing by the lane which will yield about 200 million single reads or read pairs.
You can pool libraries and sequence over multiple lanes to get the necessary depth.
If splicing is important to you then use 100bp paired-end sequencing (100PE).
These recommendations are for 'complex' organisms. Shorter reads and/or fewer reads may be necessary for 'simpler' organisms.

A Basic Well-Designed Experiment

3 or more replicates per condition

may not be enough if effect is small or variability is high.

50 million read (pairs) per sample

A Quick Survey

2 replicates
30 million read (pairs)

this is similar to a microarray experiment.
sequence deeper to get more detail on low-expressing transcripts.

Deep Sequencing Experiment

200 millions read (pairs)

Counting

Since the fundamental aspect of RNA-Seq is counting, it is important to get enough reads to adequately determine expression levels. Common wisdom is that about 10-30 million reads is roughly equivalent to a microarray. However, many applications done at the core routinely use 50 to 100 or even 200 million reads to reach deeper into the transcriptome. Also, if you are interested in quantifying exon or splice junction usage you will need more reads to adequately quantify these smaller features.
A characteristic of RNA-Seq is that at a given sequencing depth, the longer and/or more highly-expressed transcripts are more accurately quantified. Shorter or lower expressed genes take more sequence to quantify. Since the range of expression values is much higher than the range of mRNA sizes, expression level is the dominant force, but sequence length should not be discounted. It is especially important when quantifying exons or splice junctions as each of these features will be captured by only a small portion of the reads from a gene.

Splice Forms

There are two main lines of evidence for identifying splice forms. The first is explicit identification of a splice junction in the alignment of a read. The number of such detected junctions will be increased by using paired-end sequencing as the second read gives the opportunity for more junctions to be discovered from a given fragment. The second is implicit deduction of the presence or absence of an exon from the length distribution of fragments and the distance between two ends of a paired-end sequenced fragment. For example, in a gene with three exons and the two ends of paired read aligning to exons 1 and 3, we can infer that exon 2 is unlikely to be in the transcript the reads came from if exon 2 is large and the average RNA insert length is small.
Assembling full-length splice forms is difficult with short single-read sequencing.

Replicates

RNA-Seq is not fundamentally different from a microarray experiment in that replicates are essential to understand the inherent biological and technical variability in an experiment. However, though not recommended, it is possible to get p-values from single replicate comparisons.

Library Prep

rRNA Reduction

rRNA can be reduced in a few different ways. First, it can be depleted by using rRNA (complementary) sequences attached to beads. The sequences grab then rRNA which is then removed by extracting the beads. Second, it can be depleted by using poly-T sequences attached to beads which are used to pull out the poly-A mRNA from the total RNA. Third, there is a dsDNase which can be used to remove double-stranded DNA from a library. Since the rRNA fragments form the major portion, they will re-anneal before the other RNA and be digested. Finally, some kits, e.g., one from Nugen use special primers in the RNA to cDNA process to avoid reverse transcribing rRNA.

Costs

The Illumina tru-Seq kit costs about $75 per sample, but at the moment (2012/05/22) it is back-ordered.

Multiplexing

We strongly recommend that RNA-Seq libraries be multiplexed so that test runs can be done. Additionally, this allows many libraries to be sequenced in a pool over many lanes which allows us to tune coverage to the needs of the experiment and to mitigate any lane-

Data Analysis

At the moment the Core uses RUM to analyze RNA-Seq data. RUM is run on the cloud (AWS) and the charges reflect AWS usage charges and labor.

miR-Seq

Summary

The purpose of miR-Seq is to quantify the free or total amount of miRNAs in a sample.

Library Prep

Data Analysis

Trim reads to remove adapter
Align unique sequences to precursor hairpins, RefSeq RNAs, and the genome.
When there are 3 or replicates we can do quantile normalization and differential expression analysis.

ChIP-Seq

Sample Prep

Test your antibody and sample prep before making a library.
We recommend that the enrichment ratio (C+/I+)/(C-/I-) ≥ 10.

C+ is ChIP at positive control
I+ is input at positive control
C- is ChIP at negative control
I- is input at negative control

To do this, you need two primer pairs, a positive control (+) and a negative control (-) which you measure on the ChIP and input samples using Q RT-PCR. See the figure to the right.
We strongly recommend that you sequence an input library for each condition. Note that in the figure, the input track has a strong peak at the same place as the ChIP peak. The strength of inputs peaks or bias is dependent on the state of the cells as well as chromatin preparation conditions.
Chromatin fragmentation is accomplished either via sonication or (for histones) DNase treatment.
The sonication conditions can dramatically affect the results, so be consistent.

Sonication

Sonication needs to be tuned to the sample.
Successful sonication is a balancing act between a few opposing trends listed below.
Some complexes are fragile so use a little sonication as possible.
We only sequence the fragments with lengths of about 150bp, so enrichment in long fragments does not help.
More sonication reaches deeper into dense chromatin.
For low cell counts, you need to be as efficient as possible, so you need to leave as little material outside the fragment lengths that actually get sequenced.
With the proper primers, you can verify enrichment in both the ChIPed chromatin and the library to ensure that the enrichment is still present in the library.

Sequencing

ChIP-seq libraries are normally sequenced with 50bp SR runs on the hiSeq.
ChIP-seq is primary a counting process, the reads only have to be long enough to place most of them uniquely in the genome
40 million reads are usually sufficient in a mammal.
We have a multiplexed library protocol which can be used to put about 6 libraries in one lane.
You may need more reads if the target protein is spread across large sections of the genome.

Initial Analysis

ChIP-seq libraries are prone to PCR bias when the amount of starting material is small so we check the read redundancy of ChIP-Seq libraries.
We align the reads to the genome with ELAND and keep the ones with a best alignment to just one position.
We usually use HOMER to identify areas of enrichment, but can use other tools such as MACS or GLITR.
We use the standard annotation pipeline to annotate the enriched regions.

IDOM/DRC FGC & PSOM NGSC Cores