IDOM/DRC FGC & PSOM NGSC Cores: March 2012

Thursday, March 29, 2012

Scheduling Runs

Policy

Since hiSeq2000 flow cells contain 8 lanes, we need to collect 8 lanes-worth of libraries before we begin a run. This is more than most investigators will need to get an experiment started and, often, more than they will have prepared at a time, especially with multiplexing. Therefore we take care will schedule flow cells from libraries from the sequencing queue in a first-come first-served fashion.

Because we do this scheduling it is actually counter-productive for clients to fill a flow cell and insist that the samples be sequenced together. There is rarely a technical reason to have everything sequenced on the same flow cell, so we ask that you let us schedule the flow cells to get maximum throughput.

Sequencing Queue

We have sequencing queues for each combination of sequencing parameters:

length (50 or 100bp)
paired-end or single-read
multiplexed or plain

At the moment 50bp PE sequencing is rare, so this is the one case where bringing 8 lanes of libraries would be helpful.

If you can't wait for a 50bp PE flowcell to fill, then we will put the library in the 100bp PE queue.

Thursday, March 22, 2012

How to Find Us

We are located in the Translational Research Center on the 12th floor.

12-156 Translational Research Center

3400 Civic Center Blvd Bldg 421

Philadelphia, PA 19104-5156

Floor Plan

Friday, March 9, 2012

HITS-CLIP

Summary

HITS-CLIP is roughly equivalent to ChIP-seq, except that it is identifying RNA-bound proteins. Additionally, when the target protein is Argonaut (Ago) the libraries can contain miRNAs bound to the mRNA as well as the mRNA.

Library Prep

Data Analysis

Thursday, March 8, 2012

Fixed Bugs and Implemented Features

Summary

Here are the things we've fixed or added.

Pending Bugs and Feature Requests

Summary

We will post known bugs and desired features here for comment and prioritization. As we fix bugs they will be moved to the Fixed Bugs post.

Errors and Omissions

Here's a list of the issues we are aware of and working to fix.

Track scrooming is not centered.
I can't upload new tracks from my files.
Background condition information for experiments.

Wednesday, March 7, 2012

Initial Data Analysis

Summary

We offer a complete set of initial analyses for most techniques.
The initial analysis usually covers cleaning and trimming, alignment, and quantification.
In many cases differential analyses are available.
Charges may include the cost of computing as many of the techniques may require up to a 1000 hours of CPU time.
We do not offer advanced analysis, but will consider developing pipelines that will be useful to many investigations, e.g., for new techniques.

UCSC Files

Summary

The UCSC genome browser can load and generate a number of different file types. Their website has a good description of the formats, but we can add a few comments here.

BED Files

BED files are the standard method for delivering locations of 'small' sets of feature, e.g., 40,000 transcription factror binding sites.
The files do not contain any sequence information.
BED file can be uploaded to UCSC and many other genome browsers to view results

Fastq Files

Summary

Fastq are the standard format for delivering 'raw' sequencing results.
Fastq file contain a unique ID for each read, the read sequence, and base qualities.
The files do not contain any alignment information.

Multiplexed Libraries

Summary

Because a single lane of a modern sequencer may have a capacity that exceeds what is necessary for a sample, Illumina (and other manufacturers) offer ways to put more that one library in a lane. Illumina does this by using modified adapters that include a barcode in one end or both. The barcodes are sequenced in distinct phases of the sequencing procedure. The reads from the lane are put into groups during the data processing phase.

Limitations

Illumina offers recommendations for sets of barcodes to use when pooling small numbers of libraries. Failure to follow these recommendations may result in the failure of the barcode sequencing and thus an inability to separate the libraries.

Sample Drop Off

When you bring samples in for sequencing, make sure you know what the barcodes are so that we can properly associate sample info with the barcode.

Exome Capture - Targeted Resequencing

Summary

Library Prep

Data Analysis

RNA-Seq

Summary

RNA-Seq is a technique for measuring RNA expression levels using high-throughput sequencing. There are many variations, but the common theme is to create a library that contains as little rRNA as possible, but otherwise contains fragments of RNA. The RNA fragments are sequenced, aligned to the genome and/or transcripts, then the number of reads hitting a transcript are counted to get the expression level of the transcript. Additionally, information about exon usage and splice forms can be extracted as well.

Typical Recommendations

Use multiplexed libraries.
We charge for sequencing by the lane which will yield about 200 million single reads or read pairs.
You can pool libraries and sequence over multiple lanes to get the necessary depth.
If splicing is important to you then use 100bp paired-end sequencing (100PE).
These recommendations are for 'complex' organisms. Shorter reads and/or fewer reads may be necessary for 'simpler' organisms.

A Basic Well-Designed Experiment

3 or more replicates per condition

may not be enough if effect is small or variability is high.

50 million read (pairs) per sample

A Quick Survey

2 replicates
30 million read (pairs)

this is similar to a microarray experiment.
sequence deeper to get more detail on low-expressing transcripts.

Deep Sequencing Experiment

200 millions read (pairs)

Counting

Since the fundamental aspect of RNA-Seq is counting, it is important to get enough reads to adequately determine expression levels. Common wisdom is that about 10-30 million reads is roughly equivalent to a microarray. However, many applications done at the core routinely use 50 to 100 or even 200 million reads to reach deeper into the transcriptome. Also, if you are interested in quantifying exon or splice junction usage you will need more reads to adequately quantify these smaller features.
A characteristic of RNA-Seq is that at a given sequencing depth, the longer and/or more highly-expressed transcripts are more accurately quantified. Shorter or lower expressed genes take more sequence to quantify. Since the range of expression values is much higher than the range of mRNA sizes, expression level is the dominant force, but sequence length should not be discounted. It is especially important when quantifying exons or splice junctions as each of these features will be captured by only a small portion of the reads from a gene.

Splice Forms

There are two main lines of evidence for identifying splice forms. The first is explicit identification of a splice junction in the alignment of a read. The number of such detected junctions will be increased by using paired-end sequencing as the second read gives the opportunity for more junctions to be discovered from a given fragment. The second is implicit deduction of the presence or absence of an exon from the length distribution of fragments and the distance between two ends of a paired-end sequenced fragment. For example, in a gene with three exons and the two ends of paired read aligning to exons 1 and 3, we can infer that exon 2 is unlikely to be in the transcript the reads came from if exon 2 is large and the average RNA insert length is small.
Assembling full-length splice forms is difficult with short single-read sequencing.

Replicates

RNA-Seq is not fundamentally different from a microarray experiment in that replicates are essential to understand the inherent biological and technical variability in an experiment. However, though not recommended, it is possible to get p-values from single replicate comparisons.

Library Prep

rRNA Reduction

rRNA can be reduced in a few different ways. First, it can be depleted by using rRNA (complementary) sequences attached to beads. The sequences grab then rRNA which is then removed by extracting the beads. Second, it can be depleted by using poly-T sequences attached to beads which are used to pull out the poly-A mRNA from the total RNA. Third, there is a dsDNase which can be used to remove double-stranded DNA from a library. Since the rRNA fragments form the major portion, they will re-anneal before the other RNA and be digested. Finally, some kits, e.g., one from Nugen use special primers in the RNA to cDNA process to avoid reverse transcribing rRNA.

Costs

The Illumina tru-Seq kit costs about $75 per sample, but at the moment (2012/05/22) it is back-ordered.

Multiplexing

We strongly recommend that RNA-Seq libraries be multiplexed so that test runs can be done. Additionally, this allows many libraries to be sequenced in a pool over many lanes which allows us to tune coverage to the needs of the experiment and to mitigate any lane-

Data Analysis

At the moment the Core uses RUM to analyze RNA-Seq data. RUM is run on the cloud (AWS) and the charges reflect AWS usage charges and labor.

miR-Seq

Summary

The purpose of miR-Seq is to quantify the free or total amount of miRNAs in a sample.

Library Prep

Data Analysis

Trim reads to remove adapter
Align unique sequences to precursor hairpins, RefSeq RNAs, and the genome.
When there are 3 or replicates we can do quantile normalization and differential expression analysis.

ChIP-Seq

Sample Prep

Test your antibody and sample prep before making a library.
We recommend that the enrichment ratio (C+/I+)/(C-/I-) ≥ 10.

C+ is ChIP at positive control
I+ is input at positive control
C- is ChIP at negative control
I- is input at negative control

To do this, you need two primer pairs, a positive control (+) and a negative control (-) which you measure on the ChIP and input samples using Q RT-PCR. See the figure to the right.
We strongly recommend that you sequence an input library for each condition. Note that in the figure, the input track has a strong peak at the same place as the ChIP peak. The strength of inputs peaks or bias is dependent on the state of the cells as well as chromatin preparation conditions.
Chromatin fragmentation is accomplished either via sonication or (for histones) DNase treatment.
The sonication conditions can dramatically affect the results, so be consistent.

Sonication

Sonication needs to be tuned to the sample.
Successful sonication is a balancing act between a few opposing trends listed below.
Some complexes are fragile so use a little sonication as possible.
We only sequence the fragments with lengths of about 150bp, so enrichment in long fragments does not help.
More sonication reaches deeper into dense chromatin.
For low cell counts, you need to be as efficient as possible, so you need to leave as little material outside the fragment lengths that actually get sequenced.
With the proper primers, you can verify enrichment in both the ChIPed chromatin and the library to ensure that the enrichment is still present in the library.

Sequencing

ChIP-seq libraries are normally sequenced with 50bp SR runs on the hiSeq.
ChIP-seq is primary a counting process, the reads only have to be long enough to place most of them uniquely in the genome
40 million reads are usually sufficient in a mammal.
We have a multiplexed library protocol which can be used to put about 6 libraries in one lane.
You may need more reads if the target protein is spread across large sections of the genome.

Initial Analysis

ChIP-seq libraries are prone to PCR bias when the amount of starting material is small so we check the read redundancy of ChIP-Seq libraries.
We align the reads to the genome with ELAND and keep the ones with a best alignment to just one position.
We usually use HOMER to identify areas of enrichment, but can use other tools such as MACS or GLITR.
We use the standard annotation pipeline to annotate the enriched regions.

Pippin Prep

Summary

The NGSC uses the Pippin Prep to do size selection of libraries.

Using Sage Science’s Pippin Prep helps standardize library size selection and most importantly vastly increases yield over the normal gel extraction technique.

The machine does not catch your mistakes, so carefully follow the directions and be aware of what you are doing.
CAUTION: some gels contain Ethidium Bromide, so wear gloves at all times.

We recommend using 2% EF (Ethidium Free) Agarose Gel Cassette (Sage Science CEF-2010 or Life Technologies 4472171) for size selecting ChIP-Seq and RNA-Seq libraries.

3% EF Agarose Gel Cassettes (Sage Science CEF3010) can be used for size selection of smRNA-Seq samples.

1.5% EF Agarose Gel Cassettes (Sage Science CEF1510) can be used for size selection of gDNA libraries.

QuBit

Summary

The NGSC uses the Life Technologies QuBit to assess the concentration of libraries. The Qubit can measure protein, single- and double-stranded DNA, and RNA. The single-stranded DNA mode is especially helpful to quantify RNA-seq libraries with the 'high junk'.

This high junk is good library and is an indication of over PCRing or not enough primers. Since the ends of the library have sequence homology, they will substitute for the primers and thus cause daisy chaining or a big clump of ssDNA. This is the high junk. Therefore, denaturing the libraries and using the ssDNA Qubit kit allows us to resolve the actual concentration of these complicated libraries.

Agilent BioAnalyzer

Summary
The NGSC uses the Agilent bioAnalyzer to assess the quality and quantity of samples and libraries.

Illumina hiSeq2000

Summary

The NGSC has two hiSeq2000 sequencers. These are high volume sequencers.

Sequencing Capacity

Each sequencer can sequence two flowcells at the same time.
Each flowcell has eight lanes.
Each lane can hold about 250 million clusters.
Each cluster can be sequenced from one end (single read) or both ends (paired-end).
Sequencing is normally done to a length of 50bp or 100bp.
A 100bp PE lane can generate 250 × 10² clusters × 200 bp/cluster = 5 × 10¹⁰ bp

Run Times

Flow Cells	Ends	Bp	Days
1	SR	50	2
		100	4
	PE	50	5
		100	11
2	SR	50	2
		100	4
	PE	50	5
		100	11

The length of time it takes to do a run depends on the number of ends sequenced, the number of bases sequenced, what kind of multiplexing, as well as whether there are one or two flowcells running on the same machine.
Multiplexing adds a day to the run. These are the minimum times, allow time for libraries to be checked for quality checked, scheduling time, and for data to be processed after the run.

Illumina miSeq

Summary
The NGSC has ordered a miSeq sequencer. This is a smaller capacity than a hiSeq, but is a much faster machine.

Availability
The miSeq is available now.

Sequencing Capacity

Each miSeq has one flowcell.
The flowcell has 1 lane.
Each lane can generate about 15 million clusters.
Each cluster can be sequenced from one end, single read (SR), or both ends, paired-end (PE).
Sequencing is normally done to a length between 50bp and 250bp.
A 250bp PE run can generate 7.5 × 10⁶ clusters × 500 bp/cluster = 3.75 × 10⁹ bp

So the miSeq lane capacity is close to the total base pair output of a hiSeq lane, but finishes in a day. However, to reach this capacity the miSeq is sequencing fewer but longer reads.

Using the miSeq

For now the miSeq will be operated by the NGSC staff.

However, we anticipate that after training investigators will be able to use the machine on their own.

Monday, March 5, 2012

How the New Website Works

The new website has been dramatically redesigned to integrate all aspects of genomics experiments into one site.

Here's a brief guide.

Services

This section contains a guide to the services we offer (Home) and a price list.

Getting Started

This section contains links to make new accounts, start a new experiment, and request and appointment with us.

Status and Results

Links in this section will let you search through your data, view it on the
genome, and download the data. Essentially all of the TessLA browser
functionality has been pushed into this section.

A few capabilities are missing but they should appear soon, along with new
ones to see the status of your samples and the core.

Technical Details

Links for protocol downloads, contact and shipping info, and our pictures.

Changes in TessLA

There are no more portals. Once you login (using your old TessLA login), you can see all the data that you are allowed to. This includes

all of the investigations that you are either PI or investigator on
all of the investigations that your PI is a PI on
all investigations that you are listed as a collaborator on
all public investigations.

To find data, use 'Search Tracks' or 'Search Studies'. These links let you do Google-style searching in studies or tracks to find the data you want.

We will be adjusting this search over the next few weeks, but for now, to find tracks enriched areas called by HOMER for MyoD in heart, you would enter something like

*HOMER*+*myod*+*heart*

The studies search works in a similar way, but you start at the level of an investigation or study, then drill down to the tracks. This process usually means you can just start with something like

*myod*+*heart*

Once you've found the track you want, click the checkbox next to the genome
release to make it visible.

Not all tracks are linked into a study. You may have to find a track via Search Tracks rather than via Search Study.

We've replaced portals with preference tags. Once you log in, you can make tags (upper left of the screen). Then as you make tracks visible, set colors etc., these preferences are associated with the tag you have selected in the popup menu in the upper left. To change 'projects', just pick a new tag from the menu.

To get you started, we've imported all of your portal preferences from TessLA.

Most of the old TessLA options are now popups in the browser.

The ability to upload files is missing. We will do this very shortly.

Experimental Plans

What is an experimental plan?

An experimental plan is a description of your samples, what you plan to measure about them, and what are the controls and experimental variables you are using.

We structure an experimental plan as follows. First there is an investigation which will hold the whole plan. Connected to the investigation are one or more studies. Each study contains one or more conditions and one or more assays.

A condition is a summary of the process of going from the organism or cell line to the extract you use in performing an assay. It includes treatments, tissue, organ, genetic background, time of day, age etc.

An assay is what technique you are going to apply to the sample including possible variations. In a typical ChIP-seq experiment, you will have an assay for the ChIP libraries, and another for the input library. If you ChIP for multiple targets, then each target will have its own assay.

A study is a combination of conditions and assays that are typically focussed on a particular part or stage of the experiment. The website presents a study as a table of conditions versus assays. Usually all of the cells in the table will be filled in by the end of the experiment. This gives a nice visual representation of the progress of the experiment.

Multiple studies can be put together under one investigation as needed.

Why do I need to make one?

For a few reasons.

First, although we have long collected information such as organism, tissue, developmental stage, condition, we sometimes get samples where this information does not capture important aspects of the sample that are important to the experiment. So, we needed to make our sample description capture more information.

Second, the Core is frequently called upon to submit data to public repositories as part of getting your publication accepted. Having these experimental plans, and gathering detailed data up front will help make the publication process easier.

Third, as sequencing capacity has increased experiments have become more complex. Also, with time, investigators have accumulated a large number of samples, and these need a more coherent system to organize them and the subsequent analyses. We will use the experimental plans as a way to help organize, find, and share data.

Is it hard to make a plan?

No, its very easy.

To make a plan submit a description to the Core using either the on-line form or email. We will write up a 'formal' plan then confer with you to finalize the plan.

Our goal is to capture the technical details of the experiment, but to still present the results using the terms or jargon that you normally use when discussing the experiment. For example, if you are knocking out Myod in heart muscle at E14.5 we can capture that technical detail, but give that condition a tag or nickname of 'KO'. Similarly, what you call the 'WT' in the experiment, is probably not a wild mouse gathered from a field in the Poconos, but rather may be a BL6 with a loxP flanked gene. We will record this, but still give such a mouse the nickname 'WT' (or what ever you normally call it.)

What about my old experiments?

We have used the PI, investigator, and sample information we have to estimate the design of all existing experiments. These estimated designs have been loaded into the database and are being used in the new website.

We know that the estimated designs are flawed and ugly. Also you may have several separate designs that really belong together as one.

Please feel free to setup a time to meet to correct earlier experimental plans - use the appointment calendar link to find a time we can meet.

What are Source IDs for?

The purpose of the source ID is to allow us to know which libraries come from the same biological source. The source ID can include

anonymized donor IDs
your local mouse IDs,
sample prep IDs

Tracking source IDs will allow us to distinguish biological from technical replicates, help guide analyses.

Here are some examples of what we have in mind.

In an experiment on WT and KO mouse livers, source IDs such as WT1, WT2, WT3, WT4 KO1, KO2, KO3, KO4 are fine. They let us know that each sample came from a different mouse.

If you are testing a library or sample prep method on the same sample, and we have included the testing in the experiment design, you should use the same ID for all of the samples.