Monday, December 17, 2012

Analysis - Tools - RUM

RUM

We use the RUM package from Grant et al to do the basic processing of RNA-Seq data.  RUM generates a set of files which we then process a bit further to make them visible in the TessLA browser and for other down-stream analyses.

Files

Here is a typical set of files produced for a RUM analysis:

26642352304 RUM.sam - all alignments
 9094911272 RUM_NU - non-unique alignments
   48715025 RUM_NU.bedGraph.gz - bedGraph format for display
    1002296 RUM_NU.bedGraph.gz.tbi - index of bedGraph format for display
  213897294 RUM_NU.cov - coverage data for non-uniquely mapping reads
 3687201046 RUM_Unique - unique alignments
   52517756 RUM_Unique.bedGraph.gz  - bedGraph format for display
     830599 RUM_Unique.bedGraph.gz.tbi - index of bedGraph format for display
  228549781 RUM_Unique.cov - coverage data for uniquely mapping reads
  122977785 feature_quantifications-max.tab
  122977785 feature_quantifications-max.tab-sorted
  122977785 feature_quantifications-min.tab
  122977785 feature_quantifications-min.tab-sorted
  111522316 feature_quantifications_RLB-GENOME-TAG - expression levels of transcript, exons, and introns.
    5417514 inferred_internal_exons.bed
    3126115 inferred_internal_exons.txt
   30203897 junctions_all.bed
   30203808 junctions_all.bed-sorted
   18500890 junctions_all.rum
    9149089 junctions_high-quality.bed
    9148994 junctions_high-quality.bed-sorted
      16384 log
       4142 mapping_stats.txt - summary of how many reads mapped to genome or transcripts
     289688 novel_inferred_internal_exons_quantifications_RLB-GENOME-TAG
      16384 postproc
 5438152292 quals.fa - read qualities
 5438152292 reads.fa - read sequences
        449 rum_RLB-GENOME-TAG_preproc.sh
        848 rumRLB-GENOME-TAG_proc.sh
       1837 rum_job_config
       3275 rum_job_report.txt
       1384 rum_runner.log
        125 rum_sge_job_ids


Analysis - Tools - Comparison

We run this tool to do basic differential analysis.  It is best used for RNA-Seq data, but can be used for other data as well.

Files

By default the files are created in Analysis/DiffExp.  In this directory, you may find multiple analyses which use different data and/or parameters.  Looking inside one of these directories, you will see 3 to 4 files called Compare.*, the most useful of which is Compare.tab.xls.

Compare.tab.xls

This file contains the comparison data.  The contents are somewhat flexible, but will follow this outline.

Each row is a transcript.  The first few columns contain the gene, transcript, and 'Best' (an indicator which guides you to the best transcript for each genes.)

The next set of columns are various comparisons.  Which comparisons are done depend on the experiment.  For each comparison there are 6 columns.
  1. MVA:M:Test:Control - log2 test/control fold change 
  2. MVA:A:Test:Control - log2 average expression
  3. EDGE:A:Test:Control - log2 average expression
  4. EDGE:M:Test:Control - log2 test/control fold change
  5. EDGE:pv:Test:Control  - 0-1 p-value
  6. EDGE:FDR:Test:Control - 0-1 FDR from p-value using Benjamini-Hochberg correction
The first word in each column title indicates the tool that is used to produce the data in the column.
  1. MVA is a simple MvA comparison with no statistical significance.
  2. EDGE is the EdgeR package which performs differential gene expression on RNA-Seq data.
The data that is passed to the analysis programs has been quantile normalized.
M values are the log2(Test/Control), so M=1 indicates 2-fold increase in expression.
A values  are log2 of the average expression between two conditions.  MvA and EdgeR use different units, MVA is usually Reads, whereas EdgeR values have been normalized to counts per million.

The next set of columns of the file are quantile normalized log2 versions of the 'raw' data for the individual samples.

The last set of columns are the 'raw' data which is usually reads.

Looking Deeper
Within each Comparison folder is another called 'Heatmap'.  See http://fgc-ngsc-cores.blogspot.com/2012/09/analysis-tools-rum-multiplecomparisons.html for details about the files in this folder.

Thursday, November 8, 2012

How to submit samples


How to submit your samples
  • Minimum 15ul or more
  • Minimum 10nM concentration
  • Please submit your samples in 1.5-2.0ml tubes
  • List of your samples by name on tube
  • Index number per sample
  • Source ID (Mouse ear tag #, cell line, etc.- something that ties this library back to its source) 
  • Sequence length and type (50SR, 100SR, 100PE)
  • Bring any custom primers if applicable
Please keep in mind that your samples will be handled often. In order to keep them from getting mixed up, please submit samples in tubes that are clearly labeled and intact (no unattached caps). 




Wednesday, October 24, 2012

FAQ What Does It Cost

FAQ - What Does It Cost?

The FGC and NGSC cores each have a price sheet posted under the 'Pricing' option in the getting started menu.

The FGC only serves clients in the IDOM/DRC, expect for microarrays which are
open to UPenn and other academic institutions.

In both cases, academics institutions outside of Penn pay a higher amount to make up for the lack of grant overhead.

For maximum flexibility there are separate charges for 
  1. sample and library quantification 
  2. library preparation
  3. sequencing, charged by the lane
  4. standard analyses, charged by the sample
  5. advanced analyses, only from the FGC.
The exact cost of an experiment depends on many factors, so it is best estimated with the help of the FGC or NGSC technical director after submitting an experiment request.

FAQ Old TessLA Browser

FAQ Old TessLA Browser

Never fear, it's here.

We also have a button on the front page and a link in the Results menu at the top of the page.

Users should begin to use only the portals that are named for the PI, e.g., Eric-Thepi-Lab.

FAQ Where Is My Old Data

  1. The cores keep tape back ups of every run.
  2. Most runs are kept on our sequencing server for a few weeks.
  3. The fastq files and subsequent analyses are stored on the PGFI cluster.
    1. The website can be used to download any of these files.
    2. Except for very old microarray data that we have archived almost all data is available this way.
    3. People with PGFI accounts can access these files directly.
  4. Many analyses are attached to tracks in the browser and can be downloaded using our genome browser.

What the Future Holds

The volume of data is quite large and is now a major part of our operating expenses.  Therefore in the future we will be moving to a more aggressive purging of on-line data storage.  We will advise you as plans move forward for this.

FAQ Downloading Data

See this entry.

We have also added a big Data Download button on the home page to take you to your (PI's) download area.

FAQ How Are My Samples Doing?

We will be extending the website and our notification system to provide more information but for now, here's what to expect.
  1. Sample Quality Checks
    1. takes a few days
    2. core staff will usually send an email with the results
  2. Scheduling a Run
    1. you may have to wait for a flow cell to be full to sequenced
    2. this time can vary widely depending on the type of run and our workload.
  3. Sequencing
    1. once the flow cell is scheduled here are the approximate times
      1. clustering - 0.5 days
      2. sequencing
        1. 50SR - 2 days
        2. 100SR - 4 days
        3. 50PE - 4 days
        4. 100PE - 11 to 14 days
  4. Fastq generation
    1. usually done the week day the run finishes
  5. Basic Analysis
    1. ChIP-Seq or HITS-CLIP alignments and peak calling
      1. takes 1 to 2 days
    2. RNA-Seq
      1. take about a week

FAQ When Can I Bring My Samples In?

  1. Do not bring samples in until we have told you that the investigation is ready for samples.
  2. We accept samples Tue-Thurs from 11-12 and 3-4. On Mondays we only accept samples from 3-4 and on Fridays we only accept samples from 11-12. Also, check the Lab Calendar (found on our website) before coming in to see if there are any events scheduled which would prevent us from accepting samples.
  3. If you already have an investigation and want to extend it to include new conditions or assays, then submit a new experiment request.  We will process the request the next morning and once you hear from us you can bring the samples in.

FAQ Does the Core Make Libraries?

FAQ Does the Core Make Libraries?

We offer library preparation services for:

  1. RNA-Seq using
    1. Illumina truSeq kits 200ng+ of total RNA

  2. Agilent SureSelect Exome Capture
We can train you to make other libraries, but do not offer these services.

Why You Should Make Your Own Libraries

We are planning on purchasing a robot to automate this process, but for now library prep is a time-consuming task that can take a while for us to complete.

FAQ How to Make Libraries

FAQ How to Make Libraries

Wow that's a big question!  It's too much to handle in one FAQ, so here are links to individual pages for various library types.


FAQ Getting Started

FAQ - How do I Get Started?

The basic things you need to do are:

  1. Read the other FAQs about experiment design and library prep
  2. If you still have questions, check the Consultation Calendar and propose a time to meet to discuss your questions.
  3. Have your PI create an account (this only needs to be done once.)
  4. Create an account for yourself (this only needs to be done once.)
  5. Submit an experiment request.
  6. The next morning we will review experiment requests and contact you to resolve any questions.
  7. We will notify you that we can accept samples.
  8. Bring samples by at either 11-12 or 2-3 Monday to Friday.

FAQ Which Core To Use


There are a few Cores at Penn that do DNA sequencing.  Here is what we offer


NGSC

The NGSC has 3 Illumina hiSeq2000s and a miSeq. Here is what these machines are good for:


hiSeq2000

The hiSeq2000 is good for these techniques (and their many variations) RNA-Seq, ChIP-Seq, miR-Seq, HITS-CLIP, exome capture, BIS-Seq, and whole genome sequencing in mammals.
There are two aspects of ultra-high throughput sequencing that are important counts and coverage.  Counts are important for RNA-Seq, ChIP-Seq, miR-Seq, and HITS-CLIP.  Coverage is important for exome capture, BIS-Seq, and whole genome sequencing.   The hiSeq2000 generates sequence for about 200 million fragments per lane.  For each fragment the hiSeq can produce single or paired-end 50bp or 100bp sequences.  Using 100bp pair-end sequencing, you can get up to 40Gb per lane.
In many cases a single lane can generate more counts or coverage than a sample needs.  In that case, it is important to use multiplexed adapters so we can sequence multiple samples per lane.  Multiplexed adapters are generally a good idea is they allow samples to be test sequenced for quality, then sequenced deeper as needed.

Technique
Typical Volume
Samples per Lane
RNA-Seq
30 to 200 million reads
1 to 6
ChIP-Seq
30 to 100 million reads
2 to 6
miR-Seq
10 million reads
20
HITS-CLIP
30 million reads
6
Exome capture
20-30x coverage
5 to 20
BIS-Seq
20-30x coverage
1/3
Genome Sequencing
20-30x coverage
1/3

miSeq

The miSeq uses the same libraries as the hiSeq2000. It generates only about 15 million fragments per lane, but runs very quickly, and can generate reads as long as 150bp (or longer).
It is good for sample testing, miR-Seq, amplicon sequencing, or the techniques above applied to small, e.g., bacterial genomes.

Thursday, October 4, 2012

Analysis - ChIP-Seq - Regional Enrichment

Introduction

This tool is used to identify statistically significant enrichment on regions that are defined relative to annotated regions of the genome, rather than to regions defined by the ChIP-Seq data itself.  It is typically used when the pattern of the ChIP-Seq target is so diffuse that standard peak callers have a difficult time identifying regions of enrichment.  In this case we will use regions that are of a priori interest, such as promoters, gene bodies, CpG islands , etc. that are likely regions to contain enrichment for the ChIP target.  The ngsc-chipseq-RegionalEnrichment tool counts reads on these pre-defined regions of interest for both a ChIP and a control (usually input) sample, then uses a Fisher exact test and Benjamini-Hochberg correction to assess the enrichment of the ChIP signal on the region.  A new track will be loaded with the enrichment ratio as the score and the p-value and FDR.

Details

A pseudocount of 1 is added to each region when computing the Fisher test.


Friday, September 7, 2012

FGC or NGSC - which core to use?

The Functional Genomics Core (FGC) and the Next-Generation Sequencing Core (NGSC) provide similar services, but with some important differences.
Here is a summary to help you decide which core is right for your project.

FGC

  • high-throughput sequencing for IDOM/DRC members
  • downstream data analysis for IDOM/DRC members as capacity allows
  • Agilent microarrays for IDOM, UPenn, and academic clients
  • limited RNA-Seq library prep for IDOM/DRC members

NGSC

  • high-throughput sequencing for UPenn and academic clients
  • standardized basic preliminary data analysis for UPenn and academic clients
  • limited RNA-Seq library prep for UPenn, and academic clients

For the NGSC and FGC, prices are higher for external clients.

To good news is that you talk to the same people no matter which core you use.

Saturday, September 1, 2012

Analysis - Tools - RUM-MultipleComparisons

Introduction

We routinely run the pipeline RUM-MultipleComparisons to assess RNA-Seq data.  Although the tool includes the work 'RUM' in the title, it can work with gene expression values from a variety of RNA-Seq tools.

We are still expanding what analyses RUM-MultipleComparisons performs but at the moment, it includes these basic steps.
  1. Assemble a table of the raw data
  2. Filter to consider just transcripts
  3. Performs quantile normalization of the values
  4. Does a series of k-means clustering of the data and displays results as heatmaps
  5. Generates MvA plots of averages for all conditions
  6. Generates MvA plots of replicates within a condition
  7. Tabulates fold-changes between average values for all conditions

What Files Should I Look At?

 First, take a look at the plot, Replicates and Kmeans-heatmap.pdf files so that you can see if the samples have good intra-condition consistency.  In addition, the heatmap file will help you see if the changes between conditions are consistent across samples, and roughly how many sets of expression patterns there are in the set.

Once you can see that the data is ok, turn to the Averages.tab file or the appropriate Kmeans-*-clusters.tab file to see gene IDs.  All of the tab files can be opened from within Excel which can be used to further filter the genes.  Gene lists can also be created for use with functional analysis.

How Do we Usually Run It?

 We usually focus on well-characterized RefSeqs, i.e., those with IDs like NM_* or NR_*.

What Does the Output Look Like?

 Plots

  • AllPairs-mva.png - a comparison of all samples in the data set.
  • Kmeans-heatmap.pdf - series of heatmaps using different numbers of clusters.  Yellow/white is high expression, red is low.
  • Pairs.pdf - MvA plots of all condition comparisons
  • Replicates-mva.pdf - MvA plots of replicates within a condition

Tables of Data

  • AllTranscriptReadCounts-sql.tab - initial raw data
  • AllTranscriptReadCounts.tab - data filtered to just transcripts
  • Averages.tab - averages over conditions with fold-changes for all comparison
  • Details-Lg2-Qn.tab - quantile normalized values for individual samples
  • Kmeans-04-clusters.tab - details of genes in each cluster.
  • Kmeans-05-clusters.tab
  • Kmeans-06-clusters.tab
  • ...
  • Kmeans-28-clusters.tab
  • Kmeans-29-clusters.tab
  • Kmeans-30-clusters.tab

Thursday, August 30, 2012

Analysis - Tools - ConnectSpanToGene

Introduction

We routinely run the ConnectSpanToGene tool to associate ChIP-Seq peaks with genes, but it can be run to associate any set of regions with genes.

ConnectSpanToGene takes a gene track, a span track (which has the peaks), and parameters (MaxDistBp, ToleranceBp, TolerancePct) controlling how far away from a span we will look for a gene.  It outputs a tab-delimited file containing the results, which we usually convert to an Excel file and load into the database (attached to the region track).

How Does it Work?

 The program considers each region in the region track.  It then reports any gene that overlaps with the span.  It then puts the nearby genes in order by the distance from the span to the transcription start site (TSS), with the closest TSS first.  If the distance (D) to the first gene is less than the distance threshold (MaxDistBp), then the gene is reported.  Any other genes that are closer to the span than D * (1 + TolerancePct/100) and (D +ToleranceBp)  are also reported.  This process is then repeated for each span in the span track.  The value of D is adjusted for each span.

 How do We Usually Run it?

We usually run with the following settings:

MaxDistBp    = 100,000
ToleranceBp  =  50,000
TolerancePct =      50

So the first TSS must be within 100KB.  Note that if the region overlaps the gene, then more distant TSSs may be reported as well.


What Does the Output Look Like?

 Here are the columns in the output file.
  1. Span-GenomeRelease - genome of the spans (and the genes)
  2. Span-Chromosome - chromosome of spans
  3. Span-BeginBp - begin of span
  4. Span-EndBp - end of span
  5. Span-Strand - '+' or '-' of span
  6. Span-Score - score of span - the meaning of the value depends on the span table.
  7. Span-Name - name of span
  8. GeneI - empty, or 1, 2, 3 etc. Is empty if there are no nearby genes
  9. AbsDistanceBp - absolute value of distance from span to gene TSS
  10. DistanceBp - distance from span to gene TSS,  Positive values are downstream of the TSS.
  11. Overlaps - yes/no, does the span overlap the gene.
  12. Gene-GenomeRelease - genome of the gene
  13. Gene-Chromosome - chromosome of the gene
  14. Gene-BeginBp - beginning of the gene
  15. Gene-EndBp - end of the gene
  16. Gene-Strand - '+' or '-' of gene
  17. Gene-Score - score of gene. Usually meaningless, but could be a differential expression value.
  18. Gene-Name - name of gene
  19. Span-Pvalue - may be empty, p-value for detection of the span
  20. Span-FDR - may be empty, FDR for detection of the span
  21. Span-ContentTag - extra stuff about the span.  Varies with the span table.


Monday, August 27, 2012

FAQ-GettingStarted

Getting Started

Here is what you need to do to get started with the core.
  1. Lab PI makes an account.
  2. Experiment investigators make accounts under the PI.
  3. The next weekday morning the core staff activate the accounts.
  4. Create a New Experiment.
  5. The Core staff read your description then either load the experiment or contact you to ask for clarification.
  6. Once the core has loaded the investigation, you can bring your samples.
  7. We do quality checks on your samples and let you know how they look.
  8. We sequence the samples.
  9. We do the data analysis you requested.
 For subsequent experiments, pick up at step 4.


Thursday, August 9, 2012

How the Core Works

Overview

Here are major steps involved in doing an experiment with the NGSC.

  1. Create accounts at the NGSC or FGC website using the Create New Account link.
    • The PI of the lab should create the first account.
    • The investigator(s) who will actually do the experiment should then create their account(s).
    • Creating a PI's account works best on Safari on a mac - we are fixing this!
    • Investigators should make sure they pick their PI from the list.
    • We will active the accounts within a day or so.
  2. If necessary meet with the technical director to help design the experiment.
    • Use the Appointment Calendar link to identify a time to meet.
    • Send him an email to propose the time.
  3. Create an experiment use the Create New Experiment form.
    • This is a basic description of the experiment you want to do.
    • The web page indicates the info we are looking for.
    • Please include your billing information.
  4. We will formalize the experiment description and load it into the database.
    • This process takes a few days.
    • We may send emails to clarify details of the experiment.
  5. Once we have loaded the experiment, bring your samples to the core.
    • Please do not bring the samples until the experiment has been loaded.
    • Be prepared to provide a source ID and a sample name for each sample
  6. We will check the quality of the samples and report any problems.
  7. We will schedule the libraries for sequencing and report our progress.
  8. Data analysis will follow sequencing as soon as possible.

Wednesday, May 30, 2012

On-Line Tools

Here is a short list of on-line tools for genome data analysis and/or visualization:


  • DAVID/EASE
    • functional analysis
    • http://david.abcc.ncifcrf.gov/
  • HOMER
    • ChIP-Seq and other data analysis
    • http://biowhat.ucsd.edu/homer/ngs/index.html
  • CISTROME
    • ChIP-Seq analysis
    • http://cistrome.org/Cistrome/Cistrome_Project.html
  • GALAXY
    • ChIP-Seq and other analysis and visualization
    • https://main.g2.bx.psu.edu/
  • UCSC Genome Browser
    • http://genome.ucsc.edu/

Saturday, April 28, 2012

Downloading Data

The Basics

We have configured the website to use your account usernames and passwords to access all files by PI (if you are a PI or lab member) or by investigation if you have collaborator status.

We have updated the download area to provide all data at a URL link this:

  https://fgc.genomics.upenn.edu/Experiments/PI

You will be prompted for a user name and password, enter the credentials you use to log in to the rest of the site.

If you don't find the data you are looking for, check the deprecated links desribed below and let us know of the omission.

How the Files are Organized

Under each PI are a number of folders that correspond to experiments (the old style) or investigations and (under that) studies (which is the new style.)

Within each experiment or investigation, there are few common places to look for data.

'Raw' Data

This data is not really raw, but has not undergone any real analysis beyond alignment

  • basic/Fastq - files of read sequence in FASTQ format
  • basic/Fasta - files of read sequence in FASTA format
  • basic/Solexa - older style sequence or ELAND output files
  • basic/Export - alignment information split into unique and not-usable (repeat) 
  • basic/BedFiles - BED file format of uniquely aligning reads and perhaps SHP output files

Analysis Results

These are typically places under the Analysis folder.  Different types of analysis are organized in subfolder under that.

Tips and Tricks

WIG files for Profiles

At the moment, we generate profile data at full resolution, i.e., including the changes resulting from each read.   Sometimes, these files can be found in basic/BedFiles with names in the form FGCNNNN_s_L-ucsc.ushp.tab.gz, e.g., FGC0138_s_7-ucsc.ushp.tab.gz.  These files are not in WIG format - they are just chromosome, begin, end, score.  They can be converted to WIG format with relatively simple programs.

WIG format data can be obtained using the download feature in the TessLA browser.  However, because modern data sets are so large this approach is not very effective.

Depcreated URLs

Because the files sit on two different filesystems you have to look at two, slightly, different URLs:

  https://fgc.genomics.upenn.edu/Experiments-1/PI

or


  https://fgc.genomics.upenn.edu/Experiments-2/PI

where, of course, you replace PI with your PI's last name.  Note the https, that's important.


Monday, April 9, 2012

Bacterial Contamination

The Problem

A ChIP-seq, RNA-Seq, or other library type looks good and sequences well, but has a very alignment percentage, i.e., much less than 50%.  Sometimes this is due to poor library construction that results in artefactual sequences, but once in a while the library is good but contains a large amount of bacterial sequence and relatively little of the intended species.

Confirming the Diagnosis

In the blastn section of the NCBI BLAST website () there is a 'Whole-genome shotgun contis (wgs)' database which contains sequence from a wide range of organisms.  If you convert 20 or so of your reads to FASTA format, you can paste them into the search window and see what species they match.  If you get a bunch of excellent matches to a species (other than the one you had hoped for) then that's the problem. The exact species can depend on the source of the contamination, i.e., local water, reagents, or from the bacteria in the host species, e.g., gut flora/fauna or dirt etc.

Tracking Down the Source

For bacterial contamination the easiest way to find the source is to do PCR on water samples to see if you can find ribosomal sequence.  The primers below hit the 16rDNA gene in a wide variety of species but do not match mammals.

The forward primer, 5'-TCCTACGGGAGGCAGCAGT-3'
the reverse primer, 5'-GGACTACCAGGGTATCTAATCCTGTT-3'
and the probe, (6-FAM)-5'-CGTATTACCGCGGCTGCTGGCAC-3'

Something's Fishy!

Another possible problem is the use of Salmon DNA as a blocking agent in ChIP-Seq experiments.  This is rare these days as we warn against it and the issue is more widely reported.  The blast search above should identify this problem as well.

Thursday, March 29, 2012

Scheduling Runs

Policy

Since hiSeq2000 flow cells contain 8 lanes, we need to collect 8 lanes-worth of libraries before we begin a run.  This is more than most investigators will need to get an experiment started and, often, more than they will have prepared at a time, especially with multiplexing.  Therefore we take care will schedule flow cells from libraries from the sequencing queue in a first-come first-served fashion.

Because we do this scheduling it is actually counter-productive for clients to fill a flow cell and insist that the samples be sequenced together. There is rarely a technical reason to have everything sequenced on the same flow cell, so we ask that you let us schedule the flow cells to get maximum throughput.

Sequencing Queue

We have sequencing queues for each combination of sequencing parameters:

  • length (50 or 100bp)
  • paired-end or single-read
  • multiplexed or plain

At the moment 50bp PE sequencing is rare, so this is the one case where bringing 8 lanes of libraries would be helpful.

If you can't wait for a 50bp PE flowcell to fill, then we will put the library in the 100bp PE queue.

Thursday, March 22, 2012

How to Find Us

We are located in the Translational Research Center on the 12th floor.

12-156 Translational Research Center
3400 Civic Center Blvd Bldg 421
Philadelphia, PA 19104-5156

Floor Plan


Friday, March 9, 2012

HITS-CLIP

Summary

HITS-CLIP is roughly equivalent to ChIP-seq, except that it is identifying RNA-bound proteins.  Additionally, when the target protein is Argonaut (Ago) the libraries can contain miRNAs bound to the mRNA as well as the mRNA.

Library Prep



Data Analysis

Thursday, March 8, 2012

Fixed Bugs and Implemented Features

Summary

Here are the things we've fixed or added.


Pending Bugs and Feature Requests

Summary

We will post known bugs and desired features here for comment and prioritization.   As we fix bugs they will be moved to the Fixed Bugs post.

Errors and Omissions

Here's a list of the issues we are aware of and working to fix.
  1. Track scrooming is not centered.
  2. I can't upload new tracks from my files.
  3. Background condition information for experiments.

Wednesday, March 7, 2012

Initial Data Analysis

Summary

  • We offer a complete set of initial analyses for most techniques.
  • The initial analysis usually covers cleaning and trimming, alignment, and quantification.
  • In many cases differential analyses are available.
  • Charges may include the cost of computing as many of the techniques may require up to a 1000 hours of CPU time.
  • We do not offer advanced analysis, but will consider developing pipelines that will be useful to many investigations, e.g., for new techniques.

UCSC Files

Summary

The UCSC genome browser can load and generate a number of different file types. Their website has a good description of the formats, but we can add a few comments here.

BED Files

  • BED files are the standard method for delivering locations of 'small' sets of feature, e.g., 40,000 transcription factror binding sites.
  • The files do not contain any sequence information.
  • BED file can be uploaded to UCSC and many other genome browsers to view results

Fastq Files

Summary

  • Fastq are the standard format for delivering 'raw' sequencing results.
  • Fastq file contain a unique ID for each read, the read sequence, and base qualities.
  • The files do not contain any alignment information.

Multiplexed Libraries

Summary

Because a single lane of a modern sequencer may have a capacity that exceeds what is necessary for a sample, Illumina (and other manufacturers) offer ways to put more that one library in a lane.  Illumina does this by using modified adapters that include a barcode in one end or both. The barcodes are sequenced in distinct phases of the sequencing procedure.  The reads from the lane are put into groups during the data processing phase.

Limitations

Illumina offers recommendations for sets of barcodes to use when pooling small numbers of libraries.  Failure to follow these recommendations may result in the failure of the barcode sequencing and thus an inability to separate the libraries.

Sample Drop Off

When you bring samples in for sequencing, make sure you know what the barcodes are so that we can properly associate sample info with the barcode.

Exome Capture - Targeted Resequencing

Summary

Library Prep

Data Analysis

RNA-Seq


Summary

RNA-Seq is a technique for measuring RNA expression levels using high-throughput sequencing.  There are many variations, but the common theme is to create a library that contains as little rRNA as possible, but otherwise contains fragments of RNA.  The RNA fragments are sequenced, aligned to the genome and/or transcripts, then the number of reads hitting a transcript are counted to get the expression level of the transcript.  Additionally, information about exon usage and splice forms can be extracted as well.

Typical Recommendations

  • Use multiplexed libraries.
  • We charge for sequencing by the lane which will yield about 200 million single reads or read pairs.
  • You can pool libraries and sequence over multiple lanes to get the necessary depth.
  • If splicing is important to you then use 100bp paired-end sequencing (100PE).
  • These recommendations are for 'complex' organisms.  Shorter reads and/or fewer reads may be necessary for 'simpler' organisms.

A Basic Well-Designed Experiment

  • 3 or more replicates per condition
    • may not be enough if effect is small or variability is high.
  • 50 million read (pairs) per sample

 A Quick Survey

  • 2 replicates
  • 30 million read (pairs)
    • this is similar to a microarray experiment.
    • sequence deeper to get more detail on low-expressing transcripts.

 Deep Sequencing Experiment

  • 200 millions read (pairs)



Counting

Since the fundamental aspect of RNA-Seq is counting, it is important to get enough reads to adequately determine expression levels.  Common wisdom is that about 10-30 million reads is roughly equivalent to a microarray. However, many applications done at the core routinely use 50 to 100 or even 200 million reads to reach deeper into the transcriptome.  Also, if you are interested in quantifying exon or splice junction usage you will need more reads to adequately quantify these smaller features.
A characteristic of RNA-Seq is that at a given sequencing depth, the longer and/or more highly-expressed transcripts are more accurately quantified.  Shorter or lower expressed genes take more sequence to quantify.  Since the range of expression values is much higher than the range of mRNA sizes, expression level is the dominant force, but sequence length should not be discounted.  It is especially important when quantifying exons or splice junctions as each of these features will be captured by only a small portion of the reads from a gene.

Splice Forms

There are two main lines of evidence for identifying splice forms.  The first is explicit identification of a splice junction in the alignment of a read.  The number of such detected junctions will be increased by using paired-end sequencing as the second read gives the opportunity for more junctions to be discovered from a given fragment. The second is implicit deduction of the presence or absence of an exon from the length distribution of fragments and the distance between two ends of a paired-end sequenced fragment.  For example, in a gene with three exons and the two ends of paired read aligning to exons 1 and 3, we can infer that exon 2 is unlikely to be in the transcript the reads came from if exon 2 is large and the average RNA insert length is small.
Assembling full-length splice forms is difficult with short single-read sequencing.

Replicates

RNA-Seq is not fundamentally different from a microarray experiment in that replicates are essential to understand the inherent biological and technical variability in an experiment. However, though not recommended, it is possible to get p-values from single replicate comparisons.

Library Prep

rRNA Reduction

rRNA can be reduced in a few different ways.  First, it can be depleted by using rRNA (complementary) sequences attached to beads.  The sequences grab then rRNA which is then removed by extracting the beads.  Second, it can be depleted by using poly-T sequences attached to beads which are used to pull out the poly-A mRNA from the total RNA.  Third, there is a dsDNase which can be used to remove double-stranded DNA from a library.  Since the rRNA fragments form the major portion, they will re-anneal before the other RNA and be digested.  Finally, some kits, e.g., one from Nugen use special primers in the RNA to cDNA process to avoid reverse transcribing rRNA.

Costs

The Illumina tru-Seq kit costs about $75 per sample, but at the moment (2012/05/22) it is back-ordered.

Multiplexing

We strongly recommend that RNA-Seq libraries be multiplexed so that test runs can be done.  Additionally, this allows many libraries to be sequenced in a pool over many lanes which allows us to tune coverage to the needs of the experiment and to mitigate any lane-

Data Analysis

At the moment the Core uses RUM to analyze RNA-Seq data.  RUM is run on the cloud (AWS) and the charges reflect AWS usage charges and labor.