Friday, March 29, 2013

Analysis - Tools - ngsc-hitsclip-RLB

Our data processing for a single HITS-CLIP or miRNA-Seq library follows the same steps.

Trim adapter sequence from reads

Since the miRNA sequences are shorter than the 36 or 50 nt that we sequence, we trim off the 3' adapter sequence from the 3' end of the reads.  We allow up to 1 mismatch per 4nt of adapter sequence.

Align Trimmed Reads

The trimmed reads are aligned using bowtie to (1) miRNA hairpins (mirBase), (2) RefSeq transcripts, and (3) whole genome.  We allow up to 3 mismatches and require a unique best match. The alignment counts and percentages are tallied to assess the  quality of the library, i.e., how much is coming from miRNAs versus degraded mRNA.

Quantitate Mature Forms

We count the number of  aligned trimmed reads that overlap with the annotated mature forms. If there is no mature annotated for either arm of a precursor hairpin, then we do a naive inference and take the mature form as half of the hairpin.

Differential Analysis

Differential expression or differential loading analysis is done using the read counts which are converted to RPM, quantile normalized, then analyzed by SamR (older data sets) or EdgeR (newer data sets, when there are at least 2 replicates) to generate p-values and FDRs.

Footprint




Wednesday, February 20, 2013

WebSite - Charges

The Charges tab on the front page of the website lists the charges that have been incurred for the investigations the core is conducting.

Scope

You only see charges for investigation that you have access to.  PIs will therefore see charges for all of their investigations.  Post-docs and grad students will only see charges for their projects, not  for others in the lab unless they are part of the investigation.

Timing

After an initial break-in period, we will load charges fairly soon after the service has been provided. Charges may be before it has been decided whether the service is billable, i.e., whether a run or lane has failed or not.  This will be adjusted on an on-going basis until, or even after an invoice has been created.

What do we charge for?

The cores charge for quality assessments, making libraries, microarrays, sequencing, basic data processing (RLB), more advanced informatics, as well as CPU charges from the PGFI cluster.

What's in a charge?

The core, FGC or NGSC, is set depending on the PI's affiliation and the service, e.g., microarrays are always through FGC, but other services may be FGC or NGSC.

The invoice number is set once an invoice is prepared for billing.

The investigation should be clear as well as the Service code (what we did) and the service type.

A service is free if it was performed, but failed due to a problem with the sequencers or something we did wrong.

The % Billed is used to (1) handle the fractional charges from PGFI CPU usage, (2) split services between invoices.

The '$' is the list price.  '$/Item' is the charge after the Free and % Billed is considered. This is what you will be charged.

The Instance Key is the name of the thing that the service was performed on or with, e.g.,
FGC0396/1 for a lane in a sequencing run
7890/7890 for bioA of sample 7890

msrx_FGC0350_3_hg18_rd for aligning and loading data from FGC0350/3 to hg18.

The description will usually be fairly cryptic unless we edit it by hand.

Started On and Ended On are the dates over which the service was performed.




Friday, February 8, 2013

FAQ Sample Queue

Introduction 

The sample queue currently covers only the steps from sample drop off to sequencing.  It does not yet automatically handle resequencing, but we are working on that. There is a diagram of our process at the bottom of the page.

This queue contains samples for both the NGSC and the FGC cores. Due to the time it takes to compete the various steps and the amount of manual review required, we will only update the queue once or twice a day.

QC

Quality control involves performing at least an Agilent bioanalyzer run and a qubit run. Additional runs may be needed if these fail or give contradictory results.  We are evaluating the use of the Kapa system for QC but this is still preliminary and is only being applied to samples selectively.

We track the various evaluations in a spreadsheet, the upload values once all checks have been done and we are confident in the results.  Thus in many cases work is being done on a sample that may it be reflected in the queue.

This queue covers, RNA and DNA, libraries just being sequenced, as well as libraries that need to get resequenced due to bad sequencing results.

After the bioanalyzer step, you will receive an email indicating the results, but further processing may be going to refine the concentration.

After QC samples are either marked BAD and processed no further, or are marked GOOD and move to library prep, pooling, or ready for sequencing depending on the sample type.

Extraction
Extraction of RNA or DNA from cells is rare.  Any RNA or DNA extracted will become a new sample and move to the QC queue.

Library Prep

Library prep covers making libraries from RNA or genomic DNA. These are typically time-consuming process that take away or two to process 8 samples.

The libraries produced are new samples that move to the QC queue.  Usually a bioA is done immediately, so he QC is to get the precise molarities.

Microarrays

Microarrays are performed by the FGC core. These typically take about 2 to 3 three person-days to perform. Currently Agilent is having trouble with their manufacturing and is not shipping any arrays.

Pooling

Pooling covers the dilution and mixing steps necessary to sequence on or more libraries.  We are very careful with this step as it is essential to achieving maximum read counts and even distribution across all libraries in a pool.  We qubit the dilutions and redilute if necessary.  Entries in this stage represent individual libraries which will be pooled to take up just a lane or two, usually.

Waiting to be Sequenced

This queue is the set of pools or individual samples that are ready to go. When a sample or pool will be sequenced in more than one lane, the queue entry will indicate this.

Sequencing Now

These samples are in runs that are going on 'now'.  This include runs that are just finishing or have recently finished.  The end of a run may also be defined manually when a run is having trouble.

Recently Sequenced

Recently sequenced is just a list of the recent runs, in case you missed your sample going through the queue.

Outline of sample flow through the NGSC or FGC Cores.


Wednesday, January 23, 2013

Survey - NGSC/FGC Sample Queue Privacy

This survey was conducted from Jan 22 to Jan 25 to determine what information NGSC/FGC users wanted displayed about samples in the sample queue, a feature that we are adding to the website.  By Jan 23 we had 86 respondents (which is fantastic!) which allows us to get a very clear picture of what users want.

To summarize,
  1. most people want to see the principle investigator and investigator's names.
  2. a significant percentage do not want investigation name or experiment or assay names visible
  3. a significant percentage do not want the sample name visible
  4. most people do want to see the date the sample was submitted.
Just a few people used the comment box and none of the comments altered the results above.  The details are below. Despite the relatively large percentage of Don't Care responses, there were very few respondents that put Don't Care for each question.

So we will be including just these four sample identification columns in the queue.  Other data, such as time in the queue or estimated time until finishing will be added once we can estimate them with reasonable accuracy.
  1. PI
  2. investigator
  3. submission date
  4. anonymous sample ID
With these answers in hand, we will implement the sample queue as quickly as we can.

Thanks for your feedback!

Question        | NO!     | no      | eh      | yes     | YES!
1. PI             2.3%  2   5.8%  5  25.6% 22  47.7% 41  18.6% 16
2. Investigator   2.3%  2   5.8%  5  29.1% 25  40.7% 35  22.1% 19
3. Investigation 14.0% 12  25.6% 22  25.6% 22  26.7% 23   8.1%  7
4. Cond & Assay  19.8% 17  22.1% 19  26.7% 23  24.4% 21   7.0%  6
5. Sample Name   10.6%  9  21.2% 18  25.9% 22  29.4% 25  12.9% 11
6. Submit Date    0.0%  0   4.7%  4  22.1% 19  33.7% 29  39.5% 34


Monday, January 21, 2013

FAQ Barcodes

When multiplexing libraries, it is essential to pick barcodes that work together.

Check the instructions in your library kit carefully to find the barcode selection guide.  It may not be obvious at first.

If you are using the FGC/NGSC's multiplexed ChIP-seq protocol, you will find the legal combinations on page 3.

Monday, December 17, 2012

Analysis - Tools - RUM

RUM

We use the RUM package from Grant et al to do the basic processing of RNA-Seq data.  RUM generates a set of files which we then process a bit further to make them visible in the TessLA browser and for other down-stream analyses.

Files

Here is a typical set of files produced for a RUM analysis:

26642352304 RUM.sam - all alignments
 9094911272 RUM_NU - non-unique alignments
   48715025 RUM_NU.bedGraph.gz - bedGraph format for display
    1002296 RUM_NU.bedGraph.gz.tbi - index of bedGraph format for display
  213897294 RUM_NU.cov - coverage data for non-uniquely mapping reads
 3687201046 RUM_Unique - unique alignments
   52517756 RUM_Unique.bedGraph.gz  - bedGraph format for display
     830599 RUM_Unique.bedGraph.gz.tbi - index of bedGraph format for display
  228549781 RUM_Unique.cov - coverage data for uniquely mapping reads
  122977785 feature_quantifications-max.tab
  122977785 feature_quantifications-max.tab-sorted
  122977785 feature_quantifications-min.tab
  122977785 feature_quantifications-min.tab-sorted
  111522316 feature_quantifications_RLB-GENOME-TAG - expression levels of transcript, exons, and introns.
    5417514 inferred_internal_exons.bed
    3126115 inferred_internal_exons.txt
   30203897 junctions_all.bed
   30203808 junctions_all.bed-sorted
   18500890 junctions_all.rum
    9149089 junctions_high-quality.bed
    9148994 junctions_high-quality.bed-sorted
      16384 log
       4142 mapping_stats.txt - summary of how many reads mapped to genome or transcripts
     289688 novel_inferred_internal_exons_quantifications_RLB-GENOME-TAG
      16384 postproc
 5438152292 quals.fa - read qualities
 5438152292 reads.fa - read sequences
        449 rum_RLB-GENOME-TAG_preproc.sh
        848 rumRLB-GENOME-TAG_proc.sh
       1837 rum_job_config
       3275 rum_job_report.txt
       1384 rum_runner.log
        125 rum_sge_job_ids


Analysis - Tools - Comparison

We run this tool to do basic differential analysis.  It is best used for RNA-Seq data, but can be used for other data as well.

Files

By default the files are created in Analysis/DiffExp.  In this directory, you may find multiple analyses which use different data and/or parameters.  Looking inside one of these directories, you will see 3 to 4 files called Compare.*, the most useful of which is Compare.tab.xls.

Compare.tab.xls

This file contains the comparison data.  The contents are somewhat flexible, but will follow this outline.

Each row is a transcript.  The first few columns contain the gene, transcript, and 'Best' (an indicator which guides you to the best transcript for each genes.)

The next set of columns are various comparisons.  Which comparisons are done depend on the experiment.  For each comparison there are 6 columns.
  1. MVA:M:Test:Control - log2 test/control fold change 
  2. MVA:A:Test:Control - log2 average expression
  3. EDGE:A:Test:Control - log2 average expression
  4. EDGE:M:Test:Control - log2 test/control fold change
  5. EDGE:pv:Test:Control  - 0-1 p-value
  6. EDGE:FDR:Test:Control - 0-1 FDR from p-value using Benjamini-Hochberg correction
The first word in each column title indicates the tool that is used to produce the data in the column.
  1. MVA is a simple MvA comparison with no statistical significance.
  2. EDGE is the EdgeR package which performs differential gene expression on RNA-Seq data.
The data that is passed to the analysis programs has been quantile normalized.
M values are the log2(Test/Control), so M=1 indicates 2-fold increase in expression.
A values  are log2 of the average expression between two conditions.  MvA and EdgeR use different units, MVA is usually Reads, whereas EdgeR values have been normalized to counts per million.

The next set of columns of the file are quantile normalized log2 versions of the 'raw' data for the individual samples.

The last set of columns are the 'raw' data which is usually reads.

Looking Deeper
Within each Comparison folder is another called 'Heatmap'.  See http://fgc-ngsc-cores.blogspot.com/2012/09/analysis-tools-rum-multiplecomparisons.html for details about the files in this folder.