IDOM/DRC FGC & PSOM NGSC Cores
Information on equipment, experimental techniques, and informatics for the IDOM/DRC Functional Genomics Core (FGC) and Next-Generation Sequencing Core (NGSC) for Perelman School of Medicine at the University of Pennsylvania.
The blog lists recent posts here, use the Search Box to find posts of interest, e.g., ChIP-Seq. Please post comments and questions!
Friday, March 29, 2013
Analysis - Tools - ngsc-hitsclip-RLB
Our data processing for a single HITS-CLIP or miRNA-Seq library follows the same steps.
Wednesday, February 20, 2013
WebSite - Charges
The Charges tab on the front page of the website lists the charges that have been incurred for the investigations the core is conducting.
The invoice number is set once an invoice is prepared for billing.
The investigation should be clear as well as the Service code (what we did) and the service type.
A service is free if it was performed, but failed due to a problem with the sequencers or something we did wrong.
The % Billed is used to (1) handle the fractional charges from PGFI CPU usage, (2) split services between invoices.
The '$' is the list price. '$/Item' is the charge after the Free and % Billed is considered. This is what you will be charged.
The Instance Key is the name of the thing that the service was performed on or with, e.g.,
FGC0396/1 for a lane in a sequencing run
7890/7890 for bioA of sample 7890
Scope
You only see charges for investigation that you have access to. PIs will therefore see charges for all of their investigations. Post-docs and grad students will only see charges for their projects, not for others in the lab unless they are part of the investigation.Timing
After an initial break-in period, we will load charges fairly soon after the service has been provided. Charges may be before it has been decided whether the service is billable, i.e., whether a run or lane has failed or not. This will be adjusted on an on-going basis until, or even after an invoice has been created.What do we charge for?
The cores charge for quality assessments, making libraries, microarrays, sequencing, basic data processing (RLB), more advanced informatics, as well as CPU charges from the PGFI cluster.What's in a charge?
The core, FGC or NGSC, is set depending on the PI's affiliation and the service, e.g., microarrays are always through FGC, but other services may be FGC or NGSC.The invoice number is set once an invoice is prepared for billing.
The investigation should be clear as well as the Service code (what we did) and the service type.
A service is free if it was performed, but failed due to a problem with the sequencers or something we did wrong.
The % Billed is used to (1) handle the fractional charges from PGFI CPU usage, (2) split services between invoices.
The '$' is the list price. '$/Item' is the charge after the Free and % Billed is considered. This is what you will be charged.
The Instance Key is the name of the thing that the service was performed on or with, e.g.,
FGC0396/1 for a lane in a sequencing run
7890/7890 for bioA of sample 7890
msrx_FGC0350_3_hg18_rd for aligning and loading data from FGC0350/3 to hg18.
The description will usually be fairly cryptic unless we edit it by hand.
Started On and Ended On are the dates over which the service was performed.
Friday, February 8, 2013
FAQ Sample Queue
Introduction
The sample queue currently covers only the steps from sample drop off to sequencing. It does not yet automatically handle resequencing, but we are working on that. There is a diagram of our process at the bottom of the page.This queue contains samples for both the NGSC and the FGC cores. Due to the time it takes to compete the various steps and the amount of manual review required, we will only update the queue once or twice a day.
QC
Quality control involves performing at least an Agilent bioanalyzer run and a qubit run. Additional runs may be needed if these fail or give contradictory results. We are evaluating the use of the Kapa system for QC but this is still preliminary and is only being applied to samples selectively.We track the various evaluations in a spreadsheet, the upload values once all checks have been done and we are confident in the results. Thus in many cases work is being done on a sample that may it be reflected in the queue.
This queue covers, RNA and DNA, libraries just being sequenced, as well as libraries that need to get resequenced due to bad sequencing results.
After the bioanalyzer step, you will receive an email indicating the results, but further processing may be going to refine the concentration.
After QC samples are either marked BAD and processed no further, or are marked GOOD and move to library prep, pooling, or ready for sequencing depending on the sample type.
Extraction
Extraction of RNA or DNA from cells is rare. Any RNA or DNA extracted will become a new sample and move to the QC queue.
Library Prep
Library prep covers making libraries from RNA or genomic DNA. These are typically time-consuming process that take away or two to process 8 samples.The libraries produced are new samples that move to the QC queue. Usually a bioA is done immediately, so he QC is to get the precise molarities.
Microarrays
Microarrays are performed by the FGC core. These typically take about 2 to 3 three person-days to perform. Currently Agilent is having trouble with their manufacturing and is not shipping any arrays.Pooling
Pooling covers the dilution and mixing steps necessary to sequence on or more libraries. We are very careful with this step as it is essential to achieving maximum read counts and even distribution across all libraries in a pool. We qubit the dilutions and redilute if necessary. Entries in this stage represent individual libraries which will be pooled to take up just a lane or two, usually.Waiting to be Sequenced
This queue is the set of pools or individual samples that are ready to go. When a sample or pool will be sequenced in more than one lane, the queue entry will indicate this.Sequencing Now
These samples are in runs that are going on 'now'. This include runs that are just finishing or have recently finished. The end of a run may also be defined manually when a run is having trouble.Recently Sequenced
Recently sequenced is just a list of the recent runs, in case you missed your sample going through the queue.Outline of sample flow through the NGSC or FGC Cores. |
Wednesday, January 23, 2013
Survey - NGSC/FGC Sample Queue Privacy
This survey was conducted from Jan 22 to Jan 25 to determine what information NGSC/FGC users wanted displayed about samples in the sample queue, a feature that we are adding to the website. By Jan 23 we had 86 respondents (which is fantastic!) which allows us to get a very clear picture of what users want.
To summarize,
So we will be including just these four sample identification columns in the queue. Other data, such as time in the queue or estimated time until finishing will be added once we can estimate them with reasonable accuracy.
Thanks for your feedback!
Question | NO! | no | eh | yes | YES!
1. PI 2.3% 2 5.8% 5 25.6% 22 47.7% 41 18.6% 16
2. Investigator 2.3% 2 5.8% 5 29.1% 25 40.7% 35 22.1% 19
3. Investigation 14.0% 12 25.6% 22 25.6% 22 26.7% 23 8.1% 7
4. Cond & Assay 19.8% 17 22.1% 19 26.7% 23 24.4% 21 7.0% 6
5. Sample Name 10.6% 9 21.2% 18 25.9% 22 29.4% 25 12.9% 11
6. Submit Date 0.0% 0 4.7% 4 22.1% 19 33.7% 29 39.5% 34
To summarize,
- most people want to see the principle investigator and investigator's names.
- a significant percentage do not want investigation name or experiment or assay names visible
- a significant percentage do not want the sample name visible
- most people do want to see the date the sample was submitted.
So we will be including just these four sample identification columns in the queue. Other data, such as time in the queue or estimated time until finishing will be added once we can estimate them with reasonable accuracy.
- PI
- investigator
- submission date
- anonymous sample ID
Thanks for your feedback!
Question | NO! | no | eh | yes | YES!
1. PI 2.3% 2 5.8% 5 25.6% 22 47.7% 41 18.6% 16
2. Investigator 2.3% 2 5.8% 5 29.1% 25 40.7% 35 22.1% 19
3. Investigation 14.0% 12 25.6% 22 25.6% 22 26.7% 23 8.1% 7
4. Cond & Assay 19.8% 17 22.1% 19 26.7% 23 24.4% 21 7.0% 6
5. Sample Name 10.6% 9 21.2% 18 25.9% 22 29.4% 25 12.9% 11
6. Submit Date 0.0% 0 4.7% 4 22.1% 19 33.7% 29 39.5% 34
Monday, January 21, 2013
FAQ Barcodes
When multiplexing libraries, it is essential to pick barcodes that work together.
Check the instructions in your library kit carefully to find the barcode selection guide. It may not be obvious at first.
If you are using the FGC/NGSC's multiplexed ChIP-seq protocol, you will find the legal combinations on page 3.
Check the instructions in your library kit carefully to find the barcode selection guide. It may not be obvious at first.
If you are using the FGC/NGSC's multiplexed ChIP-seq protocol, you will find the legal combinations on page 3.
Monday, December 17, 2012
Analysis - Tools - RUM
RUM
We use the RUM package from Grant et al to do the basic processing of RNA-Seq data. RUM generates a set of files which we then process a bit further to make them visible in the TessLA browser and for other down-stream analyses.Files
Here is a typical set of files produced for a RUM analysis:26642352304 RUM.sam - all alignments
9094911272 RUM_NU - non-unique alignments
48715025 RUM_NU.bedGraph.gz - bedGraph format for display
1002296 RUM_NU.bedGraph.gz.tbi - index of bedGraph format for display
213897294 RUM_NU.cov - coverage data for non-uniquely mapping reads
3687201046 RUM_Unique - unique alignments
52517756 RUM_Unique.bedGraph.gz - bedGraph format for display
830599 RUM_Unique.bedGraph.gz.tbi - index of bedGraph format for display
228549781 RUM_Unique.cov - coverage data for uniquely mapping reads
122977785 feature_quantifications-max.tab
122977785 feature_quantifications-max.tab-sorted
122977785 feature_quantifications-min.tab
122977785 feature_quantifications-min.tab-sorted
111522316 feature_quantifications_RLB-GENOME-TAG - expression levels of transcript, exons, and introns.
5417514 inferred_internal_exons.bed
3126115 inferred_internal_exons.txt
30203897 junctions_all.bed
30203808 junctions_all.bed-sorted
18500890 junctions_all.rum
9149089 junctions_high-quality.bed
9148994 junctions_high-quality.bed-sorted
16384 log
4142 mapping_stats.txt - summary of how many reads mapped to genome or transcripts
289688 novel_inferred_internal_exons_quantifications_RLB-GENOME-TAG
16384 postproc
5438152292 quals.fa - read qualities
5438152292 reads.fa - read sequences
449 rum_RLB-GENOME-TAG_preproc.sh
848 rumRLB-GENOME-TAG_proc.sh
1837 rum_job_config
3275 rum_job_report.txt
1384 rum_runner.log
125 rum_sge_job_ids
Analysis - Tools - Comparison
We run this tool to do basic differential analysis. It is best used for RNA-Seq data, but can be used for other data as well.
Each row is a transcript. The first few columns contain the gene, transcript, and 'Best' (an indicator which guides you to the best transcript for each genes.)
The next set of columns are various comparisons. Which comparisons are done depend on the experiment. For each comparison there are 6 columns.
M values are the log2(Test/Control), so M=1 indicates 2-fold increase in expression.
A values are log2 of the average expression between two conditions. MvA and EdgeR use different units, MVA is usually Reads, whereas EdgeR values have been normalized to counts per million.
The next set of columns of the file are quantile normalized log2 versions of the 'raw' data for the individual samples.
The last set of columns are the 'raw' data which is usually reads.
Looking Deeper
Within each Comparison folder is another called 'Heatmap'. See http://fgc-ngsc-cores.blogspot.com/2012/09/analysis-tools-rum-multiplecomparisons.html for details about the files in this folder.
Files
By default the files are created in Analysis/DiffExp. In this directory, you may find multiple analyses which use different data and/or parameters. Looking inside one of these directories, you will see 3 to 4 files called Compare.*, the most useful of which is Compare.tab.xls.Compare.tab.xls
This file contains the comparison data. The contents are somewhat flexible, but will follow this outline.Each row is a transcript. The first few columns contain the gene, transcript, and 'Best' (an indicator which guides you to the best transcript for each genes.)
The next set of columns are various comparisons. Which comparisons are done depend on the experiment. For each comparison there are 6 columns.
- MVA:M:Test:Control - log2 test/control fold change
- MVA:A:Test:Control - log2 average expression
- EDGE:A:Test:Control - log2 average expression
- EDGE:M:Test:Control - log2 test/control fold change
- EDGE:pv:Test:Control - 0-1 p-value
- EDGE:FDR:Test:Control - 0-1 FDR from p-value using Benjamini-Hochberg correction
- MVA is a simple MvA comparison with no statistical significance.
- EDGE is the EdgeR package which performs differential gene expression on RNA-Seq data.
M values are the log2(Test/Control), so M=1 indicates 2-fold increase in expression.
A values are log2 of the average expression between two conditions. MvA and EdgeR use different units, MVA is usually Reads, whereas EdgeR values have been normalized to counts per million.
The next set of columns of the file are quantile normalized log2 versions of the 'raw' data for the individual samples.
The last set of columns are the 'raw' data which is usually reads.
Looking Deeper
Within each Comparison folder is another called 'Heatmap'. See http://fgc-ngsc-cores.blogspot.com/2012/09/analysis-tools-rum-multiplecomparisons.html for details about the files in this folder.
Subscribe to:
Posts (Atom)