Saturday, April 28, 2012

Downloading Data

The Basics

We have configured the website to use your account usernames and passwords to access all files by PI (if you are a PI or lab member) or by investigation if you have collaborator status.

We have updated the download area to provide all data at a URL link this:

  https://fgc.genomics.upenn.edu/Experiments/PI

You will be prompted for a user name and password, enter the credentials you use to log in to the rest of the site.

If you don't find the data you are looking for, check the deprecated links desribed below and let us know of the omission.

How the Files are Organized

Under each PI are a number of folders that correspond to experiments (the old style) or investigations and (under that) studies (which is the new style.)

Within each experiment or investigation, there are few common places to look for data.

'Raw' Data

This data is not really raw, but has not undergone any real analysis beyond alignment

  • basic/Fastq - files of read sequence in FASTQ format
  • basic/Fasta - files of read sequence in FASTA format
  • basic/Solexa - older style sequence or ELAND output files
  • basic/Export - alignment information split into unique and not-usable (repeat) 
  • basic/BedFiles - BED file format of uniquely aligning reads and perhaps SHP output files

Analysis Results

These are typically places under the Analysis folder.  Different types of analysis are organized in subfolder under that.

Tips and Tricks

WIG files for Profiles

At the moment, we generate profile data at full resolution, i.e., including the changes resulting from each read.   Sometimes, these files can be found in basic/BedFiles with names in the form FGCNNNN_s_L-ucsc.ushp.tab.gz, e.g., FGC0138_s_7-ucsc.ushp.tab.gz.  These files are not in WIG format - they are just chromosome, begin, end, score.  They can be converted to WIG format with relatively simple programs.

WIG format data can be obtained using the download feature in the TessLA browser.  However, because modern data sets are so large this approach is not very effective.

Depcreated URLs

Because the files sit on two different filesystems you have to look at two, slightly, different URLs:

  https://fgc.genomics.upenn.edu/Experiments-1/PI

or


  https://fgc.genomics.upenn.edu/Experiments-2/PI

where, of course, you replace PI with your PI's last name.  Note the https, that's important.


Monday, April 9, 2012

Bacterial Contamination

The Problem

A ChIP-seq, RNA-Seq, or other library type looks good and sequences well, but has a very alignment percentage, i.e., much less than 50%.  Sometimes this is due to poor library construction that results in artefactual sequences, but once in a while the library is good but contains a large amount of bacterial sequence and relatively little of the intended species.

Confirming the Diagnosis

In the blastn section of the NCBI BLAST website () there is a 'Whole-genome shotgun contis (wgs)' database which contains sequence from a wide range of organisms.  If you convert 20 or so of your reads to FASTA format, you can paste them into the search window and see what species they match.  If you get a bunch of excellent matches to a species (other than the one you had hoped for) then that's the problem. The exact species can depend on the source of the contamination, i.e., local water, reagents, or from the bacteria in the host species, e.g., gut flora/fauna or dirt etc.

Tracking Down the Source

For bacterial contamination the easiest way to find the source is to do PCR on water samples to see if you can find ribosomal sequence.  The primers below hit the 16rDNA gene in a wide variety of species but do not match mammals.

The forward primer, 5'-TCCTACGGGAGGCAGCAGT-3'
the reverse primer, 5'-GGACTACCAGGGTATCTAATCCTGTT-3'
and the probe, (6-FAM)-5'-CGTATTACCGCGGCTGCTGGCAC-3'

Something's Fishy!

Another possible problem is the use of Salmon DNA as a blocking agent in ChIP-Seq experiments.  This is rare these days as we warn against it and the issue is more widely reported.  The blast search above should identify this problem as well.