Sei sulla pagina 1di 6

Analysis Guide for Illumina De Novo Assembly Data

3141 Olive St. Saint Louis, MO 63103 T 314-436-9120 F 815-301-3459 sales@cofactorgenomics.com cofactorgenomics.com
3141 Olive St. Saint Louis, MO 63103 T 314-436-9120 F 815-301-3459 sales@cofactorgenomics.com cofactorgenomics.com

Welcome to your data


Thanks for choosing to sequence with Cofactor Genomics. At Cofactor, we understand that you want to get the most out of your data. As such, this packet is intended to help guide you through your information to make sure you get exactly what you need. In the following pages, youll nd an explanation of the different directory structures as well as the different le format specications used. We have included screen shots of a generic project for you to use as a guide. If you still have any questions about your data, please do not hesitate to contact us.

Directory Structure
When you rst open up your data, the common structure of a Cofactor de novo project disk is typically broken into 2 main directories:

The Analysis folder is where you will nd the nal analysis of your data. This is data that has reached the end of our pipelines and is typically the best place to start looking at your data. Your directory will contain a combination of the folders below based off of what was requested for your specic project. The Raw Data folder is where you will nd the raw reads in your platform specic format. These can be found in the Samples folder. This is the rst directory to consult if you decide to use a different aligner or take a different approach than what was taken with Cofactors analysis pipeline. The Reference folder will contain the reference genome that was used for alignments or gene annotation (if gene annotation was performed) in the case of a de novo assembly project.

3141 Olive St. Saint Louis, MO 63103

T 314-436-9120 F 815-301-3459

sales@cofactorgenomics.com cofactorgenomics.com

Analysis Folder
Assembly folder
There are several ways that de novo assemblies of gDNA or RNA are represented. One way is to provide all contigs produced by the assembly program while the other is to provide only contigs greater than 100 bp. At Cofactor, we believe in providing our clients with all of their data, thus our Assembly folder will contain les that show the assembly stats and contig FASTA sequences for the whole assembly and the 100 assembly. Any of these les can be opened in a simple text edit program. On a Mac you would use TextEdit when prompted to choose a program or WordPad on a PC computer. If you have trouble opening your les, please contact Cofactor Genomics for help.

The xxx.stats.txt les represents the assembly metrics taking all assembled contigs into account. This le is generated automatically by our analysis pipeline and can be thought of as a snapshot of the whole assembly. The le listed as xxx.stats.txt.100 represents the assembly metrics when contigs smaller than 100 bp are removed. Many of these contigs could contain low-quality reads or systematic noise. The xxx.contig.fa.txt les provide the sequence of each contig in the assembly in FASTA format. This format is shown below: >a5100;56 K: 25 length: 153 TCCACCAACTAAGAACGGCCATACCACCCTGAACGCGCCCGATCTCGTCTGATCTCGG AAGCTAAGCAGGGTCGGGCCTGGTTAGTACTTGGATGGGAGACCGCCTGGGAATACCG GGTGCTGTAGGCTCCCCGACCCAGAAGCAGGTCGTCT The xxx.contig.fa.100 le follows the same theme as mentioned above with the exception that it will contain contigs in FASTA format from the 100 assembly.
3141 Olive St. Saint Louis, MO 63103 T 314-436-9120 F 815-301-3459 sales@cofactorgenomics.com cofactorgenomics.com

Raw Data Folder


The raw data folder holds all of the raw data from the sequencer. Many researches do not use any of this data, however Cofactor still provides this data to the researcher as they may choose to use this in conjunction with other experiments at a later time. Annotations for De Novo projects will be found under Raw Data in the Annotations.

Reference Folder

This folder contains a gunzipped le containing the reference sequence, in FASTA format, used during the annotation step (if annotations were performed for your assembly).

Samples Folder

This folder contains the raw reads in your platform specic format. The raw data is compressed using bzip2. If you would like to unzip these les, most compression programs (e.g. StuffIt) can handle these les. Please contact Cofactor if you are having trouble unzipping these les. Each le contains:

FASTQs (.gz)
Each FASTQ sequence entry has a header for the sequence that begins with @, followed by a line of the nucleotide sequence data. The third line is a header that proceeds the encoded Phred quality score values. The syntax of Solexa/Illumina read format is almost identical to the FASTQ format, but the qualities are scaled differently. Given a character, the following Perl code gives the Phred quality : = 10 * log(1 + 10 ** (ord() - 64) / 10.0)) / log(10); Aligners that use this format have the benet of the single base qualities, and more information than the FASTA format (assuming they use the quality information in their alignment).

3141 Olive St. Saint Louis, MO 63103

T 314-436-9120 F 815-301-3459

sales@cofactorgenomics.com cofactorgenomics.com

FASTAs (.fa)
Denition line of each read begins with a >. The second line of each entry contains the nucleotide sequence. We usually use this raw read data in FASTA format if we have a specic gene or region that we want to query out of the sequence data. We often use short read aligners for this but will use the FASTA format as there is less information (compared to FASTQ), it does not contain any quality information, and therefore the alignment process will proceed quickly.

3141 Olive St. Saint Louis, MO 63103

T 314-436-9120 F 815-301-3459

sales@cofactorgenomics.com cofactorgenomics.com

Contact
If you have any questions or concerns, please contact us at: 3141 Olive St. Saint Louis, MO 63103 T 314-436-9120 F 815-301-3459 1-888-8Cofactor (1-888-826-3228) sales@cofactorgenomics.com cofactorgenomics.com

Thank You
A sincere thank you for trusting your project with us. It is our goal to continue to push these new technologies to their limits and ultimately have your project benet from our efforts. If you have specic suggestions on how we may achieve this goal, please dont hesitate to contact us.

3141 Olive St. Saint Louis, MO 63103

T 314-436-9120 F 815-301-3459

sales@cofactorgenomics.com cofactorgenomics.com

Potrebbero piacerti anche