Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
Genome sequencing, genome annotation and expression profiling
Rapid and reliable determination of DNA molecules was possible with the introduction of the
sequencing technique of Sanger and Coulson and Maxam and Gilbert. .
Allan Maxam and Walter Gilbert developed a method for sequencing single-stranded DNA
by taking advantage of a two-step catalytic process involving piperidine and two chemicals that
selectively attack purines and pyrimidines. Fred Sanger was developing an alternative method.
Rather than using chemical cleavage reactions, Sanger opted for a method involving a third form
of the ribose sugars. But the base for a good computer integration was using the automated
fluorescence sequencing. In 1989, Leroy Hood introduced a DNA sequencing method in wich the
radioactive labels. autoradiography and manual base calling were replaced by fluorescent labels,
laser induced fluorescence detection and computerized base calling.
For this purpose, genome sequencing tools are a critical area of bioinformatics research. It can
be possible using multiple bioinformatic tools to handle, analyze, compare, relate, and visualize
DNA sequences:
2
BLAST (Basic Local Alignment Search Tool): algorithm for comparing primary
biological sequence information, such as the amino-acid sequences of proteins or the
nucleotides of DNA sequences;
BLAT (Basic Local Alignment Tool): pairwise sequence alignment algorithm; BLAT
can be used either as a web-based server-client program or as a stand-alone program.
UCSC: the UCSC Genome Browser is a graphical viewer for genomic data now in its
13th year. Since the early days of the Human Genome Project, it has presented an
integrated view of genomic data of many kinds. Now home to assemblies for 58
organisms, the Browser presents visualization of annotations mapped to genomic
coordinates. The ability to juxtapose annotations of many types facilitates inquiry-
driven data mining. Gene predictions, mRNA alignments, epigenomic data from the
ENCODE project, conservation scores from vertebrate whole-genome alignments and
variation data may be viewed at any scale from a single base to an entire chromosome.
Ensembl: Ensembl is a genome browser for vertebrate genomes that supports research
in comparative genomics, evolution, sequence variation and transcriptional regulation.
Ensembl annotate genes, computes multiple alignments, predicts regulatory function
and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the
Variant Effect Predictor (VEP) for all supported species.
Genome annotation is a key process for identifying the coding and non-coding regions of
a genome, gene locations and functions. Analysis of DNA sequence with genome annotation
software tools allow finding and mapping genes, exons-introns, regulatory elements, repeats and
mutations. Genome databases are essential to retrieve information on gene name, protein product
and DNA sequence functions. Bioinformatics tools are needed in annotation and prediction of
genes from sequenced genomes that requires computerized approaches because genomes are too
large to be manually annotated as mentioned above. Bioinformatics-based gene finding and
annotation including a search for protein-coding genes, RNA transcripts, and other functional
sequences within a genome is possible because there are patterns to recognize the start/stop
regions, introns, exons, motifs, repeats, and other regulatory and sensory as well as signaling
3
regions with some variations between genes and among organisms. This is made possible using
adequated databases (mostly open source) or genome analysis tools.
Bioinformatics tools are very important to analyze gene and protein expression profiles.
Largescale sequencing of cDNA libraries has led to the following techniques:
Serial Analysis of Gene Expression (SAGE): approach that allows rapid and
detailed analysis of overall gene expression patterns. Steps:
4
Structural bioinformatics: molecular folding, modeling, and design
Databases
5
Software and programming tools
Software and programming tools along with bioinformatics services and workflows have
been the main fields and core targets of bioinformatics since its emergence. Because of the
contributions of various bioinformatics companies and public institutions, bioinformatics software
and tools started to exist as simple command-line tools, but later improved to more complex
graphical standalone-programs standalone. The main driving forces for the current and future
development of bioinformatics software and tools have been made on the past-decade: advances
of genome decoding technologies, accumulation of large volume biological data, consequent need
for their analyses, as well as advancements of computer technologies, graphics, visualization, and
molecular modeling and networking techniques. Moreover, the availability of various open-source
libraries, shared object models, and community-supported plug-ins has facilitated gathering
innovative ideas from the community and performing innovative in silico experiments on existing
“Big Data”.
The most used software packages are:
1. UGENE
Creating, editing and annotating nucleic acid and protein sequences
Fast search in a sequence
Multiple sequence alignment: ClustalW, ClustalO, MUSCLE, Kalign, MAFFT, T-Coffee
PCR in silico
Search through online databases: NCBI, PDB, UniProtKB/Swiss-Prot,
UniProtKB/TrEMBL, DAS servers
Local and NCBI Genbank BLAST search
Open reading frames finder
Restriction enzyme finder with integrated REBASE restriction enzymes list
Integrated Primer3 package for PCR primer design
Plasmid construction and annotation
Cloning in silico by designing of cloning vectors
Genome mapping short reads with Bowtie, BWA and UGENE Genome Aligner
Raw NGS data processing
Visualization of next generation sequencing data (BAM files) using UGENE Assembly
Browser
Variant calling with SAMtools
RNA-seq data analysis with Tuxedo pipeline (TopHat, Cufflinks, etc.)
ChIP-seq data analysis with Cistrome pipeline (MACS, CEAS, etc.)
SPAdes de novo assembler
HMMER2 and HMMER3 packages integration
Chromatogram viewer
6
Search for transcription factor binding sites (TFBS) with weight matrix and SITECON
algorithms
Search for direct, inverted and tandem repeats in DNA sequences
Local sequence alignment with optimized Smith-Waterman algorithm
Building (using integrated PHYLIP Neighbor Joining, MrBayes or PhyML Maximum
Likelyhood) and editing phylogenetic trees
Combining various algorithms into custom workflows with UGENE Workflow Designer
Contigs assembly with CAP3
3D Structure viewer for files in PDB and MMDB formats, anaglyph view support
Protein secondary structure prediction with GOR IV and PSIPRED algorithms
Constructing dotplots for nucleic acid sequences
mRNA alignment with Spidey
Creating and using a shared storage (e.g. for a lab)
Search for a pattern of various algorithms’ results in a nucleic acid sequence with UGENE
Query Designer
2. EMBOSS
Sequence alignment
Rapid database searching with sequence patterns
Protein motif identification, including domain analysis
Nucleotide sequence pattern analysis -- for example to identify CpG islands or repeats
Codon usage analysis for small genomes
Rapid identification of sequence patterns in large scale sequence sets
Presentation tools for publication
3. GenGIS
GenGIS is a bioinformatics application that allows users to combine digital map data with
information about biological sequences collected from the environment. GenGIS provides a 3D
graphical interface in which the user can navigate and explore the data, as well as a Python
interface that allows easy scripting of statistical analyses using the Rpy libraries.
4. MOTHUR
Mothur is an open-sourced, community supported computer program that enables
investigators to describe and compare microbial communities based on molecular data. Our goal
is for mothur to be your one-stop shop for analyzing your microbial ecology data. Mothur is freely
available as C++ source code and as a windows executable.
7
5. BioJava
BioJava is an open-source software project dedicated to provide Java tools to process
biological data. BioJava is a set of library functions written in the programming language Java for
manipulating sequences, protein structures, file parsers, Common Object Request Broker
Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB,
dynamic programming, and simple statistical routines. BioJava supports a huge range of data,
starting from DNA and protein sequences to the level of 3D protein structures. The BioJava
libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing
a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application
programming interface (API) provides various file parsers, data models and algorithms to facilitate
working with the standard data formats and enables rapid application development and analysis.
6. BioPerl
BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for
bioinformatics applications. It has played an integral role in the Human Genome Project.
8
9. PERL (Practical Extraction and Reporting Language)
scripting language
compatible with UNIX, WINDOWS, MAC and other operating systems.
unique facilities of files management
uploading and downloading of files in WWW can be done very easily
used for analysis of DNA sequence or protein sequence
10. XML (Extensible Markup Language)
Markup language like HTML
Describes the files in term of types of data they contain
Uses separate database management programs for displaying and
processing data
11. Python
Has a rich and versatile standard library that is immediately available,
without the user having to download separate packages
Read and write XML and JSON files
Extract files from a zip archive
Open a URL as if were a file
Genome analysis
PDF generation
Image processing
Interfaceing with popular databases