Sei sulla pagina 1di 9

About bioinformatics

Bioinformatics is an interdisciplinary scientific field of life sciences. Bioinformatics


research and application include many fields such:
 genome annotation, gene/protein prediction, and expression profiling;
 molecular folding, modeling, and design;
 development of databases and data management systems;
 development of software and analysis tools;
 bioinformatics education and training (e.g.: medical educational applications)

Being an interdisciplinary branch of the life sciences, bioinformatics targets to develop


methodology and analysis tools to explore large volumes of biological data, helping to store,
organize, systematize, annotate, visualize, query, mine, understand, and interpret complex data
volumes. It uses conventional, modern computer science and cloud computing, statistics, and
mathematics, as well as pattern recognition, reconstruction, machine learning, simulation and
iterative approaches, and molecular modeling/folding algorithms. The emergence and advances
of the bioinformatics field, however, are tightly associated with the computerized programming
and software developments needed for the handling and structural and functional analysis of
large volumes of molecular sequences of DNA, RNA, proteins, and metabolites.
Bioinformatics also should be differentiated from related scientific fields such as biological
computation and computational biology. Biological computation aims to develop biological
computers using advances of bioengineering, cybernetics, robotics, and molecular cell biology.
In contrast, bioinformatics develops and utilizes computational algorithms to understand and
interpret biological processes based on genome-derived molecular sequences and their
interactions. Therefore, in many aspects, bioinformatics seems similar to computational biology
objectives. A computational biology is concentrated on building and/or developing theoretical
models for biological analyses, whereas bioinformatics focuses on providing practical tools to
organize and analyze basic genomic, proteomic and other “omics” data, including sequence
analysis and its visualization. Bioinformatics are to handle, analyze, and interpret the genome-
derived molecular sequence data and its organizational principles in broad scales/ spectra of
comparative, simulative, and evolutionary/phylogenetics perspectives.
For instance, bioinformatics tools such as the comparative analysis of genomic and genetic
data and/or signal processing help to interpret and understand the molecular and evolutionary
processes and interactions from large volumes of raw data in the field of wet-bench experimental
molecular biology. In the “omics” fields, it helps to sequence and annotate genomes, and identify
distinct patterns, mutation profiles, genetic epistasis, gene/protein expression and regulation, and
gene ontologies as well as be instrumental in mining and querying the biological data and
biomedical literature text.

1
Genome sequencing, genome annotation and expression profiling

Rapid and reliable determination of DNA molecules was possible with the introduction of the
sequencing technique of Sanger and Coulson and Maxam and Gilbert. .
Allan Maxam and Walter Gilbert developed a method for sequencing single-stranded DNA
by taking advantage of a two-step catalytic process involving piperidine and two chemicals that
selectively attack purines and pyrimidines. Fred Sanger was developing an alternative method.
Rather than using chemical cleavage reactions, Sanger opted for a method involving a third form
of the ribose sugars. But the base for a good computer integration was using the automated
fluorescence sequencing. In 1989, Leroy Hood introduced a DNA sequencing method in wich the
radioactive labels. autoradiography and manual base calling were replaced by fluorescent labels,
laser induced fluorescence detection and computerized base calling.

For this purpose, genome sequencing tools are a critical area of bioinformatics research. It can
be possible using multiple bioinformatic tools to handle, analyze, compare, relate, and visualize
DNA sequences:

2
 BLAST (Basic Local Alignment Search Tool): algorithm for comparing primary
biological sequence information, such as the amino-acid sequences of proteins or the
nucleotides of DNA sequences;

 BLAT (Basic Local Alignment Tool): pairwise sequence alignment algorithm; BLAT
can be used either as a web-based server-client program or as a stand-alone program.

 NGS (Next Generation Sequencing) platforms: DNA samples are randomly


fragmented and platform-specific then adaptators are added to the flanking ends to
produce a “library”. Library is then amplified trough PCR. (Platform-specific
amplification e.g. beads or glass)

 Biochips: DNA microarray is a collection of microscopic DNA spots attached to a


solid surface. Scientists use DNA microarrays to measure the expression levels of
large numbers of genes simultaneously or to genotype multiple regions of a genome.

 UCSC: the UCSC Genome Browser is a graphical viewer for genomic data now in its
13th year. Since the early days of the Human Genome Project, it has presented an
integrated view of genomic data of many kinds. Now home to assemblies for 58
organisms, the Browser presents visualization of annotations mapped to genomic
coordinates. The ability to juxtapose annotations of many types facilitates inquiry-
driven data mining. Gene predictions, mRNA alignments, epigenomic data from the
ENCODE project, conservation scores from vertebrate whole-genome alignments and
variation data may be viewed at any scale from a single base to an entire chromosome.

 Ensembl: Ensembl is a genome browser for vertebrate genomes that supports research
in comparative genomics, evolution, sequence variation and transcriptional regulation.
Ensembl annotate genes, computes multiple alignments, predicts regulatory function
and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the
Variant Effect Predictor (VEP) for all supported species.

Genome annotation is a key process for identifying the coding and non-coding regions of
a genome, gene locations and functions. Analysis of DNA sequence with genome annotation
software tools allow finding and mapping genes, exons-introns, regulatory elements, repeats and
mutations. Genome databases are essential to retrieve information on gene name, protein product
and DNA sequence functions. Bioinformatics tools are needed in annotation and prediction of
genes from sequenced genomes that requires computerized approaches because genomes are too
large to be manually annotated as mentioned above. Bioinformatics-based gene finding and
annotation including a search for protein-coding genes, RNA transcripts, and other functional
sequences within a genome is possible because there are patterns to recognize the start/stop
regions, introns, exons, motifs, repeats, and other regulatory and sensory as well as signaling

3
regions with some variations between genes and among organisms. This is made possible using
adequated databases (mostly open source) or genome analysis tools.

Bioinformatics tools are very important to analyze gene and protein expression profiles.
Largescale sequencing of cDNA libraries has led to the following techniques:
 Serial Analysis of Gene Expression (SAGE): approach that allows rapid and
detailed analysis of overall gene expression patterns. Steps:

1. Isolation of mRNA of an input sample


2. Extract a small part of sequence from a defined position of each mRNA
molecule
3. Link these small pieces of sequence together to form a long chain
4. Clone these chains into a vector which can be taken up by bacteria
5. Sequence these chains using modern DNA sequencers
6. Process this data with a computer to count the small sequence tags

 Expressed sequences tags (ESTs): short sub-sequence of a complementary


DNA sequence. ESTs are used to identify gene transcripts, being instrumental in
gene discovery and in gene-sequence determination.
 Massively parallel signature sequencing (MPSS): procedure that is used to
identify and quantify mRNA transcripts, resulting in data similar to serial analysis
of gene expression (SAGE), although it employs a series of biochemical steps.

4
Structural bioinformatics: molecular folding, modeling, and design

One of the widely used applications of bioinformatics is identification of three-dimensional


protein structures. This encompasses molecular modeling, folding to predict the possible function
of proteins or other molecular structures, modeling behavior of molecules, and designing
biomedical drugs for many complex human diseases. Moreover, it helps de novo protein design,
enzyme design, protein-ligand/drug docking, protein-peptide interaction, and structure prediction
of biological macromolecules and macromolecular complexes.
One of the freely available and leading web server/stand-alone software tools for
automated protein structure prediction and structure-based functional annotation can be
exemplified by the “Iterative Threading ASSEmbly Refinement”(I-TASSER), which “first
generates full-length atomic structural models from multiple threading alignments and iterative
structural assembly simulations followed by atomic-level structure refinement”. Using the I-
TASSER, all above-mentioned functional and structural characteristics of proteins, including
ligandbinding sites, enzyme commission number, and gene ontology terms can be explored in a
comparative scale.

Databases

Development of databases is significantly dependent on bioinformatics tools, advances,


research, and applications. There is a large number of different types of databases available, which
cover all aspects of biological data storage and organization:
 GenBank: NIH genetic sequence database, an annotated collection of all publicly
available DNA sequences
 ENA (European Nucleotide Archive) provides a comprehensive record of the
world's nucleotide sequencing information, covering raw sequencing data,
sequence assembly information and functional annotation.
 DDJB (Data Bank of Japan)
 mGen: a fast and simple gene loading, helping automate BioPerl processes.
 Metascape: is a free gene annotation and analysis resource that helps biologists
make sense of one or multiple gene lists.
 TAIR (The Arabidopsis Information Resource): maintains a database of genetic and
molecular biology data for the model higher plant Arabidopsis thaliana .

5
Software and programming tools

Software and programming tools along with bioinformatics services and workflows have
been the main fields and core targets of bioinformatics since its emergence. Because of the
contributions of various bioinformatics companies and public institutions, bioinformatics software
and tools started to exist as simple command-line tools, but later improved to more complex
graphical standalone-programs standalone. The main driving forces for the current and future
development of bioinformatics software and tools have been made on the past-decade: advances
of genome decoding technologies, accumulation of large volume biological data, consequent need
for their analyses, as well as advancements of computer technologies, graphics, visualization, and
molecular modeling and networking techniques. Moreover, the availability of various open-source
libraries, shared object models, and community-supported plug-ins has facilitated gathering
innovative ideas from the community and performing innovative in silico experiments on existing
“Big Data”.
The most used software packages are:
1. UGENE
 Creating, editing and annotating nucleic acid and protein sequences
 Fast search in a sequence
 Multiple sequence alignment: ClustalW, ClustalO, MUSCLE, Kalign, MAFFT, T-Coffee
 PCR in silico
 Search through online databases: NCBI, PDB, UniProtKB/Swiss-Prot,
UniProtKB/TrEMBL, DAS servers
 Local and NCBI Genbank BLAST search
 Open reading frames finder
 Restriction enzyme finder with integrated REBASE restriction enzymes list
 Integrated Primer3 package for PCR primer design
 Plasmid construction and annotation
 Cloning in silico by designing of cloning vectors
 Genome mapping short reads with Bowtie, BWA and UGENE Genome Aligner
 Raw NGS data processing
 Visualization of next generation sequencing data (BAM files) using UGENE Assembly
Browser
 Variant calling with SAMtools
 RNA-seq data analysis with Tuxedo pipeline (TopHat, Cufflinks, etc.)
 ChIP-seq data analysis with Cistrome pipeline (MACS, CEAS, etc.)
 SPAdes de novo assembler
 HMMER2 and HMMER3 packages integration
 Chromatogram viewer

6
 Search for transcription factor binding sites (TFBS) with weight matrix and SITECON
algorithms
 Search for direct, inverted and tandem repeats in DNA sequences
 Local sequence alignment with optimized Smith-Waterman algorithm
 Building (using integrated PHYLIP Neighbor Joining, MrBayes or PhyML Maximum
Likelyhood) and editing phylogenetic trees
 Combining various algorithms into custom workflows with UGENE Workflow Designer
 Contigs assembly with CAP3
 3D Structure viewer for files in PDB and MMDB formats, anaglyph view support
 Protein secondary structure prediction with GOR IV and PSIPRED algorithms
 Constructing dotplots for nucleic acid sequences
 mRNA alignment with Spidey
 Creating and using a shared storage (e.g. for a lab)
 Search for a pattern of various algorithms’ results in a nucleic acid sequence with UGENE
Query Designer

2. EMBOSS

 Sequence alignment
 Rapid database searching with sequence patterns
 Protein motif identification, including domain analysis
 Nucleotide sequence pattern analysis -- for example to identify CpG islands or repeats
 Codon usage analysis for small genomes
 Rapid identification of sequence patterns in large scale sequence sets
 Presentation tools for publication

3. GenGIS
GenGIS is a bioinformatics application that allows users to combine digital map data with
information about biological sequences collected from the environment. GenGIS provides a 3D
graphical interface in which the user can navigate and explore the data, as well as a Python
interface that allows easy scripting of statistical analyses using the Rpy libraries.

4. MOTHUR
Mothur is an open-sourced, community supported computer program that enables
investigators to describe and compare microbial communities based on molecular data. Our goal
is for mothur to be your one-stop shop for analyzing your microbial ecology data. Mothur is freely
available as C++ source code and as a windows executable.

7
5. BioJava
BioJava is an open-source software project dedicated to provide Java tools to process
biological data. BioJava is a set of library functions written in the programming language Java for
manipulating sequences, protein structures, file parsers, Common Object Request Broker
Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB,
dynamic programming, and simple statistical routines. BioJava supports a huge range of data,
starting from DNA and protein sequences to the level of 3D protein structures. The BioJava
libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing
a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application
programming interface (API) provides various file parsers, data models and algorithms to facilitate
working with the standard data formats and enables rapid application development and analysis.

6. BioPerl
BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for
bioinformatics applications. It has played an integral role in the Human Genome Project.

The most used programming languages in bioinformatics are:


1. R
 command line interface
 open source and freely distributed
 runs on Windows, Macs, and Linux machines
 it is continually updated, monitored, and improved upon by a dedicated
team of computer scientists, statisticians, and researchers
 the underlying algorithms in the written functions are quite adaptable,
somewhat forgiving, and remarkably efficient
 the syntax of the language is very intuitive
 there is a vast array of user-contributed packages that allow for many
specialized forms of data analysis
7. JAVA
 versatile language that is designed to run on any computer platform
 Java applications are packages of small objects, each carrying out a single
function (e.g.: JAVA Molecular Biology Workbench)
 Bioinformatics programmers have set up free / open-source projects as
EMBOSS, BioJava, BioPerl, BioPython, BioRuby
8. JAVA Script
 scripting language
 contains different features like animation, pop-window etc.
 used for interacting with databases

8
9. PERL (Practical Extraction and Reporting Language)
 scripting language
 compatible with UNIX, WINDOWS, MAC and other operating systems.
 unique facilities of files management
 uploading and downloading of files in WWW can be done very easily
 used for analysis of DNA sequence or protein sequence
10. XML (Extensible Markup Language)
 Markup language like HTML
 Describes the files in term of types of data they contain
 Uses separate database management programs for displaying and
processing data
11. Python
 Has a rich and versatile standard library that is immediately available,
without the user having to download separate packages
 Read and write XML and JSON files
 Extract files from a zip archive
 Open a URL as if were a file
 Genome analysis
 PDF generation
 Image processing
 Interfaceing with popular databases

Potrebbero piacerti anche