Sei sulla pagina 1di 8

CS262 Lecture 1 Notes

Computational Genomics / Biology for CS262 / Sequence Alignment

Scribed by Hong T. Lam

January 6, 2004

The Goal of Genomics

Study Organisms at the DNA Level

• Read a complete genome such as the human DNA. (DNA Sequencing & Assembly)

• Identify parts such as the genes encoded by the DNA sequence. (Gene Finding)

• Figure out the connections between parts such as how genes interact with each other.

• Gene Expression: The process by which genetic code is translated into structures present and
functioning in the cell. Expressed genes are transcribed into different types of RNA, of which
mRNA is the only type that is translated into proteins. Gene expression provides information
about how a gene functions and how it is different from other genes. DNA microarrays can be
used to compare gene expression in different populations of cells. Cells have different gene
expression patterns and levels. (Microarrays & Regulation)

Study Evolution at the DNA Level

• Compare whole genomes from multiple organisms. (Large-Scale Comparative Genomics)

• Quantify the evolution of biological sequence. (Phylogeny & Evolution)

• Uncover the evolutionary tree.

The Role of CS in Biology

Computer science plays an essential role in biology. With biology becoming an information science, new
high-throughput technology is needed. The shift to high throughput technologies in biology has led to an
explosion of genomic data.

Basic Computational Methods for Analysis of Biological Sequences

• Sequence Alignment Algorithms


• Dynamic Programming
• Hidden Markov Models

Hong T. Lam Page 1 1/10/2004


Genomics Applications Using Basic Computational Methods

• DNA Sequencing: The process of determining the exact order of a long string of bases (A, T, C,
G) that makes up the DNA of an organism. The genomes of several organisms, including human,
have been completely sequenced.

• Comparison of DNA and proteins across organisms

• Discovery of genes, promoters, and regulatory sites

Paradigms in Biology

There are two paradigms in biology.

• Molecular Paradigm (Genetic Dogma)

DNA RNA polypeptide

DNA is transcribed into RNA (rRNA, rNA, snRNA, mRNA) through a process known as RNA
transcription. mRNA is translated into polypeptides which then fold into 3-D protein structures
through a mechanism called protein translation. An organism consists of different types of
proteins.

• Evolution Paradigm: All organisms originate from a common ancestor, connected by an


evolutionary tree.

Basic Biology for CS262

Structures of Biomolecules

• The cell is composed of DNA in the nucleus and proteins in the cytoplasm, all of which is
encapsulated in a lipid membrane.

• The nucleic acids (DNA and RNA) form the genetic material of all living organisms. They are
found mainly in the nucleus of the cell.

• A nucleotide has three components.

Sugar (ribose in RNA, deoxyribose in DNA)


Phosphoric acid
Nitrogen base
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T) or Uracil (U)

Two nucleotides are linked together by attaching the phosphate group of one nucleotide to the 5’
carbon atom of the sugar of the other nucleotide.

Hong T. Lam Page 2 1/10/2004


• Nucleic acids are linear, unbranched polymers of nucleotides. While RNA is single-stranded,
DNA consists of two strands, which run in opposite directions to each other anti-parallel. The
strands are joined together by pairing the nitrogenous bases (Watson & Crick base pairs). DNA
and RNA are read from the 3’ to the 5’ end. This is related to the numbers on the ribose ring.

DNA RNA

A T A

A=T
G
G=C G
C

C G C

G C G

A T A

C G C
T→U

T A U

G C G

• Three nucleotides of an mRNA strand form a codon that specifies one amino acid. This makes
sense because a codon made from only one or two nucleotides would not produce enough
combinations (codons) to code for all 20 of the known amino acids.

1 nucleotide = 4 possible codons


2 nucleotides = 4 * 4 possible codons
3 nucleotides = 4 * 4 * 4 possible codons = 64 possible codons for 20 amino acids

Since a three-nucleotide codon produces 64 possible combinations and there are only 20 known
amino acids, this implies redundancy or degeneracy in the genetic code where several different
codons specify the same amino acid. The parsimony principle – that the simplest solution is often
right – rules out a four-nucleotide codon.

• Two amino acids form a dipeptide.


R R R O R
| | | II |
H2N--C--COOH H2N--C--COOH H2N--C--C--NH--C--COOH
| | | |
H H H H
Hong T. Lam Page 3 1/10/2004
• A linear sequence of amino acids forms a polypeptide, which folds to form a complex 3-D protein
structure. The structure of a protein is intimately connected to its function.

How does DNA function?

• In the cell, DNA provides all the information needed to function. There are questions about DNA
as the carrier of genetic information.

Q: How is the information stored in DNA?


A: Stored as nucleotide sequences.

Q: How is the stored information used?


A: Used in protein synthesis.

• Ribosomes are the sites of protein synthesis. Since DNA is mainly found in the nucleus and
ribosomes are found in the cytoplasm, how does information flow from DNA to protein? There
is a need for an intermediary -- ribonucleic acid (RNA). RNA has three functions (mRNA,
tRNA, rRNA).

• Messenger RNA (mRNA) is synthesized on a DNA template by a process called transcription,


during which information is copied from one strand of DNA to mRNA. mRNA serves as the
messenger that tells the ribosomes what proteins to make. So, how are the information carried by
the mRNA interpreted? Think of an mRNA sequence as a sequence of “triplets”, for example,
AUGCCGGGAGUAUAG as AUG-CCG-GGA-GUA-UAG. Each triplet (codon) maps to an
amino acid. Hence, the sequence of triplets (codons) is translated to a sequence of amino acids
according to the genetic code.

• In 1968, Nirenberg and Khorana received a Nobel Prize in medicine for cracking the universal
genetic code, which mapped each triplet (codon) to an amino acid. It shows how the nucleotide
language of mRNA is translated into the amino acid language of proteins.

Hong T. Lam Page 4 1/10/2004


• Transfer RNA (tRNA) floats freely in the cytoplasm. It is the molecule that carries amino acids
to the ribosome when a specific amino acid is called for by the information on the mRNA to be
put into the protein that is being synthesized. Every amino acid has its own specific tRNA that
binds to it alone.

• In 1962, Robert Holly solved the structure of tRNA. Although tRNA is single-stranded molecule,
stretches of complementary nucleotides hydrogen bond to form short double-stranded regions,
which bend the tRNA into a cloverleaf shape. All tRNAs have a similar cloverleaf structure. At
a position on one of the leaves, a sequence of three nucleotides form an anti-codon, which base
pairs with a specific mRNA codon. This anti-codon/codon binding is crucial. There is a different
tRNA molecule corresponding to each mRNA codon.

• rRNA serves as part of the structure of the ribosome, the protein/RNA complex that synthesizes
proteins according to the information carried by the mRNA

• So, to put this all together: The DNA code is transcribed into a complementary mRNA molecule
within the nucleus. The mRNA enters the cytoplasm, where it associates with a ribosome. The
mRNA code is then translated into a polypeptide chain. The codon AUG signals the start of
translation. An activated tRNA ferries the first amino acid, methionine, to the ribosome. The
tRNA anti-codon binds to the AUG codon on the mRNA. The whole complex shifts and the next
codon is read by another tRNA. As the two amino acids are held in position, a peptide bond is
formed between them. The second tRNA accepts the growing protein chain and the methionine
tRNA is released. The process continues until a stop codon is encountered. When the stop codon
is reached, translation is finished. The ribosome disassembles to be reused for translating another
mRNA and one complete peptide chain is released.

What is a gene?

• A genome is a set of all genes in the organism + junk stuff (the entire DNA content).

• A gene is a sequence of nucleotides on the DNA that encodes a polypeptide.

Hong T. Lam Page 5 1/10/2004


• Central Dogma of Molecular Biology

DNA RNA Protein Phenotype

ZOOM
IN

tRNA

transcription
DNA
rRNA

snRNA

translation
POLYPEPTIDE
mRNA

DNA is transcribed into different types of RNA (tRNA, rRNA, snRNA, mRNA). Transcription
consists of three key steps: initiation, elongation, and termination. The transcripts (mRNA
molecules) contain the information to be translated into polypeptides that form proteins.

• Each gene has its own promoter(s). Promoters are sequences in the DNA just upstream of the
mRNA transcripts that define the sites of initiation. The role of the promoter is to attract RNA
polymerase to the correct start site so transcription can be initiated.

• The mRNA transcripts are sometimes edited before they serve as a blueprint for a protein. The
processing involves the removal of intervening, gibberish sequences (introns) in the gene. Exons
are spliced together to form mRNA. Exons are nucleotide segments whose codons will be
expressed.

How are genes regulated?

• In an adult multi-cellular organism, there are a wide variety of cell types seen in the adult, such as
muscle, nerve, and blood cells. The different cell types contain the same DNA though. This
differentiation arises because different cell types express different genes. Hence, genes can be
switched on and off.

• There are some questions about the regulation of genes.

Q: What turns genes on and off?


Q: When is a gene turned on or off?
Q: Where (in which cells) is a gene turned on?
Q: How many copies of the gene product are produced?

Hong T. Lam Page 6 1/10/2004


• Regulatory sequences are binding sites for proteins. They are often short stretches of DNA (~25
nucleotides), consisting of inexactly repeating patterns called motifs. Motifs stand out as highly
conserved regions in a multiple sequence alignment.

Complete Genomes & Evolution

There has been an explosion of genomic data. Complete genomes of some organisms have been
sequenced (human, pig, dog, rat, mouse, etc.). DNA in these different organisms has been compared to
study evolution occurring at the DNA level, resulting from sequence edits (insertion, deletion, mutation)
and rearrangements (inversion, translocation, duplication). Similarity between DNA sequences has
suggested that all organisms come from a common ancestor, connected by an evolutionary tree (evolution
paradigm).

The evolutionary process occurs at different rates. If DNA mutations occur in non-critical regions, they
are incorporated into the next generation. If the mutations occur in critical regions, they are unlikely to be
propagated onward. However, some mutations have positive effects, and thus are conserved in
subsequent generations, such as in the case of the highly conserved Interleukin regions found in human
and mouse. Sequence conservation implies functionality. The fact that evolution did not modify a region
of the sequence suggests that it is functionally important to the organism.

Interleukin regions in human and mouse

Pairwise sequence alignment can be used to find sequences conserved between organisms. It can reveal if
sequences are related or not. This information can help to determine their functional and structural roles
and provide clues to the common ancestor.

Sequence Alignment

Given two strings, x = x1x2…xM and y = y1y2…yN, and a scoring function for calculating matched letters
and gap penalty, an alignment is an assignment of gaps to positions 0,…,M in x and 0,…,N in y, so as to
line up each letter in one sequence with either a letter, or a gap in the other sequence.

Hong T. Lam Page 7 1/10/2004


AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Optimal Alignment

What is a good alignment? It is the “best” way to match the letters of one sequence with those of the
other. The problem is: how do we define “best”? If an alignment is a hypothesis that two sequences
come from a common ancestor through sequence edits, then optimal alignment is finding the least cost
transformation of one sequence into another using new operations (sequence edits, inversions,
translocations, duplications). The least cost transformation is measured as the edit distance between two
sequences, which is defined as the minimum number of edit operations needed to transform the first string
into the other. Since most of DNA changes during evolution are due to insertion, deletion, and
substitution, the edit distance can be used as a way to roughly measure the number of DNA replications
that occurred between two sequences. Although the edit distance is not an accurate metric system for
depicting the underlying evolutionary process, it serves as an approximation that is easy to optimize
algorithmically.

Likewise, optimal alignment is the pairing of sequences that retains the order of letters in each sequence,
introducing gaps if necessary, such that the scoring function returns an optimal score.

Scoring Function

Match: +m
Mismatch: –s
Gap: –d

Score F = (# matches) * m – (# mismatches) * s – (# gaps) * d

The optimality of an alignment is measured by the calculated result of the scoring function. The total
score of an alignment is the sum of terms for each pair of aligned letters and terms for each gap. A match
receives a positive score of m, a mismatch receives a penalty of –s, and a gap receives a penalty of -d.

Hong T. Lam Page 8 1/10/2004

Potrebbero piacerti anche