Sei sulla pagina 1di 26

Lecture : 10

Introduction to Genomics

Course Instructor:
Dr. Anum Masood

1
Genome
Genome Size
• The complete DNA sequence defines what we call a genome.
• The genome is therefore the total genetic information that is carried
within the cell.
• That includes the DNA in the nucleus and DNA in any of the
organelles.
• This is new: turns out that some of the organelles also include DNA.

• In animals, such organelle is mitochondria and in plants, this are the


chloroplasts.
Genome
• Despite this, when we will refer to a genome for eukaryotes, we will
usually mean the DNA in the nucleus, and we will refer to the genetic
material in the mitochondria as mitochondrial genome.

• DNA in the nucleus is most often not a single molecule, but rather
broken into pieces and organized within the chromosomes.

• Human have 23 pairs of chromosomes.

• But do not worry about chromosomes at this point (and at least not for a
few next lectures).
Genome
Genome Size
• So what is the total length of the DNA sequence?
• It depends on an organism.
• Prokaryote (bacteria) have the shortest genome.
• The length of the DNA sequence is expressed in the base pairs (bp),
which is a unit consisting of two nucleobases bound to each other by
hydrogen bonds.
• Simply, one base pair, one nucleotide on each strand.
• The total number of nucleotides in one of the strands is the size (length)
of the genome.
Genome
Genome Size
• Bacteria have genomes of length ranging from 0.5 to 13 Mbp.

• The unit Mbp means mega base pairs, or 106106 base pairs.

• This is actually usually written simply by Mb (mega bases).

• Genomes of eukaryotes is large and ranges from 8Mb to 670Gb.

• Viruses have a much smaller genomes of the size from 5 to 50kb.


Genome
Genome Size
What should be cheaper and faster? DNA/RNA or
protein sequencing?

DNA/RNA sequencing is faster and


cheaper simply because of fewer
characters, four nucleotides vs. twenty
The two strands of DNA are complimentary!
Why complementarity?

T is always facing A, while G is always facing C in one-


to-one reciprocal relationship
If we know the sequence of one strand, we can get the
sequence of the other strand
quiz

There is 20% Adenine in the genome of


a newly sequenced bacterium from
Antarctic. What is the percentage of G,
C, and T in the genome?

Acknowledgements: ©Arshan Nasir, PhD


quiz
There is 20% Adenine in the genome of a newly sequenced bacterium
from Antarctic. What is the percentage of G, C, and T in the genome?

If A = 20% then T is also 20%. It means that G and C make up the


remaining 60% 

Acknowledgements: ©Arshan Nasir, PhD


Example
• 5’-ATGCTGA-3’

• What is the complimentary sequence?


• 5’-ATGCTGA-3’
• 3’-TACGACT-5’

• How is this reported?


• 5’-ATGCTGA-3’ and 5’-TCAGCAT-3’

• What does it mean?


• The two sequences correspond to facing strands of the same DNA
molecule
Palindromes
• A fascinating property of DNA complementarity
is that sometimes the two strands are identical

• Known as palindromes and are very important


• Recognized by restriction enzymes
• Important binding sites

• A palindromic sequence is a nucleic acid


sequence (DNA or RNA) that is same whether
read 5' to 3' on one strand or 5' to 3' on the
complementary strand with which it forms a
double helix.
Palindromes
• The mirror like palindrome in which the
same forward and backwards are on a
single strand of DNA strand, as in
GTAATG

• The Inverted repeat palindromes is also


a sequence that reads the same forward
and backwards, but the forward and
backward sequences are found in
complementary DNA strands (GAATTC
being complementary to CTTAAG)
One more stop: the genetic code
• Proteins and RNA are encoded by DNA

• Sequencing proteins is difficult (isolation, funding restrictions, and


larger alphabets)

• So what should we do?

• Can we predict the protein and RNA sequences simply from DNA?
• Yes. Very useful activity in bioinformatics
DNA Coding Regions: Pretending
to Work with Protein Sequences

• Determining the sequence of a protein is much more difficult than


sequencing DNA —but all the proteins that a given organism (whether
microbe or human being) can synthesize are encoded in the DNA
sequence of its genome.
• Thus, the smart shortcut that molecular biologists have been using is to
read protein sequences directly at the information source: in the DNA
sequence!
• This way, we can pretend to know the amino-acid sequence of a protein
Turning DNA into proteins: The genetic code
• When you know a DNA sequence, you can translate it into the
corresponding protein sequence by using the genetic code, the very
same way the cell itself generates a protein sequence.
• The genetic code is universal (with some exceptions—otherwise life
would be too simple!), and it is nature’s solution to the problem of how
one uniquely relates a 4-nucleotide sequence (A, T, G, C) to a suite of
20 amino acids;
• we’re using symbols (rather than actual chemicals) to do the same.
• Understanding how the cell does this was one of the most brilliant
achievements of the biologists of the 1960s.
• Yet the final answer can be contained in a (miraculously small) table
Turning DNA into proteins: The genetic code
Turning DNA into proteins: The genetic code
From a given starting point in your DNA sequence,
start reading the sequence 3 nucleotides (one triplet)
at a time. Then consult the genetic code table to read
which amino acid corresponds to the current triplet
(technically referred to as codons)
Turning DNA into proteins: The genetic code
• If your DNA sequence is correctly listed in the 5' to 3' orientation, you
generate the protein sequence in the conventional N- to C-terminus as
well.
Turning DNA into proteins: The genetic code
• Thus, if you know where a protein-coding region starts in a DNA
sequence, your computer can pretend to be a cell and generate the
corresponding amino-acid sequence!
• This simple computer translation exercise is at the origin of most of the
so-called protein sequences that you can find in databases.
Turning DNA into proteins: The genetic code
• The resulting protein sequence depends entirely on the way you
converted your DNA sequence into triplets before using the genetic
code.
• For instance, using the second position as starting point leads to

Beginning with the third position (GGA-AGT- . . .) again leads to an entirely different translation.
Turning DNA into proteins: Reading Frames
• Because of the triplet-based genetic code, a given DNA interval, on a given strand,
can theoretically be translated in three different ways
• Basically three perspectives that are known in the field as reading frames.

• Because the DNA can be used from both strands, a total of six possible reading
frames are possible for translating a DNA sequence into proteins.

• With very few exceptions (found in exotic viruses), only one of these six frames is
used for any given DNA coding region.
Open Reading Frame (ORF)
• An interval of DNA sequence that begins at Start Codon (ATG M=Methionine)
and remains free of STOP Codon (TAA, TGA, or TAG) is called an open reading
frame (ORF)
Six ORFs Example-1 Six ORFs Example-2
Turning DNA into proteins: The genetic code

• Some DNA sequences are not encoding proteins at all — and that
higher organisms have large pieces of noncoding DNA inserted within
their genes.

• A large part of bioinformatics is devoted to the development of


methods to locate protein-coding regions in DNA sequences, to
delineate precisely where genes start and end, or where they are
interrupted by the noncoding intervals (called introns).
Questions

26

Potrebbero piacerti anche