Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
com/naturegenetics contents
editorial
Cover art by Darryl Leja
Spreading the word 1
Alan Packer
foreword
Power to the people 2
Andreas D Baxevanis & Francis S Collins
perspective
Genomic empowerment: the importance of public databases 3
Harold Varmus
user’s guide
A user’s guide to the human genome 4
Tyra G Wolfsberg, Kris A Wetterstrand, Mark S Guyer, Francis S Collins
& Andreas D Baxevanis
Question 1 9
How does one find a gene of interest and determine that gene’s structure? Once the
gene has been located on the map, how does one easily examine other genes in that
same region?
Question 2 18
How can sequence-tagged sites within a DNA sequence be identified?
Question 3 21
During a positional cloning project aimed at finding a human disease gene, linkage
data have been obtained suggesting that the gene of interest lies between two
sequence-tagged site markers. How can all the known and predicted candidate genes
in this interval be identified? What BAC clones cover that particular region?
Question 4 29
A user wishes to find all the single nucleotide polymorphisms that lie between two
sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within
the coding region of a gene? Where can any additional information about the
function of these genes be found?
Question 5 33
Given a fragment of mRNA sequence, how would one find where that piece of DNA
mapped in the human genome? Once its position has been determined, how would
one find alternatively spliced transcripts?
40
Question 7 44
How would an investigator easily find compiled information describing the structure
of a gene of interest? Is it possible to obtain the sequence of any putative promoter
regions?
Question 8 49
How can one find all the members of a human gene family?
Question 9 53
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
Are there ways to customize displays and designate preferences? Can tracks or
features be added to displays by users on the basis of their own research?
Question 10 57
For a given protein, how can one determine whether it contains any functional
domains of interest? What other proteins contain the same functional domains as
this protein? How can one determine whether there is a similarity to other proteins,
not only at the sequence level, but also at the structural level?
Question 11 63
An investigator has identified and cloned a human gene, but no corresponding
mouse ortholog has yet been identified. How can a mouse genomic sequence with
similarity to the human gene sequence be retrieved?
Question 12 66
How does a user find characterized mouse mutants corresponding to human genes?
Question 13 70
A user has identified an interesting phenotype in a mouse model and has been able
to narrow down the critical region for the responsible gene to approximately 0.5 cM.
How does one find the mouse genes in this region?
Acknowledgments 75
References 76
There was a time, not too long ago, when the wisdom of swimming in a rapidly rising sea of data…how do we
genome-sequencing projects was up for discussion. keep from drowning?” And if geneticists and bioinfor-
Would they be too expensive, draining funds from other maticians are struggling to stay afloat, what of the non-
areas of the life sciences? Would they be worth the trou- geneticists who are eager to exploit the sequences but
ble? Not much more than 15 years have passed since are relative newcomers to the tools needed to navigate
those early debates, and the importance of sequenced all of this information?
genomes to biology and medicine has now gained wide It is with these questions in mind that we present A
acceptance. This is in part owing to the relatively rapid User’s Guide to the Human Genome. Written by Tyra
fall in the cost of sequencing, followed by the undeniably Wolfsberg, Kris Wetterstrand, Mark Guyer, Francis
important insights gained from the annotation of sev- Collins and Andreas Baxevanis of the National Human
eral bacterial genomes, and those of a few of our favorite Genome Research Institute (NHGRI), this peer-
eukaryotes. The news has been so relentlessly upbeat reviewed how-to manual guides the reader through
that one might even have expected some ‘genome some of the basic tasks facing anyone whose work might
fatigue’ to set in, especially given the saturation coverage be facilitated by an improved understanding of the
of the publication of the drafts of the human genome online resources that make sense of annotated genomes.
sequence 18 months ago. Not so, however; witness the The directors of these online resources—Ewan Birney of
recent jockeying by different groups for inclusion of Ensembl, David Haussler of the University of California,
‘their’ model organism in the next round of sequencing Santa Cruz and David Lipman of the National Center for
projects. The honeymoon goes on. Biotechnology Information—have served as advisors
And yet there are important issues to be addressed. during the development of this guide, ensuring a bal-
One is the concern surrounding any bestseller—that it anced and accurate treatment of their respective web
will have far fewer actual readers than one might expect. portals. The online version of the guide will also evolve,
At first glance, this would seem not to apply to the with an initial update scheduled for April, 2003.
human genome. After all, one is hard pressed these days As noted by Harold Varmus in his eloquent perspec-
to pick up a copy of Nature Genetics, or any genetics tive on A User’s Guide and the public databases it exam-
journal, and not find evidence that sequenced genomes ines, one of the important legacies of the Human
inform many of the most important advances. A survey Genome Project is its ethos of open access to the data. In
published last year by the Wellcome Trust, however, this spirit, and with the generous sponsorship of the
found that only half of the researchers who were using NHGRI and the Wellcome Trust, the online version of
sequence data were fully conversant with the services this supplement will be freely available on the
provided by the freely accessible databases. Nature Genetics website.
There is also the concern that genome sequencers
might be victims of their own success. As computa- Alan Packer
tional biologist David Roos recently put it, “We are Nature Genetics
The National Human Genome Research Institute of the the Wellcome Trust indicated that only half of biomed-
National Institutes of Health is delighted to sponsor this ical researchers using genome databases are familiar
special supplement of Nature Genetics. The primary aim with the tools that can be used to actually access the data.
of this supplement is to provide the reader with an ele- The inherent potential underlying all of this sequence-
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
mentary, hands-on guide for browsing and analyzing based data is tremendous, so the importance of all biolo-
data produced by the International Human Genome gists having the ability to navigate through and cull
Sequencing Consortium, as well as data found in other important information from these databases cannot be
publicly available genome databases. The majority of this understated.
supplement is devoted to a series of worked examples, The study of biology and medicine has truly undergone
providing an overview of the types of data available and a major transition over the last year, with the public avail-
highlighting the most common types of questions that ability of advanced draft sequences of the genomes of
can be asked by searching and analyzing genomic data- Homo sapiens and Mus musculus, rapidly growing
bases. These examples, which have been set in a variety of sequence data on other organisms, and ready access to a
biological contexts, provide step-by-step instructions host of other databases on nucleic acids, proteins and
and strategies for using many of the most commonly- their properties. Yet for the full benefits of this dramatic
used tools for sequence-based discovery. It is hoped that revolution to be felt, all scientists on the planet must be
readers will grow in confidence and capability by work- empowered to use these powerful databases to unravel
ing through the examples, understanding the underlying longstanding scientific mysteries. As pointed out by
concepts, and applying the strategies used in the exam- Harold Varmus in the Perspective, free accessibility of all
ples to advance their own research interests. of this basic information, without restrictions, subscrip-
One of the motivating factors behind the development tion fees or other obstacles, is the most critical component
of this User’s Guide comes from the general sense that the of realizing this potential. It is our modest hope that this
most commonly-used tools for genomic analysis still are User’s Guide will provide another useful contribution.
terra incognita for the majority of biologists. Despite the
large amount of publicity surrounding the Human Andreas D. Baxevanis and Francis S. Collins
Genome Project, a recent survey conducted on behalf of National Human Genome Research Institute
Over the past twenty five years, a mere sliver of recorded time, the teaching many of the principles of biological design, including
world of biology — and indeed the world in general — has been evolution, gene organization and expression, organismal devel-
transformed by the technical tools of a field now known as opment, and disease; and in part because those who work on
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
genomics. These new methods have had at least two kinds of genomes have been tireless in attempts to explain the meaning of
effects. First, they have allowed scientists to generate extraordi- genes to an eager public. Endless metaphors, artistic creations,
narily useful information, including the nucleotide-by- lively journalism, monographs about social and ethical implica-
nucleotide description of the genetic blueprint of many of the tions, televised lectures from the White House, and many other
organisms we care about most—many infectious pathogens; use- cultural happenings have been among the manifestations of this
ful experimental organisms such as mice, the round worm, the fascination. In this way, the HGP has had a strong hand in raising
fruitfly, and two kinds of yeast; and human beings. Second, they the public’s awareness of new ideas in biology and of the power-
have changed the way science is done: the amount of factual ful implications of genomics in medicine, law and other societal
knowledge has expanded so precipitously that all modern biolo- institutions.
gists using genomic methods have become dependent on com- Some of these cultural effects come as much from the behav-
puter science to store, organize, search, manipulate and retrieve ioral aspects of the HGP as from the genomic sequences them-
the new information. selves. The sharing of new information, even before its assembly
Thus biology has been revolutionized by genomic information into publishable form, has spurred efforts to share other kinds of
and by the methods that permit useful access to it. Equally research tools and has encouraged the notion of making the sci-
importantly, these revolutionary changes have been dissemi- entific literature freely accessible through the Internet. The con-
nated throughout the scientific community, and spread to other tribution of scientists in many countries to the sequencing of
interested parties, because many of those who practice genomics many genomes, including the human genome, has inspired
have made a concerted effort to ensure that access is simplified efforts to develop gene-based sciences—from basic genomics to
for all, including those who have not been deeply schooled in the biotechnology—throughout the world, including the poorest
information sciences. The goal of providing genomic informa- developing nations. Indeed, the World Health Organization, the
tion widely has also inevitably attracted the interests of those in United Nations, and the World Bank have all contributed
the commercial sector, and privately developed versions of vari- recently to the growth of the ideas that science is both possible
ous genomes are also now available, albeit for a licensing fee. and valuable in all economies and that science can be a means to
The operative principle most prominently involved in trans- help unify the world’s population under a banner of enlighten-
mitting the fruits of genomics—the one that has captured the ment, demonstrating a virtue of globalization.
imagination of the public and served as a standard for the shar- From this perspective, the availability of the sequences of many
ing of results and methods more generally in modern biology— genomes through the Internet is a liberating notion, making
has been open access. Funding by public and philanthropic extraordinary amounts of essential information freely accessible
organizations, such as the U.S. National Institutes of Health, the to anyone with a desktop computer and a link to the World Wide
U.S. Department of Energy, the Wellcome Trust in Britain, and Web. But the information itself is not enough to allow efficient
many other organizations, has made this altruistic behavior pos- use. Interested people who reside outside the centers for studying
sible and has fostered the idea that genomic information about genomes need to be told where best to view the information in a
biological species should be available to all. (Such information form suitable for their purposes and how to take advantage of the
about individual human beings is, of course, an entirely different software that has been provided for retrieval and analysis.
matter and should be protected by privacy rules.) The attitude of The manual before us now offers such help to those who might
open access to new biological knowledge has also been embodied otherwise have had trouble in attempting to use the products of
in the databases of the International Nucleotide Sequence Data- genomics. Furthermore, the advice is offered in that spirit of
base Collaboration, comprising the DNA DataBank of Japan, the altruism that has come to characterize the public world of
European Molecular Biology Laboratory, and GenBank at the US genomics. The information is provided in a highly inviting and
National Library of Medicine. The same focus on open access is understandable format by casting it in the form of answers to the
exemplified by PubMed (operated by the NLM), other gateways questions most commonly posed when approaching big
to the scientific literature, and the assemblies of genomic genomes. The information, made freely available on the World
sequence now found at the several Web portals described in this Wide Web, has been assembled by some of the best minds in the
guide. HGP, who have generously given their time and intellect to
The Human Genome Project (HGP), which has supported the encourage widespread use of the great bounty that has been cre-
public genome sequencing effort, has been the mainstay of the ated over the past two decades.
effort to make genomes accessible to the entire community of In other words, the guide to use of genomes provided here is
scientists and all citizens. This effort has, in fact, been quite natu- simply another indication that the HGP should take great pride
rally extended to instruct the public about many themes in mod- in much more than the sequencing of genomes.
ern biological science. This has occurred in part because the
human genome itself has been such an exciting concept for the Harold Varmus
public; in part because genomes are natural entry points for Memorial Sloan-Kettering Cancer Center
The primary aim of A User’s Guide to the Human Genome is to provide the reader with an elementary hands-on
guide for browsing and analyzing data produced by the International Human Genome Sequencing Consortium
and other systematic sequencing efforts. The majority of this supplement is devoted to a series of worked exam-
ples, providing an overview of the types of data available, details on how these data can be browsed, and step-
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
by-step instructions for using many of the most commonly-used tools for sequence-based discovery. The major
web portals featured throughout include the National Center for Biotechnology Information Map Viewer, the
University of California, Santa Cruz Genome Browser, and the European Bioinformatics Institute’s Ensembl system,
along with many others that are discussed in the individual examples. It is hoped that readers will become more
familiar with these resources, allowing them to apply the strategies used in the examples to advance their own
research programs.
Authors
Tyra G. Wolfsberg
Kris A. Wetterstrand
Mark S. Guyer
Francis S. Collins
Andreas D. Baxevanis
National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
e-mail: andy@nhgri.nih.gov
In its short history, the Human Genome Project (HGP) has pro- finished when it has been determined at an accuracy of at least
vided significant advances in the understanding of gene structure 99.99% and has no gaps. Sequence data that fall short of that
and organization, genetic variation, comparative genomics and benchmark but can be positioned along the physical map of the
appreciation of the ethical, legal and social issues surrounding chromosomes are termed ‘draft’. Currently, 87% of the euchro-
the availability of human sequence data. One of the most signifi- matic fraction of the genome is finished and less than 13% is at
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
cant milestones in the history of this project was met in February the draft stage.
2001 with the announcement and publication of the draft ver- Even in this incomplete state, the available data are extremely
sion of the human genome sequence1. The significance of this useful. This usefulness was apparent early on, leading the Inter-
milestone cannot be understated, as it firmly marks the entrance national Human Genome Sequencing Consortium (IHGSC) to
of modern biology into the genome era (and not the post- pursue a staged approach in sequencing the human genome. The
genome era, as many have stated). The potential usefulness of first stage generated draft sequence across the entire genome1.
this rich databank of information should not be lost on any biol- The project is now well advanced into its second stage, with draft
ogist: it provides the basis for ‘sequence-based biology’, whereby sequence being improved to ‘finished quality’ across the entire
sequence data can be used more effectively to design and inter- genome, a necessarily localized process. As a result, and as it has
pret experiments at the bench. The intelligent use of sequence been presented to date, the human genome sequence is an evolv-
data from humans and model organisms, along with recent tech- ing mix of both finished and unfinished regions, with the unfin-
nological innovation fostered by the HGP, will lead to important ished regions varying in data quality. As the data are initially
advances in the understanding of diseases and disorders having a made available in raw form, with subsequent refinement and
genetic basis and, more importantly, in how health care is deliv- improvement, and because data of different quality are found in
ered from this point forward2. different places in the genome, users must understand the kinds
Although this flood of data has enormous potential, many of data presented by the various tools available.
investigators whose research programs stand to benefit in a tan-
gible way from the availability of this information have not Determining the human sequence: a brief overview
been able to capitalize on its potential. Some have found the As with all systematic sequencing projects, the basic experimen-
data difficult to use, particularly with respect to incomplete tal problem in sequencing lies in the fact that the output of a sin-
human genome draft sequence information. Others are simply gle reaction (a ‘read’) yields about 500–800 bp1,4. To determine
not sufficiently conversant with the seeming myriad of data- the sequence of a DNA molecule that is millions of bases long, it
bases and analytical tools that have arisen over the last several must first be fragmented into pieces that are within an order of
years. To assist investigators and students in navigating this magnitude of the read size. The sequence at one or both ends of
rapidly expanding information space, numerous World Wide many such fragments is determined, and the pieces are then
Web sites, courses and textbooks have become available; many ‘assembled’ back into the long linear string from which they were
individuals, of course, also turn to their friends and colleagues originally derived. A number of approaches for doing this have
for guidance. We have prepared this Guide in that same spirit, been suggested and tested; the most commonly used is shotgun
as an additional resource for our fellow scientists who wish to sequencing4. The application of shotgun sequencing to the mul-
make use (or better use) of both sequence data and the major timegabase- or gigabase-sized genomes of metazoans is still
tools that can be used to view these data. The Guide has been evolving. A small number of strategies are currently being evalu-
written in a practical, question-and-answer format, with step- ated, for example, hierarchical or map-based shotgun sequenc-
by-step instructions on how to approach a representative set of ing, whole-genome shotgun sequencing and hybrid approaches.
problems using publicly available resources. The reader is These approaches are described in detail elsewhere4.
encouraged to work through the examples, as this is the best The IHGSC’s human sequencing effort began as a purely map-
way to truly learn how to navigate the resources covered and based strategy and evolved into a hybrid strategy1. The ‘pipeline’
become comfortable using them on a regular basis. We suggest that the IHGSC used to generate the human sequence data
that readers keep copies of the Guide next to their computers as involved the following steps.
an easy-to-use reference. 1. Bacterial artificial chromosome (BAC) clones were selected,
Before embarking on this new adventure, it is important to and a random subclone library was constructed for each one in
review a number of basic concepts regarding the generation of either an M13- or a plasmid-based vector.
human genome sequence data. This review does not discuss the 2. A small number of members of the subclone library (usually
chronological development of the HGP or provide an in-depth 96 or 192) were sequenced to produce very-low-coverage, single-
treatment of its implications; the reader is referred to Nature’s pass or ‘phase 0’ data. These data were used for quality control
Genome Gateway (http://www.nature.com/genomics/human/) and can be found in the Genome Survey Sequence division of
for more information on these topics. The DNA Database of Japan (DDBJ), the European Molecular
Biology Laboratory (EMBL) and GenBank (of the National Cen-
Current status of human genome sequencing ter for Biotechnology and Information; NCBI).
Sequencing of the human genome is nearing completion. The 3. If a BAC clone met the requisite standard, subclones were
target date for making the complete, high-accuracy sequence derived and sufficient sequence data generated from these to pro-
available is April 2003, the 50th anniversary of the discovery vide four- to fivefold coverage (that is, enough data to represent
of the double helix3. As we go to press, however, the work is still an average base in the BAC clone between four and five times).
a mosaic of finished and draft sequence. A sequence becomes This is known as ‘draft-level’ coverage, and permits the assembly
length sequences in GenBank. Each alternatively spliced transcript is represented by its own ref- stage referred to as ‘fully
erence mRNA and protein. The RefSeq project also includes sequences of complete genomes topped-up’. The data from each
and whole chromosomes, and genomic sequence contigs. The human genomic contigs that fully topped-up BAC are
NCBI assembles, which form the basis of the presentations in the different genome browsers, reassembled, typically resulting
are part of the RefSeq project. Most RefSeq entries are considered provisional and are derived by in a smaller number of contigs
an automated process from existing GenBank records. Reviewed RefSeq entries are manually (often in just a single contig)
curated and list additional publications, gene function summaries and sometimes sequence than at the draft level. The new
corrections or extensions. assembly is again submitted to
Reference sequences are available through NCBI resources, including Entrez, BLAST and the HTGS division as an
LocusLink. They can be easily recognized by the distinctive style of their accession numbers. update of the existing BAC
NM_###### is used to designate mRNAs, NP_###### to designate proteins and NT_###### to clone, now identified with the
designate genomic contigs. The NCBI and UCSC use alignments of the mRNA RefSeqs with the keyword ‘htgs_fulltop’. The
genome to annotate the positions of known genes. Ensembl aligns mRNA RefSeqs to the accession number of the clone
genome. The NCBI also provides model mRNA RefSeqs produced from genome annotation. stays the same, and the version
These are derived by aligning the NM_ mRNAs and other GenBank mRNAs to the assembled number increases by one
genome and then extracting the genomic sequence corresponding to the transcripts. The result- (AC108475.2, for example,
ing model mRNA and model protein sequences have accession numbers of the form becoming AC108475.3).
XM_###### and XP_######. As the XM_ and XP_ records are derived from genomic sequence, 6. At this stage, there are,
they may differ from the original NM_ or GenBank mRNAs because of real-sequence polymor- even for clones comprising a
phisms, errors in the genomic or mRNA sequences or problems in the mRNA/genomic single contig, typically some
sequence alignment. A complete list of types of RefSeqs, along with details on how they are pro- regions that are of insufficient
duced, is available from http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html. quality for the clone to be con-
sidered finished. If this is the
case, the fully topped-up
of sequence using computer programs that can detect overlaps sequence is analyzed by a sequence finisher (an actual person)
between the random reads from the subclones, yielding longer who collects, in a directed manner, the additional data that are
‘sequence contigs’. At this stage, the sequence of a BAC clone needed to close the few remaining gaps and to bring any regions
could typically exist on between four and ten different contigs, of low quality up to the finished sequence standard. While the
only some of which were ordered and oriented with respect to clone is worked on by the finisher, the HTGS entry in GenBank is
one another. The BAC ‘projects’ were submitted, within 24 hours identified by the keyword ‘htgs_activefin’. Once work on the
of having been assembled, to the High-Throughput Genomic clone has been completed, the keyword of the HTG record is
Sequences (HTGS) division of DDBJ/EMBL/GenBank5, where changed to ‘htgs_phase3’, the version number is once again
each was given a unique accession number and identified with increased, and the record is moved from the HTGS division to
the keyword ‘htgs_draft’. (The DDBJ, EMBL and GenBank are the primate division of DDBJ/EMBL/GenBank. In the context of
members of the International Nucleotide Sequence Database a BLAST search at NCBI, these finished BAC sequences would
Collaboration, whose members exchange data nightly and assure now be available in the nr (“non-redundant”) database.
that the sequence data generated by all public sequencing efforts 7. The finished clone sequences are then put together into a
are made available to all interested parties freely and in a timely finished chromosome sequence. As with the initial draft assem-
fashion.) Less-complete high-throughput genomic (HTG) blies, there are a number of steps involved in this process that use
records are also known as ‘phase 1’ records. As the sequence is map-based and sequence-based information in calculating the
refined, it is designated ‘phase 2’. In the context of a BLAST maps. The final assembly process involves identifying overlaps
search at the NCBI, these sequences would be available in the between the clones and then anchoring the finished sequence
HTGS database. contigs to the map of the genome; details of the process can be
4. In late 2000, the draft sequence of the entire human genome found on the NCBI web site (http://www.ncbi.nlm.nih.gov/
was assembled from the sequence of 30,445 clones (BAC clones genome/guide/build.html).
and a relatively small number of other large-insert clones). This Initially, both the UCSC and NCBI groups generated complete
assembled draft human genome sequence was published in Feb- assemblies of the human genome, albeit using different
ruary 2001 and made publicly available through three primary approaches. As noted on the UCSC web site, the NCBI assembly
portals: the University of California, Santa Cruz (UCSC), tended to have slightly better local order and orientation, whereas
Ensembl (of the European Bioinformatics Institute; EBI) and the the UCSC assembly tended to track the chromosome-level maps
NCBI. The use of all three of these sites to obtain annotated somewhat better. Rather than having different assemblies based
information on the human genome sequence is the primary sub- on the same data, IHGSC, UCSC, Ensembl and NCBI decided
ject of this guide. that it would be more productive (and obviously less confusing)
simply aligning Reference Sequence (RefSeq) mRNAs (see box), the three major genome portals will certainly continue to evolve
GenBank mRNAs, or both to the assembly. If the RefSeq or Gen- long after April 2003. Computational annotation is a highly
Bank mRNA aligns to more than one location, the best align- active area of research, yielding better methods for identifying
ment is selected. If, however, the alignments are of the same coding regions, noncoding transcribed regions and noncoding,
quality, both are marked on to the contig, subject to certain rules non-transcribed functional elements contained within the
(specifically, the transcript alignment must be at least 95% iden- human sequence.
tical, with the aligned region covering 50% or more of the length,
or at least 1,000 bases). Transcript models are used to refine the Accessing human genome sequence data
alignments. Ensembl identifies ‘best in genome’ positions for Although each of the three portals through which users access
known genes by performing alignments between all known genome data has its own distinctive features, coordination
human proteins in the SPTREMBL database6 and the assembly among the three ensures that the most recent version and anno-
using a fast protein-to-DNA sequence matcher7. UCSC predicts tations of the human genome sequence are available.
the location of known genes and human mRNAs by aligning Ref- Ensembl (http://www.ensembl.org) is the product of a collab-
Seq and other GenBank mRNAs to the genome using the BLAST- orative effort between the Wellcome Trust Sanger Institute and
like alignment tool (BLAT) program8. In addition to identifying EMBL’s European Bioinformatics Institute and provides a bioin-
and placing known genes onto the assemblies, all of the major formatics framework to organize biology around the sequences
genome browser sites provide ab initio gene predictions, using a of large genomes7. It contains comprehensive human genome
variety of prediction programs and approaches. annotation through ab initio gene prediction, as well as infor-
Genome annotation goes well beyond noting where known mation on putative gene function and expression. The web site
and predicted genes are. Features found in the Ensembl, NCBI provides numerous different views of the data, which can be
and UCSC assemblies include, for example, the location and either map-, gene- or protein-centric. Ensembl is actively build-
placement of single-nucleotide polymorphisms, sequence- ing comparative genome sequence views, and presents data
tagged sites, expressed sequence tags, repetitive elements and from human, mouse, mosquito and zebrafish. In addition,
clones. Full details on the types of annotation available and the numerous sequence-based search tools are available, and the
methods underlying sequence annotation for each of these dif- Ensembl system itself can be downloaded for use with individ-
ferent types of sequence feature can be found by accessing the ual sequencing projects.
URLs listed under Genome Annotation in the Web Resources The UCSC Genome Browser (http://genome.ucsc.edu) was
section of this guide. At UCSC, many of the annotations are pro- originally developed by a relatively small academic research
vided by outside groups, and there may be a significant delay group that was responsible for the first human genome assem-
between the release of the genome assembly and the annotation blies. The genome can be viewed at any scale and is based on
of certain features. Furthermore, some tracks are generated for the intuitive idea of overlaying ‘tracks’ onto the human
only a limited number of assemblies. For an in-depth discussion genome sequence; these annotation tracks include, for exam-
of genome annotation, the reader is referred to an excellent ple, known genes, predicted genes and possible patterns of
review by Stein9 and the references cited therein. This review, alternative splicing. There is also an emphasis on comparative
along with the Commentary in this guide, also provides cautions genomics, with mouse genomic alignments being available.
on the possible overinterpretation of genome annotation data. The browser also provides access to an interactive version of
the BLAT algorithm8, which UCSC uses for RNA and compar-
The data—and sometimes the tools—change every day ative genomic alignments.
The steps outlined in the previous section should emphasize Given its Congressional mandate to store and analyze biologi-
that the state of the human genome sequence will continue to be cal data and to facilitate the use of databases by the research com-
in flux, as it will be updated daily until it has actually been munity, the NCBI (http://www.ncbi.nlm.nih.gov) serves as a
declared ‘finished’. (Finished sequence is properly defined as the central hub for genome-related resources. NCBI maintains Gen-
“complete sequence of a clone or genome, with an accuracy of at Bank, which stores sequence data, including that generated by
least 99.99% and no gaps”2. A more practical definition is that of the HGP and other systematic sequencing projects. NCBI’s Map
“essentially finished sequence,” meaning the complete sequence Viewer provides a tool through which information such as exper-
of a clone or genome, with an accuracy of at least 99.99% and no imentally verified genes, predicted genes, genomic markers,
gaps, except those that cannot be closed by any current physical maps, genetic maps and sequence variation data can be
method.) The reader should be mindful of this, not just when visualized. The Map Viewer is linked to other NCBI tools—for
reading this guide, but also, when referring back to it over time. example, Entrez, the integrated information retrieval system that
Similarly, the tools used to search, visualize and analyze these provides access to numerous component databases.
sequence data also undergo constant evolution, capitalizing on Although we have chosen to illustrate each example using
new knowledge and new technology in increasing the usefulness resources available at a single site, almost all the questions in this
of these data to the user. guide can be answered using any of the three browsers. The
gate these public databases. The readers are encouraged to who seek to gain access to these three genome portals will see
explore the alternative methods for answering the questions. better performance with Internet Explorer.
Question 1
How does one find a gene of interest and determine that gene’s struc-
ture? Once the gene has been located on the map, how does one easily
examine other genes in that same region?
doi:10.1038/ng966
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
This question serves as a basic introduction to the three major • ev brings the user to the evidence viewer, a view that displays
genome viewers. One gene, ADAM2, will be examined using the biological evidence supporting a particular gene model.
all three sites so that the reader can gain an appreciation of This view shows all RefSeq models, GenBank mRNAs, tran-
the subtle differences in information presented at each of scripts (whether annotated, known or potential) and
these sites. expressed sequence tags (ESTs) aligning to this genomic con-
tig. More information on the evidence viewer can be found
National Center for Biotechnology Information Map on the NCBI web site by clicking Evidence Viewer Help on any
Viewer ev report page.
The NCBI Human Map Viewer can be accessed from the NCBI’s • hm is a link to the NCBI’s Human–Mouse Homology Map,
home page, at http://www.ncbi.nlm.nih.gov. Follow the hyper- showing genome sequences with predicted orthology
link in the right-hand column labeled Human map viewer to go between mouse and human (Fig. 12.2).
to the Map Viewer home page. The notation at the top of the • seq allows the user to retrieve the genomic sequence of the
page indicates that this is Build 29, or the NCBI’s 29th assembly region in text format. The region of sequence displayed can
of the human genome. Build 29 is based on sequence data from 5 easily be changed.
April 2002. The previous genome assembly, Build 28, was based • mm is a link to the Model Maker, which shows the exons that
on sequence data from 24 December 2001. To search for any result when GenBank mRNAs, ESTs and gene predictions are
mapped element, such as a gene symbol, GenBank accession aligned to the genomic sequence. The user can then select
number, marker name or disease name, enter that term in the individual exons to create a customized model of the gene.
Search for box and then press Find. For this example, enter More information on the Model Maker can be found on the
‘ADAM2’ and then press Find. The on chromosome(s) box may be NCBI web site by clicking help on any mm report page.
left blank for text-based searches such as this one. The UniG_Hs map shows human UniGene clusters that have
The resulting overview page shows a schematic of all of the been aligned to the genome. The gray histogram depicts the
human chromosomes, pinpointing the position of ADAM2 to number of aligning ESTs and the blue lines show the mapping of
the p arm of chromosome 8 (Fig. 1.1). The search results section UniGene clusters to the genome. The thick blue bars are regions
shows that the gene exists on two NCBI maps, Genes_cyto and of alignment (that is, exons) and the thin blue lines indicate
Genes_seq. Genes_cyto refers to the cytogenetic map, whereas potential introns. In this example, the mapping of UniGene clus-
Genes_seq refers to the sequence map. Clicking on either of those ter Hs.177959 to the genome follows that of ADAM2, and all the
two links opens a view of just that map. exons align.
Detailed descriptions of these and other NCBI maps are The Genes_cyto map shows genes that have been mapped
available at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/ cytogenetically; the orange bar shows the position of the gene.
humansearch.html. To get the most general overview of the Although ADAM2 has been finely mapped and is represented by
genomic context of ADAM2, including all available maps, click a short line, other genes, such as the group below it on a longer
on the item in the Map element column (in this case, ADAM2). line, have been cytogenetically mapped to broader regions of
This view shows ADAM2 and a bit of flanking sequence on chro- chromosome 8.
mosome 8p11.2 (Fig. 1.2). Three maps are displayed in this view, Clicking on the zoom control in the blue sidebar allows the
each of which will be discussed below. Additional maps, dis- user to zoom out to view a larger region of chromosome 8.
cussed in other examples in this guide, can be added to this view Zooming out one level shows 1/100th of the chromosome. There
using the Maps & Options link. are 20 genes in the region, and all 20 are labeled (displayed) in
The rightmost map is the master map, the map providing the this view (Fig. 1.3). The region of ADAM2 is highlighted in red
most detail. The master map in this case is the Genes_seq map, on all maps. On the basis of the Genes_seq map, ADAM2 is
which depicts the intron/exon organization of ADAM2 and is located between ADAM18 and LOC206849.
created by aligning the ADAM2 mRNA to the genome. The gene
appears to have 14 exons. The vertical arrow next to the ADAM2 University of California, Santa Cruz Genome Browser
gene symbol (within the pink box) shows the direction in which The home page for the UCSC Genome Browser is http://genome.
the gene is transcribed. The gene symbol itself is linked to ucsc.edu/. At present, UCSC provides browsers not only for the
LocusLink, an NCBI resource that provides comprehensive most recent version of the mouse and human genome data, but
information about the gene, including aliases, nucleotide and also for several earlier assemblies. To use the Genome Browser,
protein sequences, and links to other resources10 (see Question select the appropriate organism from the pull-down menu at the
10). The links to the right of the gene symbol point to additional top of the blue sidebar (Human, in this case) and then click the
information about the gene. link labeled Browser. On the resulting page, select the version of
• sv, or sequence view, shows the position of the gene in the the human assembly to view. The genome browser from August
context of the genomic contig, including the nucleotide and 2001 is based on an assembly of the human genome done by
encoded protein sequences. UCSC using sequence data available on that date. The Dec. 2001
(Fig. 1.5). The section marked Known Genes shows the map- 8p12. Clicking on any of these items recenters the display around
ping of the NCBI Reference mRNA sequences to the genome. that item. The section of interest is boxed in red on the
The mRNA Associated Search Results represent the mapping of DNA(contigs) map. The genes annotated by Ensembl as being
other GenBank mRNA sequences to the genome. Click on the around ADAM2 are Q96KB2 and ADAM18.
Known Genes link for ADAM2 (arrow, Fig. 1.5) to see the The bottom panel of the ContigView, the Detailed View
genomic context of the ADAM2 mRNA Reference Sequence (Fig. 1.14), shows a zoomed-in view of the boxed region, high-
(NM_001464). lighting all features that have been mapped to this region of the
The resulting zoomed-in view shows a region of chromosome human genome. The navigator buttons between the Overview
8 from base pair 36234934 to 36280132, located within 8p12 and the Detailed View move the display to the left and right and
(Fig. 1.6). The blue track entitled Known Genes (from RefSeq) zoom in and out. The features to be displayed can be changed
shows the intron–exon structure of known genes. The vertical by selecting the Features pull-down menu and then checking
boxes indicate exons and the horizontal lines introns. The which features to view.
ADAM2 gene seems to have 14 exons. The direction of transcrip- The Features shown in Fig. 1.14 are the defaults. The DNA
tion is indicated by the arrowheads on the introns. The tracks (contigs) map separates items on the forward strand (above)
labeled Acembly Gene Predictions, Ensembl Gene Predictions from those on the reverse (below). The only feature on the
and Fgenesh++ Gene Predictions are the results of gene predic- reverse strand in this view is a single Genscan transcript, pre-
tions (see Question 7). Alignments of other database nucleotide dicted by the GENSCAN gene prediction program11 (see Ques-
sequences are shown in the Human mRNAs from GenBank, tion 7). The forward strand shows five types of features. Starting
spliced EST, UniGene and Nonhuman mRNAs from GenBank at the bottom, the ADAM2 transcript is shown in red, indicating
tracks. Translated alignments of mouse and Tetraodon genomic that it is a known transcript corresponding to a near-full-length
sequence are in the mouse and fish BLAT tracks. Tracks display- cDNA sequence, protein sequence or both already available in
ing single-nucleotide polymorphisms (SNPs), repetitive ele- the public sequence database. Black transcripts are predicted
ments and microarray data are shown at the bottom. Additional based on EST or protein sequence similarity. EST Transcr. links to
details about each track are available by selecting the track name individual aligning ESTs, whereas the UniGene track near the top
in the Track Controls at the bottom. displays UniGene clusters. The Genscan model on the forward
To view the genomic context of ADAM2, zoom out 10× by strand contains many exons found in the known transcript. The
clicking on the zoom out 10× box in the upper right corner. Proteins and Human proteins boxes indicate protein sequences
ADAM2 is located between TEM5 and ADAM18 (Fig. 1.7). that align to this version of the genome, whereas NCBI Transcr.
links to the NCBI Map Viewer. Positioning the computer mouse
Ensembl over any feature brings up the feature’s name and links to more
The Ensembl7 project, http://www.ensembl.org/, provides detailed information.
genome browsers for four species: human, mouse, zebrafish and The NCBI, UCSC and Ensembl sometimes use different sym-
mosquito. Click on Human to view the main entry point for the bols for the same genes, so it can be difficult to compare the
human genome. The current version of human Ensembl is ver- views obtained by the different browsers. Furthermore, the
sion 6.28.1, based on the NCBI’s 28th build of the genome. To three sites maintain independent annotation pipelines and do
perform a text search, enter ‘ADAM2’ in the text box, and limit not all attempt to align the same mRNA sequences to the
the search by selecting Gene from the pull-down search. Click on genome. The NCBI is currently displaying build 29, Ensembl
the upper button labeled Lookup. A single result is returned with shows build 28, and UCSC offers both builds 28 (December
a link to the ADAM2 gene (Fig. 1.8). 2001) and 29 (April 2002), although all examples from UCSC in
Click on either of the ADAM2 links to retrieve the GeneView this guide will be illustrated using the better-annotated build
window. The returned page contains four sections of data. The 28. Because of the differences between the two assemblies, there
first section (Fig. 1.9) is an overview of ADAM2, including links are subtle discrepancies between what is shown at the NCBI and
to accession numbers and protein domains and families. Links to what is available at UCSC and Ensembl. However, it is fairly
the Ensembl view of highly similar mouse sequences are pre- easy to navigate among the three sites. The NCBI, for example,
sented in the Homology Matches section. Some of these fields will links to Ensembl and UCSC through the black boxes at the top
be described in more detail in later examples. The second section of LocusLink entries for human genes, and Ensembl directs
of the GeneView window provides information on the gene tran- users to NCBI and UCSC through the “Jump to” link in its Con-
script (Fig. 1.10). The sequence of the cDNA is shown, as is a tigView. Some versions of UCSC’s Genome Browser have links
graphic of its intron–exon structure. A limited amount of the to Ensembl and NCBI’s Map Viewer in the blue bar at the top of
genomic context around the gene is shown schematically as well. each browser page.
Figure 1.2
Figure 1.4
Figure 1.6
Figure 1.8
Figure 1.10
Figure 1.12
Figure 1.14
Question 2
How can sequence-tagged sites within a DNA sequence be identified?
doi:10.1038/ng967
The NCBI’s electronic PCR (e-PCR) tool12, which is part of the ent maps. Cross-references to LocusLink, UniGene and the
UniSTS resource, can be used to find STS markers within a DNA Genebridge 4 map to which this STS was mapped are shown
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
fragment of interest. UniSTS (http://www.ncbi.nih.gov/ next. The mapping information section contains links to the
genome/sts/) contains all the available data on STS markers, NCBI’s MapViewer. At the bottom of the page, the Electronic
including primer sequences, product size, mapping information PCR results show other sequences, including contigs, mRNAs
and alternative names. Links to other NCBI resources such as and ESTs that may contain this STS marker.
Entrez, LocusLink and the MapViewer are also provided. e-PCR To see the genomic context of the STS marker in all maps to
looks for potential STSs in a DNA sequence by searching for sub- which it has been mapped, click on the link labeled MapViewer
sequences with the correct orientation and distance that could at the top of the Mapping Information section. This map view
represent the PCR primers used to generate known STSs. (Fig. 2.3) shows two maps. Note that, in this view, the STS
The e-PCR home page can be found by going to the NCBI stSG47693 is called RH92759 (highlighted in pink). Gene
home page, at http://www.ncbi.nlm.nih.gov, and then following Map ’99–Genebridge 4 (GM99_GB4, left) has 46,000 STS mark-
the Electronic PCR link in the right-hand column. On the e-PCR ers mapped onto the GB4 RH panel by the International
home page, paste the sequence of interest or enter an accession Radiation Hybrid Consortium. The STS map (right) shows the
number into the large text box at the top of the page. The acces- NCBI’s placement of STSs onto the genome sequence assembly
sion number of the sequence for this example is AF288398. This using e-PCR. Gray lines connect markers that appear in both
sequence contains only one STS, stSG47693, which is located maps, whereas the red line denotes where the STS RH92759
between nucleotides (nt) 2102 and 2232 of the sequence under appears on both maps. In the region shown, there are a total of
study (Fig. 2.1). 211 STSs on the STS map, but only 20 are labeled in this view. To
Click on the marker name to bring up details of the STS from the right of the STS map, the green and yellow circles show the
UniSTS (Fig. 2.2). The primer information and PCR product size maps on which the STS markers have been placed. One can
are listed at the top of the page, along with alternative names for zoom in or out of this view by clicking on the lines of the zoom
the marker. Often STSs are known by different names on differ- tool in the left sidebar.
Figure 2.2
Question 3
During a positional cloning project aimed at finding a human disease
gene, linkage data have been obtained suggesting that the gene of
interest lies between two sequence-tagged site markers. How can all
the known and predicted candidate genes in this interval be identified?
What BAC clones cover that particular region?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
doi:10.1038/ng968
some cases, it may be useful to select a page size larger than the more details about an entry, including the clone name, by click-
default of 20 to view more data in the browser window. ing on the accession number to link to Entrez. The clone name is
Fig. 3.10 shows the maps, as specified in the Maps & Options visible directly in the MapViewer if the Comp map is the master.
window. The green dots to the right of the STS map show all the A map can be quickly made the master map by clicking on the
maps on which the markers appear. This is a fairly long region of blue arrow next to its name.
chromosome 10, and not every STS marker is shown. In particu- Because this is a zoomed-out view of the chromosome, indi-
lar, although there are 611 STSs in this region, only 20 are shown vidual genes and GenBank entries are difficult to visualize.
by name in this view. For each known gene, the Genes_Seq map Zooming in, using the controls in the blue sidebar, will provide
shows all the exons that have been mapped to the genome. Exons a region in more detail. Alternatively, click on the Data As
for individual known mRNAs are shown on the RNA (Tran- Table View in the left sidebar to retrieve all data, including
script) map. Unless a gene is alternatively spliced, the Genes_Seq those hidden in this view, as a text-based table (partially shown
and RNA maps will be the same. The GScan (GenomeScan) map in Fig. 3.11).
Figure 3.2
Figure 3.4
Figure 3.6
Figure 3.8
Figure 3.10
Question 4
A user wishes to find all the single nucleotide polymorphisms that lie
between two sequence-tagged sites. Do any of these single nucleotide
polymorphisms fall within the coding region of a gene? Where can any
additional information about the function of these genes be found?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
doi:10.1038/ng969
The starting point for this search would be the web site for the linked to other databases; a P in this column indicates that the
Database of Single Nucleotide Polymorphisms (dbSNP) at the variation has been mapped to a known protein structure. For a
NCBI13, which is located at http://www.ncbi.nlm.nih.gov/SNP. complete description of all the features within this display, click
There is a series of links on the page that allow the user to search on any part of the header above the columns.
using either information about the database submission itself or Returning to the original question, one of the SNPs displayed
information regarding genes and gene loci. on this page does indeed fall within a coding region, as indi-
For this particular search, assume that the region of interest is cated by an orange C. To obtain more information on any par-
known and defined by two STS markers, RH70674 and G32133. ticular SNP, simply click on the hyperlinked SNP Cluster ID.
Begin by scrolling to the section labeled Between Markers at the Clicking on rs1059133, for example, produces a new page, with
bottom of the page. Enter the STS marker names ‘RH70674’ and all available information on that SNP (Fig. 4.2). Under the
‘G32133’ into the two text boxes, and click on Submit STS Mark- header marked Submitter records for this RefSNP Cluster is a list
ers. This will produce a display showing SNPs 1–25 out of the of the individual SNPs (in this case, only one SNP) that have
total of 81 within the region of interest. Go to page 3 of the dis- been clustered together to form this single reference SNP. The
play by entering ‘3’ in the Page box and clicking Display. sequence of the SNP is shown in the next header. Under the
The resulting page (Fig. 4.1) illustrates most of the possible header marked NCBI Resource Links are GenBank and NCBI
types of result one would find on a typical dbSNP results page. In RefSeq entries that are associated with this SNP. Scrolling fur-
the table, starting from the left, the first column gives the individ- ther down on the SNP page (Fig. 4.3), the gene whose coding
ual dbSNP cluster IDs (all starting with ‘rs’). The second column, region this SNP falls within is indicated on the LocusLink Analy-
labeled Map, shows whether a particular SNP has been mapped sis section (ADAM2, a disintegrin and metalloproteinase
to a unique position in the genome (illustrated by a single green domain 2). The SNP allele is G/C, a non-synonymous change
arrow, as in the first row of the example) or to multiple positions leading to replacement of the Asp residue in the reference
(not shown here). sequence by a His residue. Links are also provided to the NCBI
The next set of columns, labeled Gene, indicates whether these Map Viewer, Ensembl map and UCSC genome assembly in the
SNPs are associated with particular features, such as genes, section labeled Integrated Maps. The sections labeled Variation
mRNAs or coding regions. The three columns (L, T and C) are Summary and Validation Summary (not shown) give the raw
either lit up or appear gray in every row. Taking each in order: data on this particular SNP.
If the L (for locus) appears in blue, part or all of the marker To answer the final part of this question requires jumping from
position lies either within 2 kilobases (kb) of the 5′ end of a gene dbSNP to LocusLink10. To do so, click on the ADAM2 link in the
feature or within 500 bases of the 3′ end of a gene feature. line marked LocusLink at the top of the page (Fig. 4.3). This
If the T (for transcript) appears in green, part or all of the brings the user to the LocusLink page for ADAM2 and provides
marker position overlaps with a known mRNA. This does not numerous jumping-off points to the NCBI and affiliated
mean, however, that the SNP marker necessarily falls within a resources through the boxed links at the top of the page. More
coding region. information on these resources can be found by following the
If the C (for coding) appears in orange, part or all of the LocusLink FAQ link in the left-hand column of the page. By sim-
marker position overlaps with a coding region. ply examining the LocusLink page itself, one sees that the
The next column, labeled Het, indicates the average heterozy- ADAM2 protein belongs to a family of membrane-anchored pro-
gosity observed for this marker, on a scale of 0–100%. A reading teins that have been implicated in processes as diverse as fertiliza-
of zero means that no information is available for that particular tion, muscle development and neurogenesis.
marker, whereas the pink bars show a 95% confidence interval One often-overlooked source of information on genes and
for the marker. The Validation column indicates whether the gene products is OMIM14. This is an electronic version of the
marker has been validated (shown by a star) or is unvalidated
(shown by light blue boxes). Validated markers have been veri-
fied by independent re-analysis of the sequence. All of the unval- Using the UCSC browser, users can retrieve the positions of
idated markers shown in Fig. 4.1 are denoted by three blue boxes, genome annotations such as SNPs as a text file suitable for
which, according to the scale at the top of the column, means that loading into a spreadsheet program. While looking at the
there is a >95% success rate in validation. This figure indicates browser for a defined chromosomal region, click on the
the probability that this marker is real. (The success rate is Tables link (Fig. 1.6, upper blue bar). Similarly, to export a
defined as 1 – false-positive rate.) list of genome annotations in a defined chromosomal region
In the penultimate column, the symbol TT (not shown here) at Ensembl, click on Export from any ContigView window
indicates that individual genotypes are available for this marker. (Fig. 1.14, center yellow bar).
Finally, the Linkout Avail column indicates which markers are
Figure 4.2
Figure 4.4
Question 5
Given a fragment of mRNA sequence, how would one find where that
piece of DNA mapped in the human genome? Once its position has been
determined, how would one find alternatively spliced transcripts?
doi:10.1038/ng970
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
For the purpose of this example, the fragment of mRNA of inter- potentially representing differentially spliced transcripts, click
est is contained within GenBank accession number BG334944. on the track’s label. This will expand this area of the figure so
First, retrieve the nucleotide sequence of this EST using the that each EST occupies a single line (Fig. 5.7). The ESTs are of
NCBI’s Entrez interface, at http://www.ncbi.nlm.nih.gov/ varying length, but most contain the same exons as the known
Entrez/. Type ‘BG334944’ into the text box at the top of the page, gene and are (presumably) spliced in the same way. Close
change the pull-down menu to Nucleotide and press Go. The inspection indicates that some of the ESTs are missing one or
resulting page shows one entry, corresponding to accession num- more exons compared with the known gene. Consider the lines
ber BG334944. To retrieve this sequence in FASTA format (a marked BE798864 and W52533: the former appears to be miss-
common format for bioinformatics programs), change the pull- ing the fifth exon, whereas the latter is missing the fourth, fifth
down menu on this page to FASTA and then press Text (Fig. 5.1). and sixth exons.
A new web page containing only the sequence, in FASTA format, Any of the ESTs can be examined in more detail by clicking on
is produced (Fig. 5.2); copy the resulting sequence. that particular line. Here, click on the line for BE798864 (arrow,
To determine where this sequence maps within the genome, Fig. 5.7) to reach the information page for this EST (Fig. 5.8).
use UCSC’s BLAT tool8. Begin this search by pointing your web The EST is 99.8% identical to the genomic sequence; clicking
browser to the UCSC Genome Browser home page, at anywhere on the hyperlinked line in the section marked
http://genome.ucsc.edu. From this page, select Human from the EST/Genomic Alignments returns the actual side-by-side align-
Organism pull-down menu in the blue bar on the side of the ment (Fig. 5.9). Differences exist at the ends of the EST, but the
page, and then click Blat. Paste the FASTA-formatted sequence sequences are identical in the region surrounding the putative
obtained from Entrez (above) into the large text box on the BLAT missing exon.
search page (Fig. 5.3), change the Freeze pull-down menu to Dec. An alternatively spliced mRNA is more likely to be of biologi-
2001, change the Query pull-down menu to DNA and then press cal significance when it changes the sequence of the encoded,
Submit. The server will (very quickly) return the search results; in wildtype protein. To determine whether EST BE798864 could
this case, a single match of length 636 is found on the forward encode a protein different from that of the known gene
strand of chromosome 9 (Fig. 5.4). (RAB9P40), one can simply compare the two sequences directly
To obtain more details on this hit, click the details link, to the against each other using the NCBI’s BLAST 2 Sequences tool.
left of the entry. A long web page is returned, with three major First, open a new web browser window, because information
sections: the mRNA sequence (Fig. 5.5, top), the genomic from the above search will be needed here; this will prevent hav-
sequence (Fig. 5.5, middle) and an alignment of the mRNA ing to use the browser’s Back and Forward keys excessively and is
sequence against the genomic sequence (see Fig. 5.9 for an exam- a good general rule when using multiple web tools. Then access
ple). In the alignment in Fig. 5.5, matching bases in the cDNA the BLAST home page, at http://www.ncbi.nlm.nih.gov/BLAST.
and genomic sequences are colored in darker blue and capital- Select BLAST 2 Sequences, under the header labeled Pairwise
ized. Gaps are indicated in lower-case black type. Light blue BLAST. On this page, the user can simply enter accession num-
upper-case bases mark the boundaries of aligned regions on bers rather than cutting and pasting sequences into the text
either side of a gap and are often splice sites. boxes. For the EST, simply enter its accession number
Returning to the BLAT summary page for this search (Fig. 5.4),
click on browser. This will produce a graphic representation of
where this particular mRNA sequence aligns to the genome Ensembl also displays database hits that overlap with each
(Fig. 5.6). The track labeled Chromosome Band indicates that the exon in a transcript. These hits may include proteins as well as
mRNA maps to 9q34.11. The query sequence itself is represented ESTs and mRNAs, and may illustrate alternatively spliced
on the line labeled Your Sequence from BLAT Search (arrow, products. The hits are shown as green boxes in the TransView
Fig. 5.6). The sequence is shown as being discontinuous: regions (Fig.13.5), which can be accessed in a number of ways; for
of similarity are shown as vertical lines, gaps are shown as thin example, by clicking on the View Evidence box for a transcript
horizontal lines, and the direction of the alignment is indicated on the GeneView (Fig. 1.10). Another good starting point for
by the arrowheads. The aligned regions of the EST query corre- visualizing alternatively spliced transcripts is the NCBI’s
spond to the exons of a known gene, shown on the line immedi- Model Maker (follow the mm link in Fig. 1.2). The Model
ately below (Known Genes, here RAB9P40). Typing the EST Maker displays putative exons from mRNAs, ESTs and gene
name, BG334944, directly into a UCSC search box would have predictions that align with the genome. Users can select indi-
generated a similar result to that shown in Fig. 5.6, but part of the vidual exons from these alignments and build a customized
purpose of this example is to illustrate the use of BLAT. gene model. As the Model Maker displays the nucleotide
Approximately halfway down the graphic is a track labeled sequence of the model along with its three-frame translation,
Human ESTs That Have Been Spliced. This track is at first shown the effects of adding, modifying or deleting exons can be
in dense mode, with all the ESTs condensed onto a single line. To quickly evaluated.
see all of the ESTs that align with the genome in this region,
sequence 2 (the known gene) is denoted as the subject. The predictions must, however, be tested computationally by looking
known gene’s protein translation is also shown, starting at the at the quality of the EST–genomic alignment as shown above.
end of the third row of the alignment. Examination of the align- Final proof of alternative splicing can, of course, only be gener-
ment shows that the EST is missing 153 nt (nt 360–512 of the ated at the laboratory bench.
Figure 5.2
Figure 5.4
Figure 5.6
Figure 5.8
Figure 5.10
Question 6
How would one retrieve the sequence of a gene, along with all anno-
tated exons and introns, as well as a certain number of flanking bases for
use in primer design?
doi:10.1038/ng971
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
This type of search can be initiated at the UCSC Genome ton next to extended case/color options and then click Submit. By
Browser home page, located at http://genome.ucsc.edu. Select selecting this option, the user can highlight features in the
Human from the pull-down menu labeled Organism, and then sequence by changing the format (case, underline, bold, italic)
click on Browser. This brings the user to the Human Genome and/or color (red, green, blue) of the text. Colors can be varied in
Browser Gateway, from which a number of text- or position- darkness and mixed together by changing the values in the boxes
based searches can be performed on current or older versions of under Red, Green and Blue to any number between 0 and 255;
the genome assembly. In this case, select the Dec. 2001 assem- examples of how to specify in RGB (red-green-blue) format color
bly, type the name of the gene of interest (PTPN1) into the posi- are given below the table. At this point, check the Toggle Case box
tion box, and then click Submit. The Browser returns all genes in the Known Genes row, change the red saturation to 255 and
starting with the characters ‘PTPN1’ (Fig. 6.1). The gene of leave the other color values set at zero (Fig. 6.4). Once the user
interest here is the one called PTPN1; click on the hyperlinked clicks Submit, a new page is presented with the entire length of
PTPN1 (arrow, Fig. 6.1) to view the genomic context of this the sequence specified above (chr20:48928540-49003836) and
gene (Fig. 6.2). the exons within this range are shown in red in capital letters
The text box at the top of Fig. 6.2 gives the absolute base pair (Fig. 6.5). This genomic sequence can now be saved and
position of this gene (chromosome 20, positions imported into a primer design or sequence assembly package for
48929540–49003636) and indicates that the gene spans 74 kb. further analysis.
The track labeled Chromosome Bands shows that PTPN1 is The Extended DNA Case/Color Options page can be used to
located at 20q13.13. Finally, the track marked Known Genes combine and differentiate between genomic tracks. For exam-
shows that the gene is on the forward strand, as the arrows on ple, return to the Options page, leave the Known Genes row as
that track are pointing to the right. The exons within this gene before but now also check the Underline square in the Mouse
are indicated by the vertical lines in the Known Genes track. Blat row of the table. Clicking Submit produces a page on which
One way to obtain sequence upstream of a gene is described in the human exons still appear in red capital letters, but hits from
Question 7. Here we explain how to retrieve flanking sequence the mouse sequence are now shown as underlined text (Fig. 6.6).
on both sides of a gene. To retrieve an adequate amount of In this section of the gene, the conserved mouse sequence over-
sequence with which to design primers, one can increase the size laps with the exons.
of the region displayed by changing the position numbers within
the position box at the top of the figure. To add an additional
1,000 nt at the 5′ end and an additional 200 nt at the 3′ end, for One way to retrieve sequence for a defined chromosomal
example, change the text in the position box to ‘chr20:4892854- region at the NCBI is with the seq link on the MapViewer, vis-
49003836’ and click Jump. This now redraws the graphic with the ible when the Gene_Seq map is the master (Fig. 1.2). At
new boundaries. Ensembl, export genomic nucleotide sequence with the
To obtain the actual sequence within the region, click on the Export→FASTA link in any ContigView window (Fig. 1.14,
DNA link in the blue bar at the top of the page. This produces a center yellow bar).
new page, entitled Get DNA in Window (Fig. 6.3). Click the but-
Figure 6.2
Figure 6.4
Figure 6.6
Question 7
How would an investigator easily find compiled information describing
the structure of a gene of interest? Is it possible to obtain the sequence
of any putative promoter regions?
doi:10.1038/ng972
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
One place to initiate this search is at UCSC’s Genome Browser, at intronic regions both 5′ and 3′ to a putative exon using a dynamic
http://genome.ucsc.edu. For purposes of this example, consider programming algorithm; the method also takes into account
the gene encoding pendrin (PDS), a protein associated with protein similarity data16.
developmental abnormalities of the cochlea, sensorineural hear- The Genscan Gene Predictions derive from a method called
ing loss and diffuse thyroid enlargement (goiter). GENSCAN, through which introns, exons, promoter sites and
From the UCSC home page, choose Human from the pull- poly(A) signals can be identified. Here, the method does not
down Organism list, and click on Browser. The user is now at the expect the query sequence to represent one and only one gene, so
Human Genome Browser Gateway. The search in this case is sim- it can make accurate predictions for either partial genes or multi-
ple: select Dec. 2001 from the assembly pull-down menu, type ple genes separated by intergenic DNA11.
pendrin into the position box, and then click Submit. The The Human mRNAs from Genbank track shows alignments
returned results indicate one known gene and two mRNA between human mRNAs in GenBank and the genome sequence.
sequences; click on the accession number of the mRNA sequence The Spliced ESTs and Human EST tracks show the alignment of
AF030880 to continue. The user will now be presented with a ESTs from GenBank against the genome. Because ESTs usually
graphic overview of the region containing this mRNA. To gain a represent fragments of transcribed genes, there is high likelihood
better perspective of the region, click on the 1.5× button next to that an EST corresponds to an exonic region.
zoom out. Finally, click the reset all button on the middle of the Finally, the Repeating Elements by RepeatMasker track shows,
page to reset the tracks to their default settings. as its name would suggest, repetitive elements such as short and
Carrying out these steps will produce an output similar to that long interspersed nuclear elements (SINEs and LINEs), long ter-
shown in Fig. 7.1. For the purpose of this question, however, the minal repeats (LTRs) and low-complexity regions (http://repeat-
default settings are not ideal. Using the Track Controls at the bot- masker.genome.washington.edu/cgi-bin/RepeatMasker). It is
tom of the figure, and following the example in Fig. 7.2, set some customary to remove or ‘mask’ these elements before applying a
tracks to hide mode (not shown), others to dense (all data con- gene prediction method to a nucleotide sequence.
densed onto one line) and some to full (a separate line for each Returning to the example shown in Fig. 7.2, notice that most of
feature, up to 300). Before considering the actual data within the tracks return a nearly identical gene prediction; as a rule,
these tracks, a brief discussion of the content and representation exons predicted by multiple methods increase the likelihood that
of these tracks is warranted. Many were provided to UCSC by the prediction is actually correct and does not represent a ‘false
outside individuals. Further information on the gene prediction positive’. Most of the methods show a 3′ untranslated region, indi-
methods briefly discussed below can be found elsewhere15. cated by the heavy, shorter block at the left of the predictions. The
The general convention for the Known Genes and predicted Acembly track shows three possible alternative splices in addition
gene tracks (Fig. 7.1) is that each coding exon is shown as a tall, to the full-length product shown in the third line of that section, a
vertical bar or block. 5′ and 3′ untranslated regions are shown as prediction that agrees with those shown in most of the other
shorter vertical bars or blocks. tracks. The Genscan track extends off to both the right and the
Connecting introns are shown as very thin lines. The direction left: GENSCAN can be used to predict multiple genes, and this
of transcription is indicated by the arrows along that thin line. display implies that the method has been applied in this fashion.
Known Genes are taken from mRNA reference sequences Although these graphical overviews are useful, the investigator
within LocusLink10. These reference sequences have been aligned will more often than not want the actual sequence corresponding
against the genome using BLAT. to these blocks. For this example, the Fgenesh++ prediction will
The Acembly Gene Predictions With Alt-splicing track is derived be used as the basis for obtaining raw sequence data, but the steps
from the alignment of human mRNA and EST sequence data will be identical regardless of which track is chosen. Click on the
against the genome, using the program Acembly. This program track labeled Fgenesh++ Gene Predictions to go to a summary
attempts to find the best alignment of each mRNA against the page describing the prediction (Fig. 7.3). The region has sequence
genome and considers alternative splice models. If more than similarity to the pendrin gene (which was already known at the
one gene model with statistical significance can be produced, beginning of the example). The size and the beginning- and end-
each of these is shown in the display. Additional information on points of the prediction are given, and it is indicated that the pre-
Acembly can be found on the NCBI web site at diction lies on the minus strand; this was also indicated in Fig. 7.2
http://www.ncbi.nih.gov/IEB/Research/Acembly/. by the left-pointing arrows in the intronic regions. To obtain the
The Ensembl Gene Predictions track7 is provided by Ensembl. sequence, click on Genomic Sequence. The user will be taken to a
The Ensembl genes are predicted by a range of methods, includ-
ing homology to known mRNAs and proteins, ab initio gene pre-
diction using GENSCAN and gene prediction HMMs. The NCBI also provides gene predictions, computed using
The Fgenesh++ Gene Predictions come from a method that pre- the program GenomeScan17. These models are shown on the
dicts internal exons by looking for structural features such as GenomeScan and Gene_Seq maps.
donor and acceptor splice sites, putative coding regions and
Figure 7.2
Figure 7.4
Question 8
How can one find all the members of a human gene family?
doi:10.1038/ng973
The HUGO Gene Nomenclature Committee (http://www.gene. BLAST. Paste the sequence of the ADAM2 protein (GenBank
ucl.ac.uk/nomenclature/) has been working to develop a unique accession NP_001455.2) into the query box (having obtained the
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
symbol, as well as a longer and more descriptive name, for each protein sequence from the NCBI’s Entrez database by following
human gene. Thus, members of many gene families, previously the steps in Question 5). Set the database to Homo sapiens,
cloned in different laboratories and known by a variety of terms, genomic sequence to search the Ensembl genome assembly, and
now share a common gene symbol. A text search in any of the choose TBLASTN as the executable (Fig. 8.2). Use the default
genome browsers will often return links to all named members of parameters for the remaining settings. When done, click Search.
a gene family that have been mapped to the genome. Whereas The returned page will contain a retrieval ID (Fig. 8.3), which,
Ensembl and UCSC currently return lists of the genes, the NCBI when the search is finished, will link to the search results page
presents both a list and a graphical overview. (Fig. 8.4).
Go to the NCBI home page at http://www.ncbi.nlm.nih.gov/ The top of the results page shows a graphical overview of the
and click on the Human map viewer link on the right side to locations of hits. These hits may be to the entire protein or just to
access the Map Viewer search page. Enter the term a single domain. The hits are colored by BLAST score, red being
‘ADAM*[sym]’ in the text query box. The asterisk, or wild card, most similar, blue least similar and green intermediate. Some of
will match any character, whereas the term [sym] limits the the hits, like the pairs on the q arms of chromosomes 10 and 14,
search to items with ADAM as their gene symbol. Other lie in positions similar to those of ADAMs mapped by the NCBI
advanced search options are available by clicking the Advanced (Fig. 8.1), but others, such as those on chromosomes 12 and Y,
Search box or by reading the online documentation. The search are unique to the BLAST search. These unique hits may represent
returns 41 hits, which include members of the ADAM family as real members of the ADAM family that have not yet been named
well as other related families whose names start with the term and would therefore not show up in a text-based search. Alterna-
‘ADAM’, such as ADAMTS and ADAMDEC. To limit the search tively, they may be unnamed pseudogenes or nonsignificant
to ADAM genes only, eliminate the undesired gene symbols with BLAST hits. One gene on chromosome 1 is found in the text-
the Boolean NOT term, using the query ADAM*[sym] NOT based search at the NCBI but not in the BLAST search at
ADAMTS*[sym] NOT ADAMDEC1*[sym]. The graphic at the Ensembl. The similarity between this gene and ADAM2 is not
top of the returned page shows the location of each gene with a high enough for it to appear in the BLAST search using the
red tick mark (Fig. 8.1). It is immediately clear that the 19 default Ensembl parameters.
mapped ADAM genes are distributed among 11 chromosomes, Clicking on an arrow next to one of the hits shown in Figure 8.4
and that some, such as those at the tips of the q arms of chromo- activates a pop-up menu that gives the details of the BLAST
somes 10 and 14, are close together. The list at the bottom of the report and provides links to the BLAST alignment and the
page presents links to the 19 genes. ContigView (Figs 8.5 and 8.6, respectively, for the hit on chromo-
Another way to search for homologous genes in the genome is some 12). The hit on chromosome 12 contains a stop codon and
through a basic local alignment search tool (BLAST) search at is probably an intronless pseudogene. The bottom of the results
the NCBI or Ensembl. BLAT searches at UCSC are not as sensi- page (Fig. 8.4) shows a summary of the BLAST hits. Clicking on a
tive as BLAST searches and may not find as many homologous hit links to the BLAST alignment (Fig. 8.5). A link in the middle
genes. In this example, all genomic sequences homologous to the of the results page (Fig. 8.4) provides the entire BLAST report in
ADAM2 protein will be found using the Ensembl BLAST inter- standard format. Clicking on a hit in the BLAST report retrieves
face. From the Ensembl Human home page at the ContigView for the region around the hit (similar to what is
http://www.ensembl.org/Homo_sapiens/, click on the link to shown in Fig. 8.6).
Figure 8.2
Figure 8.4
Figure 8.6
Question 9
Are there ways to customize displays and designate preferences? Can
tracks or features be added to displays by users on the basis of their
own research?
doi:10.1038/ng974
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
In this example, the UCSC browser will be used to view particu- for example, customize the EST track controls to color red all
lar tracks. Start at the UCSC home page (http://genome.ucsc. ESTs from a certain library that contain a particular keyword in
edu), click on Browser in the blue sidebar on the left-hand side of their GenBank entry or to eliminate all such ESTs from the dis-
the page, and set the Genome Browser Gateway to a region of play. The browser retains these selections for all subsequent ses-
interest. For example, one could set the genome to Human and sions; the default settings can be restored by clicking on the reset
the assembly to Dec. 2001, type chr22:38496887-39496866 into all button.
the position box, and click Submit to display a representative One of the attractive features of the UCSC system is that users
region of the December 2001 assembly of human chromosome can add their own annotations, features or tracks to their local
22. A number of tracks are already displayed in dense format displays. These changes are not written or saved in any way to the
(Fig. 9.1). Below the graphic showing the specified region are original data held at UCSC. To customize the display, the user
pull-down menus that allow the user to change the appearance of returns to the Human Genome Browser Gateway page and scrolls
the graphic, under the heading Track Controls (Fig. 9.2). There down to the Add Your Own Tracks section. Here, the user is pre-
are three options in each of these pull-down menus: sented with a large text box into which properly formatted text
• Hide, which allows the user to eliminate that particular track can be typed or pasted. Alternatively, the specifications can be in
from the display. a text file, which the user can select by using the Browse button
• Dense, which displays all annotations or features for that above the large text box. As another option, if the text file is
track on a single line. posted on the user’s local web page, the user can share the custom
• Full, which displays each annotation or feature for that track track of annotations with other colleagues simply by telling them
on a separate line; this is the ‘exploded view’ that is illustrated the URL of the file. Colleagues can then view the custom annota-
in a number of the questions in this guide. tion by starting the UCSC browser and entering this URL into
Once the desired selections have been made, the user clicks on the large text box.
the refresh button to redraw the graphic. Further customization For the purposes of this example, enter the following text
of individual tracks can be achieved by clicking on the track file into the entry field (Fig. 9.3) and click Submit at the top of
name in the Track Controls section of the browser. The user can, the page:
Figure 9.2
Figure 9.4
Question 10
For a given protein, how can one determine whether it contains any
functional domains of interest? What other proteins contain the same
functional domains as this protein? How can one determine whether
there is a similarity to other proteins, not only at the sequence level, but
also at the structural level?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
doi:10.1038/ng975
To demonstrate how to find functional domains within a protein, CDD (Pfam and SMART), as can be seen by looking at the acces-
the human testis-determining factor TDF, also known as the sex- sion numbers in the hit list.
determining protein SRY, will be used as an example. To determine which other proteins contain this same HMG-
Although the search could be commenced from the Entrez box domain, click on the box labeled Show, right under the
search box on the NCBI home page, a better way to perform the graphical view near the top of the page. This will invoke the
initial search is from LocusLink10. One of the advantages of domain architecture retrieval tool (DART). DART shows func-
using LocusLink lies in its standardization of gene and protein tional domains within a protein and, more importantly, other
names with appropriate cross-referencing, making it more proteins with a similar domain architecture (Fig. 10.5). The
likely that the correct protein will be found on the first attempt. query (the HMG-box) is shown at the top of the page in red.
From the NCBI home page at http://www.ncbi.nlm.nih.gov/, Every other protein in the NCBI’s non-redundant sequence data-
choose LocusLink from the pull-down menu in the upper left base having that same domain is then shown below the query,
corner, type the gene name, ‘TDF’, into the query box, and click with the HMG box again colored red. Other domains within the
Go. Four loci are returned (Fig. 10.1). The first column gives found proteins are also shown, in various colors and shapes, with
the Locus ID, which is a stable identifier associated with that a key appearing at the bottom of the web page. Clicking on any of
gene locus. Clicking on the LocusID produces a LocusLink the links to the left would provide additional information about
report view; more detailed information on the report view can these new proteins.
be found in the LocusLink Help feature and in the literature15. Although a protein domain has now been identified within the
The second column, marked Org, gives a shorthand version of query protein, no in-depth information has yet been provided
the organism name. Here, there is one entry from Drosophila about the function of that domain. Whereas a circuitous path
(Dm), one from mouse (Mm), one from human (Hs) and one could be followed from the DART page to find this information,
from rat (Rn). A series of alphabet blocks shown to the right of an easier method is to use another web-based resource, called
each entry provide jumping-off points to other database InterPro. InterPro is an integrated resource for information
resources. The locus of interest here is the third entry in the list, about protein families, domains and functional sites, bringing
because that is the one for the human form of TDF/SRY. together information from a number of protein domain-based
To find additional information on the protein, click on the sec- resources, such as PROSITE, PRINTS, Pfam and ProDom19. The
ond P (in green) on that line. This takes the user to the protein InterPro Simple Search engine can be accessed from the InterPro
entries corresponding to that particular LocusLink entry home page, at http://www.ebi.ac.uk/interpro. Clicking on Text
(Fig. 10.2). At this point, the user can click on any of the hyper- Search, on the left, brings the user to the search page; for this
links to look at the raw database information available on any search, type ‘HMG Box’ into the text box and hit Search. Three
of the proteins listed. hits are returned (Fig. 10.6). For purposes of this example, follow
Consider the first entry in the list, an NCBI Reference Protein the link from the first hit, for high mobility group proteins HMG1
sequence with accession number NP_003131. To the right of the and HMG2 (IPR000135). The resulting InterPro summary page
accession number is a series of hyperlinks. Clicking on the link (Fig. 10.7) provides information on the function, intracellular
labeled BLink will take the user to the BLink page for the protein location and, most importantly, metabolic role of this particular
of interest (Fig. 10.3). BLink stands for ‘BLAST Link’ and pro- protein within the cell, in an executive summary format. Refer-
vides the graphical results of pre-computed BLAST searches that ences are provided at the bottom of the web page for users who
have been performed not just for this protein sequence, but for wish for more in-depth information about the domain. Users
every protein sequence within the Entrez Proteins data domain. can also retrieve all of the full-length sequences containing the
The pre-computed BLAST results for TDF/SRY are shown in the domain; the reader is referred to the InterPro documentation for
section beginning with the label ‘204 aa’. Across the top are a more details.
number of buttons that allow the user to ask a series of questions The final part of this question asks whether similarity to the
regarding their protein of interest. As the object of this question query protein can be found at the structural as well as the
is to find the protein domains present within the TDF/SRY pro- sequence level. Answering this question requires a new search
tein, the user can click on CDD-Search (Conserved Domain against NCBI Structures. From the NCBI home page, change the
Database Search18). Doing this will produce a graphical overview pull-down menu in the query box at the top of the page to Struc-
of any domains present within the protein, as well as a sequence
alignment of those domains with the query sequence (Fig. 10.4).
In this case, one functional domain is found: an HMG box, At Ensembl, the GeneView links directly to the InterPro
which is a DNA-binding domain found in many nuclear pro- domain(s) found in the protein (Fig. 1.9).
teins. The domain was found in both of the databases comprising
Figure 10.2
Figure 10.4
Figure 10.6
Figure 10.8
Question 11
An investigator has identified and cloned a human gene, but no corre-
sponding mouse ortholog has yet been identified. How can a mouse
genomic sequence with similarity to the human gene sequence be
retrieved?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
doi:10.1038/ng976
For purposes of this example, assume that the user does not est (Fig. 11.2). Especially in a translated mode, mouse and
already have the human sequence of interest to hand. The first human gene sequences are usually more similar in exons than in
step will be to locate the human gene of interest using the UCSC introns. Look carefully at the two alignments that derive from a
Genome Browser. Begin by pointing to the UCSC Genome mouse sequence called chr3 81178k (Fig. 11.2, arrow). On the
Browser home page, at http://genome.ucsc.edu. Select Human Mouse Blat track, the brown vertical lines represent alignments
from the Organism pull-down menu and then click on Browser; and the horizontal lines are gaps. These alignments correspond
both are located on the blue navigation bar at the left side of the to the blue vertical lines indicating the exons of AGPS on the
page. This will take the user to the Human Genome Browser Known Genes track.
Gateway. Select the Dec. 2001 version of the UCSC genome To see the kind of information available for a translated BLAT
assembly, type the gene symbol ‘AGPS’ into the position box, and alignment, click on the mouse genomic sequence labeled chr3
then click Submit. On the resulting page, follow the link for 81178k. The resulting page (Fig. 11.3) provides the details of the
AGPS in the Known Genes section. alignment of the trace with the human genome assembly. This
The result of the search on AGPS is shown in Fig. 11.1. In the mouse genomic sequence is 607 nt in length and aligns with the
main figure are a series of ‘tracks’, which are labeled along the human sequence in eight blocks. Within the blocks, the mouse
left-hand side. The Known Gene track is for AGPS, correspond- and human sequences are 78% identical. To view the alignment
ing to the query. Clicking on AGPS returns a summary of infor- itself, click on the View details of parts of alignment. . . link. On
mation on that gene, including the full name of the protein the resulting page (Fig. 11.4), the mouse sequence is shown on
product (alkylglycerone phosphate synthase precursor), a link to top, with the region of alignment in blue. The human genomic
the GeneCards database at the Weizmann Institute20 and links to sequence is shown next, and a side-by-side alignment of the
the translated protein, mRNA and genomic sequences. Focus human and mouse sequences is at the bottom of the web page
now on the track labeled Mouse Translated Blat Alignments. (not shown).
What is shown in this track are the results of aligning the
November 2001 version of the mouse genome assembly with the
human genome using the program BLAT8 in its translated pro- The NCBI’s UniGene_Mouse map shows alignments of
tein mode. More details about the BLAT algorithm and about mouse mRNA and EST sequences with the human genome.
how the mouse BLAT track is automatically generated can be Add this map using Maps & Options (Fig. 3.9). The easiest
found by clicking on the Mouse Blat hyperlink found below the way to find the mouse ortholog of a human gene is probably
main graphical display. to use Ensembl’s precomputed Homology Matches. These
Click anywhere within the Mouse Blat track to expand the sin- matches, where available, link directly from a human gene to
gle BLAT track so that it now shows each individual mouse a putative mouse homolog (Fig. 1.9).
sequence that aligns with human sequence in the region of inter-
Figure 11.2
Figure 11.4
Question 12
How does a user find characterized mouse mutants corresponding to
human genes?
doi:10.1038/ng977
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
The NCBI provides a set of maps that show chromosomal to select Linkage Maps. On the resulting page, customize the
regions homologous between mouse and human. This resource search to find the region around the mouse gene Tyr. Under
can be accessed directly at http://www.ncbi.nlm.nih.gov/Homol- Chromosome, set the number to 7; then set the chromosomal
ogy/. For this example using a known, mapped human gene, region to between 40 and 48 cM.
however, it is easier to start the search from the LocusLink entry Many of the uncloned mouse mutants are not mapped in
for tyrosinase. The LocusLink10 query page can be found at high-resolution crosses, and many are carried out with a small
http://www.ncbi.nlm.nih.gov/LocusLink/. Select Human from number of mice relative to another easy-to-score phenotype for
the Organism pull-down menu, enter ‘tyrosinase’ into the Query another mouse mutant that maps to the same chromosome. It is
box and click Go. To view the entry for tyrosinase (TYR), click on thus necessary to be lenient in looking for potential uncloned
its LocusLink ID, 7299. mouse mutants (±4 cM relative to the location). In this case, as
On the resulting page (Fig. 12.1), links to the mouse homology the NCBI data tells us that the gene is at 44 cM, the region from
maps are in the section of the LocusLink summary page marked 40 to 48 cM should be searched.
Relationships. In this case, there are four maps available for TYR Further down the page (Fig. 12.3), under Markers, set Include
showing mouse alignments: NCBI vs MGD aligns the current DNA segments to No to reduce the number of markers shown. Do
NCBI assembly of the human genome with the MGD (Mouse include syntenic markers, which are DNA markers and mutant
Genome Database21, at The Jackson Laboratory) genetic map, alleles linked to chromosome 7 that have not been finely mapped
UCSC vs. MGD aligns the current UCSC genome assembly with but that may be associated with a phenotype of interest relative to
the MGD genetic map, NCBI vs. EST-based RH Map aligns the TYR. Under Comparative Maps, Show homologs from species,
NCBI assembly with the Whitehead–MRC RH map, and UCSC choose human (Homo sapiens). Select Show all markers. Use the
vs. Hudson et al. aligns the 7 October 2001 UCSC assembly with default setting for all other options, and hit Retrieve.
the Whitehead–MRC RH Map22. The Hs and Mm links adjacent The gene Tyr is found on page 2 of the output, at 44 cm
to each map name show the mouse–human homology map with (Fig. 12.4). The mouse chromosome is shown schematically on
the master chromosome as human or mouse, respectively. Click the left and expands as one moves to the right. In the rightmost
on the Hs link next to the NCBI vs. MGD map. columns are the names of the mouse markers in a particular
The resulting mouse–human map shows the mouse genes that region in blue and, if there is a corresponding human ortholog,
are the likely orthologs of human genes on human chromosome the name of that ortholog in black. Some of the displayed
11 (Fig. 12.2). Depending on the browser being used, one may mouse markers are genes, some are STSs, some are recessive
have to click the View as text box to obtain the output; the result- mutants (all small letters) and some are dominant alleles (ini-
ing output will appear in text format, slightly different from what tial capital letter). At the bottom of the page are syntenic mark-
is shown in Fig. 12.2. Chromosomal locations of the mouse genes ers, those which have been mapped to chromosome 7 but not to
are shown, where known. The green circles link to the UniSTS an exact position.
entry for each locus; those on the left link to the human UniSTS Clicking the blue Tyr link at 44 cm opens up a summary of the
entry, whereas those on the right link to the mouse UniSTS entry. Genes, Markers, and Phenotypes for that gene (Fig. 12.5). Of par-
The cytogenetic positions are hyperlinked to either the human or ticular interest in this case are the phenotypic alleles. There are 99
mouse Map Viewer, as appropriate. Gene symbols are linked to mouse strains with mutations in the Tyr gene.
LocusLink10. The tyrosinase gene, highlighted in pink, maps to
mouse chromosome 7 at 44 cM, a piece of information that will
be needed in the next step. Users can also view chromosomal regions that are homolo-
The mouse models themselves are described at the Mouse gous between mouse and human by using Ensembl’s Syn-
Genome Informatics site at the Jackson Laboratory. Go to tenyView, available from the ContigView by clicking the Jump
the Mouse Genome Informatics home page, at http://www. to→syntenyview link (Fig. 1.14, center yellow bar).
informatics.jax.org/, and use the Query Forms pull-down menu
Figure 12.2
Figure 12.4
nature genetics
user’s guide
Question 13
A user has identified an interesting phenotype in a mouse model and
has been able to narrow down the critical region for the responsible
gene to approximately 0.5 cM. How does one find the mouse genes in
this region?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
doi:10.1038/ng978
Ensembl provides a mouse genome browser, similar to the one Consider the new gene indicated by the red arrow in
available for humans. It is being updated with the latest mouse Fig. 13.3. To view general information about this gene, hold the
genome sequence assemblies and, at the time of writing, dis- computer mouse over the gene graphic and select Transcript
plays the MGSC version 3 assembly of the mouse genome, with Information from the pop-up menu. The GeneView window
sequence data from February 2002. The sequence is estimated (Fig. 13.4) provides a description of this gene, as well as a link
to cover 96% of mouse euchromatic DNA, and Ensembl has to the GeneView window for the putative human ortholog
predicted that it contains over 22,000 genes. Start at the (Fig. 13.4, Homology Matches section). To view the database
Ensembl mouse home page, at http://www.ensembl.org/ sequences that align with the predicted exons of the new mouse
Mus_musculus/. Choose Marker from the pull-down menu, gene, place the computer mouse pointer over the gene in the
type the marker name ‘RH114718’ in the adjacent box, and Detailed View (Fig. 13.3, arrow) and select Supporting evidence
press Lookup. Click either of the resulting links to view more from the pop-up menu. Fig. 13.5 depicts the mRNA and pro-
details about this radiation hybrid marker. RH114718 has been tein sequences that align with exons in the new gene. Click on
mapped to a single position on chromosome 19 and is also any of the green boxes to see the alignment of the database
known as MGI:102447, MTH1904 and D19MIT109 (Fig. 13.1). sequence with the new transcript.
Click on the chromosomal position to view the genomic con- The zoomed-out Detailed View also provides links to com-
text of the marker (Fig. 13.2). puted regions of orthology between the mouse and human
The Overview section of Fig. 13.2 shows a region of 1 Mb of genomes (Fig. 13.3, pink bars). As the mouse genome assembly
chromosome 19 centered around the marker, labeled and annotation lag behind those of the human, it may also
D19MIT109 in this view. More than 30 mouse genes are pre- be useful to view the human genes in an orthologous region of
dicted in this region, some already known and some new. The the genome.
Detailed View at the bottom of the page is a zoomed-in display of
the region around the marker. To get a better view of the genes
and transcripts in this region, zoom out on the bottom view by
clicking on the longest bar in the zoom control (closest to the UCSC also provides a mouse genome browser and the BLAT
minus sign). The Detailed View will now show the same region of search tool for use with the latest mouse genome sequence
chromosome 19 as the overview, but with many additional fea- assemblies. The links are available from the UCSC genome
tures (Fig. 13.3). The splice patterns of the genes and gene pre- browser home page, at http://genome.ucsc.edu/. Mouse
dictions are shown, as are regions of homology between the genome analysis tools developed at the NCBI, including a
genome and other proteins and mRNAs. Pointing the computer mouse Map Viewer and mouse BLAST pages, are available
mouse at any feature allows the user to open a small menu that from http://www.ncbi.nlm.nih.gov/genome/guide/mouse/.
links to additional descriptions.
Figure 13.2
Figure 13.4
In working through the examples in the User’s Guide, the reader SMART web sites and issue the query there, the searches would
is exposed to a number of databases, web sites and other be performed using a very different algorithm, a hidden Markov
resources of enormous value for performing in silico analysis of model26. Although a description of the two different methods is
biological data. Familiarity with and use of this vast arsenal can beyond the scope of this discussion, it is important to understand
help the researcher to plan and execute experiments more intelli- that they are fundamentally different and will therefore produce
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics
gently. In using these resources and, more importantly, in draw- different results. An extended discussion on this point, using spe-
ing biological conclusions based on the results gleaned from cific examples, is available27. The CDD front end will miss those
these sites, however, there are a number of caveats and potential SMART and Pfam entries that represent short domains, repeats
pitfalls of which the user should be aware. Although some of the and motifs28. To understand what the methods do does not mean
specific points we now discuss go beyond the sample questions having to comprehend advanced mathematical equations: basic
included in this guide, the basic lessons to be learned apply to the explanations in layman’s terms can be found in any one of a
full range of bioinformatic analyses. number of reviews or textbooks7,8.
The user must understand the capabilities—and limitations— One can often carry out a search and become excited on the
of the programs being used. In the same way that molecular biol- identification of a motif; frequently such a motif is rather small.
ogists need to understand the chemistry underlying a routine The Lys-Asp-Glu-Leu motif is an example; it targets proteins to
assay or the physics behind separation techniques, they must the endoplasmic reticulum. But one should beware the ‘short-
have a basic understanding of what search or analysis methods motif ’ pitfall. The level of sequence identity required for signifi-
actually do once the ‘Submit’ button has been pressed. Under- cant homology is much higher for smaller regions—they either
standing what the chemistry, physics or search methods can and match or they don’t. For very short motifs, homology cannot be
cannot reveal is critical if the user is to extract the full meaning of inferred by sequence identity, meaning that short motifs may not
the results but not overinterpret them. By understanding the be at all helpful in describing what a protein does.
methods, users can also optimize them and end up with a better Longer motifs have greater power in identifying true positives
set of results than if these sequence-based search methods were and eliminating false positives. More importantly, the support-
treated simply as a ‘black box’. ing information is made available by simply clicking past the first
A specific case in which the reader could have encountered dif- page of summary results provided by the search engine. Even, or
ficulty deals with the detection of domains within a protein, as especially, the newest of users is encouraged to click away and
described in Question 10. Consider the part of the question that discover the information and assumptions underlying the results
discussed the Conserved Domain Database (CDD) at the NCBI. that the searches have produced. These are self-explanatory in
The CDD is a ‘secondary database’, one in which the entries have many cases.
been derived from other databases, in this case Pfam23 and the With respect to complete sequences, the reader is advised to
Simple Modular Architecture Research Tool (SMART)24. Pfam recall that the preliminary analyses of the human genome
provides collections of multiple sequence alignments that repre- sequence led to a large reduction in the estimated number of
sent known, common protein domains. Pfam is subdivided into genes contained in the human genome. Earlier, numbers of the
two parts: Pfam A, which is manually curated, and Pfam B, which order of 80,000 to as high as 140,000 had been suggested29. With
is automatically generated. By virtue of being ‘hand-crafted’, the the draft sequence of the genome in hand, new estimates lie
entries in Pfam A are of higher quality and are therefore more closer to 30,000–35,000 genes11. If this is correct, the human
reliable than those in Pfam B. Nevertheless, both Pfam A and would have only twice as many genes as are observed in either the
Pfam B provide broad coverage across the spectrum of known roundworm or the fruit fly11. At the same time, human genes
protein domains. appear (in general) to have a more complex structure.
The second source database, SMART, provides information This pronounced ‘reduction’ in the number of genes in the
on 500 domain families, but with a specific emphasis on those human genome obviously challenges the one-gene, one-protein
domains that have been implicated in signaling or have been hypothesis (or, more properly, the one-gene, one-enzyme hypoth-
found in extracellular or chromatin-associated proteins. This esis30), as the number of proteins in the human proteome is
was a deliberate choice by the developers, who wished to tackle thought to be well in excess of 35,000 (ref. 11). One explanation of
what might be called ‘tougher-to-detect’ or ‘tougher-to-define’ the large number of individual proteins that can be generated
domains. At the outset, simply knowing the scope of the target from this relatively small number of genes is alternative splicing, a
database tells the user whether or not it is an appropriate process by which the transcripts from a single gene can be
choice for a sequence of interest, especially when some bio- processed differently and thus give rise to several distinct proteins.
chemical data may already be available. If users were to search Particularly germane to this discussion is that many proteins have
solely against SMART and find nothing, without understand- more than one function, depending on where they are found in
ing the limited scope of the data underlying the resource, they the cell or within the body as a whole.
might erroneously conclude that the protein of interest had no An interesting example of this phenomenon is the multifunc-
known domains. tional protein phosphoglucose isomerase31. This protein cat-
Continuing with this example, and assuming that the user alyzes the interconversion of D-glucose-6-phosphate and
now understands the scope of the underlying source databases, D-fructose-6-phosphate. It is identical to neuroleukin, a protein
a second problem quickly surfaces. When searching Pfam and secreted by T cells that promotes the survival of some embryonic
SMART through the CDD interface at the NCBI, the search spinal neurons and sensory nerves. It is also identical to an
is performed using a variation of the BLAST algorithm called autocrine motility factor that might be involved in metastasis,
RPS-BLAST25. If one were, however, to go directly to the Pfam or and to a differentiation and maturation mediator implicated in
http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/map_search db.html
important in this age of genetic and genomic research. The fol- http://www.ornl.gov/hgmis/education/education.html
lowing web sites provide an introduction to important issues
related to genome biology as applied to human health and pro- Genetics Education Center
vide a jumping-off point for further information. http://www.kumc.edu/gec/
DOE ELSI Program NHGRI Exploring our Molecular Selves Multimedia Kit
http://www.ornl.gov/hgmis/elsi/elsi.html http://www.genome.gov/Pages/EducationKit/