Sei sulla pagina 1di 4

Gene 417 (2008) 14

Contents lists available at ScienceDirect

Gene
j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / g e n e

Review

What is a gene? An updated operational denition


Graziano Pesole
Dipartimento di Biochimica e Biologia Molecolare E. Quagliariello, Universit di Bari, Via Orabona 4, 70126 Bari, Italy Istituto Tecnologie Biomediche, Consiglio Nazionale delle Ricerche, via Amendola 122/D, 70125 Bari, Italy

A R T I C L E

I N F O

A B S T R A C T
A crucial pre-requisite for large-scale annotation of eukaryotic genomes is the denition of what constitutes a gene. This issue is addressed here in the light of novel and surprising gene features that have recently emerged from large-scale genomic and transcriptomic analyses. The updated operational denition proposed here is: a gene is a discrete genomic region whose transcription is regulated by one or more promoters and distal regulatory elements and which contains the information for the synthesis of functional proteins or non-coding RNAs, related by the sharing of a portion of genetic information at the level of the ultimate products (proteins or RNAs). This denition is specically designed for eukaryotic chromosomal genes and emphasizes the commonality of the genetic material that gives rise to nal, functional products (ncRNAs or proteins) derived from a single gene. It may be useful in several applications and should help in the provision of a comprehensive inventory of the genes of a given organism, nally allowing answers to the basic question of how many genes are encoded in its genome. 2008 Elsevier B.V. All rights reserved.

Article history: Received 17 September 2007 Received in revised form 28 February 2008 Accepted 6 March 2008 Available online 26 March 2008 Keywords: Genomics Bioinformatics Alternative splicing Alternative transcription start sites Alternative transcription termination

Contents 1. 2. Problematic issues in eukaryotic chromosomal gene An updated operational gene denition . . . . . . Acknowledgments . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 4

1. Problematic issues in eukaryotic chromosomal gene denition A major goal of a genome sequencing project for a specic organism is the denition of its entire gene complement. To accomplish this important task several fairly accurate gene prediction tools are generally used together with large-scale production of expression evidence (e.g. cDNA and EST sequences). In this review I will deal with the problem of the denition of what is a gene, a crucial pre-requisite for large-scale annotation of eukaryotic genomes. Indeed, gene assessment in prokaryotic genomes is much simpler owing to their higher gene density (about 80% of a prokaryotic genome is protein coding) and the lack of introns. The

Abbreviations: AS, alternative splicing; miRNA, micro RNA; ncRNA, non-coding RNA; ORF, open reading frame; TSS, transcription start site; TTS, transcription termination site; TU, transcriptional unit; UTR, untranslated region. Dipartimento di Biochimica e Biologia Molecolare, University of Bari, Via Orabona, 4, 70125 Bari, Italy. Tel.: +39 080 5443588; fax: +39 080 5443317. E-mail address: graziano.pesole@biologia.uniba.it. 0378-1119/$ see front matter 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2008.03.010

identication of signicantly large Open Reading Frames (ORFs) is an obvious solution for the identication of the majority of protein coding prokaryotic genes. Short prokaryotic genes are more problematic but can generally be identied with suitable bioinformatics approaches validated by transcription and translation evidence. To date, 77 eukaryotic genome projects have been completed (Liolios et al., 2006) but for none of them are we able to answer the simple question of how many genes they contain. This is mostly due to the presence of some gaps in the genome sequences and to the incompleteness of gene annotation. However, even if all gaps were closed and a full gene annotation was available and validated by comprehensive transcriptional evidence, we could be still unable to provide reliable estimate of the gene number in part because of the lack of a clear and unambiguous denition of what a gene is. Several denitions have been proposed such as this, from one of the most widely used Molecular Biology textbooks: A gene is the segment of DNA specifying a polypeptide chain; it includes regions preceding and following the coding region (leader and trailer) as well as intervening sequences (introns) between individual coding

G. Pesole / Gene 417 (2008) 14

segments (exons). ((Lewin, 2007), http://www.ergito.com/). This exemplar denition, apart from its ambiguous use of the term exon, is barely satisfactory as it does not consider some problematic gene features recently highlighted by work carried out at the RIKEN Institute on the transcriptional landscape of mouse genome (Carninci et al., 2005) and most recently by the International Encyclopedia of DNA elements (ENCODE) project (Gerstein et al., 2007) that strongly challenge the conventional view of genes. Indeed, the classical one gene one protein denition is no longer acceptable and is also impractical (Pearson, 2006). In particular: 1) A large fraction of genes do not encode for proteins. Indeed, over 50% of the transcriptional units (TUs) identied in mouse do not appear to be coding and the majority of them are alternatively spliced and polyadenylated. 2) The same gene locus may encode a large variety of transcripts and proteins through alternative transcription start sites (TSS), alternative transcription termination sites (TTS) and alternative splicing (AS). In some cases AS may generate mRNAs encoding for completely unrelated proteins using different coding frames. 3) Some genes have been found to overlap each other on the same or opposite strands. The discontinuous structure of eukaryotic genes potentially allows Russian doll gene models, where one gene can be completely contained inside one or more introns of another gene without sharing any exonic regions. 4) The ligation of two distinct mRNA molecules encoded by separate gene loci through the trans-splicing mechanism is another phenomenon widespread in some eukaryote lineages such as

nematodes and ascidians (Hastings, 2005) which may further increase the complexity of the gene expression pattern. 5) Finally, recent computational and experimental analyses point to the existence of chimerical transcripts produced by the cotranscription of tandem gene pairs, and potentially encoding fusion proteins (Parra et al., 2006). 2. An updated operational gene denition In the light of the above features one might ask if is still appropriate to maintain a gene-centric view of molecular biology, or it is better to just consider functional products (proteins and ncRNAs) that may be in some way related by the molecular processes involved in their expression, such as the sharing of a promoter (or TSS), a transcriptional termination (TTS) or one or more splicing sites. Indeed, to understand the relationships between the different cellular components in a system biology framework, it may be more appropriate to consider functional products rather than genes, in the light of their specic expression in different conditions (i.e. tissue, developmental stage or pathological status). However, I believe that despite the many problems that have emerged in these last years it would be premature to announce the death of the gene concept, mostly because the tight connection between a functional product and its encoding genetic material cannot be disregarded. However, an updated operational denition is needed to allow the unambiguous association between transcripts, proteins, and their encoding genes. In agreement with Gerstein et al. (2007) this updated denition should adopt a bottom-up criterion, i.e. emphasize the ultimate

Fig. 1. The discrete genomic region depicted here encodes one non-coding and eight protein coding spliced transcripts (ncRNA in yellow; 5UTR and 3UTR in light and dark pink, respectively; protein coding sequence in green; dotted lines represent RNA removed or spliced out by maturation). Four different genes (numbered 14) can be annotated according to the gene denition proposed here. A specic set of transcripts can be clustered and assigned to the same gene if the transcript projections on the genome sequence limited at the regions encoding the nal products (e.g. the green and the yellow boxes for the protein coding and non-coding RNA genes, respectively) overlap each other. The clustering procedure is iterated and may include in the same gene cluster non-overlapping transcripts. For example, in the case of gene 3, the transcript isoforms encoding for products DE and FE are clustered because they overlap through the region E, then the transcript FG is added to this cluster because of the overlapping of the region F with one member of the cluster. The transcript encoding the product AE can be identied as a chimerical transcript originated by the concatenation of two exons belonging to two different genes as these two exons are prevalently expressed by two unrelated genes (i.e. genes 2 and 3). The gene coordinates, denoted by the arrowed lines, are the leftmost and rightmost mapping positions on the genome of all transcripts belonging to the same gene cluster. (For interpretation of the references to color in this gure legend, the reader is referred to the web version of this article.)

G. Pesole / Gene 417 (2008) 14

functional gene products, either ncRNAs (e.g. miRNA) or proteins, and consider the regulatory regions involved in their expression at both the transcriptional (i.e. promoter, enhancer, etc.) and post-transcriptional (i.e. 5UTR and 3UTR) level as gene-related. Thus, the proposed operational denition can be summarized as: a gene is a discrete genomic region whose transcription is regulated by one or more promoters and distal regulatory elements and which contains the information for the synthesis of functional proteins or non-coding RNAs, related by the sharing of a portion of genetic information at the level of the ultimate products (proteins or RNAs). This denition does not include cis-regulatory regions as sequence elements controlling the expression of a gene are not necessarily located upstream of it and may be dispersed throughout the genome (Gerstein et al., 2007) making the accurate denition of their boundaries unfeasible. In addition, some of the transcriptional regulatory elements may themselves be transcribed (Zhu et al., 2007). An example to illustrate the application of this denition is shown in Fig. 1, where a genomic region encoding nine different transcripts which give rise to one ncRNA and seven functional proteins is described. According to the above denition: i) ABC, AC and ii) DE, FE, FG, form two clusters of related proteins, generated by alternatively spliced products of genes 2 and 3. I would suggest that two (or more) proteins are related (i.e. belong to the same gene cluster) if their encoding genome sequences overlap each other. Indeed, products with overlapping encoding genome sequences, like DE and FE, have a strict genetic relationship as a mutation in the shared genomic region

(i.e. the E region) would affect both products. It should be noted that the relationships between two products can be indirect as DE and FG are related through FE (see also the legend of Fig. 1). Related proteins may also have completely different sequences, as in the case of DE and FG, or if the expressed products should use a different reading frame. According to the gene denition proposed here the transcript encoding the product H should be assigned to a different gene (4), even if it shares the same TSS with transcripts encoding ABC and AC, given that H and ABC (or AC) are completely unrelated proteins, i.e. encoded by non-overlapping genomic regions. This is in line with the recent observation that different genes may share distal 5UTRs, possibly providing a specic expression pattern (Denoeud et al., 2007). Furthermore, the existence of trans-splicing where exons from two separate transcripts are spliced together to form a mature mRNA molecule has been shown in some eukaryotes (Hastings, 2005). In the genomic region drawn in Fig. 1, we are also able to identify an additional gene (1) encoding a ncRNA giving rise to the mature product X. This situation accounts for miRNA genes, often expressed as polycistronic primary-miRNA and located in the introns of coding or non-coding RNAs (Kim and Nam, 2006). Finally, AE can be identied as a fusion protein originating from the co-transcription of two tandem genes (2 and 3, expressing nonoverlapping mature transcripts) through the formation of a chimerical transcript on the basis that the prevalent expression forms of the genes which provide exons to this product form two unrelated transcript

Fig. 2. (A) Seven alternative mRNAs expressed by CDKN2A gene in human as determined by the ASPIC program (Castrignano et al., 2006) (RefSeq IDs are shown on the right of known isoforms). (B) Alternative proteins encoded by transcript isoforms shown in (A).

G. Pesole / Gene 417 (2008) 14

clusters, i.e. with the 3 end of the transcripts of the rst cluster lying upstream of the 5 end of the transcripts of the second cluster, and encode unrelated and non-overlapping functional products. Once the related mature products have been dened one can easily go back to the relevant precursor transcripts, and determine the gene coordinates on the genome as their leftmost and rightmost mapping positions (Fig. 1). In this way a single gene locus is dened to encode a set of related products and its genomic coordinates established by precursor transcripts. The gene denition proposed here is different form the one proposed by Gernstein et al. (2007): A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products in that in the current proposal: i) each gene is assigned a contiguous genomic region; ii) gene coordinates include 5 and 3 mRNA untranslated (UTR) sequences included in the precursor transcript. Therefore, according to the proposed denition a genomic tract encoding for a trans-spliced leader is not included in the genomic region assigned to a given gene as we assume that a gene is a contiguous genome region and furthermore the trans-leader corresponds to an untranslated region of the transcript which do not contribute to the nal product. The denition provided in the current paper is not only simpler but also operationally more appropriate as it unambiguously denes the genomic region to be considered in the analysis of alternative splicing usually carried out by aligning gene-related transcripts (typically a Unigene cluster) to the relevant genomic region where alternatively spliced 5UTRs are frequently observed. To deal with a real example, Fig. 2 shows the splicing pattern of the gene CDKN2A, as determined by the ASPIC program (Castrignano et al., 2006). It should be noted that the rst and second transcripts (CDKN2A.Ref and CDKN2A.Tr2 in Fig. 2A) encode two completely different proteins, 116 and 173 aa long respectively (Fig. 2B) and the corresponding coding sequences use different reading frames. CDKN2A.Tr2, .Tr3 and .Tr4 encode the same product but differ in their 3UTR. CDKN2A.Tr5, .Tr6 and .Tr7 encode different partially overlapping proteins of 105, 146 and 138 residues, respectively. Note that products of CDKN2A.Ref and CDKN2A.Tr5 are indirectly related through the product of CDKN2A.Tr6. This example highlights a possible problem that may arise with the proposed denition. Indeed, in most real gene predictions we know neither the location of the coding sequence, if any, nor the function of the encoded protein. In fact, in this case only CDKN2A.Ref, .Tr2 and .Tr6 correspond to known transcripts included in the RefSeq collection (Pruitt et al., 2007). A pragmatic solution to this problem is to annotate the longest possible open reading frame as a functional product (even in the absence of strong supporting data). In this way all inferred transcripts, CDKN2A.Tr1.Tr7, will be assigned to the same gene locus. It is now quite clear that an unequivocal and universal gene denition is not possible and therefore it has been proposed that the operational units of a genome could be better represented by the different expressed transcripts as they actually relate the genome sequence to function and phenotype (Gingeras, 2007). However, the gene concept, with suitable revision and update still remains a key issue in Molecular Biology, underlying the centrality of the relation-

ship between genotype and phenotype. An operational denition, such as that proposed here may be extremely useful for the unambiguous classication of transcripts in discrete gene loci, such as those provided by the Unigene database (Wheeler et al., 2007) and may be more appropriate for computational analysis involving alignment of genome and transcript sequences. By way of contrast, the Gerstein et al. (2007) gene denition, which includes a discontinuous genome region with the exclusion of UTRs, cannot be used to delineate the genome region to be considered in bioinformatics analyses for the detection of novel splicing isoforms and of splicing events located in non-coding portion of mRNAs. The simple operational gene denition proposed here, while not universal it is specically designed for chromosomal eukaryotic genes (e.g. genes of RNA viruses do not t this denition) allows unambiguous denition of gene coordinates and of gene-related transcripts. It may have a wide range of applicability and help in the provision of a comprehensive inventory of the genes of a given organism, nally allowing answers to the basic question of how many genes are encoded in its genome. Acknowledgments This work was supported by the Italian Ministry of University and Research (Fondo Italiano Ricerca di Base: Laboratorio Internazionale di Bioinformatica), Associazione Italiana Ricerca sul Cancro and Telethon. I thank David Horner (University of Milano) for stimulating discussions and critical reading of the manuscript. References
Carninci, P., et al., 2005. The transcriptional landscape of the mammalian genome. Science 309, 15591563. Castrignano, T., et al., 2006. ASPIC: a web resource for alternative splicing prediction and transcript isoforms characterization. Nucleic Acids Res. 34, W440W443. Denoeud, F., et al., 2007. Prominent use of distal 5 transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 17, 746759. Gerstein, M.B., et al., 2007. What is a gene, post-ENCODE? History and updated denition. Genome Res. 17, 669681. Gingeras, T.R., 2007. Origin of phenotypes: genes and transcripts. Genome Res. 17, 682690. Hastings, K.E., 2005. SL trans-splicing: easy come or easy go? Trends Genet. 21, 240247. Kim, V.N., Nam, J.W., 2006. Genomics of microRNA. Trends Genet. 22, 165173. Lewin, B., 2007. Genes IX. Jones and Bartlett, Sudbury, Massachusetts. Liolios, K., Tavernarakis, N., Hugenholtz, P., Kyrpides, N.C., 2006. The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332D334. Parra, G., et al., 2006. Tandem chimerism as a means to increase protein complexity in the human genome. Genome Res 16, 3744. Pearson, H., 2006. Genetics: what is a gene? Nature 441, 398401. Pruitt, K.D., Tatusova, T., Maglott, D.R., 2007. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61D65. Wheeler, D.L., et al., 2007. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 35, D5D12. Zhu, X., Ling, J., Zhang, L., Pi, W., Wu, M., Tuan, D., 2007. A facilitated tracking and transcription mechanism of long-range enhancer function. Nucleic Acids Res. 35, 55325544.

Potrebbero piacerti anche