Sei sulla pagina 1di 15

HIV-1 and HIV-2 LTR Nucleotide Sequences: Assessment of the Alignment by N-block Presentation, Retroviral Signatures of Overrepeated Oligonucleotides,

and a Probable Important Role of Scrambled Stepwise Duplications/Deletions in Molecular Evolution


Ivan Laprevotte,* Maude Pupin, Eivind Coward, Gilles Didier,* Christophe Terzian, Claudine Devauchelle,* and Alain Henaut*
*Laboratoire Genome et Informatique, Universite de Versailles Saint Quentin-en-Yvelines, Versailles, France; Laboratoire dInformatique Fondamentale de Lille, Equipe Bioinformatique, Universite des Sciences et Technologie de Lille, Villeneuve dAscq, France; Deutsches Krebsforschungszentrum Theoretische Bioinformatik (H0300) Im Neuenheimer Feld 280, Heidelberg, Germany; and Institut de Genetique Humaine, Montpellier, France Previous analyses of retroviral nucleotide sequences, suggest a so-called scrambled duplicative stepwise molecular evolution (many sectors with successive duplications/deletions of short and longer motifs) that could have stemmed from one or several starter tandemly repeated short sequence(s). In the present report, we tested this hypothesis by focusing on the long terminal repeats (LTRs) (and anking sequences) of 24 human and 3 simian immunodeciency viruses. By using a calculation strategy applicable to short sequences, we found consensus overrepresented motifs (often containing CTG or CAG) that were congruent with the previously dened retroviral signature. We also show many local repetition patterns that are signicant when compared with simply shufed sequences. First- and second-order Markov chain analyses demonstrate that a major portion of the overrepresented oligonucleotides can be predicted from the dinucleotide compositions of the sequences, but by no means can biological mechanisms be deduced from these results: some of the listed local repetitions remain signicant against dinucleotide-conserving shufed sequences; together with previous results, this suggests that interspersed and/or local mononucleotide and oligonucleotide repetitions could have biased the dinucleotide compositions of the sequences. We searched for suggestive evolutionary patterns by scrutinizing a reliable multiple alignment of the 27 sequences. A manually constructed alignment based on homology blocks was in good agreement with the polypeptide alignment in the coding sectors and has been exhaustively assessed by using a multiplied alphabet obtained by the promising mathematical strategy called the N-block presentation (taking into account the environment of each nucleotide in a sequence). Sector by sector, we hypothesize many successive duplication/deletion scenarios that t our previous evolutionary hypotheses. This suggests an important duplication/deletion role for the reverse transcriptase, particularly in inducing stuttering cryptic simplicity patterns.

Introduction Previously, computer-aided analyses of retroviral nucleotide sequences aimed to unravel putative molecular evolution models from sequence comparisons and oligonucleotide distributions (Laprevotte et al. 1984, 1997; Laprevotte 1989, 1992; Terzian et al. 1997). An analysis of 24 viruses from the 10 classes of vertebrate retroviruses has shown common features for the overrepresented oligonucleotides three to six bases in length (Laprevotte et al. 1997): alternating purine and pyrimidine stretches are emphasized and displayed clearly in each of two subsets on both sides of CTG (or TTG) or CAG, respectively. Two general consensuses show up: CCTGG and CAGR; both are found in most classes of retroviruses and at least one is found in each class. This retroviral signature was not found among yeast, plant, and invertebrate retrotransposons, which indicates that the vertebrate retroviruses are a distinct homogeneous group (Terzian et al. 1997); this ts a common evoluKey words: HIV-1 and HIV-2 LTR nucleotide sequences, multiple alignment, N-block presentation, retroviral signatures of overrepeated oligonucleotides, scrambled stepwise duplications/deletions, cryptic simplicity. Address for correspondence and reprints: Ivan Laprevotte, Laboratoire Genome et Informatique, Universite de Versailles Saint Quentin en-Yvelines, 45 avenue des Etats-Unis, 78035 Versailles cedex, France. E-mail: laprevotte@genetique.uvsq.fr.
Mol. Biol. Evol. 18(7):12311245. 2001 2001 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038

tionary origin for the retroviruses and is consistent with the universal rule of a trend toward TG/CT excess, which was proposed as a generative principle of nucleotide sequences (Ohno and Yomo 1990). Most of the oligonucleotides 36 bases in length with signicantly larger than average numbers of occurrences appear to be internally repeated (with mono- or oligonucleotide internal iterations), suggesting an evolutionary stage by slippage-like local duplications. As a whole, these results are consistent with a scrambled duplicative stepwise molecular evolution (many sectors with successive duplications/deletions of short and longer motifs). Core consensuses could correspond to intermediary evolutionary stages, with short tandem repeats giving rise to longer oligonucleotide repeats as previously hypothesized (Southern 1972; Ohno 1988). In the present study, we tested these evolutionary hypotheses by focusing on the long terminal repeats (LTRs) (and anking sequences) that bound proviral DNA sequences from two groups of human immunodeciency viruses (HIV): 15 HIV-1s together with a chimpanzee simian immunodeciency virus (SIV), and 9 HIV-2s together with a macaque SIV and a sooty mangabey SIV. It is known that following retrovirus integration into the host-cell genome, the double-stranded proviral DNA is anked by two identical LTRs, with the 5 LTR element serving as the binding site for transcription factors (reviewed in Ou and Gaynor 1995; Pereira
1231

1232

Laprevotte et al.

FIG. 1.Diagrammatic presentation of the 5 /3 long terminal repeats (LTRs) of the human and simian immunodeciency viruses. U3, R, U5, PBS, PPT, and Nef are dened in the text. U3, R, and U5 make up the LTR. The intercalated and interrupted sectors, together with the dotted line, represent the rest of the retrovirus sequence. The hatched zones correspond to the anking eukaryotic sequences and the Nef gene-coding sequence. The actual sequences aligned and studied in the present paper are recapitulated in the bottom line.

et al. 2000). The HIV nef gene open reading frame partially overlaps the 3 LTR (g. 1). The HIV LTRs are short sequences that can be visually compared and have been subjected to exhaustive sequencing and biological studies because of an important pathological concern (as a result of the worldwide AIDS crisis) and the presence of transcription control sites on them. It is already known that retrovirus LTRs have multiplied motifs that may correspond to experimentally determined regulatory elements (Frech, Brack-Werner, and Werner 1996). Several studies have shown that the LTR structures and their regulation are of particular interest for HIV expression (Gaynor 1992). Here, we use the control sites that are conserved during evolution as starter homology blocks for a reliable multiple alignment of the 27 LTR sequences. We also list overrepresented words by using a new calculation strategy (Klaerr-Blanchard, Chiapello, and Coward 2000) applicable to short sequences such as the LTRs and their coding and noncoding sectors. CTG is often found in the overlapping multiplied motifs described in HIV-1 LTRs (Seto, Brunck, and Bernstein 1989). Moreover, the sequences of HIV (1 and 2) are the most biased in favor of the overrepresented trinucleotides in the LTRs (Laprevotte et al. 1997).We search for putative short- and longer-range duplications/deletions by comparisons with shufed sequences and by scrutinizing the thoroughly assessed alignment of the 27 sequences sector by sector. The results are in accordance with the previous hypotheses for the retrovirus nucleotide sequences of molecular evolution by scrambled stepwise short- and longer-range duplications/deletions. Materials and Methods The 27 Studied Nucleotide Sequences The sequences represented a portion of the plus strand of the proviral DNA (this strand corresponds to the viral RNA). These are the LTRs together with anking sectors (g. 1 and the alignment on the web page). The 27 5 nucleotides were those located upstream of the 3 LTR. They included the polypurine tract (PPT)

that is the binding site for the primer of DNA plusstrand reverse transcription. The 40 3 nucleotides were those located downstream of the 5 LTR. They included the primer-binding site (PBS) for minus-strand reverse transcription. These two anking sectors were highly conserved 5 and 3 landmarks that bound the alignment (see below). The three regions of the LTR were 5 -U3-R-U5-3 (reviewed in Peterlin 1995; Vogt 1997). U3 corresponds to the unique regulatory sequence at the 3 end of viral RNA, R corresponds to the terminal direct repeat RNA, and U5 corresponds to the unique regulatory sequence at the 5 end of viral RNA. The 27 sequences are listed in table 1 and in the left column of the alignment on the web page. The upper (HIV-1) group is that of the HIV-1s together with a related (Berkhout 1996) chimpanzee SIV (CIVCG [X52154]). The lower (HIV-2) group corresponds to the HIV-2s together with related SIVs (Berkhout 1996): a sooty mangabey SIV (RESIVSMM [X14307]) and a macaque SIV (RESIMM251 [M19499]). HIV-1 sequences were regrouped in accordance with their degrees of reciprocal homologies (determined by pairwise comparisons taking into account only the putative base substitutions and excluding the other evolutionary events; data not shown). HIV-2 sequences (L07625 and X61240 excluded) are listed in alphabetical order of their EMBL accession numbers. L07625 and X61240 were brought together: as seen in the alignment, they shared many common features that separate them from the other HIV-2 sequences (Kreutz et al. 1992; Barnett et al. 1993). The Method for Finding Exceptional Words in a Sequence The calculation strategy (Klaerr-Blanchard, Chiapello, and Coward 2000) is applicable to short sequences. The basic operation is to count occurrences of words of a given length in sub-sequences called windows (here, the LTRs and their coding and noncoding sectors). The P value is the probability for any word to occur in at least its observed number of positions. The calculation strategy involves the possibility that a word could overlap with itself. Since the probabilities are calculated for words actually occurring in the window, the rst occurrence of any word is considered given (probability 1). The P value is considered signicant when it is lower than 0.01 or even 0.001 (tables 25). When the Bernoulli model is used, the calculation takes into account the length and the base composition of the observed window. When the model is a rst-order Markov chain, the dinucleotide composition is additionally taken into account. The implementation of these probability calculations is a part of the program Excep (Coward 1998). Methods for Finding Probable Local Repetition Sectors We set a priori patterns that could correspond to local repetitions. A numerical value was used, that is, the percentage of the bases in the observed sequence

HIV-1 and HIV-2 LTR Nucleotide Sequences

1233

Table 1 The 27 Retroviral Nucleotide Sequences Analyzed in this Work


EMBL Accession No. K02013 K02083 K03455 M19921 X01762 K02007 M17449 M17451 M26727 K03454 M22639 M27323 K03456 L20571 M62320

Virus Human Human Human Human Human Human Human Human Human Human Human Human Human Human Human immunodeciency virus type 1, isolate BRU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1, isolate PV 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1, (HXB2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1, NY5/BRU (LAV-1) recombinant clone pNL4-3 . . . . . . . T-cell leukaemia type III (HTLV-III) proviral genome (AIDS virus) . . . . . . . . . . . . . . . . . immunodeciency virus type 1, isolate ARV-2/SF2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1, isolate MN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1, isolate RF (HAT-3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1 (HIV-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1, isolate ELI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1, isolate Z2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1 (HIV-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lymphadenopathy virus (MAL isolate) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1 (HIV-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . immunodeciency virus type 1, Ugandan isolate U455 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chimpanzee immunodeciency virus (CIVCG), SIV(cpz) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X52154

Simian immunodeciency virus from sooty mangabey monkey (RESIVSMM) . . . . . . . . . . . . . . . X14307 Simian (macaque) immunodeciency virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M19499 Human Human Human Human Human Human Human Human Human immunodeciency immunodeciency immunodeciency immunodeciency immunodeciency immunodeciency immunodeciency immunodeciency immunodeciency virus virus virus virus virus virus virus virus virus type type type type type type type type type 2 from strain HIV-2UC1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, isolate D205 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, isolate HIV2FG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, isolate SBLISY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 (HIV-2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, isolate ROD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 (HIV-2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, isolate GH-1, clone 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 (HIV-2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L07625 X61240 J03654 J04498 J04542 M15390 M30502 M30895 M31113

that were included in at least one of the repetitions so dened. For each of the 27 sequences, the actual value was compared with those of 100 simply shufed sequences (Bernoulli model) or 100 shufed sequences additionally conserving the exact starter dinucleotide count (rst-order Markov chain model). The result was considered signicant (table 6) when any random value was lower than that of the observed sequence. The result was considered somewhat signicant when no more than 5 of the 100 random values were above the observed one. A computer program that shufed the letters of a sequence while accurately conserving its dinucleotide, or even trinucleotide, composition, was implemented (Kandel et al. 1996; Coward 1999). Searching of approximately duplicated sectors was done using the program BESTFIT (Smith and Waterman 1981; Devereux 1989): the alignment score between two sectors was compared with 100 random values corresponding to the alignments of one of the two observed sectors with 100 shufings of the other. The N-Presentation Algorithm The N-presentation strategy (Didier 1999) enables one to code a set of biological sequences (here, nucleotide sequences) by using a multiplied alphabet accounting for the neighborhood of each letter in the sequences. Here, the N-presentation rank was the length of the neighborhood considered (here, the 8- and the 12-presentations were performed). In the N-presentation, each nucleotide is renamed by its letter followed by a number

(its type). A type of nucleotide appears in two different positions in the sequences if the same neighborhood of length N covers these two positions with the same relative rank (i.e., the neighborhood starts at the same distance [ N] from the two positions). The calculation procedure also enables one to identically rename the same starter characters when they are included in oligonucleotides that are only partially similar. In this study, the N-block presentation computer program was used to assess the alignment of the 27 sequences. Actually, this alignment was difcult to assess with the four-letter alphabet of DNA because of a large number of putative duplications/deletions that made the problem of the gaps difcult to manage and because the homology blocks were often difcult to distinguish from noise. The homology blocks were much easier to delineate using an alphabet made up of a large number of characters (4,711 here when the 12-presentation was used). To begin with, the alignment was constructed manually (see below), inasmuch as the available software could not use an alphabet with more than 26 letters. The Alignment Strategy The alignment can be found on the web page http: //genome.genetique.uvsq.fr/laprevotte/. Within each HIV-1 or HIV-2 group, the sequences were closely related and the alignments are easily constructed, such that the two consensuses were easily deduced. The point is to align HIV-1 and HIV-2 sequences together (these sequences are supposed to have a common evolutionary

1234 Laprevotte et al.

Table 2 Overrepresented Oligonucleotides in the Entire Sequences


A A A C C C C G G T T C G G A T T T C G C G T A G G C G T T A T G x x x x x x x x A A G A A C A A A C A C A C C C A C T A A C T G A G A A A G A C A G C A A G C T A G G A A G T G C A A G C A G A C A G G C C A G C C C T C C T G C T A G C T C T C T G A C T G G C T T G C T T T G A A G G A C T G A G G G C A G G C C T G C T G G C T T G G A A G G G A T A C T T C T C T C T G T G C C x T G C T T G G C T G G G

A C G T

x x x x x x x X x x x

x x x

x x x x x x x x x x x x x x x

X x X X x X X X X X X

X x x x x x x x x

x x x x x x

x x x x x x x x x x x x x x x

x x

x x x

x X x x

x x x x x x x x x x X x x x X x x X x x X X x x x x x x x
signicant; x

x x X x x x x x x x x x x x
somewhat signicant (see Materials and Methods).

x x

X X x x x x x x x x

x x x x x x x

X X x x X x x x x x x x x x x x

x x x x x x x x

x x x

CIVCG . . . . . . . . . . . K02013 . . . . . . . . . . . K02083 . . . . . . . . . . . K03455 . . . . . . . . . . . M19921. . . . . . . . . . . X01762 . . . . . . . . . . . K02007 . . . . . . . . . . . M17449. . . . . . . . . . . M17451. . . . . . . . . . . M26727. . . . . . . . . . . K03454 . . . . . . . . . . . M22639. . . . . . . . . . . M27323. . . . . . . . . . . K03456 . . . . . . . . . . . L20571 . . . . . . . . . . . M62320. . . . . . . . . . . RESIVSMM . . . . . . . RESIMM251 . . . . . . L07625 . . . . . . . . . . . X61240 . . . . . . . . . . . J03654. . . . . . . . . . . . J04498. . . . . . . . . . . . J04542. . . . . . . . . . . . M15390. . . . . . . . . . . M30502. . . . . . . . . . . M30895. . . . . . . . . . . M31113. . . . . . . . . . . x x X x x x x X x X x x X x x x x x x x x x X x X x X x x x x x X x x x X x x x x x

x X x x x x X x x x X

X X X X X X X X X X X X X X X X X X X X X X x X X X X

NOTE.Overrepresented oligonucleotides are displayed from top to bottom (those included in CCTGG or CAGR are in boldface type). X

HIV-1 and HIV-2 LTR Nucleotide Sequences

1235

Table 3 Overrepresented Oligonucleotides in the Coding Sectors


A G C T A C A x A G A C T G x G C T A A G A x A C A A X A C A C A G A A A G G C C A C A C C A G C T G G x x x x x X x x x x G A A G G A C T G A G A G G A A G G A T G G C T T A C C T C A G T G G A T G G C T T T G

CIVCG . . . . . . . K02013 . . . . . . . K02083 . . . . . . . M19921 . . . . . . . K02007 . . . . . . . M17451 . . . . . . . M26727 . . . . . . . K03454 . . . . . . . M22639 . . . . . . . M27323 . . . . . . . K03456 . . . . . . . L20571 . . . . . . . M62320 . . . . . . . RESIVSMM . . . RESIMM251. . . x L07625 . . . . . . . X61240 . . . . . . . J03654 . . . . . . . . J04498 . . . . . . . . J04542 . . . . . . . . M15390 . . . . . . . M30502 . . . . . . . M30895 . . . . . . . M31113 . . . . . . . x

x x

x x x X x x X x x x x x x x x x x X x x x

x x x

x x x
signicant; x

NOTE.Overrepresented oligonucleotides are displayed from top to bottom (those included in CCTGG or CAGR are in boldface type). X somewhat signicant (see Materials and Methods). Three HIV-1 sequences are not included because of premature stop codons.

Table 4 Overrepresented Oligonucleotides in the Noncoding Sectors


A G C T A C T A G A A G G C A G C T G x x x x x x x x x x x x x x X X x x x X x x X x x x x X x x x x x X x X x x x x x x x x x x x x x x x x x x x x x x x x x x x
signicant; x

C T T

G C T

T C T

A A G C

A G A G

A G C A

A G T G

C A C T

C A G G

C C C T

C C T G

C T A G

C T C T x x x x x x x x x x X x

C T G C x

C T G G X X X X x X x x

C T T G

C T T T x

G A C T

G C A G

G C T G

G G G A X X X X X X X x x x x x

T A A A

T A C T

T C T C

T G C T

T G G G

CIVCG . . . . . . . K02013. . . . . . . K02083. . . . . . . M19921 . . . . . . K02007. . . . . . . M17451 . . . . . . M26727 . . . . . . K03454. . . . . . . M22639 . . . . . . M27323 . . . . . . K03456. . . . . . . L20571 . . . . . . . M62320 . . . . . . RESIVSMM. . . RESIMM251 . . L07625 . . . . . . . X61240. . . . . . . J03654 . . . . . . . J04498 . . . . . . . J04542 . . . . . . . M15390 . . . . . . M30502 . . . . . . M30895 . . . . . . M31113 . . . . . .

x x

x x x x x

X X X X X X X X X x X X X x X X X X X X X X X X

x x X

x x

x x

X x

x x

x x

x x x x x x x

NOTE.Overrepresented oligonucleotides are displayed from top to bottom (those included in CCTGG or CAGR are in boldface type). X somewhat signicant (see Materials and Methods).

1236

Laprevotte et al.

Table 5 Overrepresented Oligonucleotides Versus a First-Order Markov Chain Model


A T C CIVCG . . . . . . . K02013. . . . . . . K02083. . . . . . . M19921 . . . . . . X01762. . . . . . . K02007. . . . . . . M17451 . . . . . . M26727 . . . . . . K03454. . . . . . . M22639 . . . . . . M27323 . . . . . . K03456. . . . . . . L20571 . . . . . . . M62320 . . . . . . RESIVSMM. . . RESIMM251 . . L07625 . . . . . . . X61240. . . . . . . J03654 . . . . . . . J04498 . . . . . . . M30502 . . . . . . M30895 . . . . . . 716a [1353]b [360700]c [360700]c [360700]c 700a 700a [360700]c 700a [360700]c [360700]c [360700]c 700a [1356]b [360700]c 700a [360700]c [360697]c 702a [360702]c [357708]c 881a [1569]b [1383]b [375921]c [1413]b [417917]c 694a 921a [1410]b [1413]b 920a [411920]c C G C A C A A x X x X x x x x x x x x x x x X x X x x x x x x x x x x x x
signicant; x

A G A C

A G T G

A T C C

C G C T

C G G G

C T G G

G A A G

G A T C

G C G C x

G G G A

G T G T

T A A A

T A C C

T A T A

T G G C

T G T A

x x

x x x x

NOTE.Overrepresented oligonucleotides are displayed from top to bottom (those included in CCTGG or CAGR are in boldface type). X somewhat signicant (see Materials and Methods). a The entire sequence. b The coding sector. c The noncoding sector.

progenitor). The alignment was constructed manually. To begin with, it was based on eight consensus elements (or groups of elements), that is, 18 positions highlighted by Frech, Brack-Werner, and Werner (1996), who studied common modular structures in primate lentiviral LTR sequences. Most of these core blocks enable one to propose a reliable alignment of the corresponding and neighboring HIV-1 and HIV-2 sectors, provided that some local corrections are done. In addition, PPT and PBS were very signicant core blocks, together with the 5 and 3 ends of the LTRs, respectively. The rest of the alignment was built by recursively searching the intercalary sectors for perfectly matched segments of at least three bases in length. For each step, a new intercalary alignment was then based on the longest perfect match between any paired HIV-1 and HIV-2 sequences, that is, consistent with the prealigned bordering sectors. This match was a priori assumed to be the closest to the putative original sequence. In addition, probable duplication/deletion events were taken into account. Especially in the case of an unequal number of repeated mo-

tifs between HIV-1s and HIV-2s, gaps were inserted (gaps were not treated explicitly but remain as those parts of the sequences that did not belong to any of the aligned segments; Morgenstern, Dress, and Werner 1996). Eventually, the alignment was based on the nucleotides printed on the line labeled common sectors. These nucleotides covered 643 positions ( 58%) of the alignment. In the coding reading frame, the polypeptide alignment was constructed in the same way based on the conserved amino acids (and the corresponding codons). In order to assess and to locally correct the nucleotide alignment while increasing the signal-to-noise ratio, alphabets of more than four letters were additionally used: that of the polypeptide alignment in the coding reading frame (as just mentioned), and that obtained by the 8- and 12-ranked N-block presentation for the whole of the sequences. Obviously, there was good agreement between the polypeptide and the nucleotide alignments except for a few locations (see the web page). All of the aligned sequences were coded using a 12-ranked and an 8-ranked N-block presentation (the

HIV-1 and HIV-2 LTR Nucleotide Sequences

1237

Table 6 Sectors Suggesting Local Repetition Scenarios


A 0 CIVCG. . . . . . . . . K02013 . . . . . . . . K02083 . . . . . . . . K03455 . . . . . . . . M19921 . . . . . . . . X01762 . . . . . . . . K02007 . . . . . . . . M17449 . . . . . . . . M17451 . . . . . . . . M26727 . . . . . . . . K03454 . . . . . . . . M22639 . . . . . . . . M27323 . . . . . . . . K03456 . . . . . . . . L20571. . . . . . . . . M62320 . . . . . . . . RESIVSMM . . . . RESIMM251 . . . . L07625. . . . . . . . . X61240 . . . . . . . . J03654 . . . . . . . . . J04498 . . . . . . . . . J04542 . . . . . . . . . M15390 . . . . . . . . M30502 . . . . . . . . M30895 . . . . . . . . M31113 . . . . . . . . X x x x x x X X X x x x x X x 1 x x x 0 x X X X X X X X X X X X X X x X x x x x x x x x x x x B 1 C 1 D 1 E 1 x X X X x X X x x X x X x X X F 0 X X X X X X X X X X X X X X x X X x x x X X x X X x 1 x X X X X X X X X X X X X X X x X X x 0 X G 1 x 0 x H 1 0 X X x x X x x X x x X X x x I 1 x x x x x X 0 x x x x x x x x x x x x x X X x X x X x x X x X x x x X X x X J 1 0 x X x X x X X x X x X x X x X x K 1 x x X X x X X x X x X X x x X

x x X x

X x x x x

x x x x X x

x x x

X x x

x x

x x x x x X x X x x

X X

x x

x X

x x

X X X x

x X X

X x

NOTE.A perfect tandem repeats of words at least two bases in length plus local repetitions of words at least three bases in length with no more than ve bases intercalated; B tandemly repeated dinucleotides; C motifs at least six bases in length made up of two distinct and alternate letters; D ABCDABCDABCD patterns; E nonoverlapping repetitions of words at least 10 bases in length with no more than ve letters intercalated; F dinucleotides repeated at least nine times in windows 50 bases in length; G trinucleotides repeated at least six times in windows 50 bases in length; H trinucleotides repeated at least four times in windows 25 bases in length; I tetranucleotides repeated at least four times in windows 50 bases in length; J sectors more than 15 bases in length that are made up of no more than two letters; K sectors at least 30 bases in length made up of no more than three letters (at most one base excepted, this latter not included in the computation of the numerical value dened in Materials and Methods). For 0, the results are compared with a Bernoulli model; For 1, the results are compared with a rst-order Markov chain model. X signicant; x somewhat signicant (see Materials and Methods).

latter being less stringent). Obviously (see the web page), the N-block presentation corroborated the homology blocks (in addition, local corrections of the alignment were made possible). Results Overrepresented Oligonucleotides Appear to Be Congruent with the So-Called Retroviral Signature We used a new calculation strategy (Materials and Methods) to perform on a short sector of the retroviral genome an investigation similar to that performed on complete sequences (Laprevotte et al. 1997). For the Bernoulli model, the results are displayed in tables 2 4. The overrepresented oligonucleotides of 24 bases in length are shown in the whole sequences (table 2), in the 5 part (coding for the 3 end of the nef gene; table 3), and in the 3 (noncoding) part (table 4). As a whole, the overrepeated words selected from the entire lengths of the sequences (table 2) appeared to be congruent with the retroviral signature previously found, particularly the core consensuses CCTGG and CAGR (Laprevotte et al. 1997). For instance, 32 out of the 53 oligonucleotides displayed shared over more than half of their lengths, a continuous sector with one of these consensuses. The 2 dinucleotides, 5 (out of 11)

trinucleotides, and 4 (out of 40) tetranucleotides were completely included in these consensuses. In comparison, the corresponding proportions of such oligonucleotides in the all possible di-, tri-, and tetranucleotides were 7/16, 6/64, and 4/256, respectively. The most often selected words were AG, CT, AGA, CTG, CTGG, GGGA, and, only for the HIV-2 group, CAG, which is complementary to CTG. The word CTG was overrepresented in all of the sequences HIV-1 except for M26727 (although it did contain an overrepeated CCAG). Except for the macaque virus (although it showed an overrepresented CAG), the HIV-2 group also showed an overrepresented CTG, with the sooty mangabey virus (RESIVSMM), which is supposed to be the evolutionary progenitor of the HIV-2s (Gao et al. 1999), included. In the coding sector (table 3), only CIVCG and RESIVSMM showed overrepresented oligonucleotides including CTG. In addition, two HIV-1s and six HIV2s showed an overrepresented CCAG. The noncoding sector (table 4) appeared to be much more congruent with the retroviral signature: the sequences studied showed at least one overrepresented oligonucleotide including CTG (except for K03456, M26727, RESIVSMM, and RESIMM251); six HIV-2s out of nine

1238

Laprevotte et al.

FIG. 2.Percentages of the sequences occupied by approximate tandem repeats (perfect tandem repeats of a word at least two bases in length and locally repeated oligonucleotides at least three bases in length) for 27 100 simple (left), dinucleotide-conserving (middle), and trinucleotide-conserving (right) shufings of the 27 sequences.

showed at least one overrepresented word including CAG. For K03455, X01762, and M17449, only the entire lengths of the sequences were studied because of a premature stop codon (tables 35). Table 5 displays the tri- and tetranucleotides that were found to be overrepresented when a rst-order Markov chain model was used. Only the sequence RESIVSMM had an overrepresented oligonucleotide (CTGG, in the entire sequence and in the coding sector) that was congruent with the so-called retroviral signature. GGGA remained signicantly overrepresented in the noncoding sectors of all of the HIV-1 sequences that were tested (except for CIVCG) and three HIV-2 sequences (L07625, M30895, and X61240). Obviously, the major portion of these overrepresented GGGAs was clustered in the sectors (aligned with CIVCG 397458), where the repeated sites NF-KB and SP.1 were located (see the alignment on the web page). Actually, for these sequences with overrepresented GGGAs, the noncoding sectors were 341547 bases in length and included between 7 and 10 occurrences of this word. In these actual sectors, the zone in which NF-KB and SP.1 sites were clustered (being only 5973 bases in length) included as many as four or ve occurrences of GGGA (boxed by a thick line in the alignment). Sectors Suggesting Local Repetition Scenarios The simulation procedures (Materials and Methods) were aimed at nding local repetition processes

such as those hypothesized previously (Laprevotte et al. 1997). For each sequence, the patterns studied in table 6 (column A) were approximate tandem repeats (i.e., either perfect tandem repeats of a word at least two bases in length or local repetitions of a motif at least three bases in length with no more than ve bases intercalated between two successive occurrences of this word). These repeated words are boxed by a thin line in the alignment on the web page. For each sequence, the signicance of the numerical value (Materials and Methods) was assessed against the Bernoulli model (table 6, column A, left). Except for K03456, HIV-1 group sequences (15 sequences) appeared to be signicant (for CIVCG, K02007, M17449, M17451, and L20571) or somewhat signicant (for the other 10 sequences). In the HIV-2 group, only 3 sequences out of 11 were at least somewhat signicant: L07625 (somewhat signicant), M15390 (somewhat signicant), and X61240 (signicant). On average, the signicant results were greater by 5% than the nonsignicant results. The rst-order Markov chain model (table 6, column A, right) showed that for a major portion, the signicant results correlated with the dinucleotide compositions of the corresponding sequences: 2 HIV-2- and 8 HIV-1 sequences were no longer signicant; X61240 and 4 HIV-1 sequences including CIVCG become only somewhat signicant; the degree of signicance was conserved only for K02083, M17451, and M26727. In gure 2, the numerical values dened for table 6 (column A) are displayed for three sets of 2,700 (27

HIV-1 and HIV-2 LTR Nucleotide Sequences

1239

100) shufed sequences. For each distribution graph, each of the 27 starter sequences was shufed 100 times (Materials and Methods). For the left graph, the sequences only rigorously conserved the starter nucleotide compositions. For the middle graph, the dinucleotide counts were additionally exactly conserved, as were the trinucleotide compositions for the right graph. The middle graph was more shifted from the left than was the right from the middle, such that the major part of the increase of the random values was accounted for by the rst-order Markov chain model. Hence, the repeated sequences investigated in table 6 (column A) appear to be accounted for, to a large extent, by the dinucleotide compositions of the sequences. Columns B, C, D, and E of table 6 focus on particular repetitions that were part of those recapitulated in column A. Column B focuses on the tandemly repeated dinucleotides and displays but a few signicant results. For columns C, D, and E, signicant results are dened using a rst-order Markov chain model. Column C displays sequences with overrepresented motifs of at least six bases in length, made up of only two distinct and alternate letters; such are all of the HIV-1 sequences (CIVCG excepted), and a single HIV-2 sequence; on average, this overrepresentation accounts for an overestimation of about 5% of the numerical value that is computed for column A. Column D shows that all of the HIV-2 sequences have overrepresented repetitions as ABCDABCDABCD; actually, this only highlights the sector GCTTGCTTGCTT (boxed by a thick line in the alignment on the web page), which extends from position RESIVSMM-671 to position RESIVSMM-682. Column E accounts for nonoverlapping repetitions of words at least 10 bases in length with no more than ve letters intercalated between two successive identical motifs. For each starter sequence, no more than one random sequence with such a repetition (and with only two copies) occurs, such that in an observed sequence, this repetition (boxed by a thick line in the alignment) is thus considered signicant (it accounts for an overestimation of about 3% of the numerical value calculated for column A). Together with the overrepetitions displayed in column C, these overrepresented patterns could account for at least some of the signicant results presented in column A. Only a single HIV-2 sequence (L07625) shows such a pattern (two copies of a decanucleotide), located in the sector of the SP-1 sites (see below). On the contrary, all the HIV-1 sequences but L20571 show such a repetition, that is, 2 copies of the NF-KB site (see below). In the same sector of the alignment, L20571 (which is usually distinct from the other sequences of its group; Gurtler et al. 1994) shows a probable imperfect duplication (boxed by a thick line in the alignment) of a segment including the NF-KB site (underlined), but with an unequal number of CTGs and Gs: 359 382 ACTG---ACACTGC-GGGACTTTCCAG ACTGCTGACACTGCGGGGACTTTCCAG 381 408.

Columns F, G, H, and I of table 6 account for signi-

cantly clustered di-, tri-, or tetranucleotides in windows of 50 bases in length (for column H, trinucleotides repeated at least four times in a window 25 bases in length). The window slides step by step along the sequence (step 1), with the purpose of delineating simply repeated sequences or cryptic simplicity stretches (rich in short direct repeats, as discussed in Tautz, Trick, and Dover 1986; Treier, Pfeie, and Tautz 1989). The most numerous signicant results are those displayed in column F (dinucleotides repeated at least nine times in windows 50 bases in length). As compared with the Bernoulli model, all of the sequences (RESIMM251 excepted) show at least a somewhat signicant result. As compared with a rst-order Markov chain model, the results remain signicant for the HIV-1 group except for L20571 (CIVCG remains only somewhat signicant); of the HIV-2 sequences, ve remain at least somewhat signicant, in contrast to the other ve (J03654, L07625, M15390, M31113, and X61240). On average, the numerical value to be compared (Materials and Methods) was overestimated by about 12% for the sequences that remained signicant when a one-order Markov chain was used, as compared with those which were not signicant or those not remaining signicant. Results similar to those in column F of table 6 (albeit less numerous) are displayed in column G (trinucleotides repeated at least six times in windows 50 bases in length), H, and I (tetranucleotides repeated at least four times in windows 50 bases in length). Column J of table 6 accounts for signicant overrepresentations of sectors of more than 15 bases in length that are made up of no more than two letters. According to the Bernoulli model, the HIV-1 sequences (except for K02013, K02007, and K03456) were somewhat signicant; except for M17451, they were not signicant when compared with a rst-order Markov chain model. For the HIV-2 group, when a rst-order Markov chain model was used, RESIVSMM, J04542, L07625, M30502, and M31113 remained signicant, while J03654, J04498, and M15390 did not; the three other sequences in the group were not signicant anyway. For the sequences that remained signicant when compared to shufed sequences conserving the exact starter dinucleotide count (rst-order Markov chain model), the numerical value was 3.6% or 3.7%, except for M17451 (2.3%) conserving only a borderline signicance; for those sequences which did not conserve their signicance or remain nonsignicant, the parameter was 0% 2.3%. Column K of table 6 accounts for signicant overrepresentations of sectors at least 30 bases in length made up of no more than three letters (with at most one base excepted, this latter not being included in the computation of the numerical value dened above). HIV-1 sequences (except for M17449 and L20571) were signicant even against a rst-order Markov chain model. For signicant sequences, the numerical values range from 18.7% to 32% (14% and 10.1% for M17449 and L20571, respectively). As a whole, HIV-2 sequences (value from 0% to 17%) were not signicant.

1240

Laprevotte et al.

Table 7 Tests of Three Available Multiple-Sequence Alignment Programs Against HIV-1 and HIV-2 Common Core Blocks
Clustal-X Mabios Polypurine tract . . . . . . . . . . . . . . . . . . LTR (U3) 5 end . . . . . . . . . . . . . . . . . Primate element blocka . . . . . . . . . . . . NF-KBb . . . . . . . . . . . . . . . . . . . . . . . . . SP-1c . . . . . . . . . . . . . . . . . . . . . . . . . . . TATA box. . . . . . . . . . . . . . . . . . . . . . . U3-R junction. . . . . . . . . . . . . . . . . . . . TAR common sectord . . . . . . . . . . . . . Poly (A) . . . . . . . . . . . . . . . . . . . . . . . . R-U5 junction. . . . . . . . . . . . . . . . . . . . Poly (A) downstream elemente . . . . . . LTR (U5) 3 end . . . . . . . . . . . . . . . . . Primer-binding site. . . . . . . . . . . . . . . . X X X x x X X x X X X x x X X x X X X Dialign x X X X x x x x X x X X

NOTE.X perfect alignment; x imperfect or partial alignment. a Four sites. b Dialign is the only program highlighting the duplication of the NF-KB site in the HIV-1 group. c The SP-1-sites sector exhibits clustered GGGAs and stretches of Gs, so several alignments are possible. d Dialign constructs an almost perfect alignment for approximately the 5 two thirds of the sector. e Dialign constructs a correct alignment except for the two 3 bases.

polyadenylation site (Poly (A)) and highlighting the duplication of the NF-KB site in the HIV-1 group (by inserting a gap in HIV-2 sequences in front of one of the two NF-KB copies). Hence, as regards the present alignment, Dialign appeared to be the most reliable program of those tested; it was further tested for two sectors where the alignment was difcult to construct even manually: the set of sequences aligned with CIVCG 328 463 and that aligned with CIVCG 558609 (see the web page). The nucleotides of the alignment constructed with Dialign in these sectors were coded by the 8-ranked Nblock presentation, HIV-1 and HIV-2 matching letters being highlighted in red as on the web page (data not shown). Actually, these highlighted letters are less numerous than in the manually aligned corresponding sectors, which suggests that the alignment constructed with such a program has at least to be visually rened. Discussion Previous analyses of retroviral nucleotide sequences have suggested a scrambled stepwise duplicative molecular evolution. Genetic diversity in these sequences is usually presumed to arise as a consequence of reverse transcriptase indelity (Katz and Skalka 1990), which is due to numerous phenomena, such as nucleotide miscopying, duplication, deletion, recombination, and G-toA hypermutation of the viral genome (Vartanian et al. 1991). Duplications have been reported to occur much less frequently than deletions. Additionally, an oligopurine sequence bias has been reported to occur in eukaryotic viruses (Beasty and Behe 1988). In this paper, we focused on the LTRs of human and simian immunodeciency viruses to perform a high-resolution study. The listed overrepresented oligonucleotides (often containing CTG or CAG, as previously shown) are congruent with the previously named retroviral signature. This is particularly clear in the noncoding part of the LTRs, which strengthens the previous conclusion that these overrepresented oligonucleotides are not merely predictable from the codon usage in the coding frames of retroviral sequences; nucleotide repetitiveness and codon usage appeared not to be strictly correlated, and overrepresented/clustered nucleotide motifs showed up without regard to the coding/noncoding sectors and the phase of the reading frame (Laprevotte 1989, 1992). However, one must keep in mind that an initially coding foreign genetic material could be inserted by recombination into a noncoding part of a retroviral sequence (Katz and Skalka 1990). For instance, the nef gene is thought to be a captured element (Myers 1997), which could account for the less characteristic results achieved here in the coding part of the LTRs. Additionally, it has recently been suggested that reading-frame-independent force(s) may inuence synonymous codon choice (Antezana and Kreitman 1999). By no means can biological mechanisms be deduced from the correlation of overrepresented words and signicant local repetitions with the dinucleotide compositions of the corresponding sequences. It is impossible to decide between two hypotheses: either the dou-

The Reliability of the Alignment of the 27 Nucleotide Sequences A reliable alignment is an essential tool in the present work. The accurate alignment of previously identied benchmarks and its congruency with the polypeptide alignment and with the N-block presentation coding of the sequences (Materials and Methods and the alignment on the web page) allowed us to consider this alignment reliable for testing local molecular evolution hypotheses. Three available multiple-sequence alignment programs were tested (table 7) against the benchmarks found in both the HIV-1 and the HIV-2 groups to select the most suitable algorithm for aligning the actual sequences studied in this work. Clustal-X (Thompson, Plewniak, and Poch 1999) is a progressive alignment method comparing individual residues by using a Needleman-Wunschbased algorithm (Needleman and Wunsch 1970) and employing gap penalties; Mabios (Abdedda 1997) and Dialign (Morgenstern, Dress, m and Werner 1996) calculate homology blocks of which the best combinations are chosen in order to select the benchmarks on which the rest of the alignment is constructed. At rst, it appeared that there was no program constructing the same alignment that another did. The program Clustal-X produced a total misalignment downstream of the HIV-1 deletion zone following TAR Common Sector (CIVCG-519); moreover, the deleted sequence J03654 was oddly aligned (data not shown). Mabios and/or Dialign aligned all of the benchmarks (the R-U5 junction excepted); the alignment of ve of these benchmarks was more accurate when Mabios was used (with Dialign constructing an alignment that was less accurate or only partial). However, the Dialign program was the only one aligning all but one of the indicated benchmarks, particularly the much-conserved

HIV-1 and HIV-2 LTR Nucleotide Sequences

1241

blet frequencies, due to any event, account for these words found to be overrepresented or clustered when compared with a Bernoulli model, or a large number of duplications of oligonucleotides (such as AG and CT; tables 24) bias the dinucleotide composition of the sequence and account for the nonsignicance of many repetitions when tested against a rst-order Markov chain model. Such duplications could favor particular nucleotide motifs for biochemical reasons or because of starter tandem repeated sequences. Previous studies of complete retroviral sequences (Laprevotte 1992; Laprevotte et al. 1997) strengthen the second hypothesis by demonstrating that for most of the overrepeated oligonucleotides, the observed frequency is not merely a consequence of dinucleotide distribution (many overrepresented oligonucleotides remained signicant versus a rst-order Markov chain model; moreover, the correlation between the dinucleotide distribution in the subset of the overrepresented oligonucleotides and that of the whole sequence was variable, high, weak, or even null). The fact that for RESIVSMM (which is supposed to be the evolutionary progenitor of the HIV-2s; Gao et al. 1999) CTGG is overrepresented even when a rst-order Markov chain model is used (table 5) ts the same hypothesis. Moreover, many of the putative locally repeated sectors remain signicant even against a rstorder Markov chain model (table 6), giving examples of probable duplications that are obviously not accounted for by the dinucleotide composition of the sequence; these are tandem repeats, local repetitions, clusters of oligonucleotides, and monotonous sectors made up of no more than two or three letters. Columns F and K of table 6 show many repetitions and monotonous sectors that may cover up to 30% of the sequence and that are signicant even against Markov-1 random sequences (in these cases, the percentage is overestimated by about 10%15%). Moreover, about one third of the alignment includes sectors boxed by a thick line at at least one sequence or one HIV-1/HIV-2 consensus (see the web page). As seen below, these sectors suggest local repetition events. Hence, it appears that in any case the dinucleotide compositions cannot account for all of the listed repetitive patterns and that these patterns cover a large portion of the sequences. The discrepancies between the results (tables 26) for the HIV-1 and the HIV-2 groups, respectively, suggest distinct mono- or oligonucleotide duplication/deletion scenarios that could have occurred since the evolutionary divergence between the two groups; this led us to search the reliable alignment of the sequences for patterns both statistically signicant and evolutionary suggestive. Differentiated sectors can be delineated in the alignment (see the web page) in terms of the degrees of homology between HIV-1 and HIV-2 aligned sectors. The 5 landmark that is the PPT, together with the 5 end of the LTR (CIVCG 1037) and the 3 landmark that is the PBS (CIVCG 682705), are highly conserved and highlighted by the 12-ranked N-block presentation, as are six other homology blocks; out of these six blocks, the NF-KB site (CIVCG 397407 and 409418) and the polyadenylation signal (CIVCG 570581) are in

the noncoding part of the sequence; the other four (CIVCG [105136], [168181], [215229], and [284 296]), align with conserved sectors in the polypeptide sequence nef. The major portion of the coding sectors (up to and including position CIVCG-346), is to be distinguished from the rest of the alignment: the aligned sectors (except for two) measure about the same length (337346 bases). Except for M17449 (which exhibits a premature stop codon), the lengths are equal or differ, as expected, by multiples of three. The length is longer for the L20571 sequence (349 bases); a scan of the colored alignment obviously corroborates the fact that L20571 is a divergent isolate among the HIV-1 group (Gurtler et al. 1994). In the HIV-2 group, J03654 (Zagury et al. 1988) is deleted between positions CIVCG-88 and CIVCG-317 (excluded). This could be accounted for by two successive deletion events. Let us write the HIV-2 consensus between the positions CIVCG-79 and CIVCG-94 while supposing a jump of the reverse transcriptase (Katz and Skalka 1990; Zhang and Temin 1994) from the rst aga to the second; then, the sequence becomes TATACTTAGAAGG. Eleven out of the 13 letters of this motif match the HIV-2 consensus between positions CIVCG-309 and CIVCG-321 (TATARYTACAAGG), suggesting a second jump between the two motifs. Furthermore, in spite of the conservation of the lengths of the major part of the coding sectors, gaps have to be inserted in the sequences in order to align the homology blocks, suggesting any number of duplications/deletions. For instance, for the sectors aligned with CIVCG from position 39 to position 58, four demonstrative sequences lead to the proposal of a suggestive alignment: M26727 M62320 RESIMM251 L07625 ATTTACTCCCAG----AAAAGACA ATTCACTCACAG----AAAAGACA ATTTATT-ACAGTGCAAGAAGACA ATTTACT-ATAGTGAGAGAAGACA 58 58 58 58

If this is biologically meaningful, the (underlined) serine codons (TCM for the HIV-1 group, AGT for the HIV-2 group) cannot be aligned; the evolutionary hypothesis of a simultaneous double-nucleotide substitution (TCAG), as discussed elsewhere (Averof et al. 2000), is not conrmed in this actual case of an assumed reverse transcriptase directed evolution. This suggests a less straightforward evolutionary mechanism and further emphasizes the importance of accurate alignments in testing local evolutionary hypotheses. The sectors aligned with CIVCG-81CIVCG-95 are particularly suggestive. Between the two landmarks that are the conserved 5 TAY (tyrosine) and 3 GGV (glycine), the HIV-1 and HIV-2 groups do not match; this would lead to consideration of a large number of nucleotide substitutions if only these molecular evolution events were to be taken into account in the alignment. In addition, except for L20571, the HIV-1 sectors include a stretch of

1242

Laprevotte et al.

more than six bases made up of only two different alternate letters; seven of the aligned HIV-2 sectors include a stretch of more than 15 bases made up of only two letters. In the two groups, these selected stretches are boxed by a thick line in the alignment (see the web page) when the pattern is signicantly overrepresented in the entire corresponding sequence (table 6, columns C and J). The two aligned K03454 (HIV-1) and M31113 (HIV-2) sequences can be taken as an example: K03454 M31113 TACAACACAC------AAGGCAT TACTTAGAAAAGGAAGAGGGAAT tyrosine glycine 97 97

One can imagine that at a pause site (Wu et al. 1995), the reverse transcriptase could replicate the same short template several times, thus expanding the last monoor oligonucleotide of the nascent DNA strand. If that is the case, the alternate Cs and As in the HIV-1 group should have arisen from the 5 end of the sector, and the aligned sector in the HIV-2 group should have arisen from the 3 end. This could somewhat mimic the performance of the telomerase, a cellular reverse transcriptase which synthesizes short repeat sequences rich in T and G and carries its own RNA template with a segment complementary to one and a half copies of the telomeric repeat. One should keep in mind that there is a close relationship between the cellular telomerase active subunit and retroelement reverse transcriptases (reviewed in Boeke and Stoye 1997; Malik, Burke, and Eickbush 2000). It has been suggested that the rst steps of DNA synthesis by reverse transcriptases of non-LTR retrotransposons might be similar to the generation of telomeric repeats (Chaboissier, Finnegan, and Bucheton 2000). From position CIVCG-347 downward, the major portion of the aligned sequences is noncoding. Consequently, their lengths do not necessarily differ by multiples of three; they are much more divergent between the HIV-1 and the HIV-2 groups and even, within the HIV-2 group, between the two SIV-2 and the HIV-2 sequences. The duplication/deletion events appear to have been much less constrained during evolution than they

have been in the coding parts. In this respect, several sectors deserve scrutiny. The alignment between positions CIVCG-348 and CIVCG-386 can be accounted for by stepwise duplications/deletions (see the web page). HIV-1 clones have been described (Estable et al. 1996) where the HIV-1 empty sectors are occupied by the so-called most frequent naturally occurring length polymorphism (MFLNP on the web page), which shows varying lengths and appears more or less clearly to contain repeated sectors. Here, the aligned sectors in the HIV-2 group do not appear to be deleted. Between CIVCG-394 and CIVCG-413 (excluded), HIV-2 sequences (SIVs excluded) exhibit two imperfectly repeated sectors that could be the remnant of a duplication event. L07625 and X61240 HIV-2 sequences are to be distinguished from the other seven (Kreutz et al. 1992; Barnett et al. 1993), as they differ in numerous locations all along the alignment. In each of them, the two homologous sectors (boxed by a thick line in the alignment) extend from position L07625-436 to position L07625-460 and from position L07625-461 to position L07625-484, respectively, and do not coincide with those of the other HIV-2 sequences: GG-AACTAGCTGACACTGCACAAGAR GGAAACTAGCWGACACYGCA--GGGA Each alignment score is greater than those of 100 random alignments (BESTFIT program). The other seven HIV-2 sequences show two successive homologous sectors with ve intercalated letters that appear to be a duplication of the 3 end of the rst sector. The alignment score is less than a random score for at most ve random sequences out of 100 (BESTFIT). For J03654, M15390, and M31113 (homologous sectors boxed by a thick line in the alignment), no random score is greater than the observed one. The rst sector extends from position L07625-408 to position L07625-454, the second from position L07625-457 to position L07625-483 (with M30895 having an additional sector because of a probable duplication of CTGCAG at the 3 end of the alignment). The most signicant alignment is that of M15390:

408 447

AGTTAA--AGACAGGAACAGCT-ATACTTGGTCAGGG 442 CAGGA

441 446 478

AGT-AACTA-ACAGAAACAGCTGAGACT-G--CAGGG

Additionally, from position L07625-465 to position L07625-536, binding sites for transcription factors show a variable number of copies, which also suggests duplication/deletion events. First, from the left to the right, two sites alternate: the Bel-1 similar region (for RESIVSMM and RESIMM251, to a lesser extent for the

HIV-2 sequences, and only partially for the HIV-1 group), NF-KB (only for the HIV-1 group), the Bel-1 similar region (for L20571 and partially for M30895), and, nally, NF-KB (for both the HIV-1 and the HIV-2 groups). Aligned L20571 (HIV-1) and RESIMM251 (SIV HIV-2) clearly show this pattern:

HIV-1 and HIV-2 LTR Nucleotide Sequences

1243

L20571 RESIMM251

ACT-GC---------GGGACTTTCCAGACTGCTGACA-CTGCGGGGACTTTCCA ACTCGCTGAGATAG---------------------------CAGGGACTTTCCA
[Bel-1 similar region] [ NF-KB ][ Bel-1 similar region ] [ NF-KB ]

407 451

Downstream, the SP-1 sites are located in a variable way at four possible segments in a zone with G stretches and a cluster of GGGAs including those located in the NFKB sites (see above). Column K of table 6 shows that the HIV-1 sequences (except for M17449 and L20571) include overrepresented sectors at least 30 bases in length made up of no more than three letters (with at most one base excepted). Such a sector is found in these sequences between positions K02013-462 and K02013-494 (the corresponding sector is boxed by a thick line at the HIV-1 consensus; see the web page). In this sector, as well as upstream and downstream, the HIV-1 group shows clusters of CTs, CTGs, and CTGGs (table 6, columns F, G, H, and I). These words are boxed by a thick line in the alignment when the corresponding pattern is overrepresented against a one-order Markov chain model in the corresponding sequence taken as a whole (table 6). All of these words are scattered in a region that could be accounted for, at least partly, by stepwise duplications/deletions of mono- or oligonucleotides taken from tandemly repeated CTGs. The aligned sectors extending from the 5 end of the R region (CIVCG-501, the beginning of viral RNA; see above) to the positions aligned with CIVCG-565 correspond to the TAR region, which has been extensively studied concerning its biological meaning and the stable stem-loop structure that forms TAR RNA (reviewed in Ou and Gaynor 1995; Rabson and Graves 1997). The HIV-1 TAR RNA contains both a loop and a bulge structure that are critical for Tat-mediated activation. The HIV-2 TAR RNA is capable of forming a complex structure that consists of two discrete stemloop regions. Possible evolution routes from simple onehairpin to complex branched TAR structures have been discussed in the literature. The extended portion of the HIV-2 TAR, relative to the HIV-1 TAR, have the greatest similarity to a human immunoglobulin pseudogene sequence, suggesting (see above) that this sub-sequence is a captured element (reviewed in Myers 1997). In the alignment, the sector referred to as TAR Common Sector is conserved between HIV-1 and HIV-2. It corresponds to the upper portion of the HIV-1 stem-loop (the bulge-and-loop zone) and to the 5 HIV-2 stem-loop region. The two successive sectors of the HIV-2 consensus that are boxed by a thick line in the alignment on the web page (the rst including the TAR Common Sector), correspond (apart from a few bases) to the two HIV-2 TAR RNA discrete stem-loop regions: GCAGATTGAGCCCTGGGAGGTTCTCT-CCAGCACT GCAGGTAGAG-CCTGGG-TGTTCCCTGCTAG-ACT These two sectors have an alignment score greater than

any obtained from 100 randomizations (BESTFIT). Whatever the possible evolution routes discussed with regard to this region, the possibility of a duplication event has to be taken into account. In contrast, the corresponding zone of the HIV-1 sequence appears to be much deleted. Additionally, the HIV-2 M30502 sequence (from position 596 to position 648) shows (table 6, column F) a cluster of AGs (boxed by a thick line in the alignment). The rest of the HIV-2 TAR region additionally shows an extended portion relative to HIV1. The 42/43-base-long HIV-2 sectors included between positions CIVCG-559 and CIVCG-565 (aligned with a zone of HIV-2 consensus that is boxed by a thick line) show clustered oligonucleotides (boxed by a thick line when they correspond to a pattern that is overrepresented in the entire corresponding sequence; table 6, columns D and FI), which, again, suggest stuttering local duplications (see the web page). As a whole, the results discussed above t the molecular-evolution model hypothesized previously (Laprevotte 1989, 1992; Laprevotte et al. 1997): overrepresented oligonucleotides are scattered throughout the entire range of the retroviral sequences; they share complementary core consensuses that t the rule of a trend to TG/CT excess (Ohno and Yomo 1990) and suggest starter tandemly repeated oligonucleotides (short tandem repeats giving rise to longer oligonucleotide repeats, as hypothesized previously [Southern 1972; Ohno 1988]); they are mixed with scrambled short-scale repetitions, deletions/duplications, tandem repeats, and cryptic simplicity patterns, suggesting a molecular evolution by scrambled stepwise short- and longer-range duplications/deletions (in addition to nucleotide miscopying). Even though this model gives a good account of the repetitive aspects of retroviral nucleotide sequences, other evolutionary processes may be considered, such as gene conversion (leading to homogeneity throughout DNA sequences; see discussion in Laprevotte 1989) and a converging evolution toward repeated motifs serving useful functions (Laprevotte et al. 1997). This also leads to consideration of possible selective pressures maintaining the repeats. Conclusions The listed overrepresented oligonucleotides (selected here by using a calculation strategy applicable to short sequences and often containing CTG or CAG) are congruent with the retroviral signature (previously dened for the entire sequences) when focusing on the noncoding part of the HIV LTRs (this retroviral signature was not found among yeast, plant, and invertebrate retrotransposons; Terzian et al. 1997). The coding part is much less characteristic, which strengthens the hy-

1244

Laprevotte et al.

pothesis, namely, that the nef gene is a captured element. It appears that the search for consensus overrepresented oligonucleotides could be part of an analysis aimed at dening retrovirus-like sequences in a genome. The biased dinucleotide distribution could be, in large part, the consequence of interspersed duplications of oligonucleotide motifs. Sector by sector, we hypothesize a large number of local duplication/deletion scenarios that span a great portion of the alignment and could account for length divergences between the HIV-1 and HIV-2 groups. Consequently, base substitutions are by no means the unique evolutionary process to take into account for comparisons of such sequences and their phylogenetic analyses. Altogether, our results support our previous hypotheses on the molecular evolution of retroviral nucleotide sequences: a large portion of the sequences can be accounted for by scrambled stepwise short- and longerrange duplications/deletions. There is an emerging hypothesis of an important duplication/deletion role for the reverse transcriptase that could (in addition to alreadyproposed scenarios) generate perfect or stuttering tandem repeats and then a cryptic simplicity of the sequence. The consensus overrepresented motifs and the numerous cryptic simplicity sectors observed suggest one or several starter tandemly repeated short motif(s). Additional comparisons of decreasingly homologous sequences using a fast and reliable method for the alignments could further unravel these evolutionary patterns. A reliable and accurate alignment of the compared sequences is an essential tool for performing a highresolution molecular evolution study. The accurate assessment of the nucleotide alignment with already-identied benchmarks, with the polypeptide alignment, and with the N-presentation coding of the sequences allows us to consider the alignment reliable. The multiplied alphabet obtained by the mathematical strategy called Nblock presentation appears to be a promising method to increase the signal-to-noise ratio in the nucleotide alignment studies. It is well known that in eukaryotic cells, reverse transcription processes are not restricted to parasitic retroviruses, and that a diverse set of genes, referred to as retrotranscripts, derived from their normal progenitor genes via an mRNA intermediate (Boeke and Stoye 1997). These elements, as well as retroviruses and retrotransposons, are a source of genomic variation, as could be an increasing number of human endogenous retrovirus sequences that have been demonstrated (Kjellman, Sjogren, and Widegren 1999). The endogenous IAP particles of mice may also contribute to the generation of genetic diversity in this host population. Furthermore, it has been hypothesized that if the prebiotic genetic material was RNA, reverse transcription might have been required to formulate DNA-based genetic information (Katz and Skalka 1990). All of these data and others, taken together, suggest that further investigation of the reverse transcription could shed light on some aspects of eukaryotic genome evolution and consequently not be restricted to the biology of retroviruses.

Supplementary Material The multiple alignment of the 27 HIV-1 and HIV2 LTR nucleotide sequences is available from the website http://genome.genetique.uvsq.fr/laprevotte. In addition, this full sequence alignment is directly available from I.L.
LITERATURE CITED
ABDEDDAIM, S. 1997. Incremental computation of transitive closure and greedy alignment. Lect. Notes Comput. Sci. 1264:167179. ANTEZANA, M. A., and M. KREITMAN. 1999. The nonrandom location of synonymous codons suggests that reading frame-independent forces have patterned codon preferences. J. Mol. Evol. 49:3643. AVEROF, M., A. ROKAS, K. H. WOLFE, and P. M. SHARP. 2000. Evidence for a high frequency of simultaneous double-nucleotide substitutions. Science 287:12831286. BARNETT, S. W., M. QUIROGU, A. WERNER, D. DINA, and J. A. LEVY. 1993. Distinguishing features of an infectious molecular clone of the highly divergent and noncytopathic human immunodeciency virus type 2 UC1 strain. J. Virol. 67:10061014. BEASTY, A. M., and M. J. BEHE. 1988. An oligopurine sequence bias occurs in eukaryotic viruses. Nucleic Acids Res. 16:15171528. BERKHOUT, B. 1996. Structure and function of the human immunodeciency virus. Prog. Nucleic Acid Res. Mol. Biol. 54:134. BOEKE, J. D., and J. P. STOYE. 1997. Retrotransposons, endogenous retroviruses, and the evolution of the retroelements. Pp. 343435 in J. M. COFFIN, S. H. HUGHES, and H. E. VARMUS, eds. Retroviruses. Cold Spring Harbor Laboratory Press, New York. CHABOISSIER, M. C., D. FINNEGAN, and A. BUCHETON. 2000. Retrotransposition of the I factor, a non-long terminal repeat retrotransposon of Drosophila, generates tandem repeats at the 3 end. Nucleic Acids Res. 28:24672472. COWARD, E. 1998. Mathematical methods for repeated patterns in biological sequences. Dr.Ing. thesis, Norwegian University of Science and Technology, Trondheim, Norway. . 1999. Shufet: shufing sequences while conserving the k-let counts. Bioinformatics 15:10581059. DEVEREUX, J. 1989. The GCG sequence analysis software package. Version 6.0. Genetics Computer Group, Madison, Wis. DIDIER, G. 1999. Caracterisation des N-ecritures et application a letude des suites de complexite ultimement n ` cste. Theor. Comput. Sci. 215:3149. ESTABLE, M. C., B. BELL, A. MERZOUKI, J. S. G. MONTANER, M. V. OSHAUGHNESSY, and I. J. SADOWSKI. 1996. Human immunodeciency virus type 1 long terminal repeat variants from 42 patients representing all stages of infection display a wide range of sequence polymorphism and transcription activity. J. Virol. 70:40534062. FRECH, K., R. BRACK-WERNER, and T. WERNER. 1996. Common modular structure of lentivirus LTRs. Virology 224: 256267. GAO, F., E. BAILES, L. ROBERTSON et al. (12 co-authors). 1999. Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature 397:436441. GAYNOR, R. 1992. Cellular transcription factors involved in the regulation of HIV-1 gene expression. AIDS 6:347363. GURTLER, L. G., P. H. HAUSER, J. EBERLI, A. VON BRUNN, S. KNAPP, L. ZEKENG, J. M. TSAGUE, and L. KAPTUE. 1994.

HIV-1 and HIV-2 LTR Nucleotide Sequences

1245

A new subtype of human immunodeciency virus type 1 (MVP-5180) from Cameroon. J. Virol. 68:15811585. KANDEL, D., Y. MATIAS, R. UNGER, and P. WINKLER. 1996. Shufing biological sequences. Discrete Appl. Math. 71: 171185. KATZ, R. A., and A. M. SKALKA. 1990. Generation of diversity in retroviruses. Annu. Rev. Genet. 24:409445. KJELLMAN, C., H. O. SJOGREN, and B. WIDEGREN. 1999. HERV-F, a new group of human endogenous retrovirus sequences. J. Gen. Virol. 80:23832392. KLAERR-BLANCHARD, M., H. CHIAPELLO, and E. COWARD. 2000. Detecting localized repeats in genomic sequences: a new strategy and its application to B. subtilis and A. thaliana sequences. Comput. Chem. 24:5770. KREUTZ, R., U. DIETRICH, H. KUHNEL, K. NIESELT-STRUWE, M. EIGEN, and H. RUBSAMEN-WAIGMANN. 1992. Analysis of the envelope region of the highly divergent HIV-2 ALT isolate extends the known range of variability within the primate immunodeciency viruses. AIDS Res. Hum. Retroviruses 8:16191629. LAPREVOTTE, I. 1989. Scrambled duplications in the feline leukemia virus gag gene: a putative pattern for molecular evolution. J. Mol. Evol. 29:135148. 1992. Mo-MuLV nucleotide sequence exhibits three levels of oligomeric repetitions, suggesting a stepwise molecular evolution. J. Mol. Evol. 35:420428. LAPREVOTTE, I., S. BROUILLET, C. TERZIAN, and A. HENAUT. 1997. Retroviral oligonucleotide distributions correlate with biased nucleotide compositions of retrovirus sequences, suggesting a duplicative stepwise molecular evolution. J. Mol. Evol. 44:214225. LAPREVOTTE, I., A. HAMPE, C. J. SHERR, and F. GALIBERT. 1984. Nucleotide sequence of the gag gene and gag-pol junction of feline leukemia virus. J. Virol. 50:884894. MALIK, H. S., W. D. BURKE, and T. H. EICKBUSH. 2000. Putative telomerase catalytic subunits from Giardia lamblia and Caenorhabditis elegans. Gene 251:101108. MORGENSTERN, B., A. DRESS, and T. WERNER. 1996. Multiple DNA and protein sequence alignment based on segment-tosegment comparison. Proc. Natl. Acad. Sci. USA 93: 1209812103. MYERS, G. 1997. Retroviral sequences. Pp. 709755 in J. M. COFFIN, S. H. HUGHES, and H. E. VARMUS, eds. Retroviruses. Cold Spring Harbor Laboratory Press, New York. NEEDLEMAN, S. B., and C. D. WUNSCH. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443453. OHNO, S. 1988. Codon preference is but an illusion created by the construction principle of coding sequences. Proc. Natl. Acad. Sci. USA 85:43784382. OHNO, S., and T. YOMO. 1990. Various regulatory sequences are deprived of their uniqueness by the universal rule of TA/CG deciency and TG/CT excess. Proc. Natl. Acad. Sci. USA 87:12181222. OU, S.-H. I., and R. B. GAYNOR. 1995. Intracellular factors involved in gene expression of human retroviruses. Pp. 97 184 in J. LEVY, ed. The Retroviridae. Vol. 4. Plenum Press, New York and London.

PEREIRA, L. A., K. BENTLEY, A. PEETERS, M. J. CHURCHILL, and N. J. DEACON. 2000. A compilation of cellular transcription factor interactions with the HIV-1 LTR promoter. Nucleic Acids Res. 28:663668. PETERLIN, B. M. 1995. Molecular biology of HIV. Pp. 185 238 in J. LEVY, ed. The Retroviridae. Vol. 4. Plenum Press, New York and London. RABSON, A. B., and B. J. GRAVES. 1997. Synthesis and processing of viral RNA. Pp. 205261 in J. M. COFFIN, S. H. HUGHES, and H. E. VARMUS, eds. Retroviruses. Cold Spring Harbor Laboratory Press, New York. SETO, M. H., T. K. BRUNCK, and R. L. BERNSTEIN. 1989. Overlapping redundant sextuplets identical with regulatory elements of HIV-1 and SV40. Nucleic Acids Res. 17:2783 2800. SMITH, T. F., and M. S. WATERMAN. 1981. Identication of common molecular subsequences. J. Mol. Biol. 147:195 197. SOUTHERN, E. 1972. Repetitive DNA in mammals. Pp. 1927 in R. A. PFEIFFER, ed. Modern aspects of cytogenetics: constitutive heterochromatin in man. Symposia Medica Hoechst No. 6. Schattauer Verlag, Stuttgart, Germany. TAUTZ, D., M. TRICK, and G. A. DOVER. 1986. Cryptic simplicity in DNA is a major source of genetic variation. Nature 322:652656. TERZIAN, C., I. LAPREVOTTE, S. BROUILLET, and A. HENAUT. 1997. Genomic signatures: tracing the origin of retroelements at the nucleotide level. Genetica 100:271279. THOMPSON, J. D., F. PLEWNIAK, and O. POCH. 1999. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27:26822690. TREIER, M., C. PFEIFLE, and D. TAUTZ. 1989. Comparison of the gap segmentation gene hunchback between Drosophila melanogaster and Drosophila virilis reveals novel modes of evolutionary change. EMBO J. 8:15171525. VARTANIAN, J.-P., A. MEYERHANS, B. ASJO, and S. WAIN-HOBSON. 1991. Selection, recombination, and GA hypermutation of human immunodeciency virus type 1 genomes. J. Virol. 65:17791788. VOGT, P. K. 1997. Historical introduction to the general properties of retroviruses. Pp. 125 in J. M. COFFIN, S. H. HUGHES, and H. E. VARMUS, eds. Retroviruses. Cold Spring Harbor Laboratory Press, New York. WU, W., B. M. BLUMBERG, P. J. FAY, and R. A. BAMBARA. 1995. Strand transfer mediated by immunodeciency virus reverse transcriptase in vitro is promoted by pausing and results in misincorporation. J. Biol. Chem. 270:325332. ZAGURY, J. F., G. FRANCHINI, M. REITZ et al. (15 co-authors). 1988. Genetic variability between isolates of human immunodeciency virus (HIV) type 2 is comparable to the variability among HIV type 1. Proc. Natl. Acad. Sci. USA 85:59415945. ZHANG, H., and H. M. TEMIN. 1994. Retrovirus recombination depends on the length of sequence identity and is not error prone. J. Virol. 68:24092414.

PIERRE CAPY, reviewing editor Accepted March 14, 2001

Potrebbero piacerti anche