VIRUS DETECTION USING DEGENERATE PCR PRIMERS
Field of the invention
The invention relates to a method of detecting new viruses using a high throughput polymerase chain reaction (PCR) assay.
Background of the invention Biological materials can often become contaminated or infected with unidentified organisms. For example, cells grown in tissue culture often exhibit signs of a cytopathic effect consistent with a virus infection but the identity of the virus may not be apparent. Human blood products, such as factor VTJI for the treatment of haemophiliacs, can be contaminated with unidentified viruses, as was demonstrated by infection of many haemophiliacs with human immunodeficiency virus in the early 1980s. Similarly, two decades ago*20% of individuals who received transfused blood contracted hepatitis C (Randall, 2001, J Pediatr. Oncol. Nurs. 18(1), 4-15).
Summary of the invention
PCR allows amplification of a specific region of a polynucleotide. The specificity of the reaction is due to the primers which, during the course of PCR, bind to the region to be amplified in a sequence specific manner. Degenerate primers can be designed which amplify sequence from substantially all members of a virus family. Such primers typically bind to nucleotide sequence which is conserved across the virus family. The invention provides a PCR based high throughput screen that uses such degenerate primers for detecting unknown viruses.
In particular, the invention provides a high throughput method for screening a biological sample for unknown viruses, which method comprises (a) subjecting DNA from the sample to PCR amplification conditions using simultaneously multiple pairs of degenerate primers, wherein each primer binds a sequence that is conserved across members of a family of viruses and each pair of primers selectively directs amplification of sequence of said family;
I
(b) sequencing PCR product obtained in step (a); and
(c) comparing the sequence of the PCR product with the sequences in at least one database comprising viral sequences to determine whether the sequence is present in, or absent from, the database, wherein absence of the sequence from the
5 database suggests that the sequence may be from an unknown virus.
Detailed description of the invention General description
There are a number of human diseases in which unidentified viruses are
10 thought to play a causative role. For example, unidentified viruses are believed to play a role in cancers such as leukaemia, autoimmune diseases such as rheumatic disease, cardiovascular diseases such as dilated cardiomyopathy and Kawasaki disease, and prostatitis (zurHausen 2001 The Lancet 357, 381-384; Greaves 1997 The Lancet 349, 344-349; Rowley and Shulman 1998 Clinical Microbiology
15 Reviews 11(3), 405-414; Kawai 1999 Circulation 99, 1091-1100; and Dominigue and Hellstrom 1998 Clinical Microbiology Reviews 11(4), 604-613). Particles resembling retroviruses have been reported in affected tissue from patients with psoriasis, Sjogren's syndrome and rheumatoid arthritis (Iversen 1990 J. Invest. Dermatol. 90, 41S-3S; Garry et al 1990 Science 250, 1127-9; Yamano et al 1997 J.
20 Clin. Pathol. 50, 223-30; and Stransky et al 1993 Br. J. Rheumatol. 32, 1044-8). The invention provides a way of screening for the viruses which may cause or contribute to such diseases. Once identified, the viruses may be used as a target for developing diagnostic tests for, or therapies against, the diseases.
The method of the invention is based on obtaining sequences from viruses so
25 that they can be compared with known viral sequences to determine whether they are from novel viruses. The sequences of the novel viruses are amplified using PCR primers which recognise sequences which are conserved (similar/homologous) in known members of virus families. The primers direct amplification of sequence between the conserved regions to give a PCR product whose sequence can be
30 compared with that of known viruses.
The biological sample which is screened may be any sample susceptible to
I
. -3- infection by a virus. It may, for example, be a tissue culture sample (e.g. tissue culture supernatant), or a sample of animal (including human) or plant material. In a particularly preferred embodiment the invention is directed to the identification of unknown human viruses, and in this case the sample will generally be derived from 5 one or more humans. A sample derived from a human or animal may be from a range of tissue and fluid types, for example blood serum, seminal fluid, breast milk, saliva, cerebrospinal fluid, urine, bile, bronchial lavage fluid, nasal secretion, eye secretion or vaginal wash.
Before the sample is subject to PCR it may processed. In one embodiment 10 the virus material in the sample is concentrated, for example by ultiacentrifugation. The virus material may also be purified in a manner which increases the content of viral nucleic acid relative to non- viral nucleic acid. For example the viral nucleic acid may be concentrated by centrifuging the biological sample under conditions such that cell debris is 15 pelleted and virus particles remain in the supernatant; collecting the supernatant; and centrifuging the supernatant under conditions such that virus particles are pelleted.
The initial centrifugation to pellet the cell debris may, for example, be carried 20 out at 100 to 10,000 g, preferably from 1000 to 10,000 g. The subsequent centrifugation to pellet the virus particles is carried out at a higher g force, for example 50,000 to 500,000 g, preferably about 100,000 g.
The purification of viral nucleic acid may include a step of treating a suspension comprising the virus with a nuclease so as to digest extraneous nucleic 25 acid, wherein the viral nucleic acid is protected from digestion by viral coat or core protein. The nuclease is preferably a non sequence-specific nuclease which digests DNA and/or RNA, for example micrococcal nuclease S7 (Roche Molecular Biochemicals, Catalogue 107 921).
The processing may also comprise a nucleic acid purification, such as 30 phenol/chloroform nucleic acid purification or the use of a column which selectively binds nucleic acid. In one embodiment purification is carried out using a Qiagen™
column.
Typically processing of the sample increases the purity of the virus nucleic acid present in the sample (for example leading to an increase in concentration of 2- fold to 1000-fold of viral nucleic acid). The processing of the sample may comprise the reverse transcription of viral
RNA in the sample to DNA, i.e. RNA from the unknown virus is processed to produce the equivalent (such as the same or a complementary) DNA sequence. Thus the DNA which is subject to PCR conditions may be cDNA. This is required when the unknown virus has an RNA genome. Thus the processing may comprise reverse transcription of the RNA to produce a complementary DNA strand and then optionally synthesising a second DNA strand before carrying out PCR. This can be achieved by using a primer which directs initiation at random sequences in a reverse transcription reaction and then in a second strand synthesis reaction.
Random reverse transcription may be directed using a primer which directs initiation of DNA synthesis at random sequences. Such a primer may be made by synthesising it so that it contains a random sequence, for example a sequence of at least 6 consecutive nucleotides (e.g. from 6 to 20 nucleotides) wherein each nucleotide may be any of the four possible natural nucleotides, i.e. A, T, C or G. In other words, such a primer contains a sequence NNNNNN wherein each N is A, T, C or G.
In one embodiment a "single tube system" is used for the reverse transcription and then PCR with the multiple pairs of degenerate primers. In such a system the sample (typically after being processed) is added to a mixture of reagents which allow both reverse transcription and PCR to occur. Thus typically the mixture will comprise both a reverse transcriptase and a thermostable DNA polymerase. The mixture may comprise the Titan™ reagants from Roche Molecular Biochemicals™ (cat no. 1855476) which uses the avian myeloblastosis virus reverse transcriptase and a Pwo (Pyrococcus woesei) thermostable DNA polymerase. Alternatively the ProSTAR™ system from Stratagene™ may be used. The PCR reaction is carried out in a PCR mixture that generally comprises the following: the template DNA (which will be amplified in the event of virus
detection), one or more primer pairs specific for members of a virus family, a thermostable polymerase enzyme (typically a DNA polymerase, such as Taq polymerase), deoxynucleotide triphosphates (dATP, dTTP, dCTP and dGTP) and a suitable buffer. The PCR reaction generally comprises cycles of the following steps: a denaturation step, a primer annealing step and a polynucleotide synthesis step. Typically the PCR reaction comprises at least 25 cycles, such as 30, 35, 40 or more cycles, up to a maximum of 60 cycles for example. Generally, in the denaturation step, the PCR mixture is heated to a temperature at which the DNA in the PCR mixture (in particular the region to be amplified) denatures to single-stranded form. The denaturing temperature is generally from 85 to 98 °C.
In one embodiment the PCR reaction comprises a "hot start" in which the PCR mixture is kept at the denaturing temperature for an extended amount of time before commencement of the thermal cycles, such as for 5 to 30 minutes, preferably 10 to 20 minutes. The use of Amplitaq Gold™ DNA polymerase (Applied Biosystems™) is preferred when the PCR reaction comprises a hot start.
In the primer annealing step the primers bind to template nucleotide sequence in a sequence specific manner. This step is generally carried out at a temperature of from 30 to 65° C. In the polynucleotide synthesis step the polymerase replicates/ synthesises nucleotide sequence based on template sequence by addition of nucleotides to the 3' end of the bound primers. This step is generally carried out at about 72°C.
In the method of the invention, the sample (generally after processing as described above) is subject to PCR conditions using a panel of multiple pairs of degenerate primer pairs. In the course of a PCR reaction such primers are capable of binding the conserved sequences of the genome of a family of viruses. These conserved regions typically have a role in providing a necessary or advantageous activity or property to the virus. Generally, the conserved sequences may be coding or non-coding sequences. In one embodiment the conserved sequences code for or are from virus proteins which have the following activities: DNA or RNA polymerase (replicase),
topoisomerase (helicase/gyrase), endonuclease (integrase), nucleic acid binding protein, protease, transcription factors, envelope glycoproteins, structural protein (e.g. capsid or nucleocapsid protein).
As discussed above multiple pairs of primers are used in the method each of the primer pairs used being selective (or specific) for members of a virus family (for example selective for a subfamily or genus). In the disclosure below regarding the numbers of primers used in different embodiments of the invention it is understood that this refers to the numbers of primers which are substantially specific for members of a virus family. However, in some embodiments additional primer pairs may be used which are selective for more than one family (for example selective for 2 to 10, such as 3 to 6 families). Such embodiments are within the scope of the * present invention.
The panel of primer pairs may comprise sets of primer pairs which perform a nested PCR reaction. Generally such a set of primer pairs comprises a first and second primer pair. The first primer pair is able to amplify a template nucleotide sequence from a virus to form a PCR product. The second primer pair is able to amplify a nucleotide sequence using the PCR product generated by the first primer pair as a template. Multiply nested sets of primer pairs may also be used. The use of nested sets of primer pairs allows increased sensitivity and specificity. The panel of primers used is capable of detecting viruses which are single-stranded or dou le-stranded DNA or single-stranded or double-stranded RNA viruses. The viruses are generally capable of infecting prokaryotic or eukaryotic cells, such as bacterial, animal, plant, yeast or fungal cells. Preferably the viruses are mammalian (preferably primate) or avian viruses, such as human, pig, horse, sheep, goat, cow, chicken, turkey or duck viruses.
The viruses are typically from any combination of the following families: Adenoviridae, Arenaviridae, Arteriviridae, Astroviridae, Birnaviridae, Bunyaviridae, Caliciviridae, Circoviridae, Coronaviridae, Deltavirus, Filoviridae, Flaviviridae, Hepadnaviridae, Herpesviridae, Orthomyxoviridae, Papovaviridae, Paramyxoviridae, Parvoviridae, Picomaviridae, Polydnaviridae, Poxviridae, Reoviridae, Retioviridae, Rhabdoviridae, Togaviridae orBornavirus.
Typically in the method 12 to 300' different primer pairs are used, such as 24 to 200 or 48 to 100 primer pairs. These primers may all be used in the same multi- well plate (placed on a thermal cycling machine). The plate may be a 96-well or 384-well plate. In a preferred embodiment at least one of the wells in which the PCR is done comprises more than one primer pair, such as 2, 3, 4, 5, 6, 7, 8 or 9 primer pairs. Typically 3 to 96, such as 12 to 48, of the wells comprise more than one primer pair.
In one embodiment, some or all of the primer pairs used in the same well carry different labels. Thus, one or both primers of each primer pair carries a label. When both primers of a primer pair carry a label then these labels are different from each other. Typically, at least one of the primers in each primer pair will carry a different label from that used for the other primer pairs in the same well. The PCR product generated by labelled primers carries the labels present on the primers.
Thus, after different primer pairs have been used for PCR in the same well detection of the labels in the PCR products can be used to deduce which primer pah- has directed the PCR reaction. In one embodiment all forward primers of the group are labelled with one colour and the reverse primers are labelled with a different colour.
In a preferred embodiment the primers are labelled with a fluorescent label, such as fluorescein based labels (e.g. fluorescein isothiocyanate). Different primer pairs may be labelled with fluorescent labels of different colours. The fluorescent labels which are used may be capable of detection by a Beckman Coulter CEQ2000™ or Applied Biosystems A3700™ fluorescent DNA analyser. The fluorescent labels may be obtained from Beckman Coulter™ or Applied Biosystems™.
Another way of being able to determine which PCR products are generated by which primer pair is for each primer pair in the group to generate a PCR product of different size to the PCR products generated by the other primer pairs of the group. Typically each PCR product which is generated by the group of primers differs in size from all the other PCR products by at least 20, such as at least 50, 100, 200, 500, 1000 or more nucleotides. Each PCR product may for example differ in
size from all other PCR products by up to' a maximum of 3000 nucleotides. In a preferred embodiment, multiple biological samples are screened simultaneously by subjecting DNA from multiple samples to PCR conditions using simultaneously multiple pairs of primers. Generally each of the samples is from a different (typically human) individual. Typically 2 to 80, such as 5 to 40 samples are screened simultaneously in the method.
In one embodiment, DNA from multiple samples is mixed together before being subject to PCR conditions. Typically 2 to 10 such as 5 to 8 samples are pooled together in this way. After the DNA has been subject to PCR conditions any PCR product which is obtained may be sequenced. Typically prior to sequencing the PCR product is gel purified and cloned into a vector, for example a plasmid or a bacteriophage vector.
Suitable plasmids are known and commercially available, such as pBluescript™
(Stratagene) and pGEM-T-Easy™ (Promega). Suitable bacteriophage include bacteriophage λ and M 13. Alternatively the sequencing reaction may be carried out on the PCR product itself, for example using one of the PCR primers as a sequencing primer.
Preferably an automated sequencer is used to obtain the sequence of the PCR product, such as a Beckman Coulter CEQ2000™ or Applied Biosystems A3700™ DNA analyser.
Designing the primers
Each of the primer pairs used in the method of the invention binds a sequence conserved across members of a virus family and selectively directs amplification of sequence from the members of the family. The multiple primer pairs which are used are typically designed by:
(i) providing a plurality of amino acid sequences from members of a first virus family,
(ii) comparing the sequences to identify conserved regions, (iii) designing a first primer pair using a computer based method, wherein each primer in the pair binds a nucleotide sequence that encodes a conserved region
identified in (ii) and wherein the primer pair is designed to amplify by PCR the nucleotide sequence between the nucleotide sequences that encode conserved regions in members of the first virus family, and
(iv) repeating steps (i) to (iii) for each virus family. The multiple primer pairs may also be designed by:
(i) providing a. plurality of nucleotide sequences from members of a first virus family,
(ii) comparing the sequences to identify conserved regions,
(iii) designing a first primer pair using a computer based method, wherein each primer in the pair binds a conserved region identified in (ii) and wherein the primer pair is designed to amplify by PCR the nucleotide sequence between the conserved regions in members of the first virus family, and
(iv) repeating steps (i) to (iii) for each virus family.
The multiple pairs of primers are capable of detecting unknown viruses in a sample, wherein such a sample originates from a single individual or is a pooled sample from individuals of the same species. Thus the panel of primers detects viruses which infect the same species.
The number of primers designed by the above steps is typically the same as the numbers of primers mentioned above for use in the method of the invention. The primer pairs which are designed bind sequence which is conserved across members of a virus family. The panel of primer pairs which is designed may comprise primer pairs that bind sequence which is conserved across substantially members of the family or across a subset of the members of the family, for example across all members of a subfamily or of a genus. Generally, the primer pairs bind at least 70%, at least 80%, or at least 90% of the known viruses of the family, subfamily or genus. Preferably less than 10, such as less than 5, primer pairs will be used for the detection of any given family, subfamily or genus in the panel.
The panel of primer pairs is generally capable of detecting viruses from at least 10, 15, 2.0, 30 or more families, typically up to a maximum" of 35 families .
The panel of primer pairs may comprise sets of primer pairs which perform a
nested PCR reaction. Generally such a set of primer pairs comprises a first and second primer pair. The first primer pair is able to amplify a template nucleotide sequence from a virus to form a PCR product. The second primer pair is able to amplify a nucleotide sequence using the PCR product generated by the first primer pair as a template. The use of nested sets of primer pairs allows increased sensitivity. In a preferred embodiment each primer pair is specific for a particular virus family, so that it does not detect viruses of other families.
The plurality of amino acid or nucleotide sequences are provided from different known viruses of the same family. The sequences will be for the same protein of the different viruses. Typically at least 5, 10, 20, 50, 100 or more sequences are provided. The maximum number of sequences provided will, for example, be 300 sequences.
Each of the sequences which is provided is typically at least 20, 50, 100, 200 or more amino acids or nucleotides in length. In general the maximum length of the nucleotide sequences is 1000 nucleotides and the maximum length of the amino acid sequences is 300 amino acids. The sequences may be obtained from a database of sequences, such as GenBank. The sequences may be obtained from a database comprising virus sequences which are organised into homologous protein families (based on sequence similarity relationships). In a preferred embodiment the sequences are obtained from the VIDA database (described in Alba et al (2001) Nucleic Acids Research 29, 133-136) or the Virus Division of GenBank. The sequences may be provided in the form of a database, preferably in computer-readable form. The sequences are preferably provided in the form of a computer-readable database constructed using programs which identify homologous protein families, such as GeneTableMaker, MKDOM or PSCBuilder.
The sequences which have been provided are compared to identify conserved regions. Typically such conserved regions will have a length of at least 12 nucleotides, such as at least 15, 21, 27, 36, 99 or more nucleotides (generally up to a maximum length of 200 nucleotides) or at least 4, 5, 7, 10, 25 or more amino acids (generally up to a maximum length of 50 amino acids).
Across the conserved region the virus sequences which are being provided will of course share identity or similarity. Typically the amino acids or nucleotides in at least 50% of the positions in the region will be the same in at least 50 %, 60%, 70%, or 80%) of the viruses of the group (i.e. in the family, genus or subfamily). The algorithm which identifies conserved regions generally uses a multiple sequence alignment method. The method may comprise (a) aligning all pairs of sequences separately to calculate a distance matrix giving the divergence of each pair of sequences, (b) calculating a guide tree from the distance matrix, and (c) aligning the sequences progressively according to the branching order in the guide tree. A preferred algorithm for the aligning the conserved sequences is
CLUSTALW as described in Thompson et al (1994) Nucleic Acids Research 22, 4673-80. Other algorithms that can be used for aUgning sequences. are MultAlin (Corpet (1988) Nucleic Acids Research 16, 10881-90) or Jalview (Clamp et al (1998) http://barton.ebi.co.uk). BLOCKS of conserved regions of amino acids may be extracted from the multiple alignments, typically using the program Blocks Multiple Alignment Processor. Alternatively the entire process of performing multiple alignments and extracting BLOCKS can be performed using BLOCKMAKER (Henikoff and Henikoff (1994) Genomics 19, 97-107).
The output from the alignment and BLOCK extraction set (i.e. the information describing the identified conserved regions) is then entered into the algorithm which designs the primers. Such output is typically in the form of partial sequences which correspond to the conserved regions (BLOCKS). These BLOCKS are input into a primer design algorithm. In one embodiment such an algorithm is CODEHOP. In the primer design step the conserved regions which are chosen as targets for primers preferably comprise few codons with degenerate counterparts, i.e. preferably the sequence has a low redundancy, such as a redundancy of less than 512 fold, 256 fold or 128 fold. Each primer binds in accordance with Watson-Crick base pairing and thus the binding is sequence specific. Each primer will thus be designed to be wholly or partially complementary to the sequence to which it binds.
Each of the primers typically has a length of at least 8 nucleotides, such as at
least 10, 12, 15, 20, 30, 40 or more nucleotides (up to a maximum of 50 nucleotides for example). In one embodiment the primer may comprises at least 2, 4 or 6, up to a maximum of 10 for example, inosine bases. Inosine is able to bind to any of the four nucleotides and therefore use of inosine causes a reduction in effective redundancy. Each primer pair will be designed so that the PCR product generally has a length of at least 20, such as at least 50, 100, 200, 500, 1000 or more nucleotides (and typically up to a maximum of 5x103 nucleotides long).
Each primer is preferably be designed so that it anneals to a single site, i.e. the primer will not bind to any other site in the genome of the relevant viruses. Each primer is preferably designed so that it does not exhibit secondary structure, i.e. the nucleotides in the primer will not bind substantially to any other nucleotide in the primer apart from those to which it is covalently linked. In addition preferably each primer is designed so that it does not bind other primers with the same sequence. In one embodiment the 3' region, and preferably the 3' terminal nucleotide of the primer binds to the target sequence with high affinity,, thus preferably this region or nucleotide comprises a G or C.
Generally each primer is designed to have an annealing temperature of from 30 to 65 °C, such as 50 to 60°C or 35 to 45°C. In addition each primer pair may be designed to ensure that the two primers do not bind to each other.
The primers are designed by a computer based algorithm. In one embodiment such an algorithm designs primers according to the following rules:
1) A set of blocks is input, where a block is an aligned array of amino acid sequence segments without gaps that represents a highly conserved region of homologous proteins. A weight is provided for each sequence segment, which can be increased to favour the contribution of selected sequences in designing the primer. A codon usage table is chosen for the target genome.
2) An amino acid position-specific scoring matrix (PSSM) is computed for each block using the odds ratio method. 3) A consensus amino acid residue is selected for each position of the block as the highest scoring amino acid in the matrix.
4) For each position of the block, the most common codon corresponding to the amino acid chosen in step 3 is selected utilizing the user-selected codon usage table. This selection is used for the default 5' consensus clamp in step 8.
5) A DNA PSSM is calculated from the amino acid matrix (step 2) and the codon usage table. The DNA matrix has three positions for each position of the amino acid matrix. The score for each amino acid is divided among its codons in proportion to their relative weights from the codon usage table, and the scores for each of the four different nucleotides are combined in each DNA matrix position. Nucleotide positions are treated independently when the scores are combined. As an option, the highest scoring nucleotide residue from each position can replace the most common codons from step 4 that are used in the consensus clamp.
6) The degeneracy is determined at each position of the DNA matrix based on the number of bases found there. As an option, a weight threshold can be specified such that bases that contribute less than a minimum weight are ignored in determining degeneracy.
7) Possible degenerate core regions are identified by scanning the DNA matrix in the 3' to 5' direction. A core region must start on an invariant 3' nucleotide position, have length of 11 or 12 positions ending on a codon boundary, and have a maximum degeneracy of 128 (this is the default setting of CODEHOP). The degeneracy of a region is the product of the number of possible bases in each position.
8) Candidate degenerate core regions are extended by addition of a 5' consensus clamp from step 4 or 5. The length of the clamp is controlled by a melting point temperature calculation (the CODEHOP default is 60 °C) and is usually about 20 nucleotides.
9) Steps 7 and 8 are repeated on the reverse complement of the DNA matrix from step 5 for primers corresponding to the opposite DNA strand.
In one embodiment CODEHOP (Rose et al (1998) Nucleic Acids Research 26, 1628-1635) is used to design the primer pairs. This program uses. the above rules.
The primers designed by the algorithm may then be mapped back to the
original sequence to choose primer pairs which provide the desired length of PCR product.
The above-described computer based method is repeated until the desired number of primer pairs have been designed. Optionally the primer pairs can then be synthesis ed and tested. They are typically tested to determine the optimal conditions for using the primers in a PCR reaction.
In one embodiment the primers are tested for their ability to amplify one or more of the plurality of nucleotide sequences from known viruses which were used to design the primers, or in the case of amino acid sequences from known viruses being used to design the primers the primers may be tested for their ability to amplify the nucleotide sequence from the virus which encodes the amino acid sequence.
The primers may be tested in a range of buffer conditions to determine optimal buffer conditions for PCR using the primers. The buffer conditions which may be tested include pH (typically between 7 and 10), magnesium concentration (typically from 0.5 mM to 5 mM), potassium chloride (typically from 0 to 100 mM), ammonium chloride (typically 0 to 100 mM), glycerol (typically 0 to 20%), dimethysulphoxide (typically 0 to 20%), ethanol (typically 0 to 20%), sorbitol (typically 0 to 20%) or betaine (typically 1M betaine).
The primers may be tested at a range of different temperatures to determine the optimal temperatures in the PCR reaction. Preferably the primers are tested in PCR reaction in which a range of primer annealing temperatures are tested. . Typically the range of temperatures is from 30 to 65° C.
The panel of primer pairs or a group of primers within the panel may be designed to be used together on the same plate (i.e. using the same thermal cycles). Thus such primer pairs will be designed to work at the same annealing temperature.
In one embodiment a group of primer pairs within the panel are designed to have similar optimal conditions for use in PCR so that they can be used optimally in the same well or reaction vessel, i.e. that they can be used in multiplex PCR. Such a group typically comprises at least 2, 3, 4, 5, 6 or more primer pairs (up to a maximum of 8 primer pairs for example).
To provide such primer pairs the computer based method steps may be used
to design primer pairs which are calculated to have similar annealing temperatures and/or the primers are tested to select primer pairs which can be used optimally together. Such testing typically determines whether the primers work optimally with the same buffers and/or whether the primers have similar annealing temperatures.
Validating a PCR product as being from a novel virus
After sequencing the PCR product(s), the next step is to determine whether each sequence is present in at least one database of known nucleic acid sequences, typically sequences of viruses known to infect the individuals from which the samples are derived. Appropriate databases include the virus subdivision of GenBank or the VEDA database.
In addition each sequence is typically also compared with a database of human sequences to exclude sequences which are human sequences. Such a database is generally a comprehensive or consensus human genome database. Preferably, at least one of the human sequence databases searched contains an essentially complete human genome sequence. However, it needs to be borne in mind that, although there has recently been a great deal of publicity about the "completion" of the human genome sequence, not all the human genome has in fact been sequenced, and it is possible that a cloned sequence could fall within the unsequenced part of the genome. The human genome contains large areas with repetitive sequences, and much of the unsequenced genome is within these areas.
In order to make as comprehensive a search as possible, it is desirable to search a range of different types of database; in addition to a human genome database, it is desirable to search, for example, a database comprising expressed sequence tags (ESTs) and a database comprising repetitive elements of the.human genome. Appropriate databases include GenBank, the EMBL database, the Celera human genome database, the Ensemble human genome database, the DNA Data Bank of Japan (DDBJ), the Incyte LifeSeq™ database of ESTs and the Repbase database of repetitive elements in the human genome. Where the sequence is found to be not present in any of the interrogated databases of known sequences, this indicates that the nucleic acid may be from a
previously unknown virus. The nucleic acid then becomes a candidate for further investigation and may be designated a Primary Candidate Virus PCV).
It is generally necessary to confirm by experimentation that a nucleic acid sequence designated a PCV is not in fact a human sequence. A preferred way of doing this involves designing and synthesising a specific primer set (or sets) to amplify the nucleic acid designated a PCV and determining whether the set(s) are able to amplify any DNA in a sample of complete genomic human DNA. The amplification conditions for each primer set may be optimised using the original sample from which the PCV derives or using the PCR product which is obtained in the method of the invention.
The primer set may be used to screen one or more samples of human genomic DNA, for example from 1 to 100 samples, preferably from 5 to 50 samples. As an alternative to PCR, human genomic DNA may be probed with a labelled probe containing sequence from the original PCR product (e.g by Southern blotting). If the PCV cannot be detected in human DNA by experimentation (by PCR or hybridisation with a labelled probe), it may then be subjected to further analysis. It may be designated a Secondary Candidate Virus (SCV).
The further analysis of an SCV may include gene walking to determine whether the original cloned nucleic acid sequence exists in nature as part of a longer sequence, such as the genomic sequence of an unknown virus. Gene walking may be carried out using techniques known in the art, such as vectorette PCR (Allen et al, PCR Methods Appl. 4:71-75), rapid amplification of cDNA ends (RACE, Frohman et al Proc Natl Acad Sci U S A. 85:8998-9002), rapid amplification of genomic ends (RAGE, Cormack and Somssich. 1997. Gene. 194:273-276) and methods derived from these. Alternatively, the SCV sequence may be "extended" by screening a DNA or cDNA library using the original cloned nucleic acid sequence as a probe.
The additional sequence information obtained through DNA walking may reveal information about the identity of the SCV which cannot be determined from the original clone. The additional information may therefore be analysed, for example to determine whether it contains an open reading frame (i.e. a sequence encoding a protein); the presence of an Open reading frame provides further support
for the suggestion that the SCV is a virus. Furthermore, the additional information may identify the SCV as being related to a known virus; for example, the information may identify the SCV as being a new member of a known family of viruses.
A further step may then be to determine whether a newly-identified candidate virus is associated with a disease, for example with a cancer, autoimmune disease, cardiovascular disease' or other disease mentioned above. This may be done by obtaining a specimen from each member of a group of subjects with a disease; determining whether the cloned nucleic acid or other nucleic acid of the same virus is present in each specimen; and determining whether the proportion of subjects in whom the nucleic acid is present is greater in the group of subjects who have the disease than in a control group of subjects who do not have the disease, wherein a said greater proportion suggests that the virus may cause or contribute to the disease.
Typically, the process of determining whether the nucleic acid is present or absent from a specimen from a subject may be carried out by PCR using primers specific for the novel sequence (including any contiguous sequence obtained by DNA walking). Initially, perhaps from 10 to 50 patients from a disease group may be tested, but if positive results are obtained in initial studies, the investigation may be extended to a larger group (e.g. a group of up to 100, 500, 1000 or 10,000). The nature of the biological specimens taken from the members of the group varies depending on the disease association that is being investigated; where possible specimens are from disease affected tissue and from peripheral blood of the subjects (for a published example of this see Griffiths et al, 1999, Arthritis Rheumatism, 42:448-454). The specimens may be from the same tissue and fluid types as the biological samples used in the initial screening assay described above. Once a new virus has been identified and found to be positively associated with a particular disease or condition, serological and genetically-based diagnostic assays for infection by the virus may readily be devised. Genetically-based assays can be developed by using the nucleotide sequence of the virus to design probes and/or PCR primers for specifically detecting the nucleic acid of the virus. Serological assays can be developed by producing recombinant proteins or protein fragments encoded by the virus and testing for the presence of antibodies to these
proteins in human sera. Alternatively, antibodies specific for the proteins of the virus may be made and the antibodies used to detect the virus directly. The serological assays may take the format of an ELISA, western blot or immunofluorescence assay. Correlations may be sought between serological data and genetic data. Furthermore, the organism provides a target for the development of therapies and/or prophylactic vaccines against the disease.
The following Example illustrates the invention: Example
The Example below refers to Figure 1 which shows how primers were designed using a database known as 'VEDA', and computer programs known as 'CLUSTALW', 'BLOCKS' and 'CODEHOP'.
Designing a panel of primers
A panel of primers was designed for detecting unknown viruses from the family Herpesviridae according to the strategy shown in Figure 1. The amino acid sequences of herpes virus DNA packaging protein UL15 were obtained from the VTDA database (Alba et al, see above). These sequences are shown in Table 1.
The sequences obtained from the VTDA database were then imported into CLUSTALW. This compares the protein sequences to identify conserved regions and then aligns the sequences according to the conserved regions. The alignment produced by CLUSTALW is shown in Table 2.
The BLOCKS program was then used to extract the sequences of the conserved regions identified by CLUSTALW, and to enter these sequences into CODEHOP. The primer sequences were then designed by CODEHOP using the conserved sequences. The output from the CODEHOP program is shown in Table 3.
Table 1. All protein sequences of DNA packaging protein UL 15 extracted from VTDA. Here written as a list and unaligned.
>gi_l 0180719 MFGGL GEETKRHFERLMKTKNDRLGASHRNERS IIUJGD VDAPFLNFAIPVPRRHQTVMPAIGI HNCC
DSLGIYSAITTRM YSS IACSEFDELREDSVPRCYPEITNAQAF S PMMMRVANS I IFQEYDEMECAAHR
NAYYSTrøSFISrøTSDAFKQ TVFISRFSK LIASFRDVNKI-DD^^
KMIIJfflATYFVTSVIJ_GDHAERAERI*l*RVAFDTPHFSDIV^
MSSFEGIRIGYTSHIRKAIEPVFEDIGDR RR FGAHRVDHV GETITFSFPSGL STvTFASSHNTNSI RGQDFNLLFVDEANFIRPDAVQTI IGFLNQATCKI IFVSSTNSGKASTSFLYG KGSADDLLNVVTYICaD
EHMKH TDYTN SCSC IJ-^P FI MDGA 1RRTAEMFLPDS MQEIIGGGV D ICQGDRSIFTASA
IDRFLIYRPSTVHNQDPFSQDLYVYVDPAFTAiOTKASGTGVAVIG YG DYIVFGIi3HYFLRA TGESSD
SIGYCVAQO.IQICAIHRKRFGVIKIAIEGNSNQDSAVAIATRIAIEMISYM AAVAPTPHNVSFYHSKS
NGTOVEYPYFTΛQRQKTTAFDFFIAQΗTSGRVTjASQDLVSTOTS^ RTFSGKKGGNDDTWA TMAVYISAHIPDMAFAPIRV
>gi_7S73189
MFGGLLGEET RHFERLMKTKNDR GASHRNERS IRDGDMVDAPFIINFAI PVPRRHQTVMPAIGILHNCC DS GIYSAITTRMLYSSIACSEFDE RRDSVPRCYPRITNAQAF SP MMRVANSIIFQEYDEMECAAHR NAYYSTMNSFISMRTSDAFKQLTVFISRFSK LIASFRDTΛ^ IMIFDAC3IJCI^CFTWRSRRASERLIRVAFDTPHFSDIV RHFRQRKTVFLVPRRHGKT FLVPLIAIA MSSFEGIRIGYISHIRKAIEPVFΈDIGDRLRRWFGAHRVDHVKGETITFSFPSGLKSTVTFASSHNTNSI RGQDFN LFVΓEA FIRPDAVQTI IGFUSTQATCKI IFVSSTWSGKASTSFLYG KGSADDLLNVVTYICΓJ EHMKHVTΩY ATSCSCYVLNKPVFITIYRDGAMRRTA.LMFLPDSFMQEIIGGGV^ IDRFLIYRPSTVITOQDPFSQDLYVYVDPAFTAOT' ASGTGVAVIGKYGTDYIWGLEHYFLRAT.TGESSG S IGYCVAQCLIQICAIHRKRFGVI IAIEGNSNQDSAVAIA RIAIEMISYMKAAVAPTPHNVSFYHSKS NGTDVEYPYFLLQRQKTTAFDFFIAQFNSGRVI-ASQDLVSTTVSLTTDPVEYLT QLTNISEVVTGPTCT RTFSGK GGNDD WALTMAVYISAHIPDMAFAPIRV
>gi_5S89285
MFGGALGESAK HFERLLRDRTffiRLGASRKiraCTjARGGSL^ DGTGIYSAIATRLLYAGIVSSEFGEVRRES SNGHISKRiraEAIiIiAPTL RVANS ITFHEYDDAQCAAHR
NAYYSTrøroOS]yKTSDAFQQLASFIDRFSKLTJAAS
KMIIJfflATYFLTSVI^DHAERAERTJ^VIFDI^^
MSSFEGIRIGYTSHIRK&IEPVFEEIGDR RR FGTQCVDHVKGETITFSFPSGSRS V FASSHIirrNSI
RGQDF-*T LFVDEAOTIRPDAVQTIIGFTJNQANCKIIFV^^ EHMKHVrmrTNATSCSCYVIJrePVFITMDGAMRRTAEMFLPDSFM EIIGGIT^
VERLIiYRPS VRNQDI SRDLYVYVDPAFTAlSriTtASGTGIAVIGRYGADYIIFGLEHFFLRA TGESAD
AIGECAAQCIAQICAIHCERFGTIRVAVEGNSNQDSAVAIATRISIDLASYVQSGVAPAPHDVCFYHSKP
AGSIWEYPFFIiQRQKTAAFDFFIARF SGRVTJiSQD VSTTISLSTDPVEY TKQL ISrLSEVVTGATGT
RTFSGKKGGYDDTWALV AVYISAHASDATFAPIRGVEATCRGP EA >gi_1869837
MFGQQLASDVQQY ERLEKQRQQ VGVDEASAGLIIGGDA RVPFI^FATATPKRHQTVVPGVG LHDCC
EHSP FSAVARRLLFNSLVPAQI.RGRDFGGDHTA T^FIjAPE VRAVAR RFRECaPEDAV'PQRimYYSV
TjraFQAL-ffiSEAFRQLVHFVRDFAQl^KTSFRASSLAETTG^
HATYF]^AAVL GDHAEQVOT'FLRLVFEIPLFSDTAVRHFRQRATVFLVPRRHGK FLVPLIALSLASFR GIKIGYTAHIRKATEPVFDEIDACLRG FGSSRVDHVKGETISFSFPDGSRSTIVFASSHNTNGIRGQDF
NLLFVDEANFIRPDAVQTIMGFI-NQANCKIIFVSSTOTGKA≤TSF^
VVTHTNATACSCYIL.-r PVFIΗTOGAVRRTADI^
PSTTTNSGLMAPELYVYVDPAFTANTF-^GTGIAWGRYRDDFIIFAΪ^^
SlαAQVLAIΛPGAFRSVRVAVEGNSSQDSAVAIATHVHTEMHRIIASAGANGPGPEIiFYHCEPPGGAVLY PFFLIuNKQKTPAI*ΕYFIKKFNSGGVMASQELVS TVRLQTDPVEYLSEQT-NNLIETVSPNTDVRMYSGKR •
NGAADDLMVAVIMAIYLAAPTGIPPAFFPITRTS
>gi_59501
MFGQQLASDVQQYIjERIiEKQRQLKVGADEASAG TMGGDAIΛVPFIJDFATATPKRHQTVVPGVGTωro
EHSPLFSAVARR LFNSLVPAQL GPJJFGGDHTAKI^F APEiVRAVAR RFKECAPADVVPQRNAYYSV l-NTFQALHRSEAFRQLVHFVRDFAQLLKTSFRASSLTETC^
HATΫFLAAVI-LGDHAEQVNTFLRLVFEIPLFSDAAVRI^^
GIKIGYTAHI - EPVFEEIDAC RG FGS RroHVKGETISFSFPDGSRS I FASSHN NGI GQDF
NLLFVDEANFIRPDAVQTIMGFMQANCKIIFVSSTNTGKASTSF^
VV HTNATACSCYILNKPVFITimGAVRRTADLFI-ADSFMQEIIGGQARETGDDRPV T SAGERFL YR PS TNSGIjmPDLYVYVDPAFTANTRASGTGVAVVGRYRDDYI IFA EHFF RALTGSAPADIARCVVH
SLTQVTJU-HPGAFRGVRVAVEGNSSQDSAVAIATHVΗTEMHRIJΛSEσADAGSGPEILFYHCEPPGSAV
YPFFI-TJraQKTPAFEHFIKKFNSGGVMASQEIVSATVRL^^
RNGASDDLMVAVIMAIY AAQAGPPHTFAPITRVS
>gi_2S05992 MFGKALSRETIQYFE LRKEVQSRSGA lSrPJAEAQTGGEDDVKTAFLOTAIPTPQRHQ VVPGVG HDC
(^TAQIFASVARRLLFRSLSK RGGES ERr^PSSVEAYVDP VKQALKTISFVEY-TOAEaRSCR.TAYYS
IM1TOFDSLRSSDAFHQVANFVARFSR VDTSFNGADLDGDGQQTS RI VDVPTYGKQRGTLE FQKMI
MHATYFIAAVILGDHADRIGAF KMVF 'PEFSDATIRHFRQRATVFLVPRRHGKTWFTjVPLIALALiATF
KGIKIGYTAHIRKA EPVFDEIGAR RQWFGNSPVDHV GENISFSFPDGS STIVFASSHNTNGIRGQD FNLLFVDEANFIRPEAVQTI IGFLNQTNCKI IFVSSTIWGKASTSFLYH KGAADELLNVVTYICDEHME
RVKAHΪNATSCSCYILNKP IT DGAMR1OTAELFLPDS
YRPSTVANQDIMSNMLYVYVDPAFTTNAMASGTGVAVVGRYRSNWIVFGLEHFFLSALTGSSAELIARCV
AQCLAKWAIHSRPFDSVRIAVEGNSSQDAAVAIATNIQLEL-ITLRQADVVHMPGTVLFYHCTPPGSSVA
YPFFLLQKQKTGAFDHFIKAFNSGLVLASQELISNTVRLQTDPVEYI-IiTQMKNLTEVITGTSETRVFTGK RNGASDDMLVALVMAVYMASLPPTTNAFSSLSTQ
>gi_330792
MFGRVLIGRETVQYFEAIJUFFIVQARRGAKILIRAAΞAQNGGEDDAKTAFIJ-TFAIPTPQRHQTVVPGVGTLHDC CETAQIFASVARRIJ.FRSLSKWQSGEARERLDPASVEAYVDPKVRQALKTISFVEYSDDEARSCRNAYYS IMNTFDALRSSDAFHQVASFVARFSRLVDTSFNGADI^GDGQQAS-OΪARVDVPTYGKQRGTLELFQKMIL Π-ATΪFI I GDHADRIGAFL-^VF T EFSDATIRHFRQRAT FL PRRHGKT F P IAIAR-A F * KGIKIGYTAHIR-^TEPVFDEIGARLRQ FGNSPVXIHVKGENISFSFPDGSKSTIVFASSHCΓΓNGIRGQD F-TLLFVDEANFIRPEAVQTIIGFIINQTNCKIIFVSSTOT RVKAHTNATACSCYIMKPVFI MDGAMRNTAELFLPDSFMQEIIG∞ YRPS VANQDI SSD YVDPAFTIWA ASGTGVA GRYRSN V FG^FFIHFF S I-TGSSAE IARCV AQCLAQVFAIHKRPFRJSVRVAVEGNSSQDAAVAIA NIQLET-TRLRRADVVPMPGAVLFYHCTPHGSSV YPFFT^QKQCTGAFDHFIKAFNSGSVI^QELVSOTVRLQTDPVE RNGASDD LVALVMAVYLSSLPPTSDAFSΞLPAQ
>gi_971317
MFGGAVGEQSARYFQPJI*IffiRQRRAAERGARPDGGGGARGEDDARVPFLDFAVAAPKRHQTVVPGVGTLH * GYCELAPLFAATASRT.TiTiTSMARAEAGIJTOGTGEAHVSRETAGVIiSAI^FAAHPPAEAAAHCNAYHSVMA
AI^SMRASGAFAQVAAFVARFSRLVGTSFSHLGGGDDADPPRAKRARVEPPSGQTRGALELFQKMILMPA
TYFVAATIi GEHAERIGAFl_RVAFNTPDFSDAAVAHFRQRATVFLVPRRHGKT FLVPLIALALATFKGI IGYTAHIRKATEPVFEEIVARLRQWFGGERVDHVKGEVISFSFPDGARSTIVTASSHNTNGIRGQDFNL LFVDEANFIRPEAVQTIVGFIJJQASCKIIFVSSTNTGKASTSFLYNLKGASDGI NVVTYIC^ AHGGATACS CYVLNKPVF ITMDAAARNTAETFLPNS FMQE I IGGGEVARRAEPAAVFTRAAGEQFLLYRP STAAARGP PERLYMYIDPAFTSNARASGSGIAVVGRHRGSWi.VLGLEHFFLPALTGSSAAEIARCAVRC FAQVI^VHRRRLDGLFVAVEGNSSQDSAVAIALGVRRELDSLAASGAVPMPAETRFYHCRPPGSAVAYPF FIJLQKQKTAAFDHFIRLFNSGRVVASQDLASLTVRLQTΓJPVEYLFEQLQNLTESTAGPGGARAFSG RRG AADDLMVALVMAVFVGSLPPTDGAFCPLAPRPPAD >gi_5869808 MSLIMFGRTLGEESVRYFERLKRRRDERFGTI£SPTPCSTRQGSLGNATQIPFIαI!rFAIDVTRRHQAVIPG IGTLHNCCEYIP FSATARRAMFGAFLSSTGYNCTPNVVLKPWRYSVNANVSPELKKAVSSVQFYEYSPE EAAPHRNAYSGVMNTFRAFSLSDSFCQLSTFTQRFSYLVETSFESIEECGSHGKRA VDVPIYGRYKGTL ELI^KMILMHTTHFISSVI^^
IALVMATFRGIKVGYTAHIRKATEPVFEGIKSRT.EQWFGANYVDHVKGESITFSFTDGSYSTAVFASSHN TNGIRGQDFNLLFVDEANFIRPDAVQTI VGFLNQTNCKI IFVSSΩ^GKASTSFLYNIΛGSSDQLLNVVT WCTDHMPRVLAHSDVTACSCYVIiNKPWITMDGAMRRTA^
T TARERFILYRPSTVANCAILSSVLYVYVDPAFTSNTRASGTGVAIVGRYKSD IIFGLEHFFLRALTG TSSSEIGRCVTQCLGHIl aiiHP.π'F .rVHVSIEGNSSQDSAVAISIΛIAQQFAVLE GNV SSAPVLLFY HS IPPGCSVAYPFFLLQ QKTPAVDYFVKRFNSGNI IASQELVSLTVKLGVDPVEYLCKQLDNL EVIKG riMGCrLDTKTYTG GTTGTMSDDIiMVALIMSVYIGSSCIPDSVFMPIK >gi_5708110
^rLGKES EIVIKYRDAL KR ^rERGPDD DGQEMSDS FI ASICDR DSA DTM S AS FQFAID QRHQACIAPIGSFHNCCAISRAFSYMASEIIYIΪNLASYSTKYTDTDAALNDLQVSPKRQLFTGAAEDSIL PALRQK ANLNFARFAPSDSLIHDKAFDGIMNGYRGFVKSDEFSQLNl^IYRFHTIiLKKS RAKLEKTTSEQRDGTLELFQK ILMHATYFASS I CLGEGSTERSNRYLSTVFNTSLFSENI IQHFRQRTT VFLVPRRHGKT FLVPLISLLVSSFEGIRIGYTAHLRKATEPVFIEIFTRLYKWFGAKQVEQVKGETITF TFPJ^GNKSAIVFASSQITIWGIiRGQDFNFLFVDEANFIKPAAi TVMGFljNQTNCKLFFVSST^ IJLYNL GKTNSLLNVVTYIC^EHMPEIQKKTDVTTCSCYVLHKPVF^ GGRAGKYDSDRTLVPVRALDQFLIYRPΞTSSKPNISGLGKILTVYVDPAFTTNRSASGTGIALVTALRDS VTJ>1GAEHFY ^ALTGEAALEIAQCVYLCIAYCCLIHAGAFP^IRIAVEGNSSQDSAAAIAGNLTE]^LDS ' LRRRLGFSLTFAHSRQPGTAMAHPFYLIilffiQKSPJ^DLFVSLFNSGRFMASQELVSNTLVLSKDPCEYLV DQIRNITVTHGQGPDSFRTFSGKQGRVPDDMLVAAVMSTYLALEGSPTAGYHPIAPIGRRQRPA >gi_1813970 MIJRGDSAAKIQERYAELQKRKSHPTSCISTAFT-WATL^KRYQMMHPELGLAHSOSIEAFLPLMAFCGRH RDYNSPEESQREIiFHERLKSALDKLTFRPCSEEQRASYQKLDALTELYRDPQFQQINNFMTDFKK LDG GFSTAVEGDAKAIRI-EPFQKNLLIHVIFFIAVTKIPVLANRVLQYLIHAFQIDFLSQTSIDIFKQKATVF LVPRRHG-ΩWFIIPIISFLLKHMIGISIGYVAHQKHVSQFVLKEVEFRσUCTFARDYVVEN DNVISIDH RGAKSTALFASCYNTNSIRGQNFIftiTiTiVDEAHFIKKEAFNTILGFLAQNTTKIIFISSTNTTSDSTCFLT RLNNAPFDMT-N SYVCEEHLHSFTEKGDATACPCTRLHK^^ NKISQNTVLITDQSREEFDILRYSTLNTNAYDYFGKTLYVYLDPAFTTNRKASGTGVAAVGAYRHQFLIY
GLF^FFIJU3LSESSEVAIAEC^UMIISVLSLHPYLDELRIAVEGNTNQAAAVRIACI.IRQSVQSSTLIR VLFYH PDQNHIΞQPFYIMGRDKALAVEQFISRFNSGYIKASQELVSYTIKLSHDPIEYLLEQIQNLHRV TLAEGTTARYSAKRQNRISDDLIIAVIMATYLCIJDIHAIRFRV'S
>gi_27 629S MLRSCDIDAIQKAYQSIΪW HEQDVKISSTFPNSAIFCQKRFIILTPELGFTHAYCRHVKPLYLFCDRQR HV SKIAICDPLNCALSKLKFTAIIEKmΕVQYQKHLELQTSFYRNPMFLQIEKFIQDFQR ICGDFENT
NKKERI LEPFQKS ILIHI IFFISVTKLPTIΛNHVLDYLKYKFDIEFINESSVNILKQKASVFLVPRRHG KTWFMIPVICFLLIO^n^GISIGYVAHQKHVSHFV KDVEFKCRRFFPQKNITCQDNVITIEHETIKSTAL FASf^NTHSIRGQSFNLLIVDESHFIKKDAFSTILGFLPQSSTKIIFISSTNSGNHSTSFLTKLSNSPFE LT SYVCΕDHVHILlTORGNATTCACTUuHKPKFISINA^ LITEQGLIEFDLFRYSTISKQIIPFLGKELYΪYIDPAYTINRRASGTGVAAIGTYGDQYIIYGMEHYFLE SLLSNSDAS IAECASHMILAVLELHPFFTELKI I IEGNSNQSSAVKIACILKQTISVIRYKHITFFHTLD QSQIAQPFYIJLGRΞKRLAVEYFISNFNSGYIKASQELISFTIKITYDPIEYVIEQI NLHQININEHVTY NAKKQTCSDDLLISI IMAIYMCHEGKQTSFKEI >gi_32S496 LRT03ITHI NNYEAIIWKGERDCSTISTKYPNSAIFYKKRFIMLTPELGFAHSYNQQVKPLYTFCEKQ RHLKNRKPLTILPSLSHKLQEMKFLPASDKSFESQYTEFLESFKILYREPLFLQIDGFIKDFRKWIKGEF NDFGDTRKIQLEPFQKNILIHVIFFIATCKLPALANRVINYLTHVFDIEFVNE^
RHGKT FIVPIISFIiLKNIEGISIGYVAHQKHVSHFVMKEVEFKCRRMFPE TITCLDNVITIDHQNIKS . TALFASCTNTHQSIRGQSFNl-LIVDΞSHFIKKDAFSTILGFLPQASTKI^^ SPFH-fl-SWSYVOSDHAHMLNERGNATACSCYRLHKPKFISINAEV^
INDVLITEQGQTEFEFFRYSTINKNLIPFLGKDLYVYTjDPAYTG-mRASGTGIAAIGTYLDQYIVYGMEH YFlBSI- π?SSDTAIAECAAHMI SIIΛ HPFFTEVKIIIEG S QAS VKIACIIKENITANKSIQV FF HTPDQNQIAQPFYLLGKEKKLAVEFFISNEWSGNIKASQELISFTIKITYDPVEYAl^QIRNIHQISVNN YITYSAKKQACSDDLIIAIIMAIYVCSGNSSASFREI >gi_854039
MKIJrøSPFEMLS SYVCTDHAHMLϊ-TERGNATACSCYR^
AT rVINDVLITEQGQTEFEFFRYSTINK.TLIPFLGKDLYVYLDPAYTGNRRASGTGIAAIGTYLDQYIV YGMEHYFI^SLMTSSDTAIAE aAH ILSILDLHPFFTEVKIIIEGNSNQASAVKIACIIKENITANKSI QVTFFHTPDQNQIAQPFYIiGKE.OCr.AVEFFISNFNSGKIKASQELISFTIKITYDPVΕYATjEQIRNIHQ ISVNNYITYSAKKQACSDDLIIAIIMAIYVCSGNSSASFREI >gi_5733564
^π-RTCDI HI-*^n*IYEAIIW GE NCSTISTKYPNSAIF K RFIM TPELGF HSY QQVKP TFCEKQ RHLKNRKPLTILPSLTR LQEMKFLPASDKSFESQYTEFLESFKILYSEPLFLQIDGFIKDFRKWΪ GEF NDFGDTRKIQLEPFQKNILIHVIFFIAVTKLPALANRVINYLTHVFDIEFVNESTIΛTTL QKTNVFLVPR RHGKTWFIVPI ISFLLKNIEGIS IGYVAHQKHVSHFVMKEVEFKCRRMFPEKTITCLDNVITIDHQNI S TALFASCYITOISIRGQSFNLLIVDESHFIKKDAFSTILGFLPQASTKILFISSTNSGNHSTSF^ PFEKLSWSYVOTDHAH LlffiRGNATACSCΩtLHKPKFISINAEVKKT^^
NDVLITEQGQTEFEFFRYSTINKNLIPFLGKDLYVYLDPAYTGNRRASGTGIAAIGTYLDQYIVYGMEHY FLESLMTSSDTAIAECAAHMILS ILDLHPFFTEVKI I IEGNSNQASAVKIACI IKENITANKS IQVTFFH TPDQNQIAQPFΥLLGKEKKLAVEFFISNFNSGNIKASQELISFTIKITYDPVEYALEQIRNIHQISVNNY ITYSAKKQACSDDLIIAIIMAIYVCSGNSSASFREI >gi_4996048 KTjraSPFEMLSVVSYVCEDHAHMIiNERGNATACSCYRLHKPKFISINAEVKKTANLFLEGA^ ATaWINDVLITEQGQTEFEFFRYSTINKNLIPFLGKDLYVYIjDPAYTGNRRASGTGIAAIGTYLDQYIV YGMEHYFLESLMTSSDTAIAECAAHMILS ILDLHPFFTEVKI I IEGNSNQASAVKIACI IKENITANKS I QVTFFHTPDQNQIAQPFYLLGKEKKJ^VEFFISNFNSGNIKASQELISFTIKITYDPVEYALEQIRNIHQ ISVNNYITYSAKKQACSDDLIIAIIMAIYVCSGNSSASFREI >gi_1136808 MI*LSRHRERLAAN]^ETAKDAGERWELSAPTFTRHCPKTARMaHPFIGVV^ ' TPTSANPDVGTPRPSEDNVPAKPRLLESLSTYLQMRCVREDAHVSTADQLVEYQAGRKTHDSLHACSVYR ELQAFLVNLSSFLNGCYVPGVHVπ_EPFQQQLVMHTFFFLVSIKA.PQKTHQLFGLFKQYFGLFETPNSVLQ
TFKQKASVFLIPRRHGKTWIVVAIISMLLASVENINIGYVAHQKHVANSV7AEIIKTLCKWFPPKNLNIK KENGTIIYTRPGGRSSSIiMCATCFNKNSIRGQTFNLLYVDEANFIKKDALPAILGFMLQKDAKLIFISSV NSSDRSTSFLLNLRNAQEKm-N SYVOTDHREDFm
GAFDTEIMGEGAASSNATLYRVVGDAALTQFDMCRVDTTAQEVQKCLGKQLFVYIDPAYTNNTEASGTGV GA VTSTQTP S ILG^IEHFF RDL G 2 & EIASCaC MIK IAVTJHTTIE V AAVEG SSQDSGVA IATVIiMICPLPIHFLHYTDKSSALQWPIYMLGGEKSSAFETFtYALNSGTLSASQTVVSNTIKISFDPV TYLVEQVRAIKCVPIJIDGGQSYSAKQKHMSDDLLVAVVMAHFMATDDRHMYKPISPQ >gi_1718281
MLQKDAKLIFISSVNSSDRSTSFLLNLRNAQEK TLNWSYVCADiπffiDF IDES IKTTTNLF EGAFDTELMGEGAASSNATLYRVVGDAALTQFDMCRVDTTAQQVQKCLGKQLFVYID PAYTN ΕASGTGVGAVVTSTQTPTRSLILGMEHFFLRDLTGAAAYEIASCΑCTMIKAIAVLHPTIERVN AAVEGNSSQDSGVAIATVLNEICPLPIHFLHYTDKSSALQWPIYfTLGGEKSSAFETFIYALNSGTLSASQ TVVSNTIKISFDPVTYLVEQVRAIKC^PLRDGGQSYSAKQKHMSDDLLVAVVMAHFMATDDRHMYKPISP
Q >gi_2246S15
' mQKDAKLIFISSVNSSDRSTSFLINLRNAQEKMLNWSWCAD∞ IDESIKTTTNLF EGAFDTELMGEGAASSNATLYRVVGDAALTQFDMCRVDTTAQQVQKCLGKQLFVYID PAYTNNTEASGTGVGAVVTSTQTPTRSLILGMEHFFLRDLTGAAAYEIASCACTMIKAIAVLHPTIERVN AAVEGNSSQDSGVAIATVLNEICPLPIHF1HYTDKSSALQWPIYMLGGEKSSAFETFIYALNSGTLSASQ TVVSNTIKISFDPVTYLVEQVRAIKCVPLRDGGQSYSAKQKHMSDDLLVAVVMAHFMATDDRHMYKPISP
Q
>gi_224S552
AmLSRHRERLAAI*rLQETAKDAGERWELSAPTFTRΞCPKTARMAHPFIGVVHRINSYSSVLETYCTRHHPA P SA PDVGTP PSED PAKPRL ESLSTYLQ I C REDAHVSTADQ VEYQA RKTHDS HACS R E QAF ]mSSF ^GCY PGVH LEPFQQQL riϊTFFFL SI PQKTHQ FGLFKQYFG FE PNSVLQ TFKQKASVFLIPRRHGKTWIVVAI ISMLIΛMVENINIGYVAHQKIOTANSVFAEI IKTLC^WFPPKNLNIK KENGTI IYTRPGGRSSSLM<^TCFNKNSIRGQTFNLLYVDEANFIKKDALPAILGFra_QKDAKLIFISSV NSSDRSTSFLI∞ RNAQEKMLNVVSYVCADHREDFHLQDALVSCPCYRLHIPTYITIDESIKTTTNL GAFOTELMGEGAASSNATLYRVVGDAALTQFDMraVDTTAQQVQKCLGKQLFVYIDPAYTNNTEASGTGV GAVVTSTQTPTRSLILGMEHFFLRDLTGAAAYEIASCACTMIKAIAVLHPTIERVNAAVEGNSSQDSGVA lATVIJTEICPLPI-IFIJHYTDKSSALQWPIY GGEKSSAFETFIYALNSGTLSASQTVVSNTIKISFDPV TYLVEQVRAIKCWPLRDGGQSYSAKQKIMSDDLLVAVVMAHFMATDDRHMYKPISPQ >gi_4494933 IttQKDAKLIFISSSNSSDKSTSFLI1NL:KDAHE--3'ffiNVVNYVCPD IDETVRSTTNLFLΞGAFSTELMGDAATSAQSMHKIVSDSSLSQLDLCRVKSTSQDIQGAMKPCLHVYIDP AYTNNTDASGTGIGAVIAVNHKVIKCILLGVEHFFIj LTGTAAYQIASCAAALIRAIVTLHPQITHVNV AVEGNSSQDAGVAIATVIiNEICSVPLSFIHHVDKITmiRSPIYMLGPEKAKAFΕ.SFIYALNSGTFSASQT VVSHTIKLSFDPVAYLIDQIKAIRCIPLKDGGHTYCAKQKTMSDDVLVAAVMAHYMATNDKFVFKSLE >gi_7330018 rø.QKDAKLIFISSSNSSD:KSTSFLIjNLKDAHEKMI NVVN^
IDETVRSTTNLFLΞGAFSTELMGDAATSAQSMHKIVSDSSLSQLDLCRVESTSQDIQGAMKPCLHVYIDP AYTNNTDASGTGIGAVIAVNHKVIKCILLGVEHFFLRDLTGTAAYQIASCAAALIRAIVTIJIPQITHVNV AVEGNSSQDAGVAIATV]^NEICSVPLSFLHH2y3KNTLIRSPIYPrLGPEKAKAFESFIYALNSGTFSASQT VVSHTIKLSFDPVAYLIDQIKAIRCIPLKDGGHTYα^QKTMSDDVT.VAAVMAHYMATNDK^ . >gi_4019255 IjIiKAKKAIMEJffiTEASSTQSETEWTVDTPTMITNIKKSER AYSKIGVIPSINLYSASLTSFCRLYRP -
LALKQPLPQTGTLRLLPSEKPYISQKLSlSYVKSLTLKIOTrHDIEA
FIINLSSFLNGCYVKKSTΞIEPFQLQLILHTFYFLISIKSPESTNKLFTJIFKEYFGLGEMDSA LQNFKQ
KAjSIFLIPRRHGKT I AIISty iTSVE.ttHVGYVAHQKHVANSVFra
TL IYKI PGKKPSTLMCASCFNKNS IRGQTFNLLYIDEANFIKKDSLPAILGFMLQKDAKLIFISSVNSGD ICATSFLFIIΠJKNASEKMΓJNIV.WICPDIIKDDFSLQDSLISCPCΎKLYIPTYITIDETIKOT'T^
TEIMGDISVMSK1WIHKVIGETAIMQFDLCRIDTTKPEITQCLNSIMYLYIDPAYTNNSEASGTGIGAII ALKNNSSKCIIVGIEHYFLCTLTGTATYQIASCACSLIRAALVLYPHIQAVHVAVEGNSSQDSAVAISTF TJFFICSPVKVNFMHYKDKTTAMQWPIYMLGSEKSQAFESFIYAINSGTISASQSIISNTIKLTFDPISYLI EQIRAIRCΥPLRDGSHTYC^KKRTVSDDVLVAVVMAHFFSTSNKHIFKQLNSI >gi_4019257 ^QKDAKLIFISS NSGD SFL LK ASEKM]^IVNYICPDHKDDFSLQDSLISCPCYKLYIPTYI IDETIKNTTNLFI*DGAFTTELMGDISVMSKNNIH3OTIGETAIMQFDL< IDTTKPE
PAYTIrøSEASGTGIGAIIALKNNSSKCIIVGIEHYFLKDLTGTATYQIASCACSLIRAALVLYPHIQAVH VAVEGNSSQDSAVAISTFLNECSPVKVNFMHYKDKTTAMQWPIYMLGSEKSQAFESFIYAINSGTISASQ SI ISNTIKLTFDPISYLIEQIRAIRCYPLRDGSHTYCAKKRTVSDD VLVAWMAHFFSTSNKHIFKQLNS I
>gi_60355
MLLLKAKKAIIENLSEVSSTQAETDWDMSTPTIITNTSKSERTAYSKIGVIPSVNLYSSTLTSFCKLYHP LTLNQTQPQTGTLRIiLPHEKPLILQDLSlTΪVKLLTSQNVCHDTEANTEYNAAVQTQKTSMECPTYLELRQ FVINLSSFLiNGCYVKRSTHIEPFQLQLILHTFYFLISIKSPESTNRLFDIFKEYFGLREirDPDMLQIFKQ KASIFLIPRRHGKTWIVVAIISMLLTSVENIHVGYVAHQKHVANSVFTEIINTLQKWFPSRYIDIKKENG TIIYKSPD K S IMCATCF-^NSIRGQTFNLL IDEANFIKKDSLP ILGFrQKD K IFISSV SGD RATSFLFNLKNASEKMLNIVNYICPDHKDDFSLQDSLISCPCYKLYIPTYITIDETIK-W^ TELMGDMSGISKSNMHKVISEMAITQFDLOlADTTKPEITQCLNSTMYIYIDPAYTNlSrSEASGTGIGAIL TFKNNSSKCIIVGMEHYFLKDLTGTATYQIASCACSLIRASLVLYPHIQCVHVAVEGNSSQDSAVAISTL INECSPIKVYFIHYKDKTTTMQWPIYMLGAEKSIAFESFIYAINSGTISASQSIISNTIKLSFDPISYLI EQIRSIRCYPLRDGSHTYCAKKRTVSDDVLVAVVMAYFFATSNKHIFKPIJNST >gi_S95201
MLQKDAKIIFISSVNSSDQTTSFLYNLKNAKEKMI^AΠSΓYVCPQHREDFSLQESVVSCPCYRLHIPTYIA IDENIKDTTNLFMEGAFTTEIJMGDGAAATTQTNMHKVVGEPALVQFTJLCRVDTGSPEAQRGLNPTLFLYV
DPAYTNNTEASGTGMGAVVSMKNSDRCVVVGVEHFFLKELTGASSLQIASCAAALIRSLATLHPFVREAH AIEGNSSQDSAVAIATLLHERSPLPVKFLHHADKATGVQWPMYILGAEKARAFETFIYALNSNTLSCGQ AIVSNTIKLSFDPVAYLIEQIRAIKCTPLKDGTVSY(^AK3IKGGSDDTLVAVVMAHYFATSDFJIVFKNHMK
QI >gi_4928934
MIJ.SSFRNHLQKISΓ^KYSVQAQNIDWPVETPVLISKDSKTNRLAHPLIGVISRINLYSPTLKYYCDEYST
TKQPKFTPDIGYVRDLK-O-RDQYFLPKLQHHLSTL EAYHHVDRQA
FLINLSCFLINGCYVSKSTCIELFQKQLILHTFYFLISIKTPEETISRKMFTFFKHYVGLFDIDDNMLQCFKQ
KSTVFLIPRRHGKTWI A ISVLTjASVE rølGYVAHQKHVANAVFTEI ITTLYQWFPSKNIEIKKENG I IYTKPGRKPSTLMCATCFNKNSIRGQTENILYVDEANFIKKEALPAILGFMLQKDAKIIFISSVNSAD KSTSFLFNLRNAKEKMLNVVNYVCPEHKEDFNLQSTLTSCPCYRiaiPTYITIDESIKNTTNLF^^ TE]^GDISTFPTSSMFICVVEEQALFHFDICRVDTTQIDTVKIIDNVLYVYVDPAYTSNSEASGTGIGAVV PLKTKVKTIILGIEHFYLKfΛTGTASQQIAYCVTSMIKAILTLHPHINHVNVAVEGNSSQDSAVAISTFI NΞYCPVPVFFAHO^TERSSVFQWPIYILGSEKSQAFEKFICAINTGTLSASQTIVSNTIKISFDPyAYLME QIRAIR- LPLKDGSYTYCAKQKTMSDDTLVAVT ANYMAISEKHTFKELCKT >gi_1632798 *
rLYASQRGR EN ALQQDSTTQGCLGAETPSI^r ^G S'DR AHP VG IHAS LYCPr RAYC HY GPRPVFVASDESLPMFGASPAIJHTPVQVQMCLLPEIIRDTLQRLLPPPNLEDSEALTEFKTSVSSARAILE
DPNFLEΓREFI^SLASFLSGQYKHKPARLEAFQKQVVLHSFYFLISIKSLEITDTMFDIFQSAFGLEEMT LEKLHIFKQKASVFLIPRRHGKTWI WAI ISLILSNLSNVQIGYVAHQKHVASAVFTEI IDTLTKSFDSK RVEVNKETSTITFRHSGKISSTVMCATCFNKNS IRGQTFHLLFVDEANFIKKEALPAILGFMLQKDAKI I FISSVNSADQATSFLYKLKEAQERIIIJSTVVSYVCQEHRQDFDMQDSMVSCPCFRLHIPSYITMDSNIRATT T^FΩGAFSTELMGDTSSLSQGSLSRTVRDDAINQIJELCKVDTLNPRVAGRLASSLYVYVΫPAYTNNTSA SGTGIAAVTHDRADPNRVIVLGLEHFFLKDLTGDAALQIATCVVALVSSIVTLHPHLEEVKVAVEGNSSQ DSAVAIAS I IGESCPLPCAFVHTKDKTSSLQWPMYIJLTLFFIKSKAFΕRLIYAVNTASLSASQVTVSNTIQL SFDPVLYLISQIRAIKPIPIJRDGTYTYTGKQRITTSDDVLVALVMAHFIATTQKHTFKKVH
>gi_2337991
MFYVKVMPALQKACEELQNQWSAKSGKWPVPETPLVAVETRRSERWPHPYLGLLPGVAAYSSTLEDY'CΞL YNPYIDALTRCDLGQTHRRVATQPVLSDQLCQQLKKLFSCPPJTTSVKAKLEFEAAVRTHQALDNSQVFLE - r l-N SAFIJS RYSD SSHIE FQ Q IMH FFLVSIKAPE CEKF■ NI KL F IDTroQA,π.DI FKQKASVFLIPRRHGKTWIVVAIISILIiASVQDLRIGYVAHQKHVANAVFTEVINTLHTFFPGKYMDVKK ENGTIIFGLPNKKPSTLLCATCFNKNSIRGQTFQLLFVDEANFIKKDALPTILGFMLQKDAKIIFISSSN SSDQSTSFLYNLKGASERML-TVVSYVCSNHKEDFSMQDGLISCPCΥSLHVPSYISIDEQIKTTTNLFLDG VFDTELMGDSSCGTLSTFQIISESALSQFELCRIDTASPQVQAHLNSTVHMYIDPAFTNNLDASGTGISV IGRLGAKTKVILGCEHFFLQKLTGTAALQIASCATSLLRSWIIHPMIKCAQITIEGNSSQDSAVAIANF IDE^PIPVTFYHQSDIrKGVLCPLYLLGQEKAVAFΕSFIYAra!π.GLCKASQLIVSHTIKLSFDPVTYLL EQVRAIKCQSLRDGSHTΪHAKQKNLSDDLLVSVVMSLYLSSANTLPFKPLHIERFF >gi_2317977 l QKDA IFFISS SGE TTSFLYNL-αD E M SY CSEH rEDFN QSAI ACPCY VPEFIT INDNIKCTTNIΛLEGSFATEIiMG-MQSHTEVSGNSMIHESSLTR^ AYGNNVHASGTGIVAMSHCKHTKKCI ILGLEHFFLNNLTGTAAHNIASCATATi EGILFQHPWIQEIRCI IEGNSNQDSAVAIATFISHNIKLPTLFASYRDKTGMQWPIYJTLSGDKTLAFQNFISSLNQGLLCASQTVV SNTVIιLSSDPISYLIEQIKNTKCIYHKNKTITFQSKTHTMSDDVLIACTOTCYVMTTNKISYISFSIK
FIASKKSYFEAVYRSTVSSHSEEFWKSDDPVYFTQYKKQ- NRLPNAYLGTLHSASKYSENFRHYVATFS NSPLDFPQSVF-TORNPCEYSVPYIXlSALQCSAKTLVGCSVSTTEPJ-rEYEVCKEATRCFKDAMSHKVLKVF LSN S F KGHY S QAFLE FQ QLILHSFMF ASI CPETT KLFDEFKFL Dt YFDN D LTFLQK SPAFLIPRRHGKTWIVTAIISMLLTSVDDI1HIGYVAHQKHVSLAVFLEISNIIJ.AWFPRKNIDIKKENGV ILYSHPGKKSSTLMC^TCFNKNSIRGQTFNLLFVDEANFIKKEALPAILGFMLQKDAKIFFISSVNSGEK TTSFLYNLi ANEKMVNVVSYVCSEHMEDFNKQSAITACPσπU.YVPEFITINDNIKCTTNLLLEGSFAT ELMGN QSHTEVSGNS IHESSLTRLDFYRCDTAGQGAPTTENTLFVYIDPAYGNNVHASGTGIVAMSHC KHTKKCI ILGLEHFFLNNLTGTAAHNIASCATALLEGILFQHPWIQEIRCI IEGNSNQDSAVAIATFISH NIKLPTLFASYRDKTGMQWPIYMLSGDKTLAFQNFISSLNQGLLCASQTVVSNTVLLSSDPISYLIEQIK NTKCIYIQCNKTITFQSKTHTMSDDVLIACVMTCYVMTTNKISYISFSIK
1 10 20 30 40 50 GO 70 80 90 100 110 120 130
1 + + + + 4 + + _+ .—„_+ _,—+ + + 1 gi_101B0719 HFGGLLGEET RHFERLH TKIIDR GnSHRHERSIRDG---^DrlVDRPF— LHFfllPVPRRHQTVMPHIGILHHCCDSLGIYSniTTRM YSSIflCSEFDELRRD S¥PRCYP gi_7G73189 HFGGLLGEETKRHFERLrlKTKHDRLGflSHRHERSIRDG DMVDRPF— LHFBIPVPRRHQTVrlPfllGILHHCCDSLGIYSfllTTRMLYSSIflCSEFDELRRD SYPRCYP gi_5G89285 rfFGGRLGESRKKHFERLLRDRNERLGRSRKNΕCLHRGG SLVDRPF~-LNFHISVPRRHQTWPHVGTLH0CCDGTGT.YSRIRTRLLYRGIVSSEFGEVRRE SLSNGHJ gi_1869837 HFGQQLflSDVQQYLERLEKQRQQ VGV-DEflSRGLTLG GDRLRVPF— DFfiTRTPKRHQTVVPGVGTLHDCCEHSPLFSRVRRRLLFHSLVPRQLRGRDFG GD H gi_59501 HFGQQLRSDVQQY ERLEKQRQL VGR-DERSRGLTMG GDπLRVPF~LDFRTRTP RHQTVVPGVGTLHDCCEHSPLFΞflVRRRLLFHSLVPflQLKGRDFG GD H
CO c gi_2G05992 ttFG flLSRETIQYFETLRKEVQ,SRSGR-KHRRRERQTG--GEODVKTRF-- -LHFfllPTPQRHQTVVPGVGTLHDCCETRQIFRSVflRRLLFRSLSKMRGGES ER LD P gi_330792
DO HFGRVLGRETVQYFERLRREVQRRRGR-KHRRRERQHG--GEDDR TRF~LHFRIPTPQRHQTVVPGVGTLHDCCETRQIFRΞVRRRLLFRSLS HQSGERRER LD P CO gi_971317 HFGGflVGEQSflRYFQRLLRERQRRflRE-RGflRPDGGGGRRGEDDflRVPF~LDFflVRHP RHQTVVPGVGTLHGYCELflPLFRRTflSRLLLTSrlflRflERG Lll T g±_58G9808 HSLIMFGRtLGEESVRYFERL RRRDERFGTLESPTPCSTRQGSLGHflTQIPF~LHFRIDVTRRHQRVIPGIGTLHHCCEYIPLFSRTRRRR FGRFLSSTGYHCTPH VVLKPMR gi_5708110 rJLGKESVEIVKRYRDRLRKRTHERGPDDVDGQEMSDSHFITTRSICDRHDSRRDTHHSPflSRFQFfllDVPQRHQRCIflPIGSFHHCCRISRRFSYHRSEIIYEHLflSYSTKYTDTDRflLHDLQVSPKRQL gi_1813970 MLRGDSflflKIQERYRELQKRKSHPTSCIST-flFTNVfiTLCRKRYQHrlHPELGLflHSCHEnFLPLrlRFCGRHRDYHSPEESQREL m gi_2746296 HLRSCDIDRIQKRYQSIIHKHEQDVK-ISS-TFPHSRIFCQKRFIILTPELGFTHRYCRHVKPLYLFCDRQRHV KSK— I gi_325496 HLRTCDITHIKHNYEflllHKGERDCSTIST-KYPHSfllFYKKRFIHLTPELGFflHSYHQqVKPLYTFCEKQRHL KNR PL '
CO gi_5733564 HLRTCDITHIKHHYERIIMKGERKCSTIST-KYPHSRIFYKKRFIHLTPELGFRHSYHQQV PLYTFCEKQRHL HRKPL
Im gi_113S808 HLLSRHRERLRRHLEETR D— RGE^RHEL-SfiPTFTRHCPKTRRh'flHPFIGVVHRIHSYSSVLETYCTRHHP RTPTSflHPDV GTPRPSE m gi_224G552 MLLSRHRERLRRHLQETRKD-- RGETRHEL-SRPTFTRHCP TRRHRHPFIGVVHRIHSYSSVLETYCTRHHP RTPTSflHPDV GTPRPSE
H 'gi_4019255 MLLL flKKflLHEHLTEflSSTT-QSETEHTV-DTPTMITHI KSERMHYSKIGVIPSIHLYSflSLTSFCRLYRP-r-- LflL QPLPQT GTLRLLP gi_G0355 MLLLKRKKRHEHLSEVSStr-QRETDHDrl-STPTHTHTSKSERTRYS IGVIPSVHLYSSTLTSFCKLYHP LTLHQTQPQT GTLRLLP c gi_4928934 HLLSSFRHHLQ HYEKYSVQ—RQHIDHPV-ETPVLIS DSKTHRLRHPLIGVISRIHLYSPTLKYYCDEYST TKQPKFTPDI GYVRDLK r gi_2337991 HFYVKVHPRLQ flCEELQHQHSRKSGKHPVPETPLVflVETRRSER PHPYLGLLPGVRRYSSTLEDYCHLYHP YIDRLTRCDL GQTHRRV m gi_1632798 ttLYRSQRGRLTEHLRHRLQQDSTTQGCLGR-ETPSIMYTGRKSDRHRHPLVGTIHRSHLYCPHLRRYCRHYGPRPVFVRSDESLPrlF GRSPRLH ro gi_GG25593 MFTRS KSYFEflVYRSTVSS— HSEEFMKSDDPVYFTQY KQCHRLPHRYLGTLHSRSKYSEHFRHYVRTFSH SPLDFPQSVF HERHPCE gi_1718281 gi_2246515
CO gi_4*^9<1933 gi_\7330018
> c gi_4019257 gi_G95201 gi_2317977 gi_854039 gi_499G048
Consensus
TABLE 2
131 140 150 1G0 170 180 190 200 210 220 230 240 250 2G0
I + + + + '. + + + + + + + + _* 1 gi_10180719 RITHRQRFLSPHrlrlRVRHSIIFQEYDEHECRflHRHRYYSTMHSFISHRTSDRFKQLTVFISRFSKLLIRSFRDVHKLDDHTVK— RRRIDRPSYD LHGTLELFQKMILHHRTYFVTSVLLGD-HRERR gi_7G73189 RITNflQRFLSPHMMRVfiHSIIFQEYDEMECRflHRHflYYSTttHSFTSMRTSDflFKQLTVraSRFS LLrflSFRDVH LDDHTVK-n RRRIDRPSYDKLHGTLELFQKHIFDRCHLFCHFCFTWR-SRRflS gi_5G89285 SKRHRERLLRPTLTRVRHSITFHEYDDflQCRfiHRHRYYSTrlHTFGSMRTSDRFQQLflSFIDRFS LLRRSFKDVHILDRHHRP-i-kRRRITRPSYDKPHGTLELFQ HILMHflTYFLTSVLLED-HflERfl gi_18G9837 — TflKLEFLflPELVRRVRRLRFRECflPEDflVPQRHflYYSVLHTFQRLHRSERFRQLVHFVRDFRQLLKTSFRRSSLRETTGPP-KKRfl VDVRTHGqTYGTLELFQ MILMHflTYFLRflVLLGD-HflEQV gi_59501
CO — TflKLEFLflPELVRflVflRLRFKECflPflDVVPQRHHYYSVLHTFQflLHRSERFRQLVHFVRDFRQLLKTSFRRSSLTETTGPP-K RR VDVRTHGRTYGTLELFQKHILHHRTYFLRRVLLGD-HREQV c gi_2605992 — SSVERYVDPKV QRLKTISFVEYHDRERRSCRHRYYSIHHTFDSLRSSDflFHQVflHFVflRFSRLVDTSFHGflDLDGDGQQT-SKRIKVDVPTYG QRGTLELFQ HILMHflTYFIRRVILGD-HflDRI
DO gi_330792 — -RSVERYVDPKVRQRLKTISFVEYSDDERRSCRHRYYSIrlHTFDflLRSSDflFHQVflSFVRRFSRLVDTSFHGflDLDGDGQQfl-SKRflRVDVPTYG QRGTLELFQKrllLMHflTYFIflflVILGD-HflDRI CO gi_971317 — GTGEflHVSRELflGVLSRLRFRflHPPflEflRflHCHRYHSVrRRLESHRflSGRFRQVfiRFVflRFSRLVGTSFSHLGGGDDRDPPRRKRRRVEPPS-GQTRGRLELFQKrllLrlPflTYFVRRTLLGE-HRERI gi_58G9808 --YSVHflHVSPELKKflVSSVQFYEYSPEEflRPHRHflYSGVi HTFRflFSLSDSFCqLSTFTQRFSYLVETSFESIEECGSHG KRRKVDVPIYGRYKGTLELFQKrllLrlHTTHFISSVLLGD-HflDRV gi_5708110 FTGRREDSILPRLRQKLRHLHFRRFRPSDSLIHDKRFDGIHHGYRGFV SDEFSQLHHFIYRFHTLLKKSFSGQflSHDY RfiKLE TTSEQRDGTLELFQKrffLMIRTYFflSSICLGEGSTERS gi L813970 LFHERL SRLDKLTFRPCSEEQ-R RSYQK-LDRLTELYRDPQFQQΪNHFHTDFkKHLDGGFSTRVEGD RKRIRLEPFQKHLLIHVIFFIRVT IPV-LflHRV m gi_274G29G RICOPLHCRLSKLKFTRIIEKHTE VQYQKHLELQTSFYRHPMFLQIEKFIQDFQRHICGDFEHT--H KERIKLEPFQ SILIHIIFFISVT LPT-LRHHV gi_32549G TILPSLSHKLQEM FLPRSDKSFE SQYTEFLESFKILYREPLFLQIDGFI DFRKHI GEFNDF--GD TRKIQLEPFQKHILIHYIFFIflvT LPR-LflNRV
CO
I gi_57335G4 -TILPSLTRKLQErlKFLPRSDKSFE SQYTEFLESF iLYREPLFLqiDGFIKDFRKHI GEFHDF~GD- TR iqLEPFqKHILIHVIFFIHVTKLPfl-LflHRV m gi_1136808 DHVPRKPRLLESLSTYLQMRCVREDRHVSTRDqLVEYqRGR THDSLHflCSVYRELqRFLVHLSSFLHGCYVP- -GVHMLEPFqQqLVMHTFFFLVSIkflPq-KTHqL t •- m gi_224G552 DHVPRKPRLLESLSTYLQr/RCVREDRHVSTRDQLVEYQflflR THDSLHflCSVYRELQRFLVHLSSFLHGCYVP- -GVH LEPFqqqLVHHTFFFLVSI RPq- THqL gi_4019255 SE KPYISqKLSNYVKSLTL HV HDIERE~REYYRSVqTEKTFMECPI,YLELRqFIIHLSSFLHGCYVK- -KSTHIEPFqLQLILHTFYFLISIkSPE-STHkL
73 gi_G0355 HE KPLILPDLSHYV LLTSqHVCHDTERH— TEYHRRVqτqKTSHECPTYLELRQFVIHLSSFLHGCYVK- -RSTHIEPFQLqLILHTFYFLISIKSPE-STHRL c gi_4928934 KH DQYFLP LQHHLSTLCEflYHHVDRqflq--VEFHRSILTLKRFHflHGVLHELKQFLIHLSCFLHGCYVS- -KSTCIELFQKqLILHTFYFLISIKTPE-ETNKrf gi_2337991 RT qPVLSDqLGqqLKKLFSCPRHTSVKRK— LEFERRVRTHqRLDHSqVFLEL TFVLNLSRFLHKRYSD- m -RSSHIELFq qLIi HTFFFLVSI HPE-LCEKF gi_lG32798 TPVqVqHCLLPELRDTLqRLLPPPHLEDSERL— TEFktSVSSRRRILEDPHFLErlREFVTSLRSFLSGQYKH- -KPfiRLERFQKQWLHSFYFLISIKSLE-ITDTrf r gi_6G25593 YSV-- TPYLDSRLqCSR TLVGCSVSTTER HEYEVC ERTRCF DRHSHKVL VFLSHLSMFLKGHYKS- - qRFLEPFqKQLILHSFHFVflSIKCPE-TTT L gi_1718281
CO gi._224G515 gi_4494933 gl_7330018
> c gi.,4019257 gi_G95201 gi_2317977 gi_854039 gi_4996048
TABLE 2 CONTINUED
-wimnTri'BPirtinnojr- iMnai
G51 . GGO G70 680 690 700 710 720 730 740 750 7B0 770 780
I + + . .-+ + + + + . + 1 + i + u *._,+ 1 gi_10180719 flP--TPHH*/SFYHSKSHGTDVEYPYFLLqRQKTTflFDFFIflqFHSGRVLflSQDLVSTTySLTTDPVEYLTKQLTHISEVVTG PTCTRTFSGKKGG— -NDDtWflLTilflVYISfiH-IPDrlRFflPIRV gi_7673189 RP--TPHHVSFYHSKSHGTDVEYPYFLLqRQKTTRFDFFinqFHSGRVLRSqDLVSTTVSLTTDPVEYLTKQLTNISEVVTG PTCTRTFSGK I-G---HDDtVVHLTMflVYrSflH-IPDMflFflPIRV gi_5689285 RP--flPHDVCFYHSkPRGSNVEYPFFLLqRQ ;TRfiFDFFIHRFHSGRVLflSQDLVSTTISLSTDPVEYLTKQLTHLSEVVTG RTGTRTFSGKKGG— ^YDDTVVRLVHflVYISflH-fiSDRTFflPIRG gi_1869B37 RH-GPGPELLFYHCEPPGGRVLYPFFLLHKQkTPflFEYFIKKFHSGGVHHSqELVSVTVRLQTDPVEYLSEQLHNLIETVSP HTDVRHYSGKRHGR-^RDDLHVfiVIHRIYLRflPTGIPPflFFP-ITR gi_59501 RDRGSGPELLFYHCEPPGSflVLYPFFLLNKQKTPRFEHFIKKFHSGGVrlRSqEiVSflTVRLQTDPVEYLLEqLHHLTETVSP HTDVRTYSGKRHGR--SDULHVflVIHflIYLRflqflGPPHTFRPITR gi_2605992 VH--«PGTVLFYHCTPPGSεVRYPFFLLqKQKTGflFDHFIKRFHSGLVLRSqELlSHTVRLQTDPVEYLLTqrlKNLTEVITG TjSETRVFTGKRHGfl~-SDDHLVRLVrlflVY«flSLPPTTHRFSSLST gi_330792 VP--HPGRVLTYHCTPHGSSVflYPFFLLqKQKTGflFDHFIKflFNSGSVLfiSQELYSNTVRLOT^ gi_971317 VP~HPflETRFYHCRPPGSRVRYPFFLLqKQKTRflFDHFIRLFHSGRVVflSqDLflSLTVRLqTDPVEYLFEqLqHLTESTRG PGGRRRFSGKRRGH-~flQDLrlVflLVHRVFVGSLPPTDGflFCPLHP
CO gi_58B9808 LS— SflPVLLFYHSIPPGCSVRYPFFLLQKQKTPflVDYFVKRFHSGHIIflS8ELVSLTVKLGVDPVEYLCKQLDNLTEVIKGGHGHLDTKTYTGKGTTGTMS0DLHVRLIHSVYIGSSCIPDSVFrtPIK c gi_5708110 LGFSLTFflllΞRQPGTflHRHPFYLLHKQKSRRFDLFVSLFH£GRFttRSqE VSHTLVLSKDPCEYLVDqiRHrr--VTHGqGPDSFRTFSGKQGRV-^PDDHLVRRVrlSTYLRLEGSPTflGYHPIRP cσ gi_1813970 RVLFYHTPDQHHIEQ-PFYLHGRDKRLfiVEqFlSRFHSGYIKflSqELVSYTIKLSHDP-EYLLEQIQHLHRVTLfl EGTTHRYSRKRQHR-ISDDLIIRVIHRTYLCDDIHRIRFRVS co gi_274629G HITFFHTLDqsqifiα-PFYLLGREKRLflVEYFlSHFHSBYIKflSqELlSFπKITYDPIEYVIEqiKHLHQIHIH EHVT~YHRKKq-T-CSDDLLISIIHflIYHCHEGKqTSFKEI gi_32543B QVTFFHTPDQHqifiQ-PFYLLGKEKKLflVEFFISHFHSGHIKBSQELlSFTIKITYDPVEYRLEqiRNIHQISVH NYrr--YSHKKn-R-CSDϋLIIflIIHflIYVCSGHSSRSFREr gi_57335G4 Ql/TFFHTPDqHqiflQ-PFYLLGKEKKLHVEFFISHFHSGHI RSQELlSFTIKITYDPVEYflLEqiRHIHQISVH NYIT~YSRKKQ-H-CSDDLIIRIIr(RIYVCSGHSSRSFREI gi_113G808 PIHFLHYTDKSSRLqHPIY»lLGGEKSSflFETFIYRLHSGTLSflSqTVVSNTIKIBFDPVTYLVEQVRRIKCVPLR--~DGGqS-YSflKQK-H-HSDDLLVRVVHflHFHRTDDRHHYKPISPq m gi-224G552 PIHFLHYTPKSSflLQHPIYHLGGEKSSRFETFI¥flLHSGTLSfiSqTVVSHTTJ ISFDPVTYLVEqVRRIKCVPLR-----DGGqS-YSRKqK-H-r(SDDLLVflVVrfflllFHRTDDRHHYKPISPQ gi_4019255 KVHFHHYKDKTTRHQμPIYnLGSEKSqflFESFIYRIHSGTISflSqSIISHTIKLTFDPISYLIEqiRflIRCYPLR---DGSHT-YCflKKR-T-VSDDVLVRVVHRHFFSTSHKHIFKqLHSI co gi_G0355 KVYFIHYKDKTTTMQMPIYHLGREKSIflFESFIYRIHSGTISnSqsnSHTIKLSFDPISYLIEqiRSIRCYPLR---DGSHT-YCflKKR-T-VSDDVLVflVVHnYFFRTSHKHIFKPLHST m gi_4928934 PVFFflHCHERSSVFQ PIYILGSEKSqnFEKFICRUITGTLSRSqTlVSHTIKISFDPVRYLMEQIRRIRCLPLK— -DGSYT-YCRKqK-T-HSDDTLVflVVHRNYHRISEKHTFKELCKT m gi_2337931 PVTFYHqSDKTKGVLCPLYLLGqEKflVflFESraYRMHLGLCKflSQUVSHTαLSFDPVTYLLEDVRflIKCQSLR---DGSHT--YHflKQK-H-LSDDLLVSVVHSLYLSSfl TLPFKPLHIER gi_lG32798 PCRFVHTKDKTSSLQHPrlYLLTHFKSKflFERLIYflVHTRSLSflSqVTVSHTiqLSFDPVLYLIsqiRflIKPIPLR---DGTYT-YTGKQR-N-LSDDVLVRLVHRHFLRTTqKHTFKKVH gi_GG25593 PTLFRSYRDKT-GHqHPIYMLSGDKTLRFqHFISSLHQGLLCflSqTVVSHTVLLSSDPISYLIEqiKHTKCIYHK HKTIT-FqSKTH-T-«SDDVLIfiCVMTCYVHTTHKISYISFSIK c gi-1718281 PIHFLHYTDKSSRLQHPIYHLGGEKSSfiFETFIYRLHSGTLSRSqTVVSHTIKISFDPVTYLVEqVRHIKCVPLR---- DGGqSYSRKQKH~HSDDLLVflWHflHFHflTDORHHYKPISPq gi-2246515 PIHFLHYTDKSSRLQμPIYHLGGEKSSfiFETFIYRLHSGTLSftSqTVVSMTIKISFDPVTYLVEqVRflIKCVPLR-^--DGGqSYSflKQKH--HSDrjLLVflWHflHFHnTDDRHrlYKPISPQ m gi_4494933 PLSFLHHVDKHTLIRSPIYttLGPEKRKRFΕSFIYRLHSGTFSRSqTVVSHTIKLSFDPVRY DqiKRIRCrPLK-^ — DGGHTYCRKQKT--rlSDD\>LVflflVHflHYrlflTH„KFVFKSLE r gi-7330018 PLSFLHHHDKHT RSPIYMLGPEKRKRFESFIYflLHSGTFSflSqTVVSHTIKLSFDPVRYLIDQIKfllRCIPLK-*^ — DGGHTYCflKQKT--HSDnVLVflfiVHRHYHHTl*IDKFVFKSLE n gi-4019257 KVHFrlHYKDKTTflHQHPIYrlLGSEKSqflFESFIYfllHSGTISHSqSlISNTIKLTFDPISYLIEQIRfllRCYPLR-1 — DGSHTYCRKKRT~VSDDVLVRWHHHFFSTSHKHIFKqLHSI gi_G35201 PVKFLHHRDKRTGYQHPHYILGfiEKRRflFETFIYfiLHSHTLSCGQfllVSMTIKLSFDPVflYUEQIRRIKCYPLK- — DGTVSYCRKHKG--GSDDTLVRVVHRHYFRTSDRHVFKHHHKQI
CO gi_2317977 PTLFflSYRDK-TGHQHPIYrlLSGD TLfiFQHFISSLHQGLLCRSqTVVSHTVLLSSDPISYLIEQIKHTKCIYHK HKTITFqSKTHT—HSDDVLIRCVHTCYVMTTH ISYISFSIK gi_854039 qVTFFHTPDqHqiRQ-PFYLLGKEKKLflVEFFlSMFHSGHIKnSQELISFTIKrTYDPVEYflLEqiRHIHqiSV NHYITYSflKKqR—CSDDLIIRIIHfllYVCSGHSSflSFREI gi_493G0.48 qVTFFHTPOqHqiRQ-PFYLLGKEKKLflVEFFISHFHSGHIkflSQELISFTIKITYDPVEYRLEqiRHIHqiSV HHYITYSHKKQR~CSDDLIIRIIHflIYVCSGHSSRSFREI c Consensus .„p„F.h.. « q.PiY$Lg< ftK..flf#.FI„a.Hsg_ . RSQ, SnT !klsfDP! ,ϊl,iq!ral.c..l .aK SDD.l!R Ha.ϊ;„t k
TABLE 2 CONTINUED
Table 3. Degenerate primers generated by CODEHOP
Block x7263xbliD
T L Y V Y I D P oligo : 5 ' -AACCTGTACGTGtayntngaycc-3 ' degen=64 temp=33.4. Extend clamp
T Y V Y I D P A ' oligo : 5 ' -AACCTGTACGTGTACntngayccngc-3 ' degen=128 temp=36.0 Extend clamp
T L Y V Y I D P A Y oligo : 5 ' -AACCTGTACGTGTACATngayccngcnt-3 ' degen=128 temp=42.5 Extend clamp
Complement of Block x7263xbliD
Y I D P A Y T N' N T atrnanctrggGCGGATGTGGTTGTTGT oligo : 5 ' -TGTTGTTGGTGTAGGCGggrtcnanrta-3 degen=64 temp=62.9
D P A Y T N N T R A anct r ggncgnaTGTGGTTGTXGTGGGTCCG oligo : 5 ' -GCCTGGGTGTTGTTGGTGTangcnggrtcna-3 ' degen*=128 temp=61.8
D P A Y T N N T R A ctrggncgnawGTGGTTGTTGTGGGTCCG oligo : 5 ' -GCCTGGGTGTTGTTGGTGwangcnggrtc-3 ' degen=64 temp=61.0
Block x7263xbliE
C I I F G M E H F .F oligo : 5 ' -TGGATCATCTTCGGCATngarcaytwyt-3 * • degen=64 temp=55.7 Extend clamp
F G M E H 'F F L oligo : 5 ' -CATCTTCGGCATGGAGcaytwytwyyt-3 ' degen=64' temp=62.0 Complement of Block x7263xbliE
E H F* F . R D i T* G , . ctygtrawra GGACTTCCTGGACTGCCC oligo : 5 ' -CCCGTCAGGTCCTTCAGGwarwartgytc-3 ' degen=32 temp=61.7
. H ' F R D L T G tygtra rawrrACTTCCTGGACTGCCCG oligo : 5 ' -GCCCGTCAGGTCCTTCArrwarwartgyt-3 ' de-gen=128 temp=60.8
H F F L R D L T G gtrawrawr-raCTTCCTGGACTGCCCG oligo: 5 '-GCCCGTCAGGTCCTTCarrwarwartg-3 ' ■ degen=64 temp=S0.8 *
Block x7263xbliF • " •
E V H. I - A V E G N ' oligo :.5 ' -GGACGTGCACGTCGCCrtngarggnaa-3 ' degen=64 temp=63.8
Complemen of Block x7263xbliF
E G N 'S S Q D S A anctyccnttrwGGTTGGTCCTGAGGCGG oligo : 5 ' -GGCGGAGTCCTGGTTGGwrttnccytcna-3 ' degen=128 temp=62.7
E G N S S Q D S A V ctyccnttrwsGTTGGTCCTGAGGCGGC oligo : 5 ' -CGGCGGAGTCCTGGTTGswrttnccytc-3 ' degen=64 temp=63.9 '