WO2002074994A2 - Improved orestes sequencing method - Google Patents

Improved orestes sequencing method Download PDF

Info

Publication number
WO2002074994A2
WO2002074994A2 PCT/US2001/046665 US0146665W WO02074994A2 WO 2002074994 A2 WO2002074994 A2 WO 2002074994A2 US 0146665 W US0146665 W US 0146665W WO 02074994 A2 WO02074994 A2 WO 02074994A2
Authority
WO
WIPO (PCT)
Prior art keywords
cell
nucleic acid
organism
sequences
cdna
Prior art date
Application number
PCT/US2001/046665
Other languages
French (fr)
Other versions
WO2002074994A3 (en
Inventor
Andrew John George Simpson
Emmanuel Dias-Neto
Ricardo R. Brentani
Original Assignee
Ludwig Institute For Cancer Research
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ludwig Institute For Cancer Research filed Critical Ludwig Institute For Cancer Research
Publication of WO2002074994A2 publication Critical patent/WO2002074994A2/en
Publication of WO2002074994A3 publication Critical patent/WO2002074994A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6846Common amplification features
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • the single stranded cDNA is prepared, it is used in an amplification reaction.
  • the single primer used is identical to the first primer, as described supra, and that low stringency conditions be employed. Using identical primers tends to produce longer products, but this is not required.
  • the darker portion is a sequence obtained in accordance with the invention.
  • RNA samples from S. mansoni DNA and human cancer tissues were collected from freshly perfused hamsters, and frozed immediately.
  • Total RNA was isolated from approximately 500 mg of tissue, using standard methods.
  • Messenger RNA was isolated using a commercially available mRNA kit, resuspended in 40 ⁇ of buffer, and then treated with Dnase (lU/10 ⁇ ), for 15 minutes, at room temperature. The DNase was inactivated by incubating at 65 °C for 10 minutes. Integrity of mRNA was checked routinely by performing control amplification of three different messages using 3 different pairs of primers which match the 5' end of S. mansoni control genes. The control genes were actin (abundant message), retinoic X receptor, 4211 bases (long mRNA), and catepsin C (low abundance message). Controls showed that the mRNA was intact.
  • PROTOCOL 3 S. MANSOND:
  • Protocol 1 was used on 333 different ORESTES cDNA libraries which had been derived from three different human tissues, and four different mRNA preparations. The clones that were derived from this set of libraries were sequenced, yielding 65,091 ESTs, with an average of 195 ESTs per minilibrary. This sequence set was used for normalization analysis, and positional distribution. Analysis of normalization was performed using 11,848 ESTs, which matched complete human genes present in the UniGene cluster. The average size of the clusters tagged by the sequences generated was 495. Size distribution is set forth in figure 4. Positional distribution is shown in figure 5. The central portion of the transcripts is preferentially tagged.
  • Steps 2-38 are then repeated, 26 times 39) 72 °C 30 seconds PROTOCOL 5
  • a set of approximately 250,000 ESTs was generated, and then analyzed.
  • the ESTs were size selected, in distinct size ranges, varying from 0.3 to 1.5kb.
  • the size selected fragments were cloned into pUC18, using standard methods.
  • the plasmids were cloned into cells, and colonies were grown overnight in liquid media. PCR was carried out thereafter, and the PCR products were used for DNA sequencing.
  • sequences were trimmed to exclude primer sequences and vector sequences, as well as low quality regions.
  • the sequences were clustered, using "CAP3", in accordance with Huang, et al, Genome Res 9:868-877 (1999), inco ⁇ orated by reference. Preliminary analysis of the sequences showed that 18% of the sequences were derived from rRNA and mtDNA transcript, or were composed almost entirely of repetitive sequences. These were excluded from further analysis.
  • the method involves forming a cDNA library by contacting a sample of mRNA with at least one arbitrary primer, at low stringency conditions, followed by reverse transcription. The resulting, single stranded cDNA is then amplified, with at least one arbitrary primer, at low stringency, to create a mini-library of cDNA.
  • a cDNA library by contacting a sample of mRNA with at least one arbitrary primer, at low stringency conditions, followed by reverse transcription.
  • the resulting, single stranded cDNA is then amplified, with at least one arbitrary primer, at low stringency, to create a mini-library of cDNA.
  • These nucleotide sequences are derived from internal, coding regions of mRNA.
  • the resulting nucleic acid molecules are then sequenced.
  • the improvement comprises varying the annealing temperature for the amplification reaction, wherein a high annealing temperature is used, for a number of cycles, followed by a number of cycles with a low annealing temperature.
  • a source of pre-existing sequence information e.g., a nucleotide sequence library.
  • the improvement comprises varying the annealing temperature for the amplification reaction, wherein a high annealing temperature is used, for a number of cycles, followed by a number of cycles with a low annealing temperature.
  • pre-existing information which corresponds to internal mRNA sequences can be identified.
  • the method is applied to eukaryotes.
  • the method as described herein is applicable to any organism, including single cell organisms such as yeast, parasites such as Plasmodium, and multicellular organisms. All plants and animals, including humans, can be studied in accordance with the methods described herein.
  • a second feature of the invention is a method for developing so-called "contig" sequences.
  • These are nucleotide sequences which are generated following comparing sequences produced in accordance with this method to previously determined sequences, to determine if there is overlap. This is of interest because longer sequences are of great interest in that they define the target molecule with much greater accuracy.
  • These contigs may be produced by comparing sequences developed in accordance with the method, as well as by comparing the sequences to pre-existing sequences in a databank. The aim is simply to find overlap between two sequences.
  • the power of the inventive method is such that there are innumerable applications. For example, it is frequently desirable to carry out analyses of populations of subjects.
  • the invention can be used to carry out genetic analyses of large or small populations. Further, it can be used to study living systems to determine if, e.g., there have been genetic shifts which render an individual or population more or less likely to be afflicted with diseases such as cancer, to determine antibiotic resistance or non-tolerance, and so forth.
  • the invention can also be used in the study of congenital diseases, and the risk of affliction to a fetus, as well as the study of whether such conditions are likely to be passed to offspring via ova or sperm.
  • analyses for pathological conditions can be carried out in all animals, plants, birds, fish, etc.
  • the invention is applicable to all eukaryotes, not just humans, and not just animals.
  • the genomes of food crops can be studied to determine if resistance genes are present, have been inco ⁇ orated into a genome following transfection, and so forth. Defects in plant genomes can also be studied in this way.
  • the method permits the artisan to determine when pathogens which integrate into the genome, such as retroviruses and other integrating viruses, such as influenza virus, have undergone shifts or mutations, which may require different approaches to therapy.
  • This aspect of the invention can also be applied to eukaryotic pathogens, such as trypanosomes, different types of Plasmodium, and so forth.
  • the method described herein can also be applied to DNA directly. More specifically, there are organisms, such as particular types of bacteria, which are very difficult to culture. One can apply the inventions described herein to DNA of these or other bacteria directly, rather than to cDNA prepared from mRNA. Essentially, the methodology used is the same as the methodology described supra, except genomic DNA is used. In such a case, random fragments are produced, rather than ORF segments. Using PCR in this type of approach means that very small amounts of DNA are needed, hence difficulties in culture are avoided. It is estimated that less than one microgram of DNA would be necessary to sequence an entire genome of a prokaryote.
  • annealing temperature ranging from 72 °C to 30 °C, more preferably about 60 ° C to 40° C are preferred. These so-called “touchdowns” are used at least once, more preferably up to 40 or 50 times. More preferably, they are used 1-30 times. Essentially, denaturing, annealing of abundant message, annealing of primers, and then primer extension constitute one cycle, and the temperature may vary at each of the steps.

Abstract

The invention involves a method for obtaining nucleotide sequence information from nucleic acid molecules, such as cDNA. The method involves the use of arbitrary primers, and low stringency conditions. Rather than providing information from the termini of nucleic acid molecules, the method provides information on the more interesting and relevant internal portions of nucleic acid molecules. The method shows how to secure information on ORFs, and how to prepare contig sequences from any source.

Description

IMPROVED ORESTES SEQUENCING METHOD
RELATED APPLICATIONS
This application is a continuation in part of Serial No. 09/406,117, filed September 27, 1999, as well as Serial No. 09/196, 716, filed November 20, 1998. Both of these applications are incorporated by reference in their entirety.
FIELD OF THE INVENTION
The invention relates to improved methods for determining the sequences of nucleic acid molecules. More particularly, it relates to improved methods for preferentially sequencing internal portions of nucleic acid molecules, such as those portions referred to as open reading frames, or "ORFs". The method is such that one can essentially eliminate sequencing of non-coding portions. Preferentially, the method is applied to complementary DNA, or "cDNA" obtained from eukaryotes. The method is applicable to all organisms, eukaryotic organisms in particular, be they single cell or complex. All nucleic acid molecules including plant and animal molecules can be studied with this method. Repeated application of the method permits the sequencing of essentially the entire coding component of an organism, regardless of the complexity of the genome under consideration. Application of the method has led to the identification of hundreds of previously unknown nucleic acid molecules. Further application of the method permits the construction of "contigs" or constructs of sequenced nucleic acid molecules. Application of the method also allows one to assign previously identified nucleotide sequences to internal regions of genόs. The improved methods described herein address a limitation of the method described in the grandparent application, which is the generation of relatively small numbers of ESTs derivable from single cDNA libraries, under limits of reasonable redundancy. The improvements described herein led to substantial increases in the number and percentage of non-redundant ESTs derived from mini-libraries. Rare transcripts are more prevalent when using the improvements to the method. BACKGROUND AND PRIOR ART
The area of nucleic acid research has seen tremendous advances in knowledge and understanding in the recent past. One of the goals in the field has been the determination of the sequence of the entire chromosomal component, or "genome" of organisms. This has been achieved for several non-nucleated organisms (prokaryotes), and of one organism with a nucleus, a "eukaryote". Eukaryotes have much more complex genomes than prokaryotes, for reasons which will be discussed infra.
The interest in sequencing entire genomes of organisms has been explained in detail in both technical and non-technical publications, and need not be repeated here. See, for example Venter, et al, "Shotgun Sequencing of The Human Genome", Science 280:1640- 1642 (1998), Pennisi, "A Planned Boost for Genome Sequencing, But the Plan Is in Flux", Science 281: 148-149 (1998).
Various approaches to what is a large, and complex project have been advanced. For example, the so-called "Shotgun" approach, developed by Venter et al, is very well known. In this approach, genomic DNA is cleaved into very small pieces, and these pieces are then sequenced. The approach is repeated, and after an undefined number of repeats, sequences are aligned to permit, at least in theory, a determination of the complete genomic sequence.
This approach has been used by Venter et al on prokaryotes, and it has been proposed for use on more complex eukaryotes, such as humans. The proposed approach to eukaryotes is not without drawbacks and criticism, however. A sizable portion of the scientific community is of the view that the resulting information will be riddled with gaps. The human genome, in contrast to prokaryotic genomes is characterized by a large number of repetitive sequences. It is felt by many that the overlapping of repetitive sequences could lead to incorrect alignment of the larger fragments from which they are derived.
A second approach, which has found more widespread acceptance, is to cleave the genome into relatively large fragments, and then to "map" the larger, non-sequenced fragments to show overlap prior to sequencing the material. After this overlapping, which results in a physical map of the genome, the segments are fragmented, and sequenced. While this approach should, in theory, eliminate the gaps in the sequence, it is time consuming and costly. Further, both of these approaches suffer from a fundamental drawback, as will all approaches which begin with eukaryotic genomic DNA, as will now be explained.
Eukaryotic DNA consists of both "coding" and "non-coding" DNA. For purposes of this invention, only coding DNA is under consideration, as it is this material which is transcribed and then translated into proteins. This coding DNA is sometimes referred to as "open reading frames" or "ORFs", and this terminology will be used hereafter.
As compared to prokaryotes, eukaryotic DNA has a much more complex structure. Genes generally consist of a non-coding, regulatory portion of hundreds of nucleotides followed by coding regions ("exons"), separated by non-coding regions ("introns"). When DNA is transcribed into messenger RNA, or mRNA, and then translated into protein, it is only these exons which are of interest. It has been estimated that, for humans, of the approximately 3 billion nucleotides which make up the genome, only about 3% are coding sequences. The shotgun and mapping approaches referred to supra do not differentiate between coding and non-coding regions. Hence, a method which would permit sequencing of only coding regions would be of great interest, especially if the method permits development of longer "contigs" of sequence information.
One such method is, in fact known. This is the "Expressed Sequence Tag" or "EST" approach. In this approach, one works with complementary DNA or "cDNA" rather than genomic DNA. In brief, as indicated supra, genomic DNA is transcribed into mRNA. The mRNA contains the relevant ORF in contiguous form, i.e. without intervening introns. These molecules are very fragile and their existence transient. In the laboratory, one can employ various enzymes, i.e., so-called "reverse transcriptases" to prepare complementary DNA, or "cDNA", which is much more stable than mRNA. One then sequences the cDNA, incompletely, from either the 5' or 3' end. These incomplete sequences, in theory, serve as identifying "tags" for nucleic acid molecules of interest. Literally millions of ESTs have been prepared, and are accessible via known data bases, such as GenBank.
There are problems with this approach as well. First, large amounts of extremely high quality mRNA are necessary, and this is not always available. Also, one must bear in mind that the non-coding regions of mRNA molecules are found at the 5' and 3' ends, and this is carried over into the cDNA molecule. As a result, the information obtained may not be very useful. For example, it frequently provides no information about the actual protein encoded by the molecule. Clearly, there is a need for a system which provides more useful information about nucleic acid molecules.
Dias Neto et al, Gene 186: 135-142 (1997), the disclosure of which is incorporated by reference, applied a method for determining sequence information from the parasite S. mansoni which involved, inter alia, the use of arbitrary primers, and low stringency hybridization conditions. There is no discussion in this paper of the ability to identify and to sequence internal portions of an open reading frame. The paper itself appears to have only been cited a single time by other investigators. Nor is there any discussion within the reference of investigating sequences for overlap, so as to develop "contigs", i.e, longer nucleotide sequences prepared by determining overlap of two smaller sequences.
U.S. Patent No. 5,487,985 to McClelland, et al., incorporated by reference, teaches a method referred to as "AP-PCR", or arbitrarily primed polymerase chain reaction. The method employs a single primer designed so that there is a degree of internal mismatch between the primer and the template. Following amplification with the primer, a second PCR is carried out. The amplification products are separated on a gel to yield a so-called "fingerprint" of the organism or individual under study. The '985 patent does not discuss the identification of internal portions of open reading frames, nor does it discuss the analysis of sequences to develop contigs. Dias Neto, et al., Proc. Natl. Acad Sci USA 97(7): 3491-6 (2000), incoφorated by reference in its entirety, describes the methodology set forth in the grandparent application, the "ORESTES" methodology, which is elaborated upon in examples 1-8 of the disclosure which follows.
The methodology as applied in the patent applications and article referred to supra yields a relatively small number of ESTs, derived from a single cDNA library. This is a very limiting step, and results in a need to produce a large number of cDNA libraries, which in turn requires larger quantities of mRNA.
In eukaryotic cells, genes can be classified as those which generate abundant, intermediate, or rare transcripts. The majority of genes (approximately 11,000/cell), are represented at very low frequencies (10 copies or less). Intermediate number of transcripts (about 300 copies per cell), are produced by about 500 genes per cell, while it is estimated that fewer than 10 genes are represented by 12,000 or more copies of transcript per cell. Clearly, it is necessary to obtain all genes when a complete transcriptome of a cell is desired. The methodology described in the grandparent application and Dias Neto, et al., supra is efficient at generating ESTs from the central portion of transcripts, and at generation of ESTs from partially normalized mRNA populations. A major limitation of the method however, is the requirement of a large number of low complexity cDNA libraries in order to obtain a detailed transcription profile. Dias Neto. et al., supra, describe generation of about 40 different sequences per primer used. Low complexity amplification patterns were found using the ORESTES method. While it is not completely clear why this was the case, one possible reason is the low appealing temperature (37°C) used in the annealing step of the reaction. Primers anneal to sites of higher complementarity before the temperature reaches the annealing point, which explains, in part, the normalizing capacity of the method. When low temperatures are reached during the annealing step, low complementary sites are tagged, especially those with high copy number. When the number of amplification cycles is the same for each tagged site, those that are more prevalent will be over-represented as predominant bands on the ORESTES profile. The more abundant targets compete for reagents with less abundant copies, resulting in a profile that does not accurately reflect gene diversity of a given cell. Hence, it is a feature of the invention to modify the ORESTES protocols so as to develop protocols which are efficient, and which increase the informative value of ORESTES generated cDNA libraries. Specifically, protocols have been modified so as to increase the number of cycles of high stringency annealing and to reduce the number of cycles of low stringency, i.e., lower annealing temperatures. The aim is to increase the frequency of rare genes, and decrease the frequency of abundant genes, leading to a more complex and accurate representation of a transcriptome. How this is achieved will be seen, in particular, in examples 9 et seq. which follow.
BRIEF DESCRIPTION OF THE FIGURES
Figures 1 A and IB both show, schematically, prior art genome sequencing approaches.
Figure IC shows the invention, schematically. Figure 2 presents both a theoretical probability curve (dark ovals) and actual results (white ovals), obtained when practicing the invention. The data points refer to the probability of securing the sequence of a particular portion of cDNA molecule when practicing the invention.
Figure 3 shows construction of a contig, using the invention.
Figure 4 shows size distribution of ORESTES clusters produced in accordance with the improved method.
Figure 5 shows positional distribution of the data shown in figure 4.
Figure 6 shows size distribution from a second experimental run.
Figure 7 shows positional distribution of the data of figure 6.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
One aspect of the invention, as discussed supra, is a method for obtaining nucleotide sequence information from organisms, preferably information from open reading frames of cDNA of eukaryotic organisms. As a first step, messenger RNA ("mRNA") is extracted from a cell. The extraction of mRNA is a standard technique, the details of which are well known by the artisan of ordinary skill. For example, it is well known that eukaryotic mRNA, as compared to other forms of RNA, is characterized by a "poly A" tail. One can separate mRNA from other types of RNA by passing it over a column which contains oligomers of the base thymidine. These "oligo dT" molecules hybridize to the poly A sequences on the mRNA molecules, and these then remain on the column. Other approaches to separation of mRNA are known. All can be used. If prokaryotic mRNA is being considered, separation using poly A/poly T hybridization is not carried out. It is preferred to treat the resulting material to reduce or to eliminate contamination by DNA. Adding a DNA degrading enzyme, such as DNase is preferred. This is carried out prior to contact with the column. It is also preferred to pas the purified RNA over the column at least twice.
The separated mRNA is then used to prepare a cDNA. The preparation of the cDNA represents the first inventive step in the method of the invention. To prepare the cDNA, the mRNA is combined with a sample of a single, arbitrary primer. By "arbitrary" is meant that the primer used does not have to be designed to correspond to any particular mRNA molecule. Indeed, it should not be, because the primer is going to be used to make all of the cDNA. Details on the design of arbitrary primers can be found in Dias-Neto, et al., supra, McClelland, et al, supra, and Serial No. 08/907,129 filed August 6, 1997 and incorporated by reference.
The primer is preferably at least 15 nucleotides long. Theoretically, it should not exceed about 50 nucleotides, but it can. Most preferably, the primer is 15-30 nucleotides long. While the sequence of the primer can be totally arbitrary, it is preferred that the total content of nucleotides "G" and "C" in the primer be compatible with the "G" and "C" content of the open reading frames of the organism under consideration. It is found that this favors amplification of the desired sequences. General rules of primer construction favor a G and C content of at least 50%.
"Arbitrary primer" as used herein does not exclude specific design choices within the primers. For example, the four bases at the 3' end of a given primer are generally considered the most important portion for hybridization. Hence, it is desirable to include as many different primers as possible, to cover all variations within this 4 base sequence. There are 256 variants possible, since there are four nucleotides. In order to identify products from a particular source, a "marker" sequence can be used, i.e., a stretch of predefined nucleotides. The remainder of the primer should be selected to correspond to overall GC usage, as described supra. Hence, for a primer 25 nucleotides long, the first 17 should correspond to GC usage for the organism in question. Nucleotides 18-21 would be a "tag", such as "GGCC." Then, all possible combinations of four nucleotides would follow, to produce 256 primers, which contain a known marker. This procedure could be repeated with a second set of primers, where the marker at 18-21 is different.
In practice, each set of variants is used with mRNA from a single source, and would permit the artisan to mark all sequences from a source, and still permit pooling.
The primer is combined with the mRNA under low stringency conditions. What is meant by this is that the conditions are selected so that the primer will hybridize to partially, rather than to only completely complementary sequences. Again, this is necessary because the primer will amplify an arbitrary sample of the mRNA pool, not just one sequence. There are standard rules and formulas for approximating high and low stringency, and the artisan of ordinary skill is familiar with these. Attention is drawn to Simpson, et al., U.S. Patent application Serial No. 08/907,129, filed August 6, 1997, incorporated by reference, for more information on this, as well as Dias-Neto, et al, and McClelland, et al., supra.
The arbitrary primer and mRNA are mixed with appropriate reagents, such as reverse transcriptase, a buffer, and dNTPs, to yield a pool of single stranded, cDNA molecules.
Once the single stranded cDNA is prepared, it is used in an amplification reaction. In this second reaction, it is preferred, but not required, that the single primer used is identical to the first primer, as described supra, and that low stringency conditions be employed. Using identical primers tends to produce longer products, but this is not required.
The result of this amplification is a mini library. One can carry out cDNA synthesis in multiple, separate reactions, using different arbitrary primers, "A", "B", "C" and "D". Four pools of single stranded cDNA are then produced, i.e., "A", "B", "C" and "D". Each pool is then amplified using each of the four primers, to generate mini-libraries AA, AB, AC, AD, BA, BB, BC, BD, CA, CB, CC, CD, DA, DB, DC, and DD. These mini-libraries are used in the sequencing reaction which follows.
Once the cDNA is prepared, the resulting products are isolated, such column, or other solid phase by size fractionation on a gel. The resulting cDNA can be removed from the gel, such as by elution, and then subjected to standard methodologies for cloning and sequencing. When methodologies such as these are used, it is possible to define the minimum and maximum sizes of the sample pool to be analyzed. For example, it is preferred to excise all cDNA of from about 0.25kb to about 1.5kb from the solid phase. The skilled artisan can also choose different fractions of the sample pool, such as all materials between 0.5kb and l.Okb, or 0.75kb to 1.5kb, and so forth. Any variation between the 0.25 and 1.5kb limits is possible. Such variations are referred to as "fractions" of solid phase associated product. The molecules can then be used in, e.g., standard vectors, so as to express sufficient copies of the molecules in host organisms such as E. coli or other probaryotes, or any desired eukaryotes. In the alternative, one need not separate the products on a, solid phase, but can use them directly in cloning methodologies such as those described supra.
Key to this feature of the invention, as is described herein, is the use of arbitrary primers under low stringency conditions. This combination permits the artisan to sequence internal regions of cDNA preferentially, as compared to the 5' and 3' ends, as is typical in standard prior art approaches. Specifically, consider a portion of a cDNA molecule which is a distance "S" from the 3' end of the molecule. For this portion of the molecule to be amplified by a primer, the primer must bind on both sides of the region to be amplified. If the complete length of the molecule is represented by "L", the probability of a primer binding to the nucleic acid molecule on both sides of a point on a nucleic acid molecule is S(L-S).
The highest probability for inclusion within amplified cDNA is the exact middle of the molecule. Lowest priority, in contrast, is at the extreme 5' and 3' ends. To elaborate, assume a point directly in the middle of a cDNA molecule, i.e., if the molecule is "x + 1" nucleotides long, .5x nucleotides precede the midpoint, and .5x nucleotides follow it. The likelihood of a primer hybridizing to a point on the molecule, preceding the middle is .5x, and following it is also .5x. If "x" is 1, then the probability of hybridization surrounding the midpoint is .5(1 -.5), or .25, i.e., 25%. Similarly, assume a point on the same molecule located .9x away from the 3' end. In this case, since the molecule is "x" units long, the point is .lx from the 5' end, i.e., .1 units precede it, and .9 units follow it. If the length is 1, then the probability of hybridization surrounding this process is .9 (1-.9), or 9%. Hence, by using a primer and conditions which permit hybridization of the primer anywhere along the molecule, one actually secures the majority of amplified products from within a cDNA molecule, rather than at the ends. In figure 2 of this application, one sees a curve which results when the theoretical model is applied (dark ovals), and a curve obtained in practice (light ovals). It will be seen that, remarkably, the practice of the invention is actually very close to the theory.
One very practical result of this approach is that the mRNA is normalized, and bias in copy number is eliminated. The probability of producing an EST from a given mRNA is proportional to the length of that molecule and not its abundance within the source being analyzed.
A further aspect of the invention is the construction of contigs, once the sequence information has been determined. One creates a contig by comparing sequence information and finding overlaps. For example, the last 300 nucleotides of a sequence may be identical to the first 300 nucleotides of a second sequence. The artisan can essentially splice the first and second sequences together, to produce a longer one. The splicing can be done with two or more sequences found in the particular experiment that is carried out, or by comparing deduced sequences to sequences which are available in a public data base, a private data base, a journal, or any other source of sequence information.
A further aspect of the invention is the ability to compare information obtained using the inventive method to pre-existing information, in order to determine if a known nucleotide sequence is an internal sequence of a particular gene. This can be done because, as explained supra, the method described herein generates an extremely high percentage of internal sequences, with a very low percentage of sequences at the ends of a given molecule. The prior art methods either generate predominantly terminal sequences, or internal sequences on a completely random basis. Hence, it is probable that nucleotide sequences of unknown origin are contained within various sources of sequence information. Data generated using the methods of this invention can be compared to this pre-existing information very easily, and can result in a determination that a particular nucleotide sequence is, in fact, an internal sequence.
The methodology referred to as "Touchdown PCR" by Don, et al., Nucleic Acid Res. 19(14): 4008 (1991), incorporated by reference, is a technique designed to improve PCR yield and specificity. In brief, the technique involves starting with a high annealing temperature followed by a gradual decrease, especially over the first few cycles. As a result, the first priming only occurs with sequences having perfect homology with the primers. Any other methodology which involves variation in the annealing temperature is also a art of the invention. Such methodologies are known to the skilled artisan and need not be reiterated here.
Only good interactions, i.e., those involving bases at the 3' end of the oligonucleotide primer, are productive in terms of generating double stranded cDNA, when high temperatures are used in the annealing step. The number of poorly expressed genes, as pointed out, supra, is 1,000 times higher than the number of abundant genes. Hence, the population of rare genes is 1,000 times more complex in terms of nucleotide composition. Many more chances for a complementary bending site are presented than there are for abundant genes, offering many more chances for complentary sites for primer binding. Application of the touchdown concept, i.e., the use of a number of higher stringency annealing cycles in early PCR cycles, led to an increase in the number of copies of PCR fragments derived from rare genes at the start of the cycle, thereby favoring the chances for hybridization at later cycles where low annealing temperatures are used. The gradual reduction of annealing temperature reduces the requirements for primer/template interaction, and the generation of amplification bands. Molecules of lower complementailty are targeted only in later cycles; however, their amplification has remained low.
Hence, as is shown in examples 9 et seq. which follow, by increasing the number of high stringency annealing cycles (i.e., those carried out at high temperature), and reducing the number of low stringency cycles (low temperature), rare genes become more highly represented, and the complexity of the amplification profile is increased. In effect, were one to attempt to generate a standard "fingerprint" of the type used in the art, one would be disappointed. The operation of the invention results in a sample pool containing many more sequences, which appear almost as a "blur" or "smear" as compared to standard DNA products on gels. Such complexity is key to the invention, however, and is desirable.
The practice of the invention and how it is achieved will be seen in the examples which follow.
EXAMPLE 1
This example describes the generation of a cDNA library in accordance with the invention. While colon cancer cells from a human were used, any cell could also be treated in the manner described herein.
The mRNA was extracted from a sample of colon cancer cells, in accordance with standard methods well known to the artisan, and not repeated here. It was then divided into approximately 5μt aliquots, which contained anywhere from 1 to lOng of mRNA. The samples were then stored at -70 °C until used.
The aliquots of mRNA were then used to prepare single stranded cDNA, using 25 p ol samples of a single, arbitrary primer. Several different experiments were carried out, using a different, single arbitrary primer in each case. The single, arbitrary primers used were:
5' - GAAGCTGGTA AACAAAAGG- 3' SEQ IDNO: 392
5' - AGCTGCATGA TGTGAGCAAG - 3* SEQ IDNO: 393
5' - CCCGCTCCTC CTGAGCACCC - 3* SEQIDNO: 394
5' - GAGTCGATTT CAGGTTG- 3' SEQ ID NO: 395
5' - TGCTTAAGTT CAGCGGG- 3' SEQ IDNO: 396
In each case, 25 pmols of arbitrary primer were mixed with the aliquot of mRNA, 100 units of Moloney urine leukemia virus reverse transcriptase, reverse transcriptase buffer (25mM Tris-HCl, pH 8.3, 75mm KCl, 3mM MgCl2, lOmM DTT), and lOOmM of each dNTP, to a final volume of 20uL. The mixture was incubated for 30 minutes, at 37 °C, to yield single stranded cDNA.
EXAMPLE 2
The single stranded cDNA produced in example 1, supra, was used as the template in a PCR amplification reaction. In this, a sample of lul of single stranded cDNA was combined, together with the same primer that had been used to generate the cDNA. Amplification was carried out, using 12uM of primer, 200 uM of each dNTP, 1.5mM MgCl2, 1 unit of DNA polymerase, and buffer (50mM KCl, lOmM Tris-HCl, pH9.0, and 0.1% Triton X-100), to reach a final volume of 15ul. Then, 35 cycles of amplification were carried out, 1 cycle consisting of 95 °C for 1 minute, (denaturation), 37 °C for 1 minute (annealing), and extension at 72 °C, for 1 minute. In the final cycle extension was increased for 5 minutes. The amplification products were used in the analyses which follow. Additional experiments were also carried out, in the same fashion, using different primers.
EXAMPLE 3
In order to analyze the amplification products, 3ul samples were mixed with 3ul of sample buffer, 0.05% bromophenol blue, 0.05% xylene cyanol FF, and 7% sucrose (w/v), in distilled water, and then visualized on silver stained, 6% polyacrylamide gels, following Sanguinetti, et al, Biotechniques 17:3-6 (1994), incorporated by reference.
The steps set forth supra result in banding patterns on the gel, each band representing a different sequence. The most complex banding patterns were analyzed, as discussed in example 4, infra. It is important to note that controls were run during the experiments, to make sure that genomic DNA had not contaminated the samples. In brief, the control experiments used mRNA and genomic DNA, without reverse transcription PCR. The profiles obtained should differ, in each case from those obtained using reverse transcribed mRNA, and did so.
EXAMPLE 4
The cDNAs generated in the preceding examples were mixed, by pooling 10-20ul of each set of products into a final volume of 60ul, followed by electrophoresis through a 1% low melting point agarose gel containing ethidium bromide to stain the cDNA fragments. Known DNA size standards were also provided.
The gel portions containing fragments between 0.25 and 1.5 kilobases were excised, using a sterile razor blade. Excised agarose was then heated to 65 °C for 10 minutes, in 1/10 volume of NaOAc (3mM, pH 7.0), and cDNA was recovered via standard phenol/chloroform extraction and ethanol precipitation, followed by resuspension in 40ul of water. The thus recovered cDNA was used in the following experiments.
EXAMPLE 5
The cDNA extracted supra was treated with 10 units of Klenow fragment cDNA polymerase, and 10 units of T4 polynucleotide kinase, for 45 minutes at 37°C. The reaction mixture was then extracted, once, with phenol, and the DNA was then recovered by passage through a standard Sephacryl S-200 column. Recovered cDNA was then ligated into the commercially available plasmid pUClδ, and the plasmids were used to transform receptive E. coli, using standard methodologies. This resulted in sufficient amounts of individual cDNA molecules for the experiments which follow. EXAMPLE 6
Individual bacterial clones were established from the transformants of example 5. These were then used to prepare sequencing templates, following standard methodologies and sequenced. Standard computational procedures, and publicly accessible databases were employed in analyzing the resulting sequences. There were some cases where the analysis revealed two, different cDNAs in the clone. This could be determined, since the primer sequence is present only at both ends of the cDNA. Thus, if the primer was found in the middle of the sequence, it indicated that the sequences on either side were from different cDNAs. The two sequences were treated as separate sequences in analyzing the results.
Of 413 cDNA sequences studied, 337 were not found in the public databases referred to, supra. Sixteen of these sequences had a partial match to known sequences, allowing a contig to be formed.
There were another 42 sequences which were similar, but not identical to, sequences in public databases, suggesting that these 42 sequences are related to the pre-existing material.
Twenty six of the sequences were completely contained within known, complete human sequences. This permitted generation of the empirical curve shown in Figure 2. Twenty two of the twenty six sequences were completely or partially within open reading frames of known genes.
Some of the sequences obtained showed partial homology to known genes, suggesting their function. Other sequences were found which showed no homology to known sequences.
Some of these sequences which were found in these experiments is set forth at SEQ ID NOS: 1 - 241.
EXAMPLE 7
This example shows the use of the invention as applied to breast cancer cells.
A sample of an infiltrative breast carcinoma with attached portions of normal tissues was operatively resected from a subject. The material was kept at -70 °C until used. The sample was characterized, inter alia, by a large tumor mass and a very small amount of normal tissue. Three x 20 micron-thick slices were taken across the tumor mass and any attached normal tissue was microdissected out to leave "pure" tumor tissue. One slice was treated to remove mRNA, as described, supra. Three cDNA libraries were prepared, using SEQ ID Nos: 392 & 393, as well as :
5* - AGGAGTGACG GTTGATCAGT - 3' SEQ IDNO: 397
Reverse transcription was carried out as with the colon cancer sample, as described supra. Then, PCR amplification was carried out by combining 12.8uM of the same primer used in the reverse transcription 125uM of each dNTP, 1.5 mM MgCl2, 1 unit of thermostable DNA polymerase, and buffer (50mM KCl, lOmM Tris-HCl, pH 9.0, and 0.1% Triton X-100), to a final volume of 20ul. Amplification was carried out by executing 1 cycle (denaturation at 94 °C for 1 minute, annealing at 37° C for 2 minutes, and extension at 72 °C, for 2 minutes), followed by 34 cycles at 94°C for 45 seconds, annealing at 55 °C for 1 minute and extension at 72 °C for 5 minutes. When analyzed for banding, as described supra, the samples revealed a complex pattern.
The products were eluted from their gels, cloned into pUC-18, and the plasmids were transformed into E. coli strain DH5α, all as described supra. Plasmids were subjected to minipreparation, using the known alkaline lysis method, and then about 150 of the molecules were sequenced. Of these, 69% were not found in any databank consulted, and appear to represent new sequences. A total of 22% was characterized by large quantities of repetitive elements and retroviral sequences. A total of 4% corresponded to known human sequences, another 4% to ribosomal RNA and mitochondrial sequences, and 8% were redundant sequences. The new sequences are set forth as SEQ ID NOS: 242-391.
EXAMPLE 8
An example of how a contig sequence can be built is described herein.
With reference to figure 3, the darker portion is a sequence obtained in accordance with the invention.
When the sequence was compared to sequences already accessible in databases, there was substantial overlap with a known sequence at the 3' end, and some overlap at the 5' end. This permitted construction of a 1,064 nucleotide long contig. The first sequence is a tentative human consensus sequence, as taught by Adams, et al., Nature 377: 3-17 (1995), while the third sequence is an EST obtained from human gall bladder cells, identified as human gall bladder EST 51121.
EXAMPLE 9
Analysis was carried out on DNA samples from S. mansoni DNA and human cancer tissues. With respect to S. mansoni, live adult worms were collected from freshly perfused hamsters, and frozed immediately. Total RNA was isolated from approximately 500 mg of tissue, using standard methods. Messenger RNA was isolated using a commercially available mRNA kit, resuspended in 40μϋ of buffer, and then treated with Dnase (lU/10μϋ), for 15 minutes, at room temperature. The DNase was inactivated by incubating at 65 °C for 10 minutes. Integrity of mRNA was checked routinely by performing control amplification of three different messages using 3 different pairs of primers which match the 5' end of S. mansoni control genes. The control genes were actin (abundant message), retinoic X receptor, 4211 bases (long mRNA), and catepsin C (low abundance message). Controls showed that the mRNA was intact.
Reverse transcription was carried out at 42 °C for 60 minutes, using 200 units of enzyme, 250ng of mRNA, and 20 pmole of primer, at a final volume of 20μ Samples were treated with RNaseH for 20 minutes, at 37 °C, after which aliquots were subjected to PCR, as described infra. Controls to establish that contaminating DNA was absent were carried out by performing PCR with an aliquot of mRNA not submitted to treatment with reverse transcriptase.
With respect to human samples, tissue samples were taken from excised tumors, and frozen in liquid nitrogen immediately after resection. Total RNA was extracted using standard methods, and RNA degradation was evaluated via Northern Blotting, using GAPDH cDNA as a probe. Samples containing intact mRNA were treated with DNase I (10U/50ng of total RNA), and absence of contaminating genomic DNA was confirmed via PCR, using primers for mitochondrial D-loop, and for the p53 gene. Amplified product was blotted onto nylon membranes, and hybridized with α32 P- dCTP labeled probe for amplified sequences. Any samples which qualified, i.e., had no detectable DNA, were processed for isolation of poly A+ RNA. Following this, cDNA templates were produced by heating 10 to lOOng of purified cDNA, at 65 °C for 5 minutes and then subjected to reverse transcription at 37°C for 60 minutes, in the presence of 200 units of MoMLV reverse transcriptase, and 15pmols of randomly selected primer, at a final volume of 20μi
The products described herein were used in the examples which follow.
EXAMPLE 10
This example describes the PCR amplification process used. One microliter of either schistosomes or human single stranded cDNA was analyzed, using the same primer that had been used for cDNA synthesis. Different cycling parameters were used. Essentially, after cDNA was denatured, at 75 °C, touchdown PCR was carried out, using an annealing temperature which varied from 60°C to 40°C. Progressive reductions of 0.5 to 4°C per cycle were used, together with from 40-45 total cycles. Complexity of the profiles was evaluated by visual inspection, following electrophoresis on 8% silver stained polyacrylamide gels, or in 1%) ethidium bromide stained agarose gels.
The varying protocols were as follows. In each protocol, the steps 1-4 are: denaturing, annealing of abundant message, annealing of primer, and then primer extension.
FOR HUMAN MATERIALS
Protocol 1:
1. 75°C - 1 minute
2. 52°C - 1 minute
3. 72°C - 1 minute
4. 95°C - 45 seconds
5. Repeat 2-4, 4 times
6. 48 °C - 1 minute
7. 72 °C - 1 minute 8. 95°C - 45 seconds
9. Repeat 6-8, 4 times
10 44°C - 1 minute
11. 72°C - 1 minute
12 95°C - 45 seconds
13. Repeat 10-12, 4 times
14 56°C 1 minute
15 72°C 1 minute
16 95°C 45 seconds
17 Repeat 14-16, 28 times
18 56°C - 1 minute
19 72°C - 7 minutes
PROTOCOL 2:
1. 94°C - 5 minutes
2. 94°C - 30 seconds
3. 50°C - 1 minute
4. 72°C - 1 minute
5. Repeat 2-4, 2 times
6. 94°C - 30 seconds
7. 49°C - 1 minute
8. 72°C - 1 minute
9. Repeat 6-8, 2 times
10. 94°C - 30 seconds
11. 48°C - 1 minute
12. 72°C - 1 minute
13. Repeat 10-12, 2 times
14. 94°C - 30 seconds
15. 47°C - 1 minute 16. 72°C - 1 minute
17. Repeat 14-16, 2 times
18. 94°C - 30 seconds
19. 46°C - 1 minute
20. 72 °C - 1 minute
21. Repeat 18-20, 2 times
22. 94°C - 30 seconds
23. 45 °C - 1 minute
24. 72 °C - 1 minute
25. Repeat 22-24, 2 times
26. 94°C - 30 seconds
27. 44°C - 1 minute
28. 72°C - 1 minute
29. Repeat 26-28, 2 times
30. 94°C - 30 seconds
31. 43°C - 1 minute
32. 72°C - 1 minute
33. Repeat 30-32, 2 times
34. 94°C - 30 seconds
35. 50°C - 1 minute
36. 72°C - 1 minute
37. Repeat 34-36, 27 times
38. 72°C - 7 minutes
PROTOCOL 3 (S. MANSOND:
1. 94 °C - 9 minutes
2. 60°C - 1 minute
3. Slow temperature ramp increase to 72°C (15% maximum speed)
4. 72°C - 1 minute
5. 94°C - 1 minute 6. reduce set temperature of "2" by 0.5 °C
7. Repeat 3-6, 29 times
8. 45 °C - 1 minute
9. Slow temperature ramp increase to 72°C (15% maximum speed)
10. 72°C - 1 minute
11. 94°C - 1 minute
12. Repeat 8-11, 8 times
13. 45°C - 1 minute
14. 72°C - 8 minutes
"Protocol 1" was used on 333 different ORESTES cDNA libraries which had been derived from three different human tissues, and four different mRNA preparations. The clones that were derived from this set of libraries were sequenced, yielding 65,091 ESTs, with an average of 195 ESTs per minilibrary. This sequence set was used for normalization analysis, and positional distribution. Analysis of normalization was performed using 11,848 ESTs, which matched complete human genes present in the UniGene cluster. The average size of the clusters tagged by the sequences generated was 495. Size distribution is set forth in figure 4. Positional distribution is shown in figure 5. The central portion of the transcripts is preferentially tagged.
Protocol 2's products were also analyzed, using 362 different ORESTES cDNA libraries. These were derived from the same tissues as were used in protocol 1, together with a library derived from breast cancer. A total of 68,992 sequences were generated, averaging 190.6 ESTs/minilibrary. Normalization and positional distribution analysis was carried out, using 11,402 ESTs which matched human genes in the UniGene cluster. The average size was 552. Figure 6 shows the size distribution. Figure 7 shows positional distribution.
Additional protocols were used for both Homo sapiens and S. mansoni cDNA. These protocols follow. In both protocols, the annealing temperature drops through nost of the cycle. For example, in protocol 4, the annealing temperature starts at 52 °C, and drops by 1 °C intervals to 45 °C, and then increases again, to 48 °C. In protocol 5, the temperature is allowed to decrease by 2°C intervals through step 14. PROTOCOL 4
1) 75 °C 5 minutes
2) 94°C 30 seconds
3) 52 °C 1 minute
4) 72 °C 1 minute
5) 94 °C 30 seconds
6) 94°C 30 seconds
7) 51 °C 1 minute
8) 72°C 1 minute
9) 94°C 30 seconds
10) 50°C 30 seconds
11) 72°C 1 minute
12) 94 °C 1 minute
13) 94°C 30 seconds
14) 49°C 30 seconds
15) 72°C 1 minute
16) 94°C 1 minute
17) 94°C 30 seconds
18) 48°C 30 seconds
19) 72°C 1 minute
20) 94°C 1 minute
21) 94°C 30 seconds
22) 94°C 30 seconds
23) 47°C 1 minute
24) 72°C 1 minute
25) 94°C 30 seconds
26) 94°C 30 seconds
27) 46°C 1 minute
28) 72°C 1 minute
29) 94°C 30 seconds
30) 94°C 30 seconds
31) 45°C 1 minute
32) 72°C 1 minute
33) 94 °C 30 seconds
34) 94 °C 30 seconds
35) 72°C 1 minute
36) 48°C 1 minute
37) 72 °C 1 minute
38) 94°C 30 seconds
Steps 2-38 are then repeated, 26 times 39) 72 °C 30 seconds PROTOCOL 5
1) 75°C 5 minutes
2) 66°C 10 seconds
3) 64°C 10 seconds
4) 62°C 10 seconds
5) 60°C 10 seconds
6) 58°C 10 seconds
7) 56°C 10 seconds
8) 54 °C 10 seconds
9) 52°C 10 seconds
10) 50°C 10 seconds
11) 48°C 10 seconds
12) 46°C 10 seconds
13) 44°C 10 seconds
14) 44°C 10 seconds
15) 95°C 30 seconds
Steps 2-15 are repeated 45 times
16) 72°C 7 minutes
EXAMPLE 11
The results obtained using protocol 3 were analyzed further. Specifically, the sequence information was compared to sequence information available in the public GenBank data base. The results follow:
Library H8-1 (cDNA, using "Protocol #3")
306 sequences = 53.9 % non-redundant
209 valid sequences (less contaminant rRNA) = 76.5 % non-redundant
Sequences partially or totally known (70.7 %)
Homologous to ESTs from S. mansoni 30.8 %
Homologous to full-length S. mansoni seqs 39.9 % (31.7 % rRNA) New S. mansoni sequences (29.1 %)
no match (not homologous to any Genbank seq) 23.9 %
Homologous to full-length non-S. mansoni seqs 5.2 % (by BLASTX)
Library F241 (cDNA from experiment similar to that above, with different primer, using
"Protocol #3")
401 sequences = 54.6 % non-redundant
290 valid sequences (less contaminant rRNA) = 71.3 % non-redundant
Sequences partially or totally known (70.4 %)
Homologous to parts of ESTs from S.mansoni 32.9 %
Homologous to full-length S. mansoni seqs 38.5 % (24.9 % rRNA)
New S. mansoni sequences (28.6 %)
Homologous to EST from S. japonicum 1.7 % no match (not homologous to any GenBank seq) 15.1 % Homologous to full-length non-S. mansoni seqs 10.9 % (1.7 % by BLASTN,
9.2 % by BLASTX)
In each library, about 29% of the sequences were new S. mansoni sequences. Library H8-1 showed 54% non-redundant sequences after 306 clones were sequenced. The single most redundant gene 197 out of 306, or 31%) was 18S ribosomal RNA. This was probably a contaminant in the mRNA. The 18S rRNA sequence which was repeatedly cloned matches the 3' end of the oligonucleatide primer used in the experiment precisely.
With respect to library F24-I, the results were similar. The single most redundant gene was another ribosomal RNA distinct from that in the first library. When the contaminants are removed from the data pool, the resulting sequence information shares 76.5% non-redundance in 209 sequences for library H8-1, and 71.3% non redundance, in 290 sequences from library F24I.
Approximately 300 sequences, having a non-redundancy rate of about 75% were found. This is considerably higher than those shown previously, e.g., the average of 40 sequences per library shown by Dias-Neto, et al, Proc. Natl. Acad Sci USA 97(7) 3491- 6(2000), cited supra.
EXAMPLE 12
This example provides further details on the improved method. Biological samples were taken from tumors and surrounding normal tissues and were used for RNA extraction, following standard methods. Samples were frozen in liquid nitrogen immediately after resection and were then allowed to partially thaw to -20 °C for microdissection, in order to enrich for tumor cells. In addition, breast cell lines were provided via another investigator. In all, human breast, colon, stomach and head and neck tumors were used. RNA extraction and template preparation were carried out in accordance with Dias-Neto et al, supra, except 1 μH samples of neat cDNA were amplified, via PCR, using touchdown PCR. In the touchdown methodology used, cDNA was denatured at 75 °C, followed by annealing temperatures that varied from 60 °C to 41 °C, using progressive reduction of 1-2° C per cycle. A total of 45 cycles were used.
A set of approximately 250,000 ESTs was generated, and then analyzed. First, the ESTs were size selected, in distinct size ranges, varying from 0.3 to 1.5kb. The size selected fragments were cloned into pUC18, using standard methods. The plasmids were cloned into cells, and colonies were grown overnight in liquid media. PCR was carried out thereafter, and the PCR products were used for DNA sequencing.
All sequences were trimmed to exclude primer sequences and vector sequences, as well as low quality regions. The sequences were clustered, using "CAP3", in accordance with Huang, et al, Genome Res 9:868-877 (1999), incoφorated by reference. Preliminary analysis of the sequences showed that 18% of the sequences were derived from rRNA and mtDNA transcript, or were composed almost entirely of repetitive sequences. These were excluded from further analysis.
The remaining sequences, when processed by CAP3, permitted assembly into 81,429 contigs. of these, 1,181, or 1.45% were found to match sequences on chromosome 22. These represent approximately 1.1% of the total human genome, and is consistent with previously detected, high gene density in the chromosome, as reported by Craig, et al, Nat. Genet 7:376- 382 (1994), Deloukas, et al, Science 282:744-746 (1998), and Saccone, et al, Gene 174:85-94 (1996). It is also consistent with the fact that chromosome 22 contains a number of highly expressed genes. Of the genes known to be found on chromosome 22, 66.6% axe ranked in the top 10%o of the most highly expressed human genes.
Alignment between the contigs and position of genes on chromosome 22 as described by Dunham, et al, nature 402:489-495 (1995), led to identification of at least one contig for 162 of the 247 known genes on this chromosome.
The foregoing examples disclose the invention, which is an improvement on the ORESTES method for identification of nucleotide sequences which correspond essentially in toto to coding regions or open reading frames of organisms. As shown, supra, the method involves forming a cDNA library by contacting a sample of mRNA with at least one arbitrary primer, at low stringency conditions, followed by reverse transcription. The resulting, single stranded cDNA is then amplified, with at least one arbitrary primer, at low stringency, to create a mini-library of cDNA. These nucleotide sequences are derived from internal, coding regions of mRNA. The resulting nucleic acid molecules are then sequenced. These can then be compared to a source of pre-existing sequence information, e.g., a nucleotide sequence library. The improvement comprises varying the annealing temperature for the amplification reaction, wherein a high annealing temperature is used, for a number of cycles, followed by a number of cycles with a low annealing temperature. Thus, pre-existing information which corresponds to internal mRNA sequences can be identified. Preferably, the method is applied to eukaryotes. The method as described herein is applicable to any organism, including single cell organisms such as yeast, parasites such as Plasmodium, and multicellular organisms. All plants and animals, including humans, can be studied in accordance with the methods described herein.
More specific approaches using the inventive method will be clear to the skilled artisan. For example, one can determine sequences associated with cancer via, e.g., carrying out the invention on a sample of cancer cells and corresponding normal cells, and then studying the resulting mini-libraries for differences there between. These differences can include expression of genes in cancer cells not expressed in normal cells, lack of expression of genes in cancer cells which are expressed in normal cells, as well as mutations in the genes.
In another embodiment of the invention, one can determine if and where variation occurs in the nucleotide sequences of an organism. This can be done by producing sequences from different sources of an organism. These different sources can be, e.g., cells taken from different tissues, different individual organisms, and so forth. Such an approach will identify polymoφhisms, among individuals and mutations present in specific pathological conditions, such as cancer. This approach can be accomplished using the "marked" primers as is described supra.
In addition to cancer, other pathological conditions can be studied. These conditions include not only mammalian conditions, such as diseases affecting humans, but also diseases of plants. Essentially, any scientific investigation which calls for analysis of a eukaryotic genome is facilitated by this aspect of the invention.
A second feature of the invention is a method for developing so-called "contig" sequences. These are nucleotide sequences which are generated following comparing sequences produced in accordance with this method to previously determined sequences, to determine if there is overlap. This is of interest because longer sequences are of great interest in that they define the target molecule with much greater accuracy. These contigs may be produced by comparing sequences developed in accordance with the method, as well as by comparing the sequences to pre-existing sequences in a databank. The aim is simply to find overlap between two sequences.
The power of the inventive method is such that there are innumerable applications. For example, it is frequently desirable to carry out analyses of populations of subjects. The invention can be used to carry out genetic analyses of large or small populations. Further, it can be used to study living systems to determine if, e.g., there have been genetic shifts which render an individual or population more or less likely to be afflicted with diseases such as cancer, to determine antibiotic resistance or non-tolerance, and so forth.
Studies on populations can also identify genes associated with diseases. Exemplary, but by no means inclusive of the types of conditions which can be studied are heart disease, bronchitis, Alzheimer's disease, diseases associated with particular human leukocyte antigens, autoimmune diseases, and so forth.
The invention can also be used in the study of congenital diseases, and the risk of affliction to a fetus, as well as the study of whether such conditions are likely to be passed to offspring via ova or sperm. Such analyses for pathological conditions can be carried out in all animals, plants, birds, fish, etc.
The invention, as discussed supra, is applicable to all eukaryotes, not just humans, and not just animals. In the area of agriculture, for example, the genomes of food crops can be studied to determine if resistance genes are present, have been incoφorated into a genome following transfection, and so forth. Defects in plant genomes can also be studied in this way. Similarly, the method permits the artisan to determine when pathogens which integrate into the genome, such as retroviruses and other integrating viruses, such as influenza virus, have undergone shifts or mutations, which may require different approaches to therapy. This aspect of the invention can also be applied to eukaryotic pathogens, such as trypanosomes, different types of Plasmodium, and so forth.
The method described herein can also be applied to DNA directly. More specifically, there are organisms, such as particular types of bacteria, which are very difficult to culture. One can apply the inventions described herein to DNA of these or other bacteria directly, rather than to cDNA prepared from mRNA. Essentially, the methodology used is the same as the methodology described supra, except genomic DNA is used. In such a case, random fragments are produced, rather than ORF segments. Using PCR in this type of approach means that very small amounts of DNA are needed, hence difficulties in culture are avoided. It is estimated that less than one microgram of DNA would be necessary to sequence an entire genome of a prokaryote.
As was shown supra, methodologies where there are variations in the annealing temperature, especially those methods referred to generically as "touchdown PCR," are especially preferred. Variatins in annealing temperature ranging from 72 °C to 30 °C, more preferably about 60 ° C to 40° C are preferred. These so-called "touchdowns" are used at least once, more preferably up to 40 or 50 times. More preferably, they are used 1-30 times. Essentially, denaturing, annealing of abundant message, annealing of primers, and then primer extension constitute one cycle, and the temperature may vary at each of the steps.
Other aspects of the invention will be clear to the skilled artisan and need not be set forth herein.
The terms and expressions which have been employed are used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof, it being recognized that various modifications are possible within the scope of the invention.

Claims

1. A method for determining open reading frames of the genome of an organism,
comprising:
(a) contacting messenger RNA from a cell of said organism with a single,
oligonucleotide primer at low stringency,
(b) preparing single stranded cDNA by reverse transcribing said messenger
RNA with said single, oligonucleotide primer,
(c) amplifying said single stranded cDNA with a second, single
oligonucleotide primer, to form an amplification product of nucleic acid molecules,
(d) sequencing the nucleic acid molecules of (c),
(e) repeating steps (a), (b) and (c) with a different pair of oligonucleotide
primers, and
(f) sequencing nucleic acid molecules produced in (e).
2. The method of claim 1, wherein the oligonucleotide primers of (a) & (c) are identical
to each other.
3. The method of claim 1 , wherein the oligonucleotide primers of (a) & (c) differ from
each other.
4. The method of claim 1 , wherein said organism is a eukaryote.
5. The method of claim 4, wherein said eukaryote is an animal.
6. The method of claim 5, wherein said animal is a mammal.
7. The method of claim 4, wherein said eukaryote is a human.
8. The method of claim 4, wherein said organism suffers from a pathological condition.
9. The method of claim 8, wherein said pathological condition is cancer.
10. The method of claim 9, wherein said cancer is colon cancer or breast cancer.
11. The method of claim 7, wherein said eukaryote is a multicellular organism.
12. The method of claim 4, wherein said eukaryote is not an animal.
13. The method of claim 12, wherein said eukaryote is a plant.
14. A method for determining that a known nucleotide sequence from a genome of an
organism corresponds to a nucleotide sequence of an open reading frame, comprising:
(a) contacting messenger RNA from a cell of said organism with at least one single stranded oligonucleotide primer, at low stringency, (b) preparing single stranded cDNA by reverse transcribing said messenger RNA with said single, oligonucleotide primer,
(c) amplifying said single stranded cDNA with at least one, single stranded oligonucleotide primer, to form an amplification product, comprising of at least one nucleic acid molecule,
(d) sequencing said at least one nucleic acid molecule, and
(e) comparing the sequence determined in (d) to known nucleotide sequences for an organism from which said cell is taken to determine if any nucleotide sequences correspond to said at least one nucleic acid molecule, wherein any nucleotide sequences which do correspond are from an open reading frame.
15. The method of claim 14, wherein the oligonucleotide primers of (b) and (c) are identical to each other.
16. The method of claim 14, wherein the oligonucleotide primers of (b) and (c) differ from each other.
17. The method of claim 14, wherein said cell is an eukaryotic cell.
18. The method of claim 17, wherein said eukaryotic cell is an animal cell.
19. The method of claim 18, wherein said animal is a mammal.
20. The method of claim 17, wherein said eukaryotic cell is a human cell.
21. The method of claim 17, wherein said eukaryotic cell is associated with a pathological condition.
22. The method of claim 21, wherein said eukaryotic cell is a cancer cell.
23. The method of claim 22, wherein said cancer cell is a colon cancer cell or a breast cancer cell.
24. The method of claim 14, wherein said cell is a cell from a multicellular organism.
25. The method of claim 14, wherein said cell is a non-animal cell.
26. The method of claim 25, wherein said non-animal cell is a plant cell.
27. A method for preparing a contig, nucleic acid molecule from a genome of an organism, comprising:
(a) contacting messenger RNA from a cell with at least one oligonucleotide, at low stringency,
(b) preparing cDNA by reverse transcribing said messenger RNA with said single stranded oligonucleotide,
(c) amplifying said single stranded cDNA with at least one oligonucleotide primer to form an amplification product comprising at least one nucleic acid molecule,
(d) sequencing said at least one nucleic acid molecule,
(e) comparing the sequence of said at least one nucleic acid molecule to other nucleic acid molecules to determine any overlap there between, and
(f) constructing a contig nucleic acid molecule.
28. The method of Claim 27, wherein said cell is an eukaryotic cell.
29. The method of Claim 28, wherein said eukaryotic cell is an animal.
30. The method of Claim 29, wherein said animal is a mammalian cell.
31. The method of Claim 30, wherein said mammalian cell is a human cell.
32. The method of Claim 28, wherein said eukaryotic cell is a plant cell.
33. The method of Claim 27, comprising comparing said sequence and said at least one nucleic acid molecule electronically.
34. The method of claim 27, wherein the oligonucleotides of (a) & (c) are the same.
35. The method of claim 27, wherein the oligonucleotides of (a) & (c) differ from each other.
36. A method for sequencing all or part of a genome of an organism, comprising:
(a) contacting genomic DNA from a cell of said organism with a single oligonucleotide primer at low stringency, to generate a random set of nucleic acid molecules,
(b) amplifying said random set of nucleic acid molecules with a second oligonucleotide primer, to generate an amplification product,
(c) sequencing nucleic acid molecules in said amplification product,
(d) repeating steps (a), (b) and (c) with a different oligonucleotide primer, and
(e) sequencing nucleic acid molecules produced in (a).
37. The method of claim 36, wherein the oligonucleotide primers of (a) and (b) are identical to each other.
38. The method of claim 36, wherein said organism is a prokaryote.
PCT/US2001/046665 2000-11-07 2001-11-01 Improved orestes sequencing method WO2002074994A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US24631300P 2000-11-07 2000-11-07
US60/246,313 2000-11-07

Publications (2)

Publication Number Publication Date
WO2002074994A2 true WO2002074994A2 (en) 2002-09-26
WO2002074994A3 WO2002074994A3 (en) 2003-08-14

Family

ID=22930138

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/046665 WO2002074994A2 (en) 2000-11-07 2001-11-01 Improved orestes sequencing method

Country Status (1)

Country Link
WO (1) WO2002074994A2 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5487985A (en) * 1990-10-15 1996-01-30 Stratagene Arbitrarily primed polymerase chain reaction method for fingerprinting genomes
WO2000031299A2 (en) * 1998-11-20 2000-06-02 Ludwig Institute For Cancer Research Method for determining nucleotide sequences using arbitrary primers and low stringency
WO2001051518A2 (en) * 2000-01-14 2001-07-19 Ludwig Institute For Cancer Research Nucleic acids encoding human semaphorin proteins and use thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5487985A (en) * 1990-10-15 1996-01-30 Stratagene Arbitrarily primed polymerase chain reaction method for fingerprinting genomes
WO2000031299A2 (en) * 1998-11-20 2000-06-02 Ludwig Institute For Cancer Research Method for determining nucleotide sequences using arbitrary primers and low stringency
WO2001051518A2 (en) * 2000-01-14 2001-07-19 Ludwig Institute For Cancer Research Nucleic acids encoding human semaphorin proteins and use thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DIAS NETO E ET AL: "Minilibraries constructed from cDNA generated by arbitrarily primed RT-PCR: an alternative to normalized libraries for the generation of ESTs from nanogram quantities of mRNA" GENE: AN INTERNATIONAL JOURNAL ON GENES AND GENOMES, ELSEVIER SCIENCE PUBLISHERS, BARKING, GB, vol. 186, 1997, pages 135-142, XP002138178 ISSN: 0378-1119 cited in the application *
EMMANUEL DIAS NETO ET AL: "SHOTGUN SEQUENCING OF THE HUMAN TRANSCRIPTOME WITH ORF EXPRESSED SEQUENCE TAGS" PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE. WASHINGTON, US, vol. 97, no. 7, 28 March 2000 (2000-03-28), pages 3491-3496, XP000996193 ISSN: 0027-8424 cited in the application *
RADELOFF ET AL: "PRESELECTION OF SHOTGUN CLONES BY OLIGONUCLEOTIDE FINGERPRINTING: AN EFFICIENT AND HIGH THROUGHPUT STRATEGY TO REDUCE REDUNDANCY IN LARGE-SCALE SEQUENCING PROJECTS" NUCLEIC ACIDS RESEARCH, OXFORD UNIVERSITY PRESS, SURREY, GB, vol. 26, no. 23, December 1998 (1998-12), pages 5358-5364, XP002103597 ISSN: 0305-1048 *

Also Published As

Publication number Publication date
WO2002074994A3 (en) 2003-08-14

Similar Documents

Publication Publication Date Title
US7691614B2 (en) Method of genome-wide nucleic acid fingerprinting of functional regions
Pastorian et al. Optimization of cDNA representational difference analysis for the identification of differentially expressed mRNAs
JP4289443B2 (en) Method for suppressing amplification of DNA fragments during PCR
US6270966B1 (en) Restriction display (RD-PCR) of differentially expressed mRNAs
US5994068A (en) Nucleic acid indexing
US20040175719A1 (en) Synthetic tag genes
CN110719957B (en) Methods and kits for targeted enrichment of nucleic acids
WO1998040518A9 (en) Nucleic acid indexing
Good Reduced representation methods for subgenomic enrichment and next-generation sequencing
US20060105362A1 (en) Compositions and systems for identifying and comparing expressed genes (mRNAs) in eukaryotic organisms
MXPA03000575A (en) Methods for analysis and identification of transcribed genes, and fingerprinting.
US20020155438A1 (en) Method for determining nucleotide sequences using arbitrary primers and low stringency
WO2001079549A2 (en) CONSTRUCTION OF UNI-DIRECTIONALLY CLONED cDNA LIBRARIES FROM MESSENGER RNA FOR IMPROVED 3'END DNA SEQUENCING
López-Nieto et al. Selective amplification of protein-coding regions of large sets of genes using statistically designed primer sets
WO2002074994A2 (en) Improved orestes sequencing method
US5851805A (en) Method for producing DNA from mRNA
Patel et al. PCR‐based subtractive cDNA cloning
Weiss et al. Optimizing utilization of DNA from rare or archival anthropological samples
KR102237248B1 (en) SNP marker set for individual identification and population genetic analysis of Pinus densiflora and their use
WO2001051518A2 (en) Nucleic acids encoding human semaphorin proteins and use thereof
US6207810B1 (en) TRT1 polynucleotides, host cells and assays
Ohara et al. Method for systematic targeted isolation of homologous cDNA fragments in a multiplex format
WO2004035785A1 (en) Human housekeeping genes and human tissue-specific genes
Cooper et al. PCR-Based Full-Length cDNA Cloning Utilizing the Universal-Adaptor/Specific DOS Primer-Pair Strategy
Hof et al. Digital analysis of cDNA abundance; expression profiling by means of restriction fragment fingerprinting

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP