US20050282162A1 - Methods of validating snps and compiling libraries of assays - Google Patents

Methods of validating snps and compiling libraries of assays Download PDF

Info

Publication number
US20050282162A1
US20050282162A1 US10/502,761 US50276105A US2005282162A1 US 20050282162 A1 US20050282162 A1 US 20050282162A1 US 50276105 A US50276105 A US 50276105A US 2005282162 A1 US2005282162 A1 US 2005282162A1
Authority
US
United States
Prior art keywords
snps
snp
various embodiments
nucleic acid
acid sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/502,761
Inventor
Francisco De La Vega
Janet Ziegle
Hadar Isaac
Charles Scafe
Eugene Spier
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Applied Biosystems LLC
Original Assignee
Applera Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Applera Corp filed Critical Applera Corp
Priority to US10/502,761 priority Critical patent/US20050282162A1/en
Assigned to APPLERA CORPORATION reassignment APPLERA CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ZIEGLE, JANET S., SPIER, EUGENE G., SCAFE, CHARLES R., DE LA VEGA, FRANCISCO, WANG, YU N., ISSAC, HADAR I.
Assigned to APPLERA CORPORATION reassignment APPLERA CORPORATION CORRECTED RECORDATION TO CORRECT CONVEYING PARTY NAME (HADAR L. ISAAC) PREVIOUSLY RECORDED ON REEL 015605, FRAMES 0300-0308. Assignors: ZIEGLE, JANET S., SPIER, EUGENE G., SCAFE, CHARLES R., DE LA VEGA, FRANCISCO, WANG, YU N., ISAAC, HADAR I.
Publication of US20050282162A1 publication Critical patent/US20050282162A1/en
Assigned to APPLIED BIOSYSTEMS INC. reassignment APPLIED BIOSYSTEMS INC. CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: APPLERA CORPORATION
Assigned to APPLIED BIOSYSTEMS, LLC reassignment APPLIED BIOSYSTEMS, LLC MERGER (SEE DOCUMENT FOR DETAILS). Assignors: APPLIED BIOSYSTEMS INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B01PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
    • B01LCHEMICAL OR PHYSICAL LABORATORY APPARATUS FOR GENERAL USE
    • B01L3/00Containers or dishes for laboratory use, e.g. laboratory glassware; Droppers
    • B01L3/54Labware with identification means
    • B01L3/545Labware with identification means for laboratory containers
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N35/00Automatic analysis not limited to methods or materials provided for in any single one of groups G01N1/00 - G01N33/00; Handling materials therefor
    • G01N35/00584Control arrangements for automatic analysers
    • G01N35/00594Quality control, including calibration or testing of components of the analyser
    • G01N35/00613Quality control
    • G01N35/00663Quality control of consumables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0621Item configuration or customization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0633Lists, e.g. purchase orders, compilation or processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0641Shopping interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/22Social work
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B01PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
    • B01LCHEMICAL OR PHYSICAL LABORATORY APPARATUS FOR GENERAL USE
    • B01L2300/00Additional constructional details
    • B01L2300/02Identification, exchange or storage of information
    • B01L2300/021Identification, e.g. bar codes
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B01PHYSICAL OR CHEMICAL PROCESSES OR APPARATUS IN GENERAL
    • B01LCHEMICAL OR PHYSICAL LABORATORY APPARATUS FOR GENERAL USE
    • B01L3/00Containers or dishes for laboratory use, e.g. laboratory glassware; Droppers
    • B01L3/54Labware with identification means
    • B01L3/545Labware with identification means for laboratory containers
    • B01L3/5453Labware with identification means for laboratory containers for test tubes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Definitions

  • Assays can include probes and/or primers that hybridize with a target nucleic acid sequence. These probes and primers can be useful for visualizing or amplifying the target nucleic acid sequence.
  • the target nucleic acid sequence can have a Single Nucleotide Polymorphism (SNP) contained therein.
  • SNP Single Nucleotide Polymorphism
  • SNPs Single nucleotide polymorphisms
  • SNPs are promising tools for mapping susceptibility mutations that contribute to complex diseases. Although most SNPs are neutral and do not affect phenotype, they can be used as surrogate markers for positional cloning of genetic loci, because of the allelic association, known as linkage disequilibrium (LD), that can be shared by groups of adjacent SNPs. LD is eroded by gene conversion and recombination, and the amount of LD depends on the age of the mutations and on the demographic history of the population. The extent of LD across a genomic region dictates the density of SNP markers necessary to ensure association between a marker and the causative allele sought.
  • LD linkage disequilibrium
  • methods are provided for SNP validation that take into consideration a number of findings and statistics. Studies have reported a discontinuous structure in the patterns of LD across a set of regions sampled from the human genome, where long stretches of strong LD are punctuated by recombination hot-spots. These LD “blocks” show little evidence of historical recombination. According to various embodiments, these results are deconvoluted to predict that a reduced set of contiguous chromosomal segments, or haplotypes, exist in specific populations.
  • haplotype diversity is made up on only 4 to 6 so-called common haplotypes.
  • LD block patterns change depending on the population sampled because of historical differences; for example, populations that have experienced bottlenecks (e.g., Caucasians) show longer LD blocks and less evidence of historical recombination events, than other populations.
  • the haplotype diversity in a given population is typically constant in a given region irrespective of the number of SNPs sampled; therefore typing an arbitrarily large number of SNPs within a LD block is unnecessary. Selecting the minimum subset of SNPs within LD blocks, or any other discrete genetic locus, that enable discrimination of the common haplotypes present in a block without loss of information can be used to validate SNPs and/or to compile a concise library of assays useful for genetic analysis.
  • a method of compiling a library of polynucleotide data sets can correspond to polynucleotides that each can function as (A) a primer for producing a nucleic acid sequence that is complementary to at least one target nucleic acid sequence including a target SNP, (B) a probe for rendering detectable the at least one target nucleic acid sequence including a target SNP, or (C) both (A) and (B).
  • the method can include the step of selecting for the library polynucleotide data sets that each correspond to a respective polynucleotide that contains a sequence that is complementary to a respective first allele included in each of the at least one target nucleic acid sequences, if, under a set of reaction conditions a number of parameters are met by each polynucleotide corresponding to the data sets included in the library.
  • the parameters can include: (1) the respective polynucleotide has a background signal value less than or equal to a first defined value, where the background signal value is a first normalized ratio of a fluorescence intensity of the respective polynucleotide reacted with first assay reactants in the absence of the target nucleic acid sequence, and under first conditions of fluorescence excitation, to a dye fluorescence intensity of a passive-reference dye under the first conditions; (2) the respective polynucleotide has a signal generation value of greater than or equal to a second defined value, wherein the signal generation value is the difference between (i) a second normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with the first assay reactants in the presence of the target nucleic acid sequence, to the dye florescence intensity and (ii) the background signal value; (3) the respective polynucleotide has a specificity value of less than or equal to a third defined value, wherein the specificity value is the difference
  • a method of compiling a library of polynucleotide data sets can correspond to polynucleotides that each can function as (A) a primer for producing a nucleic acid sequence that is complementary to at least one target nucleic acid sequence including a target SNP, (B) a probe for rendering detectable the at least one target nucleic acid sequence including a target SNP, or (C) both (A) and (B).
  • the method can include the step of determining a background signal value by calculating a first normalized ratio of a fluorescence intensity of a respective polynucleotide that contains a sequence that is complementary to a first allele included in the at least one target nucleic acid sequence, reacted with first assay reactants in the absence of the target nucleic acid sequence, and under first conditions of fluorescence excitation, to a dye fluorescence intensity of a passive-reference dye under the first conditions.
  • the method can include the step of comparing a difference between (i) a second normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with the first assay reactants in the presence of the target nucleic acid sequence, to the dye fluorescence intensity, and (ii) the background signal value.
  • the method can include the step of comparing a difference between (i) a third normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with second assay reactants that contain a second allele included in the at least one target nucleic acid sequence to the dye fluorescence intensity, wherein the second allele differs from the first allele, and (ii) the background signal value.
  • the method can include the step of determining whether at least one individual from a population of individuals has a genotype identifiable under the first conditions that results from reacting the respective polynucleotide with the first assay reactants and in the presence of the target nucleic acid sequence, wherein the population includes at least one individual that has the identifiable genotype and at least one individual that does not have the identifiable genotype.
  • the method can include the step of determining whether at least one individual from the population has an identifiable minor allele of the identifiable genotype, under the first conditions that results from reacting the respective polynucleotide with the first assay reactants in the presence of the target nucleic acid sequence.
  • Various combinations of the herein described method steps and/or parameters can be used.
  • a method of confirming the existence of a SNP can include the step of identifying a location corresponding to a possible SNP in a polynucleotide ir a first collection of data sets containing information on genomic deoxyribonucleic acid (DNA) samples in the form of data sets corresponding to polynucleotides.
  • the method can include the step of confirming the existence of the SNP if at least one condition is met. A condition can be met if a second collection of data sets containing information on genomic deoxyribonucleic acid (DNA) samples contains information that identifies the location as containing the possible SNP.
  • a condition can be met, for example, if at least two data sets from the first collection of data sets contain information corresponding to a minor allele of the possible SNP at the location, wherein the at least two data sets represent genomic deoxyribonucleic acid (DNA) samples obtained from two independent sources.
  • a condition can be met if a data set that corresponds to a consensus sequence of genomic deoxyribonucleic acid (DNA) samples that contains the minor allele of the possible SNP in a third collection of data sets.
  • the source of the consensus sequence of genomic deoxyribonucleic acid (DNA) samples, and the sources of the genomic deoxyribonucleic acid (DNA) samples from the first collection of data sets, can be independent.
  • a library contains data corresponding to respective oligonucleotides that can function as assays to detect Single Nucleotide Polymorphisms (SNPs).
  • the library can have a number of data sets corresponding to not more than a sufficient number of oligonucleotides necessary to provide a collection of assays that provides a maximum statistical loss of a defined percentage of haplotype diversity across a human genome.
  • an algorithm to select the minimal subset of SNPs required for capturing the diversity of haplotype blocks or other genetic loci is provided.
  • the algorithm can be used to quickly select the minimum SNP subset with no loss of haplotype information.
  • the algorithm can be used in a more aggressive mode to further reduce the original SNP set, with minimal loss of information.
  • FIGS. 1 a - b are graphs of SNPs per LD block v. minimum information SNP subset for African-American and Caucasian populations, respectively;
  • FIGS. 2 a - 2 e are schematic diagrams of quenchable dyes that can be part of a mixture of reagents provided and/or used according to various embodiments;
  • FIG. 3 is a workflow diagram according to various embodiments
  • FIG. 4 is a graph showing visualized assay results, according to various embodiments.
  • FIG. 5 is a flowchart showing an algorithm according to various embodiments
  • FIG. 6 is a flowchart showing an algorithm according to various embodiments.
  • FIG. 7 is an illustration of SNPs selected by a method according to various embodiments.
  • FIG. 8 shows a table of SNPs and genes on a hypothetical chromosome
  • FIG. 9 is a histogram of gene lengths of the hypothetical chromosome of FIG. 8 ;
  • FIG. 10 is a histogram of the specified maximum distance between adjacent SNPs
  • FIG. 11 is a histogram of actual maximum distance between adjacent SNPs
  • FIG. 12 is a histogram of total selected SNPs per gene.
  • FIG. 13 is a histogram of the number of newly identified SNPs selected per gene.
  • nucleic acid analogs can be used in addition to or instead of nucleic acids.
  • nucleic acid analogs can include the family of peptide nucleic acids (PNA), wherein the sugar/phosphate backbone of DNA or RNA has been replaced with acyclic, achiral, and neutral polyamide linkages.
  • PNA peptide nucleic acids
  • a probe or primer can have a PNA polymer instead of a DNA polymer.
  • the 2-aminoethylglycine polyamide linkage with nucleobases attached to the linkage through an amide bond can be used as a PNA and shown to possess exceptional hybridization specificity and affinity.
  • An example of a PNA is as shown below in a partial structure with a carboxyl-terminal amide:
  • Nucleobase as used herein means any nitrogen-containing heterocyclic moiety capable of forming Watson-Crick hydrogen bonds in pairing with a complementary nucleobase or nucleobase analog, e.g. a purine, a 7-deazapurine, or a pyrimidine.
  • Typical nucleobases are the naturally occurring nucleobases such as, for example, adenine, guanine, cytosine, uracil, thymine, and analogs of the naturally occurring nucleobases, e.g.
  • Nucleoside refers to a compound consisting of a nucleobase linked to the C-1′ carbon of a sugar, such as, for example, ribose, arabinose, xylose, and pyranose, in the natural ⁇ or the ⁇ anomeric configuration.
  • the sugar can be substituted or unsubstituted.
  • Substituted ribose sugars can include, but are not limited to, those riboses having one or more of the carbon atoms, for example, the 2′-carbon atom, substituted with one or more of the same or different Cl, F, —R, —OR, —NR 2 or halogen groups, where each R is independently H, C 1 -C 6 alkyl or C 5 -C 14 aryl.
  • Ribose examples can include ribose, 2′-deoxyribose, 2′,3′-dideoxyribose, 2′-haloribose, 2′-fluororibose, 2′-chlororibose, and 2′-alkylribose, e.g. 2′-O-methyl, 4′- ⁇ -anomeric nucleotides, 1′- ⁇ -anomeric nucleotides, 2′-4′- and 3′-4′-linked and other “locked” or “LNA”, bicyclic sugar modifications.
  • Exemplary LNA sugar analogs within a polynucleotide can include the following structures: where B is any nucleobase.
  • Sugars can have modifications at the 2′- or 3′-position such as methoxy, ethoxy, allyloxy, isopropoxy, butoxy, isobutoxy, methoxyethyl, alkoxy, phenoxy, azido, amino, alkylamino, fluoro, chloro and bromo.
  • Nucleosides and nucleotides can have the natural D configurational isomer (D-form) or the L configurational isomer (L-form).
  • the nucleobase is a purine, e.g. adenine or guanine
  • the ribose sugar is attached to the N 9 -position of the nucleobase.
  • the nucleobase is a pyrimidine, e.g. cytosine, uracil, or thymine
  • the pentose sugar is attached to the N 1 -position of the nucleobase.
  • Nucleotide refers to a phosphate ester of a nucleoside and can be in the form of a monomer unit or within a nucleic acid.
  • Nucleotide 5′-triphosphate refers to a nucleotide with a triphosphate ester group at the 5′ position, and can be denoted as “NTP”, or “dNTP” and “ddNTP” to particularly point out the structural features of the ribose sugar.
  • the triphosphate ester group can include sulfur substitutions for the various oxygens, e.g. ⁇ -thio-nucleotide 5′-triphosphates.
  • polynucleotide and oligonucleotide mean single-stranded and double-stranded polymers of, for example, nucleotide monomers, including 2′-deoxyribonucleotides (DNA) and ribonucleotides (RNA) linked by internucleotide phosphodiester bond linkages, e.g. 3′-5′ and 2′-5′, inverted linkages, e.g. 3′-3′ and 5′-5′, branched structures, or internucleotide analogs.
  • DNA 2′-deoxyribonucleotides
  • RNA ribonucleotides linked by internucleotide phosphodiester bond linkages, e.g. 3′-5′ and 2′-5′, inverted linkages, e.g. 3′-3′ and 5′-5′, branched structures, or internucleotide analogs.
  • Polynucleotides can have associated counter ions, such as H + , NH 4+ , trialkylammonium, Mg 2+ , Na + and the like.
  • a polynucleotide can be composed entirely of deoxyribonucleotides, entirely of ribonucleotides, or chimeric mixtures thereof.
  • Polynucleotides can be comprised of internucleotide, nucleobase and sugar analogs.
  • a polynucleotide or oligonucleotide can be a PNA polymer.
  • Polynucleotides can range in size from a few monomeric units, e.g.
  • nucleotides are in 5′ to 3′ order from left to right and that “A” denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotes deoxyguanosine, and “T” denotes thymidine, unless otherwise noted.
  • Internucleotide analog as used herein means a phosphate ester analog or a non-phosphate analog of a polynucleotide.
  • Phosphate ester analogs can include: (i) C 1 -C 4 alkylphosphonate, e.g. methylphosphonate; (ii) phosphoramidate; (iii) C 1 -C 6 alkyl-phosphotriester; (iv) phosphorothioate; and (v) phosphorodithioate.
  • Non-phosphate analogs can include compounds wherein the sugar/phosphate moieties are replaced by an amide linkage, such as a 2-aminoethylglycine unit, commonly referred to as PNA.
  • Heteroygous as used herein means both members of a pair of alleles of a gene are present in a sample obtained from a single source, wherein a gene call have two alleles due to, for example, the fusion of two dissimilar gametes with respect to the gene.
  • Heterozygous assay as used herein means an assay adapted to identify the allelic state of a gene having one or both members of a pair of alleles.
  • Homozygous as used herein means one member of a pair of alleles is present in a sample obtained from a single source, wherein a gene can have one allele due to, for example, the fusion of two identical gametes with respect to the gene.
  • “Homozygous assay” as used herein means an assay adapted to identify only one of two possible allelic states of a gene having one or both members of a pair of alleles.
  • Lossy as used herein means the loss of haplotype diversity in a linkage disequilibrium block.
  • “Lossless” as used herein means that there is no loss of haplotype diversity in a linkage disequilibrium block.
  • a library of assays can be provided.
  • the library of assays can have from about 100,000 to about 500,000 polynucleotides, for example, about 150,000 to about 250,000 polynucleotides.
  • a library of data sets can be provided.
  • the library of data sets can have from about 100,000 to about 500,000 data sets, for example, about 150,000 to about 250,000 data sets.
  • an algorithm can select a minimum subset of SNPs without loss of haplotype information, or an even smaller subset with some acceptable loss of information.
  • a SNP set was reduced by 18% for an African American population and by 32% for a Caucasian population with no loss of haplotype distribution information.
  • the algorithm can produce optimal results in a reasonable time.
  • the algorithm can allow for the real-time calculation of minimum SNP subsets for haplotype blocks.
  • an algorithm to select the minimal subset of SNPs required for capturing the diversity of haplotype blocks or other genetic loci is provided.
  • the algorithm can be used to quickly select the minimum SNP subset with no loss of haplotype information.
  • the algorithm can be used in a more aggressive mode to further reduce the original SNP set, with minimal loss of information.
  • family relationships of the DNA donors can be used to increase haplotype inference accuracy.
  • the Expectation-Maximization algorithm introduced by Excoffier and Slatkin can be accurate, especially in regions of low diversity.
  • the analysis of haplotype distributions in genetic studies aimed to find susceptibility mutations in case-control populations can be useful in finding associations. Therefore, haplotype interference can be used in disease and pharmacogenomic studies.
  • a probability vector P of length M can be defined where P i is the relative frequency of the i th haplotype.
  • A a haplotype/SNP allele state matrix of N columns and M rows is defined, wherein A ij (the i th row of the j th column of the matrix) indicates the allele state (‘1’ or ‘2’) of the j th SNP for the i th haplotype.
  • the algorithm can use other measures of information.
  • the algorithm can consist of two phases (phases I and II). These phases can be performed sequentially.
  • the operations can be outlined in lossless mode as follows below.
  • any column that is identical to another column, or is the exact opposite of another column can be eliminated.
  • a column in a matrix that is identical to another column can represent a SNP that behaves identically to another SNP for all tested samples.
  • the redundant SNP will not provide any additional information.
  • a column is the exact opposite of another column in the matrix, this represents a SNP where the behavior can always be predicted from the behavior of another SNP simply by inverting it. Therefore, according to such embodiments, this SNP will not provide new information.
  • phase I it can be assumed that N columns of matrix A have been reduced to N′ unique columns where N′ ⁇ N.
  • any column whose elimination does not reduce the number of unique rows can be eliminated.
  • Each row in a matrix can represent the allelic states of the SNPs for a specific haplotype. Removing a “useful” SNP can eliminate the ability to detect at least one haplotype.
  • two or more haplotypes can register the same allelic state at the remaining SNPs, thereby reducing the number of unique rows. Therefore, if the elimination of a column does not reduce the number of unique rows, it can be omitted.
  • Phase I can be a “sub-set” of phase II, in the sense that if phase I is skipped, phase II can eliminate the SNPs that phase I would have eliminated. Phase I can be computationally easier to perform than phase II, for example, in lossy mode. Therefore it can be more efficient to begin with phase I.
  • Example 1 illustrates a method according to various embodiments, in lossless mode.
  • SNPs and four haplotypes that yield the following allelic responses are illustrated in Table 1.
  • Table 1 Haplotype/SNP Allele State Matrix SNP 1 SNP 2 SNP 3 SNP 4 Haplotype 1 1 1 1 2 Haplotype 2 2 2 1 1 Haplotype 3 2 2 2 1 Haplotype 4 1 2 2 2 2
  • the fourth column is the exact opposite of the first column. This implies that either SNP 4 or SNP 1 is redundant. If SNP 4 is removed from the SNP set, no information is lost. When SNP 1 registers allele “1”, the state of SNP 4 is known as allele “2”, and conversely, when SNP 1 registers allele “2”, the state of SNP 4 is known as allele “1”. Removing SNP 4 leaves the matrix seen in Table 2. TABLE 2 Haplotype/SNP Allele State Matrix after Phase I SNP 1 SNP 2 SNP 3 Haplotype 1 1 1 1 Haplotype 2 2 1 Haplotype 3 2 2 2 Haplotype 4 1 2 2
  • phase I is complete.
  • Table 3 depicts the three remaining matrices, following the removal of SNP 1 , SNP 2 , or SNP 3 , respectively.
  • the first and the third matrices only have three unique rows, whereas the second matrix has four unique rows.
  • SNP 2 can be eliminated with no loss of haplotype detection.
  • the set ⁇ SNP 1 , SNP 3 ⁇ can provide the same haplotype detection ability as the full set ⁇ SNP 1 , SNP 2 , SNP 3 , SNP 4 ⁇ .
  • each phase can cause the elimination of exactly one SNP.
  • each phase can result in the elimination of multiple SNPs or no SNPs.
  • the retained SNP set can be optimized to minimize the loss of haplotype detection.
  • Phase I can remain unchanged and phase II can select the optimal SNPs to eliminate.
  • the entropy H for the resulting P is computed.
  • the selection with the highest H can be chosen as the best selection.
  • N′ ⁇ k columns can be eliminated.
  • the resulting matrix (with k columns) can have fewer unique rows than the full matrix (with N′ columns).
  • the relative frequency (probability) of a “major” haplotype is equal to the sum of the frequencies of the “minor” haplotypes.
  • the repeating rows can be combined into a single row, and their respective probabilities can be summed to form a new probability.
  • the vector P can be shorter and can have larger numbers. This can reduce the value of the entropy, H.
  • the combination with the smallest reduction of entropy can be deemed the optimal selection.
  • k SNPs can be used with no loss of information, as in Example 1.
  • Example 2 uses an LD block that was discovered using the Caucasian population panel, in Chromosome 6, overlapping the Human gene TTK (RefSeq ID NM — 003318, Celera ID hCG401205) in lossy mode.
  • the block consists of 17 SNPs, and the EM algorithm inferred 8 haplotypes, with two major ones: haplotype 2 and haplotype 7 with frequencies of approximately 43% and 33%, respectively. The remaining 24% of the diversity is spread among the remaining 6 haplotypes.
  • Table 4 summarizes the allelic states of the 17 SNPs, as well as the respective probability, for each of the 8 haplotypes. TABLE 4 Original Haplotype/SNP Allele State Matrix Haplotype SNP No.
  • the optimal selection of 3 SNPs cause haplotypes 3 and 4 to merge and cause haplotypes 6, 7, and 8 to merge, with a total loss of 9.2% of original entropy.
  • the optimal single SNP is SNP 16 . With single SNP 16 , the detection ability is reduced to: “haplotype2” or “other.” Since haplotype 2 is the most common, with 43.2% of the frequency, if only a single SNP was chosen, SNP 16 would be the most useful choice. TABLE 6 Lossy Min. SNP Set Example No. of SNPs (k) No.
  • genotyping data was used from 11,160 SNPs distributed in a gene-centric fashion across chromosomes 6, 21, and 22, with intragenic spacing averaging 12 Kb, 8 Kb, and 9 Kb, respectively.
  • the SNPs were scored with 5 ⁇ nuclease assays including TAQMAN-MGB probes from Applied Biosystems' Assays-on-DemandTM SNP Genotyping Products (Foster City, Calif., USA).
  • the samples typed included 45 African-American and 45 Caucasian DNAs from the Coriell Human Diversity Collection available from Coriell Institute for Medical Research, Camden, N.J., USA.
  • LD blocks and haplotypes were computed independently for each population using methods described in Abecasis, et al., Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30:97-101 (2002) and Gabriel et al., The structure of haplotype blocks in the human genome Science 296:2225-2229 (2002), both of which are herein incorporated in their entireties by reference. Only blocks of 3 or more SNPs were considered. Therefore, only 4,864 SNPs were used for the African-American population and 7,347 SNPs were used for the Caucasian population. The Caucasian population is known, in general, to have more and longer LD blocks. The algorithm was implemented in MATLAB v 6.1, available from The MathWorks Inc., Natick, Mass., USA, without further optimization. The computations were completed on a 700 MHz PC in less than 1 minute.
  • Table 5 summarizes the results after applying the algorithm to the haplotype blocks detected in data for chromosomes 6, 21, and 22.
  • the African-American population panel is denoted by ‘A’ and the Caucasian population panel is denoted by ‘C’.
  • TABLE 7 Results Summary Mean Mean Total Spacing Spacing Mean Mean Min. No. Between Between No. of Block SNPs SNP per Block of SNPs SNPs in Haplotype Size per ⁇ 10% Chr. Pop.
  • FIG. 1 illustrates the relationship between the original number of SNPs in an LD block (horizontal axis) and the minimum number of SNPs required to genotype the LD block with no loss of information (vertical axis).
  • the thickness of the ‘x’ corresponds to the number of different blocks found in chromosome 6 with the same properties.
  • phase II The method described by Judson et al., How many SNPs does a genome - wide haplotype map require? Pharmacogenomics 3:379 (2002) is essentially equivalent to phase II of the lossy version of various embodiments except that the algorithm is limited to k ⁇ 11. This is expected, because without the efficient pruning of SNPs performed by phase I, the exponential nature of phase II can result in practically infinite execution time. For example, the largest block found in the above examples consisted of 22 SNPs. In previous efforts, comparing sub-sets of 1 to 11 out of 22 SNPs required examining over 2.4 million combinations. Even after that computation, the optimal solution is not assured since it is only a local optimum.
  • phase II can find the global optimum (3 SNPs in lossless mode or 2 SNPs with less than 10% loss) in 15 comparisons only.
  • the premise of the algorithm is that the DNA sample size for each population is large enough so that the inferred haplotypes adequately represent reality. There can be a risk that a SNP whose behavior is identical to another SNP (and thus deemed worthless in terms of new information) for the sample size used, could differentiate an additional haplotype inferred with a larger sample size. This risk can be lower for common haplotypes. Rare haplotypes can harbor a causative mutation and can be present in higher frequency in some cases. Experimental errors can eliminate data points and thus can render suboptimal the minimum SNP subset. Therefore, additional SNPs can enhance the minimum SNP subset to enhance robustness.
  • SNPs putative single nucleotide polymorphisms
  • databases containing information on SNPs can be used for conducting genetic studies. Putative SNPs can be validated and can be assembled into a standardized SNP marker map or a database containing data sets corresponding to the standardized SNP marker map. SNP information can be easily accessible and standardized assay reagents can be developed, validated, and made available to enable high throughput and automation, for example, to screen many SNPs on many individuals.
  • a reference SNP database can be produced from SNP and genomic information from both proprietary and public databases.
  • the database can be used for linkage disequilibrium (LD) mapping and can be used to provide, for example, validated, ready-to-use assays and reagents.
  • the database can provide high-density coverage of any known gene regions and can enable easier and more affordable candidate gene association studies and candidate region association studies.
  • researchers can select SNPs across candidate genes or chromosomal regions that are most suitable for a given study, and can quickly translate that information into practice by, for example, directly obtaining assay protocols and reagents for those SNPs.
  • a “core” set of SNPs and the associated assay reagents can be compiled into a database or library and expanded or refined as additional information become available, such as haplotype definition for some or all of the genome.
  • the SNP database can also be used to compile a fixed set of chromosome-based assays for cost-efficient whole genome association (WGA) studies using, for example, oligonucleotide ligation assay (OLA) PCR Bead Array systems for ultra-high throughput genotyping.
  • Linkage disequilibrium is the non-random association of alleles in a chromosomal segment, and can be the basis of all genetic mapping. Selecting SNPs as genetic markers for LD studies involve considering all genetic and assay-specific technical factors that affect the ability to find association between a marker and the susceptibility mutations being mapped.
  • the extent of LD across a genomic region can dictate the SNP density necessary to ensure association between a marker and the allele sought.
  • Early attempts to model the extent of LD predicted very short LD of only a few kilobases (kb). However, recent empirical surveys report average LD levels between 5 kb and 60 kb, and extending up to hundreds of kb, which implies that the number of SNPs required for WGA studies could range from 50,000 to 250,000, and that markers spaced by tens of kb will suffice for candidate gene studies.
  • Common SNPs are the most likely to be useful for LD studies across more than one population since they represent ancient mutations that arose before ethnic group segregation. Simulation studies suggest that common SNPs are more likely than coding SNPs (cSNPs) to be in LD with a given causative allele regardless of whether the allele is present at low or high frequency.
  • common SNPs can be used to assemble a database in a hybrid gene-based approach.
  • SNPs can be considered “common” when the minor allele frequency is, for example, less than 15% in at least one of the populations used for validation.
  • a gene list can include 25,083 gene regions derived by Celera Genomics.
  • a gene region can be defined as bounded by the first and last transcribed base, including untranslated regions, plus 10 kilobases (kb) upstream and downstream to account, for example, for uncharacterized exons and regulatory regions.
  • SNPs can be selected within gene regions at an average density of, for example, one SNP per 10 kb, such that the map can resemble a gene-focused picket fence. Density for specific regions can be adjusted as data on recombination and LD extent emerges. Additional SNPs in intergenic regions, such as, for example, non-coding regions of homology between mouse and human, can be added to a database or library.
  • obtaining a validated SNP assay at the end of a process of validating possible SNPs can be enhanced by defining prioritization criteria
  • One criterion that can be used to validate a possible SNP is evidence of independent discovery of the minor allele.
  • a data set corresponding to a possible SNP can be cross-referenced against data available in public sources.
  • observation of the minor allele in genotypes of two independent donors can be used to validate the SNP.
  • a SNP database obtained using the above criteria can include about one million data sets corresponding to SNPs.
  • One step can include identifying a location corresponding to a possible SNP in a polynucleotide in a first collection of data sets.
  • the first collection can contain information on genomic deoxyribonucleic acid (DNA) samples in the form of data sets corresponding to polynucleotides.
  • Another step can include confirming the existence of the SNP if at least one of a number of conditions is present or met.
  • a condition can be that a second collection of data sets containing information on genomic DNA samples contains information that identifies the location as containing the possible SNP.
  • a condition can be that at least two data sets from the first collection of data sets contain information corresponding to a minor allele of the possible SNP at the location.
  • the at least two data sets representing genomic DNA samples are obtained from two independent sources.
  • a condition can be that a data set that corresponds to a consensus sequence of genomic DNA samples in a third collection of data sets has the minor allele of the possible SNP.
  • the source of the genomic DNA of the consensus sequence and the sources of the genomic deoxyribonucleic acid (DNA) samples from the first collection of data sets are independent.
  • the third database of genomic DNA samples can be, for example, a public database of the Human Genome Project.
  • the first database of at least one genomic DNA sample can be, for example, a proprietary database of the Human Genome Project.
  • a multi-step, high-throughput assay design pipeline can be provided to ensure optimum performance of assays.
  • the methods provided can enable automation, minimize assay failure, and ensure compatibility of the SNP sequence with, for example, TAQMAN probe-based 5′ nuclease chemistry, available from Applied Biosystems, Foster City, Calif., and/or other assay formats.
  • a stringent scoring system can be used to select only those SNP context sequences with the highest probability of success.
  • a bioinformatics process can be used to design assays.
  • a step can involve masking SNPs adjacent to the target polymorphism and/or any sequence discrepancy between the Celera and the HGP human genome assembly, within the 600 bases of a context sequence. This can prevent primers and probes from being placed on top of other SNPs and can maximize the chance that the probes will hybridize to the correct genomic sequence.
  • TAQMAN 5′ nuclease primers and probes can be designed using, for example, the ASSAYS-BY-DESIGN custom oligonucleotide reagent service (Applied Biosystems, Foster City, Calif.). Oligonucleotides can be designed in batch mode without manual intervention, and a scoring scheme can select the best sequences for a given SNP. The design algorithm can implement thermodynamic and heuristic rules and additional empirically-derived factors can increase manufacturability and assay performance. According to various embodiments, probes can be designed successfully for, for example, 97% of SNPs. After this step, a further computational quality-control step can also be performed in the context of the genome that can allow the elimination of potentially problematic SNP targets that may arise from repeated genomic regions, pseudo-SNPs, and/or other possible assembly artifacts.
  • primers and probes can be synthesized, and additional quality-control steps can occur.
  • oligonucleotide integrity can be tested.
  • assay performance can be tested against a panel of 10 DNA samples.
  • assays that pass post-manufacturing quality control can be validated in the population panels.
  • Assay validation in population panels can ensure that the locus is polymorphic and that the allele frequency is adequate for association studies in a variety of populations. For example, ninety (90) samples from the Coriell Human Variatiori Collection were obtained. By obtaining individual genotypes from a panel of 45 African Americans, a panel of 45 Caucasians, and a chimp DNA sample (to provide insight into ancestral alleles), sufficient information was obtained to estimate linkage disequilibrium between the SNPs in the LD map and to computationally infer common haplotypes. Assay validation in population panels can provide additional information on the usefulness of the markers, the coverage provided for a given study, and/or provide an independent assessment of assay performance.
  • the performance of each assay that can be comprised of at least one polynucleotide can be benchmarked against criteria, such as, for example: background signal (e.g., low signal in the control experiments run without template); signal generation (e.g., good separation between control experiments run without template and allele clusters); and specificity.
  • a criterion can be that a maximum of three clusters of fluorescing sample and a minimum of two clusters of fluorescing sample must be observed.
  • Another criterion can be that at least 90% of samples yield callable genotypes.
  • a method for compiling a library of polynucleotide data sets that each correspond to polynucleotides that can function as (A) a primer for producing a nucleic acid sequence that is complementary to at least one target nucleic acid sequence including a target SNP, (B) a probe for rendering detectable the at least one target nucleic acid sequence including a target SNP, or (C) both (A) and (B).
  • the method can include the step of selecting for the library polynucleotide data sets that each correspond to a respective polynucleotide that contains a sequence that is complementary to a respective first allele included in each of the at least one target nucleic acid sequences, if, under a set of reaction conditions, a number of parameters are met by each polynucleotide corresponding to the data sets included in the library.
  • the respective polynucleotide has a background signal value less than or equal to a first defined value, where the background signal value is a first normalized ratio of a fluorescence intensity of the respective polynucleotide reacted with first assay reactants in the absence of the target nucleic acid sequence, and under first conditions of fluorescence excitation, to a dye fluorescence intensity of a passive-reference dye under the first conditions;
  • the respective polynucleotide has a signal generation value of greater than or equal to a second defined value, wherein the signal generation value is the difference between (i) a second normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with the first assay reactants in the presence of the target nucleic acid sequence, to the dye fluorescence intensity and (ii) the background signal value;
  • the respective polynucleotide has a specificity value of less than or equal to a third defined value, wherein the specificity value is the difference between (i)
  • the first reaction conditions can comprise a 900 nM final primer concentration and a 250 nM final probe concentration under thermal cycling conditions.
  • the first defined value can be about 2.0
  • the second defined value can be about 1.0
  • the third defined value can be about 2.0.
  • At least, for example, 0.01% of individuals from the population can have the identifiable genotype.
  • At least 10% of individuals from the population can have the identifiable genotype.
  • At least 20% of individuals from the population can have the identifiable genotype.
  • the identifiable genotype can result from reacting the respective polynucleotide with the first assay reactants in the presence of the target nucleic acid sequence.
  • the reaction can occur under the first conditions.
  • the population can have a frequency of the minor allele of greater than or equal to about 5%.
  • the minor allele frequency can be greater than or equal to about 10%.
  • the minor allele frequency can be greater than or equal to about 15%.
  • methods can include not selecting a second polynucleotide data set that corresponds to a second polynucleotide if one or more of parameters (1)-(5), above, is not met by the second polynucleotide.
  • a library of polynucleotide data sets can be compiled using methods according to various embodiments.
  • a library of assays can be compiled using methods according to various embodiments.
  • the method can include manufacturing a library of assays wherein each assay can be made using a polynucleotide data set compiled in the library.
  • a library of polynucleotides can be compiled by manufacturing polynucleotides corresponding to polynucleotide data sets compiled using methods according to various embodiments.
  • a library of assays can be compiled using methods according to various embodiments.
  • a method of detecting a SNP can be provided.
  • a step of the method can be reacting a sample containing a target nucleic acid sequence that has a target SNP with an assay selected from the library of assays compiled according methods described herein.
  • a step can be determining the genotype of the target nucleic acid sequence that has the target SNP by detecting a characteristic attributable to the genotype of the target SNP in the sample.
  • a method for compiling a library of polynucleotide data sets that correspond to polynucleotides that each can function as (A) a primer for producing a nucleic acid sequence that is complementary to at least one target nucleic acid sequence including a target SNP, (B) a probe for rendering detectable the at least one target nucleic acid sequence including a target SNP, or (C) both (A) and (B).
  • the method can include the step of determining a background signal value by calculating a first normalized ratio of a fluorescence intensity of a respective polynucleotide that contains a sequence that is complementary to a first allele included in the at least one target nucleic acid sequence, reacted with first assay reactants in the absence of the target nucleic acid sequence, and under first conditions of fluorescence excitation, to a dye fluorescence intensity of a passive-reference dye under the first conditions.
  • the method can include the step of comparing a difference between (i) a second normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with the first assay reactants in the presence of the target nucleic acid sequence, to the dye fluorescence intensity, and (ii) the background signal value.
  • the method can include the step of comparing a difference between (i) a third normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with second assay reactants that contain a second allele included in the at least one target nucleic acid sequence to the dye fluorescence intensity, wherein the second allele differs from the first allele, and (ii) the background signal value.
  • the method can include the step of determining whether at least one individual from a population of individuals has a genotype identifiable under the first conditions that results from reacting the respective polynucleotide with the first assay reactants and in the presence of the target nucleic acid sequence, wherein the population includes at least one individual that has the identifiable genotype and at least one individual that does not have the identifiable genotype.
  • the method can include the step of determining whether at least one individual from the population has an identifiable minor allele of the identifiable genotype, under the first conditions, that results from reacting the respective polynucleotide with the first assay reactants in the presence of the target nucleic acid sequence.
  • the method can include a combination of some of or all of these steps.
  • the polynucleotide data set corresponding to the respective polynucleotide can be selected for the library if, for example, the background signal value in parameter (1) is less than or equal to about two, if the ratio from the comparison in parameter (2) is greater than or equal to about one, if the ratio from the comparison in parameter (3) is less than or equal to about two, if the at least one individual of parameter (4) has the identifiable genotype, and if the at least one individual of parameter (5) has the identifiable minor allele.
  • a library of polynucleotide data sets can be compiled using methods according to various embodiments.
  • a library of polynucleotides can be compiled by manufacturing polynucleotides corresponding to polynucleotide data sets compiled using a method or methods according to various embodiments.
  • a method of compiling a library of assays can be provided.
  • the method can include manufacturing a library of assays, wherein each assay is manufactured using a polynucleotide data set compiled in a library according to various embodiments.
  • methods of detecting a SNP can be provided.
  • the method can include the step of reacting a sample containing a target nucleic acid sequence that has a target SNP with an assay selected from the library of assays compiled using a method or methods according to various embodiments.
  • a step can include determining the genotype of the target nucleic acid sequence that has the target SNP by detecting a characteristic attributable to the genotype of the target SNP in the sample.
  • an automatic allele calling software can be used to automatically analyze validated assay data without user intervention. According to various embodiments, at least, for example, 90% of the assay data can be processed automatically to identify an allele. According to various embodiments, an automated validation process can be used for high volume commercial or research purposes.
  • FIG. 4 is a plot 100 of fluorescence data from many SNP assays according to various embodiments.
  • the x-axis represents relative fluorescence of a 6-FAM dye label and the y-axis represents relative fluorescence of a VIC dye label.
  • Cluster 110 represents the relative fluorescence of control samples having probes labeled with 6-FAM and VIC, respectively. The control samples did not contain a target nucleic acid sequence.
  • the background signal value is the average of the relative fluorescence of the control samples as represented by cluster 110 in FIG. 4 .
  • the background signal value in FIG. 4 is less than about 2.0.
  • Cluster 120 represents the relative fluorescence of samples having homozygous alleles (allele 2) that hybridized with probes labeled with 6-FAM.
  • Cluster 130 represents the relative fluorescence of samples having heterozygous alleles (alleles 1 and 2) that hybridized with probes labeled with VIC and 6-FAM, respectively.
  • Cluster 140 represents the relative fluorescence of samples having homozygous alleles (allele 1) that hybridized with probes labeled with VIC.
  • the signal generation value for the assays is based, at least in part, on the average of the relative fluorescence of at least one of clusters 120 , 130 , and 140 .
  • FIG. 5 is a flowchart showing a comparison of reference values determined by experimental assays according to various embodiments against target values.
  • TBV is a target background value
  • TsigV is the target signal value
  • TSpV is the target specificity value
  • TIP is the target identifiable percentage, or the minimum frequency that the assay produces an identifiable genotype
  • TMI is the target minor allele frequency, or the minimum frequency that the minor allele appears in the population.
  • FIG. 6 is a flowchart that illustrates the selection of a data set into a library of data sets, according to various embodiments.
  • Experimental or reference data is obtained by performing an assay against individuals in a population. Experimental data can be obtained at least until results are statistically significant for a given population. When a statistically significant sample size has been determined, the experimental data can be processed according to various embodiments as shown, for example, in FIG. 5 . If the selection criteria are met, the data set corresponding to polynucleotides used in the experimental assays can be added into the library. If the selection criteria are not met, the polynucleotide can be redesigned or discarded, where the expendable assays can be repeated.
  • SNP assays and reagents can use easily automated chemistry and can be compatible with readily available high-throughput instrumentation and software systems.
  • SNP assays and reagents can have few enzymatic steps, no post-reaction transfer of liquids, and/or universal reaction conditions that can facilitate robotic liquid handling automation.
  • the assays, reagents, and/or high-throughput workflow can be easy to implement and automate and/or can use components that are ready-to-use out-of-the-box and require no optimization.
  • TAQMAN probe-based 5′ nuclease assay chemistry available from Applied Biosystems, Foster City, Calif., can meet almost any assay requirement and can unite PCR amplification and signal generation into a single step, thereby simplifying automation of both reaction set-up and data collection.
  • a hybridization probe with fluorogenic and quencher tags is cleaved by the 5′ nuclease activity of thermos aquaticus (Taq) DNA polymerase during PCR amplification. Cleavage produces fluorescence by freeing the fluorogenic molecule from the quencher.
  • FIGS. 2 a - 2 e provide an overview of the TAQMAN probe-based 5′ nuclease assay chemistry for SNP genotyping.
  • the TAQMAN system is adapted to provide allelic discrimination and high-throughput SNP genotyping.
  • Chemistry improvements have increased assay design flexibility, enabled easy protocol standardization, enabled the use of universal reagents, and reduced background fluorescence, all of which can be desirable for high throughput SNP processing and allelic discrimination.
  • conjugated probes can significantly stabilize probe-template complexes, enabling the use of probes in the 13-mer to 20-mer size range.
  • conjugated probes can have better mismatch discrimination, can be easier to design for challenging genetic regions such as those high in GC content or those in variable context sequences, and/or can increase the signal-to-noise ratio by bringing the quencher closer to the fluorescent tag.
  • conjugated probes can increase the melting temperature window that can be used in reaction protocols, thereby allowing all the SNP assays to run under identical conditions.
  • the most precious reagent can be the DNA sample itself.
  • the 7900HT system can use a 5 microliter reaction volume and can consume one nanogram of DNA per genotyping reaction, thereby minimizing reagent costs and conserving DNA template samples.
  • the 5′ nuclease assay can be suited for automation because of its easy, three-step workflow.
  • a universal master mix including probes and primers, can be added directly to plates of dry or fresh DNA using standard robotics. Plates can be sealed and cycled using standard thermal cyclers, such as, for example, the Applied Biosystems Dual 394-Well GENEAMP PCR System 9700 thermal cycler, available from Applied Biosystems, Foster City, Calif. Following cycling, plates can be automatically read on the 7900HT that can support the collection of more than 250,000 genotypes per day.
  • the availability of thermal cyclers with automated lid handling can increase throughput by enabling robotics integration for 24-hour unattended operation. Automation software can also increase both quality and throughput. For example, automation of allele calling can remove inter-technician variability, increasing confidence in data quality and reducing the time spent on data analysis by 8.5 person-hours per day.
  • FIG. 3 provides an example of an automated workflow system.
  • an assay that uses two different types of probes wherein the polynucleotide and the reporter dyes differ.
  • the first type of probe can have a first polynucleotide with a VIC reporter dye attached to the 5′ end of the first polynucleotide
  • the second type of probe can have a second polynucleotide with a 6-FAM reporter dye attached to the 5′ end of the second polynucleotide
  • the first and second polynucleotides can differ by at least one nucleic acid residue at the same location in the polynucleotide when the polynucleotides are aligned 5′ to 3′.
  • the dye-labeled probes can be adopted to perform a heterozygous assay or a homozygous assay.
  • the probe can anneal to a complementary sequence between the forward and reverse primer sites. At the time of annealing, the probe is intact and the proximity of the reporter dye to the quencher can result in suppression of fluorescence of the reporter dye.
  • a polymerase can cleave a reporter dye only when the probe has completely, mostly, or substantially hybridized to the target DNA sequence. When the reporter dye is cleaved from the probe, the relative flourescence of the reporter dye increases. The increase in relative fluorescence can be caused to only occur if the amplified target DNA sequence is complementary, mostly complementary, or substantially complementary to the probe. Therefore, the fluorescent signal generated by PCR amplification can indicate which alleles are present in a sample.
  • Mismatches between a probe and a target DNA sequence can reduce efficiency of probe hybridization and/or a polymerase can be more likely to displace a mismatched probe without cleaving it and therefore not produce a fluorescent signal. For example, if one of two possible reporter dyes fluoresce during an assay, then the presence of a homozygous gene is indicated. For further example, if both possible reporter dyes fluoresce during an assay, then the presence of a heterozgous gene is indicated.
  • At least one primer can be provided, wherein the primer can be a sequence that is shorter than the target DNA sequence.
  • the primer can have a polynucleotide and/or a minor groove binder.
  • the primer can be a sequence that is complimentary to, or mostly complimentary to, the target DNA sequence.
  • the primer can be at least 90% homologous to a corresponding length of the target DNA sequence, at least 80% homologous to a corresponding length of the target DNA sequence, at least 70% homologous to a corresponding length of the target DNA sequence, or at least 50% homologous to a corresponding length of the target DNA sequence.
  • thermostable DNA polymerase such as, for example, thermus aquaticus (Taq), and at least 4 embodiments of a deoxyribonucleic acid (e.g., adenosine, tyrosine, cytosine, and guanine)
  • the polymerase can be, for example, AMPLITAQ GOLD, available from Applied Biosystems, Foster City, Calif.
  • components of a fluorogenic 5′ nuclease assay or other assay reagents that utilize 5′ nuclease chemistry for example, TAQMAN minor groove binder probes, available from Applied Biosystems, Foster City, Calif., can be provided.
  • Some or all of the above-listed components can be replaced by or used with commercially-available products, for example, buffers or AMPLITAQ GOLD PCR MASTER MIX (Applied Biosystems, Foster City, Calif.).
  • a high-quality LD map of validated SNPs was created by integrating information from both public and private human genome efforts.
  • a set of over 200,000 validated, easy-to-use, individual SNP assays and TAQMAN ready-to-use assay reagents created by using methods according to various embodiments can be provided.
  • a minor groove binder and a non-fluorescent quencher, and the integration of the 5′ nuclease chemistry with an automated detection system, such as, for example, the 7900HT, can be used. According to various embodiments.
  • a web-based bioinformatics and ordering system can be provided where a customer can search for SNPs and order assay reagents, thus reducing the time and costs associated with candidate-gene and candidate-region association studies.
  • and LD map can, for example, enable candidate-gene and candidate-region association studies using 5′ nuclease chemistry and/or be implemented on an ultra-high throughput SNP genotyping platform to enable WGA studies.
  • the 5′ nuclease chemistry system can leverage the specificity of the OLA-PCR assay chemistry and the highly parallel detection of, for example, BEADARRAY technology, available from Illumina, Inc., San Diego, Calif.
  • the system can enable the generation of about 2,100,000 genotypes per day and all components of the assay can be universal except for, for example, the SNP-specific OLA probes.
  • assays for over 4,000,000 SNPs from the Celera database can cover every gene in the human genome. According to various embodiments, many SNPs can have the necessary variability for genetic association studies and assays for the SNPs can be provided. According to various embodiments, assays can be grouped together into convenient SNP sets optimized for specific assays such as, for example, p450 genotyping and disease-specific gene studies.
  • FIGS. 2 a - 2 e are schematic diagrams showing the interaction of components that can be part of a mixture of reagents according to various embodiments.
  • primer 52 has annealed to template strand 54 . Replication of the template strand from primer 52 will occur in the 5′ to 3′ direction.
  • Probe 50 including a generic reporter dye R, quencher Q, and minor groove binder MGB, has annealed to the template strand 54 .
  • Arrow 53 shows that as the complementary strand (not shown) is produced from the template strand 54 starting at the forward primer 52 , the complementary strand will meet probe 50 .
  • FIG. 2 b shows the complementary strand 55 as it meets probe 50 a .
  • Polymerase 60 cleaves VIC reporter dye V during the production of complementary strand 55 given that probe 50 a has annealed to the target strand 54 because the target strand 54 and the probe 50 a are completely complementary.
  • FIG. 2 c shows the complementary strand 55 as it meets probe 50 b .
  • Polymerase 60 does not cleave FAM reporter dye F during the production of complementary strand 55 given that probe 50 b has not hybridized with the target strand 54 because of a mismatched base pair at location 64 .
  • FIG. 2 d shows the complementary strand 55 as it meets probe 50 b .
  • Polymerase 60 cleaves FAM reporter dye F during the production of complementary strand 55 given that probe 50 b has annealed to the target strand 54 because the target strand 54 and the probe 50 b are completely complementary.
  • FIG. 2 e shows the complementary strand 55 as it meets probe 50 a
  • Polymerase 60 does not cleave VIC reporter dye V during the production of complementary strand 55 given that the probe 50 a has not hybridized with the target strand 54 because of a mismatched base pair at location 66 .
  • FIG. 7 is an illustration of SNPs selected by a method according to various embodiments.
  • a library contains a plurality of data sets, corresponding to one or more respective oligonucleotides that can function as a respective assay to hybridize with at least one respective Single Nucleotide Polymorphism (SNP) in a nucleic acid sequence.
  • the nucleic acid sequence can include three or more adjacent SNPs and the data sets can correspond to at least the three or more adjacent SNPs, respectively.
  • Each adjacent SNP can be spaced a distance from at least one other adjacent SNP and each of the distances between adjacent SNPs can be from about 75% to about 125% of an average of the distances between the adjacent SNPs.
  • the corresponding, adjacent SNPs can be spaced apart along at least a region of a chromosome.
  • the corresponding, adjacent SNPs can be spaced apart along at least a region of a gene.
  • the distances between all corresponding, adjacent SNPs can be equal, plus or minus 30%.
  • the distances between all corresponding, adjacent SNPs can be equal, plus or minus 20%.
  • Each of the distances between corresponding, adjacent SNPs can be from about 90% to about 110% of an average of the distances between the adjacent SNPs.
  • Each of the distances between corresponding, adjacent SNPs can be from about 95% to about 105% of an average of the distances between the adjacent SNPs.
  • the nucleic acid sequence can be a consensus sequence.
  • the distances between all corresponding, adjacent SNPs can be less than a specified maximum distance.
  • the specified maximum distance can be, for example, 10 kilobases, 15 kilobases, 20 kilobases, or 30 kilobases.
  • the algorithm can select a minimum number of SNPs per region. For example, three SNPs per gene can be selected. The distances between all corresponding, adjacent SNPs can be greater than a specified minimum distance.
  • the nucleic acid sequence can be a consensus sequence corresponding to the human genome.
  • the nucleic acid sequence can be a nucleic acid sequence data set.
  • the library can comprise a number of data sets corresponding to not more than a sufficient number of oligonucleotides necessary to provide a collection of assays that can provide a maximum statistical loss of haplotype diversity, across the region, of less than ten (10) percent.
  • the sufficient number of oligonucleotides can be obtained by providing a matrix comprised of data representing haplotype blocks and SNP locations.
  • the columns of the matrix can contain data representing existence of respective SNPs within a haplotype block and the rows of the matrix can contain data representing respective haplotype blocks.
  • At least one column can be eliminated, wherein elimination of the at least one column may not reduce the number of rows in the matrix that contains non-duplicative information.
  • At least one column of the matrix that is identical to a second column of the matrix and/or completely opposite to a second column of the matrix can be eliminated.
  • FIG. 7 illustrates SNPs that can be selected according to various embodiments.
  • SNPs 710 , 720 , 730 , and 740 that are present in region 702 of nucleic acid sequence 700 have been selected based on prioritization criteria.
  • SNPs 710 , 720 , 730 , and 740 are separated from adjacent SNPs by a distance no greater than distance 760 .
  • the effective range of usefulness of each SNP is equal to plus or minus one half of distance 760 . Therefore, there are no gaps in coverage of region 702 of nucleic acid sequence 700 .
  • SNPs 750 , 752 , 754 , 756 were not selected using the same selection criteria.
  • the selection criteria can be used to select data sets for the library, where the data sets correspond to one or more respective oligonucleotides that can function as a respective assay to hybridize with at least one respective SNP.
  • at least one selection criterion of the selection criteria can be used.
  • One criterion can be a specified maximum distance between adjacent SNPs.
  • Another criterion can be a specified minimum distance between adjacent SNPs.
  • a criterion can be a target distance between adjacent SNPs. The actual distance between adjacent SNPs can vary from the target distance by, for example, 5%, 10%, 20%, or 30%.
  • An algorithm to prioritize SNPs can select as few SNPs as possible to achieve a coverage of the region where the largest gap between adjacent. SNPs is less than or equal to the specified maximum distance.
  • An algorithm to prioritize SNPs may not select a SNP if the distance between one of two adjacent SNPs is less than the specified minimum distance.
  • Another prioritization criterion can involve using known SNPs.
  • Known SNPs can be preferentially selected to include in the library because, for example, known SNPs are well-characterized, and may have assays that have previously been designed to target such known SNPs, they may have little or no additional cost associated with providing an oligonucleotide or a data set corresponding to an oligonucleotide for an assay directed to the SNP.
  • An algorithm to prioritize SNPs can consider known SNPs as “must use” SNPs or as “prefer to use” SNPs.
  • a known SNP is a previously characterized SNP marker that has low or no cost associated with designing and/or manufacturing one or more oligonucleotides.
  • Known SNPs can be preferentially used as a selection criterion.
  • a newly identified SNP is a SNP marker that is relatively uncharacterized and therefore there is a higher relative cost associated with designing and/or manufacturing one or more oligonucleotides.
  • a selection criterion can be newly identified SNP.
  • Newly identified SNPs can be selected after all or some of the known SNPs have been selected because, for example, newly identified SNPs require investigation into whether a functional assays can be produced for the newly identified SNPs and therefore newly identified SNPs have greater costs associated with them than known SNPs.
  • An algorithm to prioritize SNPs can consider newly identified SNPs as “must never use” SNPs, “use only after using all known” SNPs, or as “try not to use” SNPs.
  • selection criteria can be used to select data sets corresponding to the largest number of known SNPs and the smallest number of newly identified SNPs that provide a coverage of the region, where the distance between adjacent, selected SNPs is approximately equal.
  • a specified maximum distance can be allowed between two adjacent selected SNPs, as well as between the two outermost selected SNPs and the boundaries of the region, gene, chromosome, or other boundary denoting the beginning and ending of the nucleic acid sequence.
  • One requirement can be, for example, to never have a distance between adjacent selected SNPs that is greater than a maximum required distance, unless maintaining the maximum required distance is impossible because there are no adjacent SNPs that fall within the maximum required distance.
  • the largest distance between adjacent selected SNPs can be as small as possible.
  • an algorithm can be provided that selects, based on prioritization criteria, a plurality of data sets that corresponds to one or more respective oligonucleotides that can function as a respective assay to hybridize with at least one selected SNP.
  • a region can be at least a part of the nucleic acid sequence.
  • the region can be separated into sub-regions.
  • the sub-regions can, for example, be separated by known SNPs, if any.
  • Each sub-region can be solved and optimized independently according to various embodiments of the algorithm.
  • the sub-region can be bound by known SNPs. If there are no known SNPs, then the region may not be separated into sub-regions.
  • Various embodiments of the algorithm can be repeated for each sub-region.
  • Each sub-region can begin, for example, with the region's 5′ end or a known SNP and can end with the 3′ end or a known SNP.
  • a step of the algorithm can be a locally optimal solution (“local step”) that can determine the number of newly identified SNPs that can be utilized.
  • the step can include selecting all the newly identified SNPs.
  • the step can include selecting the known SNPs closest to one or both ends of the region or sub-region of the nucleic acid sequence.
  • the step can include selecting the newly identified SNPs closest to one or both ends of the region or sub-region of the nucleic acid sequence.
  • the step can iteratively eliminate newly identified SNPs until, for example, no newly identified SNPs can be eliminated.
  • the step can iteratively eliminate newly identified SNPs until, for example, the remaining number of known SNPs and newly identified SNPs is at or below a minimum number of SNPs, if any, that can be selected.
  • the distance between all adjacent SNPs can be calculated.
  • the first and/or last distances between adjacent SNPs in a region or sub-region can be doubled if the distances are between a SNP and the edge of the region or sub-region.
  • the consecutive distances between adjacent SNPs can be added until the cumulative sum of consecutive distances exceeds the specified maximum distance. If the cumulative sum of consecutive distances exceeds the specified maximum distance after more than one consecutive distance, i.e., more than just two adjacent SNPs, or more than just one SNP and an edge of a region or sub-region, then one SNP in the set of newly identified SNPs can be removed.
  • the local step can include identifying the smallest distance between adjacent SNPs in the sequence of adjacent SNPs that makes up the cumulative sum, where the cumulative sum of adjacent distances is greater than the specified maximum distance. If there is only one distance between adjacent SNPs having a distance equal to the smallest distance, then one of the two adjacent SNPs bounding the smallest distance can be eliminated if one of the SNPs is a newly identified SNP. For example, if both SNPs in the pair of adjacent SNPs having a distance equal to the smallest distance are newly identified SNPs, one of the SNPs can be eliminated at random or the SNP nearest the 5′ end or the SNP nearest the 3′ end can be eliminated by convention.
  • each SNP can have two distance values associated with it.
  • the distance value (5′) is the distance to the adjacent SNP on the 5′ end of the nucleic acid sequence
  • the distance value (3′) is the distance to the adjacent SNP on the 3′ end. If there is more than one distance between adjacent SNPs having a distance equal to the smallest distance, e.g. a “tie,” then the tie can be broken by, for example, choosing the smallest distance on the 3′ end of the nucleic acid sequence. For another example, the smallest distance can be chosen arbitrarily (e.g.
  • the process of the local step can be stopped. If the total number of SNPs is at or below a specified minimum number of SNPs, the process of the local step can be stopped.
  • a step of the algorithm can be a globally optimal solution (“global step”) that can determine the optimum selection of newly identified SNPs given the number of newly identified SNPs that can be utilized. For example, the number of newly identified SNPs that can be utilized can be provided by the local step. According to various embodiments, at least one step of the algorithm is performed.
  • N can be a value from 1 to K.
  • K can be selected by determining the specified minimum and/or specified maximum distances of the region or sub-region.
  • the global step can include calculating all possible selections P of the number of newly identified SNPs to be selected N out of the total number of newly identified SNPs in the region or sub-region K.
  • the global step can include calculating the largest distance between adjacent, selected SNPs for each selection P.
  • the global step can include choosing the selection P with the “smallest” largest distance between adjacent, selected SNPs.
  • a specified minimum distance between adjacent selected SNPs can be specified.
  • a minimum number of total markers can be specified.
  • Prioritization criterion can be assigned to different, respective newly identified SNPs that can assign preference to some newly identified SNPs over other newly identified SNPs.
  • the distance value associated with a “high priority” newly identified SNP can be, for example, marked or changed so that the “high priority” newly identified SNP is always preferred or selected over a “low priority” newly identified SNP.
  • FIG. 8 details a hypothetical chromosome having 969 genes on the chromosome. Of those 969 genes, 1,639 known SNPs are present on the chromosome and are well characterized. The chromosome contains 11,095 newly identified SNPs that are not well characterized. Of the 969 genes, 611 genes contain SNPs and 358 genes do not contain SNPs. Of the 611 genes containing SNPs, FIG. 8 lists the average gene length, in bases, the number of newly identified SNPs per gene, the number of known SNPs per gene, the total number of selected SNPs per gene that were selected using various embodiments, and the number of newly identified selected SNPs per gene that were selected using various embodiments. FIG.
  • FIG. 9 is a histogram of gene lengths of the genes found on the hypothetical chromosome of FIG. 8 .
  • FIG. 10 is a histogram of the specified maximum distance between adjacent SNPs, according to various embodiments, of the selected SNPs from the hypothetical chromosome of FIG. 8 .
  • FIG. 11 is a histogram of the actual maximum distance between adjacent SNPs, according to various embodiments, of selected SNPs of the hypothetical chromosome of FIG. 8 .
  • FIG. 12 is a histogram of total selected SNPs per gene, according to various embodiments, from the hypothetical chromosome of FIG. 8 .
  • FIG. 13 is a histogram of the number of newly identified SNPs per gene that were selected from the hypothetical chromosome of FIG. 8 using various embodiments.

Abstract

Libraries of assays and methods of compiling the libraries are provided. The assays can identify Single Nucleotide Polymorphisms (SNPs). Methods of validating SNPs are provided. Methods of constructing linkage disequilibrium maps using sets or subsets of SNPs are also provided.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit under 35 U.S.C. § 119(e) of prior U.S. Provisional Patent Applications No. 60/352,039, filed Jan. 25, 2002; 60/352,356, filed Jan. 28, 2002; 60/369,127, filed Apr. 1, 2002; 60/369,657, filed Apr. 3, 2002; 60/370,921, filed Apr. 9, 2002; 60/376,171, filed Apr. 26, 2002; 60/380,057, filed May 6, 2002; 60/383,627, filed May 28, 2002; 60/383,954, filed May 29, 2002; 60/390,708, filed Jun. 21, 2002; 60/394,115, filed Jul. 5, 2002; and 60/399,860, filed Jul. 31, 2002; U.S. Non-Provisional patent application Ser. No. 10/335,707, filed Jan. 2, 2003; U.S. Non-Provisional patent application Ser. No. 10/335,690, filed Jan. 2, 2003; and U.S. Non-Provisional patent application Ser. No. 10/334,793, filed Jan. 2, 2003; and International Patent Application No. PCT/US03/00128, filed Jan. 2, 2003; all of which are incorporated herein in their entireties by reference.
  • BACKGROUND
  • Assays can include probes and/or primers that hybridize with a target nucleic acid sequence. These probes and primers can be useful for visualizing or amplifying the target nucleic acid sequence. The target nucleic acid sequence can have a Single Nucleotide Polymorphism (SNP) contained therein. A relative indication of concentration of the target nucleic acid sequence can be obtained based on relative fluorescence of the probes.
  • Single nucleotide polymorphisms (SNPs) are an abundant form of genetic variation. These single nucleotide changes are found approximately every 500 bp in the human genome. Almost all SNPs are bi-allelic, that is, only two different alleles exist. Typically, one allele is present in the majority of the chromosomes of a population, and the alternative variant, that is, the minor allele, is present with less frequency. Only alleles that are present at a frequency greater than 1% are considered polymorphisms.
  • SNPs are promising tools for mapping susceptibility mutations that contribute to complex diseases. Although most SNPs are neutral and do not affect phenotype, they can be used as surrogate markers for positional cloning of genetic loci, because of the allelic association, known as linkage disequilibrium (LD), that can be shared by groups of adjacent SNPs. LD is eroded by gene conversion and recombination, and the amount of LD depends on the age of the mutations and on the demographic history of the population. The extent of LD across a genomic region dictates the density of SNP markers necessary to ensure association between a marker and the causative allele sought.
  • Early attempts to model the extent of LD on theoretical grounds predicted very short regions of LD, extending only a few kilobases (Kb). However, empirical surveys reported average LD distances between 5 Kb and 60 Kb, with the upper range extending up to hundreds of Kb.
  • Previous efforts for typing the specific SNP alleles present in a DNA sample produce unphased genotypes (i.e., the alleles detected cannot be assigned to either the maternal or the paternal chromosome). Although there are a few cumbersome methods to directly determine haplotypes, previous algorithms are widely used to infer the haplotypes from genotypes using maximum-likelihood or Bayesian principles.
  • SUMMARY
  • According to various embodiments, methods are provided for SNP validation that take into consideration a number of findings and statistics. Studies have reported a discontinuous structure in the patterns of LD across a set of regions sampled from the human genome, where long stretches of strong LD are punctuated by recombination hot-spots. These LD “blocks” show little evidence of historical recombination. According to various embodiments, these results are deconvoluted to predict that a reduced set of contiguous chromosomal segments, or haplotypes, exist in specific populations. For example, for a block spanning tens of Kbs for which 10 SNPs exist, instead of the 2 theoretically possible haplotypes, it has been found that 95% of the haplotype diversity is made up on only 4 to 6 so-called common haplotypes.
  • It is noteworthy that these LD block patterns change depending on the population sampled because of historical differences; for example, populations that have experienced bottlenecks (e.g., Caucasians) show longer LD blocks and less evidence of historical recombination events, than other populations. The haplotype diversity in a given population is typically constant in a given region irrespective of the number of SNPs sampled; therefore typing an arbitrarily large number of SNPs within a LD block is unnecessary. Selecting the minimum subset of SNPs within LD blocks, or any other discrete genetic locus, that enable discrimination of the common haplotypes present in a block without loss of information can be used to validate SNPs and/or to compile a concise library of assays useful for genetic analysis.
  • According to various embodiments, a method of compiling a library of polynucleotide data sets is provided. The data sets can correspond to polynucleotides that each can function as (A) a primer for producing a nucleic acid sequence that is complementary to at least one target nucleic acid sequence including a target SNP, (B) a probe for rendering detectable the at least one target nucleic acid sequence including a target SNP, or (C) both (A) and (B). The method can include the step of selecting for the library polynucleotide data sets that each correspond to a respective polynucleotide that contains a sequence that is complementary to a respective first allele included in each of the at least one target nucleic acid sequences, if, under a set of reaction conditions a number of parameters are met by each polynucleotide corresponding to the data sets included in the library. The parameters can include: (1) the respective polynucleotide has a background signal value less than or equal to a first defined value, where the background signal value is a first normalized ratio of a fluorescence intensity of the respective polynucleotide reacted with first assay reactants in the absence of the target nucleic acid sequence, and under first conditions of fluorescence excitation, to a dye fluorescence intensity of a passive-reference dye under the first conditions; (2) the respective polynucleotide has a signal generation value of greater than or equal to a second defined value, wherein the signal generation value is the difference between (i) a second normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with the first assay reactants in the presence of the target nucleic acid sequence, to the dye florescence intensity and (ii) the background signal value; (3) the respective polynucleotide has a specificity value of less than or equal to a third defined value, wherein the specificity value is the difference between (i) a third normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with second assay reactants that contain a second allele included in the at least one target nucleic acid sequence to the dye fluorescence intensity, wherein the second allele differs from the first allele, and (ii) the background signal value; (4) at least one individual from a population of individuals has a genotype identifiable under the first conditions, that results from reacting the respective polynucleotide with the first assay reactants and in the presence of the target nucleic acid sequence, wherein the population includes at least one individual that has the identifiable genotype and at least one individual that does not have the identifiable genotype; and (5) at least one individual from the population has an identifiable minor allele of the identifiable genotype, under the first conditions that results from reacting the respective polynucleotide with the first assay reactants in the presence of the target nucleic acid sequence, wherein the population includes at least one individual that has the identifiable minor allele, and at least one individual that does not have the identifiable minor allele.
  • According to various embodiments, a method of compiling a library of polynucleotide data sets is provided. The data sets can correspond to polynucleotides that each can function as (A) a primer for producing a nucleic acid sequence that is complementary to at least one target nucleic acid sequence including a target SNP, (B) a probe for rendering detectable the at least one target nucleic acid sequence including a target SNP, or (C) both (A) and (B). The method can include the step of determining a background signal value by calculating a first normalized ratio of a fluorescence intensity of a respective polynucleotide that contains a sequence that is complementary to a first allele included in the at least one target nucleic acid sequence, reacted with first assay reactants in the absence of the target nucleic acid sequence, and under first conditions of fluorescence excitation, to a dye fluorescence intensity of a passive-reference dye under the first conditions. The method can include the step of comparing a difference between (i) a second normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with the first assay reactants in the presence of the target nucleic acid sequence, to the dye fluorescence intensity, and (ii) the background signal value. The method can include the step of comparing a difference between (i) a third normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with second assay reactants that contain a second allele included in the at least one target nucleic acid sequence to the dye fluorescence intensity, wherein the second allele differs from the first allele, and (ii) the background signal value. The method can include the step of determining whether at least one individual from a population of individuals has a genotype identifiable under the first conditions that results from reacting the respective polynucleotide with the first assay reactants and in the presence of the target nucleic acid sequence, wherein the population includes at least one individual that has the identifiable genotype and at least one individual that does not have the identifiable genotype. The method can include the step of determining whether at least one individual from the population has an identifiable minor allele of the identifiable genotype, under the first conditions that results from reacting the respective polynucleotide with the first assay reactants in the presence of the target nucleic acid sequence. Various combinations of the herein described method steps and/or parameters can be used.
  • According to various embodiments, a method of confirming the existence of a SNP is provided. The method can include the step of identifying a location corresponding to a possible SNP in a polynucleotide ir a first collection of data sets containing information on genomic deoxyribonucleic acid (DNA) samples in the form of data sets corresponding to polynucleotides. The method can include the step of confirming the existence of the SNP if at least one condition is met. A condition can be met if a second collection of data sets containing information on genomic deoxyribonucleic acid (DNA) samples contains information that identifies the location as containing the possible SNP. A condition can be met, for example, if at least two data sets from the first collection of data sets contain information corresponding to a minor allele of the possible SNP at the location, wherein the at least two data sets represent genomic deoxyribonucleic acid (DNA) samples obtained from two independent sources. A condition can be met if a data set that corresponds to a consensus sequence of genomic deoxyribonucleic acid (DNA) samples that contains the minor allele of the possible SNP in a third collection of data sets. The source of the consensus sequence of genomic deoxyribonucleic acid (DNA) samples, and the sources of the genomic deoxyribonucleic acid (DNA) samples from the first collection of data sets, can be independent.
  • According to various embodiments, a library is provided that contains data corresponding to respective oligonucleotides that can function as assays to detect Single Nucleotide Polymorphisms (SNPs). The library can have a number of data sets corresponding to not more than a sufficient number of oligonucleotides necessary to provide a collection of assays that provides a maximum statistical loss of a defined percentage of haplotype diversity across a human genome.
  • According to various embodiments, an algorithm to select the minimal subset of SNPs required for capturing the diversity of haplotype blocks or other genetic loci is provided. The algorithm can be used to quickly select the minimum SNP subset with no loss of haplotype information. In addition, the algorithm can be used in a more aggressive mode to further reduce the original SNP set, with minimal loss of information.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIGS. 1 a-b are graphs of SNPs per LD block v. minimum information SNP subset for African-American and Caucasian populations, respectively;
  • FIGS. 2 a-2 e are schematic diagrams of quenchable dyes that can be part of a mixture of reagents provided and/or used according to various embodiments;
  • FIG. 3 is a workflow diagram according to various embodiments;
  • FIG. 4 is a graph showing visualized assay results, according to various embodiments;
  • FIG. 5 is a flowchart showing an algorithm according to various embodiments;
  • FIG. 6 is a flowchart showing an algorithm according to various embodiments;
  • FIG. 7 is an illustration of SNPs selected by a method according to various embodiments;
  • FIG. 8 shows a table of SNPs and genes on a hypothetical chromosome;
  • FIG. 9 is a histogram of gene lengths of the hypothetical chromosome of FIG. 8;
  • FIG. 10 is a histogram of the specified maximum distance between adjacent SNPs;
  • FIG. 11 is a histogram of actual maximum distance between adjacent SNPs;
  • FIG. 12 is a histogram of total selected SNPs per gene; and
  • FIG. 13 is a histogram of the number of newly identified SNPs selected per gene.
  • DESCRIPTION OF VARIOUS EMBODIMENTS
  • According to various embodiments, nucleic acid analogs can be used in addition to or instead of nucleic acids. Examples of nucleic acid analogs can include the family of peptide nucleic acids (PNA), wherein the sugar/phosphate backbone of DNA or RNA has been replaced with acyclic, achiral, and neutral polyamide linkages. For example, a probe or primer can have a PNA polymer instead of a DNA polymer. The 2-aminoethylglycine polyamide linkage with nucleobases attached to the linkage through an amide bond can be used as a PNA and shown to possess exceptional hybridization specificity and affinity. An example of a PNA is as shown below in a partial structure with a carboxyl-terminal amide:
    Figure US20050282162A1-20051222-C00001
  • “Nucleobase” as used herein means any nitrogen-containing heterocyclic moiety capable of forming Watson-Crick hydrogen bonds in pairing with a complementary nucleobase or nucleobase analog, e.g. a purine, a 7-deazapurine, or a pyrimidine. Typical nucleobases are the naturally occurring nucleobases such as, for example, adenine, guanine, cytosine, uracil, thymine, and analogs of the naturally occurring nucleobases, e.g. 7-deazaadenine, 7-deazaguanine, 7-deaza-8-azaguanine, 7-deaza-8-azaadenine, inosine, nebularine, nitropyrrole, nitroindole, 2-aminopurine, 2-amino-6-chloropurine, 2,6-diaminopurine, hypoxanthine, pseudouridine, pseudocytosine, pseudoisocytosine, 5-propynylcytosine, isocytosine, isoguanine, 7-deazaguanine, 2-azapurine, 2-thiopyrimidine, 6-thioguanine, 4-thiothymine, 4-thiouracil, O6-methylguanine, N6-methyladenine, O4-methylthymine, 5,6-dihydrothymine, 5,6-dihydrouracil, 4-methylindole, pyrazolo[3,4-D]pyrimidines, “PPG”, and ethenoadenine.
  • “Nucleoside” as used herein refers to a compound consisting of a nucleobase linked to the C-1′ carbon of a sugar, such as, for example, ribose, arabinose, xylose, and pyranose, in the natural β or the α anomeric configuration. The sugar can be substituted or unsubstituted. Substituted ribose sugars can include, but are not limited to, those riboses having one or more of the carbon atoms, for example, the 2′-carbon atom, substituted with one or more of the same or different Cl, F, —R, —OR, —NR2 or halogen groups, where each R is independently H, C1-C6 alkyl or C5-C14 aryl. Ribose examples can include ribose, 2′-deoxyribose, 2′,3′-dideoxyribose, 2′-haloribose, 2′-fluororibose, 2′-chlororibose, and 2′-alkylribose, e.g. 2′-O-methyl, 4′-α-anomeric nucleotides, 1′-α-anomeric nucleotides, 2′-4′- and 3′-4′-linked and other “locked” or “LNA”, bicyclic sugar modifications. Exemplary LNA sugar analogs within a polynucleotide can include the following structures:
    Figure US20050282162A1-20051222-C00002

    where B is any nucleobase.
  • Sugars can have modifications at the 2′- or 3′-position such as methoxy, ethoxy, allyloxy, isopropoxy, butoxy, isobutoxy, methoxyethyl, alkoxy, phenoxy, azido, amino, alkylamino, fluoro, chloro and bromo. Nucleosides and nucleotides can have the natural D configurational isomer (D-form) or the L configurational isomer (L-form). When the nucleobase is a purine, e.g. adenine or guanine, the ribose sugar is attached to the N9-position of the nucleobase. When the nucleobase is a pyrimidine, e.g. cytosine, uracil, or thymine, the pentose sugar is attached to the N1-position of the nucleobase.
  • “Nucleotide” as used herein refers to a phosphate ester of a nucleoside and can be in the form of a monomer unit or within a nucleic acid. “Nucleotide 5′-triphosphate” as used herein refers to a nucleotide with a triphosphate ester group at the 5′ position, and can be denoted as “NTP”, or “dNTP” and “ddNTP” to particularly point out the structural features of the ribose sugar. The triphosphate ester group can include sulfur substitutions for the various oxygens, e.g. α-thio-nucleotide 5′-triphosphates.
  • As used herein, the terms “polynucleotide” and “oligonucleotide” mean single-stranded and double-stranded polymers of, for example, nucleotide monomers, including 2′-deoxyribonucleotides (DNA) and ribonucleotides (RNA) linked by internucleotide phosphodiester bond linkages, e.g. 3′-5′ and 2′-5′, inverted linkages, e.g. 3′-3′ and 5′-5′, branched structures, or internucleotide analogs. Polynucleotides can have associated counter ions, such as H+, NH4+, trialkylammonium, Mg2+, Na+ and the like. A polynucleotide can be composed entirely of deoxyribonucleotides, entirely of ribonucleotides, or chimeric mixtures thereof. Polynucleotides can be comprised of internucleotide, nucleobase and sugar analogs. For example, a polynucleotide or oligonucleotide can be a PNA polymer. Polynucleotides can range in size from a few monomeric units, e.g. 5-40 when they are more commonly frequently referred to in the alt as oligonucleotides, to several thousands of monomeric nucleotide units. Unless otherwise denoted, whenever a polynucleotide sequence is represented, it will be understood that the nucleotides are in 5′ to 3′ order from left to right and that “A” denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotes deoxyguanosine, and “T” denotes thymidine, unless otherwise noted.
  • “Internucleotide analog” as used herein means a phosphate ester analog or a non-phosphate analog of a polynucleotide. Phosphate ester analogs can include: (i) C1-C4 alkylphosphonate, e.g. methylphosphonate; (ii) phosphoramidate; (iii) C1-C6 alkyl-phosphotriester; (iv) phosphorothioate; and (v) phosphorodithioate. Non-phosphate analogs can include compounds wherein the sugar/phosphate moieties are replaced by an amide linkage, such as a 2-aminoethylglycine unit, commonly referred to as PNA.
  • “Heteroygous” as used herein means both members of a pair of alleles of a gene are present in a sample obtained from a single source, wherein a gene call have two alleles due to, for example, the fusion of two dissimilar gametes with respect to the gene.
  • “Heterozygous assay” as used herein means an assay adapted to identify the allelic state of a gene having one or both members of a pair of alleles.
  • “Homozygous” as used herein means one member of a pair of alleles is present in a sample obtained from a single source, wherein a gene can have one allele due to, for example, the fusion of two identical gametes with respect to the gene.
  • “Homozygous assay” as used herein means an assay adapted to identify only one of two possible allelic states of a gene having one or both members of a pair of alleles.
  • “Lossy” as used herein means the loss of haplotype diversity in a linkage disequilibrium block.
  • “Lossless” as used herein means that there is no loss of haplotype diversity in a linkage disequilibrium block.
  • According to various embodiments, a library of assays can be provided. The library of assays can have from about 100,000 to about 500,000 polynucleotides, for example, about 150,000 to about 250,000 polynucleotides. According to various embodiments, a library of data sets can be provided. The library of data sets can have from about 100,000 to about 500,000 data sets, for example, about 150,000 to about 250,000 data sets.
  • According to various embodiments, an algorithm is provided that can select a minimum subset of SNPs without loss of haplotype information, or an even smaller subset with some acceptable loss of information. In an example, a SNP set was reduced by 18% for an African American population and by 32% for a Caucasian population with no loss of haplotype distribution information. The algorithm can produce optimal results in a reasonable time. The algorithm can allow for the real-time calculation of minimum SNP subsets for haplotype blocks.
  • According to various embodiments, an algorithm to select the minimal subset of SNPs required for capturing the diversity of haplotype blocks or other genetic loci is provided. The algorithm can be used to quickly select the minimum SNP subset with no loss of haplotype information. In addition, the algorithm can be used in a more aggressive mode to further reduce the original SNP set, with minimal loss of information.
  • When SNPs are initially selected for typing, often not much is known about the existence or location of LD blocks, or about the number and relative frequencies of haplotypes within the blocks. It is therefore typical in previous efforts to “over-sample” the chromosomal region, (i.e., select SNPs as densely as one's budget permits). Since there may be large costs associated with detecting the genotype for each SNP, it can be practical to minimize the number of SNPs used in a study. When beginning with a population sample large enough to allow for accurate inference of the haplotype distributions, various embodiments can reduce the set of SNPs to the minimum number required for adequate coverage with no loss of haplotype information. Furthermore, various embodiments can be used to eliminate additional SNPs while minimizing loss of haplotype information.
  • According to various embodiments, family relationships of the DNA donors, if available, can be used to increase haplotype inference accuracy. In the absence of family information, the Expectation-Maximization algorithm introduced by Excoffier and Slatkin can be accurate, especially in regions of low diversity. According to various embodiments, the analysis of haplotype distributions in genetic studies aimed to find susceptibility mutations in case-control populations can be useful in finding associations. Therefore, haplotype interference can be used in disease and pharmacogenomic studies.
  • According to various embodiments, given a block containing N SNPs and M haplotypes, a probability vector P of length M can be defined where Pi is the relative frequency of the ith haplotype. According to various embodiments, A, a haplotype/SNP allele state matrix of N columns and M rows is defined, wherein Aij (the ith row of the jth column of the matrix) indicates the allele state (‘1’ or ‘2’) of the jth SNP for the ith haplotype. The algorithm can eliminate columns of A while preserving as much of the information in P as possible. Quantifying the information in P can be defined using the Shannon Entropy equation: H = - i = 1 M P i ln ( P i )
  • According to various embodiments, it can be useful to use an algorithm in lossless mode. According to such embodiments, it can be irrelevant which information measure is used for the haplotype distribution. The algorithm can use other measures of information.
  • According to various embodiments, the algorithm can consist of two phases (phases I and II). These phases can be performed sequentially. The operations can be outlined in lossless mode as follows below.
  • In the first phase (Phase I), any column that is identical to another column, or is the exact opposite of another column, can be eliminated. A column in a matrix that is identical to another column can represent a SNP that behaves identically to another SNP for all tested samples. Thus, when the number of DNA samples is large enough to infer the major haplotypes, the redundant SNP will not provide any additional information. Similarly, when a column is the exact opposite of another column in the matrix, this represents a SNP where the behavior can always be predicted from the behavior of another SNP simply by inverting it. Therefore, according to such embodiments, this SNP will not provide new information. According to various embodiments, after phase I, it can be assumed that N columns of matrix A have been reduced to N′ unique columns where N′≦N.
  • According to various embodiments, in the second phase (Phase II) any column whose elimination does not reduce the number of unique rows, can be eliminated. Each row in a matrix can represent the allelic states of the SNPs for a specific haplotype. Removing a “useful” SNP can eliminate the ability to detect at least one haplotype. According to such an embodiment, two or more haplotypes can register the same allelic state at the remaining SNPs, thereby reducing the number of unique rows. Therefore, if the elimination of a column does not reduce the number of unique rows, it can be omitted.
  • Phase I can be a “sub-set” of phase II, in the sense that if phase I is skipped, phase II can eliminate the SNPs that phase I would have eliminated. Phase I can be computationally easier to perform than phase II, for example, in lossy mode. Therefore it can be more efficient to begin with phase I.
  • EXAMPLE 1
  • Example 1 illustrates a method according to various embodiments, in lossless mode. Four SNPs and four haplotypes that yield the following allelic responses are illustrated in Table 1.
    TABLE 1
    Haplotype/SNP Allele State Matrix
    SNP1 SNP2 SNP3 SNP4
    Haplotype 1 1 1 1 2
    Haplotype 2 2 2 1 1
    Haplotype 3 2 2 2 1
    Haplotype 4 1 2 2 2
  • The fourth column is the exact opposite of the first column. This implies that either SNP4 or SNP1 is redundant. If SNP4 is removed from the SNP set, no information is lost. When SNP1 registers allele “1”, the state of SNP4 is known as allele “2”, and conversely, when SNP1 registers allele “2”, the state of SNP4 is known as allele “1”. Removing SNP4 leaves the matrix seen in Table 2.
    TABLE 2
    Haplotype/SNP Allele State Matrix after Phase I
    SNP1 SNP2 SNP3
    Haplotype 1 1 1 1
    Haplotype 2 2 2 1
    Haplotype 3 2 2 2
    Haplotype 4 1 2 2
  • The three columns are unique (including accounting for opposites), thus phase I is complete. N=4 has been reduced to N′=3, as phase II is entered.
  • Table 3 depicts the three remaining matrices, following the removal of SNP1, SNP2, or SNP3, respectively. The first and the third matrices only have three unique rows, whereas the second matrix has four unique rows. Thus, if the haplotype list is exhaustive, SNP2 can be eliminated with no loss of haplotype detection.
    TABLE 3
    Three Possible Haplotype/SNP Allele State Matrices
    SNP2 SNP3 SNP1 SNP3 SNP1 SNP2
    Haplotype 1 1 1 1 1 1 1
    Haplotype 2 2 1 2 1 2 2
    Haplotype 3 2 2 2 2 2 2
    Haplotype 4 2 2 1 2 1 2
  • According to various embodiments, the set {SNP1, SNP3} can provide the same haplotype detection ability as the full set {SNP1, SNP2, SNP3, SNP4}. In Example 1, each phase can cause the elimination of exactly one SNP. However, according to various embodiments, each phase can result in the elimination of multiple SNPs or no SNPs.
  • EXAMPLE 2
  • Further elimination of SNPs, beyond the lossless elimination shown in Example 1 above, can be implemented. According to such embodiments, the retained SNP set can be optimized to minimize the loss of haplotype detection. Phase I can remain unchanged and phase II can select the optimal SNPs to eliminate.
  • According to various embodiments, for the ( N k )
    possible selections of k SNPs, the entropy H for the resulting P is computed. The selection with the highest H can be chosen as the best selection. When only k out of N′ SNPs are selected, N′−k columns can be eliminated. The resulting matrix (with k columns) can have fewer unique rows than the full matrix (with N′ columns). Several “minor” haplotypes can be measured as a single “major” haplotype when a row is repeated more than once. This can occur because with fewer SNPs the ability to make the finer distinction between them is lost. According to various embodiments, the relative frequency (probability) of a “major” haplotype is equal to the sum of the frequencies of the “minor” haplotypes. According to such embodiments, when elimination of columns results in repeating rows, the repeating rows can be combined into a single row, and their respective probabilities can be summed to form a new probability. The vector P can be shorter and can have larger numbers. This can reduce the value of the entropy, H. According to various embodiments, the combination with the smallest reduction of entropy can be deemed the optimal selection. According to various embodiments, if all the rows are unique after elimination of N′−k columns, the entropy is not reduced, and k SNPs can be used with no loss of information, as in Example 1.
  • Example 2 uses an LD block that was discovered using the Caucasian population panel, in Chromosome 6, overlapping the Human gene TTK (RefSeq ID NM003318, Celera ID hCG401205) in lossy mode. The block consists of 17 SNPs, and the EM algorithm inferred 8 haplotypes, with two major ones: haplotype 2 and haplotype 7 with frequencies of approximately 43% and 33%, respectively. The remaining 24% of the diversity is spread among the remaining 6 haplotypes. Table 4 summarizes the allelic states of the 17 SNPs, as well as the respective probability, for each of the 8 haplotypes.
    TABLE 4
    Original Haplotype/SNP Allele State Matrix
    Haplotype SNP No.
    Number P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
    1 0.1136 1 1 1 1 1 2 1 2 1 2 1 1 2 2 2 1 2
    2 0.4318 1 1 1 1 1 2 1 2 1 2 1 1 2 2 2 2 2
    3 0.0114 1 1 1 2 2 1 2 1 2 1 2 2 1 1 1 1 1
    4 0.0454 1 2 2 1 1 2 1 2 1 2 1 1 2 2 2 1 1
    5 0.0454 2 1 1 1 1 2 1 2 1 2 1 1 2 2 2 1 2
    6 0.0118 2 2 2 2 2 1 2 1 2 1 2 1 1 1 1 1 1
    7 0.3292 2 2 2 2 2 1 2 1 2 1 2 2 1 1 1 1 1
    8 0.0114 2 2 2 2 2 1 2 1 2 2 1 2 1 1 1 1 1
  • After running phase I of the algorithm, the number of SNPs is reduced almost immediately to 7, with the remaining SNP set being {SNP1, SNP2, SNP4, SNP10, SNP12, SNP16, SNP17}. All the haplotype information is preserved, including haplotype distribution, as shown in Table 5. The entropy of the original distribution of haplotypes is H(P)=2.0351 bits. Phase II of the algorithm was then performed.
    TABLE 5
    Haplotype/SNP Allele State Matrix After Phase I
    Haplotype
    Number P SNP1 SNP2 SNP4 SNP10 SNP12 SNP16 SNP17
    1 0.1136 1 1 1 2 1 1 2
    2 0.4318 1 1 1 2 1 2 2
    3 0.0114 1 1 2 1 2 1 1
    4 0.0454 1 2 1 2 1 1 1
    5 0.0454 2 1 1 2 1 1 2
    6 0.0118 2 2 2 1 1 1 1
    7 0.3292 2 2 2 1 2 1 1
    8 0.0114 2 2 2 2 2 1 1
  • Table 6 shows the optimum SNP subset for k SNPs out of the 8 SNPs that survived phase I. Note that the haplotype information is fully preserved all the way down to k=5, therefore if a lossless version of Phase II were run, the minimum SNP size with no loss of information would be 5. At a SNP subset size of 4, some information is lost. In fact, haplotype 8, the rarest haplotype, was merged into haplotype 7, and the ability to distinguish between the two is lost. As a result, 3.5% of the entropy of the resultant haplotype distribution is lost. The optimal selection of 3 SNPs cause haplotypes 3 and 4 to merge and cause haplotypes 6, 7, and 8 to merge, with a total loss of 9.2% of original entropy. The optimal single SNP is SNP16. With single SNP16, the detection ability is reduced to: “haplotype2” or “other.” Since haplotype 2 is the most common, with 43.2% of the frequency, if only a single SNP was chosen, SNP16 would be the most useful choice.
    TABLE 6
    Lossy Min. SNP Set Example
    No. of SNPs (k) No. of Combinations ( 8 k ) Optimal Set of k SNPs Haplotype Distribution Resulting from the Optimal SNP Set Resulting Entropy (H) (bits)
    7 8 {SNP1, SNP2, (0.114, 0.432, 2.0351
    SNP4, SNP10, 0.011, 0.045, 0.045,
    SNP12, SNP16, 0.012, 0.329, 0.011)
    SNP17}
    6 28 {SNP1, SNP4, (0.114, 0.432, 2.0351
    SNP10, SNP12, 0.011, 0.045, 0.045,
    SNP16, SNP17} 0.012, 0.329, 0.011)
    5 56 {SNP1, SNP10, (0.114, 0.432, 2.0351
    SNP12, SNP16, 0.011, 0.045, 0.045,
    SNP17} 0.012, 0.329, 0.011)
    4 70 {SNP1, SNP12, (0.114, 0.432, 1.9631
    SNP16, SNP17} 0.011, 0.045, 0.045,
    0.012, 0.341))
    3 56 {SNP1SNP16, (0.114, 0.432, 1.8475
    SNP17} 0.057, 0.045, 0.352)
  • To validate and assess the utility of the algorithm, genotyping data was used from 11,160 SNPs distributed in a gene-centric fashion across chromosomes 6, 21, and 22, with intragenic spacing averaging 12 Kb, 8 Kb, and 9 Kb, respectively. The SNPs were scored with 5αnuclease assays including TAQMAN-MGB probes from Applied Biosystems' Assays-on-Demand™ SNP Genotyping Products (Foster City, Calif., USA). The samples typed included 45 African-American and 45 Caucasian DNAs from the Coriell Human Diversity Collection available from Coriell Institute for Medical Research, Camden, N.J., USA. LD blocks and haplotypes were computed independently for each population using methods described in Abecasis, et al., Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30:97-101 (2002) and Gabriel et al., The structure of haplotype blocks in the human genome Science 296:2225-2229 (2002), both of which are herein incorporated in their entireties by reference. Only blocks of 3 or more SNPs were considered. Therefore, only 4,864 SNPs were used for the African-American population and 7,347 SNPs were used for the Caucasian population. The Caucasian population is known, in general, to have more and longer LD blocks. The algorithm was implemented in MATLAB v 6.1, available from The MathWorks Inc., Natick, Mass., USA, without further optimization. The computations were completed on a 700 MHz PC in less than 1 minute.
  • Table 5 summarizes the results after applying the algorithm to the haplotype blocks detected in data for chromosomes 6, 21, and 22. The African-American population panel is denoted by ‘A’ and the Caucasian population panel is denoted by ‘C’.
    TABLE 7
    Results Summary
    Mean Mean
    Total Spacing Spacing Mean Mean Mean Min.
    No. Between Between No. of Block SNPs SNP per Block
    of SNPs SNPs in Haplotype Size per <10%
    Chr. Pop. SNPs (bp) Genes (bp) Blocks (bp) Block Lossless Loss
    6 A 2,504 24,386 10,840 646 23,000 3.88 2.94 2.44
    C 4,009 23,694 10,630 883 34,000 4.54 2.86 2.33
    21 A 955 12,424 7,382 242 14,933 3.95 2.92 2.39
    C 1,555 11,921 7,031 336 21,032 4.63 2.88 2.32
    22 A 1,405 10,041 6,035 350 13,714 4.01 2.99 2.47
    C 1,783 9,080 7,760 417 17,505 4.28 2.81 2.27
  • FIG. 1 illustrates the relationship between the original number of SNPs in an LD block (horizontal axis) and the minimum number of SNPs required to genotype the LD block with no loss of information (vertical axis). The thickness of the ‘x’ corresponds to the number of different blocks found in chromosome 6 with the same properties.
  • Previous algorithms for finding the minimum SNP sub-set have been concerned with complete gene or randomly selected loci, as opposed to LD blocks, in such previous efforts, the number of haplotypes, and more importantly, the amount of information in the haplotype distribution, was expected to be much higher than those presented in the above examples. As a result, these previous efforts were challenged by the numerical complexity, and thus locally optimal solutions were sought. In contrast, algorithms according to various ones of the present embodiments can compute the global optimum, whether in lossless or lossy mode, making it superior to previous efforts.
  • The method described by Judson et al., How many SNPs does a genome-wide haplotype map require? Pharmacogenomics 3:379 (2002) is essentially equivalent to phase II of the lossy version of various embodiments except that the algorithm is limited to k≦11. This is expected, because without the efficient pruning of SNPs performed by phase I, the exponential nature of phase II can result in practically infinite execution time. For example, the largest block found in the above examples consisted of 22 SNPs. In previous efforts, comparing sub-sets of 1 to 11 out of 22 SNPs required examining over 2.4 million combinations. Even after that computation, the optimal solution is not assured since it is only a local optimum. In general, sub-sets of 12 to 22 would also need to be examined in order to assure a global optimum, bringing the total number of combinations to almost 4.2 million. Various embodiments can use phase I to quickly reduce the 22 SNPs to a subset of 4 SNPs. As a result, phase II can find the global optimum (3 SNPs in lossless mode or 2 SNPs with less than 10% loss) in 15 comparisons only.
  • The method described in the on-line supplement to the paper by Johnson et al., Haplotype tagging for the identification of common disease genies Nat Genet 29:233-237 (2001) compromises the maximization of the information detected by the SNP set with other considerations, e.g. maximization of the individual SNPs' properties. There is little explanation of the method details. One example provided tries to show that a haplotype matrix with full rank cannot be pruned. Note, however, that the haplotype matrix used in Example 1 and shown in Table I herein is of full rank, yet was pruned with no loss of information. A counter example is the 2 by 2 identity matrix, which is know to be full rank, but has the second column as the perfect inverse of the first column, thus providing no new information. The Johnson et al. on-line supplement also provides executable programs, but the maximum subset size is set to k≦5, thereby guaranteeing suboptimal results based on the finding that the global optimum is greater than 5 in some blocks.
  • Previous reports have suggested that the Caucasian population has longer LD blocks than the African-American population, which is considered more diverse. Examples 1 and 2 herein used the same SNPs for both populations, but kept only SNPs which formed LD blocks of 3 or more SNPs. This left more SNPs for the Caucasian population. As Table 5 shows, the Caucasian population yielded more blocks, and more SNPs per block. However, after “compression” of the SNP sets of each block into the minimum required to represent the information (with no loss), the two populations are almost the same. This can reflect the arbitrary criteria that defines an LD block, and can reflect that the criteria was applied uniformly to both populations.
  • The premise of the algorithm, according to various embodiments, is that the DNA sample size for each population is large enough so that the inferred haplotypes adequately represent reality. There can be a risk that a SNP whose behavior is identical to another SNP (and thus deemed worthless in terms of new information) for the sample size used, could differentiate an additional haplotype inferred with a larger sample size. This risk can be lower for common haplotypes. Rare haplotypes can harbor a causative mutation and can be present in higher frequency in some cases. Experimental errors can eliminate data points and thus can render suboptimal the minimum SNP subset. Therefore, additional SNPs can enhance the minimum SNP subset to enhance robustness.
  • Data was provided from over 11,000 SNPs with an average spacing of 6 to 11 Kb, across all the genes of chromosomes 6, 21, and 22, and typed on DNA samples of 45 unrelated African-Americans and 45 Caucasians from the Coriell Human Diversity Collection. With no loss of information, the number of SNPs required to capture the haplotype block diversity by 25% for the African-American population and 36% for the Caucasian population was reduced. With a maximum loss of 10% of haplotype distribution information, the SNP reduction was 38% and 49%, respectively, for the two populations. All computations were performed in less than 1 minute for the dataset used.
  • The availability of human genome data, including putative single nucleotide polymorphisms (SNPs), can enable new clinical and research methods. According to various embodiments, databases containing information on SNPs can be used for conducting genetic studies. Putative SNPs can be validated and can be assembled into a standardized SNP marker map or a database containing data sets corresponding to the standardized SNP marker map. SNP information can be easily accessible and standardized assay reagents can be developed, validated, and made available to enable high throughput and automation, for example, to screen many SNPs on many individuals.
  • According to various embodiments, a reference SNP database can be produced from SNP and genomic information from both proprietary and public databases. The database can be used for linkage disequilibrium (LD) mapping and can be used to provide, for example, validated, ready-to-use assays and reagents. The database can provide high-density coverage of any known gene regions and can enable easier and more affordable candidate gene association studies and candidate region association studies. Researchers can select SNPs across candidate genes or chromosomal regions that are most suitable for a given study, and can quickly translate that information into practice by, for example, directly obtaining assay protocols and reagents for those SNPs. A “core” set of SNPs and the associated assay reagents can be compiled into a database or library and expanded or refined as additional information become available, such as haplotype definition for some or all of the genome. The SNP database can also be used to compile a fixed set of chromosome-based assays for cost-efficient whole genome association (WGA) studies using, for example, oligonucleotide ligation assay (OLA) PCR Bead Array systems for ultra-high throughput genotyping.
  • Linkage disequilibrium (LD) is the non-random association of alleles in a chromosomal segment, and can be the basis of all genetic mapping. Selecting SNPs as genetic markers for LD studies involve considering all genetic and assay-specific technical factors that affect the ability to find association between a marker and the susceptibility mutations being mapped.
  • According to various embodiments, the extent of LD across a genomic region can dictate the SNP density necessary to ensure association between a marker and the allele sought. Early attempts to model the extent of LD predicted very short LD of only a few kilobases (kb). However, recent empirical surveys report average LD levels between 5 kb and 60 kb, and extending up to hundreds of kb, which implies that the number of SNPs required for WGA studies could range from 50,000 to 250,000, and that markers spaced by tens of kb will suffice for candidate gene studies. Common SNPs are the most likely to be useful for LD studies across more than one population since they represent ancient mutations that arose before ethnic group segregation. Simulation studies suggest that common SNPs are more likely than coding SNPs (cSNPs) to be in LD with a given causative allele regardless of whether the allele is present at low or high frequency.
  • According to various embodiments, common SNPs can be used to assemble a database in a hybrid gene-based approach. SNPs can be considered “common” when the minor allele frequency is, for example, less than 15% in at least one of the populations used for validation. A gene list can include 25,083 gene regions derived by Celera Genomics. A gene region can be defined as bounded by the first and last transcribed base, including untranslated regions, plus 10 kilobases (kb) upstream and downstream to account, for example, for uncharacterized exons and regulatory regions. SNPs can be selected within gene regions at an average density of, for example, one SNP per 10 kb, such that the map can resemble a gene-focused picket fence. Density for specific regions can be adjusted as data on recombination and LD extent emerges. Additional SNPs in intergenic regions, such as, for example, non-coding regions of homology between mouse and human, can be added to a database or library.
  • Over four million unique SNPs have been reported. However, literature reports state that as few as 50% of SNPs randomly selected from public databases are polymorphic and can yield working assays. According to various embodiments, obtaining a validated SNP assay at the end of a process of validating possible SNPs can be enhanced by defining prioritization criteria One criterion that can be used to validate a possible SNP is evidence of independent discovery of the minor allele. A data set corresponding to a possible SNP can be cross-referenced against data available in public sources. When a data set corresponding to a possible SNP has no equivalent in the public domain, observation of the minor allele in genotypes of two independent donors can be used to validate the SNP. When a data set corresponding to a possible SNP has a minor allele that is found in only one donor, an independent instance of the minor allele found by searching the consensus assembly of the public Human Genome Project can be used to validate the SNP. A SNP database obtained using the above criteria can include about one million data sets corresponding to SNPs.
  • According to various embodiments, methods of confirming the existence of a SNP are provided. One step can include identifying a location corresponding to a possible SNP in a polynucleotide in a first collection of data sets. The first collection can contain information on genomic deoxyribonucleic acid (DNA) samples in the form of data sets corresponding to polynucleotides. Another step can include confirming the existence of the SNP if at least one of a number of conditions is present or met. A condition can be that a second collection of data sets containing information on genomic DNA samples contains information that identifies the location as containing the possible SNP. A condition can be that at least two data sets from the first collection of data sets contain information corresponding to a minor allele of the possible SNP at the location. According to various embodiments, the at least two data sets representing genomic DNA samples are obtained from two independent sources. A condition can be that a data set that corresponds to a consensus sequence of genomic DNA samples in a third collection of data sets has the minor allele of the possible SNP. According to various embodiments, the source of the genomic DNA of the consensus sequence and the sources of the genomic deoxyribonucleic acid (DNA) samples from the first collection of data sets are independent. The third database of genomic DNA samples can be, for example, a public database of the Human Genome Project. The first database of at least one genomic DNA sample can be, for example, a proprietary database of the Human Genome Project.
  • According to various embodiments, a multi-step, high-throughput assay design pipeline can be provided to ensure optimum performance of assays. The methods provided can enable automation, minimize assay failure, and ensure compatibility of the SNP sequence with, for example, TAQMAN probe-based 5′ nuclease chemistry, available from Applied Biosystems, Foster City, Calif., and/or other assay formats. A stringent scoring system can be used to select only those SNP context sequences with the highest probability of success.
  • According to various embodiments, a bioinformatics process can be used to design assays. A step can involve masking SNPs adjacent to the target polymorphism and/or any sequence discrepancy between the Celera and the HGP human genome assembly, within the 600 bases of a context sequence. This can prevent primers and probes from being placed on top of other SNPs and can maximize the chance that the probes will hybridize to the correct genomic sequence.
  • TAQMAN 5′ nuclease primers and probes can be designed using, for example, the ASSAYS-BY-DESIGN custom oligonucleotide reagent service (Applied Biosystems, Foster City, Calif.). Oligonucleotides can be designed in batch mode without manual intervention, and a scoring scheme can select the best sequences for a given SNP. The design algorithm can implement thermodynamic and heuristic rules and additional empirically-derived factors can increase manufacturability and assay performance. According to various embodiments, probes can be designed successfully for, for example, 97% of SNPs. After this step, a further computational quality-control step can also be performed in the context of the genome that can allow the elimination of potentially problematic SNP targets that may arise from repeated genomic regions, pseudo-SNPs, and/or other possible assembly artifacts.
  • Finally, the primers and probes can be synthesized, and additional quality-control steps can occur. For example, oligonucleotide integrity can be tested. For further example, assay performance can be tested against a panel of 10 DNA samples. According to various embodiments, assays that pass post-manufacturing quality control can be validated in the population panels.
  • Assay validation in population panels can ensure that the locus is polymorphic and that the allele frequency is adequate for association studies in a variety of populations. For example, ninety (90) samples from the Coriell Human Variatiori Collection were obtained. By obtaining individual genotypes from a panel of 45 African Americans, a panel of 45 Caucasians, and a chimp DNA sample (to provide insight into ancestral alleles), sufficient information was obtained to estimate linkage disequilibrium between the SNPs in the LD map and to computationally infer common haplotypes. Assay validation in population panels can provide additional information on the usefulness of the markers, the coverage provided for a given study, and/or provide an independent assessment of assay performance.
  • According to various embodiments, the performance of each assay that can be comprised of at least one polynucleotide can be benchmarked against criteria, such as, for example: background signal (e.g., low signal in the control experiments run without template); signal generation (e.g., good separation between control experiments run without template and allele clusters); and specificity. A criterion can be that a maximum of three clusters of fluorescing sample and a minimum of two clusters of fluorescing sample must be observed. Another criterion can be that at least 90% of samples yield callable genotypes.
  • According to various embodiments, a method is provided for compiling a library of polynucleotide data sets that each correspond to polynucleotides that can function as (A) a primer for producing a nucleic acid sequence that is complementary to at least one target nucleic acid sequence including a target SNP, (B) a probe for rendering detectable the at least one target nucleic acid sequence including a target SNP, or (C) both (A) and (B). The method can include the step of selecting for the library polynucleotide data sets that each correspond to a respective polynucleotide that contains a sequence that is complementary to a respective first allele included in each of the at least one target nucleic acid sequences, if, under a set of reaction conditions, a number of parameters are met by each polynucleotide corresponding to the data sets included in the library. These parameters can include: (1) the respective polynucleotide has a background signal value less than or equal to a first defined value, where the background signal value is a first normalized ratio of a fluorescence intensity of the respective polynucleotide reacted with first assay reactants in the absence of the target nucleic acid sequence, and under first conditions of fluorescence excitation, to a dye fluorescence intensity of a passive-reference dye under the first conditions; (2) the respective polynucleotide has a signal generation value of greater than or equal to a second defined value, wherein the signal generation value is the difference between (i) a second normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with the first assay reactants in the presence of the target nucleic acid sequence, to the dye fluorescence intensity and (ii) the background signal value; (3) the respective polynucleotide has a specificity value of less than or equal to a third defined value, wherein the specificity value is the difference between (i) a third normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with second assay reactants that contain a second allele included in the at least one target nucleic acid sequence to the dye fluorescence intensity, wherein the second allele differs from the first allele, and (ii) the background signal value; (4) at least one individual from a population of individuals has a genotype identifiable under the first conditions, that results from reacting the respective polynucleotide with the first assay reactants and in the presence of the target nucleic acid sequence, wherein the population includes at least one individual that has the identifiable genotype and at least one individual that does not have the identifiable genotype; and (5) at least one individual from the population has an identifiable minor allele of the identifiable genotype, under the first conditions that results from reacting the respective polynucleotide with the first assay reactants in the presence of the target nucleic acid sequence; wherein the population includes at least one individual that has the identifiable minor allele, and at least one individual that does not have the identifiable minor allele. Various embodiments of these and/or other parameters can be used in deciding whether to select a polynucleotide or a polynucleotide data set for a library.
  • According to various embodiments, the first reaction conditions can comprise a 900 nM final primer concentration and a 250 nM final probe concentration under thermal cycling conditions. The first defined value can be about 2.0, the second defined value can be about 1.0, and the third defined value can be about 2.0. At least, for example, 0.01% of individuals from the population can have the identifiable genotype. At least 10% of individuals from the population can have the identifiable genotype. At least 20% of individuals from the population can have the identifiable genotype. The identifiable genotype can result from reacting the respective polynucleotide with the first assay reactants in the presence of the target nucleic acid sequence. The reaction can occur under the first conditions. The population can have a frequency of the minor allele of greater than or equal to about 5%. For example, the minor allele frequency can be greater than or equal to about 10%. The minor allele frequency can be greater than or equal to about 15%. According to various embodiments, methods can include not selecting a second polynucleotide data set that corresponds to a second polynucleotide if one or more of parameters (1)-(5), above, is not met by the second polynucleotide.
  • A library of polynucleotide data sets can be compiled using methods according to various embodiments. A library of assays can be compiled using methods according to various embodiments. The method can include manufacturing a library of assays wherein each assay can be made using a polynucleotide data set compiled in the library. A library of polynucleotides can be compiled by manufacturing polynucleotides corresponding to polynucleotide data sets compiled using methods according to various embodiments. A library of assays can be compiled using methods according to various embodiments.
  • According to various embodiments, a method of detecting a SNP can be provided. A step of the method can be reacting a sample containing a target nucleic acid sequence that has a target SNP with an assay selected from the library of assays compiled according methods described herein. A step can be determining the genotype of the target nucleic acid sequence that has the target SNP by detecting a characteristic attributable to the genotype of the target SNP in the sample.
  • According to various embodiments, a method is provided for compiling a library of polynucleotide data sets that correspond to polynucleotides that each can function as (A) a primer for producing a nucleic acid sequence that is complementary to at least one target nucleic acid sequence including a target SNP, (B) a probe for rendering detectable the at least one target nucleic acid sequence including a target SNP, or (C) both (A) and (B). The method can include the step of determining a background signal value by calculating a first normalized ratio of a fluorescence intensity of a respective polynucleotide that contains a sequence that is complementary to a first allele included in the at least one target nucleic acid sequence, reacted with first assay reactants in the absence of the target nucleic acid sequence, and under first conditions of fluorescence excitation, to a dye fluorescence intensity of a passive-reference dye under the first conditions. The method can include the step of comparing a difference between (i) a second normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with the first assay reactants in the presence of the target nucleic acid sequence, to the dye fluorescence intensity, and (ii) the background signal value. The method can include the step of comparing a difference between (i) a third normalized ratio of the fluorescence intensity of the respective polynucleotide reacted with second assay reactants that contain a second allele included in the at least one target nucleic acid sequence to the dye fluorescence intensity, wherein the second allele differs from the first allele, and (ii) the background signal value. The method can include the step of determining whether at least one individual from a population of individuals has a genotype identifiable under the first conditions that results from reacting the respective polynucleotide with the first assay reactants and in the presence of the target nucleic acid sequence, wherein the population includes at least one individual that has the identifiable genotype and at least one individual that does not have the identifiable genotype. The method can include the step of determining whether at least one individual from the population has an identifiable minor allele of the identifiable genotype, under the first conditions, that results from reacting the respective polynucleotide with the first assay reactants in the presence of the target nucleic acid sequence. The method can include a combination of some of or all of these steps.
  • According to various embodiments, the polynucleotide data set corresponding to the respective polynucleotide can be selected for the library if, for example, the background signal value in parameter (1) is less than or equal to about two, if the ratio from the comparison in parameter (2) is greater than or equal to about one, if the ratio from the comparison in parameter (3) is less than or equal to about two, if the at least one individual of parameter (4) has the identifiable genotype, and if the at least one individual of parameter (5) has the identifiable minor allele.
  • A library of polynucleotide data sets can be compiled using methods according to various embodiments. A library of polynucleotides can be compiled by manufacturing polynucleotides corresponding to polynucleotide data sets compiled using a method or methods according to various embodiments.
  • A method of compiling a library of assays can be provided. The method can include manufacturing a library of assays, wherein each assay is manufactured using a polynucleotide data set compiled in a library according to various embodiments.
  • According to various embodiments, methods of detecting a SNP can be provided. The method can include the step of reacting a sample containing a target nucleic acid sequence that has a target SNP with an assay selected from the library of assays compiled using a method or methods according to various embodiments. A step can include determining the genotype of the target nucleic acid sequence that has the target SNP by detecting a characteristic attributable to the genotype of the target SNP in the sample.
  • EXAMPLE 3
  • Using chromosome 22 as a pilot project, 2,260 SNP assays were validated. Of those assays tested, 94% of the SNPs tested with population panels were polymorphic and 90% of the assays passed performance criteria. When a minor allele frequency cut-off of 15% or greater was used, 85% of the samples from the African-American panel, and 90% from the Caucasian panel, met the performance criteria. Ninety-nine percent (99%) of SNPs had a minor allele frequency of 15% or greater in at least one of the population panels. Ninety-six percent (96%) of samples had a minor allele frequency of 20% or greater in at least one population tested, and 62% of samples had a minor allele frequency of 20% or greater in both populations tested.
  • According to various embodiments, an automatic allele calling software can be used to automatically analyze validated assay data without user intervention. According to various embodiments, at least, for example, 90% of the assay data can be processed automatically to identify an allele. According to various embodiments, an automated validation process can be used for high volume commercial or research purposes.
  • FIG. 4 is a plot 100 of fluorescence data from many SNP assays according to various embodiments. The x-axis represents relative fluorescence of a 6-FAM dye label and the y-axis represents relative fluorescence of a VIC dye label. Cluster 110 represents the relative fluorescence of control samples having probes labeled with 6-FAM and VIC, respectively. The control samples did not contain a target nucleic acid sequence. The background signal value is the average of the relative fluorescence of the control samples as represented by cluster 110 in FIG. 4. The background signal value in FIG. 4 is less than about 2.0. Cluster 120 represents the relative fluorescence of samples having homozygous alleles (allele 2) that hybridized with probes labeled with 6-FAM. Cluster 130 represents the relative fluorescence of samples having heterozygous alleles (alleles 1 and 2) that hybridized with probes labeled with VIC and 6-FAM, respectively. Cluster 140 represents the relative fluorescence of samples having homozygous alleles (allele 1) that hybridized with probes labeled with VIC. The signal generation value for the assays is based, at least in part, on the average of the relative fluorescence of at least one of clusters 120, 130, and 140.
  • FIG. 5 is a flowchart showing a comparison of reference values determined by experimental assays according to various embodiments against target values. As used herein TBV is a target background value; TsigV is the target signal value; TSpV is the target specificity value; TIP is the target identifiable percentage, or the minimum frequency that the assay produces an identifiable genotype; and TMI is the target minor allele frequency, or the minimum frequency that the minor allele appears in the population.
  • FIG. 6 is a flowchart that illustrates the selection of a data set into a library of data sets, according to various embodiments. Experimental or reference data is obtained by performing an assay against individuals in a population. Experimental data can be obtained at least until results are statistically significant for a given population. When a statistically significant sample size has been determined, the experimental data can be processed according to various embodiments as shown, for example, in FIG. 5. If the selection criteria are met, the data set corresponding to polynucleotides used in the experimental assays can be added into the library. If the selection criteria are not met, the polynucleotide can be redesigned or discarded, where the expendable assays can be repeated.
  • According to various embodiments, SNP assays and reagents can use easily automated chemistry and can be compatible with readily available high-throughput instrumentation and software systems. According to various embodiments, SNP assays and reagents can have few enzymatic steps, no post-reaction transfer of liquids, and/or universal reaction conditions that can facilitate robotic liquid handling automation. The assays, reagents, and/or high-throughput workflow can be easy to implement and automate and/or can use components that are ready-to-use out-of-the-box and require no optimization.
  • TAQMAN probe-based 5′ nuclease assay chemistry, available from Applied Biosystems, Foster City, Calif., can meet almost any assay requirement and can unite PCR amplification and signal generation into a single step, thereby simplifying automation of both reaction set-up and data collection. In the TAQMAN system, a hybridization probe with fluorogenic and quencher tags is cleaved by the 5′ nuclease activity of thermos aquaticus (Taq) DNA polymerase during PCR amplification. Cleavage produces fluorescence by freeing the fluorogenic molecule from the quencher. By using two probes, one specific to each allele of the SNP and labeled with distinct fluorogenic tags, both alleles can be specifically detected in a single tube. In addition, the fluorescent 5′ nuclease assays can be part of an easy-to-use, automated system for SNP genotyping. FIGS. 2 a-2 e provide an overview of the TAQMAN probe-based 5′ nuclease assay chemistry for SNP genotyping.
  • The TAQMAN system is adapted to provide allelic discrimination and high-throughput SNP genotyping. Chemistry improvements have increased assay design flexibility, enabled easy protocol standardization, enabled the use of universal reagents, and reduced background fluorescence, all of which can be desirable for high throughput SNP processing and allelic discrimination.
  • For example, previous probes of up to 30 bases were required to achieve the specificity required for scoring SNPs. The conjugation of a DNA minor groove binder (MGB) to a probe can significantly stabilize probe-template complexes, enabling the use of probes in the 13-mer to 20-mer size range. Such conjugated probes can have better mismatch discrimination, can be easier to design for challenging genetic regions such as those high in GC content or those in variable context sequences, and/or can increase the signal-to-noise ratio by bringing the quencher closer to the fluorescent tag. For high-throughput SNP scoring, conjugated probes can increase the melting temperature window that can be used in reaction protocols, thereby allowing all the SNP assays to run under identical conditions.
  • Previous 5′ nuclease quenchers emitted their own fluorescence, therefore signal detection was complicated. New non-fluorescent quenchers can provide improved signal detection and can facilitate automated allele calling, another desirable feature useful for high-throughput SNP scoring.
  • In a pharmacogenomic study, the most precious reagent can be the DNA sample itself. High-quality DNA quantification using real-time analysis on the ABI PRISM 7900HT Sequence Detection System with the TAQMAN RNase P Control Reagents Kit, both available from Applied Biosystems, Foster City, Calif., permits optimal amounts of DNA per reaction to maximize study efficiency. The 7900HT system can use a 5 microliter reaction volume and can consume one nanogram of DNA per genotyping reaction, thereby minimizing reagent costs and conserving DNA template samples.
  • The 5′ nuclease assay can be suited for automation because of its easy, three-step workflow. A universal master mix, including probes and primers, can be added directly to plates of dry or fresh DNA using standard robotics. Plates can be sealed and cycled using standard thermal cyclers, such as, for example, the Applied Biosystems Dual 394-Well GENEAMP PCR System 9700 thermal cycler, available from Applied Biosystems, Foster City, Calif. Following cycling, plates can be automatically read on the 7900HT that can support the collection of more than 250,000 genotypes per day. In addition, the availability of thermal cyclers with automated lid handling can increase throughput by enabling robotics integration for 24-hour unattended operation. Automation software can also increase both quality and throughput. For example, automation of allele calling can remove inter-technician variability, increasing confidence in data quality and reducing the time spent on data analysis by 8.5 person-hours per day. FIG. 3 provides an example of an automated workflow system.
  • According to various embodiments, an assay that uses two different types of probes can be provided wherein the polynucleotide and the reporter dyes differ. For example, the first type of probe can have a first polynucleotide with a VIC reporter dye attached to the 5′ end of the first polynucleotide, and the second type of probe can have a second polynucleotide with a 6-FAM reporter dye attached to the 5′ end of the second polynucleotide, and the first and second polynucleotides can differ by at least one nucleic acid residue at the same location in the polynucleotide when the polynucleotides are aligned 5′ to 3′. The dye-labeled probes can be adopted to perform a heterozygous assay or a homozygous assay.
  • The probe can anneal to a complementary sequence between the forward and reverse primer sites. At the time of annealing, the probe is intact and the proximity of the reporter dye to the quencher can result in suppression of fluorescence of the reporter dye. A polymerase can cleave a reporter dye only when the probe has completely, mostly, or substantially hybridized to the target DNA sequence. When the reporter dye is cleaved from the probe, the relative flourescence of the reporter dye increases. The increase in relative fluorescence can be caused to only occur if the amplified target DNA sequence is complementary, mostly complementary, or substantially complementary to the probe. Therefore, the fluorescent signal generated by PCR amplification can indicate which alleles are present in a sample. Mismatches between a probe and a target DNA sequence can reduce efficiency of probe hybridization and/or a polymerase can be more likely to displace a mismatched probe without cleaving it and therefore not produce a fluorescent signal. For example, if one of two possible reporter dyes fluoresce during an assay, then the presence of a homozygous gene is indicated. For further example, if both possible reporter dyes fluoresce during an assay, then the presence of a heterozgous gene is indicated.
  • According to various embodiments, at least one primer can be provided, wherein the primer can be a sequence that is shorter than the target DNA sequence. The primer can have a polynucleotide and/or a minor groove binder. The primer can be a sequence that is complimentary to, or mostly complimentary to, the target DNA sequence. The primer can be at least 90% homologous to a corresponding length of the target DNA sequence, at least 80% homologous to a corresponding length of the target DNA sequence, at least 70% homologous to a corresponding length of the target DNA sequence, or at least 50% homologous to a corresponding length of the target DNA sequence.
  • According to various embodiments, a thermostable DNA polymerase, such as, for example, thermus aquaticus (Taq), and at least 4 embodiments of a deoxyribonucleic acid (e.g., adenosine, tyrosine, cytosine, and guanine) can be provided. The polymerase can be, for example, AMPLITAQ GOLD, available from Applied Biosystems, Foster City, Calif. According to various embodiments, components of a fluorogenic 5′ nuclease assay or other assay reagents that utilize 5′ nuclease chemistry, for example, TAQMAN minor groove binder probes, available from Applied Biosystems, Foster City, Calif., can be provided. Some or all of the above-listed components can be replaced by or used with commercially-available products, for example, buffers or AMPLITAQ GOLD PCR MASTER MIX (Applied Biosystems, Foster City, Calif.).
  • According to various embodiments, a high-quality LD map of validated SNPs was created by integrating information from both public and private human genome efforts. A set of over 200,000 validated, easy-to-use, individual SNP assays and TAQMAN ready-to-use assay reagents created by using methods according to various embodiments can be provided. A minor groove binder and a non-fluorescent quencher, and the integration of the 5′ nuclease chemistry with an automated detection system, such as, for example, the 7900HT, can be used. According to various embodiments. A web-based bioinformatics and ordering system can be provided where a customer can search for SNPs and order assay reagents, thus reducing the time and costs associated with candidate-gene and candidate-region association studies. According to various embodiments, and LD map can, for example, enable candidate-gene and candidate-region association studies using 5′ nuclease chemistry and/or be implemented on an ultra-high throughput SNP genotyping platform to enable WGA studies. According to such embodiments, the 5′ nuclease chemistry system can leverage the specificity of the OLA-PCR assay chemistry and the highly parallel detection of, for example, BEADARRAY technology, available from Illumina, Inc., San Diego, Calif. According, to such embodiments, the system can enable the generation of about 2,100,000 genotypes per day and all components of the assay can be universal except for, for example, the SNP-specific OLA probes.
  • According to various embodiments, assays for over 4,000,000 SNPs from the Celera database can cover every gene in the human genome. According to various embodiments, many SNPs can have the necessary variability for genetic association studies and assays for the SNPs can be provided. According to various embodiments, assays can be grouped together into convenient SNP sets optimized for specific assays such as, for example, p450 genotyping and disease-specific gene studies.
  • FIGS. 2 a-2 e are schematic diagrams showing the interaction of components that can be part of a mixture of reagents according to various embodiments. In FIG. 2 a, primer 52 has annealed to template strand 54. Replication of the template strand from primer 52 will occur in the 5′ to 3′ direction. Probe 50, including a generic reporter dye R, quencher Q, and minor groove binder MGB, has annealed to the template strand 54. Arrow 53 shows that as the complementary strand (not shown) is produced from the template strand 54 starting at the forward primer 52, the complementary strand will meet probe 50. FIG. 2 b shows the complementary strand 55 as it meets probe 50 a. Polymerase 60 cleaves VIC reporter dye V during the production of complementary strand 55 given that probe 50 a has annealed to the target strand 54 because the target strand 54 and the probe 50 a are completely complementary. FIG. 2 c shows the complementary strand 55 as it meets probe 50 b. Polymerase 60 does not cleave FAM reporter dye F during the production of complementary strand 55 given that probe 50 b has not hybridized with the target strand 54 because of a mismatched base pair at location 64. FIG. 2 d shows the complementary strand 55 as it meets probe 50 b. Polymerase 60 cleaves FAM reporter dye F during the production of complementary strand 55 given that probe 50 b has annealed to the target strand 54 because the target strand 54 and the probe 50 b are completely complementary. FIG. 2 e shows the complementary strand 55 as it meets probe 50 a Polymerase 60 does not cleave VIC reporter dye V during the production of complementary strand 55 given that the probe 50 a has not hybridized with the target strand 54 because of a mismatched base pair at location 66.
  • FIG. 7 is an illustration of SNPs selected by a method according to various embodiments.
  • According to various embodiments: a library is provided that contains a plurality of data sets, corresponding to one or more respective oligonucleotides that can function as a respective assay to hybridize with at least one respective Single Nucleotide Polymorphism (SNP) in a nucleic acid sequence. The nucleic acid sequence can include three or more adjacent SNPs and the data sets can correspond to at least the three or more adjacent SNPs, respectively. Each adjacent SNP can be spaced a distance from at least one other adjacent SNP and each of the distances between adjacent SNPs can be from about 75% to about 125% of an average of the distances between the adjacent SNPs. The corresponding, adjacent SNPs can be spaced apart along at least a region of a chromosome. The corresponding, adjacent SNPs can be spaced apart along at least a region of a gene. The distances between all corresponding, adjacent SNPs can be equal, plus or minus 30%. The distances between all corresponding, adjacent SNPs can be equal, plus or minus 20%. Each of the distances between corresponding, adjacent SNPs can be from about 90% to about 110% of an average of the distances between the adjacent SNPs. Each of the distances between corresponding, adjacent SNPs can be from about 95% to about 105% of an average of the distances between the adjacent SNPs. The nucleic acid sequence can be a consensus sequence. The distances between all corresponding, adjacent SNPs can be less than a specified maximum distance. The specified maximum distance can be, for example, 10 kilobases, 15 kilobases, 20 kilobases, or 30 kilobases. According to various embodiments, the algorithm can select a minimum number of SNPs per region. For example, three SNPs per gene can be selected. The distances between all corresponding, adjacent SNPs can be greater than a specified minimum distance. The nucleic acid sequence can be a consensus sequence corresponding to the human genome. The nucleic acid sequence can be a nucleic acid sequence data set.
  • The library can comprise a number of data sets corresponding to not more than a sufficient number of oligonucleotides necessary to provide a collection of assays that can provide a maximum statistical loss of haplotype diversity, across the region, of less than ten (10) percent. The sufficient number of oligonucleotides can be obtained by providing a matrix comprised of data representing haplotype blocks and SNP locations. The columns of the matrix can contain data representing existence of respective SNPs within a haplotype block and the rows of the matrix can contain data representing respective haplotype blocks. At least one column can be eliminated, wherein elimination of the at least one column may not reduce the number of rows in the matrix that contains non-duplicative information. At least one column of the matrix that is identical to a second column of the matrix and/or completely opposite to a second column of the matrix can be eliminated.
  • FIG. 7 illustrates SNPs that can be selected according to various embodiments. SNPs 710, 720, 730, and 740 that are present in region 702 of nucleic acid sequence 700 have been selected based on prioritization criteria. SNPs 710, 720, 730, and 740 are separated from adjacent SNPs by a distance no greater than distance 760. The effective range of usefulness of each SNP is equal to plus or minus one half of distance 760. Therefore, there are no gaps in coverage of region 702 of nucleic acid sequence 700. SNPs 750, 752, 754, 756 were not selected using the same selection criteria. The selection criteria can be used to select data sets for the library, where the data sets correspond to one or more respective oligonucleotides that can function as a respective assay to hybridize with at least one respective SNP. According to various embodiments, at least one selection criterion of the selection criteria can be used. One criterion can be a specified maximum distance between adjacent SNPs. Another criterion can be a specified minimum distance between adjacent SNPs. A criterion can be a target distance between adjacent SNPs. The actual distance between adjacent SNPs can vary from the target distance by, for example, 5%, 10%, 20%, or 30%. An algorithm to prioritize SNPs can select as few SNPs as possible to achieve a coverage of the region where the largest gap between adjacent. SNPs is less than or equal to the specified maximum distance. An algorithm to prioritize SNPs may not select a SNP if the distance between one of two adjacent SNPs is less than the specified minimum distance.
  • Another prioritization criterion can involve using known SNPs. Known SNPs can be preferentially selected to include in the library because, for example, known SNPs are well-characterized, and may have assays that have previously been designed to target such known SNPs, they may have little or no additional cost associated with providing an oligonucleotide or a data set corresponding to an oligonucleotide for an assay directed to the SNP. An algorithm to prioritize SNPs can consider known SNPs as “must use” SNPs or as “prefer to use” SNPs.
  • According to various embodiments, a known SNP is a previously characterized SNP marker that has low or no cost associated with designing and/or manufacturing one or more oligonucleotides. Known SNPs can be preferentially used as a selection criterion. A newly identified SNP is a SNP marker that is relatively uncharacterized and therefore there is a higher relative cost associated with designing and/or manufacturing one or more oligonucleotides. A selection criterion can be newly identified SNP. Newly identified SNPs can be selected after all or some of the known SNPs have been selected because, for example, newly identified SNPs require investigation into whether a functional assays can be produced for the newly identified SNPs and therefore newly identified SNPs have greater costs associated with them than known SNPs. An algorithm to prioritize SNPs can consider newly identified SNPs as “must never use” SNPs, “use only after using all known” SNPs, or as “try not to use” SNPs.
  • For example, according to various embodiments, using at least one selection criterion, data sets that correspond to one or more respective oligonucleotides that function as a respective assay to hybridize with at least one selected SNP, were selected. Selection criteria can be used to select data sets corresponding to the largest number of known SNPs and the smallest number of newly identified SNPs that provide a coverage of the region, where the distance between adjacent, selected SNPs is approximately equal.
  • According to various embodiments, a specified maximum distance can be allowed between two adjacent selected SNPs, as well as between the two outermost selected SNPs and the boundaries of the region, gene, chromosome, or other boundary denoting the beginning and ending of the nucleic acid sequence. One requirement can be, for example, to never have a distance between adjacent selected SNPs that is greater than a maximum required distance, unless maintaining the maximum required distance is impossible because there are no adjacent SNPs that fall within the maximum required distance. For further example, the largest distance between adjacent selected SNPs can be as small as possible.
  • According to various embodiments, an algorithm can be provided that selects, based on prioritization criteria, a plurality of data sets that corresponds to one or more respective oligonucleotides that can function as a respective assay to hybridize with at least one selected SNP. A region can be at least a part of the nucleic acid sequence. The region can be separated into sub-regions. The sub-regions can, for example, be separated by known SNPs, if any. Each sub-region can be solved and optimized independently according to various embodiments of the algorithm. The sub-region can be bound by known SNPs. If there are no known SNPs, then the region may not be separated into sub-regions. Various embodiments of the algorithm can be repeated for each sub-region. Each sub-region can begin, for example, with the region's 5′ end or a known SNP and can end with the 3′ end or a known SNP.
  • According to various embodiments, a step of the algorithm can be a locally optimal solution (“local step”) that can determine the number of newly identified SNPs that can be utilized. The step can include selecting all the newly identified SNPs. The step can include selecting the known SNPs closest to one or both ends of the region or sub-region of the nucleic acid sequence. The step can include selecting the newly identified SNPs closest to one or both ends of the region or sub-region of the nucleic acid sequence. The step can iteratively eliminate newly identified SNPs until, for example, no newly identified SNPs can be eliminated. The step can iteratively eliminate newly identified SNPs until, for example, the remaining number of known SNPs and newly identified SNPs is at or below a minimum number of SNPs, if any, that can be selected.
  • For the local step, the distance between all adjacent SNPs can be calculated. The first and/or last distances between adjacent SNPs in a region or sub-region can be doubled if the distances are between a SNP and the edge of the region or sub-region. From the 5′ end of the region or sub-region, the consecutive distances between adjacent SNPs can be added until the cumulative sum of consecutive distances exceeds the specified maximum distance. If the cumulative sum of consecutive distances exceeds the specified maximum distance after more than one consecutive distance, i.e., more than just two adjacent SNPs, or more than just one SNP and an edge of a region or sub-region, then one SNP in the set of newly identified SNPs can be removed.
  • According to various embodiments, the local step can include identifying the smallest distance between adjacent SNPs in the sequence of adjacent SNPs that makes up the cumulative sum, where the cumulative sum of adjacent distances is greater than the specified maximum distance. If there is only one distance between adjacent SNPs having a distance equal to the smallest distance, then one of the two adjacent SNPs bounding the smallest distance can be eliminated if one of the SNPs is a newly identified SNP. For example, if both SNPs in the pair of adjacent SNPs having a distance equal to the smallest distance are newly identified SNPs, one of the SNPs can be eliminated at random or the SNP nearest the 5′ end or the SNP nearest the 3′ end can be eliminated by convention. Because each distance value is the distance between two adjacent SNPs, each SNP can have two distance values associated with it. The distance value (5′) is the distance to the adjacent SNP on the 5′ end of the nucleic acid sequence, and the distance value (3′) is the distance to the adjacent SNP on the 3′ end. If there is more than one distance between adjacent SNPs having a distance equal to the smallest distance, e.g. a “tie,” then the tie can be broken by, for example, choosing the smallest distance on the 3′ end of the nucleic acid sequence. For another example, the smallest distance can be chosen arbitrarily (e.g. the smallest distance closest to the 5′ end, the smallest distance closest to the 3′ end, or the smallest distance that falls in the middle of the other smallest distances). After removing a SNP adjacent to the smallest distance, the cumulative sum can be recomputed and the process can be reiterated using the remaining SNPs. According to various embodiments, if no newly identified SNPs can be removed, the process of the local step can be stopped. If the total number of SNPs is at or below a specified minimum number of SNPs, the process of the local step can be stopped.
  • According to various embodiments, a step of the algorithm can be a globally optimal solution (“global step”) that can determine the optimum selection of newly identified SNPs given the number of newly identified SNPs that can be utilized. For example, the number of newly identified SNPs that can be utilized can be provided by the local step. According to various embodiments, at least one step of the algorithm is performed.
  • According to various embodiments, if there are K newly identified SNPs in the region or sub-region, and N SNPs will be selected, then there can be P=K!/[N!(K−N)!] possible selected SNPs. N can be a value from 1 to K. The value K can be selected by determining the specified minimum and/or specified maximum distances of the region or sub-region. According to various embodiments, the maximum amount of time T that is dedicated to calculating the global step can be specified. If a computer can compute one selection in time t, then if P<T/t, the global step can be performed. If P>=T/t, then the global step may not be performed. The global step can include calculating all possible selections P of the number of newly identified SNPs to be selected N out of the total number of newly identified SNPs in the region or sub-region K. The global step can include calculating the largest distance between adjacent, selected SNPs for each selection P. The global step can include choosing the selection P with the “smallest” largest distance between adjacent, selected SNPs. The global step can be illustrated by the following operation:
    Min { Max [ SNPi − SNPi−1 ] }
    Over All i=1,...,N
    Selections
    Of N out of K
  • The smallest value of N can be selected where Min {Max [SNP1-SNPi-1]} (where i=1, . . . , N), that is less than T.
  • According to various embodiments, a specified minimum distance between adjacent selected SNPs can be specified. A minimum number of total markers can be specified. Prioritization criterion can be assigned to different, respective newly identified SNPs that can assign preference to some newly identified SNPs over other newly identified SNPs. To assign preference to some newly identified SNPs, the distance value associated with a “high priority” newly identified SNP can be, for example, marked or changed so that the “high priority” newly identified SNP is always preferred or selected over a “low priority” newly identified SNP.
  • FIG. 8 details a hypothetical chromosome having 969 genes on the chromosome. Of those 969 genes, 1,639 known SNPs are present on the chromosome and are well characterized. The chromosome contains 11,095 newly identified SNPs that are not well characterized. Of the 969 genes, 611 genes contain SNPs and 358 genes do not contain SNPs. Of the 611 genes containing SNPs, FIG. 8 lists the average gene length, in bases, the number of newly identified SNPs per gene, the number of known SNPs per gene, the total number of selected SNPs per gene that were selected using various embodiments, and the number of newly identified selected SNPs per gene that were selected using various embodiments. FIG. 9 is a histogram of gene lengths of the genes found on the hypothetical chromosome of FIG. 8. FIG. 10 is a histogram of the specified maximum distance between adjacent SNPs, according to various embodiments, of the selected SNPs from the hypothetical chromosome of FIG. 8. FIG. 11 is a histogram of the actual maximum distance between adjacent SNPs, according to various embodiments, of selected SNPs of the hypothetical chromosome of FIG. 8. FIG. 12 is a histogram of total selected SNPs per gene, according to various embodiments, from the hypothetical chromosome of FIG. 8. FIG. 13 is a histogram of the number of newly identified SNPs per gene that were selected from the hypothetical chromosome of FIG. 8 using various embodiments.
  • The present invention relates to the foregoing and other embodiments as will be apparent to those skilled in the art from consideration of the present specification and practice of the present invention disclosed herein. It is intended that the present specification and examples be considered as exemplary only with a true scope and spirit of the invention being indicated by the following claims and equivalents thereof.

Claims (6)

1-34. (canceled)
35. A library that contains a plurality of data sets, corresponding to one or more respective oligonucleotides that can function as a respective assay to hybridize with at least one respective Single Nucleotide Polymorphism (SNP) in a nucleic acid sequence, wherein the library is compiled using a method comprising the steps of:
providing a representation of a nucleic acid sequence;
designating a region within at least a part of the nucleic acid sequence;
determining the locations of known SNPs, if any, within the region;
determining the locations of newly identified SNPs, if any, within the region; and
selecting for the library a collection of data sets that corresponds to a collection of the known SNPs and the newly identified SNPs, wherein all of the SNPs of collection are:
(1) spaced less than a specified maximum distance apart from one another along the nucleic acid sequence;
(2) spaced more than a minimum distance apart from one another along the nucleic acid sequence;
(3) spaced apart from one another such that each of the distances between adjacent selected SNPs is from about 75% to about 125% of an average of the distances between the adjacent selected SNPs; or
(4) a combination of (1), (2), and (3).
36. The library of claim 35, wherein the method further comprises the step of designating a specified maximum distance between adjacent selected SNPs that correspond to respective data sets.
37. The library of claim 35, wherein the method further comprises the step of designating a specified minimum distance between adjacent selected SNPs that correspond to respective data sets.
38. The library of claim 35, wherein the method further comprises the step of designating both a specified minimum distance between adjacent selected SNPs that correspond to respective data sets and a specified maximum distance between adjacent selected SNPs that correspond to respective data sets.
39. The library of claim 35, wherein each of the distances between adjacent selected SNPs is from about 75% to about 125% of an average of the distances between the adjacent selected SNPs.
US10/502,761 2002-01-25 2003-01-27 Methods of validating snps and compiling libraries of assays Abandoned US20050282162A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/502,761 US20050282162A1 (en) 2002-01-25 2003-01-27 Methods of validating snps and compiling libraries of assays

Applications Claiming Priority (14)

Application Number Priority Date Filing Date Title
US35203902P 2002-01-25 2002-01-25
US35235602P 2002-01-28 2002-01-28
US36912702P 2002-04-01 2002-04-01
US36965702P 2002-04-03 2002-04-03
US37092102P 2002-04-09 2002-04-09
US37617102P 2002-04-26 2002-04-26
US38005702P 2002-05-06 2002-05-06
US38362702P 2002-05-28 2002-05-28
US38395402P 2002-05-29 2002-05-29
US39070802P 2002-06-21 2002-06-21
US39411502P 2002-07-05 2002-07-05
US39986002P 2002-07-31 2002-07-31
PCT/US2003/002240 WO2003065000A2 (en) 2002-01-25 2003-01-27 METHODS OF VALIDATING SNPs AND COMPILING LIBRARIES OF ASSAYS
US10/502,761 US20050282162A1 (en) 2002-01-25 2003-01-27 Methods of validating snps and compiling libraries of assays

Publications (1)

Publication Number Publication Date
US20050282162A1 true US20050282162A1 (en) 2005-12-22

Family

ID=27671405

Family Applications (7)

Application Number Title Priority Date Filing Date
US10/334,793 Abandoned US20040018506A1 (en) 2002-01-25 2003-01-02 Methods for placing, accepting, and filling orders for products and services
US10/335,690 Abandoned US20040063109A2 (en) 2002-01-25 2003-01-02 Single-tube, ready-to-use assay kits, and methods using same
US10/502,761 Abandoned US20050282162A1 (en) 2002-01-25 2003-01-27 Methods of validating snps and compiling libraries of assays
US12/015,143 Abandoned US20080228589A1 (en) 2002-01-25 2008-01-16 Methods For Placing, Accepting, And Filling Orders For Products and Services
US13/458,879 Expired - Lifetime US9464320B2 (en) 2002-01-25 2012-04-27 Methods for placing, accepting, and filling orders for products and services
US15/256,827 Expired - Lifetime US10689692B2 (en) 2002-01-25 2016-09-06 Methods for placing, accepting, and filling orders for products and services
US16/894,028 Abandoned US20200392571A1 (en) 2002-01-25 2020-06-05 Methods for placing, accepting, and filling orders for products and services

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US10/334,793 Abandoned US20040018506A1 (en) 2002-01-25 2003-01-02 Methods for placing, accepting, and filling orders for products and services
US10/335,690 Abandoned US20040063109A2 (en) 2002-01-25 2003-01-02 Single-tube, ready-to-use assay kits, and methods using same

Family Applications After (4)

Application Number Title Priority Date Filing Date
US12/015,143 Abandoned US20080228589A1 (en) 2002-01-25 2008-01-16 Methods For Placing, Accepting, And Filling Orders For Products and Services
US13/458,879 Expired - Lifetime US9464320B2 (en) 2002-01-25 2012-04-27 Methods for placing, accepting, and filling orders for products and services
US15/256,827 Expired - Lifetime US10689692B2 (en) 2002-01-25 2016-09-06 Methods for placing, accepting, and filling orders for products and services
US16/894,028 Abandoned US20200392571A1 (en) 2002-01-25 2020-06-05 Methods for placing, accepting, and filling orders for products and services

Country Status (6)

Country Link
US (7) US20040018506A1 (en)
EP (2) EP1706826A4 (en)
JP (3) JP2005516300A (en)
AU (1) AU2003209375A1 (en)
CA (1) CA2474482A1 (en)
WO (3) WO2003065146A2 (en)

Families Citing this family (82)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8137619B2 (en) 1997-08-11 2012-03-20 Ventana Medical Systems, Inc. Memory management method and apparatus for automated biological reaction system
US6093574A (en) 1997-08-11 2000-07-25 Ventana Medical Systems Method and apparatus for rinsing a microscope slide
AU1574801A (en) * 1999-10-26 2001-05-08 Genometrix Genomics Incorporated Process for requesting biological experiments and for the delivery of experimental information
US20040018506A1 (en) 2002-01-25 2004-01-29 Koehler Ryan T. Methods for placing, accepting, and filling orders for products and services
US20050287925A1 (en) * 2003-02-07 2005-12-29 Nathan Proch Collectible item and code for interactive games
US20060035692A1 (en) * 2002-02-08 2006-02-16 Keith Kirby Collectible item and code for interactive games
WO2003087327A2 (en) 2002-04-11 2003-10-23 Medimmune Vaccines, Inc. Preservation of bioactive materials by freeze dried foam
EP1535232A2 (en) * 2002-06-28 2005-06-01 Applera Corporation A system and method for snp genotype clustering
WO2004055709A2 (en) * 2002-12-13 2004-07-01 Applera Corporation Methods for identifying, viewing, and analyzing syntenic and orthologous genomic regions between two or more species
US20060035252A1 (en) * 2003-04-28 2006-02-16 Applera Corporation Methods and workflows for selecting genetic markers utilizing software tool
US20080235055A1 (en) * 2003-07-17 2008-09-25 Scott Mattingly Laboratory instrumentation information management and control network
US20050038776A1 (en) * 2003-08-15 2005-02-17 Ramin Cyrus Information system for biological and life sciences research
US20050221357A1 (en) * 2003-09-19 2005-10-06 Mark Shannon Normalization of gene expression data
US7570443B2 (en) 2003-09-19 2009-08-04 Applied Biosystems, Llc Optical camera alignment
US7332280B2 (en) * 2003-10-14 2008-02-19 Ronald Levy Classification of patients having diffuse large B-cell lymphoma based upon gene expression
US20050144096A1 (en) * 2003-12-26 2005-06-30 Caramanna George S.Ii Financial visualization and analysis systems
EP1732021A4 (en) * 2004-03-26 2009-07-29 Bio Think Tank Co Ltd Method of searching specific base sequence
EP1612537B1 (en) * 2004-06-30 2012-12-19 Sysmex Corporation Specimen preparation apparatus, specimen preparation/analysis system and specimen plate
JP4757547B2 (en) * 2004-06-30 2011-08-24 シスメックス株式会社 Specimen slide glass
WO2006019892A2 (en) * 2004-07-14 2006-02-23 Invitrogen Corporation Methods and systems for in silico experimental design and for providing a biotechnology product to a customer
JP2006072656A (en) * 2004-09-01 2006-03-16 Hitachi Software Eng Co Ltd Primer design method for real time pcr
US20060111915A1 (en) * 2004-11-23 2006-05-25 Applera Corporation Hypothesis generation
JP2006172313A (en) * 2004-12-17 2006-06-29 Fuji Electric Systems Co Ltd Data check management method and program
US20110175343A1 (en) * 2005-01-31 2011-07-21 Pipe Maintenance, Inc. Identification system for drill pipes and the like
US8428882B2 (en) 2005-06-14 2013-04-23 Agency For Science, Technology And Research Method of processing and/or genome mapping of diTag sequences
WO2007021502A1 (en) * 2005-08-10 2007-02-22 Ge Healthcare Bio-Sciences Corp. Quality control methods for arrayed oligonucleotides
US20080215387A1 (en) * 2005-08-22 2008-09-04 Sivakumar Muthusamy Automation of Validation Life Cycle and Providing 100% Paperless Electronic Validation Process
US7853868B2 (en) * 2005-09-02 2010-12-14 Microsoft Corporation Button for adding a new tabbed sheet
US7799530B2 (en) * 2005-09-23 2010-09-21 Celera Corporation Genetic polymorphisms associated with cardiovascular disorders and drug response, methods of detection and uses thereof
US8082516B2 (en) * 2005-11-01 2011-12-20 Lycos, Inc. Preview panel
US20080015947A1 (en) * 2006-07-12 2008-01-17 Swift Lawrence W Online ordering of architectural models
WO2008059314A1 (en) * 2006-11-14 2008-05-22 Centro Internacional De Vacunas Malaria vaccine based on the 200l subunit of the plasmodium vivax msp1 protein
US7962378B2 (en) * 2007-07-05 2011-06-14 International Business Machines Corporation Process and methodology to maintain consistency across disparate interfaced systems
DE102007052281A1 (en) * 2007-11-02 2009-05-07 Zenteris Gmbh Single-step multiplex immunoassay
WO2009111475A2 (en) * 2008-03-03 2009-09-11 Heatflow Technologies, Inc. Heat flow polymerase chain reaction systems and methods
US10552710B2 (en) 2009-09-28 2020-02-04 Oracle International Corporation Hierarchical sequential clustering
US10013641B2 (en) * 2009-09-28 2018-07-03 Oracle International Corporation Interactive dendrogram controls
US20120035062A1 (en) 2010-06-11 2012-02-09 Life Technologies Corporation Alternative nucleotide flows in sequencing-by-synthesis methods
EP2633470B1 (en) 2010-10-27 2016-10-26 Life Technologies Corporation Predictive model for use in sequencing-by-synthesis
US10273540B2 (en) 2010-10-27 2019-04-30 Life Technologies Corporation Methods and apparatuses for estimating parameters in a predictive model for use in sequencing-by-synthesis
CN103282873B (en) * 2010-11-12 2016-08-24 生命科技公司 Confirm for experimental determination or the system and method for checking
US9594870B2 (en) 2010-12-29 2017-03-14 Life Technologies Corporation Time-warped background signal for sequencing-by-synthesis operations
WO2012092455A2 (en) 2010-12-30 2012-07-05 Life Technologies Corporation Models for analyzing data from sequencing-by-synthesis operations
WO2012092515A2 (en) 2010-12-30 2012-07-05 Life Technologies Corporation Methods, systems, and computer readable media for nucleic acid sequencing
US20130060482A1 (en) 2010-12-30 2013-03-07 Life Technologies Corporation Methods, systems, and computer readable media for making base calls in nucleic acid sequencing
US9428807B2 (en) 2011-04-08 2016-08-30 Life Technologies Corporation Phase-protecting reagent flow orderings for use in sequencing-by-synthesis
US20130059738A1 (en) 2011-04-28 2013-03-07 Life Technologies Corporation Methods and compositions for multiplex pcr
EP3072977B1 (en) 2011-04-28 2018-09-19 Life Technologies Corporation Methods and compositions for multiplex pcr
US20130059762A1 (en) 2011-04-28 2013-03-07 Life Technologies Corporation Methods and compositions for multiplex pcr
US9123002B2 (en) * 2011-05-27 2015-09-01 Abbott Informatics Corporation Graphically based method for developing rules for managing a laboratory workflow
US8751488B2 (en) 2011-08-24 2014-06-10 Waypart, Inc. Part number search method and system
US10704164B2 (en) 2011-08-31 2020-07-07 Life Technologies Corporation Methods, systems, computer readable media, and kits for sample identification
WO2013081864A1 (en) 2011-11-29 2013-06-06 Life Technologies Corporation Methods and compositions for multiplex pcr
EP2966180B1 (en) 2011-11-29 2017-08-16 Life Technologies Corporation Methods and compositions for multiplex pcr
US9646132B2 (en) 2012-05-11 2017-05-09 Life Technologies Corporation Models for analyzing data from sequencing-by-synthesis operations
US9201916B2 (en) * 2012-06-13 2015-12-01 Infosys Limited Method, system, and computer-readable medium for providing a scalable bio-informatics sequence search on cloud
US20150167068A1 (en) 2012-07-13 2015-06-18 Life Technologies Corporation HUMAN IDENTIFICATION USING A PANEL OF SNPs
US10329608B2 (en) 2012-10-10 2019-06-25 Life Technologies Corporation Methods, systems, and computer readable media for repeat sequencing
WO2014071404A2 (en) * 2012-11-05 2014-05-08 Firefly Bioworks, Inc. Automated product customization based upon literature search results
US20140296080A1 (en) 2013-03-14 2014-10-02 Life Technologies Corporation Methods, Systems, and Computer Readable Media for Evaluating Variant Likelihood
EP3570040B1 (en) 2013-04-05 2024-02-14 F. Hoffmann-La Roche AG Analysis method for a biological sample
US9926597B2 (en) 2013-07-26 2018-03-27 Life Technologies Corporation Control nucleic acid sequences for use in sequencing-by-synthesis and methods for designing the same
CN106029899B (en) * 2013-09-30 2021-08-03 深圳华大基因股份有限公司 Method, system and computer readable medium for determining SNP information in predetermined region of chromosome
US10410739B2 (en) 2013-10-04 2019-09-10 Life Technologies Corporation Methods and systems for modeling phasing effects in sequencing using termination chemistry
WO2015069713A2 (en) * 2013-11-05 2015-05-14 Firefly Bioworks, Inc. Systems and methods for automated multiplex assay design
US10676787B2 (en) 2014-10-13 2020-06-09 Life Technologies Corporation Methods, systems, and computer-readable media for accelerated base calling
US10317420B2 (en) 2014-12-15 2019-06-11 Luminex Corporation Detailed assay protocol specification
EP3295345B1 (en) 2015-05-14 2023-01-25 Life Technologies Corporation Barcode sequences, and related systems and methods
US9766969B2 (en) * 2015-06-18 2017-09-19 Xerox Corporation Assessing and improving quality of event logs including prioritizing and classifying errors into error-perspective and error-type classifications
CN113358860A (en) * 2015-07-23 2021-09-07 中尺度技术有限责任公司 Automated analysis system and method for performing analysis in such a system
JP6220464B1 (en) * 2016-03-14 2017-10-25 神戸バイオロボティクス株式会社 Sample storage body and sample storage body automatic processing system
US10619205B2 (en) 2016-05-06 2020-04-14 Life Technologies Corporation Combinatorial barcode sequences, and related systems and methods
US20180057866A1 (en) * 2016-08-08 2018-03-01 Exploragen, Inc. Genetic profiling methods for prediction of taste and scent preferences and gustative and olfactive product selection
CN114037030A (en) 2016-10-07 2022-02-11 布鲁克斯自动化公司 Sample tube, acoustic distribution system, system for identifying sample tube and method for orienting sample tube
US10622095B2 (en) * 2017-07-21 2020-04-14 Helix OpCo, LLC Genomic services platform supporting multiple application providers
US10296842B2 (en) 2017-07-21 2019-05-21 Helix OpCo, LLC Genomic services system with dual-phase genotype imputation
EP3467690A1 (en) * 2017-10-06 2019-04-10 Emweb bvba Improved alignment method for nucleic acid sequences
CN109871941B (en) * 2019-02-18 2020-02-21 中科寒武纪科技股份有限公司 Data processing method and device and related products
AU2020291343A1 (en) * 2019-06-14 2021-12-02 Seegene, Inc. Computer-implemented method for collaborative development of reagents for detection of target nucleic acids
US11361847B1 (en) 2021-02-06 2022-06-14 Timothy A. Hodge System and method for rapidly reporting testing results
US11864010B2 (en) * 2021-07-02 2024-01-02 Cisco Technology, Inc. Automated activation of unsolicited probe responses
WO2023081627A1 (en) * 2021-11-02 2023-05-11 Koireader Technologies, Inc. System for transportation and shipping related data extraction

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5374395A (en) * 1993-10-14 1994-12-20 Amoco Corporation Diagnostics instrument
US6100030A (en) * 1997-01-10 2000-08-08 Pioneer Hi-Bred International, Inc. Use of selective DNA fragment amplification products for hybridization-based genetic fingerprinting, marker assisted selection, and high-throughput screening
US20010039014A1 (en) * 2000-01-11 2001-11-08 Maxygen, Inc. Integrated systems and methods for diversity generation and screening
US6316320B1 (en) * 1997-04-04 2001-11-13 Mitsubishi Denki Kabushiki Kaisha DRAM device with improved memory cell reliability
US6329230B1 (en) * 1998-06-11 2001-12-11 Fujitsu Quantum Devices Limited High-speed compound semiconductor device having an improved gate structure
US6480791B1 (en) * 1998-10-28 2002-11-12 Michael P. Strathmann Parallel methods for genomic analysis

Family Cites Families (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US39014A (en) * 1863-06-23 Improvement in harvesters
DE3529478A1 (en) 1985-08-16 1987-02-19 Boehringer Mannheim Gmbh 7-DESAZA-2'DESOXYGUANOSINE NUCLEOTIDES, METHOD FOR THE PRODUCTION THEREOF AND THEIR USE FOR NUCLEIC ACID SEQUENCING
US5081584A (en) 1989-03-13 1992-01-14 United States Of America Computer-assisted design of anti-peptides based on the amino acid sequence of a target peptide
DE3924424A1 (en) 1989-07-24 1991-01-31 Boehringer Mannheim Gmbh NUCLEOSIDE DERIVATIVES, METHOD FOR THE PRODUCTION THEREOF, THEIR USE AS A MEDICINAL PRODUCT AND THEIR USE IN THE SEQUENCING OF NUCLEIC ACID
US5670633A (en) 1990-01-11 1997-09-23 Isis Pharmaceuticals, Inc. Sugar modified oligonucleotides that detect and modulate gene expression
DK51092D0 (en) 1991-05-24 1992-04-15 Ole Buchardt OLIGONUCLEOTIDE ANALOGUE DESCRIBED BY PEN, MONOMERIC SYNTHONES AND PROCEDURES FOR PREPARING THEREOF, AND APPLICATIONS THEREOF
US5582986A (en) 1991-06-14 1996-12-10 Isis Pharmaceuticals, Inc. Antisense oligonucleotide inhibition of the ras gene
DE4140463A1 (en) 1991-12-09 1993-06-17 Boehringer Mannheim Gmbh 2'-DESOXY-ISOGUANOSINE, THE ISOSTERAL ANALOGS AND THE APPLICATION THEREOF
US6300058B1 (en) 1992-01-29 2001-10-09 Hitachi Chemical Research Center, Inc. Method for measuring messenger RNA
JPH0778804B2 (en) * 1992-05-28 1995-08-23 日本アイ・ビー・エム株式会社 Scene information input system and method
ATE282695T1 (en) 1992-07-20 2004-12-15 Isis Pharmaceuticals Inc PSEUDO HALF-KNOT FORMING RNA THROUGH HYBRIDIZATION OF ANTISESEOLIGON NUCLEOTIDES TO TARGETED RNA SECONDARY STRUCTURES
US5556749A (en) 1992-11-12 1996-09-17 Hitachi Chemical Research Center, Inc. Oligoprobe designstation: a computerized method for designing optimal DNA probes
US5593834A (en) 1993-06-17 1997-01-14 The Research Foundation Of State University Of New York Method of preparing DNA sequences with known ligand binding characteristics
EP1245286B1 (en) * 1993-10-22 2009-11-25 Abbott Laboratories Reaction tube and method of use to minimize contamination
WO1996001693A1 (en) * 1994-07-11 1996-01-25 Akzo Nobel N.V. Micro sample tube with reduced dead volume and bar code capability
US5948360A (en) * 1994-07-11 1999-09-07 Tekmar Company Autosampler with robot arm
US5846719A (en) 1994-10-13 1998-12-08 Lynx Therapeutics, Inc. Oligonucleotide tags for sorting and identification
US5727163A (en) * 1995-03-30 1998-03-10 Amazon.Com, Inc. Secure method for communicating credit card data when placing an order on a non-secure network
US6329139B1 (en) * 1995-04-25 2001-12-11 Discovery Partners International Automated sorting system for matrices with memory
GB9524381D0 (en) * 1995-11-29 1996-01-31 Anthony Nolan Bone Marrow Trus Method for identifying an unknown allele
EP1179600B1 (en) 1996-06-04 2005-05-11 University Of Utah Research Foundation Monitoring hybridization during PCR
US5825881A (en) * 1996-06-28 1998-10-20 Allsoft Distributing Inc. Public network merchandising system
US6189013B1 (en) * 1996-12-12 2001-02-13 Incyte Genomics, Inc. Project-based full length biomolecular sequence database
US6058373A (en) * 1996-10-16 2000-05-02 Microsoft Corporation System and method for processing electronic order forms
US6143877A (en) 1997-04-30 2000-11-07 Epoch Pharmaceuticals, Inc. Oligonucleotides including pyrazolo[3,4-D]pyrimidine bases, bound in double stranded nucleic acids
US5960411A (en) * 1997-09-12 1999-09-28 Amazon.Com, Inc. Method and system for placing a purchase order via a communications network
US6456942B1 (en) * 1998-01-25 2002-09-24 Combimatrix Corporation Network infrastructure for custom microarray synthesis and analysis
US6251588B1 (en) 1998-02-10 2001-06-26 Agilent Technologies, Inc. Method for evaluating oligonucleotide probe sequences
US6127121A (en) 1998-04-03 2000-10-03 Epoch Pharmaceuticals, Inc. Oligonucleotides containing pyrazolo[3,4-D]pyrimidines for hybridization and mismatch discrimination
KR20000021073A (en) * 1998-09-25 2000-04-15 박원배 Inhibitors for biosynthesis of cholesterol
US7013221B1 (en) * 1999-07-16 2006-03-14 Rosetta Inpharmatics Llc Iterative probe design and detailed expression profiling with flexible in-situ synthesis arrays
US6316230B1 (en) * 1999-08-13 2001-11-13 Applera Corporation Polymerase extension at 3′ terminus of PNA-DNA chimera
EP1214331B1 (en) 1999-08-30 2006-10-11 Roche Diagnostics GmbH 2-azapurine compounds and their use
US6271002B1 (en) * 1999-10-04 2001-08-07 Rosetta Inpharmatics, Inc. RNA amplification method
WO2001037167A1 (en) 1999-11-16 2001-05-25 Regency Ventures Ltd, Charted Corporation Services A method and system for configurating products
US6660845B1 (en) 1999-11-23 2003-12-09 Epoch Biosciences, Inc. Non-aggregating, non-quenching oligomers comprising nucleotide analogues; methods of synthesis and use thereof
US6727356B1 (en) * 1999-12-08 2004-04-27 Epoch Pharmaceuticals, Inc. Fluorescent quenching detection reagents and methods
US6282550B1 (en) * 2000-01-10 2001-08-28 Tangerine Technologies, Inc. Apparatus and method of utilizing a database to correlate customer requests and suppliers capabilities for custom synthesis of polymers
EP1252513A4 (en) 2000-01-25 2007-07-18 Affymetrix Inc Method, system and computer software for providing a genomic web portal
AU2001240991A1 (en) * 2000-03-15 2001-09-24 Genset Methods, software, and apparati for designing, ordering, pricing, tracking and directing production of custom biologicals
JP2001258568A (en) 2000-03-22 2001-09-25 Hitachi Ltd Primer design system
US6511277B1 (en) * 2000-07-10 2003-01-28 Affymetrix, Inc. Cartridge loader and methods
WO2002037391A2 (en) * 2000-11-03 2002-05-10 Myetribute, Inc. System and method for conducting pet, death, dna and other related transactions over a computer network
US7117095B2 (en) 2000-11-21 2006-10-03 Affymetrix, Inc. Methods for selecting nucleic acid probes
US20030082544A1 (en) * 2001-07-11 2003-05-01 Third Wave Technologies, Inc. Methods and systems for validating detection assays, developing in-vitro diagnostic DNA or RNA analysis products, and increasing revenue and/or profit margins from in-vitro diagnostic DNA or RNA analysis assays
WO2002044994A2 (en) * 2000-11-30 2002-06-06 Third Wave Technologies, Inc. Systems and methods for detection assay ordering, design, production, inventory, sales and analysis for use with or in a production facility
US20040014067A1 (en) 2001-10-12 2004-01-22 Third Wave Technologies, Inc. Amplification methods and compositions
JP2005516296A (en) 2002-01-25 2005-06-02 アプレラ コーポレイション Computer operating method and / or computer network for arranging biotechnology products
US20030190652A1 (en) 2002-01-25 2003-10-09 De La Vega Francisco M. Methods of validating SNPs and compiling libraries of assays
US20040018506A1 (en) 2002-01-25 2004-01-29 Koehler Ryan T. Methods for placing, accepting, and filling orders for products and services
AU2003298706A1 (en) 2002-12-04 2004-06-23 Applera Corporation Multiplex amplification of polynucleotides
WO2004055709A2 (en) 2002-12-13 2004-07-01 Applera Corporation Methods for identifying, viewing, and analyzing syntenic and orthologous genomic regions between two or more species
US7430496B2 (en) 2004-06-16 2008-09-30 Tokyo Electron Limited Method and apparatus for using a pressure control system to monitor a plasma processing system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5374395A (en) * 1993-10-14 1994-12-20 Amoco Corporation Diagnostics instrument
US6100030A (en) * 1997-01-10 2000-08-08 Pioneer Hi-Bred International, Inc. Use of selective DNA fragment amplification products for hybridization-based genetic fingerprinting, marker assisted selection, and high-throughput screening
US6316320B1 (en) * 1997-04-04 2001-11-13 Mitsubishi Denki Kabushiki Kaisha DRAM device with improved memory cell reliability
US6329230B1 (en) * 1998-06-11 2001-12-11 Fujitsu Quantum Devices Limited High-speed compound semiconductor device having an improved gate structure
US6480791B1 (en) * 1998-10-28 2002-11-12 Michael P. Strathmann Parallel methods for genomic analysis
US20010039014A1 (en) * 2000-01-11 2001-11-08 Maxygen, Inc. Integrated systems and methods for diversity generation and screening

Also Published As

Publication number Publication date
JP2005516300A (en) 2005-06-02
WO2003065146A3 (en) 2004-09-10
AU2003209375A1 (en) 2003-09-02
CA2474482A1 (en) 2003-08-07
WO2003065000A2 (en) 2003-08-07
US10689692B2 (en) 2020-06-23
EP1468103A1 (en) 2004-10-20
US20120303472A1 (en) 2012-11-29
JP2006294059A (en) 2006-10-26
EP1468103A4 (en) 2008-12-31
US20170268051A1 (en) 2017-09-21
US20200392571A1 (en) 2020-12-17
EP1706826A2 (en) 2006-10-04
EP1706826A4 (en) 2008-01-30
US20080228589A1 (en) 2008-09-18
US20040063109A2 (en) 2004-04-01
US20040018506A1 (en) 2004-01-29
US9464320B2 (en) 2016-10-11
JP2005515785A (en) 2005-06-02
US20030175774A1 (en) 2003-09-18
WO2003064670A1 (en) 2003-08-07
WO2003065000A3 (en) 2004-10-07
WO2003065146A2 (en) 2003-08-07

Similar Documents

Publication Publication Date Title
US20050282162A1 (en) Methods of validating snps and compiling libraries of assays
US10689695B2 (en) Multiplex amplification of polynucleotides
US6703228B1 (en) Methods and products related to genotyping and DNA analysis
EP1927064B1 (en) Melting curve analysis with exponential background subtraction
US9081737B2 (en) Methods for predicting stability and melting temperatures of nucleic acid duplexes
US6461816B1 (en) Methods for controlling cross-hybridization in analysis of nucleic acid sequences
Tebbutt et al. Microarray genotyping resource to determine population stratification in genetic association studies of complex disease
US20120184449A1 (en) Fetal genetic variation detection
US20140329238A1 (en) Detection of gene duplications
EP1056889B1 (en) Methods related to genotyping and dna analysis
MX2007005364A (en) Single step detection assay.
Fondevila et al. Forensic SNP genotyping with SNaPshot: technical considerations for the development and optimization of multiplexed SNP assays
CA2543033A1 (en) Direct nucleic acid detection in bodily fluids
US20030190652A1 (en) Methods of validating SNPs and compiling libraries of assays
Brion et al. New technologies in the genetic approach to sudden cardiac death in the young
US20040072217A1 (en) Methods of analysis of linkage disequilibrium
Dearlove High throughput genotyping technologies
US20030235848A1 (en) Characterization of CYP 2D6 alleles
Smit et al. Semiautomated DNA mutation analysis using a robotic workstation and molecular beacons
Drmanac et al. Sequencing by hybridization arrays
EP1483405A2 (en) METHODS OF VALIDATING SNPs AND COMPILING LIBRARIES OF ASSAYS
Kornilov et al. Molecular genetics methods for developmental scientists
Deharvengt et al. Nucleic acid analysis in the clinical laboratory
US20240084374A1 (en) Method for estimation of fetal fraction in cell-free dna from maternal sample
US20030144799A1 (en) Regulatory single nucleotide polymorphisms and methods therefor

Legal Events

Date Code Title Description
AS Assignment

Owner name: APPLERA CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DE LA VEGA, FRANCISCO;ZIEGLE, JANET S.;ISSAC, HADAR I.;AND OTHERS;REEL/FRAME:015605/0300;SIGNING DATES FROM 20041123 TO 20041215

AS Assignment

Owner name: APPLERA CORPORATION, CALIFORNIA

Free format text: CORRECTED RECORDATION TO CORRECT CONVEYING PARTY NAME (HADAR L. ISAAC) PREVIOUSLY RECORDED ON REEL 015605, FRAMES 0300-0308.;ASSIGNORS:DE LA VEGA, FRANCISCO;ZIEGLE, JANET S.;ISAAC, HADAR I.;AND OTHERS;REEL/FRAME:016362/0202;SIGNING DATES FROM 20041123 TO 20041215

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: APPLIED BIOSYSTEMS INC.,CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:APPLERA CORPORATION;REEL/FRAME:023994/0538

Effective date: 20080701

Owner name: APPLIED BIOSYSTEMS, LLC,CALIFORNIA

Free format text: MERGER;ASSIGNOR:APPLIED BIOSYSTEMS INC.;REEL/FRAME:023994/0587

Effective date: 20081121

Owner name: APPLIED BIOSYSTEMS INC., CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:APPLERA CORPORATION;REEL/FRAME:023994/0538

Effective date: 20080701

Owner name: APPLIED BIOSYSTEMS, LLC, CALIFORNIA

Free format text: MERGER;ASSIGNOR:APPLIED BIOSYSTEMS INC.;REEL/FRAME:023994/0587

Effective date: 20081121