WO2001080156A1 - Method and system for determining haplotypes from a collection of polymorphisms - Google Patents

Method and system for determining haplotypes from a collection of polymorphisms Download PDF

Info

Publication number
WO2001080156A1
WO2001080156A1 PCT/US2001/012831 US0112831W WO0180156A1 WO 2001080156 A1 WO2001080156 A1 WO 2001080156A1 US 0112831 W US0112831 W US 0112831W WO 0180156 A1 WO0180156 A1 WO 0180156A1
Authority
WO
WIPO (PCT)
Prior art keywords
haplotype
pair
genotype
haplotypes
frequency
Prior art date
Application number
PCT/US2001/012831
Other languages
French (fr)
Inventor
J. Claiborne Stephens
Andreas Windemuth
Original Assignee
Genaissance Pharmaceuticals, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genaissance Pharmaceuticals, Inc. filed Critical Genaissance Pharmaceuticals, Inc.
Priority to US10/258,155 priority Critical patent/US20030211501A1/en
Priority to EP01927246A priority patent/EP1290613A1/en
Priority to AU2001253720A priority patent/AU2001253720A1/en
Publication of WO2001080156A1 publication Critical patent/WO2001080156A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the invention relates to the field of genomics, and genetics, including genome analysis and the study of DNA variation.
  • the invention relates to the field of predicting haplotype info ⁇ nation from unphased and/or incomplete genotype information for an organism.
  • the invention is particularly useful in the human health care, veterinary and agricultural fields.
  • haplotypes have historically had greatest importance in the analysis of pedigree data. More recently, with the capacity to generate DNA sequence information for a large number of individuals, "haplotype” has come to mean the specific sequence of alternative variants (e.g., single nucleotide polymorphisms or "SNPs”) at the polymorphic sites, often coming from a contiguous piece of DNA.
  • SNPs single nucleotide polymorphisms
  • the inventors herein are aware of only one reference (Clark 1990) that discloses an algorithm for assigning haplotypes to unrelated individuals in a population sample. It proceeds by assigning haplotypes that are observed as homozygotes or single-site heterozygotes, then interrogating whether one (or more) of these is consistent with an ambiguous individual, that is, an individual heterozygous at two or more sites. It is an order-dependent algorithm, in that different orders give different answers, and so must be applied several times to look for differences and a single best answer.
  • This reference has identified three main problems in applying the algorithm: 1) it might never get started, if there are no unambiguous individuals; 2) it might not be able to resolve every individual in a sample; and 3) it might resolve certain individuals incorrectly.
  • the methods and tools described herein provide processes for predicting haplotypes and haplotype pairs from unphased and/or incomplete genotype data.
  • the processes are preferably carried out with the aid of a computer.
  • the exemplified methods and tools are partially embodied in a computer program coupled to a database used to display and analyze haplotype, genotype and related statistical information. It includes novel graphical and computational methods for treating haplotypes, genotypes, and related data in a consistent and easy-to-interpret manner.
  • the invention relates to a process for deriving the presence and frequency of haplotypes from a collection of genotypes of several individual polymorphisms in a locus, measured for a sample group of individuals.
  • the process begins with an exhaustive enumeration (expansion) of all possible haplotypes (called the "Hap Expansion” phase), then proceeds through a self-consistent, iterative process to deduce the haplotypes most likely to be present, and the most likely assignment of haplotype pairs to each individual (called the "Hap Assignment" phase).
  • the process also results in a probability score specifying the likelihood of the result being correct.
  • the process takes advantage of family relationships among the sample group, but does not require them.
  • the process is embodied in the HAPTM Builder program, a computer code written in Java providing an interface for a skilled person to efficiently carry out the process and store the results.
  • the invention relates to a method and tools for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals, comprising:
  • step (e) determining for each genotype obtained in step (a) a pair score F for each pair of haplotypes that is consistent with that genotype, wherein F k is a function of the frequency fj for each of the haplotypes in the pair; (f) calculating, for each genotype and consistent haplotype pair whose pair score F k meets a pair score criterion, a probability p k that assignment of that haplotype pair to the genotype would be correct;
  • step (g) generating a revised haplotype frequency fj for each haplotype, wherein the revised haplotype frequency fj is a function of the probability p for each consistent haplotype pair which contains the haplotype; and (h) repeating steps (e) through (g) until an end condition is reached, with the proviso that for each repetition the frequency fi employed in step (e) is replaced by the revised frequency fj determined in step (g).
  • Steps (a) though (d) are called the initiation, or Hap Expansion, phase of the method. Steps (a), (b) and (c) can be performed for one individual at a time or in parallel.
  • Steps (e) tlirough (g) are called the Hap Assignment phase of the method. Steps (e) and (f) can be performed for one genotype at a time or in parallel.
  • the invention also relates to a method and tools for assigning a haplotype pair to a polymorphic genomic region of an individual, comprising:
  • the invention also relates to methods and tools for predicting the probable haplotypes and haplotype pairs of one or more loci of an individual.
  • the invention also relates to methods and tools for estimating the probability that the predicted haplotypes and haplotype pairs are correct.
  • the invention also relates to a method and tools for filling in missing genotype data for any individual and polymorphic site that was not or could not be measured, comprising using the most probable assignment of haplotypes determined by the methods of the invention to construct the most likely genotype for the individual.
  • the invention also relates to methods of constructing a haplotype database for a population containing reference haplotype pair frequency data.
  • the invention also relates to methods of predicting the presence of a haplotype pair in an individual using such a database. The methods comprise accessing the database containing reference haplotype pair frequency data to determine a probability, for each of the possible haplotype pairs, that the individual has the possible haplotype pair; and analyzing the determined probabilities to predict haplotype pairs for the individual.
  • the methods and tools of the invention make it possible to determine haplotypes and haplotype pairs in an individual, or in a plurality of individuals, based on unphased and/or incomplete genotype information.
  • the individuals may be part of a population such as the general population, an ethno-geographic group, or a clinical or disease population, or they may all be of the same gender.
  • the method and tools of the invention can be used to determine the haplotypes and haplotype pairs of genes responsible for specific desirable traits, e.g., drought tolerance and/or improved crop yields, and reduce the time and effort needed to transfer desirable traits.
  • the invention includes methods, computer programs and databases to analyze and make use of genotype information to deduce and/or predict haplotype information. These include methods, programs, and databases for finding and measuring the frequency of haplotypes and/or haplotype pairs in a population; and methods, programs, and databases for inferring an individual's haplotype from the individual's genotype.
  • FIGURES 1 A and IB System Architecture Schematic.
  • FIGURE 2. First part of Flow Chart for a method and system for determining haplotypes from a collection of polymorphisms.
  • FIGURE 3. Second part of Flow Chart for a method and system for determining haplotypes from a collection of polymorphisms.
  • FIGURE 4 Third part of Flow Chart for a method and system for determining haplotypes from a collection of polymorphisms.
  • FIGURE 5 DecoGenHiPTM Builder View.
  • the top half of Figure 5 is a screen showing a set of candidate genes for which polymorphism data has been obtained or is in the process of being obtained, and which may be selected for being haplotyped.
  • the columns on the right side of the screen indicate various stages in the process of analyzing target regions of the gene identified in the corresponding row.
  • the various colors provide an immediate visual indicator of the status of the gene at each stage of analysis.
  • the bottom part of this figure is a screen which provides information concerning the sequencing of various regions of the selected candidate gene.
  • FIGURE 6 Gene Structure View. This screen shows the location of features in the gene (such as promoter, introns, exons, etc.), as well as actual sequence data, for a gene for which the "Anno" column has been selected in the screen of Figure 5.
  • FIGURE 7 Gene Haplotypes View.
  • the screen in the top right side of this view shows information about the polymorphic sites in a gene for which the "Haplo" column has been selected in the screen of Figure 5, such as the location of the polymorphic sites, the type of polymorphism, and an indication of the frequency with which each polymorphism has been seen in various world population groups.
  • the screen includes boxes which may be checked to include the polymorphic site in a haplotype analysis.
  • FIGURE 8 Gene Haplotypes View (Cont).
  • the screen in the top left side of this view shows an items selection menu which results after the "Edit” item is clicked on in Figure 7.
  • the "DeHarv” menu item is highlighted.
  • FIGURE 9. Gene Haplotypes View (Cont).
  • the screen in the top right side of this view shows information which results after the "DeHARV" menu item is selected in the "Edit" menu on the top left side of Figure 8; only six of the polymorphic sites shown in the screen in the top right of Figure 8 are selected in the screen in the top right side of Figure 9.
  • FIGURE 10 Gene Haplotypes View (Cont.). This view shows screens which result when the "Filter Polymorphisms" menu item is selected in the Edit menu of Figure 9 (which is not shown open in Figure 9).
  • the box in the middle right side of this view labeled "ScoredDiplotype Objects” shows the unphased genotypes of subjects in the database and their ethno-geographic origin.
  • the screen in the bottom right side of this view labeled "ScoredHaplotypes Objects” shows the expanded haplotypes enumerated from the genotypes in the middle screen for each of the selected (accepted) polymorphic sites .
  • FIGURE 11 Gene Haplotypes View (Cont.). This view shows screens which result when the "Assign” menu item in, the edit menu of Figure 10 (which is not shown open in Figure 10) is selected one time.
  • the screen in the middle of the figure labeled "Scored Diplotype Objects” shows the "Hapl” and "Hap2" pair assignments (i.e., the genotype to haplotype resolution) for each of the individuals in the population being examined after several iterations of the HAPTM Builder algorithm, as well as the HapPair Score assigned to them.
  • “Scored Haplotype Objects” (shown in the lower right side of the view in Figure 11) provides the different haplotypes determined in the examined population, with a haplotype frequency score, as well as the number of times each haplotype is seen in the entire population and in the various population groups.
  • FIGURE 12. Gene Haplotypes View (Cont.). This view shows a window labeled “HapPair Objects” which is displayed as a result of clicking on the "Score” cell for row 94 (individual UP002) in the "ScoredDiplotypes Objects” box in the center of Figure 12. This window contains the 15 most likely haplotype pairs for subject UP002 based on the current haplotype pair scores.
  • FIGURE 13 contains the 15 most likely haplotype pairs for subject UP002 based on the current haplotype pair scores.
  • Gene Haplotypes View (Cont.). This view shows screens which result after the "Assign" command in the Edit menu in Figure 12 has been invoked multiple times.
  • FIGURE 14. Gene Haplotypes View (Cont.). This view shows a screen with "warnings” (e.g., missing genotype data) highlighted in light gray. This view also shows a screen with the icon for the individual UP002 highlighted in dark gray in the family tree schematic because the Mendelian inheritance rules are violated.
  • Allele - A particular form of a genetic locus, distinguished from other forms by its particular nucleotide sequence.
  • Ambiguous polymorphic site A heterozygous polymorphic site or a polymorphic site for which nucleotide sequence information is lacking.
  • Candidate Gene - A gene which is hypothesized to be responsible for a disease, condition, or the response to a treatment, or to be correlated with one of these.
  • Genotype An unphased 5 ' to 3 ' sequence of nucleotide pair(s) found at one or more polymorphic sites in a locus on a pair of homologous chromosomes in an individual.
  • genotype m cludes a full-genotype and/or a sub- genotype as described below.
  • Full-genotype The unphased 5 ' to 3 ' sequence of nucleotide pairs found at all known polymorphic sites in a locus on a pair of homologous chromosomes in a single individual.
  • Sub-genotype The unphased 5 ' to 3 ' sequence of nucleotides seen at a subset of the known polymorphic sites in a locus on a pair of homologous chromosomes in a single individual.
  • Genotyping A process for determining a genotype of an individual.
  • Haplotype A 5 ' to 3 ' sequence of nucleotides found at one or more polymorphic sites in a locus on a single chromosome from a single individual.
  • haplotype includes a full-haplotype and/or a sub-haplotype as described below.
  • Full-haplotype The 5 ' to 3 ' sequence of nucleotides found at all known polymorphic sites in a locus on a single chromosome from a single individual.
  • Sub-haplotype The 5' to 3' sequence of nucleotides seen at a subset of the known polymorphic sites in a locus on a single chromosome from a single individual.
  • Haplotype pair The two haplotypes found for a locus in a single individual.
  • Haplotyping A process for determining one or more haplotypes in an individual and includes use of family pedigrees, molecular techniques and/or statistical inference.
  • Haplotype data Information concerning one or more of the following for a specific gene: a listing of the haplotype pairs in each individual in a population; a listing of the different haplotypes in a population; frequency of each haplotype in that or other populations, and any known associations between one or more haplotypes and a trait.
  • Isoform - A particular form of a gene, mRNA, cDNA or the protein encoded thereby, distinguished from other forms by its particular sequence and/or structure.
  • Isogene - One of the isoforms of a gene found in a population.
  • An isogene contains all of the polymorphisms present in the particular isoform of the gene.
  • Isolated As applied to a biological molecule such as RNA, DNA, oligonucleotide, or protein, isolated means the molecule is substantially free of other biological molecules such as nucleic acids, proteins, lipids, carbohydrates, or other material such as cellular debris and growth media. Generally, the term “isolated” is not intended to refer to a complete absence of such material or to absence of water, buffers, or salts, unless they are present in amounts that substantially interfere with the methods of the present invention.
  • Locus - A location on a chromosome or DNA molecule corresponding to a gene or a physical or phenotypic feature.
  • Naturally-occurring A term used to designate that the object it is applied to, e.g., naturally-occurring polynucleotide or polypeptide, can be isolated from a source in nature and which has not been intentionally modified by man.
  • Nucleotide pair The nucleotides found at a polymorphic site on the two copies of a chromosome from an individual.
  • phased As applied to a sequence of nucleotide pairs for two or more polymorphic sites in a locus, phased means the combination of nucleotides present at those polymorphic sites on a single copy of the locus is known.
  • Polymorphism The sequence variation observed in an individual at a polymorphic site. Polymorphisms include nucleotide substitutions, insertions, deletions and microsatellites and may, but need not, result in detectable differences in gene expression or protein function.
  • Polymorphism data Information concerning one or more of the following for a specific gene: location of polymorphic sites; sequence variation at those sites; frequency of polymorphisms in one or more populations; the different genotypes and/or haplotypes determined for the gene; frequency of one or more of these genotypes and/or haplotypes in one or more populations; any known association(s) between a trait and a genotype or a haplotype for the gene.
  • Polymorphism Database A collection of polymorphism data arranged in a systematic or methodical way and capable of being individually accessed by electronic or other means.
  • Polynucleotide - A nucleic acid molecule comprised of single-stranded RNA or DNA or comprised of complementary, double-stranded DNA.
  • Reference Population A group of subjects or individuals who are predicted to be representative of the genetic variation found in the general population.
  • the reference population represents the genetic variation in the population at a certainty level of at least 85%, preferably at least 90%, more preferably at least 95%) and even more preferably at least 99%.
  • SNP Single Nucleotide Polymorphism
  • Subject An individual whose genotypes or haplotypes or response to treatment or disease state are to be determined.
  • Treatment A stimulus administered internally or externally to a subject.
  • Unphased - As applied to a sequence of nucleotide pairs for two or more polymorphic sites in a locus, unphased means the combination of nucleotides present at those polymorphic sites on a single copy of the locus (i.e., located on a single DNA strand) is not known.
  • the present invention may be implemented with a computer, an example of which is shown in Figure 1 A.
  • the computer includes a central processing unit (CPU) connected by a system bus or other connecting means to a communication interface, system memory (RAM), non- volatile memory (ROM), and one or more other storage devices such as a hard disk drive, a diskette drive, and a CD ROM drive.
  • the computer may also include an internal or external modem (not shown).
  • the computer also includes a display device, such as a CRT monitor or an LCD display, and an input device, such as a keyboard, mouse, pen, touch-screen, or voice activation system.
  • the computer stores and executes various programs such as an operating system and application programs.
  • the computer may be embodied, for example, as a personal computer, work station, laptop, mainframe, or a personal digital assistant.
  • the computer may also be embodied as a distributed multi- processor system or as a networked system such as a LAN having a server and client terminals.
  • the present invention uses a program, referred to as the HAPTM Builder program, that generates views (or screens) displayed on a display device and which the user can interact with to accomplish a variety of tasks and analyses.
  • the HAPTM Builder program allows users to view and analyze large amounts of information such as subject identifiers (e.g., subject number or cell line number); gene-related data (e.g., gene name, gene symbol, GenBank accession number); family data (e.g., family number, father, mother, number of siblings); polymorphism data (e.g., region, position, nucleotide changes (i.e., the polymorphic nucleotide(s) as compared to the reference nucleotide(s)); genotype data (e.g., scored diplotype objects); haplotype data (e.g.
  • subject identifiers e.g., subject number or cell line number
  • gene-related data e.g., gene name, gene symbol, GenBank accession number
  • family data
  • the HAPTM Builder program is preferably written in the Java programming language. However, the program may be written using any conventional programming language such as for example C, C++, Visual BasicTM or Visual PascalTM.
  • the HAPTM 1 Builder program may be stored and executed on the computer. It may also be stored and executed in a distributed manner.
  • the data processed by the HAPTM Builder program is preferably stored as part of a relational database (e.g., an instance of an OracleTM database or a set of ASCII flat files). This data can be stored on, for example, a CD ROM or in one or more storage devices accessible by the computer.
  • the data may be stored on one or more databases in communication with the computer via a network. In one scenario, the data will be delivered to the user on any standard media
  • the HAPTM Builder program and data may also be installed on a local machine. The HAPTM Builder program and data will then be on the machine that the user directly accesses.
  • Figure IB shows an implementation where a network interconnects one or more host computers with one or more user terminals.
  • the communication network may, for example, include one or more local area networks (LANs), metropolitan area networks (MANs), wide area networks (WANs), or a collection of interconnected networks such as the Internet.
  • the network may be wired, wireless, or some combination thereof.
  • the host computer may, for example, be a world wide web server ("web server").
  • the user terminal may, for example, be a client device such as the computer shown in Figure 1 A.
  • a web server stores information documents called pages.
  • a server process listens for incoming connections from clients (e.g., browsers running on a client device). When a connection is established, the client sends a request and the server sends a reply. The request typically identifies a page by its Uniform Resource Locator (URL) and the reply includes the requested page.
  • This client-server protocol is typically performed using the hypertext transfer protocol ("http"). Pages are viewed using a browser program. They are written in a language called hypertext markup language ("html"). A typical page includes text and formatting comments called tags. Pages may also include links (pointers) to other pages. Strings of text or images that are links to other pages are called hyperlinks.
  • Hyperlinks are highlighted (e.g., by color, underlining) and may be invoked by placing the cursor on the highlighted area and selecting it (e.g., by clicking the mouse button).
  • a page may also contain a URL reference to a portion of multimedia data such as an image, video segment, or audio file.
  • Pages may also point to a Java program called an applet. When the browser connects to where the applet is stored, the applet is downloaded to the client device and executed there in a secure manner. Pages may also contain forms that prompt a user to enter info ⁇ nation or that have active maps. Data entered by a user may be handled by common gateway interface (CGI) programs. Such programs may, for example, provide web users with access to one or more databases.
  • CGI common gateway interface
  • the host computer may include a CPU connected by a system bus or other connecting means to a communication interface, system memory (RAM), nonvolatile memory (ROM), and a mass storage device.
  • the mass storage device may, for example, be a collection of magnetic disk drives in a RAID system.
  • the mass storage device may, for example, store the aforementioned web pages, applets, and the like.
  • the host computer may also include an input device, such as a keyboard, and a display device to allow for control and management by an administrator. Additionally, the host computer may be connected to additional devices such as printers, auxiliary monitors or other input/output devices.
  • the input device and display device may also be provided on another computer coupled to the host computer.
  • the host computer may be embodied, for example, as one or more mainframes, workstations, personal computers, or other specialized hardware platforms. The functionality of the host computer may be centralized or may be implemented as a distributed system. As also shown in Figure IB, the host computer may communicate with one or more databases stored on any of a variety of hardware platforms.
  • the HAPTM Builder program will be web-based and will be delivered as an applet that runs in a web browser. In this case, the data will reside on a server machine and will be delivered to the HAPTM Builder program using a standard protocol (e.g., HTTP with cgi-bin).
  • HTTP HyperText Transfer Protocol
  • the network connection could use a dedicated line.
  • the network connection could use a secure protocol such as Secure Socket Layer (SSL) which only provides access to the server from a specified set of IP addresses.
  • SSL Secure Socket Layer
  • the HAPTM Builder program can be installed on a user machine and the data can reside on a separate server machine. Communication between the two machines can be handled using standard client-server technology. An example would be to use TCP/IP protocol to communicate between the client and an oracle server.
  • HAPTM 1 Builder program could be directly imported into the HAPTM 1 Builder program by the user. This import could be carried out by reading files residing on the user's local machine, or by cutting and pasting from a user document into the interface of the HAPTM Builder program.
  • some or all of the data or the results of analyses of the data could be exported from the HAPTM 1 Builder program to the user's local computer. This export could be carried out by saving a file to the local disk or by cutting and pasting to a user document.
  • various calculations are performed to generate items displayed on a screen or to control items displayed on a screen. As is well known, some basic calculations may be performed using database query language (SQL), while other computations are performed by the HAPTM Builder program (i. e. , the Java program which, as previously mentioned, may be an applet downloaded over the internet.)
  • the invention relates to a process for deriving the presence and frequency of haplotypes from a collection of genotypes of several individual polymorphisms in a gene locus, measured for a sample group of individuals.
  • the process begins with an exhaustive enumeration (expansion) of all possible haplotypes (called the "Hap Expansion” phase), then proceeds through a self-consistent, iterative process to produce the haplotypes most likely to be present, as well as the most likely assignment of haplotype pairs to each individual (called the "Hap Assignment” phase).
  • the process also results in a probability score specifying the likelihood of the result being co ⁇ ect.
  • the process takes advantage of family relationships among the sample group, but does not require them.
  • the process is embodied in the HAPTM Builder program, a computer code written in Java providing an interface for a skilled person to efficiently cany out the process and store the results.
  • the invention relates to a method for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals, comprising:
  • step (e) detennining for each genotype obtained in step (a) a pair score Fk for each pair of haplotypes that are consistent with that genotype, wherein F is a function of the frequency fj for each of the haplotypes in the pair;
  • step (g) generating a revised haplotype frequency fj for each of the haplotypes, wherein the revised haplotype frequency fj is a function of the probability p k for each consistent haplotype pair which contains the haplotype; and (h) repeating steps (e) through (g) until an end condition is reached, with the proviso that for each repetition the frequency fi employed in step (e) is replaced by the revised frequency fj determined in step (g).
  • Steps (a) though (d) are called the initiation, or Hap Expansion, phase of the method. Steps (a), (b) and (c) can be performed for one individual at a time or in parallel.
  • Steps (e) through (g) are called the Hap Assignment phase of the method. Steps (e) and (f) can be performed for one genotype at a time or in parallel.
  • the above procedure be modified as follows: After the genotypes are obtained (or, as they are obtained), they are combined into groups, where all the genotypes in each group are identical. Groups may optionally be characterized by one or more additional criteria, such as, for example, a requirement that all individuals from whom the genotypes are derived must belong to a single population group.
  • Additional criteria may be, for example, a requirement that all individuals from whom the genotypes are derived must be of the same gender, or belong to a single clinical or disease population, or to a population exhibiting a particular response to a drug or other stimulus, or to a population characterized by a particular genotype or haplotype at some other polymorphic region.
  • all members of a group minimally have the same genotype, but there may be more than one group with the same genotype.
  • the number of individuals sharing a distinct genotype within a group g is called the multiplier, n g .
  • Hap expansion is preferably carried out only once for each distinct (different) genotype, and the multiplier is used at the end of the expansion to give the appropriate weight to the frequency scores.
  • step (h) generating a revised haplotype frequency fj for each haplotype, wherein the revised haplotype frequency fj is a function of the product (n g )(p k ) for each consistent haplotype pair which contains the haplotype; and (i) repeating steps (f) through (h) until an end condition is reached, with the proviso that for each repetition the frequency fi employed in step (f) is replaced by the revised frequency fi determined in step (h).
  • steps (a) though (e) are the initiation, or Hap Expansion, phase of the method.
  • steps (a), (b) (c) and (d) can be performed for one individual (or group) at a time or in parallel.
  • Steps (f) through (h) are the Hap Assignment phase of the method.
  • Steps (f) and (g) can be performed for one group at a time or in parallel. Characterizing groups by one or more additional criteria may be done before or after the enumerating step, but is preferably done before the enumerating step.
  • the invention also relates to a method for predicting an individual's haplotype pair for a polymorphic genomic region, comprising:
  • the invention also relates to a method for assigning a haplotype pair for a polymorphic genomic region of an individual, comprising: (a) obtaining the genotype for the polymorphic genomic region from the individual;
  • the invention also relates to a method and tools for estimating the probability that the haplotype pairs assigned by the methods described immediately above are co ⁇ ect, comprising (a) the steps described immediately above, and further comprising (b) determining the probability score p from the formula
  • N rank s the number of pairs of haplotypes selected by the practitioner for consideration, which is preferably a subset of all the possible consistent pairs. Typically, one would select only the Nra nk highest scoring pairs for consideration.
  • the invention also relates to methods of constructing a haplotype database for a population, comprising: (a) identifying individuals to include in the population;
  • the invention also relates to methods of predicting the presence of a haplotype pair in an individual comprising, in order: (a) obtaining a genotype for the individual;
  • the methods and tools of the invention make it possible to determine haplotypes and haplotype pairs in an individual, or a plurality of individuals in a population, based on unphased and/or incomplete genotype information.
  • the individuals may for example be of the same gender, and or part of the general population, an ethno-geographic population group, a clinical or disease population, or a population exhibiting a particular response to a stimulus (e.g. a response to a drug).
  • Frequency and probability scores for haplotypes and haplotype pairs are preferably calculated and used within the same population, but may be used across different populations and population groups if desired.
  • the method and tools of the invention can be used to determine the haplotypes and haplotype pairs of genes responsible for specific desirable traits, e.g., drought tolerance and/or improved crop yields, and reduce the time and effort needed to transfer desirable traits.
  • the invention includes methods, computer programs, and databases for analyzing and making use of genotype information to deduce and/or predict haplotype information. These include methods, programs, and databases for finding and measuring the frequency of haplotypes and/or haplotype pairs in a population; and methods, programs, and databases for predicting an individual's haplotypes from the individual's genotype. Various aspects of the invention are discussed in further detail below.
  • the minimum number of individuals being haplotyped be greater than the number of haplotypes expected from the number of polymorphisms in the loci being haplotyped.
  • the present inventors have empirically determined that the number of haplotypes for a gene, on average, is about 1.1 to 1.3 times the number of individual polymorphisms in the gene being studied (data not shown).
  • the skilled artisan is interested in detecting all haplotypes for the polymorphic locus that exist in the general population above a fairly low frequency, then the size of the reference population should be sufficient to predict the existence of multiple copies of such haplotypes with high certainty. For example, in a sample of 100 individuals, a haplotype present in a frequency of 10%> would be expected to occur in 19 individuals, once as a homozygote and 18 times as a heterozygote. Thus, for pharmacogenetic applications, it is desirable to use genotypes from about 100 unrelated individuals in the HAPTM Builder process described herein to establish the haplotypes that exist in the general population for a particular polymorphic locus of interest, e.g., a typical gene of pharmaceutical relevance.
  • haplotypes for a very polymorphic locus e.g., one that has > 60 polymorphic sites
  • a larger reference population such as 200, 400, 600, 800, or up to about 1000 individuals.
  • Any given genotype may be heterozygous at any of the variable sites. If a genotype is found homozygous at all sites (e.g. : A/ A, C/C, C/C, T/T), in the absence of genotyping enor the only possible assignment is two simultaneous copies of the same haplotype (ACCT). If a genotype is heterozygous at one position (e.g. : A/ A, C/C, C/G, T/T), there is likewise only one possible assignment, i.e. the combination of two haplotypes (ACCT and ACGT). If the genotype is heterozygous at more than one position, there are multiple assignments possible.
  • the Hap Expansion constitutes a way of enumerating all possibly observed haplotypes and assigning to • them a score that amounts to an initial estimate of their frequency. Hap Expansion goes through the following steps:
  • haplotypes For each genotype, all possible combinations of haplotypes that are consistent with the genotype are determined. For a frilly homozygous genotype, there will be one haplotype. For a singly heterozygous genotype there will be two. For a doubly heterozygous sample there will be four, etc. In general, if there are n heterozygous positions, there will be 2 n haplotypes that are consistent with the observed genotype.
  • each haplotype in the expansion will have an evidence score of 2/2" assigned to it.
  • a homozygous genotype will generate one haplotype with score 2
  • a singly heterozygous genotype will generate two haplotypes with a score of 1 each
  • a doubly homozygous genotype will generate four haplotypes with a score of 0.5, etc.
  • 2 n haplotypes with a score of 2/2" each will be generated, with the proviso that if the polymorphic genomic region is haploid or hemizygous in the individual (e.g., if it from a sex-linked, mitochondrial or chloroplast gene), an evidence score of 1 is assigned.
  • the frequency scores for each haplotype are summed across all the samples to yield the initial haplotype frequency. For example, one haplotype may occur in the expansions of two genotypes, one singly heterozygous and on doubly heterozygous. The total initial frequency for this haplotype from the two genotypes would then be 1 plus 0.5, or 1.5. Where multiple identical genotypes have been grouped together, the evidence scores for haplotypes associated with that genotype are multiplied by the group multiplier n g to simultaneously account for all occu ⁇ ences of that genotype. The total initial frequency, added up across all haplotypes, will be two times the number of samples if all genomic regions are diploid; if haploid or hemizygous regions are represented the total will be reduced accordingly.
  • the evidence score is a function of the number of ambiguous polymorphic sites being haplotyped.
  • an ambiguous polymorphic site means either a heterozygous site or is a site for which nucleotide sequence information is lacking.
  • the evidence score sj obeys one or both of the following formulas: n
  • n is the number of ambiguous positions being haplotyped in the genotype.
  • an evidence score of 1 is assigned.
  • the initial frequency fj is calculated from the sum of the evidence scores across all the different individuals (or genotypes, where the evidence scores Sj are weighted appropriately by multiplying by the group multipliers n g ), for each of the enumerated possible haplotypes hj, wherein it is understood that hj is an index, e.g., h ⁇ , h j , etc.
  • the set of evidence scores Si for each genotype, and the derived initial frequency scores fj for each haplotype summed across all four genotypes, are also shown.
  • Table 2 shows similar information for the same haplotypes, but where two additional individuals having genotype 2 have been added to the population, and illustrates an embodiment of the invention wherein grouping of identical genotypes has been carried out.
  • Genotypel Genotype 2 (3 occurrences) Genotyp. 3 Genotype 4 ; A/G A C C/T G/T Sj A/A A/C C/T G/T (n g )(s A G C/C C/T G/T Si G/G A/C T/T G/T Si fi
  • haplotype frequency scores fj generated in the Hap Expansion serve as an initial estimate of the expected frequency of the haplotypes. Many genotypes will allow only one possible combination of two of the haplotypes from the expansion. This is true for the homozygous and singly heterozygous genotypes. For multiply heterozygous genotypes, there are generally many possibilities. However, since of the 2 n theoretically possible haplotypes only ⁇ n actually occur, many real haplotypes will occur in more than one of the samples, and their frequency scores fj will be higher than those of rare or non-occurring haplotypes. A pair frequency score F k can be assigned to each pair of haplotypes.
  • the pair frequency score F is a function of the haplotype frequency scores (fj , fj) for each of the haplotypes hj and h j in the pair.
  • the frequency score criterion may be user defined or it may be a default value.
  • the frequency score criterion is set at fi > 0.1. This means that haplotypes with less than a 10% chance of occurring in any individual in the entire sample are eliminated from the Hap Assignment phase.
  • the frequency score criterion (e.g., fj > 0.01, or fj > 0.001) will result in slightly greater accuracy in making hap pair assignments, but greater computing time and/or resources will be required for lower values.
  • the value for this criterion may be any number that the practitioner skilled in the art might find suitable to balance the desired degree of accuracy with the constraints on available time and computational resources.
  • the pair score criterion is preferably (a) a specific numerical cutoff; (b) a function of the values of the pair scores; or (c) a function of the rankings of the pair scores.
  • the Hap Assignment phase of the method of the invention preferably further comprises determining a probability p k that the haplotype pair which has been assigned to the genotype is conect.
  • the probability score p k is determined by (a) ranking each of the pair scores F for all the N possible pairs for a genotype with the highest score (F 0 ) first; and (b) defining the probability p k as the score of the first pair divided by the sum of all scores:
  • the probability score p is determined by (a) ranking each of the pair scores Fk for all the possible pairs for a genotype with the highest score (F 0 ) first; (b) disregarding all but the N rank highest ranking assignments; and (c) defining the probability pk as the score of the first pair divided by the sum of all scores, using the formula:
  • a revised haplotype frequency score fi is calculated for each of the haplotypes, wherein the revised haplotype frequency fj is a function of the previously determined probability k for each consistent haplotype pair which contains the haplotype.
  • a new set of frequency estimates is calculated from the probability scores.
  • the new frequency fj of haplotype i is calculated as the sum of the p k for all pair assignments containing haplotype i, counting homozygous pairs (i, i) twice. Again, the sum of the frequencies across all haplotypes will be two times the number of samples, if all genomic regions are diploid; if haploid or hemizygous regions are represented the total will be reduced accordingly.
  • Revised haplotype pair scores F k and revised probability scores p k can be determined based on the revised haplotype frequency scores fj using the methods described above. These steps can be repeated until an end condition is reached.
  • the new frequency fj of haplotype hi is calculated as the sum of the products (n g )(p k ) for all pair assignments containing haplotype hi, further multiplied by 2 for homozygous pairs (i, i).
  • the end condition can be one of many possible parameters. It may be user definable or a default condition. It may also be variable or set.
  • the end condition may be met when the above-mentioned iteration steps are repeated a preset number of times.
  • the end condition may be met when one or more of the parameters fj, F k , and p k stabilizes; or the end condition may be met simply when the operator chooses to stop.
  • Stabilization can mean: (1) the maximum difference between consecutive iterations of F k or p k goes below a threshold; (2) the ranking (or truncated ranking) does not change for a given number of iterations, or (3) any suitable quantity does not change more than a threshold.
  • the prefened end condition is (1). h another prefened method of the invention exemplified herein, the end condition is met when the operator chooses to stop.
  • genotypes that meet an ambiguity criterion have their haplotypes enumerated (see Fig. 2).
  • the ambiguity criterion may be user definable or may be a default value.
  • the ambiguity criterion is preferably a function of the number of ambiguous polymorphic sites in the genotype, wherein an ambiguous polymorphic site is either a heterozygous site or is a site for which information is lacking.
  • haplotype pairs that meet a pair score criterion will be kept.
  • the pair score criterion is a function of the pair score F for each of the haplotype pairs, as discussed above.
  • the pair score F k is retained only if it is one of the top 15 pair scores.
  • only those haplotype pairs whose pair scores F are greater than a certain percentage of F k max (the highest F associated with any consistent haplotype pair) will be kept.
  • the possibility of enor in the input data may be considered.
  • a fully homozygous sample may generate a number of haplotypes, i.e., the principal one, and all the additional ones which would be introduced if any one of the positions were changed to be heterozygous.
  • the score of each such additional haplotype is considerably reduced by multiplying by the assumed enor probability, a number usually set at 0.01 - 0.02.
  • the most highly scoring haplotype pair is not consistent with the input genotype (despite the strong l%-2% penalty factor), the difference is highlighted and reported as a probable misread (e.g., a sequencing enor).
  • the prefened way to modify the method to allow for the possibility of enors is as follows. For each measured genotype (i.e. for each individual at each polymorphic position), replace the exclusive determination (either of A, A/C, C) by the specification of probabilities as follows: p c probability for the common allele, p for the heterozygote, and p r for the rare allele. Preferably, these probabilities are estimated individually for each genotype measurement according to the quality of the raw data by the procedure used to determine the genotypes. Alternatively, a single enor probability p err can be defined that estimates the probability for any given allele to be determined enoneously.
  • each haplotype may be individually weighted for its consistency with a given individual genotype.
  • the prefened weights for the Hap Expansion are
  • step (f) of the Hap Assignment process a pair of haplotypes hjhj may be weighed for its consistency with an individual genotype.
  • the prefened weights for the Hap Assignment are
  • F ' W y F k
  • Case 1 An inconsistent pair has a score that is comparable to, but lower than, the score of a consistent pair. In this case, one may conclude that there is a significant probability that the genotype causing the inconsistency was measured inco ⁇ ectly. In the presently implemented version of the invention, this genotype is characterized as a possible miscall (enor detection).
  • Case 2 An inconsistent pair ranks first on the list of possible assignments. In the presently implemented version of the invention, this genotype is characterized as most likely wrong, and we choose the haplotype assignment in spite of its inconsistency, effectively overriding the genotyping call (enor conection).
  • a recursive pruning algorithm is used in the Hap expansion phase to eliminate from consideration enumerated haplotypes whose evidence scores Sj are below a given threshold value, and/or is used in the Hap assignment phase to eliminate from consideration haplotype pairs whose pair scores F k are below a threshold value.
  • a pruning algorithm is prefened because the number of possible haplotypes grows exponentially with the number of sites, and because an exhaustive enumeration is rarely desirable.
  • the threshold for Wj is an evidence score criterion
  • the threshold for Wy is a pair score criterion.
  • Step (e) ensures that any branches of the search that are already doomed because of too many mismatches and/or rare polymorphisms will not be followed. This changes the computational complexity of the algorithm such that it rises more or less linearly rather than exponentially with the number of sites, making the calculation more practical.
  • the evidence score criterion is chosen to optimize the use of computer resources.
  • H. MENDELIAN INHERITANCE Another aspect of the method of invention provides a method for optionally adjusting the assignment probability scores p k to reflect the requirement of Mendelian inheritance between individuals who are related. For example, when there is at least one multi-generation family included among the individuals whose polymorphic genomic regions are being haplotyped, the probability p may be reduced for each pair assignment for each genotype in the family that does not obey Mendelian inheritance, hi the simplest embodiment of this aspect of the invention, the scores for any assignment which does not obey Mendelian inheritance with respect to other higher ranking assignments for the relatives are set to zero.
  • Another, prefened, embodiment is the multiplication of an unadjusted probability score p by (l-pk') > where p k ' is the score of any assignment k' of a related person that is in conflict with the first assignment k (i.e., pk - where pk' is the probability calculated for a pair assignment of a related genotype).
  • Hap assignments are preferably constrained to obey Mendelian segregation rules, i.e. one of the copies must be inherited from the father, and one from the mother. This constraint is used in the HAPTM Builder process to eliminate solutions that violate inheritance rules and increase the probability scores of those that do not. It will be appreciated that individuals who in fact are not the offspring of an expecting while parent are readily identified, and their haplotype pair assignments will not be subjected to the Mendelian segregation criterion.
  • Mendelian segregation rales can also be used to validate the HAPTM Builder process. For example, genotype information from individuals belonging to one or more three-generation families may be entered into the database so that they can be treated as either being related or not being related. The haplotype assignments under each of these conditions can be compared for consistency.
  • Another aspect of the method of the invention provides for the optional adjustment of the assignment probability scores p to reflect Hardy-Weinberg Equilibrium.
  • the probability p k may be reduced for each pair assignment for each genotype in the population group that does not obey Hardy-Weinberg Equilibrium.
  • This score adjustment is to multiply the scores pjj, p jj , and p y by one minus the Xi squared value for the deviation from Hardy-Weinberg equilibrium for all pairs of different haplotypes h; and h j :
  • the probability p k may be reduced to p' k by the above fonnula, wherein f, and j are the frequencies of haplotypes h; and h j in the population group and F ⁇ , Fjj and Fy are the frequencies of each possible pair of haplotypes hi and h j in the population group.
  • Another aspect of the invention provides methods and tools to infer haplotypes from every genotype, despite the presence of ambiguous polymorphic sites (sites where data is absent).
  • the input to the program for each genotype measurement is a set of three probabilities, one each for a homozygous common allele, for a heterozygote and for a homozygous rare allele. If no data is available at all, in the HAPTM Builder program as cu ⁇ ently implemented these probabilities default to 0.25, 0.5, and 0.25, respectively. The program accommodates these probabilities and still generates the most likely haplotype pair assignments. The missing genotypes can then be infened by combining the appropriate alleles from the assigned pair of haplotypes.
  • an end condition is tested for after a new set of haplotype frequency scores have been iterated (see Figure 4).
  • an end condition can be tested for at any point during the iterations, and such alternative embodiments are considered part of the invention.
  • the end condition is a function of the pair scores, it will be appropriate to test for the end condition after a new set of pair scores have been iterated.
  • operator intervention based upon human judgment, is the means for terminating the iterations, the iterations can of course be ended at any point.
  • each genotype is never assumed to be just one that was measured, but could be any with weighted probability.
  • an A is not just an A, it is an A with a probability of (1- p) 2 , an A/G with probability of 2p(l-p), or even a G with a (vanishing) probability of p 2 .
  • p is the "enor probability", and is usually close to 0.
  • a value of 0.01 is employed in the cu ⁇ ent prefened embodiment, conesponding to a (probably exaggerated) accuracy of 99% in calling the genotypes.
  • these probabilities are used as weights, so that, for example, the genotype A G C/T, which would normally expand into the haplotype pair AGC+AGT, may expand into something like this:
  • a simple recursive pruning method is used to find all the contributions above a certain threshold weight, cunently set at 0.01, such that only single enor possibilities are used.
  • a similar pruning algorithm is used for the haplotype pair assignment, where assignments are made that do not exactly fit the genotype, with the appropriate low weight.
  • Genotype data for input into the HAPTM 1 Builder program may be generated by the practitioner by sequencing DNA from a population of interest, or may be obtained from various commercial sources of genotype data such as commercial SNP database providers.
  • Publicly available SNP databases may also be used, such as for example the Human Genie Bi-Allelic Sequences database (HGB ASE), the dbSNP database maintained by the National Center for Biotechnology Information, and the Human SNP database maintained by the Whitehead Institute at the Massachusetts Institute of Technology. These public databases are readily accessible via the internet.
  • the data is suitably formatted when stored in a DecoGenTM database as described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, which are incorporated herein by reference.
  • a person may use a user terminal to view a screen which allows the user to see all of the candidate genes, or a subset thereof, and to bring up further information.
  • This screen (as well as all the other screens described herein) may, for example, be presented as a web page, or a series of web pages, from a web server. This web based use may involve a dedicated phone line, if desired. Alternatively, this screen may be served over the network from a non- web based server or may simply be generated within the user terminal.
  • An example of such a screen refe ⁇ ed to herein is illustrated in the top half of Figure 5.
  • the top half of Figure 5 is an example of a screen showing a set of candidate genes for which polymorphism data has been obtained or is in the process of being obtained.
  • This polymorphism data and other information described below may be stored in a database such as the one described in U.S. application serial no. 60/141,521 and in international application WO 01/01218, or is calculated from information stored in such a database. Most of the information shown in later figures is specific to the Index Repository described herein.
  • the screen shows genes for which data is cunently available in a database useful in the invention and those queued for processing (and for which data will appear in the database).
  • the "Row” column indicates the order in which genes were entered into the database
  • the "Id” column is a numerical identifier for the gene having the symbol and name indicated in the "Symbol” and “Name” columns.
  • the columns on the right side of the screen indicate various stages in the process of analyzing target regions of the gene identified in the conesponding row. For example, "Anno” is shorthand for "Annotation”, which is the operation performed at the beginning of the gene analysis process to annotate different features of the gene stracture, such as the locations and sequences of the promoter, exons and introns as described in more detail below.
  • the number in the Anno column provides the number of different annotated features of the gene.
  • the "PCR” and “Sequ” columns indicate how many of the target regions of the gene have been analyzed successfully by the PCR and Sequencing production groups, respectively.
  • the number of polymorphic sites identified for the gene is shown in the "Geno” column.
  • the number of haplotypes deduced by the HAPTM Builder method of the present invention is shown in the "Haplo” column.
  • the various colors provide an immediate visual indicator of the status of the gene at each stage of analysis, with green and yellow indicating completely done and in progress, respectively, and white indicating no target regions have arrived to that stage in the analysis process.
  • the status of genes in the different production stages may be indicated by different degrees or types of shading.
  • the genes in the database may be sorted by various criteria by clicking on any of the columns shown in the top half of Figure 5, e.g., clicking on "Id” allows the genes to be sorted in ascending or descending numerical order, clicking on "Name” allows the genes to be sorted in alphabetical order, and clicking on "Sequ” allows the genes to be sorted by number of fragments.
  • the user can select a gene to examine in detail by using the mouse (or other user-input device such as keyboard, roller ball, voice recognition, etc.) to select the candidate gene.
  • the prodynorphin gene is selected, as indicated by the purple color in what is shown in Row 408 of the figure.
  • the screen may optionally include a "find” feature, to locate a candidate gene of interest.
  • a single click on the selected gene brings up the screen shown in the bottom half of Figure 5, which provides sequencing wordflow information, i.e., numerical workflow identifiers for the sequencing and PCR reactions ("Run” and “PCR” columns), in both forward and reverse directions (“Dir” column), that have been performed for various fragments from each of the target regions of the gene (for example, fragment exon 3.1 from exon 3).
  • a check in the "Ready” column indicates when a gene fragment is ready to be analyzed for polymorphisms and the "Status” column indicates whether there is sequencing information for both strands of the fragment.
  • Such information and screens are not necessary for using the methods of the present invention, but may be used to monitor the progress and/or extent of sequencing of candidate gene(s) (or other loci) input into the database and may be useful in providing an estimate of the reliability of the sequence data which has been input into the database. Decisions about whether or not to proceed with polymorphism analysis in one or more of the fragments of the selected gene may be based on the status of the sequencing rans. For example, if sequence information is available for both strands, the more reliable the sequence will be and, therefore, the more reliable the polymorphism data will be.
  • Figure 6 shows an example of the annotation screen, which is reached by clicking on "Anno” in the screen depicted in Figure 5.
  • the PDYN gene contains 10 features, each of which has the indicated lengths and the indicated start and stop positions with respect to the indicated Accession number.
  • the Accession number is typically the GenBank Accession number for the gene, although it may be an identifying number from another publically available database or an internal identifying number. If the complete gene sequence is not know, the "Accession" column may contain multiple identifying numbers for partial sequences. A check in the "Rev” column indicates the coding sequence for the gene is found in the reverse complement of the Accession number.
  • the "Seqlen” column indicates the number of nucleotides entered into the "Sequence” box at the end of the row. The amount of sequence shown may be increased by enlarging the window; the entire sequence for a feature may be displayed by clicking on the particular sequence of interest.
  • the information contained in the "Anno” screen is typically derived from GenBank and other public data sources.
  • a single click on the haplotype (“Haplo") column in that row brings up the screen for the HAPTM Builder program, an example of which is shown in Figure 7.
  • the screen exemplified in Figure 7 shows several boxes at the same time, although one or more of the boxes may be expanded by dragging the dividers between the boxes.
  • the window on the left (labeled "Family Objects" in Figure 7) will typically show a list of the different multi- generation families available for polymorphism analysis and relevant information concerning each family, such as numerical identifiers for the father and mother, and the number of children "siblings". This window will typically show a family tree below the list of families. Males are shown as rectangular boxes and females are shown as ovals.
  • Family 1333 is selected in the box on the upper left side, therefore, the family tree for that family is displayed. Family trees for other families may be displayed by clicking on the name of the desired family in the top of the window. If nothing had been clicked on, Family 13291 would have been the default family tree displayed.
  • the screen exemplified in Figure 7 will typically also show a box that provides information about the polymorphism data for a selected gene (labeled "ScoredPolymorphism Objects" in top right side of Figure 7).
  • Each row contains information for a different polymorphic site (PS) identified in the gene from a population (a group of people whose nucleotide sequences have been examined for this gene).
  • PS polymorphic site
  • the screen indicates that eleven PS were detected in the PDYN gene.
  • the "Region” column indicates the region in the gene where the polymorphic site is located (e.g., the promoter, the first intron, the first exon, etc.).
  • the number in the first "Pos” column indicates the location of the polymorphism in the indicated region of the gene, while the number in the second “Pos” column indicates the location of the polymorphism in the genomic sequence, based on the numbering of the Accession sequence.
  • the Accession number is preferably the same Accession number as presented in the "Anno” screen, although it may be a different number.
  • the rows can be sorted by clicking on “Row”, “Position” or “Accession”. Clicking on “Row” orders the gene from 5' to 3'.
  • the “Change” column typically contains the identity of the alternative nucleotides observed at the indicated PS and, for those polymorphisms which result in amino acid variation, the identity of the alternative amino acids.
  • the "Wild” column contains the number of individuals in the analyzed population homozygous for the wild-type, or the most common allele or reference allele.
  • the "Mut” column contains the number of individuals homozygous for the least common allele or uncommon variant allele
  • the "Het” column contains the number of individuals heterozygous at that PS.
  • the most and least common nucleotide (or encoded amino acid) at each site is defined by looking at the genotypes of all individuals in the population at that particular site. The nucleotide that shows up most often is called the most common nucleotide. The one that shows up less often is termed the least common.
  • the "En” column indicates the number of individuals in which the variation in the "Change” column may have been inco ⁇ ectly determined. Checking a box in a row under the "Accept” column indicates that the haplotype is to include genotype information for the polymorphic site in that row. When a box under the "Accept” column is not checked, the genotype information concerning the polymorphic site described in that row will not be considered in the haplotype analysis for each of the individuals.
  • the screen exemplified in Figure 7 displays the polymorphism frequency calculated for various groups of the analyzed population.
  • the different population groups are African American (AF), Asian (AS), Caucasian (CA), primate (PT; one chimpanzee individual named "Harv”) and other (OT; three native American individuals).
  • the PDYN data set shown in Figure 7 includes five "chimp-specific" polymorphic sites, i.e., the human individuals examined were all monomorphic at the position, but the chimpanzee had at least one alternative allele at that position.
  • the rows containing these "chimp-specific” sites may be removed from this window by selecting the "Edit” button in the top left corner of the screen (which brings up the pull-down menu illustrated in Figure 8), then selecting "De- HARN", which unchecks the appropriate boxes in the "Accept” column (as shown in Figure 9), and then selecting "Filter Polymorphisms".
  • the resulting human polymorphic sites for the PDY ⁇ data set are shown in Figure 10.
  • the box in the middle right side of the screen shown in Figure 10 labeled "Scored Diplotype Objects" provides the genotype at each of the selected (accepted) polymorphic sites for each individual in the population being examined.
  • the genotype data is shown for each of the 6 human polymorphic sites selected in the screen at the top of the figure for the PDY ⁇ in the indicated individuals from the Index Repository.
  • Each row contains genotype information for a different individual and the genotypes for additional individuals in the population may be accessed by scrolling up and down, or by enlarging the window.
  • the empty cells colored pink indicate those polymorphic sites for which sequence information is not present in the database.
  • the "Subject” and “Eth” columns list the numerical identifier and ethnicity (i.e, population group) for the individual, respectively, using the same two-letter codes for the population groups described above.
  • the "Hap 1 "and “Hap2" columns are empty in Figure 10, but during the haplotype assignment process described above, these columns will indicate the most likely resolved haplotypes for the genotype for each individual in each row, based on the pair frequency score F k determined for that pair by the method described herein, and listed in the "Score” column after each iteration of the haplotype assignment phase.
  • this screen initially appears when the user clicks on the "Haplo” button in the screen shown in Figure 7.
  • the user selects the "Assign" command in the pull-down menu in Figure 10 (not shown).
  • An example of a screen showing the result following one or more iterations of the haplotype pair assignment phase is shown, e.g., in Figures 12 and 13, respectively.
  • the numbers in the "Hapl” and “Hap2" columns in the screens conespond to the HAP ID numbers in the window labeled "Scored Haplotype Objects" in the lower right side of the screens shown in Figures 12 and 13.
  • the genotype for individual UP018 in row 85 of the window in the middle right side of the screen in Figure 12 the number 2 appears in the "Hapl” column and the number 7 appears in the "Hap2" column.
  • the window labeled "Scored Haplotype Objects" provides the different haplotypes determined for the selected (accepted) polymorphic sites for the selected gene in the examined population.
  • Each row contains a unique haplotype, with the cu ⁇ ent haplotype frequency score fj of each haplotype listed in the "Score” column.
  • the number of times each haplotype is seen in the entire population and in the various population groups are indicated in the "Count” and following six columns, respectively, with "AF”, “AS”, “CA”, “HL” and “OT” are as described above.
  • the information in this window can also be sorted by haplotype frequency score fj, by clicking on "Score”.
  • the PT and OT columns may be hidden manually or not considered in the HAPTM 1 Builder process.
  • the "Information Entropy" shown at the top of the "ScoredHaplotypes Objects” window is a measure of the amount of variability of the locus. It measures the amount of information (in bits) that is needed to specify the genotype at the locus. If a locus has only one possible haplotype, there is only one possibility and the information entropy is zero. If there are four equally likely haplotypes, 2 bits of information are needed to specify which of the four is present.
  • the general formula is
  • the first number shown in the "ScoredHaplotpye objects" box is the info ⁇ nation entropy of the locus as calculated from the possible haplotypes and their frequencies.
  • the second number is the same quantity under the (e ⁇ oneous) assumption that all polymorphisms are independent of each other. The former is always smaller than the latter and the difference indicates the degree to which the polymorphisms are linked.
  • the largest possible information entropy is the number of polymorphisms N (if all N polymorphisms are balanced and independent of each other, or, in other words, if all 2 n possible haplotypes are equally likely), more typically the values are between 0.5 and 3.
  • a large information entropy for a locus indicates greater variability, i.e., more haplotypes exist, and thus this locus may be more useful in finding associations with phenotypes than a locus with a smaller infonnation entropy. This information is not used in building haplotypes.
  • Selecting the "Edit” menu in the top left corner of Figure 8 brings up the menu shown, having the following command selections: “Assign”; “New Locus”; “De-HARN”; “Filter Polymorphisms”; “Filter Haplotypes”; "Store”; and "Export”.
  • Each invoking of the "Assign” command causes an additional iteration of the above-described haplotype assignment method to be carried out.
  • Selecting "New Locus" clears out the scores and haplotype assignments and fills the "ScoredPolymorphism objects" box with data for all available polymorphisms for the locus.
  • Selecting "De-HARN” removes the "Accept” checkmarks from those polymorphisms that are specific to the chimpanzee, i.e. those which are monomorphic in the human population. This selection is usually made when using the HAPTM Builder program, but does not need to be. The individual "Accept” checkmarks can also be modified manually. Selecting "Filter Polymorphisms” will eliminate all polymorphisms from the list and from the analysis which are not checked in the Accept column. Simultaneously, the Hap Expansion is performed and the resulting Haplotypes displayed in the "ScoredHaplotype objects" box.
  • Selecting "Filter Haplotypes” allows the user to eliminate those haplotypes from the "ScoredHaplotype objects" box which have not been assigned as top choice to any individual.
  • Selecting "Store” stores all the infonnation into a database. This includes the list of haplotypes, the haplotype frequencies, and the haplotype pair assignments and assignment scores.
  • Selecting "Export” allows the user to write the data into a text file, from which it can be read into a spreadsheet program or otherwise stored or transmitted. Clicking on the "Assign" command in the Edit Menu in Figure 10 updates the boxes shown in the middle and lower boxes of Figure 11.
  • Haplotypes have been assigned to each genotype (i.e., the "-'"s have been replaced by haplotype Id numbers in the "Hap 1" and “Hap 2" columns, and pair frequency scores have been assigned.
  • haplotype frequency scores fj have been assigned, as well as other information.
  • FIG. 12 shows a screen on the bottom left labeled "HapPair Objects" which results when subject UP002 is selected in the screen in the "ScoredDiplotypes Objects" box in the middle right of Figure 12. This contains the 15 most likely haplotype pairs for the individual UP002 based on the cu ⁇ ent haplotype pair scores F k which are shown in the center right box.
  • Figure 13 shows the changes that occur to the screens after the iteration process is completed (following multiple selections of "Assign” and optional manual interventions), and the "Filter Haplotypes” options is selected. Only six ⁇ Scored Haplotype Objects are shown in Figure 13 as compared to eleven in Figure 12, because all haplotypes not assigned to at least one individual have been dropped. Missing genotype data appears as blanks in the "Scored Diplotype Objects" box. Figures 14 and 1.5 show such blanks for the ABCB1 gene. The header of the center right box. indicates that there are 10 warnings,, flagged by boxes highlighted in pink in the ScoredDiplotype Objects box, and 5 enors, flagged by boxes highlighted in red.
  • Figure 14 shows a situation where the assigned haplotypes do not obey Mendelian segregation in one of the families (Family 1333).
  • the operator may conclude that the mother should be assigned a different pair such as 1,4 or 1,6; or may conclude that a different pair containing at least one copy of haplotype 1 needs to be assigned to the grandmother (UP018).
  • Figure 15 shows how manual intervention can be used to fix the problem.
  • the "HapPair Objects" window at the top of the figure has been brought up by clicking on UP002 (Row 92 in the "ScoredDiplotype Objects” box).
  • UP002 Raster 92 in the "ScoredDiplotype Objects” box.
  • Low 2 haplotype pair 1,6 can be assigned to subject UP002, and the requirements of Mendelian inheritance can be satisfied.
  • the flag (red color) will then disappear from the family tree, but an additional enor will appear in the "ScoredDiplotype Objects" window at the position which had to be overridden to accommodate the non-matching pair.
  • infonnation that is stored in a database includes (1) the positions of one or
  • the gene locus or other loci
  • it also includes individual identifiers and ethnicity or other phenotypic characteristics (such as age, gender or clinical
  • the haplotypes, their frequencies, and other information about each of the members of the population being analyzed are stored and displayed, preferably in the manner shown, e.g., in Figures 7-15.
  • the information shown in Figures 7-15 includes a unique identifier
  • the methods of the invention preferably use a tool called the DecoGenTM program described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, which are incorporated herein by reference.
  • the tool consists, in part, of: a.
  • One or more databases that contain (1) genotypes (or haplotypes) for a gene (or other loci) for many individuals (i.e., people, animals, plants, etc., depending on the application) for one or more genes and, optionally, (2) a list of the names or functions of the genes (or other loci), whose functions can be, but are not limited to: disease causation, drag response, plant yields, plant disease resistance, plant drought resistance, plant interaction with pest-management strategies, etc.
  • the databases could include information generated either internally or externally (e.g. GenBank). Examples of databases which may be used in the present invention are described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, which are incorporated herein by reference .
  • GenBank GenBank
  • Examples of databases which may be used in the present invention are described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, which are incorporated herein by reference .
  • b A set of computer programs that analyze and display the relationships between the genotypes and the haplotypes for an individual.
  • the methods of the invention preferably also use a tool called the HAPTM 1 Builder Program.
  • This tool which are novel include: a. A new genotype-to-haplotype method that allows the user to infer an individual's haplotypes or sub-haplotypes for a given gene.
  • the steps required for this to work are (a) determine the haplotype (or sub-haplotype) frequencies from the reference population by expanding the genotypes of a reference population; (b) optionally, conect the observed frequencies to conform to Hardy-Weinberg equilibrium and/or Mendelian inheritance (unless it is determined that the deviation is not due to sampling bias, sequencing enor or questionable paternity); and (c) use the statistical approach described in this application (and shown schematically in Figures 2-4) to predict individuals' haplotypes or sub-haplotypes from their genotypes. b.
  • the prefened embodiment present invention uses a relational database which provides a robust, scalable and releasable data storage and data management mechanism.
  • the computing hardware and software platforms with 7x24 teams of database administration and development support, provide the relational database with advantageous guaranteed data quality, data security, and data availability.
  • the database model of the present invention provides tables and their relationships optimized for efficiently storing, searching and otherwise utilizing a genomics- oriented database.
  • a data model (or database model) describes the data fields one wishes to store and the relationships between those data fields.
  • the model is a blueprint for the actual way that data is stored, but is generic enough that it is not restricted to a particular database implementation (e.g., SybaseTM or OracleTM).
  • the model covers the data required by, and/or generated by, the HAPTM Builder program. It contains at least 4 submodels which contain logically related subsets of the data. These relevant submodels, which are described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, are described below.
  • Gene Repository This is the sub-model that describes the gene loci and its related domains. Preferably, it captures the information on gene, gene stracture, species, gene map, gene family, therapeutic applications of genes, gene naming conventions and published literature including the patent information on these objects.
  • Population Repository This is the part of the data model that encapsulates the patient and population information. Preferably, it covers the entities such as patient, ethnic and geographical background of patient and population, medical conditions of the patients, family and pedigree information of the patients, patient haplotype and polymorphism information and their clinical trial outcomes.
  • Polymorphism Repository This is the part of the model that covers the haplotype and the polymorphism associated with genes and, preferably, patient cohorts used in clinical studies.
  • the polymorphisms include those due to single nucleotide polymorphisms (SNPs), large and small insertions and deletions, RFLPs, repeats, frame shifts and alternative splicings.
  • Sequence Repository Genetic sequence information in the form of genomic DNA, cDNA, mRNA and protein is captured by this data model as is the location relationship between the gene structural features and the sequences.
  • the haplotype and other data developed using the methods and/or tools described herein may be used in a partnership of two or more companies (refened to herein as the Partnership) to integrate knowledge of human population and evolutionary variation into the discovery, development and delivery of pharmaceuticals, in the ways described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, which are incorporated herein by reference.
  • the database and analytical tools of the invention are envisioned to be useful in a variety of settings, including various research settings, pharmaceutical companies, hospitals, independent or commercial establishments. It is expected users will include physicians (e.g., for diagnosing a particular disease or prescribing a particular drug) pharmaceutical companies, generics companies, diagnostics companies, contract research organizations and managed care groups, including HMOs, and even patients themselves.
  • HAPLO a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J Hered 86:409-411.

Abstract

Methods, computer programs and databases for determining haplotypes from a collection of polymorphisms are provided. These include methods, programs, and databases to find and measure the frequency of haplotypes in the general population; and methods, programs, and databases for predicting an individual's haplotypes from the individual's genotype for a gene.

Description

TITLE
METHOD AND SYSTEM FOR DETERMINING HAPLOTYPES FROM A COLLECTION OF POLYMORPHISMS
FIELD OF THE INVENTION The invention relates to the field of genomics, and genetics, including genome analysis and the study of DNA variation. In particular, the invention relates to the field of predicting haplotype infoπnation from unphased and/or incomplete genotype information for an organism. The invention is particularly useful in the human health care, veterinary and agricultural fields.
BACKGROUND OF THE INVENTION
The investigation of haplotypes began when it was recognized that certain pairs of loci violated Mendel's second law: rather than the independent segregation of variants at separate loci, there was a correlation in the transmission pattern from one locus to the next. Such correlated variants are called "haplotypes". Haplotypes have historically had greatest importance in the analysis of pedigree data. More recently, with the capacity to generate DNA sequence information for a large number of individuals, "haplotype" has come to mean the specific sequence of alternative variants (e.g., single nucleotide polymorphisms or "SNPs") at the polymorphic sites, often coming from a contiguous piece of DNA. In such applications attention has been diverted from family pedigrees to population samples, so there has been considerable interest in obtaining haplotypes when there is no recourse to familial transmission patterns. A number of molecular mechanisms have been described, such as sperm typing, single molecule dilution, cloning, or allele-specific amplification (AS-PCR), but all are currently limited to research investigations.
As early as 1971, it was realized that the ambiguity inherent in multilocus, but unphased, genotypes could be evaluated and at least partially overcome by statistical estimation of haplotype frequencies in a population. The first implementation of an algorithm for resolving phase of genotypes (Hill, 1975) is based on Hill's theory for two loci (Hill, 1974), each with two alleles, in which case an explicit maximum likelihood solution for haplotype frequencies exists. More recently, there have been extensions of that theory for additional loci and multiple alleles (Clark, 1990; Long et al, 1995; Excoffier and Slatkin, 1995; Hawley and Kidd, 1995). The focus of such algorithms is generally on the statistical estimation of population haplotype frequencies, and will be reviewed below.
Three algorithms published in 1995 (Excoffier and Slatkin, 1995; Hawley and Kidd, 1995; Long et al, 1995) all use the Expectation-Maximization (EM) algorithm for estimating haplotype frequencies in a population. The EM algorithm was originally proposed in 1977 (Dempster et al., 1977) as a general method of obtaining maximum likelihood estimates from data that are incomplete in some sense. In the application to haplotype frequency estimation, the incompleteness is the phase of the multiply heterozygous individuals. The paper by Long et al. goes beyond haplotype frequency estimation to construct a model framework for testing the statistical association among loci. Furthermore,. it makes allowance for the possibility of null alleles at one or more loci. The model and algorithm are described for three loci, although they claim applicability to more complicated situations. The paper by Hawley and Kidd deals explicitly with multiple populations, but again this reference is focused primarily on frequency estimation. The mathematical basis of a maximum likelihood approach to haplotype estimation is explained well in the paper by Excoffier and Slatkin (1995). Although the latter paper mentions, as a potential application, "inferring which gametes are most likely associated to form genotypes in all sampled individuals", it does not say how this can be done. All three methods based on the EM algorithm require multiple starting conditions to facilitate finding the true maximum-likelihood solution, and there is still no guarantee that the true maximum will be found. These published algorithms also have the very pragmatic problem of being limited by at least one of the following: the maximum number of polymorphic positions, possible haplotypes, or heterozygous sites in an individual.
The inventors herein are aware of only one reference (Clark 1990) that discloses an algorithm for assigning haplotypes to unrelated individuals in a population sample. It proceeds by assigning haplotypes that are observed as homozygotes or single-site heterozygotes, then interrogating whether one (or more) of these is consistent with an ambiguous individual, that is, an individual heterozygous at two or more sites. It is an order-dependent algorithm, in that different orders give different answers, and so must be applied several times to look for differences and a single best answer. This reference has identified three main problems in applying the algorithm: 1) it might never get started, if there are no unambiguous individuals; 2) it might not be able to resolve every individual in a sample; and 3) it might resolve certain individuals incorrectly. Indeed, the reference included the results of computer simulations that evaluate the severity of these potential problems. Also, in a recent application of Clark's algorithm to the LPL locus (Clark et al. 1998), the algorithm required supplementation by AS-PCR, a molecular technique for resolving haplotypes, since every individual in the sample was a heterozygote.
None of the prior art disclosed or suggested an approach for assigning haplotypes to unrelated individuals that was amenable to a high-throughput mode of analysis. Moreover, none of the prior art disclosed or suggested an approach for incorporating error analysis or for estimating missing data. Finally, none of the prior art disclosed or suggested a process that would not require multiple starting conditions, nor did they disclose or suggest a process that would be amenable to the complications implicit in data with dozens of polymorphic loci. Thus, there is a need to develop a process that assigns haplotype pairs to unrelated individuals; prioritizes automation, robustness, and statistical evaluation of the accuracy of the results; and has the capacity to cope with data of substantially greater complexity than that addressed in prior art.
The methods and tools described herein provide processes for predicting haplotypes and haplotype pairs from unphased and/or incomplete genotype data. The processes are preferably carried out with the aid of a computer.
The exemplified methods and tools are partially embodied in a computer program coupled to a database used to display and analyze haplotype, genotype and related statistical information. It includes novel graphical and computational methods for treating haplotypes, genotypes, and related data in a consistent and easy-to-interpret manner. SUMMARY OF THE INVENTION
The invention relates to a process for deriving the presence and frequency of haplotypes from a collection of genotypes of several individual polymorphisms in a locus, measured for a sample group of individuals. The process begins with an exhaustive enumeration (expansion) of all possible haplotypes (called the "Hap Expansion" phase), then proceeds through a self-consistent, iterative process to deduce the haplotypes most likely to be present, and the most likely assignment of haplotype pairs to each individual (called the "Hap Assignment" phase). The process also results in a probability score specifying the likelihood of the result being correct. The process takes advantage of family relationships among the sample group, but does not require them. The process is embodied in the HAP™ Builder program, a computer code written in Java providing an interface for a skilled person to efficiently carry out the process and store the results.
More specifically, the invention relates to a method and tools for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals, comprising:
(a) obtaining a genotype for the polymorphic genomic region from each of the individuals;
(b) enumerating all possible haplotypes hj that are consistent with each genotype;
(c) assigning an evidence score s\ to each of the enumerated haplotypes hi;
(d) calculating an initial haplotype frequency fi for each haplotype among the possible haplotypes, wherein the initial haplotype frequency fj is a function of the evidence score s,;
(e) determining for each genotype obtained in step (a) a pair score F for each pair of haplotypes that is consistent with that genotype, wherein Fk is a function of the frequency fj for each of the haplotypes in the pair; (f) calculating, for each genotype and consistent haplotype pair whose pair score Fk meets a pair score criterion, a probability pk that assignment of that haplotype pair to the genotype would be correct;
(g) generating a revised haplotype frequency fj for each haplotype, wherein the revised haplotype frequency fj is a function of the probability p for each consistent haplotype pair which contains the haplotype; and (h) repeating steps (e) through (g) until an end condition is reached, with the proviso that for each repetition the frequency fi employed in step (e) is replaced by the revised frequency fj determined in step (g).
Steps (a) though (d) are called the initiation, or Hap Expansion, phase of the method. Steps (a), (b) and (c) can be performed for one individual at a time or in parallel.
Steps (e) tlirough (g) are called the Hap Assignment phase of the method. Steps (e) and (f) can be performed for one genotype at a time or in parallel.
The invention also relates to a method and tools for assigning a haplotype pair to a polymorphic genomic region of an individual, comprising:
(a) obtaining the genotype for the polymorphic genomic region from the individual; (b) enumerating all possible haplotypes hi for the genotype;
(c) providing a frequency fj for each of the possible haplotypes, where fj is determined by the method for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals, as discussed herein; (d) determining a pair score Fk for each pair of possible haplotypes hi that are consistent with the genotype, wherein Fk is a function of the frequency fj for each of the haplotypes in the pair; and (e) assigning to the genotype the haplotype pair having the highest pair score Fk. The invention also relates to methods and tools for predicting the probable haplotypes and haplotype pairs of one or more loci of an individual. The invention also relates to methods and tools for estimating the probability that the predicted haplotypes and haplotype pairs are correct. The invention also relates to a method and tools for filling in missing genotype data for any individual and polymorphic site that was not or could not be measured, comprising using the most probable assignment of haplotypes determined by the methods of the invention to construct the most likely genotype for the individual. The invention also relates to methods of constructing a haplotype database for a population containing reference haplotype pair frequency data. The invention also relates to methods of predicting the presence of a haplotype pair in an individual using such a database. The methods comprise accessing the database containing reference haplotype pair frequency data to determine a probability, for each of the possible haplotype pairs, that the individual has the possible haplotype pair; and analyzing the determined probabilities to predict haplotype pairs for the individual.
The methods and tools of the invention make it possible to determine haplotypes and haplotype pairs in an individual, or in a plurality of individuals, based on unphased and/or incomplete genotype information. The individuals may be part of a population such as the general population, an ethno-geographic group, or a clinical or disease population, or they may all be of the same gender.
Similarly, in agricultural biotechnology, the method and tools of the invention can be used to determine the haplotypes and haplotype pairs of genes responsible for specific desirable traits, e.g., drought tolerance and/or improved crop yields, and reduce the time and effort needed to transfer desirable traits.
The invention includes methods, computer programs and databases to analyze and make use of genotype information to deduce and/or predict haplotype information. These include methods, programs, and databases for finding and measuring the frequency of haplotypes and/or haplotype pairs in a population; and methods, programs, and databases for inferring an individual's haplotype from the individual's genotype. BRIEF DESCRIPTION OF THE DRAWINGS
FIGURES 1 A and IB. System Architecture Schematic. FIGURE 2. First part of Flow Chart for a method and system for determining haplotypes from a collection of polymorphisms. FIGURE 3. Second part of Flow Chart for a method and system for determining haplotypes from a collection of polymorphisms.
FIGURE 4. Third part of Flow Chart for a method and system for determining haplotypes from a collection of polymorphisms.
FIGURE 5. DecoGenHiP™ Builder View. The top half of Figure 5 is a screen showing a set of candidate genes for which polymorphism data has been obtained or is in the process of being obtained, and which may be selected for being haplotyped. The columns on the right side of the screen indicate various stages in the process of analyzing target regions of the gene identified in the corresponding row. The various colors provide an immediate visual indicator of the status of the gene at each stage of analysis. The bottom part of this figure is a screen which provides information concerning the sequencing of various regions of the selected candidate gene.
FIGURE 6. Gene Structure View. This screen shows the location of features in the gene (such as promoter, introns, exons, etc.), as well as actual sequence data, for a gene for which the "Anno" column has been selected in the screen of Figure 5.
FIGURE 7. Gene Haplotypes View. The screen in the top right side of this view shows information about the polymorphic sites in a gene for which the "Haplo" column has been selected in the screen of Figure 5, such as the location of the polymorphic sites, the type of polymorphism, and an indication of the frequency with which each polymorphism has been seen in various world population groups. The screen includes boxes which may be checked to include the polymorphic site in a haplotype analysis.
FIGURE 8. Gene Haplotypes View (Cont). The screen in the top left side of this view shows an items selection menu which results after the "Edit" item is clicked on in Figure 7. The "DeHarv" menu item is highlighted. FIGURE 9. Gene Haplotypes View (Cont). The screen in the top right side of this view shows information which results after the "DeHARV" menu item is selected in the "Edit" menu on the top left side of Figure 8; only six of the polymorphic sites shown in the screen in the top right of Figure 8 are selected in the screen in the top right side of Figure 9.
FIGURE 10. Gene Haplotypes View (Cont.). This view shows screens which result when the "Filter Polymorphisms" menu item is selected in the Edit menu of Figure 9 (which is not shown open in Figure 9). The box in the middle right side of this view labeled "ScoredDiplotype Objects" shows the unphased genotypes of subjects in the database and their ethno-geographic origin. The screen in the bottom right side of this view labeled "ScoredHaplotypes Objects" shows the expanded haplotypes enumerated from the genotypes in the middle screen for each of the selected (accepted) polymorphic sites .
FIGURE 11. Gene Haplotypes View (Cont.). This view shows screens which result when the "Assign" menu item in, the edit menu of Figure 10 (which is not shown open in Figure 10) is selected one time. The screen in the middle of the figure labeled "Scored Diplotype Objects" shows the "Hapl" and "Hap2" pair assignments (i.e., the genotype to haplotype resolution) for each of the individuals in the population being examined after several iterations of the HAP™ Builder algorithm, as well as the HapPair Score assigned to them. The window labeled
"Scored Haplotype Objects" (shown in the lower right side of the view in Figure 11) provides the different haplotypes determined in the examined population, with a haplotype frequency score, as well as the number of times each haplotype is seen in the entire population and in the various population groups. FIGURE 12. Gene Haplotypes View (Cont.). This view shows a window labeled "HapPair Objects" which is displayed as a result of clicking on the "Score" cell for row 94 (individual UP002) in the "ScoredDiplotypes Objects" box in the center of Figure 12. This window contains the 15 most likely haplotype pairs for subject UP002 based on the current haplotype pair scores. FIGURE 13. Gene Haplotypes View (Cont.). This view shows screens which result after the "Assign" command in the Edit menu in Figure 12 has been invoked multiple times. FIGURE 14. Gene Haplotypes View (Cont.). This view shows a screen with "warnings" (e.g., missing genotype data) highlighted in light gray. This view also shows a screen with the icon for the individual UP002 highlighted in dark gray in the family tree schematic because the Mendelian inheritance rules are violated. FIGURE 15. Gene Haplotypes View (Cont). This view shows a window labeled "15 HapPair Objects" which results when subject UP002 is selected in the Scored DiplotypeObjects list.
DETAILED DESCRIPTION OF THE INVENTION
I. DEFINITIONS In the context of this disclosure, the following terms shall be defined as follows unless otherwise indicated:
Allele - A particular form of a genetic locus, distinguished from other forms by its particular nucleotide sequence.
Ambiguous polymorphic site - A heterozygous polymorphic site or a polymorphic site for which nucleotide sequence information is lacking.
Candidate Gene - A gene which is hypothesized to be responsible for a disease, condition, or the response to a treatment, or to be correlated with one of these.
Gene - A segment of DNA that contains all the information for the regulated biosynthesis of an RNA product, including promoters, exons, introns, and other untranslated regions that control expression.
Genotype - An unphased 5 ' to 3 ' sequence of nucleotide pair(s) found at one or more polymorphic sites in a locus on a pair of homologous chromosomes in an individual. As used herein, genotype mcludes a full-genotype and/or a sub- genotype as described below.
Full-genotype - The unphased 5 ' to 3 ' sequence of nucleotide pairs found at all known polymorphic sites in a locus on a pair of homologous chromosomes in a single individual. Sub-genotype - The unphased 5 ' to 3 ' sequence of nucleotides seen at a subset of the known polymorphic sites in a locus on a pair of homologous chromosomes in a single individual.
Genotyping - A process for determining a genotype of an individual. Haplotype - A 5 ' to 3 ' sequence of nucleotides found at one or more polymorphic sites in a locus on a single chromosome from a single individual. As used herein, haplotype includes a full-haplotype and/or a sub-haplotype as described below.
Full-haplotype - The 5 ' to 3 ' sequence of nucleotides found at all known polymorphic sites in a locus on a single chromosome from a single individual.
Sub-haplotype - The 5' to 3' sequence of nucleotides seen at a subset of the known polymorphic sites in a locus on a single chromosome from a single individual.
Haplotype pair -- The two haplotypes found for a locus in a single individual.
Haplotyping - A process for determining one or more haplotypes in an individual and includes use of family pedigrees, molecular techniques and/or statistical inference.
Haplotype data - Information concerning one or more of the following for a specific gene: a listing of the haplotype pairs in each individual in a population; a listing of the different haplotypes in a population; frequency of each haplotype in that or other populations, and any known associations between one or more haplotypes and a trait.
Isoform - A particular form of a gene, mRNA, cDNA or the protein encoded thereby, distinguished from other forms by its particular sequence and/or structure.
Isogene - One of the isoforms of a gene found in a population. An isogene contains all of the polymorphisms present in the particular isoform of the gene.
Isolated — As applied to a biological molecule such as RNA, DNA, oligonucleotide, or protein, isolated means the molecule is substantially free of other biological molecules such as nucleic acids, proteins, lipids, carbohydrates, or other material such as cellular debris and growth media. Generally, the term "isolated" is not intended to refer to a complete absence of such material or to absence of water, buffers, or salts, unless they are present in amounts that substantially interfere with the methods of the present invention.
Locus - A location on a chromosome or DNA molecule corresponding to a gene or a physical or phenotypic feature.
Naturally-occurring - A term used to designate that the object it is applied to, e.g., naturally-occurring polynucleotide or polypeptide, can be isolated from a source in nature and which has not been intentionally modified by man.
Nucleotide pair - The nucleotides found at a polymorphic site on the two copies of a chromosome from an individual.
Phased - As applied to a sequence of nucleotide pairs for two or more polymorphic sites in a locus, phased means the combination of nucleotides present at those polymorphic sites on a single copy of the locus is known.
Polymorphic genomic region - A region comprising one or more polymorphic sites in a single contiguous region or in two or more noncontiguous regions of a single chromosome.
Polymorphic site (PS) - A position within a locus at which at least two alternative sequences are found in a population, the most frequent of which has a frequency of no more than 99%. Polymorphic variant - A gene, mRNA, cDNA, polypeptide or peptide whose nucleotide or amino acid sequence varies from a reference sequence due to the presence of a polymorphism in the gene.
Polymorphism - The sequence variation observed in an individual at a polymorphic site. Polymorphisms include nucleotide substitutions, insertions, deletions and microsatellites and may, but need not, result in detectable differences in gene expression or protein function.
Polymorphism data - Information concerning one or more of the following for a specific gene: location of polymorphic sites; sequence variation at those sites; frequency of polymorphisms in one or more populations; the different genotypes and/or haplotypes determined for the gene; frequency of one or more of these genotypes and/or haplotypes in one or more populations; any known association(s) between a trait and a genotype or a haplotype for the gene. Polymorphism Database - A collection of polymorphism data arranged in a systematic or methodical way and capable of being individually accessed by electronic or other means.
Polynucleotide - A nucleic acid molecule comprised of single-stranded RNA or DNA or comprised of complementary, double-stranded DNA.
Population Group - A group of individuals sharing a common ethnogeographic origin.
Reference Population - A group of subjects or individuals who are predicted to be representative of the genetic variation found in the general population. Typically, the reference population represents the genetic variation in the population at a certainty level of at least 85%, preferably at least 90%, more preferably at least 95%) and even more preferably at least 99%.
Single Nucleotide Polymorphism (SNP) - Typically, the specific pair of nucleotides observed at a single polymorphic site. In rare cases, three or four nucleotides may be found.
Subject - An individual whose genotypes or haplotypes or response to treatment or disease state are to be determined.
Treatment - A stimulus administered internally or externally to a subject.
Unphased - As applied to a sequence of nucleotide pairs for two or more polymorphic sites in a locus, unphased means the combination of nucleotides present at those polymorphic sites on a single copy of the locus (i.e., located on a single DNA strand) is not known.
π. METHODS OF IMPLEMENTING THE INVENTION
The present invention may be implemented with a computer, an example of which is shown in Figure 1 A. The computer includes a central processing unit (CPU) connected by a system bus or other connecting means to a communication interface, system memory (RAM), non- volatile memory (ROM), and one or more other storage devices such as a hard disk drive, a diskette drive, and a CD ROM drive. The computer may also include an internal or external modem (not shown). The computer also includes a display device, such as a CRT monitor or an LCD display, and an input device, such as a keyboard, mouse, pen, touch-screen, or voice activation system. The computer stores and executes various programs such as an operating system and application programs. The computer may be embodied, for example, as a personal computer, work station, laptop, mainframe, or a personal digital assistant. The computer may also be embodied as a distributed multi- processor system or as a networked system such as a LAN having a server and client terminals.
The present invention uses a program, referred to as the HAP™ Builder program, that generates views (or screens) displayed on a display device and which the user can interact with to accomplish a variety of tasks and analyses. For example, the HAP™ Builder program allows users to view and analyze large amounts of information such as subject identifiers (e.g., subject number or cell line number); gene-related data (e.g., gene name, gene symbol, GenBank accession number); family data (e.g., family number, father, mother, number of siblings); polymorphism data (e.g., region, position, nucleotide changes (i.e., the polymorphic nucleotide(s) as compared to the reference nucleotide(s)); genotype data (e.g., scored diplotype objects); haplotype data (e.g. haplotype identifiers, haplotype frequencies, haplotype pairs, and haplotype pair scores (indicating the probability that the haplotype pair of an individual is correct)); and population data (e.g., ethnic, geographical, clinical, and genotype and haplotype data for various populations). The HAP™ Builder program is preferably written in the Java programming language. However, the program may be written using any conventional programming language such as for example C, C++, Visual Basic™ or Visual Pascal™. The HAP™1 Builder program may be stored and executed on the computer. It may also be stored and executed in a distributed manner. The data processed by the HAP™ Builder program is preferably stored as part of a relational database (e.g., an instance of an Oracle™ database or a set of ASCII flat files). This data can be stored on, for example, a CD ROM or in one or more storage devices accessible by the computer. The data may be stored on one or more databases in communication with the computer via a network. In one scenario, the data will be delivered to the user on any standard media
(e.g., CD, floppy disk, tape) or can be downloaded over the internet. The HAP™ Builder program and data may also be installed on a local machine. The HAP™ Builder program and data will then be on the machine that the user directly accesses.
Figure IB shows an implementation where a network interconnects one or more host computers with one or more user terminals. The communication network may, for example, include one or more local area networks (LANs), metropolitan area networks (MANs), wide area networks (WANs), or a collection of interconnected networks such as the Internet. The network may be wired, wireless, or some combination thereof. The host computer may, for example, be a world wide web server ("web server"). The user terminal may, for example, be a client device such as the computer shown in Figure 1 A.
A web server stores information documents called pages. A server process listens for incoming connections from clients (e.g., browsers running on a client device). When a connection is established, the client sends a request and the server sends a reply. The request typically identifies a page by its Uniform Resource Locator (URL) and the reply includes the requested page. This client-server protocol is typically performed using the hypertext transfer protocol ("http"). Pages are viewed using a browser program. They are written in a language called hypertext markup language ("html"). A typical page includes text and formatting comments called tags. Pages may also include links (pointers) to other pages. Strings of text or images that are links to other pages are called hyperlinks.
Hyperlinks are highlighted (e.g., by color, underlining) and may be invoked by placing the cursor on the highlighted area and selecting it (e.g., by clicking the mouse button). A page may also contain a URL reference to a portion of multimedia data such as an image, video segment, or audio file. Pages may also point to a Java program called an applet. When the browser connects to where the applet is stored, the applet is downloaded to the client device and executed there in a secure manner. Pages may also contain forms that prompt a user to enter infoπnation or that have active maps. Data entered by a user may be handled by common gateway interface (CGI) programs. Such programs may, for example, provide web users with access to one or more databases.
As shown in Figure IB the host computer may include a CPU connected by a system bus or other connecting means to a communication interface, system memory (RAM), nonvolatile memory (ROM), and a mass storage device. The mass storage device may, for example, be a collection of magnetic disk drives in a RAID system. The mass storage device may, for example, store the aforementioned web pages, applets, and the like. The host computer may also include an input device, such as a keyboard, and a display device to allow for control and management by an administrator. Additionally, the host computer may be connected to additional devices such as printers, auxiliary monitors or other input/output devices. The input device and display device may also be provided on another computer coupled to the host computer. The host computer may be embodied, for example, as one or more mainframes, workstations, personal computers, or other specialized hardware platforms. The functionality of the host computer may be centralized or may be implemented as a distributed system. As also shown in Figure IB, the host computer may communicate with one or more databases stored on any of a variety of hardware platforms. In an Internet embodiment, for example involving the system of Figure IB, the HAP™ Builder program will be web-based and will be delivered as an applet that runs in a web browser. In this case, the data will reside on a server machine and will be delivered to the HAP™ Builder program using a standard protocol (e.g., HTTP with cgi-bin). To provide extra security, the network connection could use a dedicated line. Furthermore, the network connection could use a secure protocol such as Secure Socket Layer (SSL) which only provides access to the server from a specified set of IP addresses.
In another embodiment, the HAP™ Builder program can be installed on a user machine and the data can reside on a separate server machine. Communication between the two machines can be handled using standard client-server technology. An example would be to use TCP/IP protocol to communicate between the client and an oracle server.
It maybe noted that in any of the prior scenarios, some or all of the data used by the HAP™1 Builder program could be directly imported into the HAP™1 Builder program by the user. This import could be carried out by reading files residing on the user's local machine, or by cutting and pasting from a user document into the interface of the HAP™ Builder program. In yet a further embodiment, some or all of the data or the results of analyses of the data could be exported from the HAP™1 Builder program to the user's local computer. This export could be carried out by saving a file to the local disk or by cutting and pasting to a user document. In the present invention various calculations are performed to generate items displayed on a screen or to control items displayed on a screen. As is well known, some basic calculations may be performed using database query language (SQL), while other computations are performed by the HAP™ Builder program (i. e. , the Java program which, as previously mentioned, may be an applet downloaded over the internet.)
III. METHODS OF THE INVENTION
The invention relates to a process for deriving the presence and frequency of haplotypes from a collection of genotypes of several individual polymorphisms in a gene locus, measured for a sample group of individuals. The process begins with an exhaustive enumeration (expansion) of all possible haplotypes (called the "Hap Expansion" phase), then proceeds through a self-consistent, iterative process to produce the haplotypes most likely to be present, as well as the most likely assignment of haplotype pairs to each individual (called the "Hap Assignment" phase). The process also results in a probability score specifying the likelihood of the result being coπect. The process takes advantage of family relationships among the sample group, but does not require them. The process is embodied in the HAP™ Builder program, a computer code written in Java providing an interface for a skilled person to efficiently cany out the process and store the results.
More specifically, the invention relates to a method for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals, comprising:
(a) obtaining a genotype for the polymorphic genomic region from each of the individuals;
(b) enumerating all possible haplotypes hi consistent with each genotype;
(c) assigning an evidence score Si to each of the enumerated haplotypes hi; (d) calculating an initial haplotype frequency fj for each haplotype among the possible haplotypes, wherein the initial haplotype frequency fj is a function of the evidence score s,;
(e) detennining for each genotype obtained in step (a) a pair score Fk for each pair of haplotypes that are consistent with that genotype, wherein F is a function of the frequency fj for each of the haplotypes in the pair;
(f) calculating, for each genotype and consistent haplotype pair whose pair score Fk meets a pair score criterion, a probability p that assignment of that haplotype pair to the genotype would be coπect;
(g) generating a revised haplotype frequency fj for each of the haplotypes, wherein the revised haplotype frequency fj is a function of the probability pk for each consistent haplotype pair which contains the haplotype; and (h) repeating steps (e) through (g) until an end condition is reached, with the proviso that for each repetition the frequency fi employed in step (e) is replaced by the revised frequency fj determined in step (g).
Steps (a) though (d) are called the initiation, or Hap Expansion, phase of the method. Steps (a), (b) and (c) can be performed for one individual at a time or in parallel.
Steps (e) through (g) are called the Hap Assignment phase of the method. Steps (e) and (f) can be performed for one genotype at a time or in parallel.
In order to more efficiently use computing resources, particularly when large numbers of individuals are being haplo typed, it is prefeπed that the above procedure be modified as follows: After the genotypes are obtained (or, as they are obtained), they are combined into groups, where all the genotypes in each group are identical. Groups may optionally be characterized by one or more additional criteria, such as, for example, a requirement that all individuals from whom the genotypes are derived must belong to a single population group. Additional criteria may be, for example, a requirement that all individuals from whom the genotypes are derived must be of the same gender, or belong to a single clinical or disease population, or to a population exhibiting a particular response to a drug or other stimulus, or to a population characterized by a particular genotype or haplotype at some other polymorphic region. Thus, in this embodiment, all members of a group minimally have the same genotype, but there may be more than one group with the same genotype. The number of individuals sharing a distinct genotype within a group g is called the multiplier, ng. Hap expansion is preferably carried out only once for each distinct (different) genotype, and the multiplier is used at the end of the expansion to give the appropriate weight to the frequency scores. In this prefeπed embodiment, the method comprises the steps of: (a) obtaining a genotype for the polymorphic genomic region from each of the individuals; (b) grouping the genotypes obtained in step (a) into groups, wherein in each group g there are n_ identical genotypes (any unique genotypes are regarded as groups having ng = 1); (c) enumerating all possible haplotypes hi that are consistent with each distinct genotype;
(d) assigning an evidence score Sj to each of the enumerated possible haplotypes h,-;
(e) for each group g, calculating an initial haplotype frequency (fj) for each haplotype among the possible haplotypes, wherein the initial haplotype frequency fj is a function of the product (sj)(ng);
(f) determining, for each group g, a pair score Fk for each pair of haplotypes that is consistent with the genotype of that group, wherem Fk is a function of the frequency fi for each of the haplotypes in the pair;
(g) calculating, for each genotype and consistent haplotype pair whose pair score Fk meets a pair score criterion, a probability pk that assignment of that haplotype pair to the genotype would be coπect;
(h) generating a revised haplotype frequency fj for each haplotype, wherein the revised haplotype frequency fj is a function of the product (ng)(pk) for each consistent haplotype pair which contains the haplotype; and (i) repeating steps (f) through (h) until an end condition is reached, with the proviso that for each repetition the frequency fi employed in step (f) is replaced by the revised frequency fi determined in step (h).
In this embodiment, steps (a) though (e) are the initiation, or Hap Expansion, phase of the method. Steps (a), (b) (c) and (d) can be performed for one individual (or group) at a time or in parallel. Steps (f) through (h) are the Hap Assignment phase of the method. Steps (f) and (g) can be performed for one group at a time or in parallel. Characterizing groups by one or more additional criteria may be done before or after the enumerating step, but is preferably done before the enumerating step.
The invention also relates to a method for predicting an individual's haplotype pair for a polymorphic genomic region, comprising:
(a) obtaining the genotype for the polymorphic genomic region from the individual; (b) enumerating all possible haplotypes hi for the genotype;
(c) providing a frequency fi for each of the possible haplotypes, where fj is determined by one of the methods described herein;
(d) determining a pair score F for each pair of possible haplotypes hi that are consistent with the genotype, wherein Fk is a function of the frequency fj for each of the haplotypes in the pair; and
(e) assigning to the genotype the haplotype pair having the highest pair score F .
The invention also relates to a method for assigning a haplotype pair for a polymorphic genomic region of an individual, comprising: (a) obtaining the genotype for the polymorphic genomic region from the individual;
(b) enumerating all possible haplotypes hj for the genotype;
(c) providing a frequency fj for each of the possible haplotypes, where f; has been previously determined by the method for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals discussed herein; (d) determining a pair score Fk for each pair of possible haplotypes h; that are consistent with the genotype, wherein F is a function of the frequency fj for each of the haplotypes in the pair; and
(e) assigning to the genotype the haplotype pair having the highest pair score Fk.
In a preferred embodiment, the invention also relates to a method and tools for estimating the probability that the haplotype pairs assigned by the methods described immediately above are coπect, comprising (a) the steps described immediately above, and further comprising (b) determining the probability score p from the formula
1=0
In the formula above, Nrank s the number of pairs of haplotypes selected by the practitioner for consideration, which is preferably a subset of all the possible consistent pairs. Typically, one would select only the Nrank highest scoring pairs for consideration.
The invention also relates to a method for filling in missing genotype data, comprising using, in the above-mentioned methods of the invention, the following genotype probabilities for any genotypes that could not be measured: pc = 0.25, where pc represents the probability that the genotype is homozygous for the more common allele,
Ph = 0.5, where ph represents the probability that the genotype is heterozygous, and pr = 0.25, where pr represents the probability that the genotype is homozygous for the less common allele; and for any individual polymorphic site that could not be measured, using the most probable assignment of haplotypes resulting from the use of the probabilities pc, ph, and pr to construct the most likely genotype by combining the two haplotype alleles at this position.
The invention also relates to methods of constructing a haplotype database for a population, comprising: (a) identifying individuals to include in the population;
(b) determining haplotype data for each individual in the population from genotype information;
(c) organizing the haplotype data for the individuals in the population into fields; and
(d) storing the haplotype data for individuals in the population according to the fields.
The invention also relates to methods of predicting the presence of a haplotype pair in an individual comprising, in order: (a) obtaining a genotype for the individual;
(b) enumerating all possible haplotype pairs which are consistent with the genotype;
(c) accessing a database containing reference haplotype pair frequency data to determine a probability, for each of the possible haplotype pairs, that the individual has a possible haplotype pair; and
(d) analyzing the determined probabilities to predict haplotype pairs for the individual.
The methods and tools of the invention make it possible to determine haplotypes and haplotype pairs in an individual, or a plurality of individuals in a population, based on unphased and/or incomplete genotype information. The individuals may for example be of the same gender, and or part of the general population, an ethno-geographic population group, a clinical or disease population, or a population exhibiting a particular response to a stimulus (e.g. a response to a drug). Frequency and probability scores for haplotypes and haplotype pairs are preferably calculated and used within the same population, but may be used across different populations and population groups if desired.
Similarly, in agricultural biotechnology, the method and tools of the invention can be used to determine the haplotypes and haplotype pairs of genes responsible for specific desirable traits, e.g., drought tolerance and/or improved crop yields, and reduce the time and effort needed to transfer desirable traits.
The invention includes methods, computer programs, and databases for analyzing and making use of genotype information to deduce and/or predict haplotype information. These include methods, programs, and databases for finding and measuring the frequency of haplotypes and/or haplotype pairs in a population; and methods, programs, and databases for predicting an individual's haplotypes from the individual's genotype. Various aspects of the invention are discussed in further detail below.
A. POPULATION SIZE
In the methods of the invention relating to deriving the presence and frequency of haplotypes from a collection of genotypes, it is prefeπed that the minimum number of individuals being haplotyped be greater than the number of haplotypes expected from the number of polymorphisms in the loci being haplotyped. Based on an analysis of over 3500 genes, the present inventors have empirically determined that the number of haplotypes for a gene, on average, is about 1.1 to 1.3 times the number of individual polymorphisms in the gene being studied (data not shown). For example, in a locus containing 15 polymorphisms, it is expected that the number of haplotypes in the general population is between about 17 and about 20 (i.e., 1.1 x 15 = 16.5 and 1.3 x 15 = 19.5); thus the number of individuals in a reference population being haplotyped should preferably be at least 20.
If on the other hand, the skilled artisan is interested in detecting all haplotypes for the polymorphic locus that exist in the general population above a fairly low frequency, then the size of the reference population should be sufficient to predict the existence of multiple copies of such haplotypes with high certainty. For example, in a sample of 100 individuals, a haplotype present in a frequency of 10%> would be expected to occur in 19 individuals, once as a homozygote and 18 times as a heterozygote. Thus, for pharmacogenetic applications, it is desirable to use genotypes from about 100 unrelated individuals in the HAPTM Builder process described herein to establish the haplotypes that exist in the general population for a particular polymorphic locus of interest, e.g., a typical gene of pharmaceutical relevance. However, if establishing haplotypes for a very polymorphic locus, e.g., one that has > 60 polymorphic sites, then it would be preferable to use a larger reference population, such as 200, 400, 600, 800, or up to about 1000 individuals. B. HAP EXPANSION
Any given genotype may be heterozygous at any of the variable sites. If a genotype is found homozygous at all sites (e.g. : A/ A, C/C, C/C, T/T), in the absence of genotyping enor the only possible assignment is two simultaneous copies of the same haplotype (ACCT). If a genotype is heterozygous at one position (e.g. : A/ A, C/C, C/G, T/T), there is likewise only one possible assignment, i.e. the combination of two haplotypes (ACCT and ACGT). If the genotype is heterozygous at more than one position, there are multiple assignments possible. The Hap Expansion constitutes a way of enumerating all possibly observed haplotypes and assigning to • them a score that amounts to an initial estimate of their frequency. Hap Expansion goes through the following steps:
(1) For each genotype, all possible combinations of haplotypes that are consistent with the genotype are determined. For a frilly homozygous genotype, there will be one haplotype. For a singly heterozygous genotype there will be two. For a doubly heterozygous sample there will be four, etc. In general, if there are n heterozygous positions, there will be 2n haplotypes that are consistent with the observed genotype.
(2) Evidence scores are assigned to the haplotypes found in step (1). In an embodiment of the invention which is exemplified herein, each haplotype in the expansion will have an evidence score of 2/2" assigned to it. Thus, a homozygous genotype will generate one haplotype with score 2, a singly heterozygous genotype, will generate two haplotypes with a score of 1 each, a doubly homozygous genotype will generate four haplotypes with a score of 0.5, etc. In this embodiment, 2n haplotypes with a score of 2/2" each will be generated, with the proviso that if the polymorphic genomic region is haploid or hemizygous in the individual (e.g., if it from a sex-linked, mitochondrial or chloroplast gene), an evidence score of 1 is assigned.
(3) The frequency scores for each haplotype are summed across all the samples to yield the initial haplotype frequency. For example, one haplotype may occur in the expansions of two genotypes, one singly heterozygous and on doubly heterozygous. The total initial frequency for this haplotype from the two genotypes would then be 1 plus 0.5, or 1.5. Where multiple identical genotypes have been grouped together, the evidence scores for haplotypes associated with that genotype are multiplied by the group multiplier ng to simultaneously account for all occuπences of that genotype. The total initial frequency, added up across all haplotypes, will be two times the number of samples if all genomic regions are diploid; if haploid or hemizygous regions are represented the total will be reduced accordingly.
In the Hap Expansion phase, the evidence score is a function of the number of ambiguous polymorphic sites being haplotyped. As used herein, an ambiguous polymorphic site means either a heterozygous site or is a site for which nucleotide sequence information is lacking. Preferably, the evidence score sj obeys one or both of the following formulas: n
0 < Sj < 2; and _ζ_ s, = 2 ;
wherein n is the number of ambiguous positions being haplotyped in the genotype. In a cuπent prefeπed embodiment exemplified herein, the evidence score Sj is = 2/2", wherein n is the number of ambiguous positions being haplotyped in the genotype. In each of the above embodiments, however, if the polymorphic genomic region is haploid or hemizygous, an evidence score of 1 is assigned.
Also, preferably, the initial frequency fj is calculated from the sum of the evidence scores across all the different individuals (or genotypes, where the evidence scores Sj are weighted appropriately by multiplying by the group multipliers ng), for each of the enumerated possible haplotypes hj, wherein it is understood that hj is an index, e.g., h\ , hj , etc.
An example of the assignment of evidence scores for the fictitious NoName gene, having four polymorphic sites, is illustrated in Table 1. Such a gene would have a total of 34 = 81 possible genotypes, and 24 = 16 possible haplotypes. Haplotype expansions of four genotypes, from a population of four different individuals, are illustrated. Table 1 shows all possible haplotypes which could be enumerated from each of the four illustrated genotypes, one of which is heterozygous at each site (16 haplotypes), two which are heterozygous at only 3 of these sites (8 haplotypes) and one of which is heterozygous at only 2 of these sites (4 haplotypes). The set of evidence scores Si for each genotype, and the derived initial frequency scores fj for each haplotype summed across all four genotypes, are also shown. Table 2 shows similar information for the same haplotypes, but where two additional individuals having genotype 2 have been added to the population, and illustrates an embodiment of the invention wherein grouping of identical genotypes has been carried out. In this embodiment, the evidence scores for genotype 2 are multiplied by ng (in this example, ng = 3), prior to calculation of the frequency scores fj.
Table 1
Assignment of evidence scores s, and initial frequency scores f, for all possible haplotypes expanded from four genotypes of the NoName Gene
Genotype 1 Genotyps . 2 Genotype 3 Genotype 4 h, A/G A/C C/T G/T s, A/A A/C C/T G/T s, A G C/C C/T G/T s, G/G A C T/T G/T Si f,
1 A A C G 0 125 A A C G 0 250 0 375
2 A A C T 0 125 A A C T 0 250 0 375
3 A A T G 0 125 A A T G 0 250 0 375
4 A A T T 0 125 A A T T 0250 0 375
5 A C C G 0 125 A C C G 0 250 A C C G 0 250 0 625
6 A C C T 0 125 A C C T 0 250 A C C T 0 250 0 625
7 A C T G 0 125 A C T G 0 250 A C T G 0 250 0 625
8 A C T T 0 125 A C T T 0250 A C T T 0250 0 625
9 G A C G 0 125 0 125
10 G A C T 0 125 0 125
11 G A T G 0 125 G A T G 0 500 0 625
12 G A T T 0 125 G A T T 0 500 0 625
13 G C C T 0 125 G C C T 0 250 0 375
14 G C C G 0 125 G C C G 0 250 0 375
15 G c T G 0 125 G C T G 0 250 G C T G 0 500 0 875
16 G c T T 0 125 G C T T 0 250 G C T T 0 500 0 875
Totals 2 000 2 000 2 000 2 000 8 000
Table 2
Assignment of evidence scores s; and initial frequency scores f; for all possible haplotypes expanded from four genotypes of the NoName Gene; with grouping of genotype 2.
Genotypel Genotype 2 (3 occurrences) Genotyp. 3 Genotype 4 ; A/G A C C/T G/T Sj A/A A/C C/T G/T (ng)(s A G C/C C/T G/T Si G/G A/C T/T G/T Si fi
1 A A C G 0.125 A A C G 0.750 0.875
2 A A C T 0.125 A A C T 0.750 0.875
3 A A T G 0.125 A A T G 0.750 0.875
4 A A T T 0.125 A A T T 0.750 0.875
5 A C C G 0.125 A C c G 0.750 A C C G 0.250 1.125
6 A C C T 0.125 A C c T 0.750 A C C T 0.250 1.125
7 A C T G 0.125 A C T G 0.750 A C T G 0.250 1.125
8 A C T T 0.125 A C T T 0.750 A C T T 0.250 1.125
9 G A C G 0.125 0.125
10 G A C T 0.125 0.125
11 G A T G 0.125 G A T G 0.500 0.625
12 G A T T 0.125 G A T T 0.500 0.625
13 G C C T 0.125 G C C T 0.250 0.375
14 G C C G 0.125 G C C G 0.250 0.375
15 G c T G 0.125 G C T G 0.250 G C T G 0.500 0.875
16 G c T T 0.125 G C T T 0.250 G C T T 0.500 0.875
Totals 2.000 6.000 2.000 2.000 12.000
C. HAP ASSIGNMENT
The haplotype frequency scores fj generated in the Hap Expansion serve as an initial estimate of the expected frequency of the haplotypes. Many genotypes will allow only one possible combination of two of the haplotypes from the expansion. This is true for the homozygous and singly heterozygous genotypes. For multiply heterozygous genotypes, there are generally many possibilities. However, since of the 2n theoretically possible haplotypes only ~n actually occur, many real haplotypes will occur in more than one of the samples, and their frequency scores fj will be higher than those of rare or non-occurring haplotypes. A pair frequency score Fk can be assigned to each pair of haplotypes. The pair frequency score F is a function of the haplotype frequency scores (fj , fj) for each of the haplotypes hj and hj in the pair. In a prefened embodiment, only haplotypes that meet a frequency score criterion are considered when assigning the pair frequency score F . The frequency score criterion may be user defined or it may be a default value. In a particularly prefened embodiment, the frequency score criterion is set at fi > 0.1. This means that haplotypes with less than a 10% chance of occurring in any individual in the entire sample are eliminated from the Hap Assignment phase. Lower values for the frequency score criterion (e.g., fj > 0.01, or fj > 0.001) will result in slightly greater accuracy in making hap pair assignments, but greater computing time and/or resources will be required for lower values. Thus, the value for this criterion may be any number that the practitioner skilled in the art might find suitable to balance the desired degree of accuracy with the constraints on available time and computational resources.
In the Hap Assignment phase, the pair score criterion is preferably (a) a specific numerical cutoff; (b) a function of the values of the pair scores; or (c) a function of the rankings of the pair scores. In a prefened method of the invention exemplified herein, each pair (i, j)k of haplotypes hi and hj in a sample is assigned a pair score Fk = 2fjfj, if i ≠ j; or Fk = fj2 if i = j; with the proviso that Fk = fj when the polymorphic genomic region is haploid or hemizygous. The Hap Assignment phase of the method of the invention preferably further comprises determining a probability pk that the haplotype pair which has been assigned to the genotype is conect. In one embodiment of the invention, the probability score pk is determined by (a) ranking each of the pair scores F for all the N possible pairs for a genotype with the highest score (F0) first; and (b) defining the probability pk as the score of the first pair divided by the sum of all scores:
p_ ~ ^- .
/=0
Under these criteria, if there are two equally likely assignments (e.g., F0=36, Fι=36), a score of 0.5 is given to the assignment, reflecting the fact that we are only 50%) certain that one of these pairs is the coπect one. As another example, if there is one pair with score 8, and one with score 2, the assignment of the first pair is made with an 80% probability. The ranking is performed this way for all the genotypes.
In a prefened embodiment of the invention, which is exemplified herein, the probability score p is determined by (a) ranking each of the pair scores Fk for all the possible pairs for a genotype with the highest score (F0) first; (b) disregarding all but the Nrank highest ranking assignments; and (c) defining the probability pk as the score of the first pair divided by the sum of all scores, using the formula:
Figure imgf000030_0001
D. ITERATION
In the method of the present invention, a revised haplotype frequency score fi is calculated for each of the haplotypes, wherein the revised haplotype frequency fj is a function of the previously determined probability k for each consistent haplotype pair which contains the haplotype. In a method of the invention exemplified herein, a new set of frequency estimates is calculated from the probability scores. The new frequency fj of haplotype i is calculated as the sum of the pk for all pair assignments containing haplotype i, counting homozygous pairs (i, i) twice. Again, the sum of the frequencies across all haplotypes will be two times the number of samples, if all genomic regions are diploid; if haploid or hemizygous regions are represented the total will be reduced accordingly.
Revised haplotype pair scores Fk and revised probability scores pk can be determined based on the revised haplotype frequency scores fj using the methods described above. These steps can be repeated until an end condition is reached. In a prefened embodiment, where identical genotypes have been grouped together prior to Hap Expansion, the new frequency fj of haplotype hi is calculated as the sum of the products (ng)(pk) for all pair assignments containing haplotype hi, further multiplied by 2 for homozygous pairs (i, i). The end condition can be one of many possible parameters. It may be user definable or a default condition. It may also be variable or set. For example, the end condition may be met when the above-mentioned iteration steps are repeated a preset number of times. Alternatively, the end condition may be met when one or more of the parameters fj, Fk, and pk stabilizes; or the end condition may be met simply when the operator chooses to stop.
Stabilization can mean: (1) the maximum difference between consecutive iterations of Fk or pk goes below a threshold; (2) the ranking (or truncated ranking) does not change for a given number of iterations, or (3) any suitable quantity does not change more than a threshold. The prefened end condition is (1). h another prefened method of the invention exemplified herein, the end condition is met when the operator chooses to stop.
E. AMBIGUITY CRITERION AND PAIR SCORE CRITERION
In one embodiment of the invention, only genotypes that meet an ambiguity criterion have their haplotypes enumerated (see Fig. 2). The ambiguity criterion may be user definable or may be a default value. The ambiguity criterion is preferably a function of the number of ambiguous polymorphic sites in the genotype, wherein an ambiguous polymorphic site is either a heterozygous site or is a site for which information is lacking.
In another aspect of the invention, only haplotype pairs that meet a pair score criterion will be kept. The pair score criterion is a function of the pair score F for each of the haplotype pairs, as discussed above. In an embodiment of the invention described herein, the pair score Fk is retained only if it is one of the top 15 pair scores. In another embodiment, only those haplotype pairs whose pair scores F are greater than a certain percentage of Fkmax (the highest F associated with any consistent haplotype pair) will be kept.
F. ERROR DETECTION AND CORRECTION
At all stages in the method, the possibility of enor in the input data may be considered. In the expansion, for example, even a fully homozygous sample may generate a number of haplotypes, i.e., the principal one, and all the additional ones which would be introduced if any one of the positions were changed to be heterozygous. In the method of the invention exemplified herein, the score of each such additional haplotype is considerably reduced by multiplying by the assumed enor probability, a number usually set at 0.01 - 0.02. At the assignment stage, if the most highly scoring haplotype pair is not consistent with the input genotype (despite the strong l%-2% penalty factor), the difference is highlighted and reported as a probable misread (e.g., a sequencing enor).
The prefened way to modify the method to allow for the possibility of enors is as follows. For each measured genotype (i.e. for each individual at each polymorphic position), replace the exclusive determination (either of A, A/C, C) by the specification of probabilities as follows: pc probability for the common allele, p for the heterozygote, and pr for the rare allele. Preferably, these probabilities are estimated individually for each genotype measurement according to the quality of the raw data by the procedure used to determine the genotypes. Alternatively, a single enor probability perr can be defined that estimates the probability for any given allele to be determined enoneously. h the latter case the genotype probabilities are as follows: (a) For a measured homozygous common allele: pc = (1-perr)2, ph = 2(l-peiT) perr, Pr = pen-2; (b) for a measured heterozygote: pc = Vz pen-, ph - 1 — pen-, Pr - Vi Pen-; and (c) for a measured homozygous rare allele:
9 9 pc - pen- , Ph = 2(1- perr) perr, Pr = (I- Perr) • Preferably a more complex formula involving the allele frequencies would be used. In step (b) of the Hap Expansion process described earlier, each haplotype may be individually weighted for its consistency with a given individual genotype. The prefened weights for the Hap Expansion are
w, Pt, f Pι, λ
Tl c* \ Pc + XT + r* Pr + xr (Formula 1)
/t=ι V I J 2 J where k = 1 ... N enumerates the polymorphic sites, c;k = 1, r;k = 0 if the allele of haplotype hj at position k is the common allele, and Cik = 0, r^ = 1 if it is the rare allele. The score Sj is then simply multiplied by the weight so that the weighted score Si' = w; Sj .
In step (f) of the Hap Assignment process, a pair of haplotypes hjhj may be weighed for its consistency with an individual genotype. The prefened weights for the Hap Assignment are
N i \ wv = II cikcjkPc ciXjk + rikcjk )Ph + VβPr • (Formula 2) k=\
Step (f) is then modified such that the Fk are replaced by F ' = WyFk (recall that k stands for a pair of haplotypes (hjhj)). With these modifications, it is possible for a pair of haplotypes which is inconsistent with the observed genotype to nevertheless score high in the ranked list of assignments.
Two cases may be distinguished: Case 1 : An inconsistent pair has a score that is comparable to, but lower than, the score of a consistent pair. In this case, one may conclude that there is a significant probability that the genotype causing the inconsistency was measured incoπectly. In the presently implemented version of the invention, this genotype is characterized as a possible miscall (enor detection). Case 2: An inconsistent pair ranks first on the list of possible assignments. In the presently implemented version of the invention, this genotype is characterized as most likely wrong, and we choose the haplotype assignment in spite of its inconsistency, effectively overriding the genotyping call (enor conection).
G. PRUNING
In one embodiment of the invention, a recursive pruning algorithm is used in the Hap expansion phase to eliminate from consideration enumerated haplotypes whose evidence scores Sj are below a given threshold value, and/or is used in the Hap assignment phase to eliminate from consideration haplotype pairs whose pair scores Fk are below a threshold value. A pruning algorithm is prefened because the number of possible haplotypes grows exponentially with the number of sites, and because an exhaustive enumeration is rarely desirable. Since the weights Wj and Wij are written as products across the sites in the above formulas, they can be recursively enumerated as follows: (a) generate the two alleles for the first position (the first polymorphic position chosen to be included in haplotyping); (b) calculate the first factor of the weight, i.e., evaluate formula 1 or formula 2 with k = 1; (c) for each of the first position alleles, generate the two alleles at the second position; (d) for each combination of alleles, calculate the second factor of the weight and multiply it by the first factor, i.e., evaluate formula 1 or formula 2 for k = (1, 2); (e) do not continue with combinations where the weight is below a given threshold; and (f) continue generating additional combinations, one site at a time, until all positions have been visited. The threshold for Wj is an evidence score criterion, and the threshold for Wy is a pair score criterion. hi other words, one generates sub-haplotypes with one, two, three, ... up to the full set of polymorphic sites. One generates a new set of subhaplotypes from the previous set of subhaplotypes by creating the two possible combinations for each of the nth subhaplotype with either of the two alleles for the (n+l)th polymorphic site. The net result is that the weights are recalculated and the ambiguity threshold is re-tested as each polymorphic position is considered in turn. Whenever a subhaplotype containing multiple rare polymorphisms is encountered, the weight of all haplotypes comprising that subhaplotype are necessarily reduced below the threshold, indicating that all haplotypes comprising that subhaplotype need not be visited. Without the pruning step (e), for some genotypes, many possible haplotypes would be generated, most of them having vanishing weight. Step (e) ensures that any branches of the search that are already doomed because of too many mismatches and/or rare polymorphisms will not be followed. This changes the computational complexity of the algorithm such that it rises more or less linearly rather than exponentially with the number of sites, making the calculation more practical. In a prefened embodiment, the evidence score criterion is chosen to optimize the use of computer resources. The evidence score criterion s_nc is a score threshold, below which the recursive search for consistent haplotypes based on the cuπent sub-haplotype is truncated. It must be less than 1, and should be as small as the time available for computation allows. It is related to the number na of ambiguous polymorphic sites allowed in a genotype, beyond which no more contributions to the hap expansion will be generated, in the following way: Strunc =
(l/2)na. If the number of samples is N, the maximum frequency a haplotype could have in the population and escape consideration is (N)(strunc). A prefened value for the threshold is therefore Stmnc < 1/N. In a prefened embodiment Strunc = 0.01. As alternative examples, Stnmc may be 0.001, 0.0001, or 0.00001, or any other number that the practitioner skilled in the art might find suitable in view of the constraints on available time and computational resources.
H. MENDELIAN INHERITANCE Another aspect of the method of invention provides a method for optionally adjusting the assignment probability scores pk to reflect the requirement of Mendelian inheritance between individuals who are related. For example, when there is at least one multi-generation family included among the individuals whose polymorphic genomic regions are being haplotyped, the probability p may be reduced for each pair assignment for each genotype in the family that does not obey Mendelian inheritance, hi the simplest embodiment of this aspect of the invention, the scores for any assignment which does not obey Mendelian inheritance with respect to other higher ranking assignments for the relatives are set to zero. Another, prefened, embodiment is the multiplication of an unadjusted probability score p by (l-pk')> where pk' is the score of any assignment k' of a related person that is in conflict with the first assignment k (i.e., pk -
Figure imgf000035_0001
where pk' is the probability calculated for a pair assignment of a related genotype). Another embodiment, which is cuπently implemented in the example described herein, is the interactive fixing of selected assignments (by setting pk=l) according to the judgment of the operator. The scores are renormalized after such modification by using the formula
(Formula 3)
Figure imgf000036_0001
In other words, if some of the samples come from individuals that are related, Hap assignments are preferably constrained to obey Mendelian segregation rules, i.e. one of the copies must be inherited from the father, and one from the mother. This constraint is used in the HAP™ Builder process to eliminate solutions that violate inheritance rules and increase the probability scores of those that do not. It will be appreciated that individuals who in fact are not the offspring of an erstwhile parent are readily identified, and their haplotype pair assignments will not be subjected to the Mendelian segregation criterion.
Mendelian segregation rales can also be used to validate the HAP™ Builder process. For example, genotype information from individuals belonging to one or more three-generation families may be entered into the database so that they can be treated as either being related or not being related. The haplotype assignments under each of these conditions can be compared for consistency.
I. HARDY-WEINBERG EQUILIBRIUM
Another aspect of the method of the invention provides for the optional adjustment of the assignment probability scores p to reflect Hardy-Weinberg Equilibrium. In any given Mendelian population, the number of heterozygotes and homozygotes are related by the Hardy-Weinberg principle: f00 = p , foi = 2pq, and fn = q2, where f00 is the frequency of wild type homozygotes, f01 the frequency of heterozygotes, and fπ the frequency of mutant homozygotes. P and q are the frequencies of the wild type and mutant alleles, respectively. This is true for individual polymorphisms, and can also be extended for haplotypes (fji = p. , fjj = 2piPj). Since we observe three variables (or n+n(n-l)/2), which reduce to just 2 (or n), the Hardy-Weinberg principle gives us an additional constraint which is used in the HAP™1 Builder process to increase the probability scores of those haplotypes which satisfy the Hardy-Weinberg principle, and decrease the scores of those that do not.
When there is at least one population group included among the individuals whose polymorphic genomic regions are being haplotyped and whose genotypes would be expected to reflect Hardy-Weinberg Equilibrium, the probability pk may be reduced for each pair assignment for each genotype in the population group that does not obey Hardy-Weinberg Equilibrium.
The Hardy-Weinberg equilibrium postulates a relationship between the frequencies of homozygous assignments and heterozygous assignments such that Fii2 + 2Fij + Fjj 2 = l, (Formula 4) where the Fϋ, Fjj, and Fy are the frequencies Fk of the three possible assignments for any given pair of different haplotypes hi and hj. One embodiment of this score adjustment is to multiply the scores pjj, pjj, and py by one minus the Xi squared value for the deviation from Hardy-Weinberg equilibrium for all pairs of different haplotypes h; and hj:
Pk = (Formula 5)
Figure imgf000037_0001
In other words, when a haplotype pair does not fit the Hardy-Weinberg equation, the probability pk may be reduced to p'kby the above fonnula, wherein f, and j are the frequencies of haplotypes h; and hj in the population group and Fϋ, Fjj and Fy are the frequencies of each possible pair of haplotypes hi and hj in the population group.
J. INFERRING HAPLOTYPES AT AMBIGUOUS SITES
Another aspect of the invention provides methods and tools to infer haplotypes from every genotype, despite the presence of ambiguous polymorphic sites (sites where data is absent). As described elsewhere herein, the input to the program for each genotype measurement is a set of three probabilities, one each for a homozygous common allele, for a heterozygote and for a homozygous rare allele. If no data is available at all, in the HAP™ Builder program as cuπently implemented these probabilities default to 0.25, 0.5, and 0.25, respectively. The program accommodates these probabilities and still generates the most likely haplotype pair assignments. The missing genotypes can then be infened by combining the appropriate alleles from the assigned pair of haplotypes.
K. END CONDITIONS
The iterative calculation and re-calculation of pair scores, probability scores, and haplotype frequency scores leads to convergence of these values toward certain limits. It is not necessary to allow these limits to be reached in order to make use of the invention, since at some point the assignments of haplotypes and haplotype pairs that the practitioner can make based on the scores will not be altered by further iterations. For this reason the practitioner of this invention may specify end conditions that will trigger the termination of iterations by the HAP™1 Builder program. Alternatively, the practitioner may use his or her own judgment and terminate the iterations at will when the iterations have produced a satisfactory result.
Examples of suitable end conditions are:
(a) the values of all fj, pk, and/or Fk in consecutive iterations differ by less than a preselected amount, or differ by less than a preselected percentage,
(b) a preset number of iterations have been carried out, (c) the rank order of the F for the haplotype pairs under consideration does not change in consecutive iterations.
In the particular embodiment exemplified herein, an end condition is tested for after a new set of haplotype frequency scores have been iterated (see Figure 4). However, it will be apparent that an end condition can be tested for at any point during the iterations, and such alternative embodiments are considered part of the invention. For example, if the end condition is a function of the pair scores, it will be appropriate to test for the end condition after a new set of pair scores have been iterated. Where operator intervention, based upon human judgment, is the means for terminating the iterations, the iterations can of course be ended at any point. L. SUMMARY OF THE MAJOR ADVANTAGES OF
PREFERRED EMBODIMENTS OF THE INVENTION
1) The use of cut-offs to speed up the process. This happens in one or several places: (a) the Hap expansion only considers contributions down to a certain probability cut-off, (b) only a limited number of haplotype pair assignments are kept, and (c) haplotypes whose frequency falls below a certain frequency threshold are dropped from the next iteration. Pruning is the prefened way to effect cutoffs (a) and (b).
2) The assignment of "competitive quality scores" based on the ranked list of possible assignments. This accounts well for ambiguous calls, where the score will be close to 0.5 because there are two equally likely solutions to the problem.
3) The "enor detection and coπection" aspect. Each genotype is never assumed to be just one that was measured, but could be any with weighted probability. For example, an A is not just an A, it is an A with a probability of (1- p)2, an A/G with probability of 2p(l-p), or even a G with a (vanishing) probability of p2. Here, p is the "enor probability", and is usually close to 0. A value of 0.01 is employed in the cuπent prefened embodiment, conesponding to a (probably exaggerated) accuracy of 99% in calling the genotypes. In the Hap expansion, these probabilities are used as weights, so that, for example, the genotype A G C/T, which would normally expand into the haplotype pair AGC+AGT, may expand into something like this:
0.94(AGC + AGT) + 0.01(AGC + CGT) + 0.01(CGC + AGT) + ...
Obviously, there are a very large number of possibilities, most of them with very small weights. A simple recursive pruning method is used to find all the contributions above a certain threshold weight, cunently set at 0.01, such that only single enor possibilities are used. A similar pruning algorithm is used for the haplotype pair assignment, where assignments are made that do not exactly fit the genotype, with the appropriate low weight.
4) Integration of family data, Hardy-Weinberg equilibrium. Family transmission and Hardy-Weinberg equilibrium may be checked as part of the iteration procedure, which should increase the accuracy of the calls even for those assignments made for individuals that are not related. IV. EXAMPLE
Genotype data for input into the HAP™1 Builder program may be generated by the practitioner by sequencing DNA from a population of interest, or may be obtained from various commercial sources of genotype data such as commercial SNP database providers. Publicly available SNP databases may also be used, such as for example the Human Genie Bi-Allelic Sequences database (HGB ASE), the dbSNP database maintained by the National Center for Biotechnology Information, and the Human SNP database maintained by the Whitehead Institute at the Massachusetts Institute of Technology. These public databases are readily accessible via the internet. The data is suitably formatted when stored in a DecoGen™ database as described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, which are incorporated herein by reference.
In the present invention, a person may use a user terminal to view a screen which allows the user to see all of the candidate genes, or a subset thereof, and to bring up further information. This screen (as well as all the other screens described herein) may, for example, be presented as a web page, or a series of web pages, from a web server. This web based use may involve a dedicated phone line, if desired. Alternatively, this screen may be served over the network from a non- web based server or may simply be generated within the user terminal. An example of such a screen refeπed to herein is illustrated in the top half of Figure 5.
The top half of Figure 5 is an example of a screen showing a set of candidate genes for which polymorphism data has been obtained or is in the process of being obtained. This polymorphism data and other information described below may be stored in a database such as the one described in U.S. application serial no. 60/141,521 and in international application WO 01/01218, or is calculated from information stored in such a database. Most of the information shown in later figures is specific to the Index Repository described herein.
The screen shows genes for which data is cunently available in a database useful in the invention and those queued for processing (and for which data will appear in the database). The "Row" column indicates the order in which genes were entered into the database, while the "Id" column is a numerical identifier for the gene having the symbol and name indicated in the "Symbol" and "Name" columns. The columns on the right side of the screen indicate various stages in the process of analyzing target regions of the gene identified in the conesponding row. For example, "Anno" is shorthand for "Annotation", which is the operation performed at the beginning of the gene analysis process to annotate different features of the gene stracture, such as the locations and sequences of the promoter, exons and introns as described in more detail below. The number in the Anno column provides the number of different annotated features of the gene. The "PCR" and "Sequ" columns indicate how many of the target regions of the gene have been analyzed successfully by the PCR and Sequencing production groups, respectively. The number of polymorphic sites identified for the gene is shown in the "Geno" column. Similarly, the number of haplotypes deduced by the HAP™ Builder method of the present invention is shown in the "Haplo" column. The various colors provide an immediate visual indicator of the status of the gene at each stage of analysis, with green and yellow indicating completely done and in progress, respectively, and white indicating no target regions have arrived to that stage in the analysis process. Alternatively, the status of genes in the different production stages may be indicated by different degrees or types of shading. The genes in the database may be sorted by various criteria by clicking on any of the columns shown in the top half of Figure 5, e.g., clicking on "Id" allows the genes to be sorted in ascending or descending numerical order, clicking on "Name" allows the genes to be sorted in alphabetical order, and clicking on "Sequ" allows the genes to be sorted by number of fragments. The user can select a gene to examine in detail by using the mouse (or other user-input device such as keyboard, roller ball, voice recognition, etc.) to select the candidate gene. In the example depicted in Figure 5, the prodynorphin gene is selected, as indicated by the purple color in what is shown in Row 408 of the figure. The screen may optionally include a "find" feature, to locate a candidate gene of interest. In the exemplified screen, a single click on the selected gene brings up the screen shown in the bottom half of Figure 5, which provides sequencing wordflow information, i.e., numerical workflow identifiers for the sequencing and PCR reactions ("Run" and "PCR" columns), in both forward and reverse directions ("Dir" column), that have been performed for various fragments from each of the target regions of the gene (for example, fragment exon 3.1 from exon 3). A check in the "Ready" column indicates when a gene fragment is ready to be analyzed for polymorphisms and the "Status" column indicates whether there is sequencing information for both strands of the fragment. Such information and screens are not necessary for using the methods of the present invention, but may be used to monitor the progress and/or extent of sequencing of candidate gene(s) (or other loci) input into the database and may be useful in providing an estimate of the reliability of the sequence data which has been input into the database. Decisions about whether or not to proceed with polymorphism analysis in one or more of the fragments of the selected gene may be based on the status of the sequencing rans. For example, if sequence information is available for both strands, the more reliable the sequence will be and, therefore, the more reliable the polymorphism data will be.
Figure 6 shows an example of the annotation screen, which is reached by clicking on "Anno" in the screen depicted in Figure 5. As indicated in Figure 6, the PDYN gene contains 10 features, each of which has the indicated lengths and the indicated start and stop positions with respect to the indicated Accession number. The Accession number is typically the GenBank Accession number for the gene, although it may be an identifying number from another publically available database or an internal identifying number. If the complete gene sequence is not know, the "Accession" column may contain multiple identifying numbers for partial sequences. A check in the "Rev" column indicates the coding sequence for the gene is found in the reverse complement of the Accession number. The "Seqlen" column indicates the number of nucleotides entered into the "Sequence" box at the end of the row. The amount of sequence shown may be increased by enlarging the window; the entire sequence for a feature may be displayed by clicking on the particular sequence of interest. The information contained in the "Anno" screen is typically derived from GenBank and other public data sources. In the screen exemplified in Figure 5, a single click on the haplotype ("Haplo") column in that row brings up the screen for the HAP™ Builder program, an example of which is shown in Figure 7.
The screen exemplified in Figure 7 shows several boxes at the same time, although one or more of the boxes may be expanded by dragging the dividers between the boxes. The window on the left (labeled "Family Objects" in Figure 7) will typically show a list of the different multi- generation families available for polymorphism analysis and relevant information concerning each family, such as numerical identifiers for the father and mother, and the number of children "siblings". This window will typically show a family tree below the list of families. Males are shown as rectangular boxes and females are shown as ovals. Family 1333 is selected in the box on the upper left side, therefore, the family tree for that family is displayed. Family trees for other families may be displayed by clicking on the name of the desired family in the top of the window. If nothing had been clicked on, Family 13291 would have been the default family tree displayed.
The screen exemplified in Figure 7 will typically also show a box that provides information about the polymorphism data for a selected gene (labeled "ScoredPolymorphism Objects" in top right side of Figure 7). Each row contains information for a different polymorphic site (PS) identified in the gene from a population (a group of people whose nucleotide sequences have been examined for this gene). In this example, the screen indicates that eleven PS were detected in the PDYN gene. The "Region" column indicates the region in the gene where the polymorphic site is located (e.g., the promoter, the first intron, the first exon, etc.). The number in the first "Pos" column indicates the location of the polymorphism in the indicated region of the gene, while the number in the second "Pos" column indicates the location of the polymorphism in the genomic sequence, based on the numbering of the Accession sequence. The Accession number is preferably the same Accession number as presented in the "Anno" screen, although it may be a different number. The rows can be sorted by clicking on "Row", "Position" or "Accession". Clicking on "Row" orders the gene from 5' to 3'. The "Change" column typically contains the identity of the alternative nucleotides observed at the indicated PS and, for those polymorphisms which result in amino acid variation, the identity of the alternative amino acids. In the screen exemplified in Figure 7, the "Wild" column contains the number of individuals in the analyzed population homozygous for the wild-type, or the most common allele or reference allele. Similarly, the "Mut" column contains the number of individuals homozygous for the least common allele or uncommon variant allele, and the "Het" column contains the number of individuals heterozygous at that PS. The most and least common nucleotide (or encoded amino acid) at each site is defined by looking at the genotypes of all individuals in the population at that particular site. The nucleotide that shows up most often is called the most common nucleotide. The one that shows up less often is termed the least common. In situations where more than 2 nucleotides are seen at a site (which is rare but not unknown in human genes) all nucleotides except the most common one are lumped together in the least common category. The "En" column indicates the number of individuals in which the variation in the "Change" column may have been incoπectly determined. Checking a box in a row under the "Accept" column indicates that the haplotype is to include genotype information for the polymorphic site in that row. When a box under the "Accept" column is not checked, the genotype information concerning the polymorphic site described in that row will not be considered in the haplotype analysis for each of the individuals. For example, if a genotype has only one uncommon variant nucleotide (and, therefore, is not very informative for purposes of haplotype building), or if the genotype containing the polymorphism occurs in only one person, or does not obey Hardy-Weinberg equilibrium, it may be excluded from the analysis by not checking the relevant box for the polymorphism in the "Accept" column. In addition, the screen exemplified in Figure 7 displays the polymorphism frequency calculated for various groups of the analyzed population. In the screen, the different population groups are African American (AF), Asian (AS), Caucasian (CA), primate (PT; one chimpanzee individual named "Harv") and other (OT; three native American individuals). The PDYN data set shown in Figure 7 includes five "chimp-specific" polymorphic sites, i.e., the human individuals examined were all monomorphic at the position, but the chimpanzee had at least one alternative allele at that position. The rows containing these "chimp-specific" sites may be removed from this window by selecting the "Edit" button in the top left corner of the screen (which brings up the pull-down menu illustrated in Figure 8), then selecting "De- HARN", which unchecks the appropriate boxes in the "Accept" column (as shown in Figure 9), and then selecting "Filter Polymorphisms". The resulting human polymorphic sites for the PDYΝ data set are shown in Figure 10. As indicated by comparing the "Scored Haplotype Objects" boxes of Figure 9 and Figure 10, the number of possible haplotypes expanded from the diplotypes goes down significantly (54 to 18), after De-HARN and filtering polymorphism steps were canied out for PDYΝ. In one embodiment, selecting "De-HARN" also hides the PT column. Alternatively, the program could be configured so that hiding the PT column and filtering of the "chimp-specific" sites would be accomplished in a one- step operation.
The box in the middle right side of the screen shown in Figure 10 labeled "Scored Diplotype Objects" provides the genotype at each of the selected (accepted) polymorphic sites for each individual in the population being examined. For example, in this screen, the genotype data is shown for each of the 6 human polymorphic sites selected in the screen at the top of the figure for the PDYΝ in the indicated individuals from the Index Repository. Each row contains genotype information for a different individual and the genotypes for additional individuals in the population may be accessed by scrolling up and down, or by enlarging the window. The empty cells colored pink indicate those polymorphic sites for which sequence information is not present in the database. The "Subject" and "Eth" columns list the numerical identifier and ethnicity (i.e, population group) for the individual, respectively, using the same two-letter codes for the population groups described above. The "Hap 1 "and "Hap2" columns are empty in Figure 10, but during the haplotype assignment process described above, these columns will indicate the most likely resolved haplotypes for the genotype for each individual in each row, based on the pair frequency score Fk determined for that pair by the method described herein, and listed in the "Score" column after each iteration of the haplotype assignment phase. As mentioned previously, this screen initially appears when the user clicks on the "Haplo" button in the screen shown in Figure 7. To begin the HAP™1 Builder process, the user selects the "Assign" command in the pull-down menu in Figure 10 (not shown). An example of a screen showing the result following one or more iterations of the haplotype pair assignment phase is shown, e.g., in Figures 12 and 13, respectively. The numbers in the "Hapl" and "Hap2" columns in the screens conespond to the HAP ID numbers in the window labeled "Scored Haplotype Objects" in the lower right side of the screens shown in Figures 12 and 13. For example, the genotype for individual UP018 in row 85 of the window in the middle right side of the screen in Figure 12, the number 2 appears in the "Hapl" column and the number 7 appears in the "Hap2" column. This indicates that the initial most likely resolved haplotypes for individual UPOl 8 are GCCTAG and ACCCAG, identified with ID numbers 2 and 7 respectively, in the window in the lower right had side of the screen. Compare the resolved haplotypes for individual UP018 in Figure 13, after "Assign" has been selected a number of times. The number 1 appears in the "Hap 1" column and the number 2 appears in the "Hap 2" column. The score for this assignment is lower in Figure 13 than in Figure 12 and the genotype at polymorphic site 1 is highlighted in red as a possible "enor" (e.g., a possible sequencing enor).
The window labeled "Scored Haplotype Objects" (shown, e.g., in the lower right side of the screen exemplified in Figure 13) provides the different haplotypes determined for the selected (accepted) polymorphic sites for the selected gene in the examined population. Each row contains a unique haplotype, with the cuπent haplotype frequency score fj of each haplotype listed in the "Score" column. The number of times each haplotype is seen in the entire population and in the various population groups are indicated in the "Count" and following six columns, respectively, with "AF", "AS", "CA", "HL" and "OT" are as described above. The information in this window can also be sorted by haplotype frequency score fj, by clicking on "Score". In other embodiments, the PT and OT columns may be hidden manually or not considered in the HAP™1 Builder process.
The "Information Entropy" shown at the top of the "ScoredHaplotypes Objects" window is a measure of the amount of variability of the locus. It measures the amount of information (in bits) that is needed to specify the genotype at the locus. If a locus has only one possible haplotype, there is only one possibility and the information entropy is zero. If there are four equally likely haplotypes, 2 bits of information are needed to specify which of the four is present. The general formula is
E_ = _ _ ki In ki (Formula 6)
where k; is the probability for a given possibility of outcome and the sum is over all possibilities. For a single polymorphism, there are only two possibilities, and the information entropy depends on the allele frequency k as
Ej = — (k In k+ (1 - k) ln(l - k)). (Formula 7)
In 2
If the polymorphism is balanced (k = 0.5), Ei becomes one. If it is rare (k Ξ 0), Ei approaches zero. The first number shown in the "ScoredHaplotpye objects" box is the infoπnation entropy of the locus as calculated from the possible haplotypes and their frequencies. The second number is the same quantity under the (eπoneous) assumption that all polymorphisms are independent of each other. The former is always smaller than the latter and the difference indicates the degree to which the polymorphisms are linked. The largest possible information entropy is the number of polymorphisms N (if all N polymorphisms are balanced and independent of each other, or, in other words, if all 2n possible haplotypes are equally likely), more typically the values are between 0.5 and 3.
A large information entropy for a locus indicates greater variability, i.e., more haplotypes exist, and thus this locus may be more useful in finding associations with phenotypes than a locus with a smaller infonnation entropy. This information is not used in building haplotypes.
Selecting the "Edit" menu in the top left corner of Figure 8, for example, brings up the menu shown, having the following command selections: "Assign"; "New Locus"; "De-HARN"; "Filter Polymorphisms"; "Filter Haplotypes"; "Store"; and "Export". Each invoking of the "Assign" command causes an additional iteration of the above-described haplotype assignment method to be carried out. Selecting "New Locus" clears out the scores and haplotype assignments and fills the "ScoredPolymorphism objects" box with data for all available polymorphisms for the locus. Selecting "De-HARN" removes the "Accept" checkmarks from those polymorphisms that are specific to the chimpanzee, i.e. those which are monomorphic in the human population. This selection is usually made when using the HAP™ Builder program, but does not need to be. The individual "Accept" checkmarks can also be modified manually. Selecting "Filter Polymorphisms" will eliminate all polymorphisms from the list and from the analysis which are not checked in the Accept column. Simultaneously, the Hap Expansion is performed and the resulting Haplotypes displayed in the "ScoredHaplotype objects" box. Selecting "Filter Haplotypes" allows the user to eliminate those haplotypes from the "ScoredHaplotype objects" box which have not been assigned as top choice to any individual. Selecting "Store" stores all the infonnation into a database. This includes the list of haplotypes, the haplotype frequencies, and the haplotype pair assignments and assignment scores. Selecting "Export" allows the user to write the data into a text file, from which it can be read into a spreadsheet program or otherwise stored or transmitted. Clicking on the "Assign" command in the Edit Menu in Figure 10 updates the boxes shown in the middle and lower boxes of Figure 11. In the middle box on the right side of Figure 9, Haplotypes have been assigned to each genotype (i.e., the "-'"s have been replaced by haplotype Id numbers in the "Hap 1" and "Hap 2" columns, and pair frequency scores have been assigned. In the bottom box on the right side of Figure 11, haplotype frequency scores fj have been assigned, as well as other information. There are 18 "Scored Haplotype Objects" shown in Figure 10, but only 11 "Scored Haplotype Objects" shown in Figure 11 because all haplotypes below the Hap frequency threshold of 0.1 have been dropped.
Clicking on 1333 in the Family Objects box in, e.g., Figures 11, 12 or 13 brings up the Family shown in the box on the top left side of those figures. The haplotypes assigned to the members of the family are indicated below the family members which were included in the HAP™1 Builder process. Figure 12 shows a screen on the bottom left labeled "HapPair Objects" which results when subject UP002 is selected in the screen in the "ScoredDiplotypes Objects" box in the middle right of Figure 12. This contains the 15 most likely haplotype pairs for the individual UP002 based on the cuπent haplotype pair scores Fk which are shown in the center right box. The pairs are shown with their pair probability scores in the "Score" column in the upper left box labeled "Hap Pair Objects". The "En" column indicates the number of positions at which the haplotype pair is not consistent with the measured genotype. If a pair in the list is clicked on, it will rise to the top of the list and the selected individual is assigned that pair with a probability score of 1. Figure 13 shows the changes that occur to the screens after reiterations of steps (e) through (g) of the Haplotype Assignment phase of the invention described above. This occurs when the "Assign" command in the Edit menu of Figure 12 is invoked. Note the revised Scores in the boxes on the right side of the screen. A revised Score would also be visible on the HapPair Objects box (not shown). Figure 13 shows the changes that occur to the screens after the iteration process is completed (following multiple selections of "Assign" and optional manual interventions), and the "Filter Haplotypes" options is selected. Only six Scored Haplotype Objects are shown in Figure 13 as compared to eleven in Figure 12, because all haplotypes not assigned to at least one individual have been dropped. Missing genotype data appears as blanks in the "Scored Diplotype Objects" box. Figures 14 and 1.5 show such blanks for the ABCB1 gene. The header of the center right box. indicates that there are 10 warnings,, flagged by boxes highlighted in pink in the ScoredDiplotype Objects box, and 5 enors, flagged by boxes highlighted in red. (Not all rows are visible in the Figures.) Figure 14 shows a situation where the assigned haplotypes do not obey Mendelian segregation in one of the families (Family 1333). The individual UP002, whose symbol has been flagged by red coloring, has been called as a 1,1 genotype (homozygous for haplotype 1). She could have inherited one haplotype 1 from her father, but could not have inherited the other from her mother, since her mother was called as a 4,6 genotype. The operator may conclude that the mother should be assigned a different pair such as 1,4 or 1,6; or may conclude that a different pair containing at least one copy of haplotype 1 needs to be assigned to the grandmother (UP018).
Figure 15 shows how manual intervention can be used to fix the problem. The "HapPair Objects" window at the top of the figure has been brought up by clicking on UP002 (Row 92 in the "ScoredDiplotype Objects" box). By clicking on the second pair (Row 2) in this window, haplotype pair 1,6 can be assigned to subject UP002, and the requirements of Mendelian inheritance can be satisfied. The flag (red color) will then disappear from the family tree, but an additional enor will appear in the "ScoredDiplotype Objects" window at the position which had to be overridden to accommodate the non-matching pair. In this particular case, it is more 5 likely that the (6,4) haplotype assigmnent of the grandmother (UP018) is inconect, since neither her daughter nor any of her grandchildren have either the 4 or the 6 haplotype. Manual intervention is useful to address such ambiguities.
The infonnation that is stored in a database, such as a database associated with the DecoGen™ program exemplified herein includes (1) the positions of one or
10 more, preferably two or more, most preferably all, of the sites in the gene locus (or other loci) that are variable (i.e. polymorphic) across members of the reference population and (2) the nucleotides found for each individuals' 2 haplotypes at each of the polymorphic sites. Preferably, it also includes individual identifiers and ethnicity or other phenotypic characteristics (such as age, gender or clinical
1.5 information, if any) of each individual.
In the prefened embodiment of the invention, the haplotypes, their frequencies, and other information about each of the members of the population being analyzed, are stored and displayed, preferably in the manner shown, e.g., in Figures 7-15. The information shown in Figures 7-15 includes a unique identifier
20 (shown in the "Subject" column), ethnicity, genotype, and (in Figures 11-15) the 2 haplotypes predicted for each individual. Only some of the individuals are visible in the screen. Scrolling up or down with the scroll bar brings information for other individuals into view. The subjects seen in these figures are from a reference population of healthy individuals.
25 V. TOOLS OF THE INVENTION
The methods of the invention preferably use a tool called the DecoGen™ program described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, which are incorporated herein by reference. 30 The tool consists, in part, of: a. One or more databases that contain (1) genotypes (or haplotypes) for a gene (or other loci) for many individuals (i.e., people, animals, plants, etc., depending on the application) for one or more genes and, optionally, (2) a list of the names or functions of the genes (or other loci), whose functions can be, but are not limited to: disease causation, drag response, plant yields, plant disease resistance, plant drought resistance, plant interaction with pest-management strategies, etc. The databases could include information generated either internally or externally (e.g. GenBank). Examples of databases which may be used in the present invention are described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, which are incorporated herein by reference . b. A set of computer programs that analyze and display the relationships between the genotypes and the haplotypes for an individual.
The methods of the invention preferably also use a tool called the HAP™1 Builder Program. Specific aspects of this tool which are novel include: a. A new genotype-to-haplotype method that allows the user to infer an individual's haplotypes or sub-haplotypes for a given gene. The steps required for this to work are (a) determine the haplotype (or sub-haplotype) frequencies from the reference population by expanding the genotypes of a reference population; (b) optionally, conect the observed frequencies to conform to Hardy-Weinberg equilibrium and/or Mendelian inheritance (unless it is determined that the deviation is not due to sampling bias, sequencing enor or questionable paternity); and (c) use the statistical approach described in this application (and shown schematically in Figures 2-4) to predict individuals' haplotypes or sub-haplotypes from their genotypes. b. A method of displaying measurements of the probability of the coπectness of the assigmnent of haplotypes or sub-haplotypes to individuals, as well as the ability to manually change the genotype, the haplotype pair assignments, and the probability of the assignments.
VI. DATA/DATABASE MODEL The prefened embodiment present invention uses a relational database which provides a robust, scalable and releasable data storage and data management mechanism. The computing hardware and software platforms, with 7x24 teams of database administration and development support, provide the relational database with advantageous guaranteed data quality, data security, and data availability. The database model of the present invention provides tables and their relationships optimized for efficiently storing, searching and otherwise utilizing a genomics- oriented database.
A data model (or database model) describes the data fields one wishes to store and the relationships between those data fields. The model is a blueprint for the actual way that data is stored, but is generic enough that it is not restricted to a particular database implementation (e.g., Sybase™ or Oracle™). In the prefened embodiment of the present invention, the model covers the data required by, and/or generated by, the HAP™ Builder program. It contains at least 4 submodels which contain logically related subsets of the data. These relevant submodels, which are described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, are described below.
. 1. Gene Repository: This is the sub-model that describes the gene loci and its related domains. Preferably, it captures the information on gene, gene stracture, species, gene map, gene family, therapeutic applications of genes, gene naming conventions and published literature including the patent information on these objects.
2. Population Repository: This is the part of the data model that encapsulates the patient and population information. Preferably, it covers the entities such as patient, ethnic and geographical background of patient and population, medical conditions of the patients, family and pedigree information of the patients, patient haplotype and polymorphism information and their clinical trial outcomes.
3. Polymorphism Repository: This is the part of the model that covers the haplotype and the polymorphism associated with genes and, preferably, patient cohorts used in clinical studies. The polymorphisms include those due to single nucleotide polymorphisms (SNPs), large and small insertions and deletions, RFLPs, repeats, frame shifts and alternative splicings.
4. Sequence Repository: Genetic sequence information in the form of genomic DNA, cDNA, mRNA and protein is captured by this data model as is the location relationship between the gene structural features and the sequences.
VTI. BUSINESS MODELS
The haplotype and other data developed using the methods and/or tools described herein may be used in a partnership of two or more companies (refened to herein as the Partnership) to integrate knowledge of human population and evolutionary variation into the discovery, development and delivery of pharmaceuticals, in the ways described in U.S. application serial no. 60/141,521, filed June 25, 1999, and international application WO 01/01218, which are incorporated herein by reference. The database and analytical tools of the invention are envisioned to be useful in a variety of settings, including various research settings, pharmaceutical companies, hospitals, independent or commercial establishments. It is expected users will include physicians (e.g., for diagnosing a particular disease or prescribing a particular drug) pharmaceutical companies, generics companies, diagnostics companies, contract research organizations and managed care groups, including HMOs, and even patients themselves.
However, as discussed above, it is obvious that various aspects of the invention may be useful in other settings, such as in the agricultural and veterinary venues. The examples described herein illustrate certain embodiments of the present invention, but should not be construed as limiting its scope in any way. Certain modifications and variations will be apparent to those skilled in the art from the teachings of the foregoing disclosure and the following examples, and these are intended to be encompassed by the spirit and scope of the invention. Viπ. REFERENCES
1. Clark AG. (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol. 7:111-122.
2. Clark, A.G., et al. (1998) Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase.
Am. J. Hum. Genet., 63:595-612.
3. Dempster, A.P., et al. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm. J. R. Stat. Soc. [B] 39:1-38.
4. Excoffier L, Slatkin M. (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol
12:921-927.
5. Hawley ME, Kidd KK. (1995) HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J Hered 86:409-411.
6. Hill WG. (1975) Tests for association of gene frequencies at several loci in random mating diploid populations. Biometrics 31:881-888.
7. Long JC, Williams RC, Urbanek M. (1995) An E-M algorithm and testing strategy for multiple-locus haplotypes. Am J Hum Genet 56:799-810.
Modifications of the above described modes for caπying out the invention that are obvious to those of skill in the fields of chemistry, medicine, computer science and related fields are intended to be within the scope of the following claims.

Claims

CLAIMSWhat is Claimed is:
1. A method for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals, comprising:
(a) obtaining a genotype for the polymorphic genomic region from each of the individuals;
(b) enumerating all possible haplotypes hj that are consistent with each genotype;
(c) assigning an evidence score Sj to each of the enumerated haplotypes hi;
(d) calculating an initial haplotype frequency fj for each haplotype among the possible haplotypes, wherein the initial haplotype frequency fj is a function of the evidence score SJ;
(e) determining for each genotype obtained in step (a) a pair score Fk for each pair of haplotypes that is consistent with that genotype, wherein Fk is a function of the frequency f; for each of the haplotypes in the pair;
(f) calculating, for each genotype and consistent haplotype pair whose pair score F meets a pair score criterion, a probability p that assignment of that haplotype pair to the genotype would be coπect;
(g) generating a revised haplotype frequency fj for each haplotype, wherein the revised haplotype frequency fj is a function of the probability pk for each consistent haplotype pair which contains the haplotype; and
(h) repeating steps (e) through (g) until an end condition is reached, with the proviso that for each repetition the frequency fj employed in step (e) is replaced by the revised frequency fi determined in step (g).
2. The method of claim 1, wherein the evidence score Sj obeys a formula selected from the group consisting of:
0 < Sj < 2 and _., = 2 ,
;=1 where n is the number of ambiguous positions in said genotype, wherein an ambiguous polymorphic site is either a heterozygous site or is a site for which nucleotide sequence information is lacking.
3. The method of claim 2 wherein the evidence score Sj is 2/2", wherein n is the number of ambiguous positions in said genotype, with the proviso that if the polymorphic genomic region is haploid or hemizygous in the individual, an evidence score of 1 is assigned.
4. The method of claim 1 wherein the initial frequency f; is calculated from the sum of the evidence scores across all the different individuals, for each of the possible haplotypes hj.
5. The method of claim 1 wherein the enumerating step is applied only to each genotype that meets an ambiguity criterion.
6. The method of claim 5, wherein the ambiguity criterion is a function of the number of ambiguous polymorphic sites in the genotype, wherein an ambiguous polymorphic site is either a heterozygous site or is a site for which information is lacking.
7. The method of claim 1 wherein the pair score criterion is chosen from the group consisting of (a) a specific numerical cutoff; (b) a function of the values of the pair scores; and (c) a function of the rankings of the pair scores.
8. The method of claim 1, wherein the pair score F = 2fjfj, if i ≠ j, and otherwise Fk = fj2, with the proviso that the pair score Fk = fi when the polymorphic genomic region is haploid or hemizygous, where fi and fj are the haplotype frequencies for the haplotypes hi and hj in the pair.
9. The method of claim 7, wherein the pair score criterion is a function of the rankings of the pair scores, and wherein the probability pk is calculated by:
(a) ranking each of the pair scores Fkby score, with the highest score first;
(b) disregarding all but the Nτarj highest ranking assignments; and
(c) defining the probability p as:
Figure imgf000058_0001
10. The method of claim 1, wherein the end condition is selected from the group consisting of: (i) steps (e) through (g) have been repeated a preset number of times; (ii) one or more of the parameters fj, Fk, and p has stabilized; and (iii) the operator choosses to stop.
11. The method of claim 1 , wherein the plurality of individuals includes at least one multi-generation family and the probability pk is reduced for each pair assignment for each genotype in the family that does not obey Mendelian inheritance.
12. The method of claim 11, wherein the reduced probability is reduced to 0 or is reduced by the formula pk(l-pk'), where p ' is the probability calculated for a pair assignment of a related genotype.
13. The method of claim 1 , wherein the plurality of individuals comprises at least one population group and the probability pk is reduced for each pair assignment for each genotype in the population group that does not obey Hardy-Weinberg Equilibrium.
14. The method of claim 13, wherein the probability p is reduced by
* \2 , c o r 2 , r 2 \ 2 the formula pk (Fa - ft ) HFV - Ififj Y HFj, - fjΥ wherein f, and j
are the frequencies of haplotypes hj and hj in the population group and Fϋ, Fjj and Fy are the frequencies of each possible pair of haplotypes hi and hj in the population group.
15. The method of claim 1, wherein steps (a), (b) and (c) are performed for one individual at a time.
16. The method of claim 1, wherein steps (a), (b) and (c) are performed for each of the individuals in parallel.
17. The method of claim 1, wherem one or both of steps (e) and (f) are performed for one genotype at a time.
18. The method of claim 1 , wherein one or both of steps (e) and (f) are performed for each genotype in parallel.
19. A method for predicting an individual's haplotype pair for a polymorphic genomic region, comprising:
(a) obtaining the genotype for the polymorphic genomic region from the individual;
(b) enumerating all possible haplotypes hi for the genotype;
(c) providing a frequency fj for each of the possible haplotypes, where fj is determined by the method of claim 1, (d) determining a pair score Fk for each pair of possible haplotypes hj that are consistent with the genotype, wherein Fk is a function of the frequency fj for each of the haplotypes in the pair; and
(e) assigning to the genotype the haplotype pair having the highest pair score Fk.
20. The method of claim 1, further including generating an enor estimate.
21. A method of constructing a haplotype database for a population, comprising:
(a) determining haplotype data for a plurality of individuals from genotype information using the method of claim 1;
(c) organizing the haplotype data for the plurality of individuals into fields; and
(d) storing the haplotype data for the plurality of individuals according to the fields.
22. The method of claim 21 , wherein the haplotype data comprises haplotype frequencies and haplotype pair scores for a polymorphic genomic region.
23. The method of claim 22, wherein the probabilities are reduced for haplotype pairs that do not meet the requirements of the Hardy-Weinberg equilibrium.
24. The method of claim 22 wherein the haplotype data further comprises probabilities that pair assignments are conect.
25. The method of claim 24, wherein the validating comprises conecting an observed distribution of haplotypes or haplotype pairs for effects imposed by a limited number of individuals in the population.
26. The method of claim 25, wherein the validating further comprises analyzing compliance of the observed distribution with Mendelian inheritance principles.
27. The method of claim 21, wherein the population is selected from the group consisting of a reference population, a clinical population, a disease population, an etlmic population, a family population and a same-sex population.
28. A method for predicting an individual's haplotype pair for a polymorphic genomic region, comprising
(a) identifying a genotype for the individual;
(b) enumerating all possible haplotype pairs which are consistent with the genotype;
(c) determining a probability for each possible haplotype pair that the individual has that possible haplotype pair by accessing a database containing frequency data for reference haplotype pairs; and
(d) analyzing the determined probabilities to predict an individual's haplotype pair.
29. The method of claim 28, further comprising storing the haplotype pair.
30. The method of claim 29, further comprising generating an enor estimate.
31. A computer implemented method for generating haplotype pair and haplotype frequency screens for display on a display device, comprising the steps of:
(a) displaying in a first area a plurality of selectable items each conesponding to a polymorphic site for a predetermined gene;
(b) selecting one or more of said selectable items; (c) displaying in a second area the haplotype pairs occurring in a reference population for the selected polymorphic sites;
(d) displaying in a third area data indicative of haplotype frequencies for a plurality of member groupings within the population.
32. A computer system for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals, comprising: a database for storing genotyping information; a processor connected to the database; a computer program for controlling the processor connected to said database comprising instruction code to:
(a) accept input of a genotype for the polymorphic genomic region from each of the individuals and store said genotype within said database;
(b) enumerate all possible haplotypes hi consistent with each genotype and store said haplotypes h; within said database;
(c) calculate an evidence score Sj for each of said possible haplotypes hj and store said evidence score s; within said database;
(d) calculate an initial haplotype frequency fj for each haplotype hj among the possible haplotypes, and store the haplotype frequency fj in said database, wherein the haplotype frequency fj is a function of the evidence score s,;
(e) calculate for each genotype received in step (a) a pair score F for each pair of haplotypes that are consistent with that genotype, wherein Fk is a function of the frequency fj for each of the haplotypes in the pair;
(f) calculate, for each genotype and consistent haplotype pair whose pair score Fk meets a pair score criterion, a probability pk that assignment of that haplotype pair to the genotype would be conect and store the probability pk in said database;
(g) calculate a revised haplotype frequency fj for each of the haplotypes, wherein the revised haplotype frequency fi is a function of the probability ρk for each consistent haplotype pair which contains the haplotype and storing the revised frequency fj in said database; and (h) repeat steps e through g until an end condition is reached, with the proviso that for each repetition the frequency fi employed in step (e) is replaced by the revised frequency fi determined in step (g) and stored in the database.
33. The computer system of claim 32 wherein the genotype for the polymorphic genomic region from each of the individuals is obtained electronically from a remote user and stored in said database.
34. The computer system of claim 33 wherein the computer system is connected to the internet and the genotype for the polymorphic genomic region from each of the individuals is obtained electronically from a remote user through the internet.
35. The computer system of claim 33 wherein the computer system is connected to the internet and the genotype for the polymorphic genomic region from each of the individuals is obtained electronically from a remote user through electronic mail.
36. The computer system of claim 33 wherein the genotype for the polymorphic genomic region from each of the individuals is obtained from a database of one or more known genotypes.
37. A computer readable medium comprising instruction code to:
(a) accept input of a genotype for the polymorphic genomic region from each of the individuals and store said genotype within said database;
(b) enumerate all possible haplotypes hi consistent with each genotype and store said haplotypes hj within said database; (c) calculate an evidence score s, for each of said possible haplotypes h, and store said evidence score s, within said database;
(d) calculate an initial haplotype frequency fj for each haplotype h, among the possible haplotypes, and store the haplotype frequency fj in said database, wherein the haplotype frequency f, is a function of the evidence score s,;
(e) calculate for each genotype received in step (a) a pair score F for each pair of haplotypes that are consistent with that genotype, wherein Fk is a function of the frequency fj for each of the haplotypes in the pair;
(f) calculate, for each genotype and consistent haplotype pair whose pair score F meets a pair score criterion, a probability pk that assignment of that haplotype pair to the genotype would be coπect and store the probability pk in said database;
(g) calculate a revised haplotype frequency fj for each of the haplotypes, wherein the revised haplotype frequency f, is a function of the probability pk for each consistent haplotype pair which contains the haplotype and storing the revised frequency fj in said database; and
(h) repeat steps e through g until an end condition is reached, with the proviso that for each repetition the frequency fj employed in step (e) is replaced by the revised frequency f, determined in step (g).
38. The method of any one of claims 1-27, wherein all of the individuals in the plurality of individuals meet one or more criteria selected from the group consisting of:
(a) having the same gender;
(b) belonging to the same population group;
(c) belonging to the same clinical or disease population;
(d) exhibiting a particular response to a stimulus;
(e) having in common a particular genotype at a different polymorphic region; and (f) having in common a particular haplotype at a different polymorphic region.
39. The computer system of any one of claims 32-36, wherein all of the individuals in the plurality of individuals meet one or more criteria selected from the group consisting of:
(a) having the same gender;
(b) belonging to the same population group;
(c) belonging to the same clinical or disease population;
(d) exhibiting a particular response to a stimulus;
(e) having in common a particular genotype at a different polymorphic region; and
(f) having in common a particular haplotype at a different polymorphic region.
40. The computer-readable medium of claim 37, wherein all of the individuals in the plurality of individuals meet one or more criteria selected from the group consisting of:
(a) having the same gender;
(b) belonging to the same population group;
(c) belonging to the same clinical or disease population;
(d) exhibiting a particular response to a stimulus;
(e) having in common a particular genotype at a different polymorphic region; and
(f) having in common a particular haplotype at a different polymorphic region.
41. A method for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals, comprising:
(a) obtaining a genotype for the polymorphic genomic region from each of the individuals; (b) grouping the genotypes obtained in step (a) into groups, wherein in each group g there are ng identical genotypes, and wherein any unique genotypes are regarded as groups having ng = 1 ;
(c) enumerating all possible haplotypes hj that are consistent with each distinct genotype;
(d) assigning an evidence score Sj to each of the enumerated possible haplotypes hj;
(e) for each group g, calculating an initial haplotype frequency (fj) for each haplotype among the possible haplotypes, wherein the initial haplotype frequency fj is a function of the product (sj)(ng);
(f) determining for each group g, a pair score Fk for each pair of haplotypes that is consistent with the genotype of that group, wherein F is a function of the frequency fi for each of the haplotypes in the pair;
(g) calculating, for each genotype and consistent haplotype pair whose pair score Fk meets a pair score criterion, a probability pk that assignment of that haplotype pair to the genotype would be coπect;
(h) generating a revised haplotype frequency fj for each haplotype, wherein the revised haplotype frequency fj is a function of the product (ng)(pk) for each consistent haplotype pair which contains the haplotype; and
(i) repeating steps (f) through (h) until an end condition is reached, with the proviso that for each repetition the frequency fj employed in step (f) is replaced by the revised frequency fj determined in step (h).
42. The method of claim 41, wherein the evidence score s; obeys a formula selected from the group consisting of: n
0 < Sj < 2 and _ζ_ s, = 2 , ι=l where n is the number of ambiguous positions in said genotype, wherein an ambiguous polymorphic site is either a heterozygous site or is a site for which nucleotide sequence information is lacking.
43. The method of claim 42 wherein the evidence score s, is 2/2n, wherein n is the number of ambiguous positions in said genotype, with the proviso that if the polymorphic genomic region is haploid or hemizygous in the individual, an evidence score of 1 is assigned.
44. The method of claim 41 wherein the initial frequency f, is calculated from the sum of the products (ng)(Sj) across all the different genotypes, for all genotypes consistent with the haplotype fj.
45. The method of claim 41 wherein the enumerating step is applied only to each genotype that meets an ambiguity criterion.
46. The method of claim 45, wherein the ambiguity criterion is a function of the number of ambiguous polymorphic sites in the genotype, wherein an ambiguous polymorphic site is either a heterozygous site or is a site for which information is lacking.
47. The method of claim 41 wherem the pair score criterion is chosen from the group consisting of (a) a specific numerical cutoff; (b) a function of the values of the pair scores; and (c) a function of the rankings of the pair scores.
48. The method of claim 41, wherein the pair score F = 2fjfJ, if i ≠ j, and otherwise Fk = fj , with the proviso that the pair score Fk = f, when the polymorphic genomic region is haploid or hemizygous, where f, and fj are the haplotype frequencies for the haplotypes h, and h, in the pair.
49. The method of claim 47, wherein the pair score criterion is a function of the rankings of the pair scores, and wherein the probability p is calculated by:
(a) ranking each of the pair scores Fk by score, with the highest score first; (b) disregarding all but the Nrank highest ranking assignments; and
(c) defining the probability k as:
Figure imgf000068_0001
50. The method of claim 41, wherein the end condition is selected from the group consisting of: (i) steps (e) through (g) have been repeated a preset number of times; (ii) one or more of the parameters fj, F , and p has stabilized; and (iii) the operator choosses to stop.
51. The method of claim 41 , wherein the plurality of individuals includes at least one multi-generation family and the' probability p is reduced for each pair assignment for each genotype in the family that does not obey Mendelian inheritance.
52. The method of claim 51 , wherein the reduced probability is reduced to 0 or is reduced by the formula pk(l-pk'), where k' is the probability calculated for a pair assignment of a related genotype.
53. The method of claim 41 , wherein the plurality of individuals comprises at least one population group and the probability pk is reduced for each pair assignment for each genotype in the population group that does not obey Hardy-Weinberg Equilibrium.
54. The method of claim 53, wherein the probability pk is reduced by
the formula pk (Fa -ft2 2 + (FU -zftfj + (Ff ~ l λ , wherein /;, and /j
J l ' J l J > j are the frequencies of haplotypes hj and hj in the population group and Fjj, Fjj and Fjj are the frequencies of each possible pair of haplotypes hj and hj in the population group. 156
55. The method of claim 41, wherein steps (c), (d) and (e), are performed for one group at a time.
56. The method of claim 41, wherein steps (c), (d) and (e), are performed for each of the groups in parallel.
57. The method of claim 41, wherein one or both of steps (f) and (g) are performed for one genotype at a time.
58. The method of claim 42, wherein one or both of steps (f) and (g) are performed for each genotype in parallel.
59. A method for predicting an individual's haplotype pair for a polymorphic genomic region, comprising:
(a) obtaining the genotype for the polymorphic genomic region from the individual;
(b) enumerating all possible haplotypes hi for the genotype;
(c) providing a frequency fj for each of the possible haplotypes, where fj is determined by the method of claim 41,
(d) determining a pair score Fk for each pair of possible haplotypes h; that are consistent with the genotype, wherein Fk is a function of the frequency fj for each of the haplotypes in the pair; and
(e) assigning to the genotype the haplotype pair having the highest pair score Fk-
60. The method of claim 41 , further including generating an enor estimate.
61. A method of constructing a haplotype database for a population, comprising:
(a) determining haplotype data for a plurality of individuals from genotype information using the method of claim 41; (c) organizing the haplotype data for the plurality of individuals into fields; and
(d) storing the haplotype data for the plurality of individuals according to the fields.
62. The method of claim 61, wherein the haplotype data comprises haplotype frequencies and haplotype pair scores for a polymorphic genomic region.
63. The method of claim 61 , wherein the probabilities are reduced for haplotype pairs that do not meet the requirements of the Hardy-Weinberg equilibrium.
64. The method of claim 61 wherein the haplotype data further comprises probabilities that pair assignments are coπect.
65. The method of claim 61 , wherein the validating comprises coπecting an observed distribution of haplotypes or haplotype pairs for effects imposed by a limited number of individuals in the population.
66. The method of claim 65, wherein the validating further comprises analyzing compliance of the observed distribution with Mendelian inheritance principles.
67. The method of claim 61 , wherein the population is selected from the group consisting of a reference population, a clinical population, a disease population, an ethnic population, a family population and a same-sex population.
68. A method for predicting an individual's haplotype pair for a polymorphic genomic region, comprising
(a) identifying a genotype for the individual;
(b) enumerating all possible haplotype pairs which are consistent with the genotype; (c) determining a probability for each possible haplotype pair that the individual has that possible haplotype pair by accessing a database prepared by the method of claim 61 and containing frequency data for reference haplotype pairs; and
(d) analyzing the determined probabilities to predict an individual's haplotype pair.
69. The method of claim 68, further comprising storing the haplotype pair.
70. The method of claim 69, further comprising generating an enor estimate.
71. A computer implemented method for generating haplotype pair and haplotype frequency screens for display on a display device, comprising the steps of:
(a) displaying in a first area a plurality of selectable items each conesponding to a polymorphic site for a predetermined gene;
(b) selecting one or more of said selectable items;
(c) displaying in a second area the haplotype pairs occurring in a reference population for the selected polymorphic sites;
(d) displaying in a third area data indicative of haplotype frequencies for a plurality of member groupings within the population; wherein the data indicative of haplotype frequencies is retrieved from a database prepared by the method of claim 61.
72. A computer system for assigning haplotype pairs for a polymorphic genomic region to a plurality of individuals, comprising: a database for storing genotyping information; a processor connected to the database; and a computer program for controlling the processor connected to said database, comprising instruction code to: (a) accept input of a genotype for the polymorphic genomic region from each of the individuals and store said genotype within said database;
(b) group the genotypes input in step (a) into groups, wherein in each group g there are ng identical genotypes, and wherein any unique genotypes are regarded as groups having ng = 1 ;
(c) enumerate all possible haplotypes hj consistent with the genotype of each group g, and store said haplotypes hi within said database;
(d) calculate an evidence score Sj for each of said possible haplotypes hi and store said evidence score s; within said database;
(e) for each group g, calculate an initial haplotype frequency fj for each haplotype hj among the possible haplotypes, and store the haplotype frequency fj in said database, wherein the initial haplotype frequency fj is a function of the product (Sj)(ng);
(f) calculate for each group obtained in step (b) a pair score Fk for each pair of haplotypes that are consistent with that group, wherein Fk is a function of the frequency fj for each of the haplotypes in the pair;
(g) calculate, for each genotype and consistent haplotype pair whose pair score Fk meets a pair score criterion, a probability pk that assignment of that haplotype pair to the genotype would be coπect and store the probability pk in said database;
(h) calculate a revised haplotype frequency fj for each of the haplotypes, wherein the revised haplotype frequency fj is a function of the product (ng)(pιc) for each consistent haplotype pair which contains the haplotype and storing the revised frequency fj in said database; and
(i) repeat steps (f) tlirough (h) until an end condition is reached, with the proviso that for each repetition the frequency fj employed in step (f) is replaced by the revised frequency fj determined in step (h) and stored in the database.
73. The computer system of claim 72 wherein the genotype for the polymoφhic genomic region from each of the individuals is obtained electronically from a remote user and stored in said database.
74. The computer system of claim 73 wherein the computer system is connected to the internet and the genotype for the polymorphic genomic region from each of the individuals is obtained electronically from a remote user through the internet.
75. The computer system of claim 73 wherein the computer system is connected to the internet and the genotype for the polymorphic genomic region from each of the individuals is obtained electronically from a remote user through electronic mail.
76. The computer system of claim 73 wherein the genotype for the polymoφhic genomic region from each of the individuals is obtained from a database of one or more known genotypes.
77. A computer readable medium comprising instruction code to:
(a) accept input of a genotype for the polymoφhic genomic region from each of the individuals and store said genotype within said database;
(b) group the genotypes input in step (a) into groups, wherein in each group g there are ng identical genotypes, and wherein any unique genotypes are regarded as groups having ng = 1 ;
(c) enumerate all possible haplotypes hi consistent with the genotype of each group g, and store said haplotypes hj within said database;
(d) calculate an evidence score s; for each of said possible haplotypes h; and store said evidence score Sj within said database;
(e) for each group g, calculate an initial haplotype frequency fj for each haplotype hj among the possible haplotypes, and store the haplotype frequency fj in said database, wherein the initial haplotype frequency fj is a function of the product (sj)(ng);
(f) calculate for each group obtained in step (b) a pair score Fk for each pair of haplotypes that are consistent with that group, wherein F is a function of the frequency fj for each of the haplotypes in the pair;
(g) calculate, for each genotype and consistent haplotype pair whose pair score Fk meets a pair score criterion, a probability k that assignment of that haplotype pair to the genotype would be conect and store the probability pk in said database;
(h) calculate a revised haplotype frequency fj for each of the haplotypes, wherein the revised haplotype frequency fj is a function of the product (ng)(pk) for each consistent haplotype pair which contains the haplotype and storing the revised frequency fj in said database; and
(i) repeat steps (f) through (h) until an end condition is reached, with the proviso that for each repetition the frequency fj employed in step (f) is replaced by the revised frequency fj determined in step (h) and stored in the database.
78. The method of any one of claims 41-71, wherein the groups are further characterized in that all the individuals, from whom the genotypes in the group are derived, meet one or more criteria selected from the group consisting of:
(a) having the same gender;
(b) belonging to the same population group;
(c) belonging to the same clinical or disease population;
(d) exhibiting a particular response to a stimulus;
(e) having in common a particular genotype at a different polymoφhic region; and
(f) having in common a particular haplotype at a different polymoφhic region.
79. The computer system of any one of claims 72-76, wherein the groups are further characterized in that all the individuals, from whom the genotypes in the group are derived, meet one or more criteria selected from the group consisting of:
(a) having the same gender;
(b) belonging to the same population group;
(c) belonging to the same clinical or disease population;
(d) exhibiting a particular response to a stimulus;
(e) having in common a particular genotype at a different polymoφhic region; and
(f) having in common a particular haplotype at a different polymoφhic region.
80. The computer-readable medium of claim 77, wherein the groups are further characterized in that all the individuals, from whom the genotypes in the group are derived, meet one or more criteria selected from the group consisting of:
(a) having the same gender;
(b) belonging to the same population group;
(c) belonging to the same clinical or disease population;
(d) exhibiting a particular response to a stimulus;
(e) having in common a particular genotype at a different polymoφhic region; and
(f) having in common a particular haplotype at a different polymoφhic region.
PCT/US2001/012831 2000-04-18 2001-04-18 Method and system for determining haplotypes from a collection of polymorphisms WO2001080156A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US10/258,155 US20030211501A1 (en) 2001-04-18 2001-04-18 Method and system for determining haplotypes from a collection of polymorphisms
EP01927246A EP1290613A1 (en) 2000-04-18 2001-04-18 Method and system for determining haplotypes from a collection of polymorphisms
AU2001253720A AU2001253720A1 (en) 2000-04-18 2001-04-18 Method and system for determining haplotypes from a collection of polymorphisms

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US19834000P 2000-04-18 2000-04-18
US60/198,340 2000-04-18

Publications (1)

Publication Number Publication Date
WO2001080156A1 true WO2001080156A1 (en) 2001-10-25

Family

ID=22732979

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/012831 WO2001080156A1 (en) 2000-04-18 2001-04-18 Method and system for determining haplotypes from a collection of polymorphisms

Country Status (3)

Country Link
EP (1) EP1290613A1 (en)
AU (1) AU2001253720A1 (en)
WO (1) WO2001080156A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002008425A2 (en) * 2000-07-21 2002-01-31 Genaissance Pharmaceuticals, Inc. Haplotypes of the adrb3 gene
WO2002010454A2 (en) * 2000-07-28 2002-02-07 Genaissance Pharmaceuticals, Inc. Haplotypes of the alas2 gene
WO2002012499A2 (en) * 2000-08-04 2002-02-14 Genaissance Pharmaceuticals, Inc. Haplotypes of the ntf3 gene
WO2002012498A2 (en) * 2000-08-04 2002-02-14 Genaissance Pharmaceuticals, Inc. Haplotypes of the isl1 gene
WO2002012342A2 (en) * 2000-08-04 2002-02-14 Genaissance Pharmaceuticals, Inc. Haplotypes of the edg4 gene
WO2002012561A2 (en) * 2000-08-03 2002-02-14 Genaissance Pharmaceuticals, Inc. Haplotypes of the or1g1 gene
WO2002063044A2 (en) * 2001-02-08 2002-08-15 Genaissance Pharmaceuticals, Inc. Haplotypes of the il15 gene
EP1386973A1 (en) * 2001-04-19 2004-02-04 Hubit Genomix, Inc. Method of estimating diplotype from genotype of individual
US6897025B2 (en) 2002-01-07 2005-05-24 Perlegen Sciences, Inc. Genetic analysis systems and methods
US6955883B2 (en) 2002-03-26 2005-10-18 Perlegen Sciences, Inc. Life sciences business systems and methods
US6969589B2 (en) 2001-03-30 2005-11-29 Perlegen Sciences, Inc. Methods for genomic analysis
US7115726B2 (en) 2001-03-30 2006-10-03 Perlegen Sciences, Inc. Haplotype structures of chromosome 21
US7250258B2 (en) 2003-12-15 2007-07-31 Pgxhealth Llc CDK5 genetic markers associated with galantamine response
CN103745134A (en) * 2014-01-09 2014-04-23 北京林业大学 Allohexaploid genetic linkage analytical method
CN107076729A (en) * 2014-10-16 2017-08-18 康希尔公司 Variant calls device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CLARK ET AL.: "Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase", AM. J. HUM. GENET., vol. 63, 1998, pages 595 - 612, XP002944466 *
HAWLEY ET AL.: "HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes", J. HEREDITY, vol. 86, no. 5, 1995, pages 409 - 411, XP002944465 *
LONG ET AL.: "An E-M algorithm and testing strategy for multiple- locus haplotypes", AM. J. HUM. GENET., vol. 56, 1995, pages 799 - 810, XP002944464 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002008425A3 (en) * 2000-07-21 2003-09-25 Genaissance Pharmaceuticals Haplotypes of the adrb3 gene
WO2002008425A2 (en) * 2000-07-21 2002-01-31 Genaissance Pharmaceuticals, Inc. Haplotypes of the adrb3 gene
WO2002010454A2 (en) * 2000-07-28 2002-02-07 Genaissance Pharmaceuticals, Inc. Haplotypes of the alas2 gene
WO2002010454A3 (en) * 2000-07-28 2003-09-12 Genaissance Pharmaceuticals Haplotypes of the alas2 gene
WO2002012561A2 (en) * 2000-08-03 2002-02-14 Genaissance Pharmaceuticals, Inc. Haplotypes of the or1g1 gene
WO2002012561A3 (en) * 2000-08-03 2003-09-25 Genaissance Pharmaceuticals Haplotypes of the or1g1 gene
WO2002012499A2 (en) * 2000-08-04 2002-02-14 Genaissance Pharmaceuticals, Inc. Haplotypes of the ntf3 gene
WO2002012498A2 (en) * 2000-08-04 2002-02-14 Genaissance Pharmaceuticals, Inc. Haplotypes of the isl1 gene
WO2002012342A2 (en) * 2000-08-04 2002-02-14 Genaissance Pharmaceuticals, Inc. Haplotypes of the edg4 gene
WO2002012342A3 (en) * 2000-08-04 2003-08-28 Genaissance Pharmaceuticals Haplotypes of the edg4 gene
WO2002012499A3 (en) * 2000-08-04 2003-08-28 Genaissance Pharmaceuticals Haplotypes of the ntf3 gene
WO2002012498A3 (en) * 2000-08-04 2003-08-28 Genaissance Pharmaceuticals Haplotypes of the isl1 gene
WO2002063044A2 (en) * 2001-02-08 2002-08-15 Genaissance Pharmaceuticals, Inc. Haplotypes of the il15 gene
WO2002063044A3 (en) * 2001-02-08 2004-02-26 Genaissance Pharmaceuticals Haplotypes of the il15 gene
US7115726B2 (en) 2001-03-30 2006-10-03 Perlegen Sciences, Inc. Haplotype structures of chromosome 21
US11031098B2 (en) 2001-03-30 2021-06-08 Genetic Technologies Limited Computer systems and methods for genomic analysis
US6969589B2 (en) 2001-03-30 2005-11-29 Perlegen Sciences, Inc. Methods for genomic analysis
EP1386973A4 (en) * 2001-04-19 2004-09-08 Hubit Genomix Inc Method of estimating diplotype from genotype of individual
EP1386973A1 (en) * 2001-04-19 2004-02-04 Hubit Genomix, Inc. Method of estimating diplotype from genotype of individual
US6897025B2 (en) 2002-01-07 2005-05-24 Perlegen Sciences, Inc. Genetic analysis systems and methods
US7135286B2 (en) 2002-03-26 2006-11-14 Perlegen Sciences, Inc. Pharmaceutical and diagnostic business systems and methods
US6955883B2 (en) 2002-03-26 2005-10-18 Perlegen Sciences, Inc. Life sciences business systems and methods
US7250258B2 (en) 2003-12-15 2007-07-31 Pgxhealth Llc CDK5 genetic markers associated with galantamine response
CN103745134A (en) * 2014-01-09 2014-04-23 北京林业大学 Allohexaploid genetic linkage analytical method
CN107076729A (en) * 2014-10-16 2017-08-18 康希尔公司 Variant calls device
EP3207369A4 (en) * 2014-10-16 2018-06-13 Counsyl, Inc. Variant caller

Also Published As

Publication number Publication date
EP1290613A1 (en) 2003-03-12
AU2001253720A1 (en) 2001-10-30

Similar Documents

Publication Publication Date Title
Taliun et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program
US7058517B1 (en) Methods for obtaining and using haplotype data
Malaria Genomic Epidemiology Network Writing group Band Gavin 1 Rockett Kirk A. 1 2 Spencer Chris CA chris. spencer@ well. ox. ac. uk 1 d Kwiatkowski Dominic P. dominic@ sanger. ac. uk 1 2 e et al. A novel locus of resistance to severe malaria in a region of ancient balancing selection
Broman et al. R/qtl2: software for mapping quantitative trait loci with high-dimensional data and multiparent populations
US6931326B1 (en) Methods for obtaining and using haplotype data
US20200327956A1 (en) Methods of selection, reporting and analysis of genetic markers using broad-based genetic profiling applications
Brumfield et al. The utility of single nucleotide polymorphisms in inferences of population history
US20050191731A1 (en) Methods for obtaining and using haplotype data
Dereeper et al. SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects
Cooper et al. The Human Gene Mutation Database (HGMD) and its exploitation in the study of mutational mechanisms
EP1290613A1 (en) Method and system for determining haplotypes from a collection of polymorphisms
US20130212125A1 (en) Bioinformatics search tool system for retrieving and summarizing genotypic and phenotypic data for diagnosing patients
US20040267458A1 (en) Methods for obtaining and using haplotype data
US20020077775A1 (en) Methods of DNA marker-based genetic analysis using estimated haplotype frequencies and uses thereof
US20030009295A1 (en) System and method for retrieving and using gene expression data from multiple sources
Mahmoud et al. PRINCESS: comprehensive detection of haplotype resolved SNVs, SVs, and methylation
Dong et al. Comparative EST analyses in plant systems
US20030211501A1 (en) Method and system for determining haplotypes from a collection of polymorphisms
Zhang et al. geneHapR: an R package for gene haplotypic statistics and visualization
Mukamel et al. Repeat polymorphisms underlie top genetic risk loci for glaucoma and colorectal cancer
Hemstrom et al. snpR: User friendly population genomics for SNP data sets with categorical metadata
Plagnol et al. Relative influences of crossing over and gene conversion on the pattern of linkage disequilibrium in Arabidopsis thaliana
US20030211504A1 (en) Methods for identifying nucleic acid polymorphisms
Crowgey et al. An integrated approach for analyzing clinical genomic variant data from next-generation sequencing
Gibson et al. Gene expression profiling using mixed models

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWE Wipo information: entry into national phase

Ref document number: 10258155

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2001927246

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2001927246

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Ref document number: 2001927246

Country of ref document: EP