US20040107054A1 - Method for determining discrete quantitative structure activity relationships - Google Patents

Method for determining discrete quantitative structure activity relationships Download PDF

Info

Publication number
US20040107054A1
US20040107054A1 US10/699,459 US69945903A US2004107054A1 US 20040107054 A1 US20040107054 A1 US 20040107054A1 US 69945903 A US69945903 A US 69945903A US 2004107054 A1 US2004107054 A1 US 2004107054A1
Authority
US
United States
Prior art keywords
compounds
activity
model
descriptors
recited
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/699,459
Inventor
Paul Labute
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chemical Computing Group ULC
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US10/699,459 priority Critical patent/US20040107054A1/en
Assigned to CHEMICAL COMPUTING GROUP INC. reassignment CHEMICAL COMPUTING GROUP INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LABUTE, PAUL
Publication of US20040107054A1 publication Critical patent/US20040107054A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K1/00General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10TTECHNICAL SUBJECTS COVERED BY FORMER US CLASSIFICATION
    • Y10T436/00Chemistry: analytical and immunological testing
    • Y10T436/10Composition for standardization, calibration, simulation, stabilization, preparation or preservation; processes of use in preparation for chemical testing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10TTECHNICAL SUBJECTS COVERED BY FORMER US CLASSIFICATION
    • Y10T436/00Chemistry: analytical and immunological testing
    • Y10T436/11Automated chemical analysis

Definitions

  • This invention relates generally to the method of determining relationships between the structure or properties of chemical compounds and the biological activity of those compounds.
  • Target identification is basically the identification of a particular biological component, namely a protein and its association with particular disease states or regulatory systems.
  • a protein identified in a search for a chemical compound (drug) that can affect a disease or its symptoms is called a target.
  • Proteins are large chemical compounds comprising a polymer chain of amino acids.
  • the word protein is used herein to refer to any chemical compound that is involved in the regulation or control of biological systems (e.g. enzymes) and whose function can be interfered with by a drug.
  • the word disease is used herein to refer to an acquired condition or genetic condition.
  • a disease can alter the normal biological systems of the body, causing an over or under abundance of chemical compounds (i.e. a “chemical imbalance”).
  • the regulatory systems for these chemical compounds involve the use, by the body, of certain proteins to detect imbalances or cause the body to produce neutralizing compounds in an attempt to restore the chemical imbalance.
  • the word body is used herein to refer to any biological system: e.g. plant, animal or bacterial.
  • Ligand identification includes search for a chemical compound that binds to a particular target.
  • a ligand is a chemical compound that can attach itself to a protein and interfere with the normal functioning of the protein.
  • a useful analogy is viewing the protein as a “lock” and the ligand as a “key.”
  • a ligand that fits the “lock” is called “active.”
  • Toxicological and clinical trials involve characterizing the effects on the entire body of an identified ligand for a particular target. Additionally, the overall effectiveness regarding the disease must also be measured. These efforts are conducted in model bodies (i.e. generally animals) and then ultimately on the intended body (i.e. generally humans).
  • the present invention relates to ligand identification.
  • a target has been identified and the identity of an active ligand is desired.
  • Ligand identification generally involves the developing of a hypothesis that a particular chemical compound will be active, performing a physical experiment to determine if the hypothesized compound is active, and if the compound is not active, then returning to the step of developing a hypothesis.
  • HTS high throughput screening
  • This process includes the automation of the physical experiment step with robots so that hundreds of thousands or millions of experiments can be performed in a short period of time. This process has allowed a brute-force approach to ligand discovery.
  • the hypothesis phase consists of obtaining large collections of molecules either from external suppliers or through combinatorial chemistry type production of large numbers of compounds.
  • Combinatorial chemistry is a methodology in which many chemical reactions are performed simultaneously to produce a large collection of compounds. The large collection of compounds can then be physically tested with robots and activity results measured.
  • QSAR Quantitative Structure Activity Relationships
  • Determining a QSAR generally includes the following steps:
  • the ligand needs to be expressed in some quantitative manner. This step generally includes selecting a collection of numbers that characterize the ligand. These numbers are called molecular descriptors or descriptors.
  • Activity is traditionally measured as the amount of ligand needed to produce a particular interference with a target. The amount needed is on a continuous scale.
  • molecular descriptors are usually target-specific. Physical properties are often used. Mathematical properties based on the line drawing of a chemical compounds are also used. The use of the electric field of the ligand as a molecular descriptor is called Comparative Molecular Field Analysis (CoMFA) and has been the subject of previous patents. Other molecular descriptor sets include “fingerprints” or holograms”, which are descriptions of small sub-structures in the ligand.
  • the most widely used method of determining the functional relationship is the statistical technique of regression or least squares. Techniques such as genetic algorithms and partial least squares are used to select the “important” descriptors from the “less important” descriptors or “noise”.
  • HTS high throughput screening
  • HTS technologies report a binary condition; a candidate ligand is either “active” or “inactive”. Some HTS technologies report a discrete measure; i.e. activity on a scale of 1 to 10. In either case, classical QSAR techniques require a continuous activity measurement, e.g. accurate to two to three decimal places.
  • a conventional data set would consist of n observations (y i ,x i ). Without loss of generality it may be assumed that the slope is greater than zero, m>0, the x i have mean 0 and variance 1, and that activity is indicated by the condition that y ⁇ 0 ( i . e . ⁇ when ⁇ ⁇ x ⁇ - b m ) .
  • a is the number of active compounds. These estimates are completely different than those obtained from non-binary input (e.g., the b estimate is always in the range [0,1] for binary data).
  • Another object of the present invention is to provide a method for developing a quantitative structure activity relationship that allows the prediction of a candidate compound for a particular target to be identified as either active or inactive.
  • a further object of the present invention is to provide a method for developing a quantitative structure activity relationship that is less sensitive to High Throughput Screening input data error and outliers than the prior art.
  • Still a further object of the present invention is to provide a method for developing a quantitative structure activity relationship and analyze candidate compounds with the use of computer equipment.
  • Yet a further object of present invention is to provide a method for developing a quantitative structure activity relationship that is not significantly influenced by data boundary effects.
  • Still a further object of the present invention is to predict whether or not a chemical compound is a member of a particular set.
  • Yet another object of this present invention is to provide a method for developing a quantitative structure activity relationship that includes obtaining a training set of chemical compounds with molecular descriptors consisting of a number of multidimensional vectors with an activity class for each of the vectors; partitioning the multidimensional vectors in groups having interdependence; transforming the descriptors such that the interdependence of the groups is lessened; estimating a probability distribution of the descriptors by assuming that the probability distribution of the product of each of the groups is approximately equal to the probability distribution of the molecular descriptors; performing the partitioning, transforming and estimating steps for each of the activity classes; and, developing a probability distribution for the activity classes.
  • Still a further object of the present invention is to provide a method for predicting activity of candidate ligands that includes developing a prediction model; obtaining a candidate chemical compound; and, applying the prediction model to the candidate compound.
  • Yet another object of the present invention is to provide a system for predicting activity of candidate compounds as either active or inactive that includes an analyzer that receives a training set of chemical compounds; a prediction model developed by the analyzer and is based on the training set; and, a sorter that receives a candidate ligand and receives the model from the analyzer, the sorter applies the model to the candidate ligand to predict the activity of the candidate ligand.
  • Still a further object of the present invention is to provide a computer-based method of generating a quantitative structure activity relationship that includes calculating a numerical representation of molecules consisting of n numbers per molecule; and, estimating a probability distribution that a molecules is active.
  • FIG. 1 is a flow diagram of the method of the present invention
  • FIG. 2 is a flow diagram of the analyzer with its input and output
  • FIG. 3 is a mathematical flow diagram of the analyzer with its input and output
  • FIG. 4 is a mathematical flow diagram of the sorter with its input and output
  • FIG. 5 is a flow diagram of binary QSAR analysis in MOE.
  • FIG. 6 is a graph of accuracy versus active percentage compounds.
  • FIG. 1 is a flow chart of the present invention, showing the overall structure of the method of developing a discrete Quantitative Structure Activity Relationship (QSAR), and applying the discrete QSAR to candidate compounds to determine the probability that a particular candidate will be active.
  • QSAR Quantitative Structure Activity Relationship
  • Training set 4 are results from High Throughput Screening (HTS) experiments.
  • Training set 4 may be data from other sources other than HTS, even virtual or hypothetical data.
  • Training set 4 comprises molecular descriptors to describe the chemical structures of the compounds, and the activity classes or discrete binding affinities associated with the descriptors.
  • Training set 4 is sent to an analyzer 8 to develop a model 12 .
  • Analyzer 8 is a computer. The functions of analyzer 8 may alternatively be performed by some other means, even by hand calculations. Analyzer 8 will be described further below.
  • Model 12 is a mathematical function that is the output of Analyzer 8 . Model 12 is developed based on the 4 chemical structures of the training set 4 and the activity classes associated therewith, as will be thoroughly discussed below.
  • Candidate compounds 16 and model 12 are sent to a sorter 20 .
  • Candidate compounds 16 are experimental data. However, candidate compounds 16 may be from any source, even virtual or hypothetical compounds.
  • Sorter 20 applies model 12 to candidate compounds 16 to determine the activity of each candidate compound 16 for a particular target.
  • Sorter 20 is a computer. The functions of sorter 20 may be performed by other means, such as hand calculations. Sorter 20 will be described further below.
  • Analyzer 8 and sorter 20 are connected together, to allow sorter 20 to receive model 12 , and to a display device, not shown.
  • the display device will allow a user to inspect the outputs of analyzer 8 and sorter 20 .
  • the display device is preferably a computer monitor. However, the display device may also be a printer, etc.
  • FIG. 2 displays the overall general process that analyzer 8 performs, along with its input, training set 4 , and its output, model 12 .
  • the following is a description of the steps shown in FIG. 2; more mathematical detail is set forth below.
  • Training set 4 is characterized by a number of multidimensional molecular descriptor vectors with an activity class associated with every vector.
  • the multidimensional descriptor vectors of training set 4 are partitioned into groups, 32 .
  • This partition is arbitrary. The only restriction is the fact that the higher the dimension, the more data is needed and more computer memory is needed.
  • the groups have interdependence.
  • the product of the distributions of each of the groups is assumed to approximately equal the distribution of the original molecular descriptors, as represented by 40 .
  • Steps 32 , 36 and 40 need to be performed for each activity class, as represented by 44 .
  • the distribution of the activity classes must also be estimated 48 .
  • FIG. 3 displays the mathematical flow diagram for analyzer 8 and FIG. 4 displays the mathematical flow diagram for sorter 20 .
  • Analyzer 8 accepts training set 4 data, which may be characterized by ⁇ (y i ,x i ) ⁇ .
  • y i is represented by 52 and x i is represented by 56 .
  • Training set 4 is the results of m HTS experiments on a common target.
  • X L).
  • the prior distribution of Y is estimated using a maximum likelihood estimator or a Bayes estimator. Any method of estimating these probabilities may be used.
  • Our method to approximate the distributions of X is to transform a multidimensional distribution into a product f one dimensional distributions.
  • the multidimensional distribution may be transformed into simply a collection of lesser dimensional distributions.
  • the idea is to partition the multidimensional distributions into smaller groups to reduce the dimensions to enable one, or a computer, to work with the data.
  • this task let W be a random variable over the reals and let ⁇ (w) be the probability density for W.
  • the function, ⁇ can be estimated by accumulating a histogram of the observed sample values on a set of B bins(b 0 ,b 1 ], . . . ,(bB ⁇ 1,bB] defined by B+1 numbers b, ⁇ bk+1,b 0 is minus infinity and b B is plus infinity. Any method of estimating continuous distributions, other than the one explained here, may be employed.
  • predictions for a candidate ligand c can be made in two ways. Depending on whether the activity classes are an ordered scale, a user will choose the class that has the maximum probability, or use the expected class value for the prediction.
  • FIG. 4 displays the mathematical flow diagram for sorter 20 .
  • the input for sorter 20 is a candidate ligand c, 16 , and model 12 .
  • Candidate ligand 16 must go through the same process that each x i did as described above. These steps are represented as 84 and 88 , which mimic 72 and 76 above.
  • the output 24 of sorter 20 is the activity class of candidate ligand 16 .
  • model predictions ⁇ p i ⁇ are in substantial agreement with the input activity classes ⁇ y i ⁇ and the model is judged to be suitably “predictive”, then the model can be used to predict activity class of new candidates. Otherwise, the model can be adjusted by returning to the step of calculating a set of descriptors.
  • the model that is developed is then used to predict the activity class of a (possibly novel) chemical structure by calculating the same descriptors as were calculated in the step of calculating a set of descriptors for the model and by applying the model.
  • An objective of application of the present invention is to build a model to predict the 0 or 1 class when presented with chemical structure.
  • the present invention requires a numerical description of both the activity class and, for each activity class a vector of numbers (the descriptors, or quantification of the chemical structure).
  • the source of the initial data set is quite arbitrary so long as a set of descriptors can be determined from the chemical structures.
  • the chemical structures need not refer to actual compounds; that is, they can be virtual compounds or hypothesized compounds.
  • the activity classes can be any arbitrary classification of the structure; in most cases, this activity class will be some quantification or classification of biological activity.
  • the activity data can be converted into, for example, two activity classes, “active” and “inactive”, by comparison to a threshold value.
  • the threshold value is picked at 5.85. If the activity value is less than 5.85, the activity class is 0. Otherwise, it would be 1. This results in the following data set depicted in Table 2.
  • the preparation of the initial data set would be performed by a computer.
  • the structures are drawn with commercially available chemical drawing programs or chemical information systems that return chemical structures.
  • Such computer representations of chemical structures typically encode the connectivity and element labels of chemical structures. Some representations encode the depiction while some encode only the connectivity. For example, the same data of Table 2 can be represented textually using SMILES strings (a character-based encoding of chemical structures).
  • a molecular descriptor is a number calculated from a chemical structure. For example, if chemical structures are encoded using chemical formulas, then the molecular weight of the structure is an example of a molecular descriptor. The molecular weight is a number that can be calculated from the chemical formula. It is preferred that the molecular descriptors be calculated by means of a computer. However, they may also be derived by mental calculations from introspection and examination of the data. Scientific literature contains many examples of descriptors used in QSAR studies.
  • molecular descriptors are: molar refractivity; octinol/water partition coefficients; pKa; number carbons; number of triple bonds; number of aromatic atoms; sum of the positive partial charges on each atom; water accessible surface areas; heat of formation; topological connectivity indices; topological shape indices; electro topological state indices; structure fragment counts; van der Waals volume; etc.
  • quality, accuracy and predictiveness of the calculated model will depend on which descriptors are chosen for a particular data set. Automatic and/or statistical methods are used to help select appropriate descriptors in the iterative model building procedure described herein.
  • the descriptors and activity classes are used to estimate the model parameters.
  • the model can be used to “predict” or “back test” the activity classes of the training set.
  • the statistical cross-validation procedures such, as “leave-one-out”, may be used to estimate the quality and predictiveness of the model.
  • the iterative procedure terminates.
  • Exact termination criteria cannot be specified since accuracy and predictiveness will depend on the applications of the model. For example, a relative high accuracy of the model will be needed if the model is to be used to search databases of available compounds in an effort to locate compound with activity class “1”. On the other hand, a less accurate model can be used if only a trend or gross indication of activity is required. In other words, the termination criteria are problem dependent.
  • each new structure is prepared in the same manner as the initial data set.
  • the same molecular descriptors are calculated for the new structure and the vector of descriptors is used as input to the calculated model.
  • the sorter which utilizes the model, will output a predicted activity class. Typical uses of such a model would be compound data base searching, focusing on combinatorial libraries, or de novo design (the attempt to create new molecules by modification of chemical structures).
  • the estrogen receptor is an extensively studied pharmaceutical target for which a large number of ligand analogs have been generated and characterized. In addition, structural studies have elucidated the mechanism of the estrogen receptor-ligand interaction and identified the binding determinants.
  • the estrogen receptor binding affinity data of estrogen analogs have been transformed into a binary data format.
  • a predictive binary QSAR model has been derived and this model has been applied to a test set of other estrogen analogs. Both active and inactive analogs were predicted with high accuracy.
  • the binary QSAR model was stable for a variety of binary activity cutoff values and the model was quite insensitive to boundary effects.
  • PCA Principle Components Analysis
  • Conventional procedures for histogram construction are sensitive to bin boundaries since every observation, no matter how close to a bin boundary, is treated as though it falls in the center of the bin. To reduce this sensitivity, each observation is replaced with a Gaussian density with variance ⁇ 2 . This variance can be interpreted as an observation error or as a smoothing parameter.
  • estradiol has a value of 100, with lower affinity analogs having lower values and higher affinity analogs higher RBA values.
  • a total of 463 compounds were selected (tested for binding at 0 to 4° C.), 410 of which were used as a training set to derive a binary QSAR model, and 53 compounds as a test set to evaluate the model by predicting active and inactive compounds.
  • Table 4 shows the composition of estrogen analogs used in this analysis.
  • the continuous biological activity data was expressed in binary form using a threshold criterion (log RBA). Any compounds with log RBA larger than or equal to this criterion were classified as active, and any compounds with lower log RBA values were classified as inactive. Different activity threshold values were used to alter the percentage of active compounds in the training set.
  • Performance of a binary QSAR model was measured as follows: let m 0 represent the number of active compounds, m 1 the number of inactive compounds, c 0 the number of active compounds correctly labeled by the QSAR model, c 1 the number of inactive compounds correctly labeled by the QSAR model. Three parameters of performance were calculated: 1. accuracy on active compounds, c 0 /m 0 ; 2. accuracy on inactive compounds, c 1 /m 1 ; 3. overall accuracy on all of the compounds, (c 0 +c 1 )(m 0 +m 1 ). The derived binary QSAR model was cross-validated by a leave-one-out procedure.
  • a set of 410 compounds was chosen to be a training set to derive the binary QSAR model.
  • the range of the biological activities (log RBA) was ⁇ 2.02 to 2.60.
  • Table 5 shows the data profiles with different threshold values.
  • a value of 1.7 of log RBA which corresponds to 50% of RBA was selected as the threshold to derive the binary QSAR model. Based on this threshold criterion, 62 compounds were active and 348 compounds were inactive in the training set. A smoothing factor was introduced to minimize the sensitivity of the derived model to the selection of bin boundaries as mentioned earlier.
  • the binary QSAR model is also influenced by the number of principle components used. A 5 ⁇ 7 factor analysis was carried out to determine the effects of different smoothing factor values and principle component numbers on the binary QSAR analysis of the data set analyzed. Table 6 summarizes the results of the analysis.
  • Table 6 shows that an optimal binary QSAR model was obtained by a combination of principal component numbers of 12 and a smoothing factor value of 0.12. Using this combination, the non-cross-validated accuracy is 85% on active compounds, 93% on inactive compounds, 92% for all the compounds. The cross-validated accuracy is 76% on active compounds, 93% on inactive compounds, and 90% for all the compounds. Any departure from these parameter values decreased the non-cross-validated and/or cross-validated accuracy.
  • 3-Keto and 3-methyl ether derivatives have much lower binding affinities because they lack a hydrogen bond donor.
  • the aromatic ring system is required for strong binding because analogs lacking aromatic moieties have only low binding affinity. It follows that structural differences between active and inactive compounds are distinct but may be quite limited.
  • the estrogen analogs are considered to be a challenging test case for binary QSAR analysis because of small structural modifications, which actually change binary activity in a more continuous way, are considered here to render compounds either active or inactive.
  • Estrogen Receptor Ligands number of Category representative structure compounds Estradiol derivatives 165 3-keto steroids 2 nonaromatic analogs 4 metahexstrol derivatives 15 hexestrol derivatives 50 diethylstilbestrol derivatives 10 tryphenylethylene analogs 40 2-phenylbezothiopene analogs 68 2-phenylindole analogs 61 indene analogs 45 Phenol and biphenols 3

Abstract

Method for developing a quantitative structure activity relationship that includes obtaining a training set of chemical compounds with molecular descriptors consisting of a number of multidimensional vectors with an activity class for each of the vectors; partitioning the multidimensional vectors into groups having interdependence; transforming the descriptors such that the interdependence of the groups is lessened; estimating a probability distribution of the descriptors by assuming that a probability distribution of a product of each of the groups is approximately equal to the probability distribution of the molecular descriptors; performing the partitioning, transforming and estimating steps for each of the activity classes; and, developing a probability distribution for the activity classes.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of application Ser. No. 09/252,912, which is incorporated herein by reference, and which application Ser. No. 09/252,912 claims priority of United Kingdom application no. 9803466.3, filed Feb. 19, 1998. [0001]
  • FIELD OF THE INVENTION
  • This invention relates generally to the method of determining relationships between the structure or properties of chemical compounds and the biological activity of those compounds. [0002]
  • BACKGROUND OF THE INVENTION
  • The pharmaceutical and biotechnology industries are continuously searching for effective therapeutic or diagnostic agents. The processes for finding effective agents includes target identification, ligand identification, toxicology and clinical trials. [0003]
  • Target identification is basically the identification of a particular biological component, namely a protein and its association with particular disease states or regulatory systems. A protein identified in a search for a chemical compound (drug) that can affect a disease or its symptoms is called a target. Proteins are large chemical compounds comprising a polymer chain of amino acids. The word protein is used herein to refer to any chemical compound that is involved in the regulation or control of biological systems (e.g. enzymes) and whose function can be interfered with by a drug. [0004]
  • The word disease is used herein to refer to an acquired condition or genetic condition. A disease can alter the normal biological systems of the body, causing an over or under abundance of chemical compounds (i.e. a “chemical imbalance”). The regulatory systems for these chemical compounds involve the use, by the body, of certain proteins to detect imbalances or cause the body to produce neutralizing compounds in an attempt to restore the chemical imbalance. The word body is used herein to refer to any biological system: e.g. plant, animal or bacterial. [0005]
  • Ligand identification includes search for a chemical compound that binds to a particular target. A ligand is a chemical compound that can attach itself to a protein and interfere with the normal functioning of the protein. A useful analogy is viewing the protein as a “lock” and the ligand as a “key.” A ligand that fits the “lock” is called “active.”[0006]
  • Toxicological and clinical trials involve characterizing the effects on the entire body of an identified ligand for a particular target. Additionally, the overall effectiveness regarding the disease must also be measured. These efforts are conducted in model bodies (i.e. generally animals) and then ultimately on the intended body (i.e. generally humans). [0007]
  • The present invention relates to ligand identification. In other words, a target has been identified and the identity of an active ligand is desired. Ligand identification generally involves the developing of a hypothesis that a particular chemical compound will be active, performing a physical experiment to determine if the hypothesized compound is active, and if the compound is not active, then returning to the step of developing a hypothesis. [0008]
  • There are several methods available for developing hypotheses that a particular chemical compound will be active. [0009]
  • A very slow and unpredictable process is introspection. That is, the expertise gained by humans in the hypothesis-experiment process can be put to use in developing new hypotheses regarding the selection of candidate ligands. [0010]
  • Computer simulation methods have also been proposed to reduce the cost of physical experiments. These methods include simulations of activity and suggestions for new candidate ligands. These simulations have not had broad success and are generally too slow and unreliable unless a number of active compounds have already been discovered and minor modifications are desired to improve some property. [0011]
  • The current method of choice is generally called high throughput screening (HTS). This includes the automation of the physical experiment step with robots so that hundreds of thousands or millions of experiments can be performed in a short period of time. This process has allowed a brute-force approach to ligand discovery. The hypothesis phase consists of obtaining large collections of molecules either from external suppliers or through combinatorial chemistry type production of large numbers of compounds. Combinatorial chemistry is a methodology in which many chemical reactions are performed simultaneously to produce a large collection of compounds. The large collection of compounds can then be physically tested with robots and activity results measured. [0012]
  • The universe of possible ligands is extremely large; estimated between 10[0013] 40 and 10400 compounds. Accordingly, even with HTS approaches it is impossible to physically test all possible ligand candidates. Thus, methods are needed to discard the majority of the possibilities in advance or as the search proceeds.
  • It is generally accepted that the structure, composition, or physical properties of a ligand directly affect its biological activity against a target. The attempt to transform this qualitative belief into a quantitative method of activity assessment is known as the determination of Quantitative Structure Activity Relationships, or QSAR. QSAR began with the work of Hansch and was further developed by others. See, Hansch, C., Fujita, T, ρ-σ-π Analysis, A Method for the Correlation of Biological Activity and Chemical Structure, [0014] J. Am. Chem. Soc. 1964; Cramer, R. D., Patterson, D. E., Gunce, J. D., Comparative Molecular Field Analysis (CoMFA), 1. Effect of Shape on Binding of Steroids to Carrier Proteins, J. Am. Chem. Soc., 1988, 110, 5959-5967; and, Roger, D., Hopfinger, A. J., Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships, J. Chem. Info. Comp. Sci., 1994, 34.
  • Determining a QSAR generally includes the following steps: [0015]
  • First, a quantitative measure of activity needs to be defined. [0016]
  • Second, the ligand needs to be expressed in some quantitative manner. This step generally includes selecting a collection of numbers that characterize the ligand. These numbers are called molecular descriptors or descriptors. [0017]
  • Then, a functional relationship between activity and the selected descriptors must be determined. This includes developing a mathematical function that has the property that “activity=a function of the descriptors”, to a suitable high level of accuracy. [0018]
  • The functional relationship and the molecular descriptors are generally used to predict the activity of new candidate ligands. [0019]
  • Activity is traditionally measured as the amount of ligand needed to produce a particular interference with a target. The amount needed is on a continuous scale. [0020]
  • The selection of molecular descriptors is usually target-specific. Physical properties are often used. Mathematical properties based on the line drawing of a chemical compounds are also used. The use of the electric field of the ligand as a molecular descriptor is called Comparative Molecular Field Analysis (CoMFA) and has been the subject of previous patents. Other molecular descriptor sets include “fingerprints” or holograms”, which are descriptions of small sub-structures in the ligand. [0021]
  • The most widely used method of determining the functional relationship is the statistical technique of regression or least squares. Techniques such as genetic algorithms and partial least squares are used to select the “important” descriptors from the “less important” descriptors or “noise”. [0022]
  • The use of high throughput screening (HTS) to identify active compounds has greatly challenged commonly used QSAR techniques. HTS usually generates large amounts of assay data, which initially classifies compounds as active or inactive. In addition, compounds in screening libraries are typically noncongeneric, i.e., they do not share similar core structures. This makes it difficult, if not impossible, to analyze HTS data by classical QSAR techniques and to predict active compounds. [0023]
  • Higher throughput reduces the precision of the activity measurement. Many HTS technologies report a binary condition; a candidate ligand is either “active” or “inactive”. Some HTS technologies report a discrete measure; i.e. activity on a scale of 1 to 10. In either case, classical QSAR techniques require a continuous activity measurement, e.g. accurate to two to three decimal places. [0024]
  • Many HTS techniques have the unfortunate property that the activity measurement is error prone. The error rate is significant enough to warrant special attention since classical QSAR technology is very sensitive to error and outliers (data extremes). A significant error rate will neutralize the predictive capabilities of classical QSAR technology. [0025]
  • To exemplify, consider the following simple example. Suppose that activity y is linearly related to a single descriptor x. The linear relationship is expressed as follows: [0026]
  • y=mx+b
  • A conventional data set would consist of n observations (y[0027] i,xi). Without loss of generality it may be assumed that the slope is greater than zero, m>0, the xi have mean 0 and variance 1, and that activity is indicated by the condition that y < 0 ( i . e . when x < - b m ) .
    Figure US20040107054A1-20040603-M00001
  • Using linear regression, the estimates for m and b are: [0028] m ^ = 1 n i = 1 n y i x i , b ^ = y _ , y _ = 1 n i = 1 n y i
    Figure US20040107054A1-20040603-M00002
  • When presented with HTS binary measurements (i.e. 1 is active and 0 is inactive) representing the condition that y[0029] i<0 the linear regression estimates become: m ^ = 1 n xi < - b / m x i b ^ = a n
    Figure US20040107054A1-20040603-M00003
  • where a is the number of active compounds. These estimates are completely different than those obtained from non-binary input (e.g., the b estimate is always in the range [0,1] for binary data). For example, the estimated descriptor value at the boundary between active and inactive is: [0030] x = - 1 xi < - b / m x i a
    Figure US20040107054A1-20040603-M00004
  • This is inversely proportional to the mean active descriptor value. Contrast the above equation, which was developed with linear regression, with −b/m, the true descriptor value at the boundary. The assumptions of linear regression are not satisfied with binary HTS data. [0031]
  • OBJECTS AND SUMMARY OF THE INVENTION
  • It is an object of the present invention to provide a method for developing a quantitative structure activity relationship that overcomes the shortfalls of the prior art. [0032]
  • Another object of the present invention is to provide a method for developing a quantitative structure activity relationship that allows the prediction of a candidate compound for a particular target to be identified as either active or inactive. [0033]
  • A further object of the present invention is to provide a method for developing a quantitative structure activity relationship that is less sensitive to High Throughput Screening input data error and outliers than the prior art. [0034]
  • Still a further object of the present invention is to provide a method for developing a quantitative structure activity relationship and analyze candidate compounds with the use of computer equipment. [0035]
  • Yet a further object of present invention is to provide a method for developing a quantitative structure activity relationship that is not significantly influenced by data boundary effects. [0036]
  • Still a further object of the present invention is to predict whether or not a chemical compound is a member of a particular set. [0037]
  • Yet another object of this present invention is to provide a method for developing a quantitative structure activity relationship that includes obtaining a training set of chemical compounds with molecular descriptors consisting of a number of multidimensional vectors with an activity class for each of the vectors; partitioning the multidimensional vectors in groups having interdependence; transforming the descriptors such that the interdependence of the groups is lessened; estimating a probability distribution of the descriptors by assuming that the probability distribution of the product of each of the groups is approximately equal to the probability distribution of the molecular descriptors; performing the partitioning, transforming and estimating steps for each of the activity classes; and, developing a probability distribution for the activity classes. [0038]
  • Still a further object of the present invention is to provide a method for predicting activity of candidate ligands that includes developing a prediction model; obtaining a candidate chemical compound; and, applying the prediction model to the candidate compound. [0039]
  • Yet another object of the present invention is to provide a system for predicting activity of candidate compounds as either active or inactive that includes an analyzer that receives a training set of chemical compounds; a prediction model developed by the analyzer and is based on the training set; and, a sorter that receives a candidate ligand and receives the model from the analyzer, the sorter applies the model to the candidate ligand to predict the activity of the candidate ligand. [0040]
  • Still a further object of the present invention is to provide a computer-based method of generating a quantitative structure activity relationship that includes calculating a numerical representation of molecules consisting of n numbers per molecule; and, estimating a probability distribution that a molecules is active. [0041]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a flow diagram of the method of the present invention; [0042]
  • FIG. 2 is a flow diagram of the analyzer with its input and output; [0043]
  • FIG. 3 is a mathematical flow diagram of the analyzer with its input and output; [0044]
  • FIG. 4 is a mathematical flow diagram of the sorter with its input and output; [0045]
  • FIG. 5 is a flow diagram of binary QSAR analysis in MOE; and, [0046]
  • FIG. 6 is a graph of accuracy versus active percentage compounds.[0047]
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • FIG. 1 is a flow chart of the present invention, showing the overall structure of the method of developing a discrete Quantitative Structure Activity Relationship (QSAR), and applying the discrete QSAR to candidate compounds to determine the probability that a particular candidate will be active. [0048]
  • A training set of [0049] compounds 4 is obtained. Training set 4 are results from High Throughput Screening (HTS) experiments. Training set 4 may be data from other sources other than HTS, even virtual or hypothetical data.
  • Training set [0050] 4 comprises molecular descriptors to describe the chemical structures of the compounds, and the activity classes or discrete binding affinities associated with the descriptors.
  • Training set [0051] 4 is sent to an analyzer 8 to develop a model 12. Analyzer 8 is a computer. The functions of analyzer 8 may alternatively be performed by some other means, even by hand calculations. Analyzer 8 will be described further below.
  • [0052] Model 12 is a mathematical function that is the output of Analyzer 8. Model 12 is developed based on the 4 chemical structures of the training set 4 and the activity classes associated therewith, as will be thoroughly discussed below.
  • Candidate compounds [0053] 16 and model 12 are sent to a sorter 20. Candidate compounds 16 are experimental data. However, candidate compounds 16 may be from any source, even virtual or hypothetical compounds.
  • [0054] Sorter 20 applies model 12 to candidate compounds 16 to determine the activity of each candidate compound 16 for a particular target. Sorter 20 is a computer. The functions of sorter 20 may be performed by other means, such as hand calculations. Sorter 20 will be described further below.
  • [0055] Analyzer 8 and sorter 20 are connected together, to allow sorter 20 to receive model 12, and to a display device, not shown. The display device will allow a user to inspect the outputs of analyzer 8 and sorter 20. The display device is preferably a computer monitor. However, the display device may also be a printer, etc.
  • An example of a type of computer software program that will perform the methods described herein is entitled “Molecular Operating Environment” available through a license from the Chemical Computing Group Inc. of Montreal, Quebec, Canada. [0056]
  • FIG. 2 displays the overall general process that analyzer [0057] 8 performs, along with its input, training set 4, and its output, model 12. The following is a description of the steps shown in FIG. 2; more mathematical detail is set forth below.
  • Training set [0058] 4 is characterized by a number of multidimensional molecular descriptor vectors with an activity class associated with every vector.
  • The multidimensional descriptor vectors of training set [0059] 4 are partitioned into groups, 32. This partition is arbitrary. The only restriction is the fact that the higher the dimension, the more data is needed and more computer memory is needed. The groups have interdependence.
  • As represented by [0060] 36, the molecular descriptors are transformed to lessen the interdependence of the groups. Transformation will be discussed further below.
  • To estimate the distribution of the original molecular descriptors, the product of the distributions of each of the groups is assumed to approximately equal the distribution of the original molecular descriptors, as represented by [0061] 40.
  • [0062] Steps 32, 36 and 40 need to be performed for each activity class, as represented by 44.
  • The distribution of the activity classes must also be estimated [0063] 48.
  • With all of the distributions estimated, they are combined to establish a prediction function, i.e. [0064] model 12, that will determine whether a candidate compound belongs to a particular activity class.
  • The mathematical methods employed in the above steps will now be set forth for the particular case of a partition into groups of [0065] size 1 and a single transformation applicable to all classes. FIG. 3 displays the mathematical flow diagram for analyzer 8 and FIG. 4 displays the mathematical flow diagram for sorter 20.
  • [0066] Analyzer 8 accepts training set 4 data, which may be characterized by {(yi,xi)}. yi is represented by 52 and xi is represented by 56. Training set 4 is the results of m HTS experiments on a common target. Thus, there are m (yi,xi)'s where the yi are discrete values that without loss of generality, it may be assumed they are numbers {1, 2, . . . ,k} and each xi is a vector each with n numbers (the molecular descriptors) and we write xi=(xil, . . . , xin) represented by 60.
  • We will now introduce a random variable Y over the values {1,2, . . . ,k}, not shown in flow diagram, and a random variable over n-vectors (a random molecular descriptor), X=(X[0067] 1 . . . Xn), not shown in flow diagram.
  • The conditional distribution Pr(Y|X) is used to determine the probability that a new molecule L, not shown, belongs to activity class y with Pr(Y=y|X=L). The molecule can then be sorted into the class that has the highest probability, mathematically represented using the Bayes theorem: [0068] Pr ( Y = y | X = x ) = Pr ( X = x | Y = y ) Pr ( Y = y ) i = 1 k Pr ( X = x | Y = i ) Pr ( Y = i )
    Figure US20040107054A1-20040603-M00005
  • To use this formula for practical purposes it is necessary to analyze the HTS data in an effort to approximate the distributions on the right hand side of the equation. [0069]
  • The prior distribution of Y is estimated using a maximum likelihood estimator or a Bayes estimator. Any method of estimating these probabilities may be used. A Bayes estimator has been chosen, since it is well defined for all inputs, where Cj is the number of times that y[0070] i=j in the HTS experimental data: Pr ( Y = j ) g ( j ) = C j + 1 m + k
    Figure US20040107054A1-20040603-M00006
  • Estimating the k distributions of the form Pr(X=x|Y=j) is more problematic since the X is a vector of n numbers: for values of n of five or more, a straightforward histogram build-up, or counting procedure cannot be used in practice because there will not be enough experimental data to approximate the distribution with any reasonable accuracy. [0071]
  • Our method to approximate the distributions of X is to transform a multidimensional distribution into a product f one dimensional distributions. However, it is understood herein, that rather than transforming into one dimensional distributions, the multidimensional distribution may be transformed into simply a collection of lesser dimensional distributions. The idea is to partition the multidimensional distributions into smaller groups to reduce the dimensions to enable one, or a computer, to work with the data. [0072]
  • Thus, to decorrelate [0073] 64 each multidimensional vector xi, the method of principal component analysis is used to determine a p by n linear transform Q and a n-vector u, collectively 68, such that the random variable Z=Q(X−u) has a covariance matrix equal to the p by p identity matrix. For the purposes of approximation, it is assumed that the individual coordinates of Z are independent so that the following approximation can be made and is represented as 68 in FIG. 3: Pr ( X = x Y = y ) Pr ( Z = Q ( x - u ) Y = y ) = i = 1 p Pr ( Z i = z i Y = y )
    Figure US20040107054A1-20040603-M00007
  • this task, let W be a random variable over the reals and let ƒ(w) be the probability density for W. The function, ƒ, can be estimated by accumulating a histogram of the observed sample values on a set of B bins(b[0074] 0,b1], . . . ,(bB−1,bB] defined by B+1 numbers b,<bk+1,b0 is minus infinity and bB is plus infinity. Any method of estimating continuous distributions, other than the one explained here, may be employed.
  • The usual procedure for counting the number of observations among m samples in bin k>0 is: [0075] B k = i = 1 m δ ( w i ( b k - 1 , b k ] ) = i = 1 m b k - 1 b , δ ( x - w i ) x
    Figure US20040107054A1-20040603-M00008
  • This procedure has an unfortunate sensitivity to the selection of bin boundaries since observations close to a bin boundary are treated as if they were in the middle of one of the bins. In view of this sensitivity, it is desirous to spread the observations out over the bins. In other words, rather than having a single observation point, the observation will be blurred over several bins. Here, the blurring area is created by a bell-curve. However, any type of spreading or blurring may be used. [0076]
  • Accordingly, to reduce the sensitivity to the bin boundaries, the delta function in the above equation is replaced with a Gaussian, with variance s[0077] 2. This can be thought as an observation error as well as a smoothing parameter. The equation now becomes: B k = i = 1 m b k - 1 b k 1 s 2 exp [ - 1 2 ( x - w i ) 2 s 2 ] x = 1 2 i = 1 m [ erf ( b k - w i s 2 ) - ( b k - 1 - w i s 2 ) ]
    Figure US20040107054A1-20040603-M00009
  • Using the above techniques an [0078] estimation 76, is made for pk distributions from the HTS experimental data. In other words, we approximate the one dimensional distributions with ƒj(z,y) for j in {1, . . . ,p} and y in {1, . . . ,k}. . . , ƒj(z,y), Cy is represented by 80 in FIG. 3. The final approximation, or model 12, with z=Q (x−u), is: Pr ( Y = y | X = x ) ( C y + 1 ) j = 1 p f j ( z j , y ) i = 1 k ( C i + 1 ) j = 1 p f j ( z j , i )
    Figure US20040107054A1-20040603-M00010
  • Now that [0079] model 12 is developed, predictions for a candidate ligand c can be made in two ways. Depending on whether the activity classes are an ordered scale, a user will choose the class that has the maximum probability, or use the expected class value for the prediction.
  • FIG. 4 displays the mathematical flow diagram for [0080] sorter 20. The input for sorter 20 is a candidate ligand c, 16, and model 12. The transform Q and n-vector u, 68, and ƒj(z,y), Cy, 80, make up model 12.
  • [0081] Candidate ligand 16 must go through the same process that each xi did as described above. These steps are represented as 84 and 88, which mimic 72 and 76 above. The output 24 of sorter 20 is the activity class of candidate ligand 16.
  • The steps outlined above are to be performed with the use of computer software and a computer. However, it is understood that the steps may be performed by some other means, even by manual calculations. [0082]
  • Use of the present invention typically will be iterative with efforts directed at selecting, determining, discovering or inventing those descriptors that lead to an accurate and predictive model of biological activity. A typical sequence of general QSAR steps, not necessarily the inventive steps, are the following. [0083]
  • Obtain a collection of chemical structures and a collection of activity classes numbers {y[0084] 1} in the range 1, . . . ,k such that with each chemical structure, there is an associated activity class number.
  • For each chemical structure, calculate a set of descriptors x=(x[0085] 1, . . . ,xn). The complete input data set will be the {(yi, xi)}.
  • Apply the procedure described herein and depicted in FIGS. 2 and 3 using the input data set training set as the set of “candidates” to obtain a “model” consisting of (Q, u, f[0086] j,Cy) and a collection of model predictions pi.
  • If the model predictions {p[0087] i} are in substantial agreement with the input activity classes {yi} and the model is judged to be suitably “predictive”, then the model can be used to predict activity class of new candidates. Otherwise, the model can be adjusted by returning to the step of calculating a set of descriptors.
  • The model that is developed is then used to predict the activity class of a (possibly novel) chemical structure by calculating the same descriptors as were calculated in the step of calculating a set of descriptors for the model and by applying the model. [0088]
  • An objective of application of the present invention, as stated above, is to build a model to predict the 0 or 1 class when presented with chemical structure. [0089]
  • The present invention requires a numerical description of both the activity class and, for each activity class a vector of numbers (the descriptors, or quantification of the chemical structure). The source of the initial data set is quite arbitrary so long as a set of descriptors can be determined from the chemical structures. As mentioned above, the chemical structures need not refer to actual compounds; that is, they can be virtual compounds or hypothesized compounds. The activity classes can be any arbitrary classification of the structure; in most cases, this activity class will be some quantification or classification of biological activity. [0090]
  • Because the source of the data set is arbitrary, it is impossible to enumerate all possible ways in which a data set, or training set, can be assembled. [0091]
  • Research into scientific literature regarding experiments with the Carbonic Anhydrase II receptor revealed information describing physical experiments to determine and quantify the binding affinity of a variety of chemical compounds. Each compound is given a numerical value indicating its binding affinity. Table 1 depicts the nature of an example initial data set. Beside each drawing in Table 1, is a quantitative experimental assessment of binding affinity. [0092]
  • The activity data can be converted into, for example, two activity classes, “active” and “inactive”, by comparison to a threshold value. For example purposes, the threshold value is picked at 5.85. If the activity value is less than 5.85, the activity class is 0. Otherwise, it would be 1. This results in the following data set depicted in Table 2. [0093]
  • It is preferred that the preparation of the initial data set would be performed by a computer. In such a case, the structures are drawn with commercially available chemical drawing programs or chemical information systems that return chemical structures. Such computer representations of chemical structures typically encode the connectivity and element labels of chemical structures. Some representations encode the depiction while some encode only the connectivity. For example, the same data of Table 2 can be represented textually using SMILES strings (a character-based encoding of chemical structures). [0094]
  • The nature of encoding of the chemical structures is not critical, as long as molecular descriptors can be calculated from the structure encoding. [0095]
  • A molecular descriptor is a number calculated from a chemical structure. For example, if chemical structures are encoded using chemical formulas, then the molecular weight of the structure is an example of a molecular descriptor. The molecular weight is a number that can be calculated from the chemical formula. It is preferred that the molecular descriptors be calculated by means of a computer. However, they may also be derived by mental calculations from introspection and examination of the data. Scientific literature contains many examples of descriptors used in QSAR studies. Examples of molecular descriptors are: molar refractivity; octinol/water partition coefficients; pKa; number carbons; number of triple bonds; number of aromatic atoms; sum of the positive partial charges on each atom; water accessible surface areas; heat of formation; topological connectivity indices; topological shape indices; electro topological state indices; structure fragment counts; van der Waals volume; etc. In general, the quality, accuracy and predictiveness of the calculated model will depend on which descriptors are chosen for a particular data set. Automatic and/or statistical methods are used to help select appropriate descriptors in the iterative model building procedure described herein. [0096]
  • As set forth above, the descriptors and activity classes are used to estimate the model parameters. The model can be used to “predict” or “back test” the activity classes of the training set. The statistical cross-validation procedures such, as “leave-one-out”, may be used to estimate the quality and predictiveness of the model. [0097]
  • When the results of the model building and descriptor selection procedure are judged suitably accurate, the iterative procedure terminates. Exact termination criteria cannot be specified since accuracy and predictiveness will depend on the applications of the model. For example, a relative high accuracy of the model will be needed if the model is to be used to search databases of available compounds in an effort to locate compound with activity class “1”. On the other hand, a less accurate model can be used if only a trend or gross indication of activity is required. In other words, the termination criteria are problem dependent. [0098]
  • To use the calculated model to predict activity classes of chemical structures not presented in the initial data training set, the following is performed. Each new structure is prepared in the same manner as the initial data set. The same molecular descriptors are calculated for the new structure and the vector of descriptors is used as input to the calculated model. The sorter, which utilizes the model, will output a predicted activity class. Typical uses of such a model would be compound data base searching, focusing on combinatorial libraries, or de novo design (the attempt to create new molecules by modification of chemical structures). [0099]
  • The following is based on and is a partial reproduction of: “Binary Quantitative Structure-Activity Relationship (QSAR) Analysis of Estrogen Receptor Ligands”, Gao, H., Williams C., Labute P., Bajorath, J. [0100] J. Chem. Inf. Comput. Sci 1999, 39, 164-168, which is incorporated herein by reference.
  • The above methods for discrete or binary QSAR correlate compound structures, using molecular descriptors, with a “binary” expression of activity, i.e., 1=active and 0=inactive, and calculates a probability distribution for active and inactive compounds in a training set. This function can then be used to predict active compounds for a given target in a test set. The present invention is applied below to a drug discovery problem, the analysis of estrogen receptor ligands. [0101]
  • The estrogen receptor is an extensively studied pharmaceutical target for which a large number of ligand analogs have been generated and characterized. In addition, structural studies have elucidated the mechanism of the estrogen receptor-ligand interaction and identified the binding determinants. The estrogen receptor binding affinity data of estrogen analogs have been transformed into a binary data format. A predictive binary QSAR model has been derived and this model has been applied to a test set of other estrogen analogs. Both active and inactive analogs were predicted with high accuracy. The binary QSAR model was stable for a variety of binary activity cutoff values and the model was quite insensitive to boundary effects. [0102]
  • The binary QSAR analysis procedure used in this study are generally depicted in FIG. 5. Binary QSAR estimates, from a training set, the probability density Pr(Y=1X=x) where Y is a Bernoulli random variable (i.e. Y takes on values of 0 or 1) representing “active” or “inactive” and X is a random n-vector or real numbers (a random collection of molecular descriptors). [0103]
  • A Principle Components Analysis (PCA) is conducted on the training set to calculate an n by p linear transform, Q, and an n-vector, u, such that the random p-vector Z=O(X−u) has mean and variance equal to the p by p identity matrix. The quantity p is referred to as the number of principle components. [0104]
  • The original molecular descriptors are transformed by Q and u to obtain a decorrelated and normalized set of descriptors. The desired probability density is then approximated by applying Bayes' theorem and assuming that the transformed descriptors are mutually independent: [0105] Pr ( Y = 1 | X = x ) [ 1 + pr ( Y = 0 ) Pr ( Y = 1 ) i = 1 p Pr ( Z i = z i | Y = 0 ) Pr ( Z i = z i | y = 1 ) ] - 1 Z = Q ( X - u ) = ( Z 1 , , Z p )
    Figure US20040107054A1-20040603-M00011
    Z=Q(X−u)=(Z 1 , . . . ,Z p)
  • Each probability density Pr(Z[0106] i=zi) is estimated by constructing a histogram. Conventional procedures for histogram construction are sensitive to bin boundaries since every observation, no matter how close to a bin boundary, is treated as though it falls in the center of the bin. To reduce this sensitivity, each observation is replaced with a Gaussian density with variance σ2. This variance can be interpreted as an observation error or as a smoothing parameter.
  • Once all of the 2p+2 probability densities have been estimated from the training set, the desired Pr(Y=1/X=x) is constructed using the above formula. [0107]
  • The binding data of estrogen analogs to estrogen receptors of different species was collected from literature. There is little, if any, evidence for receptor-species difference in estrogen analog structure-affinity relationships. There are two subtypes of estrogen receptors, ER-α and ER-β. The data reported here is presumed to come from ER-α, since this subtype is the predominant one in uterine and breast tissue. The binding data was placed on a common “relative binding affinity” (RBA) scale. Values on this scale were calculated as a percentage of the ratio or IC[0108] 50 values of test compounds to displace 50% of [3H]estradiol from estrogen receptor binding. Thus, on the RBA scale, estradiol has a value of 100, with lower affinity analogs having lower values and higher affinity analogs higher RBA values. A total of 463 compounds were selected (tested for binding at 0 to 4° C.), 410 of which were used as a training set to derive a binary QSAR model, and 53 compounds as a test set to evaluate the model by predicting active and inactive compounds. Table 4 shows the composition of estrogen analogs used in this analysis. The continuous biological activity data was expressed in binary form using a threshold criterion (log RBA). Any compounds with log RBA larger than or equal to this criterion were classified as active, and any compounds with lower log RBA values were classified as inactive. Different activity threshold values were used to alter the percentage of active compounds in the training set.
  • Molecular descriptors were calculated using 1998.03 version of MOE, from the Chemical Computing Group Inc. of Montreal, Quebec, Canada, and binary QSAR analysis was carried out with the MOE binary QSAR function. [0109]
  • Performance of a binary QSAR model was measured as follows: let m[0110] 0 represent the number of active compounds, m1 the number of inactive compounds, c0 the number of active compounds correctly labeled by the QSAR model, c1 the number of inactive compounds correctly labeled by the QSAR model. Three parameters of performance were calculated: 1. accuracy on active compounds, c0/m0; 2. accuracy on inactive compounds, c1/m1; 3. overall accuracy on all of the compounds, (c0+c1)(m0+m1). The derived binary QSAR model was cross-validated by a leave-one-out procedure.
  • In this procedure, only one object is eliminated at a time and the process is repeated until all objects have been eliminated once and only once. Accuracy was calculated for each step, and an average accuracy for all the steps was reported as a measure of the internal predictivity of the model within the training set. [0111]
  • A set of 410 compounds was chosen to be a training set to derive the binary QSAR model. The range of the biological activities (log RBA) was −2.02 to 2.60. Table 5 shows the data profiles with different threshold values. [0112]
  • A value of 1.7 of log RBA which corresponds to 50% of RBA was selected as the threshold to derive the binary QSAR model. Based on this threshold criterion, 62 compounds were active and 348 compounds were inactive in the training set. A smoothing factor was introduced to minimize the sensitivity of the derived model to the selection of bin boundaries as mentioned earlier. The binary QSAR model is also influenced by the number of principle components used. A 5×7 factor analysis was carried out to determine the effects of different smoothing factor values and principle component numbers on the binary QSAR analysis of the data set analyzed. Table 6 summarizes the results of the analysis. [0113]
  • In this study, two-dimensional (2D) molecular descriptors were used and were shown to perform well in compound clustering. In addition, Keir's shape indices were used, which contain implicit three-dimensional (3D) information. Explicit 3D descriptors were not considered to avoid bias of the analysis due to predicted conformational effects. The different combination of molecular descriptors have been systematically explored to identify a set that captures structural characteristics of estrogen analogs and resulting activities well. This was done for the learning set similar to more conventional QSAR analysis. [0114]
  • Table 6 shows that an optimal binary QSAR model was obtained by a combination of principal component numbers of 12 and a smoothing factor value of 0.12. Using this combination, the non-cross-validated accuracy is 85% on active compounds, 93% on inactive compounds, 92% for all the compounds. The cross-validated accuracy is 76% on active compounds, 93% on inactive compounds, and 90% for all the compounds. Any departure from these parameter values decreased the non-cross-validated and/or cross-validated accuracy. Thirteen molecular descriptors were used to derive the binary QSAR model (Table 7), including four atomic connectivity indices, four molecular shape indices, one total hydrophobic accessible surface area descriptor, one charge descriptor, one aromatic bond descriptor, and two indicator variables for specific functional group and molecular structure. One of the descriptor used is I,es. A number of desthylstilbestrol (DES) analogs are found to be more potent estrogen receptor ligands than estradiol itself, despite their structure similarity (log RBA is 2.48 for DES versus 2.00 for estradiol). Because structure features that account for higher potency of DES analogs were not obvious, the indicator variable I,es was included to account for this effect. A phenolic OH group that resembles the 3-OH of estradiol molecule is required for tight binding to estrogen receptor. To account for this specific structural effect, an indicator variable, I, OH, was used. [0115]
  • The effects of ten different threshold values (log RBA values ranges from −2 to 2) on the binary QSAR model were analyzed (FIG. 2). Accuracy on active compounds ranged from 70% to 98%, with the highest accuracy obtained for 98% active compounds and the lowest for 7% active compounds. The overall accuracy remains stable at different threshold values (around 90%). FIG. 6 shows that selected threshold values cause fluctuation of observed overall accuracy by approximately 10%. The minimum obtained overall accuracy is about 80%. Thus, on the basis of these findings, the overall binary QSAR accuracy remains stable irrespective of the chosen threshold values. [0116]
  • Compounds with biological activity near the binary threshold value may fall into either the active or inactive category, which also depends on the experimental error. To analyze the influence of boundary effects on the binary QSAR model, compounds with log RBA values between 1.0 and 1.7 were omitted. Therefore, in these calculations, binary classification corresponds to largest difference in biological activities. This data set consisted of 292 inactive and 62 active (17.5%) compounds. In the resulting QSAR model, an accuracy 87% on active, 95% on inactive, and 93% for all 354 compounds was achieved. The performance is only slightly better than that obtained for the original training set. These results indicate that the boundary effects tested have only marginal influence on the binary QSAR accuracy, indicating that the binary QSAR model is stable. The obtained accuracy is not critically dependent on binary classification of observed activities, which is important with respect to the analysis of screening data. [0117]
  • In order to evaluate the predictive value of the binary QSAR model, 53 randomly selected estrogen analogs were tested. Seven out of 9 active compounds (78%), and 43 out 44 inactive compounds (98%) were correctly predicted (overall accuracy of 94%), which is consistent with the cross-validation result. The percentage of active compounds in the test set was 15%. If the compounds were selected and tested based on the binary QSAR model, the “hit rate” of active compounds would be 5 fold higher than randomly selected compounds even for this small data set. [0118]
  • The X-ray structures of the ligand binding domain of ER-α receptor in complex with estradiol and raloxifene have been reported in the past. The ligands are buried within the hydrophobic core of the ligand binding domain, but the polar ends of estradiol form hydrogen bonds to the only polar amino acid residues in the binding site. Glu353 forms a hydrogen bond to the A-ring phenolic hydroxyl group and His524 forms a hydrogen bond with 17β-hydroxyl group. The phenolic hydroxyl group is required for binding. The 3-OH group on estradiol can act as a hydrogen bond donor or acceptor, but the hydrogen bond donor ability is more important than the acceptor ability in stabilizing the complex. 3-Keto and 3-methyl ether derivatives have much lower binding affinities because they lack a hydrogen bond donor. The aromatic ring system is required for strong binding because analogs lacking aromatic moieties have only low binding affinity. It follows that structural differences between active and inactive compounds are distinct but may be quite limited. The estrogen analogs are considered to be a challenging test case for binary QSAR analysis because of small structural modifications, which actually change binary activity in a more continuous way, are considered here to render compounds either active or inactive. [0119]
  • Estrogen analogs have also been studied by conventional or classical QSAR techniques. Earlier QSAR studies on estrogen analogs did not reveal a consistent positive hydrophobic contribution for receptor-ligand binding, except substituents at the 11-β position of estradiol derivatives, although hydrophobicity expressed as log P(o/w) differs significantly among the analogs. Similarly, in the binary QSAR model, log P(o/w) was not found to be a significant descriptor. In contrast, ASA-H (which does not strictly correlate with log P(o/w) (r[0120] 2=0.62)) was found to be significant. This finding suggests that the strength of van der Waals/hydrophobic interactions between ligands and receptor is more important than the differences in energy required to desolvate the hydrophobic ligands.
  • Conventional QSAR based on regression techniques, such as multiple linear regression, partial least squares and, occasionally, neural networks, have been used to cluster compounds. These methods seek to minimize the squared error between the model and the observed data. This optimization of the model parameters introduces sensitivity to errors in experiments and regression analyses. In contrast, binary QSAR does not use any form of regression analysis; there is no attempt to minimize the model errors with regard to model parameters. It is a nonlinear modeling method. Because no regression is used, the model estimation procedure is very fast, which is in contrast to neural networks that require a lengthy training phase. Therefore, binary QSAR can efficiently process large data sets such as HTS data. [0121]
  • Several other clustering methods have been tested to classify compounds into different clusters. These methods are qualitative in that they are based on only chemical structural information regardless of biological activities. Compounds with similar structural features are clustered together. However, compounds with similar biological activities may appear in different clusters depending on their degree of structural similarity. In this case, identification of active clusters may be a nontrivial task. In contrast, binary QSAR takes both structure and activity information into account, and deduces a probability distribution function for novel compound to be either active or inactive. [0122]
  • While this invention has been described as having a preferred design, it is understood that it is capable of further modification, uses and/or adaptions following in general the principle of the invention and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains, and as may be applied to the essential features set forth, and fall within the scope of the invention or the limits of the appended claims. [0123]
    TABLE 1
    STRUCTURE ACTIVITY
    Figure US20040107054A1-20040603-C00001
    5.80
    Figure US20040107054A1-20040603-C00002
    5.92
    . . . . . .
    Figure US20040107054A1-20040603-C00003
    6.40
  • [0124]
    TABLE 2
    STRUCTURE CLASS
    Figure US20040107054A1-20040603-C00004
    0
    Figure US20040107054A1-20040603-C00005
    1
    . . . . . .
    Figure US20040107054A1-20040603-C00006
    1
  • [0125]
    TABLE 3
    STRUCTURE (SMILES) CLASS
    CC(C)NS(O) (O)C1 = NN − C 0
    (NC(C) = OS1
    CNS(O) (O)clcccc([C1])C1 1
    . . . . . .
    NS(O) (O)clccc(N = C − 1
    c2c(O)cccc1)cc1
  • [0126]
    TABLE 4
    Composition of Estrogen Receptor Ligands
    number of
    Category representative structure compounds
    Estradiol derivatives
    Figure US20040107054A1-20040603-C00007
    165
    3-keto steroids
    Figure US20040107054A1-20040603-C00008
     2
    nonaromatic analogs
    Figure US20040107054A1-20040603-C00009
     4
    metahexstrol derivatives
    Figure US20040107054A1-20040603-C00010
     15
    hexestrol derivatives
    Figure US20040107054A1-20040603-C00011
     50
    diethylstilbestrol derivatives
    Figure US20040107054A1-20040603-C00012
     10
    tryphenylethylene analogs
    Figure US20040107054A1-20040603-C00013
     40
    2-phenylbezothiopene analogs
    Figure US20040107054A1-20040603-C00014
     68
    2-phenylindole analogs
    Figure US20040107054A1-20040603-C00015
     61
    indene analogs
    Figure US20040107054A1-20040603-C00016
     45
    Phenol and biphenols
    Figure US20040107054A1-20040603-C00017
     3
  • [0127]
    TABLE 5
    Data Profiles at Different Binary Threshold Values
    threshold value active inactive
    (log RBA) compounds compounds active %
    −2.0 404  6 98%
    −1.5 394  16 96%
    −1.0 382  28 93%
    0.0 307 103 75%
    1.0 177 233 43%
    1.2 146 264 36%
    1.5  92 318 22%
    1.7  62 348 15%
    1.8  53 357 13%
    2.0  27 383  7%
  • [0128]
    TABLE 6
    Effects of PCA Number and Smoothing Factor on Binary
    QSAR
    PCA smoothing factor
    no. 0.08 0.10 0.12 0.14 0.16 0.20 0.25
    6 0.79 0.76 0.74 0.69 0.69 0.60 0.52
    0.63 0.61 0.60 0.60 0.55 0.48 0.45
    8 0.81 0.77 0.77 0.76 0.76 0.73 0.66
    0.71 0.71 0.71 0.69 0.68 0.65 0.55
    10 0.85 0.84 0.84 0.84 0.81 0.77 0.73
    0.68 0.68 0.66 0.68 0.69 0.68 0.65
    12 0.85 0.85 0.85 0.82 0.81 0.81 0.79
    0.69 0.71 0.76 0.76 0.73 0.73 0.71
    13 0.85 0.85 0.82 0.82 0.82 0.82 0.79
    0.63 0.66 0.69 0.69 0.69 0.66 0.68
  • [0129]
    TABLE 7
    Molecular Descriptors Used in the Binary QSAR
    symbol description
    b-ar number of aromatic bonds
    ASA-H total hydrophobic accessible surface area
    0X zero-order atomic connectivity index
    0Xv zero-order atomic valence connectivity index
    1X first-order atomic connectivity index
    1Xv first-order atomic valence connectivity index
    1K Keir first shape index
    2K Keir second shape index
    3K Keir third shape index
    Φ Keir molecular flexibility index
    Peoe-PC+ total of positive charge in Gasteiger & Marsili charge
    model
    I,OH indicator variable for phenolic hydroxy group; I,OH = 1
    for compounds containing phenolic OH and 0 for other
    compounds
    I,es indicator variable for hexestrol derivatives; I,es = 1 for
    hexestrol compounds and 0 for other compounds.

Claims (11)

What is claimed is:
1. A computer-based method of generating a quantitative structure activity relationship comprising:
a) calculating a numerical representation of molecules consisting of n numbers per molecule; and,
b) estimating a probability distribution that a said molecules is active.
2. A method as recited in claim 1, wherein:
a) said estimating step is calculated with Bayes Theorem.
3. A method as recited in claim 1, wherein:
a) said probability distribution of said estimating step comprises n one-dimensional distributions.
4. A method as recited in claim 1, wherein:
a) said estimating step is performed by using a means to remove linear correlations between said n numbers per molecule.
5. A method as recited in claim 4, wherein:
a) said means to remove linear correlations between said n numbers per molecule is a principal components analysis.
6. A method as recited in claim 4, wherein:
a) said means to remove linear correlations between said n numbers per molecule is a matrix diagonalization.
7. A method as recited in claim 1, wherein:
a) said estimating step is performed by using a means to remove dependencies between said n numbers per molecule.
8. A method as recited in claim 7, wherein:
a) said means to remove dependencies between said n numbers per molecule is a principal components analysis.
9. A method as recited in claim 7, wherein:
a) said means to remove dependencies between said n numbers per molecule is a matrix diagonalization.
10. A method as recited in claim 1, wherein:
a) said estimating step is performed by estimating a distribution over a single number.
11. A method as recited in claim 1, wherein:
a) said estimating step is performed by replacing a single observation with a Gaussian distribution.
US10/699,459 1998-02-19 2003-11-03 Method for determining discrete quantitative structure activity relationships Abandoned US20040107054A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/699,459 US20040107054A1 (en) 1998-02-19 2003-11-03 Method for determining discrete quantitative structure activity relationships

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB9803466.3 1998-02-19
GBGB9803466.3A GB9803466D0 (en) 1998-02-19 1998-02-19 Discrete QSAR:a machine to determine structure activity and relationships for high throughput screening
US09/252,912 US6691045B1 (en) 1998-02-19 1999-02-19 Method for determining discrete quantitative structure activity relationships
US10/699,459 US20040107054A1 (en) 1998-02-19 2003-11-03 Method for determining discrete quantitative structure activity relationships

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/252,912 Continuation US6691045B1 (en) 1998-02-19 1999-02-19 Method for determining discrete quantitative structure activity relationships

Publications (1)

Publication Number Publication Date
US20040107054A1 true US20040107054A1 (en) 2004-06-03

Family

ID=10827222

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/252,912 Expired - Lifetime US6691045B1 (en) 1998-02-19 1999-02-19 Method for determining discrete quantitative structure activity relationships
US10/699,459 Abandoned US20040107054A1 (en) 1998-02-19 2003-11-03 Method for determining discrete quantitative structure activity relationships

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/252,912 Expired - Lifetime US6691045B1 (en) 1998-02-19 1999-02-19 Method for determining discrete quantitative structure activity relationships

Country Status (5)

Country Link
US (2) US6691045B1 (en)
EP (1) EP0938055A3 (en)
JP (1) JPH11345225A (en)
CA (1) CA2262215C (en)
GB (1) GB9803466D0 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080234135A1 (en) * 2007-03-22 2008-09-25 Infosys Technologies Ltd. and Indian Institute of Science, Bangalore Ligand identification and matching software tools
WO2011041247A1 (en) * 2009-10-02 2011-04-07 Exxonmobil Research And Engineering Company A system for the determination of selective absorbent molecules through predictive correlations
WO2019009451A1 (en) * 2017-07-06 2019-01-10 부경대학교 산학협력단 Method for screening new targeted drugs through numerical inversion of quantitative structure-performance relationship and molecular dynamics computer simulation

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001057495A2 (en) * 2000-02-01 2001-08-09 The Government Of The United States Of America As Represented By The Secretary, Department Of Health & Human Services Methods for predicting the biological, chemical, and physical properties of molecules from their spectral properties
IL152198A0 (en) * 2000-04-12 2003-05-29 Janssen Pharmaceutica Nv Method and apparatus for detecting outliers in biological/pharmaceutical screening experiments
EP1295243A4 (en) * 2000-05-11 2010-09-01 Becton Dickinson Co System for identifying clusters in scatter plots using smoothed polygons with optimal boundaries
US7529718B2 (en) 2000-08-14 2009-05-05 Christophe Gerard Lambert Fast computer data segmenting techniques
AU2002217843A1 (en) * 2000-11-06 2002-05-15 Thrasos, Inc. Computer method and apparatus for classifying objects
AU2002240131A1 (en) * 2001-01-26 2002-08-06 Bioinformatics Dna Codes, Llc Modular computational models for predicting the pharmaceutical properties of chemical compounds
US20050214757A1 (en) * 2001-10-15 2005-09-29 Wilson David I Diagnosis and therapy of conditions by detection or modulation of the alms1 gene or protein
US20030167135A1 (en) * 2001-12-19 2003-09-04 Camitro Corporation Non-linear modelling of biological activity of chemical compounds
US7777743B2 (en) * 2002-04-19 2010-08-17 Computer Associates Think, Inc. Viewing multi-dimensional data through hierarchical visualization
US7444310B2 (en) * 2002-04-19 2008-10-28 Computer Associates Think, Inc. Automatic model maintenance through local nets
AU2003241302A1 (en) * 2002-04-19 2003-11-03 Computer Associates Think, Inc Using neural networks for data mining
WO2005001743A1 (en) * 2003-06-11 2005-01-06 Verachem Llc Fast assignment of partial atomic charges
WO2006004986A1 (en) * 2004-06-29 2006-01-12 Pharmix Corporation Estimating the accuracy of molecular property models and predictions
US7435037B2 (en) * 2005-04-22 2008-10-14 Shell Oil Company Low temperature barriers with heat interceptor wells for in situ processes
WO2008062680A1 (en) * 2006-11-24 2008-05-29 Nec Corporation System, method and program for evaluating the performance of apparatus for predicting interaction between molecules
US20120059599A1 (en) * 2010-09-03 2012-03-08 University Of Louisville Hybrid fragment-ligand modeling for classifying chemical compounds
US9201916B2 (en) * 2012-06-13 2015-12-01 Infosys Limited Method, system, and computer-readable medium for providing a scalable bio-informatics sequence search on cloud
CN106126793B (en) * 2016-06-17 2019-03-05 四川大学 Rock crushing Key Blocks localization method based on discrete element method
US10768893B2 (en) 2017-11-20 2020-09-08 Accenture Global Solutions Limited Using similarity analysis and machine learning techniques to manage test case information
CN115308158B (en) * 2022-10-12 2023-01-06 中国人民解放军国防科技大学 Method and device for quantitatively judging activity of biological material based on extinction characteristic

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5025388A (en) * 1988-08-26 1991-06-18 Cramer Richard D Iii Comparative molecular field analysis (CoMFA)
US5434796A (en) * 1993-06-30 1995-07-18 Daylight Chemical Information Systems, Inc. Method and apparatus for designing molecules with desired properties by evolving successive populations
US5463564A (en) * 1994-09-16 1995-10-31 3-Dimensional Pharmaceuticals, Inc. System and method of automatically generating chemical compounds with desired properties
US5526281A (en) * 1993-05-21 1996-06-11 Arris Pharmaceutical Corporation Machine-learning approach to modeling biological activity for molecular design and to modeling other characteristics
US5699268A (en) * 1995-03-24 1997-12-16 University Of Guelph Computational method for designing chemical structures having common functional characteristics
US5703792A (en) * 1993-05-21 1997-12-30 Arris Pharmaceutical Corporation Three dimensional measurement of molecular diversity

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5025388A (en) * 1988-08-26 1991-06-18 Cramer Richard D Iii Comparative molecular field analysis (CoMFA)
US5526281A (en) * 1993-05-21 1996-06-11 Arris Pharmaceutical Corporation Machine-learning approach to modeling biological activity for molecular design and to modeling other characteristics
US5703792A (en) * 1993-05-21 1997-12-30 Arris Pharmaceutical Corporation Three dimensional measurement of molecular diversity
US5434796A (en) * 1993-06-30 1995-07-18 Daylight Chemical Information Systems, Inc. Method and apparatus for designing molecules with desired properties by evolving successive populations
US5463564A (en) * 1994-09-16 1995-10-31 3-Dimensional Pharmaceuticals, Inc. System and method of automatically generating chemical compounds with desired properties
US5574656A (en) * 1994-09-16 1996-11-12 3-Dimensional Pharmaceuticals, Inc. System and method of automatically generating chemical compounds with desired properties
US5684711A (en) * 1994-09-16 1997-11-04 3-Dimensional Pharmaceuticals, Inc. System, method, and computer program for at least partially automatically generating chemical compounds having desired properties
US5699268A (en) * 1995-03-24 1997-12-16 University Of Guelph Computational method for designing chemical structures having common functional characteristics

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080234135A1 (en) * 2007-03-22 2008-09-25 Infosys Technologies Ltd. and Indian Institute of Science, Bangalore Ligand identification and matching software tools
US20080234996A1 (en) * 2007-03-22 2008-09-25 Infosys Technologies Ltd. and Indian Institute of Science,Bangalore Annotating descriptions of chemical compounds
US8468001B2 (en) 2007-03-22 2013-06-18 Infosys Limited Ligand identification and matching software tools
US8468002B2 (en) 2007-03-22 2013-06-18 Infosys Limited Annotating descriptions of chemical compounds
WO2011041247A1 (en) * 2009-10-02 2011-04-07 Exxonmobil Research And Engineering Company A system for the determination of selective absorbent molecules through predictive correlations
US20110202328A1 (en) * 2009-10-02 2011-08-18 Exxonmobil Research And Engineering Company System for the determination of selective absorbent molecules through predictive correlations
WO2019009451A1 (en) * 2017-07-06 2019-01-10 부경대학교 산학협력단 Method for screening new targeted drugs through numerical inversion of quantitative structure-performance relationship and molecular dynamics computer simulation
US11705224B2 (en) 2017-07-06 2023-07-18 Pukyong National University Industry-University Cooperation Foundation Method for screening of target-based drugs through numerical inversion of quantitative structure-(drug)performance relationships and molecular dynamics simulation

Also Published As

Publication number Publication date
EP0938055A2 (en) 1999-08-25
US6691045B1 (en) 2004-02-10
CA2262215C (en) 2005-04-26
EP0938055A3 (en) 2002-06-12
CA2262215A1 (en) 1999-08-19
JPH11345225A (en) 1999-12-14
GB9803466D0 (en) 1998-04-15

Similar Documents

Publication Publication Date Title
US6691045B1 (en) Method for determining discrete quantitative structure activity relationships
Yang et al. In silico prediction of chemical toxicity for drug design using machine learning methods and structural alerts
Labute Binary QSAR: a new method for the determination of quantitative structure activity relationships
EP0943131B1 (en) Method, system and program for synthesis-based simulation of chemicals having biological functions
Durham et al. Solvent accessible surface area approximations for rapid and accurate protein structure prediction
Good et al. Structure-activity relationships from molecular similarity matrices
Worth et al. The characterisation of (quantitative) structure-activity relationships: preliminary guidance
Sliwoski et al. Autocorrelation descriptor improvements for QSAR: 2DA_Sign and 3DA_Sign
Zhang et al. Prediction of the carcinogenicity of a second group of organic chemicals undergoing carcinogenicity testing.
Lutz et al. Quantitative molecular pharmacology and informatics in drug discovery
Contrera et al. QSAR modeling of carcinogenic risk using discriminant analysis and topological molecular descriptors
Kramer et al. Insolubility classification with accurate prediction probabilities using a MetaClassifier
Gajewicz-Skretna et al. Aquatic toxicity (Pre) screening strategy for structurally diverse chemicals: global or local classification tree models?
Rodríguez-Pérez et al. Identification of bile salt export pump inhibitors using machine learning: Predictive safety from an industry perspective
Mervin et al. Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty
Franke et al. General introduction to QSAR
Winkler et al. Application of neural networks to large dataset QSAR, virtual screening, and library design
JP2003530651A (en) Method and apparatus for detecting outliers in biological / pharmaceutical screening experiments
Liu et al. Prediction of electrophoretic mobility of substituted aromatic acids in different aqueous–alcoholic solvents by capillary zone electrophoresis based on support vector machine
Liew et al. QSAR classification of metabolic activation of chemicals into covalently reactive species
Mekenyan et al. COREPA‐M: A Multi‐Dimensional Formulation of COREPA
Bowerman et al. BEES: Bayesian ensemble estimation from SAS
Schmieder et al. QSAR prioritization of chemical inventories for endocrine disruptor testing
Wang et al. Applicability domain characterization for machine learning QSAR models
Dai et al. A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction

Legal Events

Date Code Title Description
AS Assignment

Owner name: CHEMICAL COMPUTING GROUP INC., CANADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:LABUTE, PAUL;REEL/FRAME:015112/0557

Effective date: 20040304

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION