WO2010045475A1

WO2010045475A1 - Techniques for predicting hiv viral tropism and classifying amino acid sequences

Info

Publication number: WO2010045475A1
Application number: PCT/US2009/060871
Authority: WO
Inventors: Guochun Liao; Ming Zheng
Original assignee: F. Hoffmann La-Roche Ag
Priority date: 2008-10-17
Filing date: 2009-10-15
Publication date: 2010-04-22
Also published as: CA2740879A1; JP2012506099A; CN102203603A; EP2347255A1

Abstract

Techniques for categorizing a test sequence are disclosed. An exemplary technique includes defining and utilizing Position Specific Score Matrices that takes into account dependencies of adjacent positions. An embodiment includes predicting HIV viral tropism with improved specificity and sensitivity. Another embodiment includes subdividing a training data set into a set of data subsets, training a plurality of classifiers based on the data subsets, and taking a vote of the plurality of classifiers. Yet another embodiment relates to weighting, in creating a training set, a specified data point based on a distance from the specified data point to an average of the reference plurality of data points.

Description

TECHNIQUES FOR PREDICTING HIV VIRAL TROPISM AND CLASSIFYINGAMINO ACID SEQUENCES

CROSS-REFERENCE TO RELATED PATENT APPLICATIONS [0001] The present patent application claims benefit of priority to US Provisional Patent Application No. 61/106,405, filed October 17, 2008, which is incorporated by reference.

BACKGROUND OF THE INVENTION

[0002] Embodiments of the present invention generally relate to techniques for sequence- based testing. More particularly, the present invention relates to improving computational techniques for predicting HIV viral tropism.

[0003] HIV is a lentivirus (a member of the retrovirus family), infection with which can lead to acquired immunodeficiency syndrome (AIDS), a condition in humans in which the immune system begins to fail under influence of the virus. HIV primarily infects vital cells in the human immune system such as helper T cells (specifically CD4+ T cells), macrophages and dendritic cells, which can lead to decreased immune response. When CD4+ T cell numbers decline below a critical level, cell-mediated immunity is lost, and the body becomes progressively more susceptible to opportunistic infections.

[0004] One pathway through which HIV viruses enter human host cells is by recognizing and binding to CD4 on the cell membrane and recruiting at least one of two co-receptors, CCR5 or CXCR4. Patients infected with viruses that recruit only CCR5 sites can be treated with CCR5-antagonist based drugs. It is therefore helpful to correctly identify viral tropism to aid in the effective administration of drugs. Unfortunately, the current techniques for HIV tropism determination has many limitations. [0005] Many efforts have been made to construct a classifier to predict tropism, based on the V3 loop of the GP 120 protein on the HIV virus, currently believed to be the dominant determinant of tropism and consisting around 35 amino acids.

[0006] A simple charge rule was first introduced to predict HIV tropism by de Jong, et al, J Virol 66(2):757-765 (1992) and Fouchier et al., J. CHn Microbiol. 33(4):906-911 (1995). In one version, the rule classifies the virus as "using CXCR4" if there is a positively charged amino acid at position 11 or 25, and as "not using CXCR4" otherwise. In 2001, Resch, et al., Virology 288(1 ):51-62 (2001) proposed a neural network model using 16 amino acids in V3 loop to predict the tropism. Pillai, et al, AIDS Res Hum Retroviruses 19(2):145-149 (2003) introduced machine learning methods that included decision trees and Support Vector Machines (SVM).

[0007] Jensen et al. (J Virol 77(24): 13376-13388 (2003)) proposed a Position Specific Score Matrix (PSSM) method for predicting tropism. In 2004, Sing et al. (Learning mixtures of localized rules by maximizing the area under the ROC curve. Jose Hernandez-Orallo, editor, 1st International Workshop on ROC Analysis in Artificial Intelligence, pages 89-96, Valencia, Spain, August 2004) proposed using mixtures of localized rules learned by maximizing the area under ROC curve for tropism prediction purposes. A rigorous comparison of these and other methods was performed by Sing in 2004 (Master's thesis, Max Planck Institute for Informatics, 2004) using public data downloaded from Los Alamos National Lab (LANL). The performances of some of these methods in terms of the sensitivities at 99%, 95% and 90% specificity are summarized in Figs. IA and IB.

[0008] Sensitivity and specificity are statistical measures of the performance of any of various binary classification tests. In statistics, specificity is defined to be a measure of the proportion of negatives which are correctly identified - e.g., the percentage of well people who are correctly identified as not having the condition. Specificity is defined to be a measure of the proportion of negatives which are correctly identified - e.g., the percentage of well people who are correctly identified as not having the condition. These concepts are also closely related to the general concepts of type I and type II errors.

[0009] As can be seen from Figs. IA and IB, in general, a gain in specificity usually comes at the expense of the sensitivity, and vice versa. For example, at a specificity of 99%, the range of sensitivity ranged from 22% to 44%. At a specificity of 95%, for example, the range of sensitivity ranged from 55% to 74%. At a specificity of 90%, the range of sensitivity ranged from 66% to 79%.

[0010] When viruses with CXCR4 tropism or dual tropism can be effectively identified as one group and those with CCR5-only tropism identified as a distinct and separate group, CCR5 -antagonist based drugs can be more effectively administered in the clinical setting. BRIEF SUMMARY OF THE INVENTION

[0011] We have now invented an improved computational prediction method that meets more stringent clinical requirement. Applicants hereby disclose techniques for improved prediction of HIV viral tropism. [0012] According to an embodiment of the invention, techniques for categorizing a test sequence as a first class (e.g. CXCR4) or a second class (e.g. CCR5) is disclosed. An exemplary technique includes providing a first training set that includes a plurality of sequences of the first class and a second training set that includes a plurality of sequences of the second class. The technique includes determining a plurality of probabilities associated with a plurality of positions that takes into account the dependency between elements in adjacent positions.

[0013] An embodiment provides that the technique includes determining a plurality of probabilities associated with a plurality of positions wherein the plurality of positions include a position, a preceding position, and a succeeding position. The technique includes determining a probability that a position on a sequence of the first class and a position on the test sequence are occupied by elements belonging to a first specific category, given that a preceding position on the sequence of the first class and a preceding position on the test sequence are occupied by elements belonging to a second specific category, and given that a succeeding position on the sequence of the first class and a succeeding position on the test sequence are occupied by elements belonging to a third specific category.

[0014] The technique includes determining a probability that a position on a sequence of the second class and a position on the test sequence are occupied by elements belonging to a fourth specific category, given that a preceding position on the sequence of the second class and a preceding position on the test sequence are occupied by elements belonging to a fifth specific category, and given that a succeeding position on the sequence of the second class and a succeeding position on the test sequence are occupied by elements belonging to a sixth specific category.

[0015] According to an embodiment, two pluralities of elements, one on a first sequence and another on a second sequence, are considered to be of the same type if each of every pair of corresponding elements belongs to a specific predetermined category of amino acids.

Depending on the embodiments, the predetermined categories of amino acids can be defined differently. The categorization can be used to reduce the complexity of the calculations required to comparing sequence similarities.

[0016] According to one embodiment, the 20 known amino acids are divided into four categories. The first category consists of H, K and R (histidine, lysine, and arginine, respectively); the second category consists of A, F, I, L, M, P, V and W (alanine, phenylalanine, isoleucine, leucine, methionine, proline, valine, and tryptophan); the third category consists of C, G, N, Q, S, T and Y (cysteine, glycine, asparagine, glutamine, serine, threonine, and tyrosine); and the fourth category consists of D and E (aspartic acid and glutamic acid). [0017] In another embodiment, the 20 known amino acids are divided into twelve categories. The first category consists of A and P; the second category consists of F and W; the third category consists of I, L and V; the fourth category consists of M; the fifth category consists of H; the sixth category consists of K and R; the seventh category consists of D; the eight category consists of E; the ninth category consists of N, S and T; the tenth category consists of Q; the eleventh category consists of C and G; and the twelfth category consists of Y.

[0018] According to an embodiment, the technique for categorizing a test sequence as a first class (e.g. CXCR4) or a second class (e.g. CCR5) includes determining a score for the test sequence based on the above-described plurality of probabilities and categorizing the test sequence as the first class or the second class based on the score.

[0019] Another embodiment of the invention provides for a technique for classifying a test data point based on a vote of a multitude of classifiers. The technique includes providing a training set that includes a plurality of data points and subdividing the plurality of data points into a plurality of data subsets. In a specific embodiment, data points taken from each patient may be grouped into a data subset. In another specific embodiment, data points taken from patients of a particular locale (for example, a city) may instead be grouped into one specific data subset.

[0020] The technique includes forming a plurality of training sets each formed with one data point from each data subset and training a plurality of classifiers each based on one of the plurality of training sets. In an embodiment where data points from each patient is grouped in a specific data subset, each training set is made of data points where each data point is obtained from a separate patient and the total number of data points equal to the number of patients. In an embodiment where data points from each locale are grouped in a specific data subset, each training set is made of data points where each data point is obtained from a separate locale and the total number of data points equal to the number of locales.

[0021] The technique further includes determining a plurality of tentative categorizations for the test data point using a plurality of classifiers trained on the training set determined above. The technique includes categorizing a test data point based on a vote of the plurality of tentative categorizations. The plurality of data points that can be associated with this embodiment can include biomarkers, amino acid sequences, nucleotide sequences, and the like. [0022] Another embodiment of the invention provides for a technique for training a classifier based on weighting individual data points in accordance with a distance from some reference plurality of data points. The reference plurality of data points can be defined globally to be the total data points - or individually for each individual data point to be the total data points excluding each of the individual data points in question. [0023] Depending on the embodiment, the weighting can be based on a linear distance, a geometric distance, or other types of distance. By over- weighting outlier under-sampled data points (i.e., the points far away from the reference plurality of data points), the method attempts to compensate for the under-sampled data points relative to the over-sampled data points (i.e., the points near to the reference plurality of data points). [0024] In this embodiment, some data points are derived from over-sampled sources, while others are from relatively under-sampled sources. The technique includes weighting each of the plurality of data points in accordance with a distance from an average of some reference plurality of data points. The plurality of data points that can be associated with this embodiment can include biomarkers, amino acid sequences, nucleotide sequences, and the like.

BRIEF DESCRIPTION OF THE DRAWINGS

[0025] Fig. IA is a simplified ROC plot showing the performance for of various previous existing techniques for predicting HIV viral tropism; [0026] Fig. IB is a simplified graph focusing on exemplary regions of interest for prediction of HIV viral tropism; [0027] Fig. 2 is a simplified diagram illustrating generally a technique for classifying a test data point according to an embodiment of the invention;

[0028] Fig. 3 is a simplified diagram illustrating a technique for determining a test data point as belonging to a predetermined category using Position Specific Score Matrixes according to an embodiment of the invention;

[0029] Figs. 4A, 4B, and 4C illustrate three mathematical models associated with Position Specific Score Matrixes for predicting tropism according to embodiments of the invention;

[0030] Figs. 5A and 5B illustrate two embodiments for categorizing amino acids according to an embodiment of the invention; [0031] Fig. 6 is a simplified flow chart illustrating techniques for classifying a data point based on a vote of a multitude of classifiers;

[0032] Fig. 7 is a simplified flow diagram illustrating techniques for weighting a training set an embodiment of the invention; and

[0033] Fig. 8 is a simplified block diagram of a computer system that can be used to practice various embodiments of the invention described in this application.

DETAILED DESCRIPTION OF THE INVENTION

[0034] Embodiments of the present invention can be applied to techniques for gene-based testing. More particularly, the present invention is useful for improving computational techniques for predicting HIV viral tropism.

A. Techniques for Categorizing a Test Sequence Taking into Account Dependencies Between Elements in Adjacent Positions on a Sequence

[0035] An embodiment of the invention provides for techniques for categorizing a test sequence based on an improved PSSM-based model. Position Specific Score Matrices offer a way to represent information in training sets in terms of probabilities that an element will occupy a particular position on a hypothetical sequence. Position Specific Score Matrices can be used to estimate the probability that two hypothetical sequences belong to a same class by comparing the specificity of each element on the two hypothetical sequences.

[0036] In an exemplary PSSM, each column (or row, as the case may be, depending on the embodiment) can represent a type of element (e.g., A, C, G, or T for DNA sequences; one of twenty known amino acids for protein sequences). If an element I of type A is strongly conserved across all known binding sites, for example, then a normalized version of a matrix may be 1 at i=I, J=A, and 0 at i=I, j≠A. In general, the probabilities that each position contains a certain type of element is determined independently of elements in adjacent positions.

[0037] According to an embodiment, however, the hypothesis that adjacent positions are occupied by independent elements does not always exist in reality. Applicants introduce a probabilistic model that takes into account a moderate amount of dependency between elements. According to an embodiment, Applicants introduce PSSM models with dependent probabilities to better process the joint distribution of the sequences for estimating HIV tropism.

[0038] In one embodiment, a Markov probability model is assumed, where each position depends on the position before it. By relaxing the imposition of a unidirectional dependency in a sequence, a more dedicated Markov model can also be created that assumes that each position depends on its immediate neighbors.

[0039] Fig. IA is a simplified ROC plot showing the performance of various previous existing techniques for predicting HIV viral tropism. Depicted along the x axis is the false positive rate, which by definition is equal to (1 - specificity). Depicted on the y axis is the true positive rate, which by definition is equal to the sensitivity. A rigorous comparison of these methods is performed based on public data downloaded from Los Alamos National Lab (LANL).

[0040] Fig. IB is a simplified graph focusing on exemplary regions of interest according to embodiments of the invention. The graph shows, among others, the performances of previously known techniques for predicting HIV viral tropism. The range of interest, according to the embodiment, ranges from a specificity between 90% to 99%. The corresponding range of sensitivity extends from a little over 20% to about 80%.

Table I: Comparison of the performance of several techniques for predicting HIV viral tropism.

[0041] Table I is a simplified illustration of the performance of several techniques for predicting HIV viral tropism, including embodiments of the current invention. The results are shown in terms of sensitivities for various techniques, including embodiments of the current invention ("New method" in the Table), are provided for various specificities.

[0042] As can be seen from Table I and Figs. IA and IB, in general, a gain in specificity usually comes at the expense of the sensitivity, and vice versa. For example, at a specificity of 99%, the range of sensitivity extends from 22% ("Decision Tree") to 44% ("SVM: linear kernel"). At a specificity of 95%, the range of sensitivity extends from 55% to 74%. At a specificity of 90%, the range of sensitivity extends from 66% to 79%.

[0043] The specificity and sensitivity can both be improved through techniques provided by embodiments of the invention. At a specificity of 99%, for example, the sensitivities have been improved from an average of around 40% for previously known techniques (e.g., average of 37% for PSSM and 44% for SVM: linear kernel) to 57-58% for exemplary techniques disclosed by Applicants. Similarly, at a specificity of 95%, similar sensitivities have been improved from around 69% to 77%. At a specificity of 90%, similar sensitivities have been improved from 78% to 85%. [0044] Fig. 2 is a simplified flowchart illustrating generally a technique for classifying a test data point according to an embodiment of the invention. According to this embodiment, the techniques include steps for determining a training set of a first class (2010), determining a training set of a second class (2020), and training a classifier that can be used to categorize a test data point as belonging to the first class or the second class (2030). The classifier should be based on one or more Position Specific Scoring Matrices that take into account the dependencies between elements in adjacent positions on a sequence

[0045] In the most basic case, a classifier is trained to recognize one class of data point (i.e., the first class). Data points that are recognized not to be of the first class are assigned to the second class. Depending on the embodiments, a classifier may also be trained to recognize more than two classes of data points - in which case multiple sets of training sets corresponding to multiple classes may be needed.

[0046] In some embodiments, more than one classifier may be used, as disclosed in some embodiments below. Depending on the embodiments, some of the classifiers may be specialized to recognize a subset of classes while others may be specialized to recognize other subsets of classes. In other embodiments, each of the classifiers may be trained to recognize all the sets of classes. Many of such variations are possible and recognizable to one of skill in the art. All of these variations are contemplated as part of the invention and fall under the scope of the current application. [0047] In general, there are many ways to train a classifier based on one or more Position Specific Scoring Matrices. Fig. 3 is a simplified diagram illustrating a technique for determining a test data point based on Position Specific Scoring Matrices that take into account dependencies of adjacent positions. According to an embodiment, the training of a classifier is based on determining one or more Position Specific Score Matrices. An exemplary technique includes creating a first Position Specific Score Matrix based on the training set of the first class wherein individual entries of each matrix take into account dependencies between positions on a prototypical sequence (of the first class) (3110). The technique includes creating a second Position Specific Score Matrix based on the training set of the second class wherein individual entries of each matrix of each matrix take into account dependencies between positions on a prototypical sequence (of the second class) (3120).

[0048] Based on these two Position Specific Score Matrices, the technique includes processing a first score associated with the first Position Specific Score Matrix and processing a second score associated with the second Position Specific Score Matrix (3130). The technique then includes determining a classification score for a test data point based on a ratio between the first score and the second score (3140).

[0049] Figs. 4A, 4B, and 4C illustrate three exemplary mathematical models that can be used to create the first Position Specific Score Matrix and the second Position Specific Score Matrix for predicting tropism.

[0050] Traditionally, the PSSM method involves modeling the different amino acids in the V3 loop as statistically independent entities constituting a whole sequence. In mathematical terms, such PSSM method models the likelihood of a sequence such as the V3 loop sequence,

N N S = [S₁ ,s₂,...,s_N} , as J^[//_; ⁴ (or ]^[//f )■_> where N is the number of elements in a sequence,

and for the V3 loop sequence is 35.

[0051] A score S for determining the likelihood that a sample virus can (or cannot) use CXCR4 as co-receptor can be calculated by a formula such as

[0052] For each amino acid a(= 1, 2, ..., 20) (i.e., the 20 known amino acids) at each position i(= 1,2,...,35) , f_a ^x _t ⁴ and f™ are estimated from the training data. According to an embodiment, the method requires that all sequences be of identical lengths. In this embodiment, all sequences can be aligned against a common reference HIV V3 loop sequence compiled from sequences in the LANL database. Insertions can be removed such that only the remaining amino acids are considered. In one embodiment, gaps can be inserted when necessary, and contribute a 0 to the score. In this model, the elements occupying the various positions on a sequence are considered to be independent of each other.

[0053] According to an embodiment, a moderate amount of dependency of adjacent amino acids can be introduced to better model the joint distribution of the sequences in each class or training set. According to the embodiment, a Markov probability model can be assumed where each position depends on the position before it. According to the embodiment, by relaxing the imposition of a unidirectional dependency in a sequence, a more dedicated Markov model can also be created that assumes that each position depends on its immediate neighbors, as depicted in Fig. 4A for CXCR4 and Fig. 4B for CCR5.

[0054] According to the embodiment illustrated in Fig. 4A, ff⁴ and ff⁴ ₃₅ can be defined in traditional PSSM (for the beginning and ending elements), but for intermediate elements, gf⁴(^s, I Vi _> ^s _ι+ι) ^can be used to represents the probabilities that an element (e.g., amino acid) at the i^th position is S₁ given that the corresponding virus can use CXCR4 as the co-receptor and the surrounding amino acids are S_1-1 and S₁₊₁ , respectively. According to the embodiment illustrated in Fig. 4B, ffl and //⁵ ₃₅ are defined as in PSSM and sf⁵(^s, I ^s,-v^s,₊ι) _> similarly, can be defined to represent the probability that an element (e.g., amino acid) at the i'^h position is S₁ given that the corresponding virus can use CCR5 as the co-receptor and the surrounding amino acids are S_1-1 and S₁₊₁ , respectively.

[0055] A ratio of the pseudo-likelihood scores based on CXCR4 and CCR5 introduced in Figs. 4A and 4B can be determined such as that shown in Fig. 4C. Depending on the specific embodiments, the scores can be further modified with a weighting factor W₁ to allow different positions to be weighted differently in accordance with predetermined knowledge relating to the relative importance of the positions with respect to determining tropism, as shown in Fig. 4D.

[0056] Positions 5, 11 and 25 have been shown to be particularly important in the literature and can be given a weight of 3, according to one embodiment. Positions 7, 8, 10, 13, 18-22, 24, 27 and 32 have been shown in Resch et al. 2001 also to be important and can be given a weight of 2, according to this embodiment. Other positions can be assigned to have a regular weight of 1.

[0057] Figs. 5 A and 5B illustrate two embodiments for categorizing amino acids according to an embodiment adapted to simplify the calculation needed for determining PSSM. According to an embodiment, to estimate gf⁴ (S₁ | S_1-1 , S₁₊₁ ) at position i , 20x20 = 400 distributions need to be estimated from the training data, corresponding to the 400 possible combinations of (S_1-1 , S₁₊₁ ) . This can result in a very computationally-intensive model. In one embodiment, the model can be simplified by merging amino acids into specific predetermined categories. [0058] As an example, the 20 amino acids can be grouped into 4 predetermined categories according to their physico-chemical properties: H, K and R as category 1; A, F, I, L, M, P, V and W as category 2; C, G, N, Q, S, T and Y as category 3; and D and E as category 4, as shown in Fig. 5 A. Alternatively, the 20 amino acids can be grouped into 12 smaller predetermined categories: A and P as category 1; F and W as category 2; I, L and V as category 3; M as category 4; H as category 5; K and R as category 6; D as category 7; E as category 8; N, S and T as category 9; Q as category 10; C and G as category 11; and Y as category 12, as shown in Fig. 5B.

[0059] In one embodiment, when calculating with respect to S_1-1 and S₁₊₁ , only the 4 major categories are used. In some embodiments, only 16 distributions need to be calculated instead of 400 distributions. In some embodiments, the distribution, gf⁴(s_t | ,S₁₄₁S₁₊₁) can be defined on 12 small categories instead of the 20 amino acids for S₁ , further simplifying the calculations.

[0060] In some embodiments, at each position, the marginal distribution of the amino acids in each group, f_a ^x4 and f_a ^Rf , where a = 1,2,...12 and i = 1,2,...,35 , is estimated where the small category identities of the amino acids instead of the amino acids themselves constitute sample space of the marginal distribution. In an exemplary calculation, a pseudo count can be set to 0.1 , an arbitrary small number. When the conditional probabilities gf ⁴ (a \ a__l,a_l) , where i = 1,2,...,35 and a = 1,2,...,12 and UL_{1 5}^₁ = 1,2,...,4 , are estimated, because of the small number of sequences in the training data with a__γ at position i — 1 and a_γ at position i + 1 , γ- ff_t ⁴ can be used as the pseudo counts.

[0061] In a specific embodiment, γ is a constant that can be used to adjust the contribution of the input training sequences. This factor has been set to 10 for many of data sets tested. In these embodiments, the pseudo counts represent the best guess of the conditional probabilities without the actual information. Generally, without looking at the actual combination of (a, a__γ , a_γ ) , the best guess is just the marginal distribution. In these embodiments, the larger γ is, the smaller the contribution from each sequences with a__y at position i — 1 and a_x at position i + 1 is. [0062] Depending on the embodiment, when a gap is present in a sequence compared to a standard aligned sequence, elements associated with the gap should be modified. If S₁ is a

gap in an aligned sequence, the contribution of S₁ to the score,

be set to 0 automatically. In the embodiment, when S₁ is an actual amino acid, two possibilities can exist. In a first case, only one of S_1-1 and S₁₊₁ is a gap; in a second case, both S_1-1 and S₁₊₁ are gaps. To deal with the first case, a log ratio of the partial conditional distributions, i.e., gf ⁴ (a \ a__γ at position i - 1) (or gf ⁴ (a \ a_γ at position i + T) depending on which of Vi ^and Vi *^{s a} 8^aP)' ^can be used to replace the corresponding term in the score calculation. The partial conditional distributions gf⁴ (a | a__γ , a_γ ) can be estimated in a similar manner, except according to one embodiment the constant γ can be set to 5, as the number of sequences with a__γ at position i — \ or a_γ at position i + 1 is larger. In the second case, the marginal distribution can be used to replace the conditional distributions to calculate the log ratio.

Experimental Data [0063] Various experiments have been run to confirm the results shown in Table I. In one test, HIV V3 loop sequences with tropism were downloaded from LANL database.

Sequences with tropism other than CCR5 or CXCR4 or dual were removed. The sequences are then aligned to the reference sequence

"CTRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAHC" and those not starting or ending with amino acid C were also removed. 1314 sequences with CCR5 tropism and 486 sequences with either CXCR4 or dual tropism were available for analysis.

[0064] For the data set, identical sequences with different tropism were removed to ensure data quality, and non-unique sequences were removed. Finally, 606 unique CCR5 tropic sequences from 375 patients and 213 unique CXCR4 or dual tropic sequences from 90 patients were included in the analysis.

[0065] In the experiment, in-house data (CCR5, CXCR4 or dual tropic) and sequences purchased from Monogram (CXCR4 or dual tropic) are included in the analysis. 1422 CCR5 -tropic sequences and 617 CXCR4 or dual tropic sequences were compiled. After conflicting or redundant sequences were removed, 621 unique sequences with CCR5 tropism from 381 patients and 262 unique sequences with CXCR4 or dual tropism from 113 patients were used for analysis. The quality of the extra data is believed to be higher than that of the sequences from LANL database.

[0066] In each cross validation, 100 sequences with CCR5 tropism and -25 patients with CXCR4 or dual tropism were selected as testing samples, all from different patients.

Sequences from other patients were used as training samples. The performance of the new algorithm on the LANL dataset and the expanded dataset is shown in Table I.

B. TECHNIQUES FOR CLASSIFYING A DATA POINT BASED ON A MULTITUDE OF CLASSIFIERS [0067] Another aspect of the invention relates to classifying a data point based on forming a multitude of classifiers. The method includes subdividing a data set into a set of data subsets, forming a plurality of training sets each created by sampling one data point from each data set, training a plurality of classifiers based on the plurality of training sets, and taking a vote of the decisions made by the plurality of classifiers. [0068] In a specific embodiment for training a classifier for a patient infected for HIV virus, a set of training data is obtained from several patients. Instead of building one classifier for the entire data set, many classifiers can be built. Data from each patient can be defined as forming an individual data subset. Multiple training sets can then be derived from each individual data subset by randomly obtaining one data point from each data subset (i.e. patients). According to the embodiment, information from other sequences of a data subset from the same patient can thus be ignored. Next, each classifier can be trained based on each one of the training sets. A final prediction result can then be derived based on a vote of the pool of classifiers.

[0069] Fig. 6 is a simplified flow chart illustrating techniques for classifying a data point based on a vote of a multitude of classifiers. According to an embodiment, a technique 6000 for more fully utilizing the information contained in a training set is provided. The technique includes providing for a set of data set (6010), subdividing the data set into a set of data subsets (6020), determining a set of training sets each formed by selecting one data point from each data subset (6030), and creating a set of classifiers each trained on one of the training sets (6040).

[0070] The technique further includes determining a set of tentative categorizations for a test data point based on the set of classifiers (6050) and determining a categorization for the test data point based on a vote of the set of tentative categorizations (6060). Depending on the embodiment, a majority vote scheme may be proposed to fully utilize such information. Other types of schemes can be used, such as a 2/3 vote for example. Depending on the embodiments, the scheme may involve a dynamic evaluation of specific data sets without deviating from the scope and spirit of the invention.

[0071] In yet another embodiment, a set of data may be obtained for patients obtained from three geographic locations, for example Location 1, Location 2, and Location 3. The set of data need not include an equal number of points from each location, but may include, for example, 5 data points from Location 1, 8 data points from Location 2, and 10 data points from Location 3. To more fully utilize data from the data set, a few training data set may be formed, where each training data set is formed by randomly selecting one data point from Location 1, randomly selecting one data point from Location 2, and randomly selecting one data point from Location 3. The training data set is then used to train a multitude of classifiers. To categorize a test data point, a simple majority vote of the decisions made by the multitude classifiers is taken to derive a final categorization for the test data point.

C. TECHNIQUES FOR WEIGHTING A TRAINING SET BASED ON AN AVERAGE DISTANCE FROM AN REFERENCE PLURALITY OF DATA POINTS

[0072] According to an aspect of the invention, a technique for processing data obtained from multiple closely related sources is disclosed. In one example, multiple HIV sequences are obtained from a patient. (Note that including all sequences from a patient may introduce bias into the classifier.) In another example, many sequences are obtained from different patients. However, because of sample space overlap of viral sequences from different patients (which may occur as in the case, for example, when patients are infected from the same source, resulting in samples from different patients sharing a high degree of similarity), including data sets from separate patients may also unnecessarily introduce bias.

[0073] In both cases above, the assumption regarding which viral samples are independent during the calculation of the amino acid distributions may introduce bias into the classification methods. In order to partially compensate for such bias, a sequence- reweighting procedure is employed. [0074] According to an embodiment, for a sequence in a training set, a corresponding weight may be assigned relating to a distance of a specified sequence to a reference data set. The distance may be calculated from the specified data point to an average of a reference data set - where distance in general may be defined as a degree of similarity between positions on two sequences. According to the embodiment, data points from under-sampled sources can be emphasized while the data points from over-sampled sources can be de-emphasized..

[0075] Fig. 7 is a simplified flow diagram illustrating techniques for weighting a training set an embodiment of the invention. According to an embodiment, the technique includes obtaining a set of data points sampled from various sources, some sources being over- sampled relative to other sources (7010), determining a reference plurality of data points for each of the set of data points (7020), determining a distance from a specified data point to an average of a reference plurality of data points (7030), weighting each data point in accordance with a distance from a specified data point to some average of a reference plurality of data points (7040), and forming a training set in accordance with the weighting of the data points (7050).

[0076] Depending on the specific embodiments, the reference plurality of data points may be the same for all data points or different for each data point. According to an embodiment in which the reference plurality of data points are the same, the reference plurality of data points is simply the entire data set. According to an embodiment where the reference plurality of data points for each data point is different, the reference plurality of data points for a specified data point may be the entire data set excluding the specified data point. Depending on the embodiment, the estimation of probabilistic distributions such as gf ⁴ (S₁ I S_1-1 , s_l+l ) can be weighted in proportion to the distance metrics associated with the data making up the training set.

REPRESENTATIVE COMPUTER SYSTEM

[0077] Many aspects of the techniques disclosed in this application can be implemented by computer systems. Fig. 8 is a simplified block diagram of a computer system 100 that can be used to practice an embodiment of the various inventions described in this application. As shown in Fig. 8, computer system 100 includes a processor 102 that communicates with a number of peripheral subsystems via a bus subsystem 104. These peripheral subsystems can include a storage subsystem 106, comprising a memory subsystem 108 and a file storage subsystem 110, user interface input devices 112, user interface output devices 114, and a network interface subsystem 116.

[0078] Bus subsystem 104 provides a mechanism for letting the various components and subsystems of computer system 100 communicate with each other as intended. Although bus subsystem 104 is shown schematically as a single bus, alternative embodiments of the bus subsystem may utilize multiple busses.

[0079] Network interface subsystem 116 provides an interface to other computer systems, networks, and portals. Network interface subsystem 116 serves as an interface for receiving data from and transmitting data to other systems from computer system 100.

[0080] User interface input devices 112 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a barcode scanner, a touch screen incorporated into the display, audio input devices such as voice recognition systems, microphones, and other types of input devices. In general, use of the term "input device" is intended to include all possible types of devices and mechanisms for inputting information to computer system 100.

[0081] User interface output devices 114 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices, etc. The display subsystem may be a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), or a projection device, and the like. In general, use of the term "output device" is intended to include all possible types of devices and mechanisms for outputting information from computer system 100.

[0082] Storage subsystem 106 can be configured to store the basic programming and data constructs that provide the functionality of the present invention. Software (code modules or instructions) that provides the functionality of the present invention can be stored in storage subsystem 106. These software modules or instructions can be executed by processor(s) 102. Storage subsystem 106 may also provide a repository for storing data used in accordance with the present invention. Storage subsystem 106 can comprise memory subsystem 108 and file/disk storage subsystem 110. [0083] Memory subsystem 108 can include a number of memories including a main random access memory (RAM) 118 for storage of instructions and data during program execution and a read only memory (ROM) 120 in which fixed instructions are stored. File storage subsystem 110 provides persistent (non- volatile) storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a Compact Disk Read Only Memory (CD-ROM) drive, an optical drive, removable media cartridges, and other like storage media. [0084] Computer system 100 can be of various types including a personal computer, a portable computer, a workstation, a network computer, a mainframe, a kiosk, a server or any other data processing system. Due to the ever-changing nature of computers and networks, the description of computer system 100 depicted in Fig. 7 is intended only as an example for purposes of illustrating the preferred embodiment of the computer system. Many other configurations having more or fewer components than the system depicted in Fig. 7 are possible.

[0085] Although specific embodiments of the invention have been described, various modifications, alterations, alternative constructions, and equivalents are also encompassed within the scope of the invention. The described invention is not restricted to operation within certain specific data processing environments, but is free to operate within a plurality of data processing environments. Additionally, although the present invention has been described using a particular series of transactions and steps, it should be apparent to those skilled in the art that the scope of the present invention is not limited to the described series of transactions and steps.

[0086] Further, while the present invention has been described using a particular combination of hardware and software, it should be recognized that other combinations of hardware and software are also within the scope of the present invention. The present invention may be implemented using hardware, software, or combinations thereof. [0087] The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that additions, subtractions, deletions, and other modifications and changes may be made thereunto without departing from the broader spirit and scope of the inventions.

Claims

WHAT IS CLAIMED IS:

1. A method for categorizing a test sequence as a first class or a second class, the method comprising: providing a first training set including a plurality of sequences of the first class; providing a second training set including a plurality of sequences of the second class; determining a plurality of probabilities associated with a plurality of positions, the plurality of positions including a position, a preceding position, and a succeeding position, the plurality of probabilities including: a probability that a position on a sequence of the first class and a position on the test sequence are occupied by elements belonging to a first specific category, wherein a preceding position on the sequence of the first class and a preceding position on the test sequence are occupied by elements belonging to a second specific category, and wherein a succeeding position on the sequence of the first class and a succeeding position on the test sequence are occupied by elements belonging to a third specific category; and a probability that a position on a sequence of the second class and a position on the test sequence are occupied by elements belonging to a fourth specific category, wherein a preceding position on the sequence of the second class and a preceding position on the test sequence are occupied by elements belonging to a fifth specific category, and wherein a succeeding position on the sequence of the second class and a succeeding position on the test sequence are occupied by elements belonging to a sixth specific category; determining a score for the test sequence based on the plurality of probabilities; and categorizing the test sequence as the first class or the second class based on the score.

2. The method of claim 1 further comprising determining the plurality of probabilities for every position on a sequence except for a beginning position and a last position on the sequence.

3. The method of claim 1 wherein the determining the score includes weighting each of the plurality of probabilities by a weighting factor.

4. The method of claim 1 wherein the plurality of probabilities further includes: a probability that a beginning position on the sequence of the first class and a beginning position on the test sequence are occupied by elements belonging to a seventh specific category; a probability that a last position on the sequence of the first class and a last position on the test sequence are occupied by elements belonging to an eighth specific category; a probability that a beginning position on the sequence of the second class and a beginning position on the test sequence are occupied by elements belonging to a ninth specific category; and a probability that a last position on the sequence of the second class and a last position on the test sequence are occupied by elements belonging to a tenth specific category.

5. The method of claim 1 wherein the test sequence, the plurality of sequences of the first class, and the plurality of sequences of the second class are amino acid sequences.

6. The method of claim 5 wherein a sequence of the first class includes an amino acid sequence comprising the V3 loop of the human immunodeficiency virus (HIV) GP 120 protein of the CXCR4 type and wherein a sequence of the second class includes an amino acid sequence comprising the V3 loop of the human immunodeficiency virus (HIV) GP 120 protein of the CCR5 type.

7. The method of claim 5 wherein amino acids are divided into one of four predetermined categories, the four predetermined categories including a first category including H, K and R, a second category including A, F, I, L, M, P, V and W, a third category including C, G, N, Q, S, T and Y, and a fourth category including D and E.

8. The method of claim 5 wherein amino acids are divided into one of twelve predetermined categories, the twelve predetermined categories including a first category including A and P, a second category including F and W, a third category including I, L and V, a fourth category including M, a fifth category including H, a sixth category including K and R, a seventh category including D, an eight category including E, a ninth category including N, S and T, a tenth category including Q, an eleventh category including C and G, and a twelfth category including Y.

9. The method of claim 5 wherein amino acids are divided into one of a plurality of predetermined categories, the predetermined categories based on physico- chemical properties of each amino acid.

10. The method of claim 5 wherein each of the twenty types of amino acids makes up one of twenty predetermined categories.

11. The method of claim 1 wherein the test sequence, the plurality of sequences of the first class, and the plurality of sequences of the second class are nucleotide acid sequences.

12. A method for categorizing a data point based on a plurality of data points, the method comprising: providing the plurality of data points; subdividing the plurality of data points into a plurality of data subsets, each of the plurality of data subsets satisfying a criterion; determining a plurality of training sets, each of the plurality of training sets formed by selecting one data point from each of the plurality of data subsets; training a plurality of classifiers, each of the plurality of classifiers trained on one of the plurality of training sets; determining a plurality of tentative categorizations for the data point associated with the plurality of classifiers; and categorizing the data point based on a vote of the plurality of tentative categorizations .

13. The method of claim 12 wherein the data point represents a plurality of measurements associated with a plurality of V3 loops of the human immunodeficiency virus (HIV) GP 120 protein, the method adapted to classify an amino acid sequence as a sequence of the CCR5 class or a sequence of the CXCR4 class.

14 . The method of claim 12 wherein the categorizing is based on a majority vote.

15. The method of claim 12 wherein the plurality of data points is associated with a plurality of measurements associated with a plurality of biomarkers.

16. The method of claim 12 wherein the plurality of data points is associated with a plurality of measurements associated with a plurality of nucleotide acid sequences.

17. The method of claim 12 wherein the plurality of data points is associated with a plurality of measurements associated with a plurality of amino acid sequences.

18. The method of claim 13 wherein the plurality of data points is obtained from one or more human beings.

19. The method of claim 13 wherein the plurality of data points is obtained from one or more mammals.

20. A method for training a classifier based on a plurality of data points obtained from a plurality of sources, the method comprising weighting a specified data point in accordance with a distance between the specified data point and an average of a reference plurality of data points.

21. The method of claim 20 wherein the reference plurality of data points all of the plurality of data points, including the specified data point.

22. The method of claim 20 wherein the reference plurality of data points includes all of the plurality of other data points, excluding the specified data point.

23. The method of claim 20 wherein the average is an arithmetic average.

24. The method of claim 20 wherein the average is a geometric average.

25. The method of claim 20 each data point being a measure of the sequence of an amino acid, each distance between two data points being a measure of dissimilarities between sequences of two amino acids, wherein a larger distance represents a larger degree dissimilarities and a smaller distance represents a smaller degree of dissimilarities.

26. The method of claim 20 wherein the weighting is based on a linear distance.

27. The method of claim 20 wherein the plurality of data points is associated with a plurality of measurements associated with a plurality of biomarkers.

28. The method of claim 20 wherein the plurality of data points is associated with a plurality of measurements associated with a plurality of amino acid sequences.

29. The method of claim 20 wherein the plurality of data points is associated with a plurality of measurements associated with a plurality of nucleotide sequences.

30. The method of claim 20 wherein the plurality of data points is associated with a plurality of measurements associated with a plurality of V3 loops of the human immunodeficiency virus (HIV) GP 120 protein, the method adapted to classify an amino acid sequence as a sequence of the CCR5 class or a sequence of the CXCR4 class.