WO1998052129A1

WO1998052129A1 - Constitutional analysis of protein domains

Info

Publication number: WO1998052129A1
Application number: PCT/AU1998/000355
Authority: WO
Inventors: John Redmond; Nicolle Hannah Packer; Andrew Arthur Gooley; Keith Leslie Williams
Original assignee: Macquarie Research Ltd.
Priority date: 1997-05-15
Filing date: 1998-05-14
Publication date: 1998-11-19
Also published as: AUPO680897A0

Abstract

A method of assigning a First Sequence to a peptide or polypeptide amino acid sequence, the method comprising reading from one direction of the peptide or polypeptide amino acid sequence and recording at least some of the constituent amino acids in the order in which the amino acids are first encountered so as to generate the First Sequence containing at least some of the constituent amino acids of the peptide or polypeptide, and methods of creating and searching compressed amino acid sequence databases using First Sequences of peptides or polypeptides.

Description

Constitutional analysis of protein domains

Technical Field

The present invention relates generally to use of amino acid analytical information of peptides, polypeptides and proteins. Background Art

With the consolidation of databases which contain implied sequences of proteins deduced from sequences of nucleic acids and, with the availability of high-resolution methods for the separation of analytical amounts of proteins, there is a pressing need for methods for the characterisation or possible identification of the separated proteins in the light of comparisons with protein information already available in the databases.

To date, the primary method for the characterisation of intact proteins has been by means of automated N-terminal sequencing, whereby the order of amino acids is defined for a limited domain at one end of the protein, or at one end of a derived peptide, such as is derived by enzymic digestion of the protein. The databases are then interrogated for a match against such a sequence.

Inasmuch as the majority of the sequences in the existing protein databases are in fact sequences deduced from DNA sequences, and not obtained directly by analysis of pure proteins, the match of the N-terminal sequence is often to a protein domain within the database sequence. This results from the presence of a signal sequence of amino acids which is absent in the mature protein. Furthermore, there may be truncation at the other end of the protein (the C-terminus), whereby protein synthesis (translation) is stopped earlier than expected or the protein is cleaved after synthesis. This also has the consequence that the physical end of the mature protein is within the sequence reported in the database. There are many examples of genes which produce various spliced protein products which can have different N- and C-termini. Any method for analysis and comparison with stored data should therefore be robust enough to handle any such unpredicted events.

The present inventors have developed a method using a specification of the amino acids at the ends of a protein, together with associated software, to provide a powerful means for the identification of the ends of an isolated mature protein and for comparison with reference sequences in databases. The present invention can also be applied to the analysis of peptide fragments derived from the protein, such as in those cases where the termini of the intact protein cannot be analysed because of the presence of blockage at one or other end. (It should be noted that the standard method of N- terminal sequencing also requires that peptides be prepared in the event of such a block).

The present inventors have also developed methods for assigning short identifying sequences (First Sequences) to peptides or polypeptides that can be used as concise distinguishing markers. Disclosure of Invention

The present invention relates to distinguishing a peptide or polypeptide by means of a generated 'First Sequence' which summarises the order in which at least some of the constituent amino acids are first encountered when the amino acids are read off, in order, from one of the ends of the amino acid sequence of the peptide or polypeptide.

In a first aspect, the present invention consists in a method of assigning a First Sequence to a peptide or polypeptide amino acid sequence comprising reading from one direction of the peptide or polypeptide amino acid sequence and recording at least some of the constituent amino acids in the order in which they are first encountered so as to generate the First

Sequence containing at least some of the constituent amino acids of the peptide or polypeptide.

It will be appreciated that the First Sequence can be generated by reading from either the carboxyl- (C-) or amino- (N-) direction of the polypeptide. In this way, the C-terminal First Sequence (CFS) and N- terminal First Sequence (NFS) are generated for the same peptide or polypeptide by reading from each end of the peptide or polypeptide.

In a preferred embodiment, when an amino acid is first encountered and assigned to a First Sequence, further occurrences of that amino acid in the polypeptide amino acid sequence are ignored. The First Sequences can normally be up to 20 amino acids in length as there are a possible 20 naturally occurring amino acids in polypeptides. As it becomes technically possible to characterise the positions of post-translationally modified amino acid residues, such as phosphoserine and N-acetylglucosaminylasparagine, the First Sequences of many peptides and polypeptides will be expanded to contain more than 20 constituents. It has been found by the present inventors that incomplete First Sequences, comprising less than the complete set of the constituent amino acids, can be used to generate a substantially unique identifying sequence for any given polypeptide. In this way, the first occurrence of at least 5, preferably at least 8, and more preferably about 10- 14, constituent amino acids in a given polypeptide sequence will generate a partial CFS or NFS that is substantially unique to that polypeptide.

It will also be appreciated that one or more constituent amino acids may be ignored when generating a First Sequence of a peptide or polypeptide. Although it is preferred that each constituent amino acid is only counted once when generating the first or second sequences, it will be appreciated that each amino acid may be counted more than once to generate a First Sequence. This would, however, generate larger First Sequences for a given polypeptide and would not provide as good compression of an amino acid sequence when compared with only counting constituent amino acids the once.

In a second aspect, the present invention consists in a method of compressing the amino acid sequence data of a polypeptide to that of a string of amino acids forming a First Sequence obtained according to the method of the first aspect of the present invention. It will be appreciated that an additional compressed sequence (First

Sequence) can be generated by assigning to the polypeptide a First Sequence generated from the other terminus, according to the method of the first aspect of the present invention.

In a third aspect, the present invention consists in a method of compressing a database containing known or predicted amino acid sequences of peptides or polypeptides, the method comprising assigning a First Sequence for each peptide or polypeptide according to the method of the first aspect of the present invention and storing the First Sequences as a compressed database. If a derived database is prepared which contains one or more First

Sequences derived from a primary database containing known or predicted amino acid sequences of many polypeptides, a significant compression of the structural information is obtained. It will be appreciated that a search for the presence of a First Sequence in such a database will be significantly faster than searching the uncompressed database. If a further derived database is prepared from a primary database containing known or predicted amino acid sequences of many polypeptides whereby each of the polypeptides is represented as an unsigned integer number, obtained by a hashed merging of two or more of the First Sequences corresponding to the polypeptide, an even greater degree of compression will be obtained. The characterisation of a protein by a distinctive integer means that the entries of a database can be ordered to provide rapid searching and comparison. It will be realised that this approach can be used to provide a keyed index to existing databases of polypeptides. In a fourth aspect, the present invention consists in a method of searching a database containing amino acid sequences of known or predicted First Sequences of peptides or polypeptides for the presence or identity of a test peptide or polypeptide, the method comprising assigning to the test polypeptide a First Sequence according to the method of the first aspect of the present invention to form a test First Sequence, and comparing the test

First Sequence with the corresponding First Sequences of the peptides or polypeptides in the database to detect or locate the presence of the test First Sequence.

Throughout this specification, unless the context requires otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.

In order that the present invention may be more clearly understood, preferred forms will be described with reference to the following examples and the drawing. Brief Description of Drawings

Figure 1 shows an amino acid sequence of the polypeptide microneme- rhoptry antigen and the C-terminal and N-terminal First Sequences thereof assigned according to the method of the present invention.

Modes for Carrying Out the Invention

In summary, the techniques for protein identification by the present invention exploit the informational content of constitutional analyses of protein domains. The techniques are powerful and general, and are expected to be an effective alternative to identification strategies based on amino acid sequences of proteins. The validity of the algorithms used by the present inventors has been demonstrated by extensive searches and comparisons of real sequences in both the SwissProt and Owl databases. A FIRST SEQUENCE OF A POLYPEPTIDE

A First Sequence of a protein or peptide is a string of the constituent amino acids, each of which is given in the order in which it is first encountered when moving from one end of the protein. This step is illustrated by generation of a C-terminal First Sequence (CFS) from the first protein in the SwissProt protein database. A printout of this protein is shown in Figure 1. The first line in Figure 1 gives the database label for the entry, and the second line an explanatory comment. Then follows a complete listing of the 924 amino acids in the protein, disposed in groups of 10 amino acids for clarity. This listing is terminated by an asterisk, the standard marker for the end of this database field, which is ignored for the purposes of generating First Sequences. The first occurrences of amino acids starling from the C- terminus (the right-hand end) are shown in BOLD in the above protein sequence, and the CFS (which in this case contains 20 amino acids) on the second-last line of Figure 1.

In this way, the CFS is built up as a protein signature in which each of the constituent amino acids appears only once. There are 20! (2.43 x 10¹⁸) possible CFS signatures can be constructed from 20 amino acids. It will be appreciated that, in terms of the significance of the sequence, a much heavier weighting is given to amino acids which occur in the first part (the right- hand or C-terminal end) of the CFS. and that this part of the CFS therefore has a much higher informational content.

Similarly, a NFS can be constructed by moving from left to right along the main protein chain. In some 78% of sequences available in the databases, however, methionine (M) is present in the first position, which reduces its informational content. It is therefore of advantage to include this residue in the First Sequence signature, but not to the exclusion of the first subsequent occurrence of M. Such a signature is termed a MFS. and usually contains two M residues. When the protein contains two or more M residues and one of these is at the N-terminus, the MFS and NFS differ only in the presence of an additional M residue at the left-hand end of the sequence. The last line in Figure 1 is such a MFS. The NFS and MFS are, of course, quite different sequences from the CFS of the same protein. In combination with the CFS, they are extremely compact and powerful descriptors of a protein. In the above example, 41 characters (the sum of the lengths of the MFS and CFS) are used as a (presently) unique signature for a protein of 924 amino acids. Either the NFS or the CFS can be considered to be an ordered set of the constituent amino acids of the whole protein. It is also possible to describe any domain of a protein in terms of the set of the amino acids which it contains. Any of the subsequences of a First Sequence are a true set of constituent amino acids in a domain of the main protein sequence, but ONLY if the subsequence is aligned with the appropriate terminus of the First

Sequence (the C-terminus of the CFS or the N-terminus of the NFS or MFS) . For example, the block of the first 16 amino acids of the C-terminal domain in Figure 1 contains only the amino acids in the set {P, A, S, V, G, I, L}. The set {P. A, S, V, G, I}, however, does not represent the constituent amino acids of the subterminal domain lacking the terminal L residue, because of the occurrence of a second L residue. For a truncated protein lacking one or more of the amino acid residues in the C-terminal domain, a new CFS can be constructed. In general, this will be quite different from that of the non- truncated protein, as can be demonstrated by inspection. This has important implications for the characterisation of polypeptides by compositional analysis.

APPLICATIONS OF FIRST SEQUENCES

1. Identification of a protein by constitutional analysis of its C-terminal or N-terminal domain Each of the First Sequences corresponds to the set of amino acids which occur one or more times in a protein, and inspection of the Owl database shows that some 59% of the entries contain all of the 20 amino acids. Other entries either lack one or more of these amino acids, or have ambiguities introduced by the presence of X, B or Z where specific residues have not been characterised fully. As it corresponds to 59% of the proteins, the set of 20 constituent amino acids has limited informational content but, when a more restricted set of amino acids is applied to either a whole protein or a protein domain, the informational content becomes considerable. This is illustrated by the fact that 76,487 of the 128,719 entries (59%) in the Owl database contain all 20 amino acids but that, when Q, W, C, K and I are removed from the set, there is only one match. This result was obtained by differential scanning of the SwissProt database, using a Shuffling Algorithm as described below.

A simple application of constitutional analysis is in the specification of the C-terminal domain of a protein by liberation of the group of amino acids in this domain by either enzymic or chemical treatment. The enzymic method (using a carboxypeptidase) has proved problematic, because of the enormous difference in rate of liberation of different amino acid residues, with the result that it is difficult to establish a reliable order for the release of the amino acids. An outcome of the application of the Shuffling Algorithm to a large number of test proteins in the Owl and SwissProt databases is the demonstration that it is not necessary to establish the order of release of the amino acids, and that it is sufficient to establish which amino acids are released (i.e. to establish the set of constituent amino acids in the terminal domain). This is done by scoring these amino acids on a plus/minus basis (no attempt need be made to assess relative amounts of each amino acid, but this information may be used at a later stage to resolve any ambiguities). A database can then be searched on the basis of the occurrence of this set of amino acids in the protein. This may be at the C- terminus of the protein but, in the case of a protein which has been subjected to C-terminal truncation, it will be necessary to search for an internal domain with the same constituent amino acids.

It will be appreciated that this principle can also be applied to constitutional analysis at the N- terminus of proteins, using the NFS or MFS. Matching to database entries may be successful at the N-terminus of the protein but, in the case of a protein from which an N-terminal sequence has been removed, it will be necessary to search for an internal domain with the same constituent amino acids. In combination, the constitutional analysis of the N- and C-termini of a protein, with reference to sequence information in a database, should permit identification of a protein. The N-terminal domain of a protein can be defined, in terms of its constituent amino acids, by using enzymic release by aminopeptidase, by using rapid Edman chemistry in a detuned N-terminal sequencer and/or by mass spectrometry. Other methods may become available in the future, but they will not affect the validity of the algorithm. The potential for success of this approach has been evaluated by scanning the Owl database of 128,719 protein sequences by matching, using the Shuffling Algorithm, against all or part of the CFS and MFS of some 250 representative proteins which were selected at intervals from the same database. The purpose of this experiment was to evaluate how large a set of terminal amino acids is needed in order to define a protein uniquely. The outcome is that determining the set of 5-8 amino acids at the N- or

C-terminus of a polypeptide greatly limits the number of possible protein matches. If this result is used in combination with either a somewhat larger set (10-14 amino acids) from the same terminus, the match is usually unique. Such results are typically generated by graded enzymatic treatment of a polypeptide, such as with carboxypeptidase. Equivalent information can be generated by the use of mass spectrometry, such as MALDI-TOF, to deduce constitutional information about terminal domains. The only exceptions so far encountered to the success of such an experiment have been multiple matches to such proteins as actins and histocompatibility antigens, which occur in the database as either multiple entries or proteins of extremely high homology.

It will be appreciated that the set of amino acids at one terminus can be considered in combination with the set from the other terminus, with the same successful outcome. In practical terms, a protein can be identified by defining the set of amino acids released from either the C- or N-terminus at two or more different times, or at two or more different extents. Optimally, the first set will consist of some 5-8 residues, while the later set should have some 3-5 additional amino acids. This outcome has the important implications that it is not necessary to study the detailed kinetics of release and that it possible to use the technique on a mass screening (automated) basis (e.g., in 96-well microtitre plates). It therefore avoids the well-known difficulties of using carboxypeptidases and aminopeptidases to determine sequences at the C- and N-termini of proteins, because knowing the constitution of these domains has enough informational content for protein identification. The approach is successful, even in the presence of 'problem' amino acids, such as phosphorylated or glycosylated amino acids, in the terminal domain.

Another outcome of a systematic evaluation of this algorithm is the specification of degrees of relatedness of proteins, which may be of value in the taxonomy or classification of proteins (see Application 4 below). 2. Identification of a mature protein by comparison of constitutional domains with predicted proteins in a database

Since the majority of the sequences in the protein databases are sequences deduced from DNA sequences, and the terminal sequences of mature proteins may be anywhere in the reference sequences, rather than only at the ends, a more extensive search needs to be made of the protein sequences to identify internal matching. As before, these domains may contain the constituent amino acids in any order. A scan of the Owl database for matches against the CFS and MFS of 246 selected proteins was made. To ensure that the results were of practical significance, the CFS and MFS were constructed from terminal blocks of only 10 amino acids and they each contained, on average, only 5-8 different amino acids. The searches of the database were for the occurrence of constitutional domains matching both the CFS and MFS at any point in the polypeptide chains. In a large number of cases, the combination of CFS and MFS of the reference protein match only one or a small number of proteins. In some extreme cases, however, there are a large number of matches, for a combination of possible reasons. Firstly, if there are several amino acid repeats at or near the termini of the reference polypeptides, which fall within the allocated string of 10 terminal amino acids, the First Sequences are too short (e.g. 5 amino acids). Secondly, if the constitution blocks are inherently more common in the database (which may be of considerable significance in the evolution of proteins), then more matches are made. An extreme example of this outcome is Urechistachykinin II, which has only 5 amino acids in each of the First Sequences and matches to 541 proteins. Inspection of the matches, however, reveals high levels of redundancy or near- redundancy.

When the Owl database was searched for matches against 246 reference proteins, an aggregate total of 1716 matches was obtained, corresponding to an average match of 7.0 per reference protein. When the extreme cases with high scores were excluded, such as when large numbers of redundant or near-redundant entries were present, the match score was much lower (approximately 3).

The results demonstrate that a global search based on the constitutions of two protein domains of 10 amino acids provides a powerful and simple means of matching a protein to entries in a database. Moreover, the search can be made even more discriminating by selecting a modest increase in the lengths of the match domains. Present results indicate that the domains should optimally contain 7 or 8 different amino acids. 3. Identification of a protein from the constitutional analysis of one or more derived peptides

Proteins can be converted to peptides by the use of proteases. One or more of these peptides can be analysed chemically (to define the constituent amino acids) or by mass spectrometry, to define possible constituent amino acids (within the limits of confidence imposed by amino acid isomers or near-coincidence of molecular mass of amino acids). It will be appreciated that each of these peptides can be considered as a constitution block, and that comparisons can be made with database entries as under Application 2 above. It will also be appreciated that, in the event of blockages at the termini of the isolated protein, it will be necessary to prepare peptides in this way.

A constitutional analysis of a peptide, such as that obtained by the action of a protease on a polypeptide or protein, has some informational content, but is unlikely to be of great value in itself for the identification of the protein from which the peptide has been derived. This can be demonstrated by searching a protein database for pioteins containing constitutional domains corresponding to all or part of such a peptide. As indicated, searching of a database can be done on the basis of constitutional domains (First Sequences), but the present inventors have found that the division of a constitutional domain into constitutional subdomains (First subsequences) makes the matching to database entries much more powerful.

An outline of the principles of searching on the basis of constitutional domains is given below, in which the present inventors define two broad categories of subdomains - discrete subdomains (discrete First subsequences and merged subdomains (merged First subsequences). Constitutional domains and subdomains can be specified by experimental data derived from:

Amino acid analysis. Here a physical domain of a protein, corresponding to a derived peptide, is subjected to total hydrolysis and analysis of the resultant amino acids. For the purposes of the present explanation, the amino acid analysis defines a single domain of the protein. Edman Degradation. In its normal application, Edman degradation provides a partial sequence of a protein or peptide, starting at the N- terminus. In one specific adaptation of this technique, individual cycles of the degradation are not submitted for analysis, but instead they are combined and analysed in batches corresponding to, for example, two to five cycles.

This reduces the time and cost of analysis. Each of the batched analytical runs corresponds to a specific number of cycles of degradation and. therefore, to domains of a specific number of amino acid residues in the protein sequence. Each cycle therefore defines one of the contiguous discrete subdomains which can be used for database searching.

Alternatively, modified N-terminal sequencing can be used, in which the cleavage step is shortened to provide much faster through-put. A consequence of this approach is that the repetitive yield of each degradation cycle is somewhat lowered and significant 'lag' or smearing between the cycles is introduced if the cycles are submitted individually to analysis. As before, cycles can be batched for analysis but, because of the 'lag', the number of cycles (and the number of residues in the subdomain) is less well defined. If the 'lag' is significant, the effective number of cycles will become progressively less than the actual number of degradation cycles applied. Database searching should therefore be performed for merged subdomains of unspecified length.

Aminopeptidase or carboxypeptidase degradation. A peptide or protein can be digested with one of these enzymes to obtain the step-wise release of amino acid residues. When timed aliquots of the product mixture are collected, however, it is not possible to deduce sequence information because of the significant differences in the rates of release of different amino acids by the enzymes. Instead, mixtures of amino acids are obtained in each sample. When these mixtures are analysed, however, they correspond to merged subdomains of unspecified length. Database searching can be done on this basis.

C-terminal chemical sequencing. This approach has been only partially successful for the determination of sequences in the C-terminal domain of peptides and proteins because of significant differences in the rates of release of different amino acid residues. From the present perspective, it is appropriate to consider the released residues as members of merged subdomains. Mass spectrometry. By one of a number of means of induced fragmentation, a protein or (preferably) a peptide can be partially converted into smaller peptide and amino acid fragments. This provides some sequence information, as well as considerable constitutional information about subdomains. This can be used to specify database searches on the basis of these subdomains.

The examples described below illustrate the use of the subdomain algorithm for the searching of the SwissProt database. The examples demonstrate the dramatic increase in informational content of a domain when it is increasingly divided into subdomains. especially in respect to merged subdomains, and demonstrate the validity of applying the subdomain algorithm to previously problematic techniques for the identification of proteins. EXAMPLES To illustrate some of the present methods, two protein domains have been selected from the SwissProt database, revision dated December 1997. These domains are intended to be representative of domains found in most proteins, and have been selected deliberately to exclude the less common amino acids, such as cysteine (C), histidine (H) and tryptophan (W), the presence of which would tend to increase the uniqueness, and therefore the informational content, of the domain. Domain ΛPGDKEGSEG

This domain is used to illustrate the subdomain algorithm. It is considered representative, as it contains only common amino acids, and there are repetitions, with three G and two E residues. Therefore, although the domain is 10 amino acid residues long, there are only 7 different amino acids in it. This is less than the more preferred number ( 10 different amino acids) for the highest informational content.

The SwissProt database contains 69,113 protein entries. If it is searched for domains corresponding to APGDKEGSEG (ie. domains at least

10 amino acids long which contain only and all the 7 amino acids specified), 1311 matching proteins (1.90% of the database) are identified. On this basis, the informational content of such a domain is limited with respect to identifying the protein. If, however, the domain is divided into two or more subdomains, such as APGDK and EGSEG, the informational content is increased considerably. At this point, the present inventors introduce a notation for defining the subdomains and how they are determined:

APGDK > EGSEG defines a domain APGDKEGSEG which consists of two subdomains, APGDK and EGSEG, which have determined by analysis from the left-hand (N-terminal) end.

APGDK<EGSEG defines a domain which consists of the same two subdomains. but which have been determined by analysis from the right- hand (C-terminal) end.

A greater number of subdomains may be defined, such as in APG>DKE>GSEG. which will generally consist of smaller numbers of constituent amino acids. Furthermore, depending on the experimental situation, these subdomains may be defined as discrete or merged. Discrete subdomains are non-overlapping and of definite and defined length.

For example, the discrete subdomains of the domain APGDK>EGSEG correspond to: a constitution block APGDK, which consists of 5 amino acid residues, contiguous with another constitution block EGSEG, also consisting of 5 amino acid residues, which is located immediately at the C-terminal end of APGDK. There are no intervening amino acids, and these constitution blocks do not overlap, even when they contain amino acids in common.

When the notation APGDK > EGSEG is used, it is assumed that the analysis of the domain is carried out from the N-terminal end. Correspondingly, the domain notation APGDK < EGSEG is used when the analysis is carried out from the C-terminal end.

Merged domains overlap with one another, and may be of somewhat indefinite length. When the domain is APGDK> EGSEG and the subdomains are understood to be merged, the domain corresponds to: a constitution block which contains all and only the amino acids in the set {APGDK} and which may be of any length equal to or greater than the minimum length of (in this case) 5 amino acids: and an overlapping domain, starting at precisely the same position as the first block, and defined as containing all and only the amino acids in the set {APGDKEGSEG}, which is the sum of {APGDK} and {EGSEG} and which is reduced to {APGDKES} by the removal of duplicates, which again may be of any length. When the notation APGDIO EGSEG is used, it is again assumed that the analysis of the domain is carried out from the N-terminal end. The two subdomains are aligned at the N-terminal end of the overall domain.

The domain notation APGDK < EGSEG is used when the analysis is carried out from the C-terminal end. In this case the first subdomain is defined by the set {EGSEG}, which is reduced to {GSE} by elimination of duplicates, and the second subdomain is defined as {APGDKES} after elimination of duplicates. These subdomains are aligned at the C-terminal end of the overall domain. It will be seen that there are four methods for the matching of database entries against a defined domain: (i) N-terminal discrete contiguous subdomains (ii) C-terminal discrete contiguous subdomains (iii) N-terminal merged contiguous subdomains (iv) C-terminal merged contiguous subdomains

An analysis of the matching algorithm shows that the results of matching by methods (i) and (ii) are identical. Methods (i) and (ii) are more discriminating than methods (iii) and (iv) (see Tables 1 and 2 below). Moreover, for matching of experimental data to the database, it may be of advantage to define more than two subdomains for matching, such as

AP>GDK>EGSEG or APG<DKE<GSEG. As before, if the subdomains are discrete, the notation specifies a number of compositional subdomains which are contiguous and ordered with respect to one another. If the domains are merged, the notation specifies a number of overlapping subdomains which are aligned at either the N- or C-terminus of the overall domain.

AP>GDK>EGSEG using merged domains, for example, corresponds to the three subdomains {AP}, {APGDK} and {APGDKES} which are aligned at the N-terminal end. Table 1. The result of searching the SwissProt database for the domain APGDKEGSEG (10 amino acid residues, 7 different).

* correct match, plus two identical and one additional false matches. If analysis had been continued for one further amino acid residue, there would have been a unique match (LI instead of T or N). Table 2. The result of searching the SwissProt database for the domain RDVTLEASRE (10 amino acid residues, 8 different)

The results in Table 1 indicate that the subdivision of a protein constitutional domain into two or more subdomains confers a greatly increased informational content. While the domain APGDKEGSEG gave 558 matches to entries in the database, its subdivision into two equal discrete subdomains gave a single match and identified the protein. The results in table 1 suggest that, if only two subdomains are to be defined, they should be of approximately equal size. As indicated, merged domains are less powerful for matching, but the final entry in table 1, using three subdomains, gave matches to seven proteins, four of which were false matches to highly homologous collagen fragments.

In Table 2, no unique match was found, but rather a minimal set of five matches. These were almost identical phosphatases from different sources, all of which had identical sequences in the domain on which the search was conducted. These could not have distinguished from one another on the basis of a conventional search on the sequence.

4. Determination of familial relationship between proteins Because it is possible to identify a protein on the basis of matches of constitutional domains, it is important to identify how large these domains need to be for effective matching. To this end, the databases were scanned using the Shuffling Algorithm, which scans the whole of a database and scores matches against terminal constitutional domains of various lengths derived from some 250 reference proteins. The results are obtained in tabular form, in which scores of proteins with differing degrees of relatedness are grouped in different rows of the results array. It will be appreciated that the algorithm provides a new method for classification of protein family relationships.

5. Techniques for more efficient use of large databases

With the rapid increase in the number of protein sequences resulting from the Human Genome and related projects, the size of databases is increasing rapidly, with the result that accessing and processing information is becoming more difficult, despite improvements in computer hardware. There is therefore a need for methods for data compression and data scanning. The combination of two complementary First Sequences of 20 amino acids (e.g., a CFS and an MFS) provides for 5.9 x 10³⁶ possible discriminating combinations in only 40 bytes of storage space. Two or more proteins may have the same combination of signatures, but only if they are very closely related, such as those with repeat motifs or point mutations deep within the proteins. A scan of the Owl and SwissProt databases for matches against a combination of MFS and CFS from 498 entries extracted at intervals from the same databases, showed that very few proteins have the same combination of signatures.

Together with hashing and key indexing schemes, this combination of First Sequences provides a powerful, compact and efficient method for classification, storage and access of protein and similar sequence information. As an illustration of the power of a simple hashing scheme, the CFS and MFS from each of the same 498 entries in the Owl database were combined into a 32-bit integer. This was used to scan all 128,719 entries for matches based on the 32-bit integers constructed for each entry. The results of searching in this way are identical to those obtained by direct matching to the combination of the CFS and MFS. The size of an unsigned 32-bit integer (4.29 x 10⁹) is, of course, smaller than the figure of 5.9 x 10³⁶ for the direct combination of a CFS and MFS, but is likely to be more than adequate for the management of practical databases. A small number of coincidences ('collisions') have been noted, but in no case has there yet been found a coincidence between unrelated proteins. The collisions are likely to occur only between closely related proteins, which have limited structural differences deep within the protein chain. An example which has been identified is the family of Human Immunodeficiency Antigens which differ in only one or a few isolated amino acids. The identification of a protein with a single 32-bit integer represents an enormous compression of structural information. As an example, a representative protein of some 300 amino acid residues, which is defined by a string of 300 ASCII characters (requiring 300 bytes of storage), is compressed to 4 bytes, corresponding to a compression by a factor of 75. Another very important consequence of the representation of a protein as an integer is that it is possible to arrange all the entries in a protein database in a sequential order, based on the value of that integer. As a result, using a keyed index system, extremely fast access can be obtained to any entry in the database, regardless of its position in the main data file. In the event that there are collisions with more than one protein represented by the same integer, standard linked-list implementations of buckets can be applied, with little compromise in efficiency. It is anticipated that protein databases will include increasingly large numbers of data for 'real' proteins, rather than those inferred or predicted from nucleic acid sequences. It will be appreciated that when proteins lack N-terminal signal sequences and/or C- terminal sequences, they will have different hashed integers and be ranked differently in an ordered database. In many cases, however, where there are minor deletions or modifications deep within the polypeptide chain and they will have identical hashed values. An advantage of this outcome is that both the 'hypothetical' and 'real' proteins can be accommodated and distinguished in the same database.

It will be appreciated that this approach to the ordering of proteins is general and that it can be applied directly to DNA databases, or indirectly to the same databases, such as by conversion of nucleotide base sequences to implied amino acid sequences, followed by the generation of First Sequences and hashing. The associated compression of the First Sequence information means that it is practical to load and hold whole indexes to database files in memory, to provide a very significant improvement in the speed of searching and matching. ALGORITHMS 1. Algorithm for the construction of a C-First Sequence (CFS)

This is constructed from the complete single-letter sequence of a protein chain by: a. Allocation of an empty string for the CFS; b. Working from the right-hand end (the C-terminus) of the main protein chain, each of the amino acid residues is inspected in turn. As each of the amino acids is encountered for the first time, it is added to the left-hand end of the CFS. The movement through both the main protein chain is from right to left, and the CFS grows by addition of new amino acids at the left-hand end. With obvious modifications, this algorithm can also be used for the construction of an NFS or MFS.

2. The Shuffling Algorithm

This algorithm is described here in relation to the use of a CFS but. with obvious modifications, it can be applied to an NFS or MFS. The algorithm is used to compare a CFS (or the C-terminal part of an incomplete CFS, such as would be obtained experimentally) with amino acid sequences in a database. Description: a. A reference CFS is constructed, either from the sequence of the reference protein or, more usually, by using constitutional information obtained by release of amino acids from the C-terminal domain of the protein. In this latter case, the partial CFS would be in the form of one, or

(preferably) more, constitutional sets, and would lack the implicit sequence information in a full CFS. b. A two-dimensional scoring matrix is set up with dimensions (n ? n), where n = the length of the reference CFS. All elements of the matrix are cleared to 0. c. Each entry in the protein database is scanned in the following loop: (i) A test CFS is constructed for the entry.

(ii) A scoring pointer is arranged so that it points to the [0,0] element of the matrix (at the top left corner of the matrix). In the event that the reference CFS is shorter that the test CFS, they are aligned at the C-terminal end and the tests carried out on the shorter length. The scoring pointer is then advanced to the [n,n] element of the scoring matrix, where n = the difference in length of the CFSs. If the reference CFS is longer than the test CFS, they are aligned at the C-terminus as before, but scoring still commences at the [0.0] element of the scoring matrix.

(iii) The test CFS is compared with the reference CFS, to determine whether it contains all and only the amino acids present in the reference CFS.

(iv) If there is a match, the element of the scoring matrix (which is currently addressed by the pointer) is incremented by 1. If there is no match, no score is recorded and the scoring pointer is incremented to point to the next row in the score matrix. In either case, the scoring pointer is incremented to point to the next column.

(v) The reference and test CFSs are shortened by one character by removal of the left-most character: then steps (i) to (v) are repeated in a loop until the final character of the reference CFS has been scanned. In this sense, there is a shuffling action towards to right and each successive scan uses a shorter substring of the reference and test CFS. d. After the application of the algorithm to all the entries in a database, the scoring matrix is output to an ASCII file as a tab-separated table, suitable for importation into word-processing and spreadsheet programs. As observed, the Shuffling Algorithm makes a stepwise comparison of the reference and test CFSs. It should be understood that a comparison is made of the whole of the active parts of the CFSs in terms of amino acid constitution, regardless of sequence. Therefore, a positive match at any point in the loop (c. (i)-(v) above) corresponds to a constitutional match between the CFSs from the left-most active character to the C-terminal end.

The score in the top left element of the matrix (the 0,0 element) corresponds to the total number of matches to the complete reference CFS. This corresponds to the total number of proteins which contain all and only the amino acids in the set corresponding to the reference CFS. The score in the 0,1 element (immediately to the right of this) corresponds to the number of matches for this CFS which has been shortened by removal of one character from the left-hand end. Scores in the matrix positions further to the right correspond to matches to the CFS after further shortening. In the event, therefore, of a series of successful matches of a reference

CFS to a test CFS, scores are recorded along one of the rows of the score matrix. As indicated, when there is no match, scoring is transferred to the next row below. Therefore, in the case of a CFS of 20 amino acids, there will be a score matrix of 20 rows, the top row of which contains the scores of matches with no previous errors, the second row of which contains matches with one previous error, and so on down to the bottom row in which there are matches with 20-1 = 19 errors.

A consequence of this scoring algorithm is that all first matches are scored in the diagonal elements of the matrix, and that matches subsequent to this are in the row to the right of and commencing at the diagonal element.

It will be noted, too, that the majority of diagonal scores do not continue to the right in this way. This is because of fortuitous matches, which have no structural homology with the test CFS, and therefore have little or no sequence properties in common with it. The placement of first matches in the diagonal elements illustrates the principle that failures move diagonally downwards. If, for example, there is a failure to match at element 0,0 then the next opportunity to match will be at element 1,1. On the other hand, if there is a successful match at element 0,0 but failure to match at element 0,1 then the next opportunity to match is at element 1,2. In this sense, there is a 'knight's' move from one successful score through a failure to another successful score. If there is no successful score after the 'knight^'s move', the failures will move diagonally downwards as before until (possibly) there is another match.

To assess the value of applying the Shuffling Algorithm, CFSs for 246 proteins were collected by selecting proteins at regular intervals through the Owl database. Each of these in turn was used as the reference CFS for comparison with all the 128,719 entries in the same database.

A similar approach was used to collect 246 reference MFSs which were compared with all entries in the Owl database. The only difference was that the shuffling action used was in the right-to-left direction, rather than the left-to-right direction used above for CFSs. In this case, the diagonal on which first matches are recorded is from top-right to bottom-left. 3. Illustrative algorithm for hashing of CFS and MFS a. Construct the CFS and MFS for the protein. Unless these include post- translationally modified amino acids, these will never be longer than 24 characters (bytes). b. Set aside a buffer of 24 bytes, and clear it to zeros. c. Transfer the MFS to the buffer. Set up a pointer to the start of the buffer. d. Set up an accumulator of 4 bytes and clear it to zero. e. Combine the 4 bytes addressed by the pointer with the value in the accumulator, using an exclusive-or XOR merge, and replace the result in the accumulator. f. Add 4 to the value of the pointer. g. Repeat steps e. and f., until all 24 bytes in the buffer have been merged with the accumulator. h. Mask the value in the accumulator with the hexadecimal value $1F1F1F1F to eliminate unwanted bits and left-shift the accumulator value by 3 places. Save the result, i. Clear the buffer to zero, transfer the CFS into it and set the pointer to the start of the buffer. j. Repeat steps d. to g. as before. Mask the result by an AND with

$ 1F1F1F1F and merge with the previous result obtained in h., using an XOR operation. The result is taken as the unsigned integer to describe the protein. The upper-case ASCII representation of the alphabetical characters requires 6 bits, but only 5 bits are distinctive. The masking and left-shifting in step h. are designed to minimise overlap between the bits derived from the MFS and those from the CFS, but other methods can be used for masking and shifting. It will be appreciated that there are many possible variants, using

(e.g.) OR, AND, NAND, + and -, to carry out the merges with the accumulator. Furthermore, for the potentially enormous databases of the future, and to accommodate the inclusion of post-translationally modified amino acids, a larger integer value can be developed, such as an 8-bit integer, which can accommodate 1.8 x 10 ⁹ different values.

DEFINITIONS

Sequence: a string of ASCII characters which represents in a left-to- right sense the order of amino acids, as read from the N-terminal to the C- terminal end of a protein or peptide.

Constitutional analysis: a set of the amino acids present in a protein or peptide, without specification of the relative amounts. No specific sequence of amino acids is implied. Compositional analysis: a listing of the amino acids present in a protein or peptide, together with the proportions or percentages of each. No specific sequence of amino acids is implied.

It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

CLAIMS:

1. A method of assigning a First Sequence to a peptide or polypeptide amino acid sequence, the method comprising reading from one direction of the peptide or polypeptide amino acid sequence and recording at least some of the constituent amino acids in the order in which the amino acids are first encountered so as to generate the First Sequence containing at least some of the constituent amino acids of the peptide or polypeptide.

2. The method according to claim 1 wherein the First Sequence is generated by reading from the carboxyl- (C-) direction of the peptide or polypeptide amino acid sequence thereby generating a C-terminal First Sequence (CFS).

3. The method according to claim 1 wherein the First Sequence is generated by reading from the amino- (N-) direction of the peptide or polypeptide amino acid sequence thereby generating a N-terminal First Sequence (NFS).

4. The method according to claim 1 wherein both a C-terminal First Sequence (CFS) is generated by reading from the carboxyl- (C-) direction of the peptide or polypeptide amino acid sequence and a N-terminal First Sequence (NFS) is generated by reading from the amino- (N-) direction of the peptide or polypeptide amino acid sequence.

5. The method according to any one of claims 1 to 4 wherein when an amino acid is first encountered and assigned to a First Sequence, further occurrences of the amino acid in the peptide or polypeptide amino acid sequence are ignored.

6. The method according to any one of claims 1 to 5 wherein the First

Sequence is up to 20 amino acids in length.

7. The method according to any one of claims 1 to 5 wherein the First Sequence is at least 5 amino acids in length.

8. The method according to any one of claims 1 to 5 wherein the First Sequence at least 8 amino acids in length.

9. The method according to any one of claims 1 to 5 wherein the First Sequence is 10 to 14 amino acids in length.

10. The method according to any one of claims 1 to 9 wherein one or more constituent amino acids are ignored when generating a First Sequence of a peptide or polypeptide.

11. The method according to any one of claims 1 to 10 wherein one or more constituent amino acids are included more than once to generate the First Sequence.

12. The method according to any one of claims 1 to 10 wherein the First Sequence is made up of two or more First subsequences.

13. A method of compressing amino acid sequence data of a peptide or polypeptide, the method comprising assigning to the peptide or amino acid a First Sequence according to the method of any one of claims 1 to 12 and storing the First Sequence data.

14. A method of compressing a database containing known or predicted amino acid sequences of peptides or polypeptides, the method comprising assigning a First Sequence for each peptide or polypeptide according to the method of any one of claims 1 to 12 and storing the First Sequences as a compressed database.

15. The method according to claim 14 wherein each of the peptides or polypeptides is represented as an unsigned integer number obtained by a hashed merging of two or more of the First Sequences corresponding to the peptide or polypeptide.

16. A method of searching a database containing amino acid sequences of known or predicted First Sequences of peptides or polypeptides for the presence or identity of a test peptide or polypeptide, the method comprising assigning to the test polypeptide a First Sequence according to the method of any one of claims 1 to 12 to form a test First Sequence, and comparing the test First Sequence with the corresponding First Sequences of the peptides or polypeptides in the database to detect or locate the presence of the test First

Sequence.