WO2003029487A2 - Dna sequencer - Google Patents

Dna sequencer Download PDF

Info

Publication number
WO2003029487A2
WO2003029487A2 PCT/GB2002/004398 GB0204398W WO03029487A2 WO 2003029487 A2 WO2003029487 A2 WO 2003029487A2 GB 0204398 W GB0204398 W GB 0204398W WO 03029487 A2 WO03029487 A2 WO 03029487A2
Authority
WO
WIPO (PCT)
Prior art keywords
model
base
traces
carrier
penalty function
Prior art date
Application number
PCT/GB2002/004398
Other languages
French (fr)
Other versions
WO2003029487A3 (en
Inventor
Aled Wynne Jones
Peter John Kelly
Daniel Reginald Ewart Timson
Mike Raymond Reynolds
Nicholas Vasilopoulos
Original Assignee
Scientific Generics Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0123875A external-priority patent/GB0123875D0/en
Application filed by Scientific Generics Limited filed Critical Scientific Generics Limited
Priority to AU2002329425A priority Critical patent/AU2002329425A1/en
Publication of WO2003029487A2 publication Critical patent/WO2003029487A2/en
Publication of WO2003029487A3 publication Critical patent/WO2003029487A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Definitions

  • This invention relates to DNA sequencing.
  • DNA sequencing is of major importance in modern gene manipulation, and is an important tool for medical diagnosis as well as in providing methods for identifying the source of DNA in criminal and security applications .
  • One method of sequencing DNA that is currently used is the Sanger or dideoxy method.
  • the DNA under investigation is denatured with alkali so that single strands of the DNA are formed, and the strands are annealed with a primer, that is to say, with a short sequence of bases that is known to exist in the DNA.
  • the primer attaches in a known location on the DNA, and, if the bases are also present in the solution, a complementary strand of DNA will form on the parent strand. If a small percentage of the bases aro, modified so that they are missing a hydroxyl group in the 3' position in the sugar phosphate, they will not provide a suitable substrate for chain elongation, and so will terminate the chain growth at that point.
  • each base may be identified by spectroscopic methods at the point at which it reaches the end of the electrophoresis column, and so a trace may be formed of strength of any base against time. Together the four traces, one for each base, should correspond to the sequence of the bases in the complementary DNA strand.
  • the complementary strands of DNA grow and are terminated the complementary strands are used up, with the result that the peak intensity, and accordingly the signal-to-noise ratio, for the longer lengths of strand is reduced.
  • the terminator bases may be used up with the same result as above. Added to this, the peak widths will increase (from the diffusion spread) so that the peaks corresponding to the longer strands merge into another. All these effects can combine to make it difficult to deduce the correct base order.
  • the resultant peaks may merge together to form a single broad peak.
  • Phred Base-calling of automated sequencer traces using Phred. I. Accuracy assessment", by B. Ewing, L. Hillier, M. endl and P. Green, Genome Research 8:175-185, 1988, and "Base-calling of automated sequencer traces using Phred. II. Error probabilities", by B. Ewing and P. Green, Genome Research 8:186-194, 1988).
  • Phred Error probabilities
  • base calling in the Sanger method of sequencing is performed by refining models of the DNA sequence, predicting their traces and comparing the predicted traces with the observed traces .
  • the base calling is an iterative process involving modifications to the models and selection of those models whose predicted traces fit the observed traces most closely.
  • the present invention provides a carrier which carries a computer program comprising processor implementable instructions for base calling DNA bases from a dataset corresponding to observed traces, the instructions comprising:
  • code for generating a database in the computer memory corresponding to a model of a DNA sequence code for refining the model of the DNA sequence by:
  • the invention provides apparatus for sequencing DNA which comprises :
  • ( ⁇ ) means for performing electrophoresis on the contents of the vessel
  • the present invention provides a method of sequencing DNA which comprises :
  • the invention may, in principle, be employed using terminator bases having any form of characteristic spectra, for instance I.R. or u.v. absorption spectra, or X-ray spectra.
  • terminator bases having any form of characteristic spectra, for instance I.R. or u.v. absorption spectra, or X-ray spectra.
  • Phosphorescent dyes could instead be employed provided that the decay time of the phosphors were sufficiently short to distinguish different bases in the traces .
  • the invention is not restricted to dye terminator chemistry, but may be employed with dye primer chemistry in which a dye is attached to the primer, and separate electrophoresis columns are employed for each base type.
  • the invention may be employed using radioisotope labelled bases or primers, for example using 32 P or 35 S radioisotopes .
  • the program and method according to the present invention have the advantage that base calling and the quality of the predicted models is much less prone to error due to differences in the experimental setup and conditions.
  • the method according to the present invention is more mathematically rigorous and less empirical.
  • the penalty function that is generated by the program is essentially a form of figure by which any particular model predicts traces that differ from the observed traces, and should be minimised during the process.
  • the penalty function may take any of a number of forms, but one form that is useful includes a degree of misfit between the observed and the predicted traces, and a measure of divergence from a general model of DNA, that is to say, a measure of divergence from a model of DNA that does not take into account the particular sequence of the bases in the helix.
  • the spacing of the base peaks is generally regular along the trace, and so the measure of divergence from such a model would be the irregularity of the base peak spacing (referred to herein as the "spacing norm"). Also, it is possible to vary the relative weighting of the misfit and the irregularity of the bases in generating the penalty function.
  • simulated annealing constitutes one iteration in the process of refining in step (b) above.
  • This is a process by which changes that are at least partly random are made to a model, the effects of the changes are calculated, i.e. the form of the predicted traces, and a decision is made whether to accept the new model as the best model or revert to the previous model.
  • the model will gradually improve as observed by a reduction in the penalty function.
  • Models that may be used in the modelling process will normally include a number of continuous variables for example relating to peak strength, peak shape, correlation between peaks, peak position etc. which may be varied in a continuous manner, and discrete variables, for example base type and number of bases, which can only be changed in a discrete manner.
  • continuous variables for example relating to peak strength, peak shape, correlation between peaks, peak position etc. which may be varied in a continuous manner
  • discrete variables for example base type and number of bases, which can only be changed in a discrete manner.
  • a number of changes may be made to discrete variables in the model as part of the simulated annealing process, namely: addition of a base to the model, removal of a base from the model, and replacement of one base for another in the model.
  • This process may be performed entirely randomly, but this may be expensive in computational time, and it may be better to weight the addition, removal and replacement in order to reduce the time taken to generate better models.
  • the changes and the type of base change may be chosen in accordance with the degree of misfit or irregularity of the base peak spacing in order to accelerate optimisation of the model by concentrating on those regions that contribute excessively to the penalty function.
  • the penalty function will be reduced, and this will normally involve the program accepting the modified model during a simulated annealing process if the penalty function of the modified model is less than that of the previous model.
  • the program only ever accepts a modified or new model if the penalty function is less than the previous model, it is possible for the modelling process to become trapped in a local minimum, that is to say, in a model or group of models having the lowest penalty function that is obtainable by making a single change to the bases in an attempt to reduce the penalty function, but which may not be the lowest penalty function achievable.
  • the probability of acceptance may also decrease as the number of steps (b) that have been performed increases.
  • the simulated annealing process enables the program to make changes in discrete variables in the model, that is to say, changes to the type and number of the bases.
  • the program preferably performs a process on any model to minimise differences between the predicted traces and the observed traces with respect to continuous variables of the model before calculating the penalty function.
  • Such a process is preferably a least squares inversion process that is performed iteratively with at least five and especially at least ten steps.
  • This process may be performed to minimise the difference between the predicted and observed traces with respect any one or more of the following parameters: to the position of each base, the strength of each base signal, the background signal, the correlation between signals from different bases, the peak width and the peak skew in the traces.
  • the program generates an initial model of the DNA sequence that is refined by the simulated annealing process. It is not absolutely necessary in the broadest aspect of the invention for the initial model to depend on the observed sequence: an initial model could be generated at random, or from a pre-stored table, and then simply be refined, although this could potentially involve a significant extra computational effort, and any improvement in the similarity of the initial model with the actual DNA sequence will reduce the effort significantly.
  • the program therefore preferably generates an initial model that takes the observed traces into account.
  • a number of methods may be employed to generate the initial model, but generation of the model will normally be achieved by assigning a base to a peak in a trace that is higher than any of the other traces at that position, that is, by assigning bases to the most significant peaks of the traces.
  • the correlation of the peaks of different traces can then be set to zero, and/or the peak widths can be set to the peak separation in order to generate easily identifiable peaks.
  • a typical base sequence that is operated on will contain in the region of 1000 bases, and the number of samples in the observed traces will be typically about 15,000, i.e. about 15 samples per base called.
  • the number of samples per base may vary depending among other things on the diffusion properties of the DNA, and can be for example between 5 and 25 samples per peak.
  • Such a sequence length may well be too great for ease of manipulation, and so it is preferred for the program to divide the traces generated from the electrophoresis into a plurality of smaller traces on which it operates to determine the base sequence, and then, after the base sequence corresponding to the smaller traces have been determined, to join together the models of the base sequence corresponding to the smaller traces to generate a model for the entire sequence.
  • the traces may be divided into smaller traces corresponding to from 30 to 200 bases and preferably in the region of 50 to 100 bases.
  • the smaller traces will preferably overlap other smaller traces corresponding to adjacent portions of DNA sequence by a small number of bases, for example at least two and especially at least th ⁇ fee bases . In such a case, when the models are joined together, too many bases will be present at the overlapping regions and excess bases will need to be removed.
  • the process may be terminated if the simulated annealing step (b) has been performed at least 100, and preferably at least 300, times since a modified model has been accepted as the model and that the simulated annealing step (b) has been performed more times since a modified model has been accepted as the model than the number of iterations it took to get from the starting model to the most recently accepted model.
  • This criterion is most likely for relatively clean data sets, which only need a few alterations to the starting model.
  • the process may be terminated when the simulated annealing step has been performed at least 1000 times, and preferably at least 3000 times since the most recent time a modified model has been accepted as the model.
  • This criterion will normally be employed with more complicated data where the model may not have settled down completely but is likely to be fairly close to its optimum.
  • the program may be terminated simply when the simulated annealing step (b) has been performed a large number of times, for example at least 100,000 times or preferably at least 300,000 times. In this case, it is most likely that the data is highly noisy and the modelling process is confused with the result that any models produced are of doubtful validity.
  • the apparatus and method according to the invention not merely produces a model for the DNA sequence, but also generates a probability that the assignment of any base in the model is correct. This may be achieved by seeing how the model deteriorates on removal of that base.
  • the program may operate by:
  • step ( 2 ) comparing the penalty function obtained in step ( 1 ) with the penalty function of the model without removal or changing of any base therefrom.
  • the difference in the penalty functions may then be converted into a probability or some other figure of merit for that base.
  • This process of deriving the probability of the assignment of bases will normally be performed after the termination criteria have been completed and preferably after any shorter sequences corresponding to the shorter traces have been joined together.
  • Any program embodying the invention may be in the form of source or object code or in any other form suitable for the implementation of processes according to the invention.
  • the carrier may be any entity or device capable of carrying the program.
  • the carrier may comprise a storage medium such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium for example a floppy disc or hard disc.
  • the carrier may be a transmissible carrier such as an electrical or optical signal which can be conveyed via electrical or optical cable or by radio or other means.
  • Figure 1 is schematic illustration of DNA replication in the Sanger method of DNA sequencing
  • Figure 2 is a drawing showing some of the products of the method shown in figure 1
  • Figure 3 is a schematic representation of the apparatus according to the present invention
  • Figure 4 is a drawing of the form of input data obtained during the process according to the invention
  • Figure 5 is a graphical representation of an observed trace obtained in the method of DNA sequencing according to the invention
  • Figure 6 is a graphical representation of part of the trace shown in figure 5 in greater detail;
  • Figures 7a and b are graphical representations showing correlation between different bases in the process;
  • Figures 8Aand 8B are typical traces of part of a DNA sequence obtained during the method according to the invention;
  • Figure 9 is a schematic illustration showing changes to the skew of the trace peaks;
  • Figure 10 is a schematic illustration showing some of the variables employed in modelling the DNA sequence;
  • Figure 11 is a block diagram illustrating the data structure and process steps of the base calling program according to the invention.
  • Figure 12 is a diagram of the data forming the input dataset of a section, a model, and the output dataset of that section;
  • Figure 13 is a flow diagram showing the overall base calling method used in the process according to the invention;
  • Figure 14 is a block diagram illustrating the data structure and process steps of the sectioning step of the present invention;
  • Figure 15 is a flow diagram showing the process for finding the start of clean data;
  • Figure 16 is a block diagram illustrating the data structure and process steps in the simulated annealing process employed in the present invention;
  • Figure 17 is a flow diagram illustrating the simulated annealing process;
  • Figure 18 is a block diagram illustrating the data structure for the starting model generator;
  • Figure 19 is a flow diagram illustrating generation of the starting model;
  • Figure 20 is a block diagram illustrating the data structure for the least squares inversion process employed in the invention;
  • Figure 21 is a flow diagram illustrating the least squares inversion process employed in the invention;
  • Figure 22 is a block diagram illustrating the data structure and programs employed in the model modifier
  • Figure 23 is a flow diagram of the algorithm used for modifying the model by removal of a base
  • Figure 24 is a flow diagram of the algorithm used for modifying the model by the addition of a base
  • Figure 25 is a flow diagram of the algorithm used for modifying the model by changing a base in the model
  • Figure 26 is a block diagram illustrating the data structure and programs employed in the model acceptance step of the simulated annealing process;
  • Figure 27 is a flow diagram of the model acceptance step of the simulated annealing process;
  • Figure 28 is a block diagram illustrating the data structure employed in the base assignment probability calculation;
  • Figure 29 is a flow diagram illustrating the steps of the base assignment probability calculation;
  • Figure 30 is a schematic diagram showing joining up of smaller DNA sections to form a larger section;
  • Figure 31 is a flow diagram showing base removal during joining up of the DNA sequences;
  • Figure 32 is a diagram of part of an observed trace of clean data together with a DNA model obtained according to the invention, and a predicted trace
  • Figure 33 is a diagram of part of an observed trace of poorer data together with a DNA model obtained according to the invention, and a graph of base assignment probabilities
  • Figure 34 is a graph showing base calling errors made by the process according to the invention and those made by Phred by way of comparison.
  • figure 1 shows schematically the growth of a complementary DNA chain in the Dideoxy or chain terminator method of DNA sequencing.
  • this method multiple copies of the DNA to be sequenced are placed in solution along with an alkali to denature the DNA, that is, to cause the two DNA strands to unravel, a quantity of a primer and of each of the four bases, adenine (A), guanine (G), cytosine (C) and thymine (T).
  • A adenine
  • G guanine
  • C cytosine
  • T thymine
  • a deoxynucleotide triphosphate is employed that incorporates the base, but the term "base” will be employed for simplicity since no confusion will occur.
  • the primer is a short sequence of bases that is known to exist in the DNA to be sequenced, and attaches to a known location in the DNA strand. A complementary strand of DNA will then form, by attachment of adenine to thymine and attachment of guanine to cytosine.
  • a small percentage of each base is a terminator base in which a hydroxyl group is missing from the 3 ' position of the nucleotide (shown in figure 1 by A* and C*), and which includes a fluorescent chromophore.
  • a starting strand as shown in figure 2 will thus generate a mixture of complementary DNA strands each strand varying in length from another strand by a single base, and each strand being terminated by one of the terminator bases. All possible lengths of DNA sequences are created because of the terminators that are present, although the shorter ones are more probable because a certain proportion of the strands are used up for each unit of growth of the strand.
  • FIG. 3 shows apparatus according to the invention for sequencing a sample of DNA.
  • the apparatus comprises a vessel 1 for receiving the DNA sample together with the primer and the bases, including the terminator bases.
  • the apparatus includes a column 2, for example in the form of a capillary or a column of gel, and means 4 for applying an electric field along the column in order to cause electrophoresis of the contents of the vessel. Because the sequences of DNA all have different lengths and masses, they will diffuse along the column 2 at slightly different rates, and will appear at the far end 6 of the column in the order of their length.
  • a laser 8 or other light source is provided at the far end 6 of the column 2 or at another appropriate detection point which is shone into the material finishing the electrophoresis.
  • each of the terminator bases includes a fluorescent dye that fluoresces with a characteristic spectrum
  • the material finishing the diffusion will fluoresce under the action of the laser 8.
  • the apparatus includes a beam splitter 10 and four filters 12 that match the fluorescence spectrum of each of the four bases. Also, the apparatus includes A/D converters 14 that digitise the signals from the filters 12, and a computer 16 that can receive the data from the A/D converters .
  • the contents of the vessel 1 will be caused to diffuse along the column 2 under the action of the electric field, and will exit the column as its far end 6, whereupon it will be subjected to the laser beam and the fluorescence will be detected, filtered, digitised by A/D converters 14 and stored in the computer 16.
  • the rate of the A/D converter is matched to the speed of diffusion so that approximately fifteen data samples are recorded per base as the sequences diffuse along the
  • FIG. 4 shows a typical sample of the input data that is collected by the computer 16 from the electrophoresis. The data comprises a set of approximately 15,000 optical intensity values from each of the four filters corresponding to each of the bases A, G, C and T. At the start of the data collection for example up to sample No.
  • the first bases will not have reached the end of the column, with the result that the values for each base will be relatively low and constant, and provide the background level for each trace.
  • the values in the input data trace will rise dramatically as the bases are detected at the end of the electrophoresis column.
  • the maximum intensity valves recorded, corresponding to the peak height of the traces will initially be relatively high and will then slowly decrease over the course of the 15,000 samples taken, as shown in figure 5 which is a superposition of traces for each of the four bases, with the result that, at the end of the 15,000 samples, the peak height is typically only about 20% of the peak height at the beginning depending on the concentration of the terminator bases .
  • the background, random noise is generally not very large, at least at the beginning of the trace, and the average signal to noise ratio across the trace is higher than 10. This random noise is due to thermal, shot and quantization noise from the A/D converter and laser scattering. The noise must be subtracted or taken into account in initial processing. Each detector has a constant baseline (this can be seen at the start of the trace) caused by constant background voltage from the hardware.
  • the traces for the different bases can show a high degree of correlation, due to overlap in the absorption or emission spectra of the different terminator bases, which results in peaks being formed simultaneously in traces for different bases.
  • Figure 6 shows an example of this. For example, whenever there is a peak in trace G there is also a peak in trace T.
  • the T sensor is activated by the presence of base G, which is due to an overlap in the fluorescence spectrum from the G base with the frequency window for the T trace .
  • Figure 7(b) shows the correlation between bases A and C.
  • a peak in trace A always corresponds to a small peak in trace C whereas a peak in trace C does not necessarily correspond to a peak in trace A.
  • a and C There appears to be three distinct patterns in the correlation between A and C. These three patterns are due to some combination of:
  • the model incorporates the following parameters:
  • the model contains a parameter for background noise for each of the four base types since often each trace will have a different offset level. Background terms are restricted to zero or positive.
  • the values of the correlation terms are allowed to vary in the range 0 to 1.
  • the diagonal terms of the correlation matrix i.e. C M , C GG , C M and C cc ) are by definition unity, leaving 12 variable correlation terms.
  • the expression "correlation term” used herein refers only to the 12 variable parameters, and excludes the four fixed diagonal terms .
  • the correlation matrix is not necessarily symmetric (i.e. C GA is not necessarily the same as C AG )
  • the correlation between traces is not significant.
  • the non-diagonal terms of the correlation matrix can be set to zero to reduce computational load and increase model stability.
  • Figure 8 shows two sets of traces .
  • the peaks are symmetric whilst in the second plot each peak is skewed to the left.
  • the peaks are typically of width 10 samples, whilst those of the bottom plot are about twice as broad.
  • defines the width of the peak and p the skew.
  • the integral is a function of ⁇ and p.
  • the function is chosen because of its ability to model varying skews in peak shape. However there may well be other functions that are suitable. In particular no attempt has been made to choose a function that matches the underlying physics of diffusion.
  • the peaks of the traces are of varying height, and so a strength S n is associated with each peak. This acts as a multiplier to scale the peak shape function. Strengths are restricted to 0 or positive.
  • the base type and number of bases are the only discrete parameters .
  • Figure 11 is a block diagram showing the data structure and programs employed for the base calling procedure of the present invention which operates on the input data shown in figure 4, that is to say, the intensity values for each sample of the four traces.
  • the input and output data sets are shown in Figure 12 together with datasets forming the model.
  • the model consists of a set of parameters that can be split into two groups. One group consists of parameters that apply to a single peak, namely peak strength, peak position, and base type. The other group consists of those parameters, called global parameters, that apply over the whole length of the section, namely background terms, correlation terms and the peak shape terms ⁇ and p.
  • Figure 13 is a flow diagram showing the overall base calling process.
  • the traces in the input data store will typically be about 15,000 samples long, corresponding to about 1000 bases.
  • the input data is filtered to remove any high ' frequency signals and the start of clean data (i.e. when the first bases have appeared from the electrophoresis) is recorded as j sAM"
  • the dataset is too large to be handled in its entirety, and so is split into 15 sections, each of which corresponds to about 1000 samples, or about 70 bases. Each section is operated on by the computer to determine the base sequence and 15 section output datasets are generated, each corresponding to 70 bases. The sequences are then joined together to provide a sequence of the entire DNA sample of 1000 bases which is stored in the output data store.
  • Figure 12 is a schematic indication showing the structure of the data operated on and generated by the base calling procedure.
  • the input dataset 20 of one section of the filtered input data is largely in the form shown in Figure 4 having intensity values for each of the four bases at each of 1000 samples.
  • the program generates a model 22 for each section comprising typically 70 predicted bases and, for each of the bases an associated position and intensity, together with the global parameters.
  • After the base sequence has been deduced probability scores are calculated for each base type at each base position called to generate an output dataset 24 for that section.
  • one base type (A, G, C or T) is called at each position, but this is simply the base type having the greatest probability.
  • the output could, if desired, consist only of the base type probabilities.
  • the sum of these four probabilities may be less than unity.
  • the discrepancy corresponds to the probability of there being no base at that position.
  • the models for each section are then reunited to generate an output of the program for the entire set of 1000 bases corresponding to the input dataset of 15,000 samples .
  • each section may have its own peak width and skew.
  • Figure 14 is a block diagram for data and program structure for the process of sectioning the data, and a flow diagram of the Clean Data Start Determinator is shown in figure 15.
  • the data Once the data has been read into the memory, it is filtered to remove high frequency noise which could confuse the algorithm calculating the initial model as well as degrading the other algorithms.
  • the particular type of low pass filter is not critical to the invention, but given that the. data is already in digital form, a digital filter such as an FIR filter is appropriate .
  • the "typical" base peak spacing Z 0 is determined. This is an important parameter and is used to calculate the spacing norm during the simulated annealing process. This parameter is derived from the data and should remain approximately constant along each section of the trace gradually increasing between sections.
  • the root-mean-square (r s) data value at each sample is calculated by:
  • d 0 is the observed data
  • m represents the base type of the trace (A, G, C or T) and L is the trace length.
  • the running average is calculated over a window:
  • w is the window length, typically 100 samples
  • jst a rt is less than the window-length then the precursor is of negligible duration and can be ignored by setting jsat to zero. If there is no value of j start satisfying step 4, or if jst ar t is greater than L/2, then again it is assumed that there is no easily distinguishable precursor and j start is set to zero.
  • step 4 The reason for requiring in step 4 that three successive blocks of data are of large amplitude is that often at the start of the trace there is noise of high amplitude but short duration (compared to the window length).
  • the data is split into sections for further processing beginning with the section that starts at j start .
  • the section immediately following this is done next, followed by the subsequent sections until the end of the trace is reached whereupon the program returns to the start of the dataset and processes the precursor (i.e. that part of the dataset before j start ) .
  • the reason for this is that even if the precursor contains no useful data, there is no harm in processing it other than computational effort.
  • This part of the data is processed last because the peak spacing norm Z 0 should be determined by the cleanest data, i.e. the data immediately following j start . At each section the value of Z 0 used is an average calculated from that section and all the preceding sections.
  • Figure 16 is a block diagram of the data and program structure of the simulated annealing process
  • figure 17 is a flow diagram of the process.
  • the simulated annealing is a process by which random changes can be made to a model, the effects of these changes calculated, and a determination made as to whether to keep the new model or revert to the previous one. By repeating the process many times the model will gradually improve, as defined by a reduction of the penalty function for that model.
  • penalty function is thus defined as:
  • D is the number of data samples ( four times the length of each trace); N p is the number of free parameters in the model and will vary as bases are added and removed; ⁇ s is a constant which reflects the relative weight given to E s and E d . Large ⁇ s will result in models with very evenly-spaced bases, while small ⁇ s will disregard base spacing in favour of achieving a good fit to the data; and E d is the misfit and E s is the spacing norm, both of which are defined below.
  • the unweighted misfit is defined as follows:
  • this summation can be weighted by the
  • Z 0 is the 'typical' base spacing. It is important to get an accurate estimate of this by ignoring' extreme spacings.
  • the base spacing, Z 0 is calculated by the following steps:
  • Z is the value of Z calculated from the bases of
  • the simulated annealing process is iterative with the steps as shown below:
  • the model is allowed to relax by a least-squares inversion process. This varies the continuous parameters to find the parameter set that minimises the penalty function, but cannot change any of the discrete parameters such as base type, and will only find the local minimum in the vicinity of the starting model;
  • a random change is made to the model, such as adding a random new base, removing a base or changing the type of base;
  • the new penalty function is compared with the best penalty function of a model on record. If the new penalty function is smaller, the new model is immediately accepted as the best model;
  • Figure 18 is a block diagram illustrating the data structure for the starting model generator, and figure 19 is a flow diagram of the initial model generation.
  • the initial model in common with all models, has three values (base type, base position and base strength) for each of approximately 70 bases, and 18 global parameters ( ⁇ , p, four background terms (B m ) and twelve correlation terms) making about 228 values for a typical section.
  • the initial model is chosen as follows
  • p is set to 3, which corresponds to approximately symmetric peak shapes .
  • is set to the average peak separation, to give reasonably distinct peaks.
  • the base positions, P n are set to match the peak positions found in step 1, taking into account the peak offset (i.e. the maximum of the peak shape function G(x, ⁇ ,p) does not occur at zero x) .
  • the base type for each base, T n is set to match the maximum trace, m, as found in step 1.
  • G max is the maximum value of G(x, ⁇ , ⁇ )
  • Figure 20 shows the data structure for this process and Figure 21 is a flow diagram of the process.
  • the predicted data is nonlinear with respect to the model parameters, and so it is necessary to linearise about the starting model of each 5 iteration and to iterate the solution (effectively the same as the Newton-Raphson method but extended to N P dimensions, where N P is the number of continuous parameters in the model ) .
  • i is the data index, incorporating both the position along the trace and the trace type (i.e. A, G, C or T) ,
  • ⁇ d is the difference between the observed data and that predicted by the starting model
  • ⁇ m is the change in the model at this iteration so that the model after the iteration (which will then become the starting model for the next iteration) is m 0 + ⁇ m.
  • N is the number of bases
  • ⁇ m (A A+ ⁇ s F ⁇ ) ⁇ A ⁇ ⁇ d- ⁇ s (Fm Q + G'))
  • F' and G 1 are the F and G matrices above but with zeroes inserted for the terms corresponding to the non- positional parameters of the model.
  • the A 7 A and A ⁇ ⁇ d matrices have been calculated, it may be necessary to add a damping term to them in order to make the inverted matrix stable. For example, if the starting model derived from the data contains no bases of one type, then the correlation values that relate to that base will be unconstrained by the data.
  • the damping acts to suppress the values of unconstrained parameters . This damping is very light and should not much affect the solution. It is achieved by adding a third term to E inv , such that
  • the penalty function of the model or at least the numerator of the penalty function is then calculated and compared with that of the best inverted model. If it is not less than that of the best inverted model, so that the best inverted model corresponds to the minimum so far, the best inverted model is returned and the inversion process is terminated, otherwise the new model is saved as the best inverted model and the iteration is repeated from the
  • a base type is changed e.g. a base of type G is changed to type A (20% probability)
  • Figure 23 is a flow diagram showing the removal of a base from the model. Where a base is removed, one base is removed from the model at a time. That base is chosen at random based on either the base spacing (50% probability) or the base strength (50% probability).
  • the spacing Z n of the base n is calculated.
  • Z n is given by the spacing between that base and the adjacent base.
  • Z 0 is subtracted from this to give a spacing anomaly ⁇ Z n .
  • Positive values of ⁇ Z n are set to zero to ensure that only bases that are too close to other bases are removed.
  • a base is chosen ' at random with the probability of base n being chosen being proportional to ⁇ Z n or ⁇ Z 2 n .
  • the program may calculate the cumulative value of ⁇ Z n , ⁇ Z n , and generate a random number between 0 and ⁇ Z n .
  • the value of n corresponding to the random number will be proportioned to ⁇ Z. If removal of a base is chosen due to the base strength the program looks for bases of low strength which therefore make little contribution to the predicted data value, and hence do little to reduce the misfit.
  • the probability of removing a particular base is inversely proportional to the strength of that base (with the proviso that if the strength is below a threshold, the probability is set to the inverse of that threshold to avoid inverting very small numbers).
  • One base is chosen at random and removed from the model.
  • a base is added, a single base is added to the model in each random change.
  • the position and type of the added base are chosen at random. The position is chosen either first based on the misfit (80% probability) or the base spacing (20% probability).
  • Figure 24 is a flow diagram showing the process of adding a base to the model.
  • the program initially determines whether a base is added to reduce the misfit or to improve the base spacing. If a base is added in order to reduce the misfit, program aims to introduce a new base at positions where the discrepancy between the observed data and that predicted by the model is greatest. For each position j, the misfit summed over all four traces is calculated:
  • a position is then chosen at random, with the probability of choosing position j being proportional to misfit j .
  • the program biasses the insertion of new bases towards regions where the bases are too sparse.
  • the type of the new base is chosen at random with the probability of type m being proportional to ⁇ d m 2 .
  • the position is corrected for the peak offset.
  • the strength of the new base is set to the average strength of all the other bases.
  • the new base is inserted into the model such that the bases are ordered according to their position. This is a pre-requisite for calculating the spacing norm.
  • the program looks for regions of large discrepancy between observed data and the model, and changes the base type of a base in that region.
  • the number of bases in the model is unchanged, as are all base positions and strengths.
  • Figure 25 is a flow diagram showing the process of changing a base in the model. The process has the following steps :
  • a position is chosen at random, with the probability of choosing position j being proportional to misfit ; ,. 3
  • the base that (once corrected for the peak offset) is closest to this new position is found.
  • the type of the new base to replace the existing base is chosen with the probability of type m being proportional to ⁇ d m 2 .
  • step 2 onward are repeated.
  • a few parameters may be optimised relating to the bases in the vicinity of a base that has been changed. For example where a base has been changed, whether by addition, removal or replacement, the parameters of the base in question (if present) and those of a number of bases e.g. four, on either side of the relevant position, may be optimised. This scheme is the one used most often.
  • the global parameters may be optimised. For example, if a new model is accepted as the best model, the global parameters only may be optimised to take into account the change in the bases.
  • Scheme 3 is slightly more complicated than the first two, since a change in the global parameters may well require compensating changes in the parameters for each of the bases. For example if it produces an increase in one of the background parameters the strength of all bases of that base type will have to be decreased to compensate.
  • the peak offset will also increase and the nominal positions of all the bases will have to decrease in order that the maximum of the peak still matches the data. If the skew p changes, both the strength and the position of each base will need to change.
  • any base in the new model that is less than a fraction (e.g. 5%) of the average base strength may optionally be removed.
  • a fraction e.g. 5%
  • the contribution from that base is calculated. At the position of maximum contribution, the contribution from that base should be greater than a given fraction of the sum of the contributions from all the other bases or else the base is removed. However in practice such schemes are found to be unnecessary.
  • Figure 26 is a block diagram of the data structure and program for deciding whether or not to accept the new model
  • figure 27 is a flow diagram of the process.
  • the penalty function of the new model that has been generated by modifying the current model is calculated after removal of small peaks and relaxing the model, and is compared with that of the current model to give:
  • ⁇ E (new penalty function) - (current penalty function).
  • the new model is immediately accepted. If ⁇ E>0, the new model is accepted or rejected at random with a probability of acceptance equal to exp(- ⁇ E/T) where T is a parameter referred to as the -annealing temperature.
  • T is a parameter referred to as the -annealing temperature.
  • temperature is used by analogy with the Boltzmann factor exp(- ⁇ E/kT) where k is the Boltzmann constant and T is the absolute temperature. This factor defines the probability of a system to jump out of an energy well of depth ⁇ E.
  • the new model is accepted, its global parameters are optimised and it becomes the current model, and therefore the starting point from which further random changes are made.
  • annealing temperature is not obvious and is somewhat empirical. If it is set too low the model is given very little latitude to experiment so may find it hard to break out of local minima. On the other hand, if it is set too high the model is given too much freedom and may wander very far from regions of low penalty function.
  • the simulated annealing is stopped when any of the following criteria are satisfied: (i) 400 000 iterations have been performed
  • the last of these is the most likely for relatively clean data sets, which only need a few alterations to the starting model.
  • the second occurs for more complicated data: the model may not have settled down completely but is likely to be fairly close to its optimum.
  • the first criteria is used to terminate the process where there is little prospect of a sensible result. To have got this far, the probability that the data is highly noisy and the modelling process is thoroughly confused and any models produced are of doubtful validity.
  • the penalty function, E, of this model is calculated.
  • the first base in the model is chosen and altered in four different ways: the base is successively replaced by a base of the three other types; and the base is removed completely.
  • a similar problem occurs when there are a series of bases of the same type and which do not form clear, distinguishable peaks in the traces. For example if there are a whole series of merged G peaks, the algorithm may decide that the most likely sequence is four bases of type G. However the penalty function for only three bases may be very similar. Under the process outlined above, each of the four bases might be assigned a relatively low probability, since under relaxation, when any of the four is removed, the remaining three will shuffle up to cover the gap without increasing the penalty function significantly. Therefore the probabilities thus calculated may be misleading.
  • Clusters consisting of two or more consecutive bases of the same type, are identified and the least probable base of the cluster removed (or replaced according to the values of P a ⁇ t er ed,i) •
  • the model is relaxed and the probabilities of the remaining bases in the cluster recalculated. This is repeated until only one base remains unaltered in the cluster. Therefore within a cluster the probability assigned to a base is its probability once all less probable bases have been removed (or changed) .
  • FIG. 30 shows schematically how two overlapping sections are used to create a model spanning the overlap.
  • the vertical lines are the bases.
  • every base is assumed to be of equal strength (as indicated by the height of the lines) and non vertical (either horizontal or sloping) the dotted line shows how each model is scaled prior to forming the composite model.
  • Each section (including the composite) is 1000 samples long and the overlap is 50 samples.
  • the overlap contains bases from both models, scaled in strength.
  • the two original models are in some cases consistent: for example both call Ts at the same place 20 and 22 in the sequence although the exact positions vary slightly.
  • the right- hand pair of Ts in the overlap section of the composite model are very close and would probably be deemed an obvious pair and merged.
  • the right hand section contains an extra base (type C) at position 24 not seen in the other section and has called a base as G at position 26 where the left hand section called A.
  • the probabilities of the bases left in the overlap region are calculated as described above. This may require recalculating the probabilities of some of the bases outside the overlap if they form clusters with bases within the overlap region.
  • Base sequences were deduced from the modelling process and were compared to the true base sequence for a number of DNA datasets and compared to the base sequences called by the industry standard Phred. All the traces within a particular dataset were for the same DNA fragment for which the "true" sequence had been deduced from a large number of individual traces from other experiments.
  • Figures 32A to 32C show the results of modelling for a very clean trace.
  • Figure 32A is that of the original data
  • Figure 32B plot is that of the model with the height of each base proportional to its strength
  • Figure 32C is a trace predicted by the model. For this set of relatively clean data, the model matched the true answer exactly, as did Phred.
  • Figures 33A to C are the results of modelling a somewhat poorer dataset with non-symmetric peak shapes and merged peaks.
  • the plots are of the original data (Figure 33A) , the model ( Figure 33B) and the base probabilities (Figure 33C).
  • Figure 33C for each base, the probability of each base type is indicated by the length of each line segment at that position. The probability of there being no base at all is plotted as a white line (so therefore appears as a gap) at the bottom of the vertical line.
  • the base at position 752 is indisputably a T, and so the thick line occupies the whole region from 0 to 1 in Figure 33C.
  • the base at position 843 is most likely C so the thick dashed line is much longer than any of the other line styles at that position.
  • the base at 832 has almost equal probability of being any of the base types although T is slightly more likely than the others. There is also a substantial probability of there being no base at all, as indicated by the gap from 0 to 0.25.
  • the model matches the true answer exactly with the exception that an extra base of type T has been inserted at position 832. However, as can be seen from the bottom plot, a very low probability has been assigned to this base. Phred made seven errors on the same dat&set.
  • Figure 34 is a graph showing base calling errors made by the process according to the invention with those made by Phred.
  • the number of errors was calculated by an algorithm that sought to align the deduced and true sequence by adding or deleting bases to maximise the match.
  • Each base call is either right or wrong: no account is taken of the base probability.
  • the error rate is plotted as a function of position, where position is measured in terms of bases from the start of the DNA fragment as shown in Figure 34. It can be seen that, as compared with Phred, the region of low error rate has been extended at both ends of the dataset, by between 10 and 20 bases at the start (from base 40 to base 20 or 30 depending on exact threshold chosen) and 50 at the end (in the region of base number 900).
  • the modelling process contains many parameters (e.g. annealing temperature, relative weighting of the spacing norm and misfit in the penalty function) . Performance may improve if these are fine tuned.
  • the first base peak is large, but subsequent peaks can be small. - compression, especially in GC-rich regions
  • G peak can be very large and may have a shoulder either side.
  • Homopolymer regions can cause enzyme slippage, causing the appearance of multiple overlapping sequences in the region beyond. Homopolymer regions can also cause 'wave' patterns in the peaks following them.
  • the first C is weaker, especially after G.
  • the second C in a string of 2 or more Cs can have an elevated signal.
  • C following a string of Ts can be enhanced.
  • the third can be weak, and the following Gs can be strong.
  • G following string of Ts can be very strong, and may obscure following peaks.
  • the first is strong but the others may be weak, depending on what precedes the string.
  • the first is strong but the others may be weak, depending on what precedes the string.
  • parameters can be introduced into the model which modify how the predicted values are calculated. The values of these parameters will determine just how much the predicted values for the later bases are suppressed.

Abstract

A carrier carries a computer program for base calling DNA bases from a dataset corresponding to observed traces obtained from electrophoresis. The program generates a database in the computer memory corresponding to a model of a DNA sequence and refines the model of the DNA sequence by the following operations: (a) making a change to the base sequence of the model; (b) predicting the form of the traces from the modified model; (c) comparing the predicted traces with the observed traces in dataset to generate a penalty function; (d) determining whether or not to accept the modified model based on the value of the penalty function; and (e) repeating operations (a) to (d) until termination criteria have been achieved. The program has the advantage that it is less empirical and dependent on experimental setup than existing programs such as Phred.

Description

DNA SEQUENCER
This invention relates to DNA sequencing.
DNA sequencing is of major importance in modern gene manipulation, and is an important tool for medical diagnosis as well as in providing methods for identifying the source of DNA in criminal and security applications .
One method of sequencing DNA that is currently used is the Sanger or dideoxy method. In one form of this method the DNA under investigation is denatured with alkali so that single strands of the DNA are formed, and the strands are annealed with a primer, that is to say, with a short sequence of bases that is known to exist in the DNA. The primer attaches in a known location on the DNA, and, if the bases are also present in the solution, a complementary strand of DNA will form on the parent strand. If a small percentage of the bases aro, modified so that they are missing a hydroxyl group in the 3' position in the sugar phosphate, they will not provide a suitable substrate for chain elongation, and so will terminate the chain growth at that point. The result is that a distribution of complementary copies of the original DNA sequence is generated having all possible lengths of the DNA sequence starting from the primer location. The mixture may then be subjected to electrophoresis. Because the different strand lengths will have different mobilities, they will appear at the end of the electrophoresis column in the order of their length. If, in addition, the four different terminator bases have characteristic identifying spectra, for example due to the incorporation of fluorescent dyes therein, each base may be identified by spectroscopic methods at the point at which it reaches the end of the electrophoresis column, and so a trace may be formed of strength of any base against time. Together the four traces, one for each base, should correspond to the sequence of the bases in the complementary DNA strand.
However, there are a number of difficulties that may be encountered in this method of sequencing DNA.
( 1 ) As the complementary strands of DNA grow and are terminated, the complementary strands are used up, with the result that the peak intensity, and accordingly the signal-to-noise ratio, for the longer lengths of strand is reduced. Alternatively, depending on the concentration of the terminator bases, the terminator bases may be used up with the same result as above. Added to this, the peak widths will increase (from the diffusion spread) so that the peaks corresponding to the longer strands merge into another. All these effects can combine to make it difficult to deduce the correct base order.
(2) There can be a degree of overlap in the spectra with which different bases fluoresce, with the result that a peak in a trace corresponding to one base may be accompanied by a peak in another trace. This correlation between peaks in different traces may make it difficult to determine which base is responsible for the peak.
(3) If two or more bases of the same type are next to each other in the genome, the resultant peaks may merge together to form a single broad peak.
Current analysis methods employ software called Phred, (see "Base-calling of automated sequencer traces using Phred. I. Accuracy assessment", by B. Ewing, L. Hillier, M. endl and P. Green, Genome Research 8:175-185, 1988, and "Base-calling of automated sequencer traces using Phred. II. Error probabilities", by B. Ewing and P. Green, Genome Research 8:186-194, 1988). However, this tends to be somewhat empirical and dependent on the experimental setup, and it is quite possible for incorrect base assignments to be made when sequencing.
According to the present invention, base calling in the Sanger method of sequencing is performed by refining models of the DNA sequence, predicting their traces and comparing the predicted traces with the observed traces . The base calling is an iterative process involving modifications to the models and selection of those models whose predicted traces fit the observed traces most closely.
Thus, according to one aspect, the present invention provides a carrier which carries a computer program comprising processor implementable instructions for base calling DNA bases from a dataset corresponding to observed traces, the instructions comprising:
(i) code for generating a database in the computer memory corresponding to a model of a DNA sequence; and (ii) code for refining the model of the DNA sequence by:
(a) making a change to the base sequence of the model and generating a database in the memory corresponding to the modified model;
(b) predicting the form of the traces from the modified model;
(c) comparing the predicted traces with the observed traces in the dataset to generate a penalty function;
(d) determining whether or not to accept the modified model as the best model so far based on the value of the penalty function; and
(e) repeating operations (a) to (d) until termination criteria have been achieved.
According to another aspect, the invention provides apparatus for sequencing DNA which comprises :
(i) a vessel for receiving a sample of DNA to be sequenced, and other reactants for growing a complementary DNA strand, including terminator bases having characteristic absorption spectra;
(ϋ) means for performing electrophoresis on the contents of the vessel;
(iii)means for obtaining spectra from material as it finishes electrophoresis at a frequency that is characteristic of each of the terminator bases to generate a trace corresponding to each base; and
(iv) a computer that has been programmed with a program according to the invention.
According to yet another aspect, the present invention provides a method of sequencing DNA which comprises :
(i) mixing a sample of DNA to be sequenced with a quantity of primer and bases for growing a complementary DNA strand, the bases including terminator bases having characteristic absorption spectra;
(ϋ) performing electrophoresis on the contents of the vessel;
(iii)obtaining spectra from material as it finishes electrophoresis at a frequency that is characteristic of each of the terminator bases to generate a dataset of observed traces corresponding to each base; and
(iv) determining the base sequence from the traces by means of a computer provided with a program according to the invention.
The invention may, in principle, be employed using terminator bases having any form of characteristic spectra, for instance I.R. or u.v. absorption spectra, or X-ray spectra. However, it is most common to employ terminator bases that incorporate fluorescent dyes, and to observe the fluorescence of the material at the end of electrophoresis . Phosphorescent dyes could instead be employed provided that the decay time of the phosphors were sufficiently short to distinguish different bases in the traces .
Indeed, the invention is not restricted to dye terminator chemistry, but may be employed with dye primer chemistry in which a dye is attached to the primer, and separate electrophoresis columns are employed for each base type. Furthermore, the invention may be employed using radioisotope labelled bases or primers, for example using 32P or 35S radioisotopes . The program and method according to the present invention have the advantage that base calling and the quality of the predicted models is much less prone to error due to differences in the experimental setup and conditions. In contrast to the prior art methods, the method according to the present invention is more mathematically rigorous and less empirical. In addition, in contrast to known software such as Phred, it is not necessary to invoke a fifth base type (corresponding to Phred' s N) to indicate the presence of a base where no clear peak is visible in any of the traces.
The penalty function that is generated by the program is essentially a form of figure by which any particular model predicts traces that differ from the observed traces, and should be minimised during the process. The penalty function may take any of a number of forms, but one form that is useful includes a degree of misfit between the observed and the predicted traces, and a measure of divergence from a general model of DNA, that is to say, a measure of divergence from a model of DNA that does not take into account the particular sequence of the bases in the helix. For example, according to one general model of DNA and its diffusion properties, the spacing of the base peaks is generally regular along the trace, and so the measure of divergence from such a model would be the irregularity of the base peak spacing (referred to herein as the "spacing norm"). Also, it is possible to vary the relative weighting of the misfit and the irregularity of the bases in generating the penalty function.
One form of modelling process that may be employed according to the invention is referred to as "simulated annealing" and constitutes one iteration in the process of refining in step (b) above. This is a process by which changes that are at least partly random are made to a model, the effects of the changes are calculated, i.e. the form of the predicted traces, and a decision is made whether to accept the new model as the best model or revert to the previous model. By repeating the process many times, the model will gradually improve as observed by a reduction in the penalty function.
Models that may be used in the modelling process will normally include a number of continuous variables for example relating to peak strength, peak shape, correlation between peaks, peak position etc. which may be varied in a continuous manner, and discrete variables, for example base type and number of bases, which can only be changed in a discrete manner. One such model that may be employed according to the invention is described in detail below.
A number of changes may be made to discrete variables in the model as part of the simulated annealing process, namely: addition of a base to the model, removal of a base from the model, and replacement of one base for another in the model. This process may be performed entirely randomly, but this may be expensive in computational time, and it may be better to weight the addition, removal and replacement in order to reduce the time taken to generate better models. For example, the changes and the type of base change may be chosen in accordance with the degree of misfit or irregularity of the base peak spacing in order to accelerate optimisation of the model by concentrating on those regions that contribute excessively to the penalty function.
In the process of optimising the model of the base sequence, the penalty function will be reduced, and this will normally involve the program accepting the modified model during a simulated annealing process if the penalty function of the modified model is less than that of the previous model. However, if the program only ever accepts a modified or new model if the penalty function is less than the previous model, it is possible for the modelling process to become trapped in a local minimum, that is to say, in a model or group of models having the lowest penalty function that is obtainable by making a single change to the bases in an attempt to reduce the penalty function, but which may not be the lowest penalty function achievable. Thus, it is preferred to allow the penalty function to be increased during the simulated annealing process in some circumstances to allow the model to escape a local minimum. This may be achieved by allowing the program to accept a modified model having a penalty function that is greater than that of the previous model, with a probability of acceptance that depends on the difference between the penalty function of the modified model and that of the previous model. The probability of acceptance may also decrease as the number of steps (b) that have been performed increases. Such a process allows changes to the model r that can generate relatively large increases in the penalty function to be made at the beginning of the process with the prospect of the model moving from one local minimum of the penalty function to a different region of potentially lower penalty function, but as the refining process continues, the prospect of changes to the model allowing it to escape a local minimum are gradually reduced.
The simulated annealing process enables the program to make changes in discrete variables in the model, that is to say, changes to the type and number of the bases. When such a change has been made, and before the penalty function has been calculated, it is desirable to make relatively minor changes to the model in order to optimise it with respect to continuous variables. Thus, the program preferably performs a process on any model to minimise differences between the predicted traces and the observed traces with respect to continuous variables of the model before calculating the penalty function. Such a process is preferably a least squares inversion process that is performed iteratively with at least five and especially at least ten steps. This process may be performed to minimise the difference between the predicted and observed traces with respect any one or more of the following parameters: to the position of each base, the strength of each base signal, the background signal, the correlation between signals from different bases, the peak width and the peak skew in the traces.
According to the invention, the program generates an initial model of the DNA sequence that is refined by the simulated annealing process. It is not absolutely necessary in the broadest aspect of the invention for the initial model to depend on the observed sequence: an initial model could be generated at random, or from a pre-stored table, and then simply be refined, although this could potentially involve a significant extra computational effort, and any improvement in the similarity of the initial model with the actual DNA sequence will reduce the effort significantly. The program therefore preferably generates an initial model that takes the observed traces into account. A number of methods may be employed to generate the initial model, but generation of the model will normally be achieved by assigning a base to a peak in a trace that is higher than any of the other traces at that position, that is, by assigning bases to the most significant peaks of the traces. The correlation of the peaks of different traces can then be set to zero, and/or the peak widths can be set to the peak separation in order to generate easily identifiable peaks.
A typical base sequence that is operated on will contain in the region of 1000 bases, and the number of samples in the observed traces will be typically about 15,000, i.e. about 15 samples per base called. However, the number of samples per base may vary depending among other things on the diffusion properties of the DNA, and can be for example between 5 and 25 samples per peak. Such a sequence length may well be too great for ease of manipulation, and so it is preferred for the program to divide the traces generated from the electrophoresis into a plurality of smaller traces on which it operates to determine the base sequence, and then, after the base sequence corresponding to the smaller traces have been determined, to join together the models of the base sequence corresponding to the smaller traces to generate a model for the entire sequence. The traces may be divided into smaller traces corresponding to from 30 to 200 bases and preferably in the region of 50 to 100 bases. In addition, the smaller traces will preferably overlap other smaller traces corresponding to adjacent portions of DNA sequence by a small number of bases, for example at least two and especially at least thϊfee bases . In such a case, when the models are joined together, too many bases will be present at the overlapping regions and excess bases will need to be removed. This may be achieved by refining those parts of the composite model so formed corresponding to the overlapping regions of the smaller traces by removing each base of the overlapping regions of the smaller traces in turn, comparing the predicted traces in their overlapping regions after removal of each base with the observed traces and rejecting any model that has a greater penalty function than the model without removal of any bases .
Some criteria for terminating the refinement of the model will normally be required in order to stop the process. Any of a number of criteria may be employed. For example, the process may be terminated if the simulated annealing step (b) has been performed at least 100, and preferably at least 300, times since a modified model has been accepted as the model and that the simulated annealing step (b) has been performed more times since a modified model has been accepted as the model than the number of iterations it took to get from the starting model to the most recently accepted model. This criterion is most likely for relatively clean data sets, which only need a few alterations to the starting model. Alternatively or in addition, the process may be terminated when the simulated annealing step has been performed at least 1000 times, and preferably at least 3000 times since the most recent time a modified model has been accepted as the model. This criterion will normally be employed with more complicated data where the model may not have settled down completely but is likely to be fairly close to its optimum. Finally, the program may be terminated simply when the simulated annealing step (b) has been performed a large number of times, for example at least 100,000 times or preferably at least 300,000 times. In this case, it is most likely that the data is highly noisy and the modelling process is confused with the result that any models produced are of doubtful validity.
It is highly desirable that the apparatus and method according to the invention not merely produces a model for the DNA sequence, but also generates a probability that the assignment of any base in the model is correct. This may be achieved by seeing how the model deteriorates on removal of that base. For example the program may operate by:
(1) predicting the form of the traces of the model after removal or changing of a base and comparing the predicted traces with the observed traces to generate a penalty function; and
( 2 ) comparing the penalty function obtained in step ( 1 ) with the penalty function of the model without removal or changing of any base therefrom.
The difference in the penalty functions may then be converted into a probability or some other figure of merit for that base. This process of deriving the probability of the assignment of bases will normally be performed after the termination criteria have been completed and preferably after any shorter sequences corresponding to the shorter traces have been joined together.
Any program embodying the invention may be in the form of source or object code or in any other form suitable for the implementation of processes according to the invention. The carrier may be any entity or device capable of carrying the program.
For example, the carrier may comprise a storage medium such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium for example a floppy disc or hard disc. Further, the carrier may be a transmissible carrier such as an electrical or optical signal which can be conveyed via electrical or optical cable or by radio or other means. One form of program, apparatus and method according to the present invention will now be described by way of example with reference to the accompanying drawings in which:
Figure 1 is schematic illustration of DNA replication in the Sanger method of DNA sequencing; Figure 2 is a drawing showing some of the products of the method shown in figure 1; Figure 3 is a schematic representation of the apparatus according to the present invention; Figure 4 is a drawing of the form of input data obtained during the process according to the invention; Figure 5 is a graphical representation of an observed trace obtained in the method of DNA sequencing according to the invention;
Figure 6 is a graphical representation of part of the trace shown in figure 5 in greater detail; Figures 7a and b are graphical representations showing correlation between different bases in the process; Figures 8Aand 8B are typical traces of part of a DNA sequence obtained during the method according to the invention; Figure 9 is a schematic illustration showing changes to the skew of the trace peaks; Figure 10 is a schematic illustration showing some of the variables employed in modelling the DNA sequence;
Figure 11 is a block diagram illustrating the data structure and process steps of the base calling program according to the invention;
Figure 12 is a diagram of the data forming the input dataset of a section, a model, and the output dataset of that section; Figure 13 is a flow diagram showing the overall base calling method used in the process according to the invention; Figure 14 is a block diagram illustrating the data structure and process steps of the sectioning step of the present invention; Figure 15 is a flow diagram showing the process for finding the start of clean data; Figure 16 is a block diagram illustrating the data structure and process steps in the simulated annealing process employed in the present invention; Figure 17 is a flow diagram illustrating the simulated annealing process; Figure 18 is a block diagram illustrating the data structure for the starting model generator; Figure 19 is a flow diagram illustrating generation of the starting model; Figure 20 is a block diagram illustrating the data structure for the least squares inversion process employed in the invention; Figure 21 is a flow diagram illustrating the least squares inversion process employed in the invention;
Figure 22 is a block diagram illustrating the data structure and programs employed in the model modifier; Figure 23 is a flow diagram of the algorithm used for modifying the model by removal of a base;
Figure 24 is a flow diagram of the algorithm used for modifying the model by the addition of a base;
Figure 25 is a flow diagram of the algorithm used for modifying the model by changing a base in the model;
Figure 26 is a block diagram illustrating the data structure and programs employed in the model acceptance step of the simulated annealing process; Figure 27 is a flow diagram of the model acceptance step of the simulated annealing process; Figure 28 is a block diagram illustrating the data structure employed in the base assignment probability calculation; Figure 29 is a flow diagram illustrating the steps of the base assignment probability calculation; Figure 30 is a schematic diagram showing joining up of smaller DNA sections to form a larger section; Figure 31 is a flow diagram showing base removal during joining up of the DNA sequences;
Figure 32 is a diagram of part of an observed trace of clean data together with a DNA model obtained according to the invention, and a predicted trace; Figure 33 is a diagram of part of an observed trace of poorer data together with a DNA model obtained according to the invention, and a graph of base assignment probabilities;
Figure 34 is a graph showing base calling errors made by the process according to the invention and those made by Phred by way of comparison.
Referring to the accompanying drawings, figure 1 shows schematically the growth of a complementary DNA chain in the Dideoxy or chain terminator method of DNA sequencing. In this method, multiple copies of the DNA to be sequenced are placed in solution along with an alkali to denature the DNA, that is, to cause the two DNA strands to unravel, a quantity of a primer and of each of the four bases, adenine (A), guanine (G), cytosine (C) and thymine (T). In fact, a deoxynucleotide triphosphate is employed that incorporates the base, but the term "base" will be employed for simplicity since no confusion will occur. The primer is a short sequence of bases that is known to exist in the DNA to be sequenced, and attaches to a known location in the DNA strand. A complementary strand of DNA will then form, by attachment of adenine to thymine and attachment of guanine to cytosine.
A small percentage of each base is a terminator base in which a hydroxyl group is missing from the 3 ' position of the nucleotide (shown in figure 1 by A* and C*), and which includes a fluorescent chromophore. As the complementary DNA strand forms, a small percentage of the strands will incorporate terminator bases from the solution and the strand growth will stop at that point. A starting strand as shown in figure 2 will thus generate a mixture of complementary DNA strands each strand varying in length from another strand by a single base, and each strand being terminated by one of the terminator bases. All possible lengths of DNA sequences are created because of the terminators that are present, although the shorter ones are more probable because a certain proportion of the strands are used up for each unit of growth of the strand.
Figure 3 shows apparatus according to the invention for sequencing a sample of DNA. The apparatus comprises a vessel 1 for receiving the DNA sample together with the primer and the bases, including the terminator bases. In addition, the apparatus includes a column 2, for example in the form of a capillary or a column of gel, and means 4 for applying an electric field along the column in order to cause electrophoresis of the contents of the vessel. Because the sequences of DNA all have different lengths and masses, they will diffuse along the column 2 at slightly different rates, and will appear at the far end 6 of the column in the order of their length. A laser 8 or other light source is provided at the far end 6 of the column 2 or at another appropriate detection point which is shone into the material finishing the electrophoresis. Because each of the terminator bases includes a fluorescent dye that fluoresces with a characteristic spectrum, the material finishing the diffusion will fluoresce under the action of the laser 8. The apparatus includes a beam splitter 10 and four filters 12 that match the fluorescence spectrum of each of the four bases. Also, the apparatus includes A/D converters 14 that digitise the signals from the filters 12, and a computer 16 that can receive the data from the A/D converters .
In operation, the contents of the vessel 1 will be caused to diffuse along the column 2 under the action of the electric field, and will exit the column as its far end 6, whereupon it will be subjected to the laser beam and the fluorescence will be detected, filtered, digitised by A/D converters 14 and stored in the computer 16. The rate of the A/D converter is matched to the speed of diffusion so that approximately fifteen data samples are recorded per base as the sequences diffuse along the
'* column 2. Each line from a filter 12 to its A/D converter 14 will therefore carry a trace corresponding to a frequency window around the dye frequency"Of one of the bases A, C, T, G as shown. The peaks of the traces, as represented in Figure 3, are idealised in that they are shown as being well separated, of equal height and exhibiting no correlation, so that assigning bases to the peaks would not present any problem. In practice, assigning bases to peaks is much more complicated. Figure 4 shows a typical sample of the input data that is collected by the computer 16 from the electrophoresis. The data comprises a set of approximately 15,000 optical intensity values from each of the four filters corresponding to each of the bases A, G, C and T. At the start of the data collection for example up to sample No. 100, the first bases will not have reached the end of the column, with the result that the values for each base will be relatively low and constant, and provide the background level for each trace. At one sampling point, corresponding to the appearance of the first base from the column 2 , the values in the input data trace will rise dramatically as the bases are detected at the end of the electrophoresis column. The maximum intensity valves recorded, corresponding to the peak height of the traces, will initially be relatively high and will then slowly decrease over the course of the 15,000 samples taken, as shown in figure 5 which is a superposition of traces for each of the four bases, with the result that, at the end of the 15,000 samples, the peak height is typically only about 20% of the peak height at the beginning depending on the concentration of the terminator bases . The reason for this is that a certain proportion of the stands are used up each time it attaches to the DNA strand so that the concentration of the stands decays exponentially over time (and therefore with the sampled length of the strand) . In addition, because the DNA strands toward the end of the trace diffuse more slowly than those at the beginning of the trace, the peaks in the trace will be broadened vith respect to the earlier peaks with the result that the signal to noise ratio in the traces is diminished further.
The background, random noise is generally not very large, at least at the beginning of the trace, and the average signal to noise ratio across the trace is higher than 10. This random noise is due to thermal, shot and quantization noise from the A/D converter and laser scattering. The noise must be subtracted or taken into account in initial processing. Each detector has a constant baseline (this can be seen at the start of the trace) caused by constant background voltage from the hardware.
An additional source of background noise is the DNA diffusion process itself. Dependent on the exact sequence of DNA there is a small probability that it will "fold" in on itself and alter its diffusion characteristics. This may result in DNA strands of differing lengths diffusing at similar times and this will add to the noise. This effect is known as sequence compression. It depends on the detailed chemistry being used, and results from intrastrand complementarity causing single strand DNA to form regions of local secondary structure.
In addition to the noise, the traces for the different bases can show a high degree of correlation, due to overlap in the absorption or emission spectra of the different terminator bases, which results in peaks being formed simultaneously in traces for different bases.
Figure 6 shows an example of this. For example, whenever there is a peak in trace G there is also a peak in trace T. The T sensor is activated by the presence of base G, which is due to an overlap in the fluorescence spectrum from the G base with the frequency window for the T trace .
Any attempt to deduce the order of bases in the genome should take these correlations into account. For example, there are peaks of similar height in trace C at positions 120 and 150 in Figure 6. However the peak at 150 may well be caused by a base of type A, since there is a peak in trace A at the same position, whilst that at position 120 appears uncorrelated with any of the other traces so may well be genuine. If each trace were to be considered in isolation, it would not be possible to make these distinctions .
The correlation characteristics for traces G and T are shown in Figure 7(a). It can be seen that there is a degree of asymmetry in the correlation. A high value of trace G is always accompanied by a high value of trace T. However, there is a second distribution showing that a high value of trace T (up to about 4000) is not necessarily accompanied by a high value in the G trace. Therefore, in the presence of a G base there is a corresponding peak in T trace. However, there are two possibilities in the presence of a T base. Either there is a corresponding peak in the G trace which corresponds to the same distribution in the correlation pattern and there is a second correlation between the T sensor and either the C or A bases, or in the presence of the T base there is little correlation with the G trace.
Figure 7(b) shows the correlation between bases A and C.
As can be seen, a peak in trace A always corresponds to a small peak in trace C whereas a peak in trace C does not necessarily correspond to a peak in trace A. There appears to be three distinct patterns in the correlation between A and C. These three patterns are due to some combination of:
(i) A base causing peaks in A and/or C trace (ii) C base causing peaks in C and/or A trace (iii) A third base causing peaks in A and/or C trace
It is possible to use these correlation patterns between each pair of traces to determine which base causes each peak in the trace. This can only be done by taking all four entire traces and their correlation into account and cannot be done on a peak by peak basis. It should be noted these figures were taken from a trace with particularly high correlation statistics.
V
The DNA Model
The model incorporates the following parameters:
1) Background term: Bm for m=0 to 3 (corresponding to A,G,C and T) .
The model contains a parameter for background noise for each of the four base types since often each trace will have a different offset level. Background terms are restricted to zero or positive.
2) Correlation terms: Cm,m. for m, m'=0 to 3 (corresponding to A, G, C and T).
These terms quantify the effect of the correlations in the peaks for the different bases referred to above. Thus, for example, if the G and A traces were correlated and model has called a base G at a given position, the data predicted by the model would show a peak in trace G around that position, but there would also be a peak in the predicted trace A at the same position and the ratio of amplitude of trace A to that of trace G would be CAG.
The values of the correlation terms are allowed to vary in the range 0 to 1. The diagonal terms of the correlation matrix (i.e. CM, CGG, CM and Ccc) are by definition unity, leaving 12 variable correlation terms. (The expression "correlation term" used herein refers only to the 12 variable parameters, and excludes the four fixed diagonal terms . ) The correlation matrix is not necessarily symmetric (i.e. CGA is not necessarily the same as CAG)
In many experiments the correlation between traces is not significant. In such cases the non-diagonal terms of the correlation matrix can be set to zero to reduce computational load and increase model stability.
3 ) Peak shape terms
Figure 8 shows two sets of traces . In the upper set the peaks are symmetric whilst in the second plot each peak is skewed to the left. Also, in the top figure the peaks are typically of width 10 samples, whilst those of the bottom plot are about twice as broad.
To model the variation in both peak width and skew the following function may be used:
G(x,σ,p) = (1)
Figure imgf000033_0001
G = 0 otherwise.
where x is the position on the x axis relative to the nominal base position, σ defines the width of the peak and p the skew.
The shape of this function for various values of p is shown in Figure 9.
It should be noted that the function is not normalised.
For example the integral is a function of σ and p.
Figure imgf000034_0001
Furthermore the maximum of G is offset from x = 0 (c.f. the Gaussian function). The difference between them will be referred to as the peak offset. Changing the value of p will change the maximum height and width of G, as well as the peak offset.
The function is chosen because of its ability to model varying skews in peak shape. However there may well be other functions that are suitable. In particular no attempt has been made to choose a function that matches the underlying physics of diffusion.
4) Position, Pn for n=0 to N-l where N is the number of bases in the model
Figure 10 is a schematic diagram of various peaks indicating various model parameters . Each base is allocated a position (corresponding to x=0 in figure 9), and the maximum value of the peak associated with that base is offset from the nominal base position.
5) Strength Sn for n=0 to N-l
The peaks of the traces are of varying height, and so a strength Sn is associated with each peak. This acts as a multiplier to scale the peak shape function. Strengths are restricted to 0 or positive.
6) Base Type Tn for n = 0 to N-l
This indicates whether the base is of type A, G, C or T.
' The base type and number of bases are the only discrete parameters .
Using these parameters, the data predicted by the model, dp is given by:
Figure imgf000035_0001
where j is the distance along the trace measured in samples . Base Calling Procedure
Figure 11 is a block diagram showing the data structure and programs employed for the base calling procedure of the present invention which operates on the input data shown in figure 4, that is to say, the intensity values for each sample of the four traces. The input and output data sets are shown in Figure 12 together with datasets forming the model. The model consists of a set of parameters that can be split into two groups. One group consists of parameters that apply to a single peak, namely peak strength, peak position, and base type. The other group consists of those parameters, called global parameters, that apply over the whole length of the section, namely background terms, correlation terms and the peak shape terms σ and p. Figure 13 is a flow diagram showing the overall base calling process. The traces in the input data store will typically be about 15,000 samples long, corresponding to about 1000 bases. The input data is filtered to remove any high 'frequency signals and the start of clean data (i.e. when the first bases have appeared from the electrophoresis) is recorded as jsAM" The dataset is too large to be handled in its entirety, and so is split into 15 sections, each of which corresponds to about 1000 samples, or about 70 bases. Each section is operated on by the computer to determine the base sequence and 15 section output datasets are generated, each corresponding to 70 bases. The sequences are then joined together to provide a sequence of the entire DNA sample of 1000 bases which is stored in the output data store.
Figure 12 is a schematic indication showing the structure of the data operated on and generated by the base calling procedure. The input dataset 20 of one section of the filtered input data is largely in the form shown in Figure 4 having intensity values for each of the four bases at each of 1000 samples. The program generates a model 22 for each section comprising typically 70 predicted bases and, for each of the bases an associated position and intensity, together with the global parameters. After the base sequence has been deduced probability scores are calculated for each base type at each base position called to generate an output dataset 24 for that section. As shown in Figure 12, one base type (A, G, C or T) is called at each position, but this is simply the base type having the greatest probability. The output could, if desired, consist only of the base type probabilities. The sum of these four probabilities may be less than unity. The discrepancy corresponds to the probability of there being no base at that position. The models for each section are then reunited to generate an output of the program for the entire set of 1000 bases corresponding to the input dataset of 15,000 samples .
This breaking into sections also allows the global parameters of the model to change along the trace. For example, at the start of the trace, the peaks are usually quite narrow and distinct, but as one progresses along the trace, the peaks tend to broaden and may become more skewed. Under this scheme, each section may have its own peak width and skew.
Figure 14 is a block diagram for data and program structure for the process of sectioning the data, and a flow diagram of the Clean Data Start Determinator is shown in figure 15. Once the data has been read into the memory, it is filtered to remove high frequency noise which could confuse the algorithm calculating the initial model as well as degrading the other algorithms. The particular type of low pass filter is not critical to the invention, but given that the. data is already in digital form, a digital filter such as an FIR filter is appropriate . As part of the base calling procedure, the "typical" base peak spacing Z0 is determined. This is an important parameter and is used to calculate the spacing norm during the simulated annealing process. This parameter is derived from the data and should remain approximately constant along each section of the trace gradually increasing between sections. However, it is much easier to estimate at the start of the trace where the peaks are cleaner and more distinct. The very start of the trace contains a precursor, representing the time before any genome fragments have been detected and which therefore contain no useful data. An example is Figure 5: the first useful data only comes through at roughly sample 3000 as can be seen by the dramatic increase in amplitude. Henceforth it is assumed that the cleanest parts of the trace are those with greatest amplitude.
The algorithm for estimating where the clean data starts is as follows:
The root-mean-square (r s) data value at each sample is calculated by:
Figure imgf000039_0001
where d0 is the observed data, m represents the base type of the trace (A, G, C or T) and L is the trace length.
The running average is calculated over a window:
Figure imgf000040_0001
where w is the window length, typically 100 samples,
The average of rms over the whole trace is calculated.
Figure imgf000040_0002
The lowest j (jstart) for which RA, RAj+w and RAj+2w are all
greater than — rms is found. This gives the position that
marks the onset of consistently high amplitude data.
If jstart is less than the window-length then the precursor is of negligible duration and can be ignored by setting jsat to zero. If there is no value of jstart satisfying step 4, or if jstart is greater than L/2, then again it is assumed that there is no easily distinguishable precursor and jstart is set to zero.
Finally jstart is rounded down to the nearest multiple of 10.
The reason for requiring in step 4 that three successive blocks of data are of large amplitude is that often at the start of the trace there is noise of high amplitude but short duration (compared to the window length).
Once jsart has been determined, the data is split into sections for further processing beginning with the section that starts at jstart. The section immediately following this is done next, followed by the subsequent sections until the end of the trace is reached whereupon the program returns to the start of the dataset and processes the precursor (i.e. that part of the dataset before jstart) . The reason for this is that even if the precursor contains no useful data, there is no harm in processing it other than computational effort. This part of the data is processed last because the peak spacing norm Z0 should be determined by the cleanest data, i.e. the data immediately following jstart. At each section the value of Z0 used is an average calculated from that section and all the preceding sections.
A typical sectioned trace is shown in Table 1.
TABLE 1
Figure imgf000042_0001
Although most sections will be 1000 samples long, some sections will differ from this, for example the last section and sections corresponding to samples before
J start.
There is a small overlap of typically 50 samples between sections (about three bases) which is employed when the sections are reunited at the end of the process.
Once the data have been sectioned, the data for each section can be subjected to the simulated annealing operation. Figure 16 is a block diagram of the data and program structure of the simulated annealing process, and figure 17 is a flow diagram of the process. The simulated annealing is a process by which random changes can be made to a model, the effects of these changes calculated, and a determination made as to whether to keep the new model or revert to the previous one. By repeating the process many times the model will gradually improve, as defined by a reduction of the penalty function for that model.
Any of a number of different penalty functions could be employed involving discrepancies of the matching of the predicted peaks with the observed data and discrepancies with a general model of DNA. The function used herein for any model is a single number and is a linear combination of the misfit Ed between the predicted trace and the observed trace, and the misfit Es between the predicted base spacing and a constant base spacing. This particular penalty function may be justified with reference to Bayesian statistics which are not reproduced here. The penalty function is thus defined as:
E -. (3)
D- N„
where D is the number of data samples ( four times the length of each trace); Np is the number of free parameters in the model and will vary as bases are added and removed; λs is a constant which reflects the relative weight given to Es and Ed. Large λs will result in models with very evenly-spaced bases, while small λs will disregard base spacing in favour of achieving a good fit to the data; and Ed is the misfit and Es is the spacing norm, both of which are defined below.
10
Misfit
The unweighted misfit is defined as follows:
Figure imgf000044_0001
(4)
15. where L is the number of samples in each trace (D
4*L), d0 is the observed data and dp is as defined by equation (2).
Optionally this summation can be weighted by the
20 uncertainty estimate for each data sample, σmrj, to give
Figure imgf000044_0002
This might be useful if for example one trace is much noisier than the others. Otherwise the σ normalisation term is absorbed into the overall scaling of E and λs is changed appropriately.
Spacing norm
This is a measure of the irregularity of the base spacing.
Figure imgf000045_0001
where the summation is over the N bases in the model and Pn is the position of base n (it is assumed that the bases have been ordered according to their position) .
Z0 is the 'typical' base spacing. It is important to get an accurate estimate of this by ignoring' extreme spacings.
The base spacing, Z0, is calculated by the following steps:
1 Ensure the bases are in order of increasing position. 2 Exclude bases whose peak value lies outside the data range as the positions of these bases are likely to be poorly defined.
3 Calculate the separation between bases two apart and divide by 2 i.e. zA = (Pi+i - Pi_1)/2. The reason for using bases two apart, rather than adjacent bases, is that it gives a less noisy estimate of the spacing.
4 Sort the spacings by size.
5 Calculate the average, Z , of the middle 50% (sorted
by size) of the spacings.
As explained in section 2, the value of Z0 from earlier sections is used to weight Z0 for later sections. If Z0 k is the value of Z0 used for section k (k = 0....), and
Z . is the value of Z calculated from the bases of
section k, then
Figure imgf000046_0001
=Zk for k = 0,
^OA-l + Zk = τ for k > 0, k + \ Z and Z0 are recalculated every time any of the base
positions changes.
The simulated annealing process is iterative with the steps as shown below:
1) An initial, or starting model is chosen;
2) The model is allowed to relax by a least-squares inversion process. This varies the continuous parameters to find the parameter set that minimises the penalty function, but cannot change any of the discrete parameters such as base type, and will only find the local minimum in the vicinity of the starting model;
v 3) The penalty function of the model is calculated.
4) A random change is made to the model, such as adding a random new base, removing a base or changing the type of base;
5) The model is relaxed (step (2))
6) Any small peaks are removed from the trace and the model is relaxed again; 7 ) The penalty function of the new model is calculated;
8) The new penalty function is compared with the best penalty function of a model on record. If the new penalty function is smaller, the new model is immediately accepted as the best model;
9 ) It is decided whether or not to accept the new model as the current model;
10) It is decided whether or not the termination criteria have been achieved. If not, the process returns to step 4.
11) If the termination criteria have been achieved then the probabilities of the base calls are calculated and the operation finished.
Dealing with these steps in turn:
1) Choosing an initial model
In theory, no initial model is necessary, and a model could be chosen at random, but this would be extremely wasteful in processing time. Figure 18 is a block diagram illustrating the data structure for the starting model generator, and figure 19 is a flow diagram of the initial model generation.
The closer the initial model is to the final model the better: it reduces processing time if only a few changes need to be made; and the model is less likely to get stuck in regions of local minima far from the true minimum.
The initial model, in common with all models, has three values (base type, base position and base strength) for each of approximately 70 bases, and 18 global parameters (σ, p, four background terms (Bm) and twelve correlation terms) making about 228 values for a typical section.
The initial model is chosen as follows
1. The peaks in the traces are identified. For a peak to be identified at sample number j in trace m, two conditions must be satisfied:
(1) domrj.1 < do mι i >= dom,j+1, i.e. there must be a local maximum in that trace
(2) d0 mfj > dom.j for m ≠ m' , i.e. that trace must be the biggest of the four 2. For each of the four traces the minimum value is found. This is assumed to be the background level, Bm.
3. p is set to 3, which corresponds to approximately symmetric peak shapes .
4. σ is set to the average peak separation, to give reasonably distinct peaks.
5. The base positions, Pn, are set to match the peak positions found in step 1, taking into account the peak offset (i.e. the maximum of the peak shape function G(x,σ,p) does not occur at zero x) .
6. The base type for each base, Tn, is set to match the maximum trace, m, as found in step 1.
7. For each base, its strength, Sn, is set to match the observed data for the appropriate base type, m:
d0m,j = Bm + Sn Gmax, where Gmax is the maximum value of G(x,σ,ρ)
8. The correlation terms are set to zero.
2) Relaxation - Least squares inversion
The purpose of this process is to make relatively small adjustments to the continuous parameters m of the model in order to reduce the penalty function. It will alter the parameters of existing bases, but will not add or remove bases or change their base type. Figure 20 shows the data structure for this process and Figure 21 is a flow diagram of the process. The predicted data is nonlinear with respect to the model parameters, and so it is necessary to linearise about the starting model of each 5 iteration and to iterate the solution (effectively the same as the Newton-Raphson method but extended to NP dimensions, where NP is the number of continuous parameters in the model ) .
10 Since the number of data samples, D, and the number of free parameters, Np, in the penalty function given by equation (3) are both constant during the inversion, minimising the penalty function is equivalent to minimising
Figure imgf000051_0001
It is necessary to find the model parameters, m:, '♦ corresponding to the (local) minimum of Einv. At the
minimum
Figure imgf000051_0002
will be zero for all j .
The unweighted misfit (equation (4)) can be written as
Ed - ∑(d -d)2
20 Y_l d0 Oi ~d pO,
Figure imgf000051_0003
=
Figure imgf000051_0004
where :
i is the data index, incorporating both the position along the trace and the trace type (i.e. A, G, C or T) ,
dp0 i is the data predicted by the starting model (π»o)r
Aifj is the expected change in the predicted value for data i for each unit of change in model parameter j (Ai/;j = θdpi/dm-j ), evaluated at m0,
δd is the difference between the observed data and that predicted by the starting model and
δm is the change in the model at this iteration so that the model after the iteration (which will then become the starting model for the next iteration) is m0+δm.
Therefore
dEd
~τ~ is the j'th element of —2(A τ δd- A τ Aδrn) Similarly the spacing norm, (equation (5)) can be written as:
Figure imgf000053_0001
where N is the number of bases ,
Differentiating with respect to P gives
Figure imgf000053_0002
Figure imgf000053_0003
for all other i.
Figure imgf000053_0004
This can be written in matrix form: dE
- is the ith element of 2FP+2G where dP,
Figure imgf000053_0005
P is the N element vector
Figure imgf000054_0001
and G is the N element vector
Figure imgf000054_0002
Linearising around the starting model, m0; 2F^P+2G
becomes 2( FδP+ FPmO+ G\ where δP is the change in
position and Pm0 is the starting position.
Recombining the terms for misfit and spacing norm, and writing in matrix form gives :
δm=(A A+λsF^) {Aτδd-λs(FmQ + G')) where F' and G1 are the F and G matrices above but with zeroes inserted for the terms corresponding to the non- positional parameters of the model.
Once the A7 A and Aτδd matrices have been calculated, it may be necessary to add a damping term to them in order to make the inverted matrix stable. For example, if the starting model derived from the data contains no bases of one type, then the correlation values that relate to that base will be unconstrained by the data.
The damping acts to suppress the values of unconstrained parameters . This damping is very light and should not much affect the solution. It is achieved by adding a third term to Einv, such that
Figure imgf000055_0001
where λc is a small constant and
Figure imgf000055_0002
where the summation is over the parameters of the model, The values of Cj are constant and are chosen to be approximately equal to what are deemed typical values for
each model parameter. By considering oEc ulij and arranging
the terms of C into a diagonal matrix Ct the following
equation is obtained:
Figure imgf000056_0001
(The distinction between
C and C^ enables m and δm to be damped to different
extents respectively. For example there is no 'typical' value for position so one should not try to suppress the value of position parameters (effectively forcing all bases towards position zero) . However in cases where position is poorly constrained, one may wish the change in position to be suppressed, to allow stable matrix inversion. )
After δm has been calculated and the model has been updated the model is tested for validity by checking whether or not the parameters are physically plausible, in particular by testing whether
(a) the background values are positive or zero; (b) the correlation values are between 0 and 1; and
(c) the base strengths are zero or positive.
If any of the parameters lie outside these ranges they are moved within the range. The penalty function of the model or at least the numerator of the penalty function is then calculated and compared with that of the best inverted model. If it is not less than that of the best inverted model, so that the best inverted model corresponds to the minimum so far, the best inverted model is returned and the inversion process is terminated, otherwise the new model is saved as the best inverted model and the iteration is repeated from the
step of constructing the Aτ A and A δd matrices, until at least 10 iterations have been performed.
4) Random Change to Model
Three types of changes to the model are allowed: • a base is removed (20% probability)
• a base is added (60% probability)
• a base type is changed e.g. a base of type G is changed to type A (20% probability)
Although the changes are random they are not uniformly distributed, but are biassed towards regions with large misfits or with base spacings that are too big or too small. This accelerates the simulated annealing process by concentrating on regions that contribute excessively to the penalty function.
Figure 23 is a flow diagram showing the removal of a base from the model. Where a base is removed, one base is removed from the model at a time. That base is chosen at random based on either the base spacing (50% probability) or the base strength (50% probability).
If removal of a base is chosen due to a large contribution to the spacing norm, the spacing Zn of the base n, given by (Pn+ι - Pn-ι)/2 for each base other than the first and last base, is calculated. For the first and last base, Zn is given by the spacing between that base and the adjacent base. Z0 is subtracted from this to give a spacing anomaly ΔZn. Positive values of ΔZn are set to zero to ensure that only bases that are too close to other bases are removed. A base is chosen 'at random with the probability of base n being chosen being proportional to ΔZn or ΔZ2 n. For example, the program may calculate the cumulative value of ΔZn, ∑ΔZn, and generate a random number between 0 and ΣΔZn. The value of n corresponding to the random number will be proportioned to ΔZ„ If removal of a base is chosen due to the base strength the program looks for bases of low strength which therefore make little contribution to the predicted data value, and hence do little to reduce the misfit. The probability of removing a particular base is inversely proportional to the strength of that base (with the proviso that if the strength is below a threshold, the probability is set to the inverse of that threshold to avoid inverting very small numbers). One base is chosen at random and removed from the model.
Where a base is added, a single base is added to the model in each random change. The position and type of the added base are chosen at random. The position is chosen either first based on the misfit (80% probability) or the base spacing (20% probability).
Figure 24 is a flow diagram showing the process of adding a base to the model. The program initially determines whether a base is added to reduce the misfit or to improve the base spacing. If a base is added in order to reduce the misfit, program aims to introduce a new base at positions where the discrepancy between the observed data and that predicted by the model is greatest. For each position j, the misfit summed over all four traces is calculated:
misfit j= ∑ dom - dpm j)2 where m = A, G, C and T.
A position is then chosen at random, with the probability of choosing position j being proportional to misfitj.
If a base is added in order to reduce the spacing norm, the program biasses the insertion of new bases towards regions where the bases are too sparse. -
1. The spacing between one base and the following base is calculated:
Zn = Pn+l - Pn for = 0...N-2
' 2. The typical spacing Z0 is subtracted to get the spacing anomaly:
ΔZn = Zn - Z0.
3. If ΔZn < 0, set ΔZn = 0. This ensures that only regions where bases are too far apart are considered, not those where they are too close. 4. A base location is chosen at random, with the probability of base location n being chosen being proportional to ΔZn 2.
5. The mid-point between the base location chosen and that immediately after it is calculated.
6. The peak offset is added, to give the position where the data predicted for the new base is at its maximum. This gives the insertion point for the new base.
7. If the insertion point is outside the data range, it is moved to be within the data range.
Having chosen the position with one of the two algorithms above, the following steps are performed to decide the new base type:
1 At the new position the misfit for each trace is calculated:
Δdra = d0 m,j - dpm/j where j is the new position and m = A, G, C or T.
2 The type of the new base is chosen at random with the probability of type m being proportional to Δdm 2.
3 The position is corrected for the peak offset. 4 The strength of the new base is set to the average strength of all the other bases.
5 The new base is inserted into the model such that the bases are ordered according to their position. This is a pre-requisite for calculating the spacing norm.
In order to change the base type in the model, the program looks for regions of large discrepancy between observed data and the model, and changes the base type of a base in that region. The number of bases in the model is unchanged, as are all base positions and strengths.
Figure 25 is a flow diagram showing the process of changing a base in the model. The process has the following steps :
1 For each position j, the misfit summed over all four traces is calculated:
ss JJfiU Jjt where m = A, G, C and T.
Figure imgf000062_0001
A position is chosen at random, with the probability of choosing position j being proportional to misfit;,. 3 The base that (once corrected for the peak offset) is closest to this new position is found.
4 At the maximum of that base the misfit for each trace is calculated:
Δdm = do m/ j - dp πiFj where j is the base maximum and m = A, G, C or T .
5 The type of the new base to replace the existing base is chosen with the probability of type m being proportional to Δdm 2.
6 If the new base type is the same as before, step 2 onward are repeated.
As described above, whenever a random change is made to the model, all the parameters of the model are optimised, which can be very inefficient in terms of computational effort. It is expected that a change to a base will generally have only a local effect on the trace and will not effect, for example, peaks at the other end of the trace. It is therefore not necessary, to optimise all the parameters for each iteration. This may give a slightly worse performance per iteration as quantified by the penalty function, but this is outweighed by the increased number of iterations that can be performed in a given time. A number of different optimisation schemes may be chosen. For example: 1) All parameters may be optimised. This option is appropriate at the start of the simulated annealing process since none of the parameters have yet been optimised.
2 ) A few parameters may be optimised relating to the bases in the vicinity of a base that has been changed. For example where a base has been changed, whether by addition, removal or replacement, the parameters of the base in question (if present) and those of a number of bases e.g. four, on either side of the relevant position, may be optimised. This scheme is the one used most often.
3) The global parameters may be optimised. For example, if a new model is accepted as the best model, the global parameters only may be optimised to take into account the change in the bases.
Scheme 3 is slightly more complicated than the first two, since a change in the global parameters may well require compensating changes in the parameters for each of the bases. For example if it produces an increase in one of the background parameters the strength of all bases of that base type will have to be decreased to compensate.
Similarly, if the peak width σ increases, the peak offset will also increase and the nominal positions of all the bases will have to decrease in order that the maximum of the peak still matches the data. If the skew p changes, both the strength and the position of each base will need to change.
This could be achieved by optimising all the parameters again (scheme 1), but this would remove most of the gains in computational load. Fortunately the effect is usually the same for all bases and can be calibrated without needing the full inversion.
If the backgrounds have changed by ΔB, then the strength of base n should be changed by ΔSn, where
ΔSnGmiΑ 0, and Gmax is the maximum value of
Figure imgf000065_0001
G (x, σ,p) .
The function relating peak offset and Gmax to ς? and p is highly non-linear and must be treated in a more sophisticated manner. Two extra parameters may be introduced into the inversion, corresponding to how the strength of each base should be scaled and how much the position should be shifted (these will be the same for every base). These two parameters are optimised alongside the peak shape parameters and backgrounds, and the strength and position of each base updated accordingly.
The cumulative misfit and total misfit are required for calculating the penalty function and for biassing the random changes. Under scheme 2, the misfit of data points far from the changed base will be unchanged. By maintaining a record of the misfit and cumulative misfit at each point, and only recalculating the misfit of points near the bases that have changed, processing times can be further reduced.
6) Removal of Small Peaks
In order to prevent a multitude of bases each of very small amplitude, any base in the new model that is less than a fraction (e.g. 5%) of the average base strength may optionally be removed. (More complicated thresholding schemes are also possible. For example for each base the contribution to the predicated data due to that base is calculated. At the position of maximum contribution, the contribution from that base should be greater than a given fraction of the sum of the contributions from all the other bases or else the base is removed. However in practice such schemes are found to be unnecessary.)
7) and 8) The penalty function of the resultant model is then calculated and compared to the penalty function of the current model.
9) Decision on Accepting New Model
Figure 26 is a block diagram of the data structure and program for deciding whether or not to accept the new model, and figure 27 is a flow diagram of the process. The penalty function of the new model that has been generated by modifying the current model is calculated after removal of small peaks and relaxing the model, and is compared with that of the current model to give:
ΔE = (new penalty function) - (current penalty function).
If ΔE < 0, the new model is immediately accepted. If ΔE>0, the new model is accepted or rejected at random with a probability of acceptance equal to exp(-ΔE/T) where T is a parameter referred to as the -annealing temperature. The term "temperature" is used by analogy with the Boltzmann factor exp(-ΔE/kT) where k is the Boltzmann constant and T is the absolute temperature. This factor defines the probability of a system to jump out of an energy well of depth ΔE.
If the new model is accepted, its global parameters are optimised and it becomes the current model, and therefore the starting point from which further random changes are made.
Over time the annealing temperature is reduced. This has the effect that at the start of the process the model is relatively free to try out new solutions in its search for the global minimum of the penalty function, even if the intermediate solutions show a worse penalty function. As time goes on the process is given less and less freedom to force it to home in on a particular solution.
The choice of annealing temperature is not obvious and is somewhat empirical. If it is set too low the model is given very little latitude to experiment so may find it hard to break out of local minima. On the other hand, if it is set too high the model is given too much freedom and may wander very far from regions of low penalty function.
10) Termination Criteria
The simulated annealing is stopped when any of the following criteria are satisfied: (i) 400 000 iterations have been performed
(ii) 3 000 iterations have passed since the most recent 'best' model (i.e. the model with the lowest ever penalty function) was first seen. (iii)300 or more iterations have passed since the most recent best model, and the number of iterations since this model is greater than the number of iterations required to get from the initial model to this model.
The last of these is the most likely for relatively clean data sets, which only need a few alterations to the starting model. The second occurs for more complicated data: the model may not have settled down completely but is likely to be fairly close to its optimum. The first criteria is used to terminate the process where there is little prospect of a sensible result. To have got this far, the probability that the data is highly noisy and the modelling process is thoroughly confused and any models produced are of doubtful validity.
11) Base Call Probabilities
Once the program has generated a final model, the probability that each base in the section has been called correctly is calculated, based on how much that base contributes to reducing the penalty function of the model. From this a quality score can be calculated. The data structure for this operation is shown in figure 28 and a flow chart of the procedure is shown in figure 29. 1. The best model from the simulated annealing is chosen. This should already be fully relaxed.
2. The penalty function, E, of this model is calculated. 3. The first base in the model is chosen and altered in four different ways: the base is successively replaced by a base of the three other types; and the base is removed completely.
4. For each of these four alterations the model is relaxed to generate a best fit to the observed data, and the penalty function of the relaxed, altered model calculated Ealtered or i = 0 to 3.
5. An estimate is made of how much the penalty function is expected to change under these alterations, σE. The same value of σE is used for all bases in the model.
6. The probability of the base called in the best model being correct is:
X
P
X + 2_j -_Λ altered, i
where X = expl E (J l and
E altered , i 7 Λ
X altered ~ eXPl / σ EJ 7. This can be converted to a quality factor Q = -10 log10(l-P).
8. The probability that the base should be each of the four other possibilities is:
γy X altered, i x+ ΣlA altered,!
9. The above process is repeated for all the other bases in the best model.
Occasionally bases will be found for which Ealtered < E, i.e. the replacement/removal of that base will actually improve the penalty function. Normally these would have been removed by the simulated annealing process but, due to the random nature of that process, are sometimes over- looked. Such bases should ultimately be replaced/removed from the best model but the order in which they are removed is important.
For example, if the bases in one region are too closely spaced, so that the removal of any one base will reduce the spacing norm, then several bases may well have Ealtered < E. Removing all such bases is over-kill and will most likely worsen the penalty function. Instead the base with the lowest Ealtered should be changed first, the model allowed to relax and E recalculated. Ealtered should then be recalculated for all the remaining bases. This process is repeated until no base has Ealtered < E.
A similar problem occurs when there are a series of bases of the same type and which do not form clear, distinguishable peaks in the traces. For example if there are a whole series of merged G peaks, the algorithm may decide that the most likely sequence is four bases of type G. However the penalty function for only three bases may be very similar. Under the process outlined above, each of the four bases might be assigned a relatively low probability, since under relaxation, when any of the four is removed, the remaining three will shuffle up to cover the gap without increasing the penalty function significantly. Therefore the probabilities thus calculated may be misleading.
This is resolved (to a certain extent) as follows. Clusters, consisting of two or more consecutive bases of the same type, are identified and the least probable base of the cluster removed (or replaced according to the values of Paιtered,i) • The model is relaxed and the probabilities of the remaining bases in the cluster recalculated. This is repeated until only one base remains unaltered in the cluster. Therefore within a cluster the probability assigned to a base is its probability once all less probable bases have been removed (or changed) .
Reuniting the sections
Once the final models have been generated for all the sections of the DNA sequence, it is necessary to join the sections together to form a model for the entire sequence.
Each section overlaps the adjacent section by 50 samples, corresponding to about four bases. Figure 30 shows schematically how two overlapping sections are used to create a model spanning the overlap. The vertical lines are the bases. For clarity every base is assumed to be of equal strength (as indicated by the height of the lines) and non vertical (either horizontal or sloping) the dotted line shows how each model is scaled prior to forming the composite model. Each section (including the composite) is 1000 samples long and the overlap is 50 samples. The overlap contains bases from both models, scaled in strength. In the above example, the two original models are in some cases consistent: for example both call Ts at the same place 20 and 22 in the sequence although the exact positions vary slightly. The right- hand pair of Ts in the overlap section of the composite model are very close and would probably be deemed an obvious pair and merged. However the right hand section contains an extra base (type C) at position 24 not seen in the other section and has called a base as G at position 26 where the left hand section called A.
The process for joining the sections of model to form the composite model is as follows:
1 The average of each of the global parameters (background, correlations, and peak shape parameters ) of the two models is used for the composite. 2 A mask is applied to each of the models, which acts as a the scaling function for the base strengths. This mask is unity outside the overlap and linearly decreases to zero within the overlap (Figure 30). The principle behind this is that the closer a base is to the ends of the section, the less well determined its properties are likely to be, so the corresponding weighting should be less.
3 The two models are then added together, taking into account the offset of the two sections.
4 The positions and strengths of the bases of the composite model are then corrected to take into account any changes in the global parameters in step 1 (e.g. changes in σ and p will change the peak offset). Obvious pairs of bases are then merged into a single base, by adding their strengths and taking a weighted average of their positions. To be deemed a pair two bases have to be of the same type and within a very small distance of each other. (Although not strictly required by the algorithm, this does make the inversion more stable. )
Even after the obvious pairs of bases have been merged the overlap section is still likely to be too crowded. This will be reflected in the spacing norm of the composite model. The mechanism for removing this excess is very similar to that described for calculating the probabilities of the bases called, and a flow diagram of this process is shown in Figure 31.
First of all the composite model is allowed to relax and its penalty function, E, calculated. Then each base within the overlap zone is removed in turn, the depleted model relaxed, and the new penalty function "E^p^ed is calculated. The base with the lowest Edepleted is then removed as long as Edepleted < E, and the model allowed to relax again and E recalculated. This process is repeated until none of the surviving bases has Edepleted < E.
The probabilities of the bases left in the overlap region are calculated as described above. This may require recalculating the probabilities of some of the bases outside the overlap if they form clusters with bases within the overlap region.
Results
Base sequences were deduced from the modelling process and were compared to the true base sequence for a number of DNA datasets and compared to the base sequences called by the industry standard Phred. All the traces within a particular dataset were for the same DNA fragment for which the "true" sequence had been deduced from a large number of individual traces from other experiments.
Figures 32A to 32C show the results of modelling for a very clean trace. Figure 32A is that of the original data, Figure 32B plot is that of the model with the height of each base proportional to its strength, and
Figure 32C is a trace predicted by the model. For this set of relatively clean data, the model matched the true answer exactly, as did Phred.
Figures 33A to C are the results of modelling a somewhat poorer dataset with non-symmetric peak shapes and merged peaks. The plots are of the original data (Figure 33A) , the model (Figure 33B) and the base probabilities (Figure 33C). In Figure 33C, for each base, the probability of each base type is indicated by the length of each line segment at that position. The probability of there being no base at all is plotted as a white line (so therefore appears as a gap) at the bottom of the vertical line.
The base at position 752 is indisputably a T, and so the thick line occupies the whole region from 0 to 1 in Figure 33C. The base at position 843 is most likely C so the thick dashed line is much longer than any of the other line styles at that position. The base at 832 has almost equal probability of being any of the base types although T is slightly more likely than the others. There is also a substantial probability of there being no base at all, as indicated by the gap from 0 to 0.25.
The model matches the true answer exactly with the exception that an extra base of type T has been inserted at position 832. However, as can be seen from the bottom plot, a very low probability has been assigned to this base. Phred made seven errors on the same dat&set.
Figure 34 is a graph showing base calling errors made by the process according to the invention with those made by Phred. The number of errors was calculated by an algorithm that sought to align the deduced and true sequence by adding or deleting bases to maximise the match. There are three types of error: insertion, where there is an extra base in the deduced model; deletion where a base has been missed out; and substitution, where the base is of the wrong type. Each of these is given equal weighting. Each base call is either right or wrong: no account is taken of the base probability.
When calculating error rates for Phred, base calls of N (no base) are also counted as errors.
Conventionally the error rate is plotted as a function of position, where position is measured in terms of bases from the start of the DNA fragment as shown in Figure 34. It can be seen that, as compared with Phred, the region of low error rate has been extended at both ends of the dataset, by between 10 and 20 bases at the start (from base 40 to base 20 or 30 depending on exact threshold chosen) and 50 at the end (in the region of base number 900).
POSSIBLE IMPROVEMENTS
Further modifications and improvements to the base calling process analysis can be made by accounting for the following properties in more detail. These represent minor modifications to the software or correlation statistics and are relatively simple to implement: 1 ) Feeding in prior knowledge of the correlation matrix, for example spectral dependency of laser filters, spacings, etc.
2) Regions where traces saturate (i.e. are at their maximum possible value for a number of samples) could be identified and processed appropriately.
3) The modelling process contains many parameters (e.g. annealing temperature, relative weighting of the spacing norm and misfit in the penalty function) . Performance may improve if these are fine tuned.
4 ) Those skilled in the art will appreciate that any minimisation can be used instead of least squares inversion.
5 ) The shape of peaks could be modified to take into account diffusion equation, filter function, fluorescence function.
6 ) detailed chemistry effects could be taken into account in the correlation matrix, such as:
In a series of consecutive As or Ts, the first base peak is large, but subsequent peaks can be small. - compression, especially in GC-rich regions
For a series of Ts followed by G; G peak can be very large and may have a shoulder either side.
- Homopolymer regions (e.g. poly-A tails in cDNA) can cause enzyme slippage, causing the appearance of multiple overlapping sequences in the region beyond. Homopolymer regions can also cause 'wave' patterns in the peaks following them.
- Secondary structure formation can cause a ' false stop' for the enzyme, so that all 4 peaks attain high values at a base position. - Dye primer peak patterns:
In a run of 2 or more Cs, the first C is weaker, especially after G.
First A in a run of 2 or more As is large when preceded by a C.
The second C in a string of 2 or more Cs can have an elevated signal.
- Dye terminator peak patterns
1 C after G is weak, especially if starting a run of Cs. 2 T after G can be weak.
3 A after T can be weak.
C following a string of Ts can be enhanced. In a string of 4 or more Gs, the third can be weak, and the following Gs can be strong.
G following string of Ts can be very strong, and may obscure following peaks.
In a string of As, the first is strong but the others may be weak, depending on what precedes the string.
In a string of Ts , the first is strong but the others may be weak, depending on what precedes the string.
These effects can be taken into account by modifications to how the predicted values are calculated.
For example if it is observed that in a series of consecutive A bases later peaks are suppressed, then parameters can be introduced into the model which modify how the predicted values are calculated. The values of these parameters will determine just how much the predicted values for the later bases are suppressed.
Similarly if it is observed that, under certain circumstances, the shape of the peaks tends to be altered, then this can also be reflected in the way predicted values are calculated, with parameters controlling this.
The value of these parameters (and hence for example how much subsequent peaks are suppressed) should be deduced from the data in a similar fashion to the way the correlations are calculated. As this is effectively an averaging process this may cause problems when certain scenarios occur only rarely so the values of the parameters are poorly constrained.
If the expected values for these parameters are known a priori then this knowledge can also be used by adding appropriate terms to the penalty function in a way similar to how the spacing norm is currently done. This will penalise models whose parameters differ greatly from those expected.

Claims

CLAIMS :
1. A carrier which carries a computer program comprising processor implementable instructions for base calling DNA bases from a dataset corresponding to observed traces, the instructions comprising:
(i) code for generating a database in the computer memory corresponding to a model of a DNA sequence; and
(ii) code for refining the model of the DNA sequence by:
(a) making a change to the base sequence of the model and generating a database in the memory corresponding to the modified model;
(b) predicting the form of the traces from the modified model;
(c) comparing the predicted traces with the observed traces in the dataset to generate a penalty function;
(d) determining whether or not to accept the modified model as the best model so far based on the value of the penalty function; and
(e) repeating operations (a) to (d) until termination criteria have been achieved.
2. A carrier as claimed in claim 1, wherein the program includes a code for generating a database in the memory into which the dataset corresponding to the observed traces can be read.
3. A carrier as claimed in claim 1 or claim 2, wherein the penalty function generated by the program includes a degree of misfit between the observed and the predicted traces, and a measure of divergence from a general model of DNA.
4. A carrier as claimed in claim 3, wherein the divergence from a general model of DNA is the irregularity of the spacing of the base peaks in the predicted traces.
5. A carrier as claimed in claim 3 or claim 4, wherein the penalty function is defined as Ed + λ t Es D- N„
where Ed is a measure of the discrepancy between observed and predicted traces; Es is a measure of the irregularity of spacing of the base peaks in the predicted traces; λs is a positive number;
D is the number of data samples in each trace and Np is the number of parameters in the model.
6. A carrier as claimed in any one of claims 1 to 5, wherein the code for refining the model makes changes in the base sequence that are at least partly random.
7. A carrier as claimed in any one of claims 1 to 6, wherein the code for refining the model adds a base to the base sequence during some of steps (a).
8. A carrier as claim in any of claims 1 to 7 , wherein the code for refining the model removes a base from the base sequence during some of steps (a).
9. A carrier as claimed in any of claims 1 to 8, wherein the code for refining the model replaces one base with another base during some of steps (a).
10. A carrier as claimed in any one of claims 1 to 9, wherein the code for refining the model accepts the modified model if the penalty function of the modified model is less than that of the previous model.
11. A carrier as claimed in any one of claims 1 to 10 wherein, if the penalty function of the modified model is greater than that of the previous model, the code for refining the model may accept the modified model with a probability of acceptance that depends on the difference between the penalty function of the modified model and that of the previous model.
12. A carrier as claimed in claim 11, wherein the probability of acceptance of the modified model decreases as the number of steps (a) that have been performed increases .
13. A carrier as claimed in claim 12, wherein, if the penalty function of the modified model is greater than that of the previous model, the code for refining the model will accept the modified model as the model if the value of
Figure imgf000087_0001
where ΔE is the difference between the penalty function of the modified model and that of the previous model; and
T is a value that is reduced as the number of steps (a) that have been performed increases, is greater than a random number between 0 and 1.
14. A carrier as claimed in any one of claims 1 to 13, wherein the code for refining the model performs a least squares inversion process on any model to minimise the penalty function.
15. A carrier as claimed in claim 14, wherein the least squares inversion process is performed iteratively with at least five steps .
16. A carrier as claimed in claim 14 or claim 15, wherein the least squares inversion process is performed to minimise the penalty function with respect of the position of each base, the strength of each base signal, the background signal, the correlation between signals from different bases, the peak width and the peak skew.
17. A carrier as claimed in any one of claims 1 to 16, wherein the code for generating the database generates an initial model by assigning a base to a peak in a trace that is higher than any of the other traces at that position.
18. A carrier as claimed in claim 17, wherein the initial model is generated by setting the correlation of different traces to zero.
19. A carrier as claimed in claim 16 or claim 18, wherein the initial model is generated by setting the peak width of the predicted traces to the peak separation.
20. A carrier as claimed in any of claims 1 to 19, which includes code for dividing the traces generated from the electrophoresis into a plurality of smaller traces on which it operates to determine the base sequence.
21. A carrier as claimed in claim 20, wherein the traces are divided into smaller traces corresponding to from 30 to 200 bases.
22. A carrier as claimed in claim 21, wherein the smaller traces overlap smaller traces corresponding to adjacent portions of DNA sequence by at least two bases.
23. A carrier as claimed in any one of claims 20 to 22, which includes code for joining together the models of the base sequence corresponding to the smaller traces .
24. A carrier claimed in claim 23, wherein the smaller traces overlap smaller traces corresponding to adjacent portions of DNA sequence by at least two bases, and after the code for joining the models has joined together the
' models of the base sequence, it refines those parts of the composite model so formed corresponding to the overlapping regions of the smaller traces by: " (1) removing each base from those parts of the model corresponding to the overlapping regions of the smaller traces in turn to form one or more depleted models;
(2) comparing the predicted trace of the or each depleted model in its overlapping region with the observed trace to calculate a penalty function;
( 3 ) comparing the penalty function of the depleted model having the lowest penalty function with the penalty function of the undepleted model and, if its penalty function is lower, accepting the depleted model as the undepleted model for the next iteration; and
( 4 ) repeating operations to ( 1 ) to ( 3 ) until no further improvement in penalty function occurs .
25. A carrier as claimed in claim 24, wherein the code for joining together the models generates a penalty function in respect of those parts of the model corresponding to the overlapping regions of the smaller traces that includes a degree of misfit between the observed and the predicted traces and a measure of divergence from a general model of DNA for determining which model to reject.
26. A carrier as claimed in any one of claims 1 to 25, wherein the termination criteria include the fact that step (a) has been performed at least 100 times since a modified model has been accepted as the best model to date and that step (a) has been performed more times since a modified model has been accepted as the best model to date than the number of iterations it took to generate the latest modified model.
27. A carrier as claimed in any one of claims 1 to 26, wherein the termination criteria include the fact that step (b) has been performed at least 1000 times since a modified model has been accepted as the best model to date.
28. A carrier as claimed in any one of claims 1 to 27, wherein the termination criteria include the fact that step (b) has been performed at least 100,000 times.
'' 29. A carrier as claimed in any one of claims 1 to 28, wherein the code for refining the model includes code for calculating the probability that the assignment of any base in the model is correct.
30. A carrier as claimed in claim 29, wherein the code for calculating the probability that the assignment of a base is correct: (1) predicts the form of the traces of the model after removal or change of the base and compares the predicted traces with the observed traces to generate a penalty function; and
( 2 ) compares the penalty function obtained in step ( 1 ) with the penalty function of the model without removal or change of any base therefrom.
31. A carrier as claimed in claim 30, wherein the code for calculating the probability calculates the probability from the equation:
X
P = x + Σ X, alteredf
where P is the probability that the assignment of a base in the model is correct,
X - expl Y J and
f - E altered,ι / / λ
^altered,, ~ βXP /σ E)
where E is the penalty function of the model; Eaιtered is the penalty function generated from the model after removal or change of the base, and
σE is a measure of a typical change in the penalty function caused by changing or removing a base.
32. A carrier as claimed in claim 30 or claim 31, wherein the code for refining the model calculates the probability of assignment of a base after the termination criteria have been completed.
33. A carrier which carries a computer program for sequencing DNA substantially as hereinbefore described with reference to, and as shown in, Figures 3 to 23 of the accompanying drawings .
34. A carrier as claimed in any one of claims 1 to 33, which comprises a storage medium.
35. A carrier as claimed in claim 34, wherein the storage medium is a ROM, a RAM or a magnetic storage medium.
36. A computer which includes a storage medium forming a carrier as claimed in any one of claims 1 to 34.
37. Apparatus for sequencing DNA, which comprises: (i) a vessel for receiving a sample of DNA to be sequenced and other reactants for growing a complementary DNA strand including terminator bases having characteristic absorption spectra;
(ϋ) means for performing electrophoresis on the contents of the vessel;
(iii)means for obtaining spectra from material as it finishes electrophoresis at a frequency that is characteristic of each terminator base to generate a trace corresponding to each base; and v
(iv) a computer as claimed in claim 36 for determining the base sequence from the traces .
38. A method for sequencing DNA which comprises:
(i) a mixing a sample of DNA to be sequenced with a quantity of primer and bases for growing a complementary DNA strand, the bases including terminator bases having characteristic absorption spectra;
(ii) performing electrophoresis on the contents of the vessel;
(iii)obtaining spectra from material as it finishes electrophoresis at a spectral frequency that is characteristic of each of the terminator bases to generate a dataset of observed traces corresponding to each base; and
(iv) determining the base sequence from the traces by means of a computer as claimed in claim 36.
39. A storage medium which stores a dataset corresponding to a DNA base sequence that has been obtained by a method as claimed in claim 38.
PCT/GB2002/004398 2001-10-04 2002-09-30 Dna sequencer WO2003029487A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2002329425A AU2002329425A1 (en) 2001-10-04 2002-09-30 Dna sequencer

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB0123875.7 2001-10-04
GB0123875A GB0123875D0 (en) 2001-10-04 2001-10-04 Noise reduction
GB0216746A GB0216746D0 (en) 2001-10-04 2002-07-18 DNA sequencer
GB0216746.8 2002-07-18

Publications (2)

Publication Number Publication Date
WO2003029487A2 true WO2003029487A2 (en) 2003-04-10
WO2003029487A3 WO2003029487A3 (en) 2004-05-27

Family

ID=26246613

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2002/004398 WO2003029487A2 (en) 2001-10-04 2002-09-30 Dna sequencer

Country Status (2)

Country Link
AU (1) AU2002329425A1 (en)
WO (1) WO2003029487A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SG121143A1 (en) * 2004-09-15 2006-04-26 Hoffmann La Roche Systems and methods for processing nucleic acid chromatograms
CN111462818A (en) * 2019-01-22 2020-07-28 武汉华大医学检验所有限公司 Sequencing yield prediction method and device for establishing sequencing yield prediction model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998011258A1 (en) * 1996-09-16 1998-03-19 University Of Utah Research Foundation Method and apparatus for analysis of chromatographic migration patterns
US5916747A (en) * 1995-06-30 1999-06-29 Visible Genetics Inc. Method and apparatus for alignment of signals for use in DNA based-calling
US6218121B1 (en) * 1995-05-09 2001-04-17 Curagen Corporation Apparatus and method for the generation, separation, detection, and recognition of biopolymer fragments
US6236944B1 (en) * 1998-04-16 2001-05-22 Northeastern University Expert system for analysis of DNA sequencing electropherograms

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6218121B1 (en) * 1995-05-09 2001-04-17 Curagen Corporation Apparatus and method for the generation, separation, detection, and recognition of biopolymer fragments
US5916747A (en) * 1995-06-30 1999-06-29 Visible Genetics Inc. Method and apparatus for alignment of signals for use in DNA based-calling
WO1998011258A1 (en) * 1996-09-16 1998-03-19 University Of Utah Research Foundation Method and apparatus for analysis of chromatographic migration patterns
US6236944B1 (en) * 1998-04-16 2001-05-22 Northeastern University Expert system for analysis of DNA sequencing electropherograms

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
EWING B ET AL: "BASE-CALLING OF AUTOMATED SEQUENCER TRACES USING PHRED. I. ACCURACYASSESSMENT" GENOME RESEARCH, COLD SPRING HARBOR LABORATORY PRESS, US, vol. 8, 1998, pages 175-185, XP000915054 ISSN: 1088-9051 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SG121143A1 (en) * 2004-09-15 2006-04-26 Hoffmann La Roche Systems and methods for processing nucleic acid chromatograms
US7647188B2 (en) 2004-09-15 2010-01-12 F. Hoffmann-La Roche Ag Systems and methods for processing nucleic acid chromatograms
CN111462818A (en) * 2019-01-22 2020-07-28 武汉华大医学检验所有限公司 Sequencing yield prediction method and device for establishing sequencing yield prediction model
CN111462818B (en) * 2019-01-22 2023-04-21 武汉华大医学检验所有限公司 Sequencing yield prediction method, and method and device for establishing sequencing yield prediction model

Also Published As

Publication number Publication date
AU2002329425A1 (en) 2003-04-14
WO2003029487A3 (en) 2004-05-27

Similar Documents

Publication Publication Date Title
US20110208495A1 (en) Method, system, and program for generating prediction model based on multiple regression analysis
US6873915B2 (en) Peak selection in multidimensional data
US6554987B1 (en) Method and apparatus for alignment of signals for use in DNA base-calling
US10488377B2 (en) Systems and methods to process data in chromatographic systems
US8645073B2 (en) Method and apparatus for allele peak fitting and attribute extraction from DNA sample data
US20100070441A1 (en) Method, apparatus, and program for generating prediction model based on multiple regression analysis
EP1636730A2 (en) Methods and systems for the analysis of biological sequence data
Nelson et al. What is preexisting strength? Predicting free association probabilities, similarity ratings, and cued recall probabilities
Beaujean et al. P-values for model evaluation
Roberts et al. A preprocessor for shotgun assembly of large genomes
US8635258B2 (en) Alignment of multiple liquid chromatography-mass spectrometry runs
US6208941B1 (en) Method and apparatus for analysis of chromatographic migration patterns
WO2010056131A1 (en) A method and system for analysing data sequences
CN113257346A (en) Method for evaluating HRD score based on low-depth WGS
WO2003029487A2 (en) Dna sequencer
AU2017100408A4 (en) User keyboard key-pressing behavior mode modeling and analysis system, and identity recognition method thereof
EP2631832A2 (en) System and method for processing reference sequence for analyzing genome sequence
Singh et al. Performance evaluation of different window functions for STDFT based exon prediction technique taking paired numeric mapping scheme
Zhao et al. Rfe based feature selection improves performance of classifying multiple-causes deaths in colorectal cancer
JP4355281B2 (en) Peak extraction method and peak extraction apparatus
CN103310128A (en) System and method for processing genome sequence in consideration of seed length
CN116504318B (en) Tumor ctDNA information statistical processing method based on machine learning
CN112599251B (en) Construction method of disease screening model, disease screening model and screening device
Ma et al. Seed optimization is no easier than optimal Golomb ruler design
CN116230091B (en) Knowledge reasoning method and system for iteratively analyzing biological large sample data

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BY BZ CA CH CN CO CR CU CZ DE DM DZ EC EE ES FI GB GD GE GH HR HU ID IL IN IS JP KE KG KP KR LC LK LR LS LT LU LV MA MD MG MN MW MX MZ NO NZ OM PH PL PT RU SD SE SG SI SK SL TJ TM TN TR TZ UA UG US UZ VC VN YU ZA ZM

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ UG ZM ZW AM AZ BY KG KZ RU TJ TM AT BE BG CH CY CZ DK EE ES FI FR GB GR IE IT LU MC PT SE SK TR BF BJ CF CG CI GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP