US20040236776A1 - Method and apparatus for significance testing and confidence interval construction based on user-specified distributions - Google Patents

Method and apparatus for significance testing and confidence interval construction based on user-specified distributions Download PDF

Info

Publication number
US20040236776A1
US20040236776A1 US10/878,410 US87841004A US2004236776A1 US 20040236776 A1 US20040236776 A1 US 20040236776A1 US 87841004 A US87841004 A US 87841004A US 2004236776 A1 US2004236776 A1 US 2004236776A1
Authority
US
United States
Prior art keywords
data set
numerical value
test statistic
statistical data
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/878,410
Inventor
Terrence Peace
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US10/878,410 priority Critical patent/US20040236776A1/en
Publication of US20040236776A1 publication Critical patent/US20040236776A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Definitions

  • the present invention relates to the analysis of statistical data, preferably on a computer and using a computer implemented program.
  • the invention more specifically relates to a method and apparatus that accurately analyzes statistical data when that data is not “normally distributed,” by which is meant, as used herein, that the data set does not correspond to a “normal probability distribution” or does not show a bell-shaped curve.
  • the subject invention therefore provides a method and apparatus capable of evaluating statistical data and outputting reliable analytical results without relying on transformation techniques.
  • U.S. Pat. No. 5,893,069 to White, Jr., entitled “System and method for testing prediction model,” discloses a computer implemented statistical analysis method to U.S. Pat. No. 5,893,069 to White, Jr., entitled “System and method for testing prediction model,” discloses a computer implemented statistical analysis method to evaluate the efficacy of prediction models as compared to a “benchmark” model.
  • White discloses the “bootstrap” method of statistical analysis in that it randomly generates data sets from the empirical data set itself.
  • the subject invention removes this restriction so that any function of the data may be used as a test statistic.
  • the instant invention will permit all aspects of more than one distribution to be tested one against the other in a single analysis and determine significant differences, if any exist.
  • Yet another object of the present invention is to provide a mehod and apparatus that enables a user to perform sensitivity analysis on the underlying data.
  • the invention achieves the above objects by providing a technique to analyze empirical data within its original distribution rather than transforming it to a Normal distribution. It is preferably implemented using a digital processing computer, and therefore a computer, as well as a method and program to be executed by a digital processing computer.
  • the technique comprises, in part, the computer generating numerous random data bases of the same size and distribution of the original database to provide comparisons to numerical relationships arising purely by chance.
  • the best mode of the invention requires input from the user defining a number of options, although alternative modes of the invention would involve the computer determining options at predetermined stages in the analysis.
  • the method and program disclosed herein is superior to prior art in that it allows data to be analyzed more accurately and efficiently, permits the data to be analyzed in accordance with any distribution (including the distribution which generated the data), avoids the errors which may be introduced by data transformation, and facilitates sensitivity analysis.
  • FIG. 1 is a schematic diagram of the hypothesis testing evaluation system.
  • FIGS. 2 a and 2 b depict a flow chart showing the steps for executing the hypothesis testing method and program.
  • FIG. 3 is a flow chart showing of the steps for executing the hypothesis testing method and program in which the hypothesis takes the form of confidence intervals.
  • the present invention supplies a computer and appropriate software or programming that more accurately analyzes statistical data when that data is not “normally distributed.”
  • the invention therefore provides a method and apparatus for evaluating statistical data and outputting reliable analytical results without relying on traditional prior art transformation techniques, which introduce error.
  • the practice of the present invention results in several unexpectedly superior benefits over the prior art statistical normalizations.
  • test statistic is often used to test whether two samples have the same mean.
  • the numerical value of the t-statistic is calculated and then related to tables that had been prepared using a knowledge of the distribution of this test statistic.
  • a test statistic Prior to the subject invention, a test statistic has been useless until its distribution has been discovered; thus, for all practical purposes, the number of potential test statistics has been relatively small. The subject invention removes this restriction; any function of the data may be used as a test statistic.
  • the invention enables the user to make inferences on multiple parameters simultaneously. For example, suppose that the null hypothesis (to be disproved) is that two distributions arising from two potentially related conditions are the same. Traditional data analysis might reveal that the two means are not quite significantly different, nor are the two variances. The result is therefore inconclusive; no formal test exists within the general linear model to determine if the two distributions are different and that this difference is statistically significant. The present invention will permit all aspects of both distributions to be tested one against the other in a single analysis and determine significant differences, if any exist.
  • sensitivity analysis is a natural extension of the data analysis under the invention, whereas sensitivity analysis is extremely difficult and impractical using current methods and software.
  • Sensitivity analysis examines the effect on conclusions of small changes in the assumptions. For example, if the assumption is that the process that generated the data is a Beta (2,4), then a repeat analysis under a slightly different assumption (e.g. Beta (2,5)) should not produce a markedly different result. If it does, conclusions obtained from the initial assumption should be treated with caution.
  • Such sensitivity analysis under the invention is simple and is suggested by the method itself.
  • U.S. Pat. No. 5,893,069 to White discloses a computer implemented statistical analysis method to evaluate the efficacy of prediction models as compared to a “benchmark” model.
  • the invention disclosed herein is superior to this prior art in that it tests the null hypothesis against entirely independent, randomly-generated data sets having the identical size and dimension as the original data set, with a distribution defined to best describe the process which generated the original data set under the null hypothesis.
  • the present invention is remarkably superior to that of White, in that the present invention enables the evaluation of an empirically determined test statistic by comparison to an unadulterated, randomly produced vector of values of that test statistic.
  • the empirical test statistic falls within an extreme random-data-based range of values (e.g. above the 95 th percentile or below the 5 th percentile)
  • the null hypothesis which is being tested can be rejected as false, with a high level of confidence that is not merited in the prior art with respect to non-normal data distributions. Therefore, the ability is greatly enhanced to determine accurately whether certain factors are significantly interrelated or whether certain populations are significantly different.
  • Statistical hypothesis testing is the basis of much statistical inference, including determining the statistical significance of regression coefficients and of a difference in the means.
  • a number of important problems in statistics can be reduced to problems in hypothesis testing, which can be analyzed using the disclosed invention.
  • One example is determining the likelihood ratio L, which itself is an example of a test statistic.
  • the likelihood ratio may be generalized so that different theoretical distributions are used in the numerator and denominator.
  • the likelihood ratio or its generalization may be invoked repeatedly to solve a multiple decision problem, in which more than two hypotheses are being tested. For example, in the case of testing an experimental medical treatment, the standard treatment would be abandoned only if the new treatment were notably better. The statistical analysis would therefore produce three relevant possibilities: an experimental treatment that is much worse, much better or about the same as the standard treatment, only one of which would result in rejection of the standard treatment. These types of multiple decision problems may be solved using the disclosed invention by the repeated use of the likelihood ratio as the test statistic.
  • Prediction problems may also be analyzed, whether predicting future events from past observations of the same events (e.g. time series analysis), or predicting the value of one variable from observed values of other variables (e.g. regression).
  • the significance of the statistical model's performance meaning the likelihood that the model would predict to the same level of accuracy due only to chance, may also be estimated.
  • the method and program disclosed may also be used in this case and, in most practical situations, will prove to be superior.
  • the instant invention may also be used to determine confidence intervals, which is a closely related statistical device. Whereas hypothesis testing begins with the numerical value of the test statistic and derives the respective probability, a confidence interval begins with a range of probabilities and derives a range of possible test statistics. A common confidence interval is the 95 percent confidence interval, and ranges between the two percentiles P2.5 and P97.5. Given the symmetrical relation of the two techniques, there would be nearly identical methods of calculation. A slight modification of the disclosed method, which is obvious to those skilled in the art, enables the user to construct confidence intervals as opposed to test hypotheses, with a greater level of accuracy.
  • this invention relates to determining the likelihood of a statistical observation given particular statistical requirements. It can be used to determine the efficacy of statistical prediction models, the statistical significance of hypotheses, and the best of several hypotheses under the multiple decision paradigm, as well as to construct confidence intervals, all without first transforming the data into a “normal” distribution. It is most preferably embodied on a computer, and is a method to be implemented by computer and a computer program that accomplishes the steps necessary for statistical analysis. Incorporation of a computer system is most preferred to enable the invention.
  • the computer system includes a digital processing apparatus, such as a computer or central processing unit 1 , capable of executing the various steps of the method and program.
  • the computer 1 is a personal computer known to those skilled in the art, such as those manufactured by IBM, Dell Computer Corporation, Hewlett Packard and Apple. Any corresponding operating system may be involved, such as those sold under the trademark “Windows.”
  • Other embodiments include networked computers, notebook computers, handheld computing devices and any other microprocessor-driven device capable of executing the step disclosed herein.
  • the computer includes the set of computer-executable instructions 2 , in computer readable code, that encompass the method or program disclosed herein.
  • the instructions may be stored and accessible internally to the computer, such as in the computer's RAM, conventional hard disk drive, or any other executable data storage medium.
  • the instructions may be contained on an external data storage device compatible with a computer readable medium, such as a floppy diskette 3 , magnetic tape or compact disk, compatible with and executable by the computer 1 .
  • the system can include peripheral computer equipment known in the art, including output devices, such as a video monitor 4 and printer 5 , and input devices, such as a keyboard 6 and a mouse 7 .
  • output devices such as a video monitor 4 and printer 5
  • input devices such as a keyboard 6 and a mouse 7
  • output devices such as a keyboard 6 and a mouse 7
  • output devices such as a keyboard 6 and a mouse 7
  • input devices such as a keyboard 6 and a mouse 7
  • Additional potential output devices include other computers, audio and visual equipment and mechanical apparatus.
  • Additional potential input devices include scanners, facsimile devices, trackballs, keypads, touch screens and voice recognition devices.
  • the computer executable instructions 2 begin by defining the structure of database 11 of FIG. 2, a flowchart of the computer executable steps.
  • the original data to be analyzed is collected into the database 12 .
  • This original data introduced at step 12 may consist of known empirical data; theoretical, hypothetical or other synthetically generated data; or any combination thereof.
  • the original database 12 is stored as a computer accessible database 8 of FIG. 1.
  • the database 8 can be internal to or remote from the computer 1 .
  • the database 8 can be input onto the computer accessible medium in any fashion desired by the use, including manually typing, scanning or otherwise downloading the database.
  • test statistic 13 and formal hypothesis 14 in terms of said test statistic, known as the null hypothesis, concerning database 12 .
  • the term test statistic is used to denote a function of the data that will be used to test the hypothesis.
  • number of the test statistic and “numerical test statistic” denote a particular value calculated by using that function on a given data set. Determination of a test statistic may be accomplished, for example P. G. Hoel, S. C. Port & C. J. Stone, I NTRODUCTION TO S TATISTICAL T HEORY (1971), which is incorporated herein by reference, or by other known means.
  • test statistics include a two sample t-statistic, which approximates the “Student's t-distribution” under fairly general assumptions, the Pearson product-moment correlation coefficient r, and the likelihood ratio L.
  • Embodiments of the invention would include computing the numerical values of several test statistics simultaneously, in order to test compound hypotheses or to test several independent hypotheses at the same time.
  • Embodiments of the invention may include the realm of test statistics known in the art to be previously input to the computer and stored in computer accessible database 2 , either internal to or remote from the computer 1 . Specifying a test statistic 13 of FIG. 2 may then be accomplished by the user, when prompted in the course of program execution, selecting from the test statistic database. Likewise, the computer 1 may include executable instructions to select the test statistic 13 from the database of test statistics. It is also contemplated that the user might define their own test statistic.
  • the hypothesis 14 may take several forms. Embodiments of this invention encompass any form of statistical problem that can be defined in terms of a hypothesis.
  • the formal hypothesis 14 would be a “null hypothesis” addressing, for example, the degree to which two variables represented in the original data set 12 are interrelated or the degree to which two variables have different means.
  • the formal hypothesis 14 may also take any form alternative to a null hypothesis.
  • the hypothesis may be a general hypothesis arising from a multiple decision problem, which results in the original data falling within one of three alternative possibilities.
  • the hypothesis represents the intended practical application of the computer and computer executable program, including testing the validity of prediction models and comparing results of experimental versus conventional medical treatments.
  • the computer determines a numerical value NTS of the test statistic 13 from the data set in terms of said formal hypothesis, as indicated in block 15 of FIG. 2.
  • Confidence intervals may also be constructed by a similar technique embodied by this invention, as indicated in FIG. 3.
  • the primary difference between FIG. 2 and FIG. 3 relate to the interchanged roles of test statistic and probability: In hypothesis testing the probability is derived from the test statistic, while in confidence interval determination, a range of test statistics is derived from probabilities. Otherwise, the basic underlying novel concept is the same.
  • the disclosed invention may be seen more clearly by reference to block 16 of FIG. 2 (and block 45 of FIG. 3).
  • the user specifies the probability distribution in block 16 that defines the original data set 12 .
  • This distribution is the one from which the user theorizes the data may have arisen under the hypothesis 14 .
  • Conventional data analysis usually specifies the normal probability distribution, but under the disclosed invention, any distribution of data may be tested.
  • One may appropriately specify the probability distribution from various considerations, such as theory, prior experimentation, the shape of the data's marginal distributions, intuition, or any combination thereof.
  • the types and application of common probability distributions of statistical data sets are set forth and described in detail in various texts, including by way of example N. L. Johnson & S. Kotz, D ISTRIBUTIONS IN S TATISTICS , Vols. 1-3 (1970), which is incorporated herein by reference.
  • Embodiments of the invention include the realm of statistical distributions known in the art to be previously input to the computer and stored in computer accessible database 8 of FIG. 1, either internal to or remote from the computer 1 .
  • the step in block 16 of specifying a distribution may then be performed by the computer based on its analysis of the original database 12 .
  • the user may specify the distribution by selecting from among the previously stored database of options, or defining any other distribution, including those not previously studied.
  • the number of iterations N to be performed by the computer in analyzing the hypothesis 14 is specified. This is an integer that, in the preferred embodiment, would be no less than 1,000.
  • the invention contemplates any number of iterations, the general rule being that the accuracy of testing the hypothesis 14 increases with the number of iterations N.
  • Factors affecting determination of N include the capabilities of computer 1 , including processor speed and memory capacity.
  • the computer then initializes variable i, setting it to zero in step 18 . This variable will correspond to each randomly produced data set performed in subsequent steps.
  • the computer then enters a repetitive loop of generating data for purposes of comparing and analyzing the original database 12 .
  • the loop begins on each iteration with incrementing integer i by one.
  • the computer then generates a set of random data RDS(i) at block 20 of the same size, dimension and distribution as the original data set 12 .
  • the computer may generate the random data using any technique known to the art that approximates truly random results.
  • the preferred embodiment incorporates the so-called Monte Carlo technique, which is described in the published text G. S. Fishman, M ONTE CARLO —C ONCEPTS , A LGORITHMS AND A PPLICATIONS (1995), which is incorporated herein by reference.
  • the computer determines at block 21 a corresponding numerical value TS(i) of the test statistic, which is one example of a test statistic that might arise at random under the null hypothesis 14 , distributed as distribution 16 .
  • This numerical value is stored in a numerical test statistic array 22 .
  • the computer compares i with the value N to determine whether they are yet equal to one another. If i is still less than N, the computer returns to the beginning of the repetitive loop as shown in block 24 and increments variable i by one at block 19 . The computer then generates another set of random data RDS(i) at block 20 of the same size, dimension and distribution as the original data set 12 . Using this randomly generated data set, the computer again determines at block 21 a corresponding numerical value TS(i) of the test statistic and stores TS(i) in the numerical test statistic array 22 . This process is repeated until the computer determines that i equals N at the conclusion of the repetitive loop at decisional diamond 23 . At that time, the computer will have stored an array consisting of N numerical values of test statistics derived from randomly generated data sets.
  • the computer After the computer has stored an array of randomly generated numerical test statistics, it must determine where among them falls the numerical test statistic NTS corresponding to the original data set 12 .
  • the value of the data dependent statistic e.g. the median or 50 th percentile
  • the ordinal number that defines the percentile e.g. the 95 th in 95 th percentile
  • the computer More specifically, the computer must determine a percentile value P corresponding to NTS, so that the percentile index p may be determined. This percentile index p may then be used to infer the likelihood that the value of NTS arose by chance, which is the statistical significance of NTS.
  • the invention includes any manner of correlating NTS with a percentile value P based on the numerical test statistic array of randomly generated results.
  • a preferred embodiment of the invention is shown in blocks 25 through 35 of FIG. 2.
  • the preferred embodiment technique begins with initializing variable j to one at block 25 .
  • the computer sorts the numerical test statistic array into ascending order at block 26 , resulting in an ordered array OTS having the same dimensions and containing the same data as the test statistic array of step 22 .
  • the computer is able to systematically compare the original numerical test statistic NTS with the randomly based numbers to determine its corresponding percentile value P and associated percentile index p.
  • This systematic comparison begins at decision diamond 27 , which first compares the numerical value NTS with the smallest numerical value in the array of stored numerical test statistics, defined as OTS(l). If NTS is less than OTS( 1 ), then it is known that NTS is smaller than the entire set of numerical test statistics corresponding to randomly generated data sets having the same size, dimension and distribution as the original data set 12 .
  • the computer determines that NTS is in the “zeroth” percentile, indicating that the original numerical test statistic NTS is an extreme data point beyond the bounds of the randomly generated values and, therefore that the chances of the event happening by chance under a two-tailed null hypothesis are very remote. The conclusion of the computerized evaluation therefore may be to reject the null hypothesis or to re-execute the program using a higher value N to potentially expand the randomly generated comparison set.
  • the computer outputs its results as shown in block 28 , which include the percentile index zero of the original numerical test statistic NTS.
  • the invention contemplates any variation of data output at the final step, in any form compatible with the computer system.
  • a preferred embodiment is an output to a monitor 4 or printer 5 of FIG. 1 that identifies the numerical value of the test statistic NTS derived from the original data set 12 , the corresponding percentile index p relating to the likelihood of NTS arising by chance, and the number of random data sets N on which p is based. In this case of NTS being less than all randomly based test statistics, p would equal zero.
  • This raw percentile value may also be interpreted in terms of the null hypothesis 14 ; in the case of a two-tailed test, such an extreme value would lead to rejecting the null hypothesis, while in a one-tailed test this could lead to accepting the null hypothesis.
  • the computer determines that OTS(1) is not greater than NTS, it moves to decision diamond 29 , which tests the other extreme. In other words, the computer determines whether NTS is larger than the highest value OTS(N) of the numerical test statistics corresponding to randomly generated data sets having the same size and dimension as the original data set 12 . If the answer is yes, then the computer determines that NTS is in the “one hundredth” percentile, usually indicating that the null hypothesis should be rejected because the test statistic is statistically significant (i.e. not likely to have resulted from chance). The results are then output as described above and as provided in block 30 of FIG. 2.
  • NTS does not fall beyond either extreme, the computer moves to a repetitive loop, consisting of steps 31 through 33 , which brackets NTS between two numerical test statistics arising from randomly generated data.
  • the variable j is incremented by 1 at block 31 .
  • the computer determines whether the numerical value OTS(j) is larger than the numerical value NTS. If not, the computer returns to the beginning of the loop at block 31 , as indicated by block 33 , increments j by one, and again compares the numerical value OTS(j) with NTS. This process is repeated until OTS(j) is larger than NTS, which means that NTS falls between OTS(j) and OTS(j ⁇ 1).
  • the percentile value P and associated percentile index p therefore correspond to this positioning of NTS on the ordered array OTS of test statistics correlating to randomly generated data sets. Once this bracketed value is known, the computer proceeds to output the results.
  • the output of a preferred embodiment will be a range of percentile indices, as shown in block 35 .
  • the percentile indices corresponding to the percentile values which bracket NTS are described as being between (j ⁇ 1)/N ⁇ 100 percent and j/N ⁇ 100 percent.
  • the repetitive loop of blocks 31 through 33 determines that OTS(950) out of a set of 1000 numerical test statistics arising from respective randomly generated data bases is the lowest value of OTS(j) higher than NTS, then the value of NTS lies between the percentile values with indices ⁇ fraction (949/1000) ⁇ 100% and ⁇ fraction (950/1000) ⁇ 100%, or 94.9% ⁇ P ⁇ 95.0%, where P in this case refers to the probability rather than the percentile, although of course the two are closely related. Probability P is estimated by the percentile indices. As described above, this information regarding the value of probability P is output from the computer among other relevant data as shown in block 35 .
  • the output probability P reveals the likelihood that the original numerical value of the test statistic might have arisen from random processes alone.
  • the computer determines the “significance” of the original numerical test statistic NTS. For example, if the computer determines that NTS is within the 96 th percentile among the numerical ordered test statistic array OTS, it may be safe to conclude that such it did not occur by chance, but rather has statistical significance in a one-tailed test (i.e. it is significant at the 4 percent level). Based on this information, the original hypothesis 14 , whether it represents a prediction model or a relationship between two variables represented in the original database 12 , may be rejected.
  • FIG. 3 shows a related embodiment using the same theory regarding generation of random databases of the same size, dimension and distribution as the original database.
  • test statistic is usually associated with hypothesis testing, this term will be retained in the discussion of confidence intervals in order to emphasize the essential similarity of the two procedures.
  • test statistic will be used to denote some function of the data to be found in the database, e.g. arithmetic mean, and will be used to subsume terms such as “estimator” and “decision function”. The initialization is identical as that shown in FIG.
  • the user specifies the size of the confidence interval at block 43 , having ends of the interval defined as “Lo” and “Hi.”
  • the confidence interval specified at this step usually would be symmetrical of size 95 percent. This means that, in this mode, the disclosed invention will identify the two values between which an event is 95 percent likely to occur.
  • the corresponding value of “Lo” is 0.025 and the corresponding value of “Hi” is 0.975 (which defines a 0.950 interval, or a 95 percent interval).
  • the disclosed invention continues as shown in FIG. 2 and described above.
  • the distribution is specified at block 44
  • the numerical value of the test statistic is calculated at block 45
  • the number of iterations is specified at block 46 and an array of random databases and the array of corresponding numerical values of the statistic are generated in the repetitive loop of blocks 48 to 52 .
  • the numerical statistic array is then sorted at block 53 into ascending order to accommodate analysis of the numerical value of the statistic specified in block 42 and calculated in block 45 .
  • Blocks 55 through 58 determine the numerical values defining the high and low margins of the desired confidence intervals.
  • the computer determines which two values of OS to use in calculating the upper limit of the confidence interval, by multiplying Hi by N and identifying the smallest integer greater than or equal to that product. That integer and its successor are used to identify the required values of OS.
  • N is equal to 1000 and the confidence interval is symmetrical, in the preferred embodiment, the values of OS would be 0.975 ⁇ 1000, and its successor, 976.
  • the upper endpoint of the confidence interval would be given by a function g of these two OS values, g(OS(975), OS(976)). Note that the functions f and g will depend on the current statistical practice and the philosophy the developer, but will typically be functions such as maximum, minimum, or linear combination.
  • the final step of the confidence interval analysis is to output the relevant data, as shown in block 59 .

Abstract

A computer and computer implemented method and program product for analyzing statistical data in which the data to be analyzed need not be transformed into a “Normal” distribution, thus avoiding introduction of error. Generally, the computer first determines a test statistic (formula) and associated null hypothesis. Then the distribution from which the original data arose, consistent with the null hypothesis, is defined. The computer then produces numerous randomly-generated data sets of the identical size and dimensions of the original statistical data set, according to the distribution defined above. A numerical value of the test statistic is computed from the test statistic formula for each randomly generated data set and stored in a vectored array. The numerical value of the test statistic computed from the original statistical data is then compared with the array and the associated percentile determined. With this information, the significance of the numerical value of the test statistic derived from the original data can be determined and the null hypothesis may be rejected, and if so, at what level of significance. Embodiments of the invention may likewise be used in alternative statistical applications, including computation of confidence intervals and likelihood ratios.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application is a continuation of U.S. patent application Ser. No. 09/594,144, filed on Jun. 15, 2000, the content of which is hereby incorporated by reference in its entirety.[0001]
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention [0002]
  • The present invention relates to the analysis of statistical data, preferably on a computer and using a computer implemented program. The invention more specifically relates to a method and apparatus that accurately analyzes statistical data when that data is not “normally distributed,” by which is meant, as used herein, that the data set does not correspond to a “normal probability distribution” or does not show a bell-shaped curve. [0003]
  • 2. Description of the Prior Art [0004]
  • Conventional data analysis involves the testing of statistical hypotheses for validation. The usual method for testing these hypotheses, in most situations, is based on the well known “General Linear Model,” which produces valid results only if the data are either normally distributed or approximately so. [0005]
  • Where the data set to be analyzed is not normally distributed, the known practice is to transform the data by non-linear transformation to comply with the assumptions of most statistical tests. This practice is disclosed in, for example, Haglin, Mosteller, Tukey, U[0006] NDERSTANDING ROBUST AND EXPLORATORY DATA ANALYSIS (1977), which is incorporated herein by reference. It was previously thought that data could be transformed to comply with known distributional assumptions without affecting the integrity of the analysis. More recent research has demonstrated, however, that the practice of non-linear transformation actually introduces unintended and significant error into the analysis. See, eg., Terrence B. Peace, Ph.D, TRANSFORMATION AND CORRELATION (2000) and TRANSFORMATION AND T-TEST (2000), which is incorporated herein by reference. A solution to this problem is needed. The subject invention therefore provides a method and apparatus capable of evaluating statistical data and outputting reliable analytical results without relying on transformation techniques.
  • U.S. Pat. No. 5,893,069 to White, Jr., entitled “System and method for testing prediction model,” discloses a computer implemented statistical analysis method to U.S. Pat. No. 5,893,069 to White, Jr., entitled “System and method for testing prediction model,” discloses a computer implemented statistical analysis method to evaluate the efficacy of prediction models as compared to a “benchmark” model. White discloses the “bootstrap” method of statistical analysis in that it randomly generates data sets from the empirical data set itself. [0007]
  • SUMMARY OF THE INVENTION
  • It is therefore an object of the invention disclosed herein to provide a method and apparatus, preferably implemented on a computer and with appropriate software, which more accurately analyzes statistical data distributed non-normally. [0008]
  • It is another object of the instant invention to provide a computer and computer implemented method and program by which statistical data can be analyzed under virtually any distributional assumptions, including normality. [0009]
  • It is yet another object of the invention to analyze said data without transforming the naturally occurring distribution of the original data into a Normal distribution, thereby avoiding errors which transformation may introduce into the analysis, said transformation preceding traditional data analysis techniques. [0010]
  • It is another object of the invention to enable and otherwise enhance sensitivity analysis to cross-check results of the analysis. [0011]
  • It is a further object of the present invention to provide a method and apparatus for the analysis of statistical data for use in various disciplines which rely in whole or part on statistical data analysis and forecasts, including marketing, economics, materials, administration and medical research. [0012]
  • It is an additional object of the present invention to provide a method and apparatus of statistical analysis which enable the user to construct new test statistics, rather than rely on those test statistics with distributions that have already been determined. The subject invention removes this restriction so that any function of the data may be used as a test statistic. [0013]
  • It is a further object of the present invention to provide a method and apparatus for statistical analysis that enables the user to make inferences on multiple parameters simultaneously. The instant invention will permit all aspects of more than one distribution to be tested one against the other in a single analysis and determine significant differences, if any exist. [0014]
  • Yet another object of the present invention is to provide a mehod and apparatus that enables a user to perform sensitivity analysis on the underlying data. [0015]
  • These and other objects will become readily apparent to a person of skill in the art having regard for his disclosure. [0016]
  • The invention achieves the above objects by providing a technique to analyze empirical data within its original distribution rather than transforming it to a Normal distribution. It is preferably implemented using a digital processing computer, and therefore a computer, as well as a method and program to be executed by a digital processing computer. The technique comprises, in part, the computer generating numerous random data bases of the same size and distribution of the original database to provide comparisons to numerical relationships arising purely by chance. The best mode of the invention requires input from the user defining a number of options, although alternative modes of the invention would involve the computer determining options at predetermined stages in the analysis. The method and program disclosed herein is superior to prior art in that it allows data to be analyzed more accurately and efficiently, permits the data to be analyzed in accordance with any distribution (including the distribution which generated the data), avoids the errors which may be introduced by data transformation, and facilitates sensitivity analysis.[0017]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The invention is illustrated in the figures of the accompanying drawings are meant to be exemplary and not limiting: [0018]
  • FIG. 1 is a schematic diagram of the hypothesis testing evaluation system. [0019]
  • FIGS. 2[0020] a and 2 b depict a flow chart showing the steps for executing the hypothesis testing method and program.
  • FIG. 3 is a flow chart showing of the steps for executing the hypothesis testing method and program in which the hypothesis takes the form of confidence intervals.[0021]
  • DETAILED DESCRIPTION OF INVENTION
  • As discussed above, the present invention supplies a computer and appropriate software or programming that more accurately analyzes statistical data when that data is not “normally distributed.” The invention therefore provides a method and apparatus for evaluating statistical data and outputting reliable analytical results without relying on traditional prior art transformation techniques, which introduce error. The practice of the present invention results in several unexpectedly superior benefits over the prior art statistical normalizations. [0022]
  • First, it enables the user to construct new and possibly more revealing test statistics, rather than relying on those test statistics with distributions that have already been determined. For example, the “t-statistic” is often used to test whether two samples have the same mean. The numerical value of the t-statistic is calculated and then related to tables that had been prepared using a knowledge of the distribution of this test statistic. Prior to the subject invention, a test statistic has been useless until its distribution has been discovered; thus, for all practical purposes, the number of potential test statistics has been relatively small. The subject invention removes this restriction; any function of the data may be used as a test statistic. [0023]
  • Second, the invention enables the user to make inferences on multiple parameters simultaneously. For example, suppose that the null hypothesis (to be disproved) is that two distributions arising from two potentially related conditions are the same. Traditional data analysis might reveal that the two means are not quite significantly different, nor are the two variances. The result is therefore inconclusive; no formal test exists within the general linear model to determine if the two distributions are different and that this difference is statistically significant. The present invention will permit all aspects of both distributions to be tested one against the other in a single analysis and determine significant differences, if any exist. [0024]
  • Third, sensitivity analysis is a natural extension of the data analysis under the invention, whereas sensitivity analysis is extremely difficult and impractical using current methods and software. Sensitivity analysis examines the effect on conclusions of small changes in the assumptions. For example, if the assumption is that the process that generated the data is a Beta (2,4), then a repeat analysis under a slightly different assumption (e.g. Beta (2,5)) should not produce a markedly different result. If it does, conclusions obtained from the initial assumption should be treated with caution. Such sensitivity analysis under the invention is simple and is suggested by the method itself. [0025]
  • U.S. Pat. No. 5,893,069 to White discloses a computer implemented statistical analysis method to evaluate the efficacy of prediction models as compared to a “benchmark” model. However, the invention disclosed herein is superior to this prior art in that it tests the null hypothesis against entirely independent, randomly-generated data sets having the identical size and dimension as the original data set, with a distribution defined to best describe the process which generated the original data set under the null hypothesis. [0026]
  • The present invention is remarkably superior to that of White, in that the present invention enables the evaluation of an empirically determined test statistic by comparison to an unadulterated, randomly produced vector of values of that test statistic. Under the disclosed invention, when the empirical test statistic falls within an extreme random-data-based range of values (e.g. above the 95[0027] th percentile or below the 5th percentile), the null hypothesis which is being tested can be rejected as false, with a high level of confidence that is not merited in the prior art with respect to non-normal data distributions. Therefore, the ability is greatly enhanced to determine accurately whether certain factors are significantly interrelated or whether certain populations are significantly different.
  • Statistical hypothesis testing is the basis of much statistical inference, including determining the statistical significance of regression coefficients and of a difference in the means. A number of important problems in statistics can be reduced to problems in hypothesis testing, which can be analyzed using the disclosed invention. One example is determining the likelihood ratio L, which itself is an example of a test statistic. When formulated so that the likelihood ratio is less than one, then the null hypothesis is rejected when the likelihood ratio is less than some predetermined constant k. When the constant k is weighted by the so-called prior probabilities of Bayes Theory, the disclosed invention encompasses Bayesian analyses as well. As related to the disclosed invention, the likelihood ratio may be generalized so that different theoretical distributions are used in the numerator and denominator. [0028]
  • Also, the likelihood ratio or its generalization may be invoked repeatedly to solve a multiple decision problem, in which more than two hypotheses are being tested. For example, in the case of testing an experimental medical treatment, the standard treatment would be abandoned only if the new treatment were notably better. The statistical analysis would therefore produce three relevant possibilities: an experimental treatment that is much worse, much better or about the same as the standard treatment, only one of which would result in rejection of the standard treatment. These types of multiple decision problems may be solved using the disclosed invention by the repeated use of the likelihood ratio as the test statistic. [0029]
  • Prediction problems may also be analyzed, whether predicting future events from past observations of the same events (e.g. time series analysis), or predicting the value of one variable from observed values of other variables (e.g. regression). The significance of the statistical model's performance, meaning the likelihood that the model would predict to the same level of accuracy due only to chance, may also be estimated. The method and program disclosed may also be used in this case and, in most practical situations, will prove to be superior. [0030]
  • The instant invention may also be used to determine confidence intervals, which is a closely related statistical device. Whereas hypothesis testing begins with the numerical value of the test statistic and derives the respective probability, a confidence interval begins with a range of probabilities and derives a range of possible test statistics. A common confidence interval is the 95 percent confidence interval, and ranges between the two percentiles P2.5 and P97.5. Given the symmetrical relation of the two techniques, there would be nearly identical methods of calculation. A slight modification of the disclosed method, which is obvious to those skilled in the art, enables the user to construct confidence intervals as opposed to test hypotheses, with a greater level of accuracy. [0031]
  • Thus, this invention relates to determining the likelihood of a statistical observation given particular statistical requirements. It can be used to determine the efficacy of statistical prediction models, the statistical significance of hypotheses, and the best of several hypotheses under the multiple decision paradigm, as well as to construct confidence intervals, all without first transforming the data into a “normal” distribution. It is most preferably embodied on a computer, and is a method to be implemented by computer and a computer program that accomplishes the steps necessary for statistical analysis. Incorporation of a computer system is most preferred to enable the invention. [0032]
  • Referring to FIG. 1, the computer system includes a digital processing apparatus, such as a computer or [0033] central processing unit 1, capable of executing the various steps of the method and program. In the preferred embodiment, the computer 1 is a personal computer known to those skilled in the art, such as those manufactured by IBM, Dell Computer Corporation, Hewlett Packard and Apple. Any corresponding operating system may be involved, such as those sold under the trademark “Windows.” Other embodiments include networked computers, notebook computers, handheld computing devices and any other microprocessor-driven device capable of executing the step disclosed herein.
  • As shown in FIG. 1, the computer includes the set of computer-[0034] executable instructions 2, in computer readable code, that encompass the method or program disclosed herein. The instructions may be stored and accessible internally to the computer, such as in the computer's RAM, conventional hard disk drive, or any other executable data storage medium. Alternatively, the instructions may be contained on an external data storage device compatible with a computer readable medium, such as a floppy diskette 3, magnetic tape or compact disk, compatible with and executable by the computer 1.
  • The system can include peripheral computer equipment known in the art, including output devices, such as a [0035] video monitor 4 and printer 5, and input devices, such as a keyboard 6 and a mouse 7. Embodiments of the invention contemplate any peripheral equipment available to the art. Additional potential output devices include other computers, audio and visual equipment and mechanical apparatus. Additional potential input devices include scanners, facsimile devices, trackballs, keypads, touch screens and voice recognition devices.
  • The computer [0036] executable instructions 2 begin by defining the structure of database 11 of FIG. 2, a flowchart of the computer executable steps. The original data to be analyzed is collected into the database 12. This original data introduced at step 12 may consist of known empirical data; theoretical, hypothetical or other synthetically generated data; or any combination thereof. The original database 12 is stored as a computer accessible database 8 of FIG. 1. The database 8 can be internal to or remote from the computer 1. The database 8 can be input onto the computer accessible medium in any fashion desired by the use, including manually typing, scanning or otherwise downloading the database.
  • Referring to FIG. 2, the user specifies a [0037] test statistic 13 and formal hypothesis 14 in terms of said test statistic, known as the null hypothesis, concerning database 12. The term test statistic is used to denote a function of the data that will be used to test the hypothesis. The terms “numerical value of the test statistic” and “numerical test statistic” denote a particular value calculated by using that function on a given data set. Determination of a test statistic may be accomplished, for example P. G. Hoel, S. C. Port & C. J. Stone, INTRODUCTION TO STATISTICAL THEORY (1971), which is incorporated herein by reference, or by other known means. Examples of test statistics include a two sample t-statistic, which approximates the “Student's t-distribution” under fairly general assumptions, the Pearson product-moment correlation coefficient r, and the likelihood ratio L. Embodiments of the invention would include computing the numerical values of several test statistics simultaneously, in order to test compound hypotheses or to test several independent hypotheses at the same time.
  • Embodiments of the invention may include the realm of test statistics known in the art to be previously input to the computer and stored in computer [0038] accessible database 2, either internal to or remote from the computer 1. Specifying a test statistic 13 of FIG. 2 may then be accomplished by the user, when prompted in the course of program execution, selecting from the test statistic database. Likewise, the computer 1 may include executable instructions to select the test statistic 13 from the database of test statistics. It is also contemplated that the user might define their own test statistic.
  • The [0039] hypothesis 14, specified in terms of said test statistic 13, may take several forms. Embodiments of this invention encompass any form of statistical problem that can be defined in terms of a hypothesis. In the preferred embodiment of the invention, the formal hypothesis 14 would be a “null hypothesis” addressing, for example, the degree to which two variables represented in the original data set 12 are interrelated or the degree to which two variables have different means. However, the formal hypothesis 14 may also take any form alternative to a null hypothesis.
  • For example, the hypothesis may be a general hypothesis arising from a multiple decision problem, which results in the original data falling within one of three alternative possibilities. Regardless of the form, the hypothesis represents the intended practical application of the computer and computer executable program, including testing the validity of prediction models and comparing results of experimental versus conventional medical treatments. [0040]
  • Using the specified [0041] hypothesis 14, the computer determines a numerical value NTS of the test statistic 13 from the data set in terms of said formal hypothesis, as indicated in block 15 of FIG. 2. Confidence intervals may also be constructed by a similar technique embodied by this invention, as indicated in FIG. 3. The primary difference between FIG. 2 and FIG. 3 relate to the interchanged roles of test statistic and probability: In hypothesis testing the probability is derived from the test statistic, while in confidence interval determination, a range of test statistics is derived from probabilities. Otherwise, the basic underlying novel concept is the same.
  • The disclosed invention may be seen more clearly by reference to block [0042] 16 of FIG. 2 (and block 45 of FIG. 3). In the preferred embodiment, the user specifies the probability distribution in block 16 that defines the original data set 12. This distribution is the one from which the user theorizes the data may have arisen under the hypothesis 14. Conventional data analysis usually specifies the normal probability distribution, but under the disclosed invention, any distribution of data may be tested. One may appropriately specify the probability distribution from various considerations, such as theory, prior experimentation, the shape of the data's marginal distributions, intuition, or any combination thereof. The types and application of common probability distributions of statistical data sets are set forth and described in detail in various texts, including by way of example N. L. Johnson & S. Kotz, DISTRIBUTIONS IN STATISTICS, Vols. 1-3 (1970), which is incorporated herein by reference.
  • Embodiments of the invention include the realm of statistical distributions known in the art to be previously input to the computer and stored in computer [0043] accessible database 8 of FIG. 1, either internal to or remote from the computer 1. The step in block 16 of specifying a distribution may then be performed by the computer based on its analysis of the original database 12. In the alternative, the user may specify the distribution by selecting from among the previously stored database of options, or defining any other distribution, including those not previously studied.
  • As shown in the [0044] next block 17 of FIG. 2, the number of iterations N to be performed by the computer in analyzing the hypothesis 14 is specified. This is an integer that, in the preferred embodiment, would be no less than 1,000. The invention contemplates any number of iterations, the general rule being that the accuracy of testing the hypothesis 14 increases with the number of iterations N. Factors affecting determination of N include the capabilities of computer 1, including processor speed and memory capacity. The computer then initializes variable i, setting it to zero in step 18. This variable will correspond to each randomly produced data set performed in subsequent steps.
  • In the preferred embodiment, beginning at [0045] block 19, the computer then enters a repetitive loop of generating data for purposes of comparing and analyzing the original database 12. The loop begins on each iteration with incrementing integer i by one. The computer then generates a set of random data RDS(i) at block 20 of the same size, dimension and distribution as the original data set 12. The computer may generate the random data using any technique known to the art that approximates truly random results. The preferred embodiment incorporates the so-called Monte Carlo technique, which is described in the published text G. S. Fishman, MONTE CARLO—CONCEPTS, ALGORITHMS AND APPLICATIONS (1995), which is incorporated herein by reference.
  • Using this randomly generated data set, the computer determines at block [0046] 21 a corresponding numerical value TS(i) of the test statistic, which is one example of a test statistic that might arise at random under the null hypothesis 14, distributed as distribution 16. This numerical value is stored in a numerical test statistic array 22.
  • At [0047] decision diamond 23, the computer compares i with the value N to determine whether they are yet equal to one another. If i is still less than N, the computer returns to the beginning of the repetitive loop as shown in block 24 and increments variable i by one at block 19. The computer then generates another set of random data RDS(i) at block 20 of the same size, dimension and distribution as the original data set 12. Using this randomly generated data set, the computer again determines at block 21 a corresponding numerical value TS(i) of the test statistic and stores TS(i) in the numerical test statistic array 22. This process is repeated until the computer determines that i equals N at the conclusion of the repetitive loop at decisional diamond 23. At that time, the computer will have stored an array consisting of N numerical values of test statistics derived from randomly generated data sets.
  • After the computer has stored an array of randomly generated numerical test statistics, it must determine where among them falls the numerical test statistic NTS corresponding to the [0048] original data set 12. In this process, the value of the data dependent statistic, e.g. the median or 50th percentile, will be referred to as the “percentile value P” and the ordinal number that defines the percentile, e.g. the 95th in 95th percentile, will be referred to as the “percentile index p.” More specifically, the computer must determine a percentile value P corresponding to NTS, so that the percentile index p may be determined. This percentile index p may then be used to infer the likelihood that the value of NTS arose by chance, which is the statistical significance of NTS.
  • The invention includes any manner of correlating NTS with a percentile value P based on the numerical test statistic array of randomly generated results. However, a preferred embodiment of the invention is shown in [0049] blocks 25 through 35 of FIG. 2. The preferred embodiment technique begins with initializing variable j to one at block 25. The computer then sorts the numerical test statistic array into ascending order at block 26, resulting in an ordered array OTS having the same dimensions and containing the same data as the test statistic array of step 22. However, with the array arranged in an incrementally sorted format, the computer is able to systematically compare the original numerical test statistic NTS with the randomly based numbers to determine its corresponding percentile value P and associated percentile index p.
  • This systematic comparison begins at [0050] decision diamond 27, which first compares the numerical value NTS with the smallest numerical value in the array of stored numerical test statistics, defined as OTS(l). If NTS is less than OTS(1), then it is known that NTS is smaller than the entire set of numerical test statistics corresponding to randomly generated data sets having the same size, dimension and distribution as the original data set 12. The computer determines that NTS is in the “zeroth” percentile, indicating that the original numerical test statistic NTS is an extreme data point beyond the bounds of the randomly generated values and, therefore that the chances of the event happening by chance under a two-tailed null hypothesis are very remote. The conclusion of the computerized evaluation therefore may be to reject the null hypothesis or to re-execute the program using a higher value N to potentially expand the randomly generated comparison set.
  • The computer outputs its results as shown in [0051] block 28, which include the percentile index zero of the original numerical test statistic NTS. The invention contemplates any variation of data output at the final step, in any form compatible with the computer system. A preferred embodiment is an output to a monitor 4 or printer 5 of FIG. 1 that identifies the numerical value of the test statistic NTS derived from the original data set 12, the corresponding percentile index p relating to the likelihood of NTS arising by chance, and the number of random data sets N on which p is based. In this case of NTS being less than all randomly based test statistics, p would equal zero. This raw percentile value may also be interpreted in terms of the null hypothesis 14; in the case of a two-tailed test, such an extreme value would lead to rejecting the null hypothesis, while in a one-tailed test this could lead to accepting the null hypothesis.
  • If at [0052] decision diamond 27 the computer determines that OTS(1) is not greater than NTS, it moves to decision diamond 29, which tests the other extreme. In other words, the computer determines whether NTS is larger than the highest value OTS(N) of the numerical test statistics corresponding to randomly generated data sets having the same size and dimension as the original data set 12. If the answer is yes, then the computer determines that NTS is in the “one hundredth” percentile, usually indicating that the null hypothesis should be rejected because the test statistic is statistically significant (i.e. not likely to have resulted from chance). The results are then output as described above and as provided in block 30 of FIG. 2.
  • If NTS does not fall beyond either extreme, the computer moves to a repetitive loop, consisting of [0053] steps 31 through 33, which brackets NTS between two numerical test statistics arising from randomly generated data. First, the variable j is incremented by 1 at block 31. Then, at decision diamond 32, the computer determines whether the numerical value OTS(j) is larger than the numerical value NTS. If not, the computer returns to the beginning of the loop at block 31, as indicated by block 33, increments j by one, and again compares the numerical value OTS(j) with NTS. This process is repeated until OTS(j) is larger than NTS, which means that NTS falls between OTS(j) and OTS(j−1). The percentile value P and associated percentile index p therefore correspond to this positioning of NTS on the ordered array OTS of test statistics correlating to randomly generated data sets. Once this bracketed value is known, the computer proceeds to output the results.
  • The output of a preferred embodiment will be a range of percentile indices, as shown in block [0054] 35. The percentile indices corresponding to the percentile values which bracket NTS are described as being between (j−1)/N×100 percent and j/N×100 percent. For example, if the repetitive loop of blocks 31 through 33 determines that OTS(950) out of a set of 1000 numerical test statistics arising from respective randomly generated data bases is the lowest value of OTS(j) higher than NTS, then the value of NTS lies between the percentile values with indices {fraction (949/1000)}×100% and {fraction (950/1000)}×100%, or 94.9%<P<95.0%, where P in this case refers to the probability rather than the percentile, although of course the two are closely related. Probability P is estimated by the percentile indices. As described above, this information regarding the value of probability P is output from the computer among other relevant data as shown in block 35.
  • The output probability P reveals the likelihood that the original numerical value of the test statistic might have arisen from random processes alone. In other words, the computer determines the “significance” of the original numerical test statistic NTS. For example, if the computer determines that NTS is within the 96[0055] th percentile among the numerical ordered test statistic array OTS, it may be safe to conclude that such it did not occur by chance, but rather has statistical significance in a one-tailed test (i.e. it is significant at the 4 percent level). Based on this information, the original hypothesis 14, whether it represents a prediction model or a relationship between two variables represented in the original database 12, may be rejected.
  • FIG. 3 shows a related embodiment using the same theory regarding generation of random databases of the same size, dimension and distribution as the original database. Although the tem “test statistic” is usually associated with hypothesis testing, this term will be retained in the discussion of confidence intervals in order to emphasize the essential similarity of the two procedures. As before, the term “test statistic” will be used to denote some function of the data to be found in the database, e.g. arithmetic mean, and will be used to subsume terms such as “estimator” and “decision function”. The initialization is identical as that shown in FIG. 2, except instead of specifying a null hypothesis at [0056] block 14, the user specifies the size of the confidence interval at block 43, having ends of the interval defined as “Lo” and “Hi.” As a practical matter, the confidence interval specified at this step usually would be symmetrical of size 95 percent. This means that, in this mode, the disclosed invention will identify the two values between which an event is 95 percent likely to occur. The corresponding value of “Lo” is 0.025 and the corresponding value of “Hi” is 0.975 (which defines a 0.950 interval, or a 95 percent interval).
  • After the confidence interval is specified, the disclosed invention continues as shown in FIG. 2 and described above. The distribution is specified at [0057] block 44, the numerical value of the test statistic is calculated at block 45, the number of iterations is specified at block 46 and an array of random databases and the array of corresponding numerical values of the statistic are generated in the repetitive loop of blocks 48 to 52. Also, the numerical statistic array is then sorted at block 53 into ascending order to accommodate analysis of the numerical value of the statistic specified in block 42 and calculated in block 45.
  • Hereafter, the process is customized to the extent necessary to format usable and appropriate output from the computer. [0058] Blocks 55 through 58 determine the numerical values defining the high and low margins of the desired confidence intervals. At blocks 55 and 56, the computer determines which two values of OS to use in calculating the lower limit of the confidence interval, by multiplying Lo by N and identifying the greatest integer less than or equal to that product. That integer and its successor are used to identify the required values of OS. Assuming that N was specified as 1000, with a symmetric 95 percent confidence interval, in the preferred embodiment, the values of OS would be 0.025×1000=25, and the next higher value, 26. The lower endpoint of the confidence interval would be given by a function f of these two OS values, f(OS(25), OS(26)).
  • Similarly, at [0059] blocks 57 and 58, the computer determines which two values of OS to use in calculating the upper limit of the confidence interval, by multiplying Hi by N and identifying the smallest integer greater than or equal to that product. That integer and its successor are used to identify the required values of OS. Again assuming N is equal to 1000 and the confidence interval is symmetrical, in the preferred embodiment, the values of OS would be 0.975×1000, and its successor, 976. The upper endpoint of the confidence interval would be given by a function g of these two OS values, g(OS(975), OS(976)). Note that the functions f and g will depend on the current statistical practice and the philosophy the developer, but will typically be functions such as maximum, minimum, or linear combination. The final step of the confidence interval analysis is to output the relevant data, as shown in block 59.
  • While the invention as herein described is fully capable of attaining the above-described objects, it is to be understood that it is the preferred embodiment of the present invention and is thus representative of the subject matter which is broadly contemplated, that the scope of the present invention fully encompasses other embodiments which may become obvious to those skilled in the art, and that the scope of the present invention s accordingly to be limited by nothing other than the appended claims. [0060]

Claims (20)

What is claimed:
1. An apparatus for analyzing statistical data, the apparatus comprising a computing device for executing computer readable code and having an input device; a storage device in communication with the computing device; and a programming code reading device in communication with the computing device, which reads computer executable code, the computer executable code causing the computing device to:
receive a set of original statistical data and store the statistical data set in the storage device;
calculate a numerical value corresponding to the statistical data set according to a test statistic formula;
receive a probability distribution relating to the statistical data set;
generate a plurality of random data sets of at least the same size and dimension as the statistical data set and distributed according to the probability distribution;
calculate a numerical value corresponding to each of the plurality of random data sets according to the test statistic formula to produce a corresponding plurality of numerical values; and
compare the numerical value calculated from the statistical data set to the plurality of numerical values calculated from the plurality of random data sets to determine a relationship between them.
2. The apparatus of claim 1, in which the relationship between the numerical value calculated from the statistical data set and the plurality of numerical values calculated from the plurality of random data sets is defined by a null hypothesis, which null hypothesis is accepted or rejected based on the relationship.
3. The apparatus of claim 1, in which the relationship between the numerical value calculated from the statistical data set and the plurality of numerical values calculated from the plurality of random data sets is indicative of a confidence interval, which confidence interval is defined by certain of the plurality of numerical values calculated from the plurality of random data sets.
4. The apparatus of claim 1, in which the plurality of random data sets is generated using random processes expressed in a Monte Carlo technique.
5. The apparatus of claim 1, in which the computing device receives and stores a plurality of probability distributions in the computer data storage device and determines the probability distribution by comparing the original data set with the plurality of stored probability distributions.
6. The apparatus of claim 1, in which the probability distribution of the statistical data set is derived by the computing device.
7. A method for analyzing statistical data, comprising:
collecting a set of original data;
calculating a numerical value corresponding to the statistical data set according to a specified test statistic formula;
specifying a probability distribution relating to the statistical data set;
generating a plurality of random data sets of at least the same size and dimension as the statistical data set and distributed according to the probability distribution;
calculating a numerical value corresponding to each of the plurality of random data sets according to the test statistic formula to produce a corresponding plurality of numerical values;
calculating a plurality of percentile values and corresponding percentile indices from the plurality of numerical values; and
comparing the numerical value calculated from the statistical data set to at least one of the plurality of percentile values to determine a relationship between them.
8. The method of claim 7, in which the relationship between the numerical value calculated from the statistical data set and the plurality of numerical values calculated from the plurality of random data sets is defined by a null hypothesis, which null hypothesis is accepted or rejected based on the relationship.
9. The method of claim 7, in which the relationship between the numerical value calculated from the statistical data set and the plurality of numerical values calculated from the plurality of random data sets is indicative of membership in a confidence interval, which confidence interval is defined by certain of the plurality of numerical values calculated from the plurality of random data sets.
10. The method of claim 7, in which the plurality of random data sets is generated using random processes expressed in a Monte Carlo technique.
11. The method of claim 7, in which the steps are implemented by a computing apparatus comprising a computing device having an input device, a data storage device in communication with the computing device, and programming code readable by the computing device.
12. A method for analyzing an original statistical data set, the original statistical data set having a size, a dimension and a distribution in accordance with a specified probability distribution, the method comprising:
generating a plurality of random data sets, each random data set having the size, the dimension and the distribution as the original statistical data set;
calculating a plurality of numerical values of test statistics corresponding to the plurality of random data sets, each numerical value being calculated according to a test statistic formula; and
determining a relationship between the plurality of numerical values and a numerical value of a test statistic of the original data set, calculated in accordance with the test statistic formula.
13. The method for analyzing the original statistical data set according to claim 12, in which the relationship between the plurality of numerical values and the numerical value corresponding to the original statistical data set tests whether the original statistical data set is characterized by at least one factor that is not based on chance.
14. A method for testing validity of a prediction model based on an original data set, comprising:
deriving the prediction model;
specifying a test statistic formula relating to the derived prediction model;
computing a numerical value NTS of the test statistic using the test statistic formula and the original data set;
specifying a probability distribution relating to the original data set;
creating a plurality of random data sets RDB(i) using randomly generated data, in which i is a positive integer;
computing a plurality of numerical values TS(i) of the test statistic corresponding to the plurality of random data sets RDB(i), and storing each numerical value TS(i) in a numerical test statistic array; and
comparing the numerical value NTS with the numerical test statistic array to determine a non-empty set of percentile values corresponding to the numerical value NTS and an associated non-empty set of percentile indices.
15. The method for testing validity of a prediction model according to claim 14, in which creating the plurality of random data sets RDB(i) comprises using randomly generated data according to a Monte Carlo technique.
16. The method of claim 14, in which the prediction model is derived from at least observations of a time series made before a time t.
17. The method of claim 16, further comprising determining the validity of the prediction model by comparing predictions of the time series made by the prediction model with observations of the time series made after the time t.
18. The method of claim 14, further comprising modifying the prediction model based on the determined validity of the prediction model.
19. The method of claim 14, in which the prediction model is selected from among at least two previously derived prediction models.
20. The method of claim 14, in which the prediction model is derived from at least observed values of variables other than time series.
US10/878,410 2000-06-15 2004-06-29 Method and apparatus for significance testing and confidence interval construction based on user-specified distributions Abandoned US20040236776A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/878,410 US20040236776A1 (en) 2000-06-15 2004-06-29 Method and apparatus for significance testing and confidence interval construction based on user-specified distributions

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/594,144 US6847976B1 (en) 2000-06-15 2000-06-15 Method and apparatus for significance testing and confidence interval construction based on user-specified distribution
US10/878,410 US20040236776A1 (en) 2000-06-15 2004-06-29 Method and apparatus for significance testing and confidence interval construction based on user-specified distributions

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/594,144 Continuation US6847976B1 (en) 2000-06-15 2000-06-15 Method and apparatus for significance testing and confidence interval construction based on user-specified distribution

Publications (1)

Publication Number Publication Date
US20040236776A1 true US20040236776A1 (en) 2004-11-25

Family

ID=32908836

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/594,144 Expired - Fee Related US6847976B1 (en) 2000-06-15 2000-06-15 Method and apparatus for significance testing and confidence interval construction based on user-specified distribution
US10/878,410 Abandoned US20040236776A1 (en) 2000-06-15 2004-06-29 Method and apparatus for significance testing and confidence interval construction based on user-specified distributions

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/594,144 Expired - Fee Related US6847976B1 (en) 2000-06-15 2000-06-15 Method and apparatus for significance testing and confidence interval construction based on user-specified distribution

Country Status (1)

Country Link
US (2) US6847976B1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239361A1 (en) * 2006-04-11 2007-10-11 Hathaway William M Automated hypothesis testing
US9489346B1 (en) * 2013-03-14 2016-11-08 The Mathworks, Inc. Methods and systems for early stop simulated likelihood ratio test
US10348497B2 (en) * 2010-04-07 2019-07-09 Apple Inc. System and method for content protection based on a combination of a user pin and a device specific identifier
CN110765182A (en) * 2019-10-29 2020-02-07 北京达佳互联信息技术有限公司 Data statistical method and device, electronic equipment and storage medium
CN111144021A (en) * 2019-12-30 2020-05-12 新源动力股份有限公司 Fuel cell service life prediction method and system
CN112749202A (en) * 2019-10-30 2021-05-04 腾讯科技(深圳)有限公司 Information operation strategy determination method, device, equipment and storage medium
CN113556241A (en) * 2020-04-24 2021-10-26 北京淇瑀信息科技有限公司 Upstream flow monitoring method and device and electronic equipment
US11263020B2 (en) 2010-04-07 2022-03-01 Apple Inc. System and method for wiping encrypted data on a device having file-level content protection

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7440856B2 (en) * 2002-03-13 2008-10-21 Becton, Dickinson And Company System and method for determining clinical equivalence of test methods
US7587330B1 (en) * 2003-01-31 2009-09-08 Hewlett-Packard Development Company, L.P. Method and system for constructing prediction interval based on historical forecast errors
US7462175B2 (en) * 2004-04-21 2008-12-09 Acclarent, Inc. Devices, systems and methods for treating disorders of the ear, nose and throat
US9111212B2 (en) 2011-08-19 2015-08-18 Hartford Steam Boiler Inspection And Insurance Company Dynamic outlier bias reduction system and method
US9069725B2 (en) 2011-08-19 2015-06-30 Hartford Steam Boiler Inspection & Insurance Company Dynamic outlier bias reduction system and method
US10557840B2 (en) 2011-08-19 2020-02-11 Hartford Steam Boiler Inspection And Insurance Company System and method for performing industrial processes across facilities
CA2945543C (en) 2014-04-11 2021-06-15 Hartford Steam Boiler Inspection And Insurance Company Improving future reliability prediction based on system operational and performance data modelling
JP6704341B2 (en) * 2016-12-27 2020-06-03 株式会社デンソーアイティーラボラトリ Information estimating apparatus and information estimating method
US10613971B1 (en) * 2018-01-12 2020-04-07 Intuit Inc. Autonomous testing of web-based applications
CN108924176A (en) * 2018-05-04 2018-11-30 中国信息安全研究院有限公司 A kind of data push method
US11636292B2 (en) 2018-09-28 2023-04-25 Hartford Steam Boiler Inspection And Insurance Company Dynamic outlier bias reduction system and method
US11615348B2 (en) 2019-09-18 2023-03-28 Hartford Steam Boiler Inspection And Insurance Company Computer-based systems, computing components and computing objects configured to implement dynamic outlier bias reduction in machine learning models
GB2603358B (en) 2019-09-18 2023-08-30 Hartford Steam Boiler Inspection And Insurance Company Computer-based systems, computing components and computing objects configured to implement dynamic outlier bias reduction in machine learning models
US11328177B2 (en) 2019-09-18 2022-05-10 Hartford Steam Boiler Inspection And Insurance Company Computer-based systems, computing components and computing objects configured to implement dynamic outlier bias reduction in machine learning models

Citations (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4764971A (en) * 1985-11-25 1988-08-16 Eastman Kodak Company Image processing method including image segmentation
US4833633A (en) * 1984-10-25 1989-05-23 University Of Rochester Opto-electronic random number generating system and computing systems based thereon
US5148365A (en) * 1989-08-15 1992-09-15 Dembo Ron S Scenario optimization
US5301118A (en) * 1991-11-18 1994-04-05 International Business Machines Corporation Monte carlo simulation design methodology
US5464742A (en) * 1990-08-02 1995-11-07 Michael R. Swift Process for testing gene-disease associations
US5640429A (en) * 1995-01-20 1997-06-17 The United States Of America As Represented By The Secretary Of The Air Force Multichannel non-gaussian receiver and method
US5689587A (en) * 1996-02-09 1997-11-18 Massachusetts Institute Of Technology Method and apparatus for data hiding in images
US5719796A (en) * 1995-12-04 1998-02-17 Advanced Micro Devices, Inc. System for monitoring and analyzing manufacturing processes using statistical simulation with single step feedback
US5722048A (en) * 1994-12-02 1998-02-24 Ncr Corporation Apparatus for improving the signal to noise ratio in wireless communication systems through message pooling and method of using the same
US5835625A (en) * 1993-01-29 1998-11-10 International Business Machines Corporation Method and apparatus for optical character recognition utilizing proportional nonpredominant color analysis
US5893069A (en) * 1997-01-31 1999-04-06 Quantmetrics R&D Associates, Llc System and method for testing prediction model
US5953311A (en) * 1997-02-18 1999-09-14 Discovision Associates Timing synchronization in a receiver employing orthogonal frequency division multiplexing
US6021384A (en) * 1997-10-29 2000-02-01 At&T Corp. Automatic generation of superwords
US6021397A (en) * 1997-12-02 2000-02-01 Financial Engines, Inc. Financial advisory system
US6063028A (en) * 1997-03-20 2000-05-16 Luciano; Joanne Sylvia Automated treatment selection method
US6085216A (en) * 1997-12-31 2000-07-04 Xerox Corporation Method and system for efficiently allocating resources for solving computationally hard problems
US6196977B1 (en) * 1999-04-26 2001-03-06 House Ear Institute Method for detection on auditory evoked potentials using a point optimized variance ratio
US6208738B1 (en) * 1997-02-14 2001-03-27 Numerix Corp. Interface between two proprietary computer programs
US6245517B1 (en) * 1998-09-29 2001-06-12 The United States Of America As Represented By The Department Of Health And Human Services Ratio-based decisions and the quantitative analysis of cDNA micro-array images
US6253167B1 (en) * 1997-05-27 2001-06-26 Sony Corporation Client apparatus, image display controlling method, shared virtual space providing apparatus and method, and program providing medium
US6278981B1 (en) * 1997-05-29 2001-08-21 Algorithmics International Corporation Computer-implemented method and apparatus for portfolio compression
US6434511B1 (en) * 1999-09-28 2002-08-13 General Electric Company Processor and method for determining the statistical equivalence of the respective mean values of two processes
US6591235B1 (en) * 2000-02-04 2003-07-08 International Business Machines Corporation High dimensional data mining and visualization via gaussianization
US6636818B1 (en) * 1999-09-15 2003-10-21 Becton, Dickinson And Company Systems, methods and computer program products for constructing sampling plans for items that are manufactured

Patent Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4833633A (en) * 1984-10-25 1989-05-23 University Of Rochester Opto-electronic random number generating system and computing systems based thereon
US4764971A (en) * 1985-11-25 1988-08-16 Eastman Kodak Company Image processing method including image segmentation
US5148365A (en) * 1989-08-15 1992-09-15 Dembo Ron S Scenario optimization
US5464742A (en) * 1990-08-02 1995-11-07 Michael R. Swift Process for testing gene-disease associations
US5301118A (en) * 1991-11-18 1994-04-05 International Business Machines Corporation Monte carlo simulation design methodology
US5835625A (en) * 1993-01-29 1998-11-10 International Business Machines Corporation Method and apparatus for optical character recognition utilizing proportional nonpredominant color analysis
US5722048A (en) * 1994-12-02 1998-02-24 Ncr Corporation Apparatus for improving the signal to noise ratio in wireless communication systems through message pooling and method of using the same
US5640429A (en) * 1995-01-20 1997-06-17 The United States Of America As Represented By The Secretary Of The Air Force Multichannel non-gaussian receiver and method
US5966312A (en) * 1995-12-04 1999-10-12 Advanced Micro Devices, Inc. Method for monitoring and analyzing manufacturing processes using statistical simulation with single step feedback
US5719796A (en) * 1995-12-04 1998-02-17 Advanced Micro Devices, Inc. System for monitoring and analyzing manufacturing processes using statistical simulation with single step feedback
US5689587A (en) * 1996-02-09 1997-11-18 Massachusetts Institute Of Technology Method and apparatus for data hiding in images
US5893069A (en) * 1997-01-31 1999-04-06 Quantmetrics R&D Associates, Llc System and method for testing prediction model
US6208738B1 (en) * 1997-02-14 2001-03-27 Numerix Corp. Interface between two proprietary computer programs
US5953311A (en) * 1997-02-18 1999-09-14 Discovision Associates Timing synchronization in a receiver employing orthogonal frequency division multiplexing
US6063028A (en) * 1997-03-20 2000-05-16 Luciano; Joanne Sylvia Automated treatment selection method
US6253167B1 (en) * 1997-05-27 2001-06-26 Sony Corporation Client apparatus, image display controlling method, shared virtual space providing apparatus and method, and program providing medium
US6278981B1 (en) * 1997-05-29 2001-08-21 Algorithmics International Corporation Computer-implemented method and apparatus for portfolio compression
US6021384A (en) * 1997-10-29 2000-02-01 At&T Corp. Automatic generation of superwords
US6021397A (en) * 1997-12-02 2000-02-01 Financial Engines, Inc. Financial advisory system
US6085216A (en) * 1997-12-31 2000-07-04 Xerox Corporation Method and system for efficiently allocating resources for solving computationally hard problems
US6245517B1 (en) * 1998-09-29 2001-06-12 The United States Of America As Represented By The Department Of Health And Human Services Ratio-based decisions and the quantitative analysis of cDNA micro-array images
US6196977B1 (en) * 1999-04-26 2001-03-06 House Ear Institute Method for detection on auditory evoked potentials using a point optimized variance ratio
US6636818B1 (en) * 1999-09-15 2003-10-21 Becton, Dickinson And Company Systems, methods and computer program products for constructing sampling plans for items that are manufactured
US6434511B1 (en) * 1999-09-28 2002-08-13 General Electric Company Processor and method for determining the statistical equivalence of the respective mean values of two processes
US6591235B1 (en) * 2000-02-04 2003-07-08 International Business Machines Corporation High dimensional data mining and visualization via gaussianization

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8370107B2 (en) 2006-04-11 2013-02-05 Morestream.com LLC Automated hypothesis testing
US20070239361A1 (en) * 2006-04-11 2007-10-11 Hathaway William M Automated hypothesis testing
US20100292958A1 (en) * 2006-04-11 2010-11-18 Hathaway William M Automated hypothesis testing
US20110004442A1 (en) * 2006-04-11 2011-01-06 Hathaway William M Automated hypothesis testing
US8046190B2 (en) 2006-04-11 2011-10-25 Moresteam.Com Llc Automated hypothesis testing
US8050888B2 (en) 2006-04-11 2011-11-01 Moresteam.Com Llc Automated hypothesis testing
US7725291B2 (en) 2006-04-11 2010-05-25 Moresteam.Com Llc Automated hypothesis testing
US10348497B2 (en) * 2010-04-07 2019-07-09 Apple Inc. System and method for content protection based on a combination of a user pin and a device specific identifier
US11263020B2 (en) 2010-04-07 2022-03-01 Apple Inc. System and method for wiping encrypted data on a device having file-level content protection
US9489346B1 (en) * 2013-03-14 2016-11-08 The Mathworks, Inc. Methods and systems for early stop simulated likelihood ratio test
CN110765182A (en) * 2019-10-29 2020-02-07 北京达佳互联信息技术有限公司 Data statistical method and device, electronic equipment and storage medium
CN112749202A (en) * 2019-10-30 2021-05-04 腾讯科技(深圳)有限公司 Information operation strategy determination method, device, equipment and storage medium
CN111144021A (en) * 2019-12-30 2020-05-12 新源动力股份有限公司 Fuel cell service life prediction method and system
CN113556241A (en) * 2020-04-24 2021-10-26 北京淇瑀信息科技有限公司 Upstream flow monitoring method and device and electronic equipment

Also Published As

Publication number Publication date
US6847976B1 (en) 2005-01-25

Similar Documents

Publication Publication Date Title
US6847976B1 (en) Method and apparatus for significance testing and confidence interval construction based on user-specified distribution
Dodge The Oxford dictionary of statistical terms
Lowe Understanding wordscores
Bar-Joseph et al. A new approach to analyzing gene expression time series data
Lane et al. An empirical study of two approaches to sequence learning for anomaly detection
US20050278613A1 (en) Topic analyzing method and apparatus and program therefor
US8438162B2 (en) Method and apparatus for selecting clusterings to classify a predetermined data set
Snowsill et al. Finding surprising patterns in textual data streams
Thanei et al. The xyz algorithm for fast interaction search in high-dimensional data
US20040172401A1 (en) Significance testing and confidence interval construction based on user-specified distributions
Small et al. Determinism in financial time series
Castro et al. Time series motifs statistical significance
Tomer et al. Guidelines for the implementation and publication of structural equation models
Beasley et al. Resampling methods
US8694521B2 (en) Modeling and searching patterns in a data sequence
Stanley et al. Estimator selection for closed-population capture: recapture
Zhang et al. On Mendelian randomization analysis of case-control study
Forêt et al. Characterizing the D2 statistic: word matches in biological sequences
Osogami Finding probably best systems quickly via simulations
Dumitrescu Multidimensional stability test using sum-of-squares decomposition
Lerman et al. A new probabilistic measure of interestingness for association rules, based on the likelihood of the link
Cheng et al. Efficient constructions of disjunct matrices with applications to DNA library screening
Xiong et al. Bayesian nonparametric regression modeling of panel data for sequential classification
Lewis et al. Entropy criterion for surrogate timeseries data generation via non-parametric dimensionality reduction
CN116561002B (en) Database performance problem detection method for I/O concurrency

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION