Comparison of Cox, Weibull and Gompertz Regression Models in Survival Analysis Using Breast Cancer Data
Chapter One
Objectives of the study
The following are the objectives of the study:
-To describe the survival function using Kaplan-Meier (K-M) approach, and compare the survival curves using Log-rank tests.
-To fit the three models used in the survival analysis using data on breast cancer.
-To evaluate the models used in the study of survival analysis using model comparison.
CHAPTER TWO
LITERATURE REVIEW
Introduction
The most useful benefits to survival models are keen to their growth in popularity. First, survival models are able to take time-varying variables in to account within the modeling process (Golub, 2007). Secondly, survival models are not restricted to the assumption that the distributions of the variables in the data must be normal (Sloot and Verchuren, 1990). Thirdly, survival models only produce positive predictions of time (Gokovali, et al 2007). It follows that time has the potential not to follow normal distribution, and need to be positive in predictions and is influenced by time varying variables.
Survival analysis is a collection of statistical procedures for analysing data, for which the outcome variable of interest is the time until an event occurs. It is the study of time between entry into observation and a subsequent event. The term „Survival analysis‟ came into being from initial studies, where the event of interest was death. The term „event‟ is simply a transition from one discrete state to another at an instantaneous moment in time. Nowadays, the scope of the survival analysis has become wide. Today, scientists are using it for time until onset of disease, time until earthquake, financial experts are using it for time until stock market crashes, and engineers are using it for time until equipment fail, and so on. The most common events studied are death, disease, relapse, and recovery. Few examples of studies, where tools of survival analysis are used are: leukemia (blood cancer) patients and time in remission, time to develop a heart disease for normal individuals 13-year follow up of an elderly population (60+ year) to see how subjects remain alive and time until death, of heart transplants. (Kleinbaum and Klien, 2005).
The time when the statistical procedure of survival analysis started was not known. Probably it originated centuries ago, but it was only after World War II that a new era of survival analysis emerged, stimulated by an interest in the reliability of military equipment. At the end of the war, these newly developed statistical methods, emerging from strict mortality data research to failure time research, quickly spread through the private industry as customers became more demanding of safer and more reliable products. As the uses of survival analysis grew, parametric models gave way to nonparametric and semi parametric approaches for their appeal in dealing with the ever-growing field of clinical trials in medical research.
Survival analysis was well suited for such work because medical intervention follow-up studies could start without all experimental units enrolled at start of observation time and could end before all experimental units had experienced an event. This is extremely important because even in the best-developed studies, there will be subjects who choose to quit participation, who move too far away to follow, or who will die from some unrelated events. The researcher was no longer forced to withdraw the experimental unit and all associated data from the study; instead techniques called censoring enabled researchers to analyze incomplete data due to delayed entry or withdrawal from the study. This was important in allowing each experimental unit to contribute all of the information possible to the model for the amount of time the researcher was able to observe the unit. The last great stride in the application of survival analysis techniques has been a direct result of the availability of software packages and high performance computers which are now able to run these difficult and computationally intensive algorithms relatively efficiently.
In medical research, the following models or techniques of survival analysis are mainly used to obtain the estimates of survival probabilities, mean or median survival time.
The method of survival analysis is also used for the comparison of two or more treatments or procedures. The three techniques are:
Non parametric, Semi parametric and Parametric
CHAPTER THREE
METHODOLOGY
Introduction
The data on 312 cases of breast cancer, used in this study were obtained from Ahmadu Bello University Teaching Hospital, Shika, Zaria from the period January 1997 to December 2012.
The different probabilistic survival analysis methods which include parametric, semi-parametric and non parametric models used in the analysis of the data are:
-Non parametric – Kaplan-Meier or Product Limit Estimator
-Semi-Parametric – Cox Proportional Hazard Model
-Parametric: – Weibull and Gompertz Regression Models.
The above models are to be considered because they are widely used in fitting survival data. The models are to be fitted to the data with a view to find the best fit.
The statistical package “Stata “version 12 was used for analysing the data.
Kaplan-Meier Estimator
In cancer trial, Kaplan-Meier (K-M) method is the recommended technique in survival analysis. K-M is the most popular in developing survival function (Collectt, 2003). The method is used to measure the fraction of subjects living for a certain period of time after treatment. It is applied by analyzing the distribution of patients‟ survival times following their recruitment to a study. The analysis expresses these in terms of proportion of patients still alive up to a given time, following their recruitment. In terms of graph, a plot of proportion of patients‟ surviving against time has a characteristic decline; the steepness of the curve indicates the efficacy (or ability to produce a desired result) from the treatments being investigated. The shallower part of the curve shows the more effective treatment. In analysing the survival data, two functions that are dependent on time are of particular interest; the survival function and the hazard function. The survival function denoted by
S(t) is the probability of surviving at least to time t. The hazard function denoted by h(t) is the conditional probability of dying at time t having survived to that time. The graph of S(t) against t is called the survival curve.
CHAPTER FOUR
ANALYSIS AND DISCUSSION OF RESULTS
Introduction
In this chapter, the results obtained from the analysis of the data were discussed. The data collected for the analysis are attached as appendix I. The K-M approach was used to describe the survival functions of the breast cancer patients and the Log-rank tests were used to compare the survival curves of the patients. The breast cancer data were used to fit the three models viz: Cox Proportional Hazard Model, Weibull Regression Model and Gompetz Regression Model. The models were evaluated using Akaike Information Criterion (AIC) to determine the best model among the three. The model with lower AIC value is usually considered a better model in terms of goodness of fit of the breast cancer data.
CHAPTER FIVE
SUMMARY, CONCLUSION AND RECOMMENDATIONS
Summary
With Kaplan-Meier survival analysis procedure, the study examined the distribution of time of the breast cancer patients. The mean age of the patients was found to be 43.39 years with the standard deviation of 11.74 years. The overall median survival time was 10 months. The 5-year overall survival rate was about 35.0%. The median survival time for housewives was 16 months and for other occupations was 10 months. Median survival time for age group 39 years was 19 months and for age group 40-49, 50 -59 and 60 years and above were 7, 5 and 10 months respectively.
The Log-rank tests were used to compare the survival curves of the breast cancer patients. For stages of the breast cancer patients, the Log-rank statistic is 300.23 with P- value of 0.001. For occupation of the breast cancer patients the Log-rank statistic is 0.01 with P-value of 0.9899. In the case of age groups, the Log-rank statistic is 8.81 with P-value of 0.0378. Lastly, for the results of the treatment, the Log-rank statistic is 139.11 with P- value of 0.0001.
The Cox model estimates revealed that the hazard ratio computed for occupation is 0.85. Hazard ratio for results of the treatment is 0.76. The hazard ratio for stage II breast cancer patients‟ is1.14. The hazard ratio for stage III breast cancer patients is 2.87. The estimated hazard ratio for age group 39, 40 – 49, 50 – 59, 60 and above years are 0.61,0.49 and 0.35 respectively.
For Weibull model estimates, the hazard ratio for occupation is 0.85. The hazard ratio for stage II patients is 1.14. Similarly, hazard ratio for stage III patients is 2.78.
But hazard ratio for the results of the treatment is 0.91. The hazard ratio for age group 39, 40 – 49, 50 – 59 and 60 and above years are 0.50, 0.42 and 0.29 respectively.
The Gompertz regression model estimates of hazard ratio for occupation is found to be 0.854. The hazard ratio for results of the treatment is 0.869. The hazard ratio for stage II patients is 1.15. The hazard ratio for stage III breast cancer is 2.69.
The hazard ratio for age groups 39, 40 – 49, 50 – 59 and 60 and above years are 0.57, 0.423 and 0.29 respectively.
Conclusion
In this study, Kaplan-Meier procedure was used to estimate the survival curves of breast cancer patients at Ahmadu Bello University Teaching Hospital Zaria. The mean age of the patients was found to be 43.39 years with the standard deviation of 11.74 years, and overall median survival time of 10 months. This indicates that 50% of the breast cancer patients survived less or equal to 10 months and the other 50% survived beyond 10 months after they were diagnosed with the disease. This is the survival time at which the cumulative survival function is equal to 0.5. The 5-year overall survival rate of about 35.0% of breast cancer patients admitted at Ahmadu Bello University Zaria, between 1997 and 2012 is low compared to survival rates of some developing nations.
The Log-rank statistic for stages of the breast cancer patients is 300.23 with P-value of 0.0001; this indicates that the survival experience of the breast cancer patients are not the same with respect to the stages of breast cancer at 1% level of significance.
Log-rank statistic for age groups is 8.81 with P-value of 0.0378; this indicates that the four age groups have the same survival experience at 1% level of significance. The log- rank statistic for the results of the treatment is 139.11 with P-value of 0.0001; this indicates that breast cancer patients have significant different survival experience with regard to the results of their treatment at 1%. On the other hand, Log-rank statistic for occupation of the breast cancer patients is 0.01with P-value of 0.9899, this result indicates that the survival curves for housewives and other occupations are the same at 1% level of significance.
From the results of the analysis obtained, for Cox, Weibull and Gompertz regression models, occupation of the breast cancer patients, stages of the cancer disease and results of the treatment have significant effect on the mortality of the patients, and increase in the risk of death as a result of breast cancer is associated with increase in age.
This research work analysed breast cancer dataset using three proportional hazards models, Cox model under semi-parametric model framework and also Weibull and Gompertz under parametric model frame work. The results of the analysis revealed that, according to our breast cancer data, the parametric Weibull regression model performed best and Cox regression model performed less.
The results of this study shows that, according to our breast cancer data, the parametric Weibull regression model could better determine the factors associated with the beast cancer disease than the semi-parametric Cox proportional hazards and parametric Gompertz models. In other words, in the present study, the Weibull model provided a better fit to the study data than Cox proportional hazards and Gompertz models. Therefore, it would be better for researchers of the health care field to consider this model in their researches concerning the breast cancer disease, if the assumption of proportional hazards is not fulfilled.
Recommendations
The following are the recommendations made:
- Public awareness regarding early detection of breast cancer is
- Necessary strategies to enhance cancer registry for breast cancer in an effort to increase the level of case ascertainment and completeness of the data is recommended.
- Further research on survival rate is recommended to observe the temporal changes in survival rate among Nigerian breast cancer patients which reflect the impact of the effectiveness of prevention
- The federal government should provide more funds so that researchers can make discoveries which will significantly enhance progress in the understanding of the epidemiology of the
- Further research should also be made by the government and health practitioners on new drugs that will effectively combat breast cancer
Suggestions for Further Research
This study can be extended using the methods of Cox, Weibull and Gompertz regressions under different samples size and percentage of censoring. This study can also be extended by comparing these models with Logistic regression model to find out which of the models can better fit the data and give the best result.
REFERENCES
- Aggrey, S. E. (2002). Comparison of three Nonlinear and Spline Regression Models for describing Chicken Growth Curves. Poultry Sci. 81, 1782–1788.
- Agresti A. (2002). Categorical Data Analysis.(2nd ed) New York: John Wiley and Sons.
- Ahuja J.C. and Nash S.W. (1967). The Generalised Gompertz-Verhulst family of Distributions. Sankhya Ser A; 29(2): 141 – 156
- Akaike H. (1973). “Information Theory and Extension of the Maximum Likelihood Principle “.In B.N. Petrov and F. csaki (eds) 2nd International Symposium on Information Theory, Academia Kiodo, Budapest 267 – 281
- Altman D.G. and de Stavola B. L. (1994). Practical Problem in Fitting Proportional Hazard Model to Data with Updated Measurement of Observational Studies. John Wiley and Sons: Chicester.
- Altman D.G., (1977). Practical Statistics for Medical Research. Chapman and Hall/CRC: Boca Raton.
- Bang J. W. (1949). Maximum Likelihood Estimates of Proportion of Patients. Curd by Cancer Therapy. Journal of Royal Statistical Society Series B.
- Berry G. (1983). The Analysis of Mortality by the Subject Year Method. Biometrics 30: 173 – 184
- Berwick V., Cheek L., Bell T. (2004). Statistical Review 12: Survival Analysis Crit. Core.
- 8: 389 – 394
- Bijan Moghimi-Dehkordi, Azdeh Safee, Mohamad Amin Pourhoseigholi, Reza Fatemi, Ziaoddi Tabeie, Mohammed Reza Zali (2008) Statistical Comparison of Survival Models for Analysis of Cancer Data. Asian Pacific Journal of Cancer Prevention
- Blossfield H. P., Hamerle A., Mayer K.U. (1989). Event History Analysis. Hall Dale, N. J. Lawrence Erlbaum
- Borkowf C. R. (2005). A Sample Hybrid Variance Estimator for the Kaplan – Meier Functions. Statistics in Medicine 24: 827 – 851