Department of Mathematics; NSHM Knowledge Campus, Durgapur-713212, West Bengal , INDIA
Senior Bio-statistician, Department of Bio-Statistics, Quintiles Clinical Research, Bangalore, Karnataka, INDIA
Research Scohler, Department of Statistics, Aliah University, Kolkata, West Bengal, INDIA
Yearly death rate is increasing due to heart disease. Major factors for the increasing death rate due to heart disease are (a) misdiagnosed by the medical doctors or (b) ignorance by the patients. Heart diseases can be described as any kind of disorder which affects the heart.
The dataset of ‘statlog’ from the UCI Machine Learning with 270 patients related to heart disease isused in this article. The dataset comprises attributes of patients diagnosed with heart diseases. The diagnosis was used to confirm whether heart disease is present or absent in the patient. The present article aims to identify the risk factors/variables which influence this diagnosis. Classification is a very important part of the disease diagnosis but it is also relevant to identify the risk factors/variables. Two classification techniques namely Support Vector Machines (SVM), Multi-Layer Perceptrons ensembles (MLPE) and one advanced regression technique,Generalized additive model (GAM) with binomial distribution and‘logit’ link have been introduced for diagnosis and risk factors/variables identification.
GAM explains 65% deviance with adjusted R square value 0.70 approximately. Sensitivity analysis has been performed under SVM, which is the best model for this dataset with approximately 85% classification accuracy rate. MLPE gives 82% classification accuracy rate approximately. Maximum heart rate, vessel, old peak, chest pain, thallium scan are the most important factors/variables find through both sensitivity analysis under SVM and GAM.
The present article attempt to remove some new information regarding heart disease through probabilistic modeling which may provide better assistance for treatment decision making using the individual patient risk factors and the benefits of a specific treatment. These findings may help the medical practitioners for better medical treatment.
The heart is the most essential organ of human body which also can be described as the size of a fist and a strong muscle in the body. Any disorderliness that affects the heart from infection to genetic defects and blood vessel disease is referred to as heart disease.1 Heart disease is a serious disease and proper diagnosis of heart disease at early stage remains challenging task.2 In fact, up to 25% of people with heart disease have no symptoms despite insufficient blood flow to the heart, a condition that is referred to as silent heart disease.3 In the United State of America about 600,000 people die as a result of heart disease every year which is calculated to be one in every four deaths.
In health care, data mining or statistical machine learning plays a vital role in the medical applications including diagnosis, prognosis, and therapy.
Data mining (DM) techniques
In the statistical analysis of clinical trials and observational studies, the identification and adjustment of prognostic factors is an important activity in order to get valid outcome. The failure to consider important prognostic variables, particularly in observational studies, can lead to errors in estimating treatment differences. In addition, incorrect modeling of prognostic factors can result in the failure to identify nonlinear trends or threshold effects on survival. This article describes flexible statistical methods that may be used to identify and characterize the effect of potential prognostic factors on disease endpoints. These methods are called ‘Generalized Additive Models’ (GAM).
In this research work, we used the heart disease dataset obtained from the UCI Machine Learning to develop intelligent systems using data mining and GAM for diagnosis of heart disease. The results obtained from these systems were compared and the highest recognition rate obtained was taken as the best system for diagnosis of heart disease. This system will solve the problem of misdiagnose of heart disease and also try to identify the risk or important biomedical parameters responsible for probable heart disease. This can guide the doctors about prognostic factors and patients for greater awareness regarding heart disease.
The present article is considered 270 heart disease patients with 14 factors or variables. The current secondary data set is taken from the report. The data set can be downloaded at
In this present article data mining techniques with sensitivity analysis is performed for diagnosis of the heart disease and tried to find out the important factors which are most responsible in this diagnostic work respectively. Apart from this, the generalized additive logistic models are also applied to find the risk factors for heart disease. In case of data mining Multi-Layer Perceptrons ensembles (MLPE), Support vector machines (SVM) are used for classification and there after Sensitivity analysis done only upon the best model out of this classifier for this heart disease data set.
Data mining techniques want to classify the data using different classifiers whereas GAM wants to identify the risk factors for this disease. The brief descriptions of the used methods are given below.
DM is an iterative process that consists of several steps. The CRISP-DM,
This work addresses steps 4 and 5, with an emphasis on the use of NNs and SVMs to solve classification and regression goals. Both tasks require a supervised learning, where a model is adjusted to a dataset of examples that map
To evaluate a model for classification, common metrics are.
In DM techniques, NN means the popular multilayer perceptron (MLP). A major concern in their use is the difficulty to define the proper network for a specific application, due to the sensitivity to the initial conditions and to overfitting and underfitting problems which limit their generalization capability. A very promising way to partially overcome such drawbacks is the use of MLP ensembles (MLPE); averaging and voting techniques are largely used in classical statistical pattern recognition and can be fruitfully applied to MLP classifiers. For classification problem MLPE are used, which is a combinations of MLP models. This network includes one hidden layer of
Where is the output of the network for node
Where is the predicted probability and is the NN output for class
When compared with NNs, SVMs present theoretical advantages, such as the absence of local minima in the learning phase.
Here, SVM uses the sequential minimal optimization (SMO) learning algorithm adopting the popular Gaussian kernel, which presents less parameters than other kernels (e.g. polynomial):
The sensitivity analysis is a simple procedure that is applied after the training procedure and analyzes the model responses when a given input is changed. Let
For a more detailed analysis, the variable effect characteristic (VEC) curve, Cortez
GAMs provide more flexibility than do GLMs, as they relax the hypothesis of linear dependence between the covariates and the expected value of the response variable. The main drawback of GAMs lies in the estimation of the smooth functions
Generalized additive models can be used in virtually any setting where linear models are used. For a single observation (
In the logistic regression model the outcome
Classification accuracy refers to the ability of the model to correctly predict the class level of new or previous unseen data. Classification Accuracy is the percentage (%) of testing set examples correctly classified by the classifier. The quality of classification can be assessed through overall accuracy. That is
Where T is the set data items to be classified (the test set in this case), t∈T,t.c is the class of item t, and (t) returns the classification of by the used classifier (here, SVM and MLPE). For more details see.
AUC is a common evaluation metric for binary classification problems. Consider a plot of the true positive rate vs. the false positive rate as the threshold value for classifying an item as 0 or is increased from 0 to 1 and if the classifier is very good, the true positive rate will increase quickly and the area under the curve will be close to 1. One characteristic of the AUC is that it is independent of the fraction of the test population which is class 0 or class 1; this makes the AUC useful for evaluating the performance of classifiers on unbalanced data sets.
Arrange the training examples in a random order.
Divide the training examples into
(i) Train the classifier using all the examples that do not belong to fold.
(ii) Test the classifier on all the examples in fold.
(iii) Compute, the number of examples in fold that were wrongly classified.
Return the following estimate to the classifier error:
To obtain an accurate estimate to the accuracy of a classifier,
All GAM regression and data mining works are performed in R statistical software with proper library packages.
Heart disease (HD) is very high positively significantly associated with chest pain of a patient. Out of four types of chest pain, asymptomatic chest pain changes the log odds of HD by 2.7777 with p-value 0.0008. Therefore, patient having higher chance of HD if he/she has asymptomatic chest pain.
|Variable name||Operationalization||Mean||Standard deviation||Proportion of levels of Attributes|
|Age (Year)||Age at study||54.43||9.10||---|
|Sex||Gender : (Female = 1 ; Male = 2)||---||---||1= 32.22% ; 2= 67.78%|
|Chest Pain||Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)||---||---||1= 7.41% ; 2=15.56% ; 3=29.26% ; 4=47.78%|
|Resting BP||Resting blood pressure (in mm Hg on admission to the hospital)||131.34||17.86||---|
|Cholesterol||Serum cholesterol in mg/dl||249.66||51.69||---|
|Fasting BS||Fasting blood sugar > 120 mg/dl (1 = False; 2 = True)||---||---||1= 85.19% ; 2=14.81%|
|Resting ECG||Resting electrocardiographic results (1 = Normal; 2 = Having ST-T; 3 = Hypertrophy)||---||---||1=48.52% ; 2=0.74% ; 3=50.74%|
|Max HR||Maximum heart rate achieved||149.68||23.17||---|
|Exercise Ang||Exercise induced angina (1 = No; 2 = Yes)||---||---||1=67.04% ; 2=32.96%|
|Oldpeak||ST depression induced by exercise relative to rest||1.05||1.14||---|
|Slope||The slope of the peak exercise ST segment (1 = Up sloping; 2 = Flat; 3 = Down sloping)||---||---||1=48.15% ; 2=45.19% ; 3=6.67%|
|Vessel||Number of major vessels (0-3) colored by fluoroscopy. ( Treated as a discrete variable )||---||---||0=59.26% ; 1=21.48% ; 2=12.22%; 3=7.04%|
|Thal||Thallium heart Scan (1 = Normal; 2 = Fixed defect; 3 = Reversible defect)||---||---||1=56.30% ; 2=5.19% ; 3=38.52%|
|Heart disease||Diagnosis of heart disease (1= Absence; 2= Presence)||---||---||1=55.56% ; 2=44.44%|
|Estimation of Parametric coefficients|
|Covariates||Estimate||Standard Error||Z value||p-value|
|Chest Pain 2||1.498281||0.963307||1.555||0.119862|
|Chest Pain 3||0.662778||0.824066||0.804||0.421237|
|Chest Pain 4||2.777748||0.829641||3.348||0.000814
|Resting ECG 2||2.187153||3.543705||0.617||0.537107|
|Resting ECG 3||0.768672||0.439692||1.748||0.080429.|
|ACC (Classification Accuracy Rate in %)||AUC (Area Under Curve in 0-1)|
In the GAM fitted model, for every one unit change in Cholesterol the log odds of HD increased by 0.0098 with p-value 0.029. Cholesterol has a positive significant association with HD which indicates that patients with high Cholesterol having a higher chance of HD.
HD is high negatively significantly associated with the Maximum Heart rate (Max.HR) of a patient. For every one unit change in Max. HR the log odds of HD decreased by 0.0326 with p-value 0.003. That means patients with maximum heart rate having lower risk of HD.
For one unit change in Old peak the log odds of HD increased 0.5150 with p-value 0.020.The HD is positively significantly associated with Old peak. Therefore patients with high Old peak value having higher risk of HD.
In this GAM fitted model, for every one unit change in Resting BP the log odds of HD increased by 0.0243 with p-value 0.040. Resting BP has a positive significant association with HD which indicates that patients with high Resting BP having a higher chance of HD.
Heart disease (HD) is positively significantly associated with Resting ECG of a patient. Out of three types of Resting ECG, Hypertrophy Resting ECG changes the log odds of HD by 0.7686 with p-value 0.080. Therefor patients having higher chance of HD if they have Hypertrophy Resting ECG result than others.
Sex (Gender) of a patient has a very positive significant association with HD. Male patient changes the log odds of HD by 2.0802 with p-value <0.001than a female patient. This indicates male patients having a higher chance of HD.
HD is very high positive significant association with Thallium heart scan (Thal) result. A patient with Reversible defect in his/her thallium heart scan report changes the log odd of HD by 1.6939 with p-value <0.001. It means patient has higher chance of HD if his/her thallium heart scan report shows Reversible defect than others.
Numbers of major vessels (Vessel) treated as a discrete variable in this GAM fitted model has a very high positive significant association with HD. For every one number increase in Vessel causes 1.2636 increment in log odds of HD with p-value <0.001.
In this GAM fitted model only one cofactor namely Age, used as smoothing factor. As it is a nonparametric method of estimation so Chi-square test statistic has been used for testing the hypothesis. From
It also noticed from
In the above predictive formula, except Age all the cofactors entered in this additive model parametrically. Age is the only smoothing term here whose approximate significance has been judged through non-parametrical methods (Chi-Square test).
The current article is considered the Heart Disease/HD (whether a patient has a heart disease or not) as the response variable. It is a binary variable with values ‘1’ and ‘2’ which stand for absent and present of the heart disease respectively. This HD has been modeled based on generalized additive model. The GAM fitted model results are displayed in
Data Mining Techniques (a) Multi-Layer Perceptron Neural Network (MLPE)(b) Support Vector machine (SVM)
Histogram of residuals.
Smoothing term (Age) plot with confidence belt.
Absolute residual plot.
Normal probability plots of residuals.
Input Importance Chart.
Variable effective Curve for Max. HR(most important input variable).
The current reported results (
Fifth, the final model of the HD is selected based on locating the appropriate statistical distribution. The HD distribution is identified herein as the binomial distribution. For more extension regarding this please follow the references.
To the best of our knowledge, the present models (Results & interpretation section) can be considered as one of the best first building block of a regression analysis. The current models may provide better assistance for treatment decision making using the individual patient risk factors and the benefits of a specific treatment. The current results have focused many interesting conclusions. These findings may help the medical practitioners for better medical treatment. Thallium scan report, Chest pain type are highly important for identification of a heart disease patients. Especially for male patient, it is recommended that they must take care about their heart during their older age.
We would like to acknowledge all the previous authors who had work on this data set and also the UCI Machine Learning Repository for making available of this dataset. Finally, we are very much thankful to the reviewers for their valuable comments for betterment of this article.
Support vector machine
Multi layer perceptron ensemble
Multi layer perceptron
Generalized additive model
Variable effective curve