## INTRODUCTION

The heart is the most essential organ of human body which also can be described as the size of a fist and a strong muscle in the body. Any disorderliness that affects the heart from infection to genetic defects and blood vessel disease is referred to as heart disease.^{1} Heart disease is a serious disease and proper diagnosis of heart disease at early stage remains challenging task.^{2} In fact, up to 25% of people with heart disease have no symptoms despite insufficient blood flow to the heart, a condition that is referred to as silent heart disease.^{3} In the United State of America about 600,000 people die as a result of heart disease every year which is calculated to be one in every four deaths.^{4} Diagnosis usually appears when a patient visits the doctor to have symptoms checked out. Patients may be met with shortness of breath, pain in the chest or back, painful, persistent coughing or any number of other symptoms, none of which immediately alert the doctor to a diagnosis of heart disease. Many studies were carried out about heart disease diagnosis in all over the world generally using by artificial intelligence techniques or data mining methods.^{5-8} The use of data mining techniques in medical diagnosis has been increasing gradually. There is no doubt that evaluations of data taken from patients and decisions of experts are the most important factors in diagnosis. However, sometimes different artificial intelligence techniques or machine learning techniques are used for disease diagnosis.^{5-9-11}

In health care, data mining or statistical machine learning plays a vital role in the medical applications including diagnosis, prognosis, and therapy.^{12} Clinical data mining involves the conceptualization, extraction, analysis, and interpretation of the available clinical data for practical knowledge-building, clinical decision making, and partition reflection.^{12} A medical diagnosis is a classification problem^{13} In the predictive data mining, the data set consists of instances, each instance is characterized by attributes or features and another special attribute represents the outcome variable or the class.^{14} Often, the goal of any data mining project is to build a model from the available data. Thus, data mining models are objective models rather than subjective since it is driven by the available data.

Data mining (DM) techniques^{15} aim at extracting high-level knowledge from raw data. There are several DM algorithms, each one with its own advantages. DM techniques perform regression and classification tasks. In case of neural networks (NNs), the back propagation algorithm was first introduced in 1974^{16} and later popularized in 1986.^{17} Since then, neural networks (NNs) have become increasingly used. More recently, support vector machines (SVMs) have also been proposed.^{18,19} Due to their higher exibility and nonlinear learning capabilities, both NNs and SVMs are gaining an attention within the DM field, often attaining high predictive performances.^{20,21} SVMs present theoretical advantages over NNs, such as the absence of local minima in the learning phase. In effect, the SVM was recently considered one of the most influential DM algorithms.^{22} Therefore in this paper, a study of SVM on heart disease diagnosis was realized.

In the statistical analysis of clinical trials and observational studies, the identification and adjustment of prognostic factors is an important activity in order to get valid outcome. The failure to consider important prognostic variables, particularly in observational studies, can lead to errors in estimating treatment differences. In addition, incorrect modeling of prognostic factors can result in the failure to identify nonlinear trends or threshold effects on survival. This article describes flexible statistical methods that may be used to identify and characterize the effect of potential prognostic factors on disease endpoints. These methods are called ‘Generalized Additive Models’ (GAM).^{23} Many mathematical and statistical methodologies for building classification models, from the classical statistical methods to machine learning theory to classification trees, are reviewed and compared.^{24-27} Many work and research has been done into better and accurate models for the Heart Disease Dataset. The work^{28} gives a knowledge driven approach. Initially Logistic Regression was used by Dr. Robert Detrano for heart disease diagnosis.^{29} Newton Cheung utilized C4.5, Naive Bayes, BNND and BNNF algorithms and reached the classification accuracies of 81.11%, 81.48%, 81.11% and 80.96%, respectively.^{30} proposed a method that uses artificial immune system (AIS) and obtained more classification accuracy than the previous works.^{31} shows comparative results of many study performed on this heart disease data.^{10} In this present article 10-flod cross-validation along with 5 runs in each experiment has been performed for getting more stability in classification accuracy rate. Aim of the present article is to explore a relationship between chance of having heart disease of a patient with others biomedical parameters as a cofactors. Due to complex relationship between cofactors and response variable, GAM has been introduced here for better accuracy in prediction. The another aim of this study is to find a best classifier which gives a good performance evolution measures and also try to find the important input variables for heart disease diagnosis using strong data mining techniques. Many authors had used various classification techniques to this dataset for heart disease diagnosis.^{5-11} but probably, SVM and MPLE are not been used under proper modeling scheme. This study shows high classification accuracy rate and presented a significant variable input importance chart for heart disease diagnosis.

In this research work, we used the heart disease dataset obtained from the UCI Machine Learning to develop intelligent systems using data mining and GAM for diagnosis of heart disease. The results obtained from these systems were compared and the highest recognition rate obtained was taken as the best system for diagnosis of heart disease. This system will solve the problem of misdiagnose of heart disease and also try to identify the risk or important biomedical parameters responsible for probable heart disease. This can guide the doctors about prognostic factors and patients for greater awareness regarding heart disease.

## MATERIALS AND METHODS

#### MATERIALS

The present article is considered 270 heart disease patients with 14 factors or variables. The current secondary data set is taken from the report. The data set can be downloaded at http://archive.ics.uci.edu/ml/datasets.html. Description of the covariates, factors and their levels are described in Table 1. The summarized statistics such as the mean, standard deviation, and proportion of the levels are given in Table 1. The current data contains 5 continuous variables and 9 attribute characters. The description of each variable or attribute character, attribute levels, and how they are operationalized in the present report is displayed in Table 1. Here present or absent of heart disease in patient is playing a role of dependent variable (for regression) or output variable (for classification) and rest of the variables are playing the role of independent variables/ cofactors.

#### METHODS

In this present article data mining techniques with sensitivity analysis is performed for diagnosis of the heart disease and tried to find out the important factors which are most responsible in this diagnostic work respectively. Apart from this, the generalized additive logistic models are also applied to find the risk factors for heart disease. In case of data mining Multi-Layer Perceptrons ensembles (MLPE), Support vector machines (SVM) are used for classification and there after Sensitivity analysis done only upon the best model out of this classifier for this heart disease data set.^{20}

Best GAM^{32} model can be selected through some model checking criteria namely R square value, AIC or UBRE value and regression diagnostic plots like normal probability plot, Residuals against fitted value plot etc.^{14,32} Cofactors are significant or not judged through p-value. For this heart disease data set Absence and presence of heart disease is taken as response variable (Y), and Age, Sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG results, maximum heart rate achieved, exercise induced angina, oldpeak, slope of the peak exercise ST segment, number of major vessels, thal (thallium scan) are the cofactors (X’_{i} s).

Data mining techniques want to classify the data using different classifiers whereas GAM wants to identify the risk factors for this disease. The brief descriptions of the used methods are given below.

#### Data Mining Techniques

DM is an iterative process that consists of several steps. The CRISP-DM,^{33} a tool-neutral methodology supported by the industry (e.g. SPSS, DaimlerChryslyer) partitions a DM project into 6 phases: 1. business understanding; 2. data understanding; 3. data preparation; 4. modeling; 5. evaluation; and 6. deployment.

This work addresses steps 4 and 5, with an emphasis on the use of NNs and SVMs to solve classification and regression goals. Both tasks require a supervised learning, where a model is adjusted to a dataset of examples that map *I* inputs into a given target. In case of classification models output a probability *p(c)* for each possible class *c*, such that *c*, one option is to set a decision threshold *D ϵ* 0,1 and then output *c* if *p(c)* > *D*, otherwise return *c*. This method is used to build the receiver operating characteristic (ROC) curves. Another option is to output the class with the highest probability and this method allows the definition of a multi-class confusion matrix. For more details see.^{34}

To evaluate a model for classification, common metrics are.^{35} ROC area (AUC), confusion matrix, accuracy (ACC), true positive/negative rates (TPR/TNR). A classifier should present high values of ACC, TPR, TNR and AUC. The model’s generalization performance is often estimated by the holdout validation (i.e. train/test split) or the more robust k-fold cross-validation.^{14} The latter is more robust but requires around k times more computation, since k models are fitted.

#### MLPE neural network model

In DM techniques, NN means the popular multilayer perceptron (MLP). A major concern in their use is the difficulty to define the proper network for a specific application, due to the sensitivity to the initial conditions and to overfitting and underfitting problems which limit their generalization capability. A very promising way to partially overcome such drawbacks is the use of MLP ensembles (MLPE); averaging and voting techniques are largely used in classical statistical pattern recognition and can be fruitfully applied to MLP classifiers. For classification problem MLPE are used, which is a combinations of MLP models. This network includes one hidden layer of *H* neurons with logistic functions (Figure 1 (a)). The overall model is given in the form:

Where is the output of the network for node *i, w _{i,j}* is the weight of the connection from node

*j*to

*i*and

*f*is the activation function for node

_{i}*j*. For a binary classification (

*N*=

_{c}*2*), there is one output neuron with a logistic function. Under multi-class tasks (

*N*>

_{c}*2*), there are linear output neurons and the softmax function is used to transform these outputs into class probabilities:

Where is the predicted probability and is the NN output for class *i*. The training (BFGS algorithm) is stopped when the error slope approaches zero or after a maximum of epochs. For classification it maximizes the likelihood.^{14} Since NN training is not optimal, the final solution is dependent of the choice of starting weights. To solve this issue, the solution adopted is to train different networks and then select the NN with the lowest error or use an ensemble of all NNs and output the average of the individual predictions.^{14} In general, ensembles are better than individual learners.^{36} The final NN performance depends crucially on the number of hidden nodes. The simplest NN has *H* = *0*, while more complex NNs use a high *H* value.

#### Support Vector Machine (SVM) model

When compared with NNs, SVMs present theoretical advantages, such as the absence of local minima in the learning phase.^{14} The basic idea is transform the input *x* ∈ *R ^{I}* into a high

*m*-dimensional feature space by using a nonlinear mapping. Then, the SVM finds the best linear separating hyperplane, related to a set of support vector points, in the feature space (Figure 1 (b)). The transformation (φ(x)) depends of a kernel function.

Here, SVM uses the sequential minimal optimization (SMO) learning algorithm adopting the popular Gaussian kernel, which presents less parameters than other kernels (e.g. polynomial): *K*(*X,X′*) = *exp*(−*γ*‖*X − X′*‖^{2}), *γ* > 0. The classification performance is affected by two hyperparameters:, the parameter of the kernel, and *C*, a penalty parameter. The probabilistic SVM output is given by ^{37}

Where *m* is the number of support vectors, *y _{i}*

**∈**{-1,1}; is the output for a binary classification, and are coefficients of the model, and

*A*and

*B*are determined by solving a regularized maximum likelihood problem. When

*N*>

_{c}*2*, the one-against-one approach is used, which trains

*N*(

_{c}*N*-1)/2 binary classifiers and the output is given by a pairwise coupling.

_{c}^{37}

#### Sensitivity Analysis

The sensitivity analysis is a simple procedure that is applied after the training procedure and analyzes the model responses when a given input is changed. Let *y _{a,j}* denote the output obtained by holding all input variables at their average values except

*x*, which varies through its entire range (x

_{a}_{a,j}, with j

*ϵ*{1,2,…..L} levels). Variance (

*V*) of

_{a}*y*used as a measure of input relevance.

_{a,j}^{38}If

*N*>

_{c}*2*(multi-class), it sets as the sum of the variances for each output class probability (p(

*c*)

_{a,j}). A high variance (

*V*) suggests a high

_{a}*x*relevance, thus the input relative importance (

_{a}*R*) is given by:

_{a}For a more detailed analysis, the variable effect characteristic (VEC) curve, Cortez *et al*. has been proposed, which plots the *x _{a,j}* values (x-axis) versus the

*y*predictions (y-axis).

_{a,j}^{39}

#### Generalized Additive Model (GAM)

GAM^{32,-40} is an extension of the Generalized Linear Model (GLM)^{41} where the modeling of the mean functions relaxes the assumption of linearity, albeit additivity of the mean function pertaining to the covariates is assumed. Whilst the mean functions of some covariates may be assumed to be linear, the non-linear mean functions are modeled using smoothing methods, such as kernel smoothers, lowess, smoothing splines or regression splines. In general, the model has the following structure

where, *μ*=*E*(*Y*) for a response variable with some exponential family distribution, *g* is the *link* function and *f _{i}* are some smooth functions of the covariates

*X*for each

_{i}*j*=1,2,…..,

*p*.

GAMs provide more flexibility than do GLMs, as they relax the hypothesis of linear dependence between the covariates and the expected value of the response variable. The main drawback of GAMs lies in the estimation of the smooth functions *f _{i}*, and there are different ways to address this. One of the most common alternatives is based on splines, which allow the GAM estimation to be reduced to the GLM context.

^{42}Smoothing splines,

^{43}use as many knots as unique values of the covariate

*X*and control the model’s smoothness by adding a penalty to the least squares fitting objective.

_{i}^{44,45}

Generalized additive models can be used in virtually any setting where linear models are used. For a single observation (*i*^{th} )the basic idea is to replace

In the logistic regression model the outcome *y _{i}* is ‘0’ or ‘1’ with ‘1’ indicating an event and ‘0’ indicates no event. (In this article ‘1’ indicates absence of heart disease and ‘0’indicates presence of the heart disease in patient). Then the generalized additive logistic model assumes the log-odds are given below

Where *f _{1}*,

*f*

_{2},….,

*f*are the smooth functions which are estimated by splines algorithm. For more details see these references.

_{p}^{23-32}

#### Performance Evolution Measures

Classification Accuracy (ACC)

Classification accuracy refers to the ability of the model to correctly predict the class level of new or previous unseen data. Classification Accuracy is the percentage (%) of testing set examples correctly classified by the classifier. The quality of classification can be assessed through overall accuracy. That is

Where T is the set data items to be classified (the test set in this case), t∈T,t.c is the class of item t, and (t) returns the classification of by the used classifier (here, SVM and MLPE). For more details see.^{46}

#### Area under Curve (AUC)

AUC is a common evaluation metric for binary classification problems. Consider a plot of the true positive rate vs. the false positive rate as the threshold value for classifying an item as 0 or is increased from 0 to 1 and if the classifier is very good, the true positive rate will increase quickly and the area under the curve will be close to 1. One characteristic of the AUC is that it is independent of the fraction of the test population which is class 0 or class 1; this makes the AUC useful for evaluating the performance of classifiers on unbalanced data sets.

#### k-fold Cross Validation

*k*-fold cross validation is a common technique for estimating the performance of a classifier. Given a set of *m* training examples, a single run of *k*-fold cross validation proceeds as follows:

Arrange the training examples in a random order.

Divide the training examples into

*k*-folds. (*k*chunks of approximately*m/k*examples each.)For

*i*=1,2,…..*k*:Return the following estimate to the classifier error:

To obtain an accurate estimate to the accuracy of a classifier, *k*-fold cross validation is run several times, each with a different random arrangement in Step- 1. After performing these steps several numbers of times takes an average of each run result to produced final classification accuracy. For more details see.^{14}

All GAM regression and data mining works are performed in R statistical software with proper library packages.^{40-47} (http://www3.dsi.uminho.pt/pcortez/rminer.html),^{34}

## RESULTS AND DISCUSSIONS

Table 2 presents the summarized results of Generalized Additive Model used for heart disease diagnosis. Here response variable is whether a patient has heart disease or not? Rest of the variables is cofactors. GAM has two parts of estimation methods; one is parametric estimation for those cofactors which entered in model parametrically and non-parametric estimation used for smoothing cofactors. In this present article only Age is the smoothing cofactors and rest are under parametric estimation method. The detailed results and interpretations of Table 2 (Binomial with logit link fitted model) are described as follows. The GAM regression coefficients give the change in the log odds of the Heart disease (response) for a one unit increase in the cofactors (predictor). Here we have considered the P-values up to approximately 10% level as significant, and more than 10% to approximately 20% as partially significant.^{40,41-49,50}

#### Results of Estimation of Parametric coefficients

Heart disease (HD) is very high positively significantly associated with chest pain of a patient. Out of four types of chest pain, asymptomatic chest pain changes the log odds of HD by 2.7777 with p-value 0.0008. Therefore, patient having higher chance of HD if he/she has asymptomatic chest pain.

##### Table 1

##### Table 2

Estimation of Parametric coefficients | ||||
---|---|---|---|---|

Covariates | Estimate | Standard Error | Z value | p-value |

Intercept | -6.644423 | 2.600914 | -2.555 | 0.010629 * |

Chest Pain 2 | 1.498281 | 0.963307 | 1.555 | 0.119862 |

Chest Pain 3 | 0.662778 | 0.824066 | 0.804 | 0.421237 |

Chest Pain 4 | 2.777748 | 0.829641 | 3.348 | 0.000814 *** |

Cholesterol | 0.009850 | 0.004513 | 2.183 | 0.029053 * |

Max. HR | -0.032619 | 0.011325 | -2.880 | 0.003974 ** |

Old peak | 0.515073 | 0.223007 | 2.310 | 0.020906 * |

Resting BP | 0.024378 | 0.011871 | 2.053 | 0.040025 * |

Resting ECG 2 | 2.187153 | 3.543705 | 0.617 | 0.537107 |

Resting ECG 3 | 0.768672 | 0.439692 | 1.748 | 0.080429. |

Sex 2 | 2.080282 | 0.624856 | 3.329 | 0.000871 *** |

Thal 2 | 0.063903 | 0.845742 | 0.076 | 0.939771 |

Thal 3 | 1.693988 | 0.477088 | 3.551 | 0.000384 *** |

Vessel | 1.263642 | 0.285799 | 4.421 | <0.0001*** |

Approximate Significance of smooth terms (Non-parametric) | ||||

Smooth Covariate | Edf | Ref. df | Chi.sq | p-value |

Age | 8.1 | 8.593 | 14.18 | 0.0957. |

##### Table 3

In the GAM fitted model, for every one unit change in Cholesterol the log odds of HD increased by 0.0098 with p-value 0.029. Cholesterol has a positive significant association with HD which indicates that patients with high Cholesterol having a higher chance of HD.

HD is high negatively significantly associated with the Maximum Heart rate (Max.HR) of a patient. For every one unit change in Max. HR the log odds of HD decreased by 0.0326 with p-value 0.003. That means patients with maximum heart rate having lower risk of HD.

For one unit change in Old peak the log odds of HD increased 0.5150 with p-value 0.020.The HD is positively significantly associated with Old peak. Therefore patients with high Old peak value having higher risk of HD.

In this GAM fitted model, for every one unit change in Resting BP the log odds of HD increased by 0.0243 with p-value 0.040. Resting BP has a positive significant association with HD which indicates that patients with high Resting BP having a higher chance of HD.

Heart disease (HD) is positively significantly associated with Resting ECG of a patient. Out of three types of Resting ECG, Hypertrophy Resting ECG changes the log odds of HD by 0.7686 with p-value 0.080. Therefor patients having higher chance of HD if they have Hypertrophy Resting ECG result than others.

Sex (Gender) of a patient has a very positive significant association with HD. Male patient changes the log odds of HD by 2.0802 with p-value <0.001than a female patient. This indicates male patients having a higher chance of HD.

HD is very high positive significant association with Thallium heart scan (Thal) result. A patient with Reversible defect in his/her thallium heart scan report changes the log odd of HD by 1.6939 with p-value <0.001. It means patient has higher chance of HD if his/her thallium heart scan report shows Reversible defect than others.

Numbers of major vessels (Vessel) treated as a discrete variable in this GAM fitted model has a very high positive significant association with HD. For every one number increase in Vessel causes 1.2636 increment in log odds of HD with p-value <0.001.

#### Results of Non-parametric estimation for approximate significance of Smooth term

In this GAM fitted model only one cofactor namely Age, used as smoothing factor. As it is a nonparametric method of estimation so Chi-square test statistic has been used for testing the hypothesis. From table 2 it is observed that smoothness of the cofactor Age is partially significance with p-value 0.0957.

It also noticed from Table 2 that, the GAM fitted model has an Adjusted R-square value 0.70 with 65% of its deviance explained. UBRE (Un biased risk estimator) score is -0.2423 which is also very low compare to other models.

From Table 2, the final selected GAM fitted binary logistic model of the Heart disease (y) is shown below

In the above predictive formula, except Age all the cofactors entered in this additive model parametrically. Age is the only smoothing term here whose approximate significance has been judged through non-parametrical methods (Chi-Square test).

In Figure 2 and 3, the GAM diagnostic plots have been examined for binomial logit model. Figure 2(a) shows the histogram of the residuals for binomial logit GAM, which indicates that the residuals are normally distributed. Figure 2(b) represents the plot of the smooth terms for cofactor Age with confidence belt. It shows that the non-linearity with respect to its smoothness.

In Figure 3(a), the absolute residual values are plotted against the fitted values of GAM. This residual plot is completely a flat diagram indicating that the variance is constant with the respective means. Figure 3(b) reveals the normal probability plot for the fitted model, which shows no systematic departure or lack of fit, or response distribution, or variables or outliers with respect to the fitted GAM model.

#### Results of Data Mining Techniques

Table 3 presents the results of Data Mining Techniques for heart disease diagnosis. Mainly two classification methods SVM and MLPE are introduced for diagnosis. Two performance measures namely Classification accuracy rate (ACC) and Area under curve (AUC) are checked here using 10-flods cross validation with 5 runs in each experiment. It observed from Table 3 that for both of these two performance measures SVM is superior to MLPE. After 10-flods cross validation with 5runs the average ACC value for SVM is almost 85% whereas MLPE shows 82% accuracy rate. In case of AUC value SVM and MLPE show almost 0.90 and 0.86 respectively.

In Figure 4, the plots from sensitivity analysis under SVM are shown. Figure 4(a) shows the Input importance bar charts for heart disease diagnosis. Maximum heart rate is most important input variables for heart disease diagnosis under SVM (best classifier out of all data mining techniques). Figure 4(b) shows the variable effective curve (VEC) for Max HR and it is decreasing, results form Table-2 also suggests this.

## CONCLUSION

The current article is considered the Heart Disease/HD (whether a patient has a heart disease or not) as the response variable. It is a binary variable with values ‘1’ and ‘2’ which stand for absent and present of the heart disease respectively. This HD has been modeled based on generalized additive model. The GAM fitted model results are displayed in Table 2.

The current reported results (Table 2), though not completely conclusive, are revealing. The determinants of HD are derived satisfying the following regression analysis criteria. First, the determinants are selected based on GAM fitted model analyses. Second, the final model is selected based on UBRE.^{40-47} Third, final model is justified based on GAM diagnostic plots. Fourth, the standard error of the estimates is very small, indicating that the estimates are stable 48

Fifth, the final model of the HD is selected based on locating the appropriate statistical distribution. The HD distribution is identified herein as the binomial distribution. For more extension regarding this please follow the references.^{49,50}

To the best of our knowledge, the present models (Results & interpretation section) can be considered as one of the best first building block of a regression analysis. The current models may provide better assistance for treatment decision making using the individual patient risk factors and the benefits of a specific treatment. The current results have focused many interesting conclusions. These findings may help the medical practitioners for better medical treatment. Thallium scan report, Chest pain type are highly important for identification of a heart disease patients. Especially for male patient, it is recommended that they must take care about their heart during their older age.