^{1}

^{2}

^{3}

Department of Mathematics; NSHM Knowledge Campus, Durgapur-713212, West Bengal , INDIA

Senior Bio-statistician, Department of Bio-Statistics, Quintiles Clinical Research, Bangalore, Karnataka, INDIA

Research Scohler, Department of Statistics, Aliah University, Kolkata, West Bengal, INDIA

Yearly death rate is increasing due to heart disease. Major factors for the increasing death rate due to heart disease are (a) misdiagnosed by the medical doctors or (b) ignorance by the patients. Heart diseases can be described as any kind of disorder which affects the heart.

The dataset of ‘statlog’ from the UCI Machine Learning with 270 patients related to heart disease isused in this article. The dataset comprises attributes of patients diagnosed with heart diseases. The diagnosis was used to confirm whether heart disease is present or absent in the patient. The present article aims to identify the risk factors/variables which influence this diagnosis. Classification is a very important part of the disease diagnosis but it is also relevant to identify the risk factors/variables. Two classification techniques namely Support Vector Machines (SVM), Multi-Layer Perceptrons ensembles (MLPE) and one advanced regression technique,Generalized additive model (GAM) with binomial distribution and‘logit’ link have been introduced for diagnosis and risk factors/variables identification.

GAM explains 65% deviance with adjusted R square value 0.70 approximately. Sensitivity analysis has been performed under SVM, which is the best model for this dataset with approximately 85% classification accuracy rate. MLPE gives 82% classification accuracy rate approximately. Maximum heart rate, vessel, old peak, chest pain, thallium scan are the most important factors/variables find through both sensitivity analysis under SVM and GAM.

The present article attempt to remove some new information regarding heart disease through probabilistic modeling which may provide better assistance for treatment decision making using the individual patient risk factors and the benefits of a specific treatment. These findings may help the medical practitioners for better medical treatment.

The heart is the most essential organ of human body which also can be described as the size of a fist and a strong muscle in the body. Any disorderliness that affects the heart from infection to genetic defects and blood vessel disease is referred to as heart disease.^{1} Heart disease is a serious disease and proper diagnosis of heart disease at early stage remains challenging task.^{2} In fact, up to 25% of people with heart disease have no symptoms despite insufficient blood flow to the heart, a condition that is referred to as silent heart disease.^{3} In the United State of America about 600,000 people die as a result of heart disease every year which is calculated to be one in every four deaths.^{4} Diagnosis usually appears when a patient visits the doctor to have symptoms checked out. Patients may be met with shortness of breath, pain in the chest or back, painful, persistent coughing or any number of other symptoms, none of which immediately alert the doctor to a diagnosis of heart disease. Many studies were carried out about heart disease diagnosis in all over the world generally using by artificial intelligence techniques or data mining methods.^{5-8} The use of data mining techniques in medical diagnosis has been increasing gradually. There is no doubt that evaluations of data taken from patients and decisions of experts are the most important factors in diagnosis. However, sometimes different artificial intelligence techniques or machine learning techniques are used for disease diagnosis.^{5-9-11}

In health care, data mining or statistical machine learning plays a vital role in the medical applications including diagnosis, prognosis, and therapy.^{12} Clinical data mining involves the conceptualization, extraction, analysis, and interpretation of the available clinical data for practical knowledge-building, clinical decision making, and partition reflection.^{12} A medical diagnosis is a classification problem^{13} In the predictive data mining, the data set consists of instances, each instance is characterized by attributes or features and another special attribute represents the outcome variable or the class.^{14} Often, the goal of any data mining project is to build a model from the available data. Thus, data mining models are objective models rather than subjective since it is driven by the available data.

Data mining (DM) techniques^{15} aim at extracting high-level knowledge from raw data. There are several DM algorithms, each one with its own advantages. DM techniques perform regression and classification tasks. In case of neural networks (NNs), the back propagation algorithm was first introduced in 1974^{16} and later popularized in 1986.^{17} Since then, neural networks (NNs) have become increasingly used. More recently, support vector machines (SVMs) have also been proposed.^{18,19} Due to their higher exibility and nonlinear learning capabilities, both NNs and SVMs are gaining an attention within the DM field, often attaining high predictive performances.^{20,21} SVMs present theoretical advantages over NNs, such as the absence of local minima in the learning phase. In effect, the SVM was recently considered one of the most influential DM algorithms.^{22} Therefore in this paper, a study of SVM on heart disease diagnosis was realized.

In the statistical analysis of clinical trials and observational studies, the identification and adjustment of prognostic factors is an important activity in order to get valid outcome. The failure to consider important prognostic variables, particularly in observational studies, can lead to errors in estimating treatment differences. In addition, incorrect modeling of prognostic factors can result in the failure to identify nonlinear trends or threshold effects on survival. This article describes flexible statistical methods that may be used to identify and characterize the effect of potential prognostic factors on disease endpoints. These methods are called ‘Generalized Additive Models’ (GAM).^{23} Many mathematical and statistical methodologies for building classification models, from the classical statistical methods to machine learning theory to classification trees, are reviewed and compared.^{24-27} Many work and research has been done into better and accurate models for the Heart Disease Dataset. The work^{28} gives a knowledge driven approach. Initially Logistic Regression was used by Dr. Robert Detrano for heart disease diagnosis.^{29} Newton Cheung utilized C4.5, Naive Bayes, BNND and BNNF algorithms and reached the classification accuracies of 81.11%, 81.48%, 81.11% and 80.96%, respectively.^{30} proposed a method that uses artificial immune system (AIS) and obtained more classification accuracy than the previous works.^{31} shows comparative results of many study performed on this heart disease data.^{10} In this present article 10-flod cross-validation along with 5 runs in each experiment has been performed for getting more stability in classification accuracy rate. Aim of the present article is to explore a relationship between chance of having heart disease of a patient with others biomedical parameters as a cofactors. Due to complex relationship between cofactors and response variable, GAM has been introduced here for better accuracy in prediction. The another aim of this study is to find a best classifier which gives a good performance evolution measures and also try to find the important input variables for heart disease diagnosis using strong data mining techniques. Many authors had used various classification techniques to this dataset for heart disease diagnosis.^{5-11} but probably, SVM and MPLE are not been used under proper modeling scheme. This study shows high classification accuracy rate and presented a significant variable input importance chart for heart disease diagnosis.

In this research work, we used the heart disease dataset obtained from the UCI Machine Learning to develop intelligent systems using data mining and GAM for diagnosis of heart disease. The results obtained from these systems were compared and the highest recognition rate obtained was taken as the best system for diagnosis of heart disease. This system will solve the problem of misdiagnose of heart disease and also try to identify the risk or important biomedical parameters responsible for probable heart disease. This can guide the doctors about prognostic factors and patients for greater awareness regarding heart disease.

The present article is considered 270 heart disease patients with 14 factors or variables. The current secondary data set is taken from the report. The data set can be downloaded at

In this present article data mining techniques with sensitivity analysis is performed for diagnosis of the heart disease and tried to find out the important factors which are most responsible in this diagnostic work respectively. Apart from this, the generalized additive logistic models are also applied to find the risk factors for heart disease. In case of data mining Multi-Layer Perceptrons ensembles (MLPE), Support vector machines (SVM) are used for classification and there after Sensitivity analysis done only upon the best model out of this classifier for this heart disease data set.^{20}

Best GAM^{32} model can be selected through some model checking criteria namely R square value, AIC or UBRE value and regression diagnostic plots like normal probability plot, Residuals against fitted value plot etc.^{14,32} Cofactors are significant or not judged through p-value. For this heart disease data set Absence and presence of heart disease is taken as response variable (Y), and Age, Sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG results, maximum heart rate achieved, exercise induced angina, oldpeak, slope of the peak exercise ST segment, number of major vessels, thal (thallium scan) are the cofactors (X’_{i} s).

Data mining techniques want to classify the data using different classifiers whereas GAM wants to identify the risk factors for this disease. The brief descriptions of the used methods are given below.

DM is an iterative process that consists of several steps. The CRISP-DM,^{33} a tool-neutral methodology supported by the industry (e.g. SPSS, DaimlerChryslyer) partitions a DM project into 6 phases: 1. business understanding; 2. data understanding; 3. data preparation; 4. modeling; 5. evaluation; and 6. deployment.

This work addresses steps 4 and 5, with an emphasis on the use of NNs and SVMs to solve classification and regression goals. Both tasks require a supervised learning, where a model is adjusted to a dataset of examples that map ^{34}

To evaluate a model for classification, common metrics are.^{35} ROC area (AUC), confusion matrix, accuracy (ACC), true positive/negative rates (TPR/TNR). A classifier should present high values of ACC, TPR, TNR and AUC. The model’s generalization performance is often estimated by the holdout validation (i.e. train/test split) or the more robust k-fold cross-validation.^{14} The latter is more robust but requires around k times more computation, since k models are fitted.

In DM techniques, NN means the popular multilayer perceptron (MLP). A major concern in their use is the difficulty to define the proper network for a specific application, due to the sensitivity to the initial conditions and to overfitting and underfitting problems which limit their generalization capability. A very promising way to partially overcome such drawbacks is the use of MLP ensembles (MLPE); averaging and voting techniques are largely used in classical statistical pattern recognition and can be fruitfully applied to MLP classifiers. For classification problem MLPE are used, which is a combinations of MLP models. This network includes one hidden layer of

$${y}_{i}={f}_{i}\left(w{}_{i,0}+{\displaystyle {\sum}_{j=I+1}^{I+H}\text{\hspace{0.17em}}{f}_{i}({\mathrm{\Sigma}}_{n=1}^{I}\text{\hspace{0.17em}}{x}_{n}{w}_{m,n}+{w}_{m,0}){w}_{i,n}}\right)$$

Where is the output of the network for node _{i,j}_{i}_{c}_{c}

$$p(i)=\frac{\mathit{exp}\text{\hspace{0.17em}}({y}_{i})}{{\displaystyle {\sum}_{c=1}^{{N}_{c}}\mathit{exp}\text{\hspace{0.17em}}({y}_{c})}}$$

Where is the predicted probability and is the NN output for class ^{14} Since NN training is not optimal, the final solution is dependent of the choice of starting weights. To solve this issue, the solution adopted is to train different networks and then select the NN with the lowest error or use an ensemble of all NNs and output the average of the individual predictions.^{14} In general, ensembles are better than individual learners.^{36} The final NN performance depends crucially on the number of hidden nodes. The simplest NN has

When compared with NNs, SVMs present theoretical advantages, such as the absence of local minima in the learning phase.^{14} The basic idea is transform the input ^{I}

Here, SVM uses the sequential minimal optimization (SMO) learning algorithm adopting the popular Gaussian kernel, which presents less parameters than other kernels (e.g. polynomial): ^{2}), ^{37}

$$\begin{array}{l}f({x}_{i})={\displaystyle \sum _{j=1}^{m}{y}_{i}{\alpha}_{j}\text{\hspace{0.17em}}K\left({x}_{j},{x}_{i}\right)}+b\\ \text{p}(i)=1/(1+\mathit{exp}\text{\hspace{0.17em}}(Af({x}_{i})+\text{B))}\end{array}$$

Where _{i}_{c}_{c}_{c}^{37}

The sensitivity analysis is a simple procedure that is applied after the training procedure and analyzes the model responses when a given input is changed. Let _{a,j}_{a}_{a,j}, with j _{a}_{a,j}^{38} If _{c}_{a,j}). A high variance (_{a}_{a}_{a}

$${R}_{a}=\frac{{V}_{a}}{{\displaystyle {\sum}_{i=1}^{I}{V}_{i}\times 100(\%)}}$$

For a more detailed analysis, the variable effect characteristic (VEC) curve, Cortez _{a,j}_{a,j}^{39}

GAM^{32,-40} is an extension of the Generalized Linear Model (GLM)^{41} where the modeling of the mean functions relaxes the assumption of linearity, albeit additivity of the mean function pertaining to the covariates is assumed. Whilst the mean functions of some covariates may be assumed to be linear, the non-linear mean functions are modeled using smoothing methods, such as kernel smoothers, lowess, smoothing splines or regression splines. In general, the model has the following structure

$$g(\mu )={\alpha}_{0}+{\displaystyle {\sum}_{j=1}^{p}{f}_{i}\left({X}_{j}\right)}$$

where, _{i}_{i}

GAMs provide more flexibility than do GLMs, as they relax the hypothesis of linear dependence between the covariates and the expected value of the response variable. The main drawback of GAMs lies in the estimation of the smooth functions _{i}^{42} Smoothing splines,^{43} use as many knots as unique values of the covariate _{i}^{44,45}

Generalized additive models can be used in virtually any setting where linear models are used. For a single observation (^{th} )the basic idea is to replace

In the logistic regression model the outcome _{i}

$$\mathit{log}\frac{p({y}_{i}|{x}_{i1},\mathrm{...........},{x}_{ip})}{1-p({y}_{i}|{x}_{i1},\mathrm{...........},{x}_{ip})}={\beta}_{0}+{f}_{1}\left({x}_{i1}\right)+\cdots +{f}_{p}\left({x}_{ip}\right)$$

Where _{1}_{2},….,_{p}^{23-32}

Classification accuracy refers to the ability of the model to correctly predict the class level of new or previous unseen data. Classification Accuracy is the percentage (%) of testing set examples correctly classified by the classifier. The quality of classification can be assessed through overall accuracy. That is

$$\text{Accuracy}\left(T\right)=\frac{{\displaystyle {\sum}_{i=1}^{\left|T\right|}assess({t}_{i})}}{\left|T\right|},{t}_{i}\in T$$

$$assess(t)=\{\begin{array}{c}1iffclassify\text{\hspace{0.17em}}(t)\equiv t.c\\ 0otherwise\end{array}$$

Where T is the set data items to be classified (the test set in this case), t∈T,t.c is the class of item t, and (t) returns the classification of by the used classifier (here, SVM and MLPE). For more details see.^{46}

AUC is a common evaluation metric for binary classification problems. Consider a plot of the true positive rate vs. the false positive rate as the threshold value for classifying an item as 0 or is increased from 0 to 1 and if the classifier is very good, the true positive rate will increase quickly and the area under the curve will be close to 1. One characteristic of the AUC is that it is independent of the fraction of the test population which is class 0 or class 1; this makes the AUC useful for evaluating the performance of classifiers on unbalanced data sets.

Arrange the training examples in a random order.

Divide the training examples into

For

(i) Train the classifier using all the examples that do not belong to fold.

(ii) Test the classifier on all the examples in fold.

(iii) Compute, the number of examples in fold that were wrongly classified.

Return the following estimate to the classifier error:

$$E=\frac{{\displaystyle {\sum}_{i=1}^{k}{n}_{i}}}{m}$$

To obtain an accurate estimate to the accuracy of a classifier, ^{14}

All GAM regression and data mining works are performed in R statistical software with proper library packages.^{40-47} (^{34}

^{40,41-49,50}

Heart disease (HD) is very high positively significantly associated with chest pain of a patient. Out of four types of chest pain, asymptomatic chest pain changes the log odds of HD by 2.7777 with p-value 0.0008. Therefore, patient having higher chance of HD if he/she has asymptomatic chest pain.

Variable name | Operationalization | Mean | Standard deviation | Proportion of levels of Attributes |
---|---|---|---|---|

Age (Year) | Age at study | 54.43 | 9.10 | --- |

Sex | Gender : (Female = 1 ; Male = 2) | --- | --- | 1= 32.22% ; 2= 67.78% |

Chest Pain | Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic) | --- | --- | 1= 7.41% ; 2=15.56% ; 3=29.26% ; 4=47.78% |

Resting BP | Resting blood pressure (in mm Hg on admission to the hospital) | 131.34 | 17.86 | --- |

Cholesterol | Serum cholesterol in mg/dl | 249.66 | 51.69 | --- |

Fasting BS | Fasting blood sugar > 120 mg/dl (1 = False; 2 = True) | --- | --- | 1= 85.19% ; 2=14.81% |

Resting ECG | Resting electrocardiographic results (1 = Normal; 2 = Having ST-T; 3 = Hypertrophy) | --- | --- | 1=48.52% ; 2=0.74% ; 3=50.74% |

Max HR | Maximum heart rate achieved | 149.68 | 23.17 | --- |

Exercise Ang | Exercise induced angina (1 = No; 2 = Yes) | --- | --- | 1=67.04% ; 2=32.96% |

Oldpeak | ST depression induced by exercise relative to rest | 1.05 | 1.14 | --- |

Slope | The slope of the peak exercise ST segment (1 = Up sloping; 2 = Flat; 3 = Down sloping) | --- | --- | 1=48.15% ; 2=45.19% ; 3=6.67% |

Vessel | Number of major vessels (0-3) colored by fluoroscopy. ( Treated as a discrete variable ) | --- | --- | 0=59.26% ; 1=21.48% ; 2=12.22%; 3=7.04% |

Thal | Thallium heart Scan (1 = Normal; 2 = Fixed defect; 3 = Reversible defect) | --- | --- | 1=56.30% ; 2=5.19% ; 3=38.52% |

Heart disease | Diagnosis of heart disease (1= Absence; 2= Presence) | --- | --- | 1=55.56% ; 2=44.44% |

Estimation of Parametric coefficients | ||||
---|---|---|---|---|

Covariates | Estimate | Standard Error | Z value | p-value |

Intercept | -6.644423 | 2.600914 | -2.555 | 0.010629 |

Chest Pain 2 | 1.498281 | 0.963307 | 1.555 | 0.119862 |

Chest Pain 3 | 0.662778 | 0.824066 | 0.804 | 0.421237 |

Chest Pain 4 | 2.777748 | 0.829641 | 3.348 | 0.000814 |

Cholesterol | 0.009850 | 0.004513 | 2.183 | 0.029053 |

Max. HR | -0.032619 | 0.011325 | -2.880 | 0.003974 |

Old peak | 0.515073 | 0.223007 | 2.310 | 0.020906 |

Resting BP | 0.024378 | 0.011871 | 2.053 | 0.040025 |

Resting ECG 2 | 2.187153 | 3.543705 | 0.617 | 0.537107 |

Resting ECG 3 | 0.768672 | 0.439692 | 1.748 | 0.080429. |

Sex 2 | 2.080282 | 0.624856 | 3.329 | 0.000871 |

Thal 2 | 0.063903 | 0.845742 | 0.076 | 0.939771 |

Thal 3 | 1.693988 | 0.477088 | 3.551 | 0.000384 |

Vessel | 1.263642 | 0.285799 | 4.421 | <0.0001 |

Age | 8.1 | 8.593 | 14.18 | 0.0957. |

ACC (Classification Accuracy Rate in %) | AUC (Area Under Curve in 0-1) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

1^{st} |
2^{nd} |
3^{rd} |
4^{th} |
5^{th} |
Average | 1^{st} |
2^{nd} |
3^{rd} |
4^{th} |
5^{th} |
Average | |

SVM | 84.45 | 85.45 | 84.75 | 84.75 | 84.45 | 0.8968 | 0.9023 | 0.8955 | 0.9028 | 0.8968 | ||

MLPE | 82.20 | 80.74 | 82.22 | 81.85 | 82.22 | 0.8724 | 0.8545 | 0.8622 | 0.8566 | 0.8594 |

In the GAM fitted model, for every one unit change in Cholesterol the log odds of HD increased by 0.0098 with p-value 0.029. Cholesterol has a positive significant association with HD which indicates that patients with high Cholesterol having a higher chance of HD.

HD is high negatively significantly associated with the Maximum Heart rate (Max.HR) of a patient. For every one unit change in Max. HR the log odds of HD decreased by 0.0326 with p-value 0.003. That means patients with maximum heart rate having lower risk of HD.

For one unit change in Old peak the log odds of HD increased 0.5150 with p-value 0.020.The HD is positively significantly associated with Old peak. Therefore patients with high Old peak value having higher risk of HD.

In this GAM fitted model, for every one unit change in Resting BP the log odds of HD increased by 0.0243 with p-value 0.040. Resting BP has a positive significant association with HD which indicates that patients with high Resting BP having a higher chance of HD.

Heart disease (HD) is positively significantly associated with Resting ECG of a patient. Out of three types of Resting ECG, Hypertrophy Resting ECG changes the log odds of HD by 0.7686 with p-value 0.080. Therefor patients having higher chance of HD if they have Hypertrophy Resting ECG result than others.

Sex (Gender) of a patient has a very positive significant association with HD. Male patient changes the log odds of HD by 2.0802 with p-value <0.001than a female patient. This indicates male patients having a higher chance of HD.

HD is very high positive significant association with Thallium heart scan (Thal) result. A patient with Reversible defect in his/her thallium heart scan report changes the log odd of HD by 1.6939 with p-value <0.001. It means patient has higher chance of HD if his/her thallium heart scan report shows Reversible defect than others.

Numbers of major vessels (Vessel) treated as a discrete variable in this GAM fitted model has a very high positive significant association with HD. For every one number increase in Vessel causes 1.2636 increment in log odds of HD with p-value <0.001.

In this GAM fitted model only one cofactor namely Age, used as smoothing factor. As it is a nonparametric method of estimation so Chi-square test statistic has been used for testing the hypothesis. From

It also noticed from

From

$$\begin{array}{l}\mathrm{log}\mathrm{odds}\text{(HD)}\hfill \\ \begin{array}{l}\begin{array}{ccc}& & \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}=\text{-}6\mathbf{.64}+1\mathbf{.49}\text{\hspace{0.17em}}\text{ChestPain}\text{\hspace{0.17em}}\text{2+}\mathbf{0}\mathbf{\text{.66}}\text{\hspace{0.17em}}\text{ChestPain}\text{\hspace{0.17em}}\text{3+}\mathbf{2}\mathbf{\text{.77}}\text{\hspace{0.17em}}\text{Chestpain4}\end{array}\\ \begin{array}{ccc}& & \begin{array}{l}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\mathbf{0}\mathbf{.0098}\text{\hspace{0.17em}}\text{Cholesterol-}\mathbf{0}\mathbf{\text{.03}}\text{\hspace{0.17em}}Max.HR+0\mathbf{.51}Old\text{\hspace{0.17em}}peak+0\mathbf{.02}\text{\hspace{0.17em}}\mathit{Reting}\text{\hspace{0.17em}}\mathit{BP}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+2\mathbf{.18}\text{\hspace{0.17em}}\mathit{Res}ting\text{\hspace{0.17em}}ECG2+0\mathbf{.76}\text{\hspace{0.17em}}\mathit{Res}ting\text{\hspace{0.17em}}ECG3+2\mathbf{.08}\text{\hspace{0.17em}}Sex\text{\hspace{0.17em}}2+0\mathbf{.06}\text{\hspace{0.17em}}Thal\text{\hspace{0.17em}}2\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+1\mathbf{.69}\text{\hspace{0.17em}}Thal\text{\hspace{0.17em}}3+1\mathbf{.26}\text{\hspace{0.17em}}Vessel+f(Age)\end{array}\end{array}\end{array}\hfill \\ \hfill \\ \hfill \\ \hfill \end{array}$$

In the above predictive formula, except Age all the cofactors entered in this additive model parametrically. Age is the only smoothing term here whose approximate significance has been judged through non-parametrical methods (Chi-Square test).

In

In

In

The current article is considered the Heart Disease/HD (whether a patient has a heart disease or not) as the response variable. It is a binary variable with values ‘1’ and ‘2’ which stand for absent and present of the heart disease respectively. This HD has been modeled based on generalized additive model. The GAM fitted model results are displayed in

Data Mining Techniques (a) Multi-Layer Perceptron Neural Network (MLPE)(b) Support Vector machine (SVM)

Histogram of residuals.

Smoothing term (Age) plot with confidence belt.

Absolute residual plot.

Normal probability plots of residuals.

Input Importance Chart.

Variable effective Curve for Max. HR(most important input variable).

The current reported results (^{40-47} Third, final model is justified based on GAM diagnostic plots. Fourth, the standard error of the estimates is very small, indicating that the estimates are stable 48

Fifth, the final model of the HD is selected based on locating the appropriate statistical distribution. The HD distribution is identified herein as the binomial distribution. For more extension regarding this please follow the references.^{49,50}

To the best of our knowledge, the present models (Results & interpretation section) can be considered as one of the best first building block of a regression analysis. The current models may provide better assistance for treatment decision making using the individual patient risk factors and the benefits of a specific treatment. The current results have focused many interesting conclusions. These findings may help the medical practitioners for better medical treatment. Thallium scan report, Chest pain type are highly important for identification of a heart disease patients. Especially for male patient, it is recommended that they must take care about their heart during their older age.

We would like to acknowledge all the previous authors who had work on this data set and also the UCI Machine Learning Repository for making available of this dataset. Finally, we are very much thankful to the reviewers for their valuable comments for betterment of this article.

Support vector machine

Multi layer perceptron ensemble

Multi layer perceptron

Generalized additive model

Heart disease

Data mining

Variable effective curve