Data Preprocessing¶

In [395]:
import pandas as pd
In [397]:
df = pd.read_csv('dataset.csv')
In [399]:
df.head()
Out[399]:
age;"job";"marital";"education";"default";"housing";"loan";"contact";"month";"day_of_week";"duration";"campaign";"pdays";"previous";"poutcome";"emp.var.rate";"cons.price.idx";"cons.conf.idx";"euribor3m";"nr.employed";"y"
0 56;"housemaid";"married";"basic.4y";"no";"no";...
1 57;"services";"married";"high.school";"unknown...
2 37;"services";"married";"high.school";"no";"ye...
3 40;"admin.";"married";"basic.6y";"no";"no";"no...
4 56;"services";"married";"high.school";"no";"no...
In [401]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 1 columns):
 #   Column                                                                                                                                                                                                                        Non-Null Count  Dtype 
---  ------                                                                                                                                                                                                                        --------------  ----- 
 0   age;"job";"marital";"education";"default";"housing";"loan";"contact";"month";"day_of_week";"duration";"campaign";"pdays";"previous";"poutcome";"emp.var.rate";"cons.price.idx";"cons.conf.idx";"euribor3m";"nr.employed";"y"  41188 non-null  object
dtypes: object(1)
memory usage: 321.9+ KB

Changing the delimiter -¶

In [54]:
with open("dataset.csv", "r", encoding="utf-8") as f:
    lines = f.readlines()

data = [line.strip().replace('"', '').split(';') for line in lines]

df = pd.DataFrame(data[1:], columns=data[0])

df.to_csv("cleaned_dataset.csv", index=False)
In [448]:
df = pd.read_csv('cleaned_dataset.csv')
In [450]:
df.head()
Out[450]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no

5 rows × 21 columns

Dataset Overview -¶

Bank Client Data:¶

  • age (numeric)
  • job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self- employed","services","student","technician","unemployed","unknown")
  • marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
  • education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
  • default: has credit in default? (categorical: "no","yes","unknown")
  • housing: has housing loan? (categorical: "no","yes","unknown")
  • loan: has personal loan? (categorical: "no","yes","unknown")

Related with the last contact of the current campaign:¶

  • contact: contact communication type (categorical: "cellular","telephone")
  • month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  • day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
  • duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other attributes:¶

  • campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  • previous: number of contacts performed before this campaign and for this client (numeric)
  • poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")

Social and economic context attributes:¶

  • emp.var.rate: employment variation rate - quarterly indicator (numeric)
  • cons.price.idx: consumer price index - monthly indicator (numeric)
  • cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  • euribor3m: euribor 3 month rate - daily indicator (numeric)
  • nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):¶

  • y: has the client subscribed a term deposit? (binary: "yes","no")

Datatype conversion to numeric -¶

In [454]:
numeric_columns = ['age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric)

Handling duplicates -¶

In [457]:
duplicate = df[df.duplicated()]
duplicate
Out[457]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
1266 39 blue-collar married basic.6y no no no telephone may thu ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0 no
12261 36 retired married unknown no no no telephone jul thu ... 1 999 0 nonexistent 1.4 93.918 -42.7 4.966 5228.1 no
14234 27 technician single professional.course no no no cellular jul mon ... 2 999 0 nonexistent 1.4 93.918 -42.7 4.962 5228.1 no
16956 47 technician divorced high.school no yes no cellular jul thu ... 3 999 0 nonexistent 1.4 93.918 -42.7 4.962 5228.1 no
18465 32 technician single professional.course no yes no cellular jul thu ... 1 999 0 nonexistent 1.4 93.918 -42.7 4.968 5228.1 no
20216 55 services married high.school unknown no no cellular aug mon ... 1 999 0 nonexistent 1.4 93.444 -36.1 4.965 5228.1 no
20534 41 technician married professional.course no yes no cellular aug tue ... 1 999 0 nonexistent 1.4 93.444 -36.1 4.966 5228.1 no
25217 39 admin. married university.degree no no no cellular nov tue ... 2 999 0 nonexistent -0.1 93.200 -42.0 4.153 5195.8 no
28477 24 services single high.school no yes no cellular apr tue ... 1 999 0 nonexistent -1.8 93.075 -47.1 1.423 5099.1 no
32516 35 admin. married university.degree no yes no cellular may fri ... 4 999 0 nonexistent -1.8 92.893 -46.2 1.313 5099.1 no
36951 45 admin. married university.degree no no no cellular jul thu ... 1 999 0 nonexistent -2.9 92.469 -33.6 1.072 5076.2 yes
38281 71 retired single university.degree no no no telephone oct tue ... 1 999 0 nonexistent -3.4 92.431 -26.9 0.742 5017.5 no

12 rows × 21 columns

In [459]:
df = df.drop_duplicates()

Handling missing data values -¶

In [462]:
missing_df = pd.DataFrame(df.isnull().sum())
missing_df
Out[462]:
0
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp.var.rate 0
cons.price.idx 0
cons.conf.idx 0
euribor3m 0
nr.employed 0
y 0

Handling the 'unknown' data values -¶

In [465]:
unknown_df = pd.DataFrame((df == 'unknown').sum() / len(df) * 100)
unknown_df
Out[465]:
0
age 0.000000
job 0.801438
marital 0.194288
education 4.201477
default 20.876239
housing 2.404313
loan 2.404313
contact 0.000000
month 0.000000
day_of_week 0.000000
duration 0.000000
campaign 0.000000
pdays 0.000000
previous 0.000000
poutcome 0.000000
emp.var.rate 0.000000
cons.price.idx 0.000000
cons.conf.idx 0.000000
euribor3m 0.000000
nr.employed 0.000000
y 0.000000
In [467]:
df = df[df['marital'] != 'unknown']
df = df[df['job'] != 'unknown']

Dropping the 'duration' column -¶

In [470]:
df = df.drop('duration', axis=1)

Outlier detection using boxplots -¶

In [473]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['age'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['campaign'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['pdays'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['previous'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['emp.var.rate'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['cons.price.idx'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['cons.conf.idx'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['euribor3m'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['nr.employed'])
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Outlier detection using the IQR method -¶

In [476]:
numeric_columns = ['age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

outlier_percentages = {}

for col in numeric_columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    percent_outliers = (len(outliers) / len(df)) * 100
    outlier_percentages[col] = round(percent_outliers, 2)

outlier_df = pd.DataFrame.from_dict(outlier_percentages, orient='index', columns=['Outlier %'])

outlier_df
Out[476]:
Outlier %
age 1.13
campaign 5.82
pdays 3.65
previous 13.66
emp.var.rate 0.00
cons.price.idx 0.00
cons.conf.idx 1.07
euribor3m 0.00
nr.employed 0.00

OHE of categorical features -¶

In [479]:
ohe_cols = ['education', 'default', 'housing', 'loan', 'job', 'marital', 'contact', 'month', 'day_of_week', 'poutcome']

df = pd.get_dummies(df, columns=ohe_cols, drop_first=False)

Performing label encoding and datatype conversion on the target 'y' -¶

In [482]:
df['y'] = df['y'].map({'yes': 1, 'no': 0})

df['y'] = df.pop('y')

df['y'] = df['y'].astype(bool)

Correlation analysis on numeric columns -¶

In [485]:
df_numeric = df[numeric_columns]

corr = df_numeric.corr()
corr.style.background_gradient(cmap='coolwarm') 
Out[485]:
  age campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
age 1.000000 0.003414 -0.034230 0.024465 -0.000586 0.000159 0.127881 0.010448 -0.017610
campaign 0.003414 1.000000 0.052371 -0.078932 0.150160 0.127052 -0.013229 0.134621 0.143470
pdays -0.034230 0.052371 1.000000 -0.586168 0.269956 0.078485 -0.092240 0.295609 0.371244
previous 0.024465 -0.078932 -0.586168 1.000000 -0.419486 -0.202290 -0.051704 -0.453549 -0.500352
emp.var.rate -0.000586 0.150160 0.269956 -0.419486 1.000000 0.775292 0.198133 0.972231 0.906802
cons.price.idx 0.000159 0.127052 0.078485 -0.202290 0.775292 1.000000 0.061180 0.687997 0.521496
cons.conf.idx 0.127881 -0.013229 -0.092240 -0.051704 0.198133 0.061180 1.000000 0.279367 0.102142
euribor3m 0.010448 0.134621 0.295609 -0.453549 0.972231 0.687997 0.279367 1.000000 0.945131
nr.employed -0.017610 0.143470 0.371244 -0.500352 0.906802 0.521496 0.102142 0.945131 1.000000

Using Variation Inflation Factor (VIF) to measure the amount of multicollinearity -¶

In [488]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

X = add_constant(df[numeric_columns])  

vif = pd.DataFrame()
vif['feature'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
          feature            VIF
0           const  529178.488786
1             age       1.018425
2        campaign       1.033281
3           pdays       1.608952
4        previous       1.791291
5    emp.var.rate      33.087332
6  cons.price.idx       6.342704
7   cons.conf.idx       2.645638
8       euribor3m      64.279331
9     nr.employed      31.649421

Dropping highly correlated features -¶

In [491]:
df.drop(['euribor3m', 'nr.employed'], axis=1, inplace=True)

Recalculating VIF -¶

In [494]:
numeric_columns = ['age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx']

X = add_constant(df[numeric_columns])  

vif = pd.DataFrame()
vif['feature'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
          feature           VIF
0           const  71863.225169
1             age      1.018222
2        campaign      1.025667
3           pdays      1.601893
4        previous      1.743881
5    emp.var.rate      3.356323
6  cons.price.idx      2.760853
7   cons.conf.idx      1.127746
In [ ]:
 

Predictive Analysis¶

In [498]:
import numpy as np
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

Splitting the dataset -¶

In [501]:
X = df.drop('y', axis=1)
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=25)

Scalling the input data -¶

In [504]:
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train_scaled = sc_X.fit_transform(X_train)
X_test_scaled = sc_X.fit_transform(X_test)

1) Logistic Regression¶

In [507]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=2000)
lr.fit(X_train_scaled, y_train)

y_pred = lr.predict(X_test_scaled)
y_prob = lr.predict_proba(X_test_scaled)[:,1]

print("Logistic Regression Report:")
print(classification_report(y_test, y_pred))

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title("Confusion Matrix - Logistic Regression")
plt.show()

y_pred_proba = lr.predict_proba(X_test_scaled)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Logistic Regression Report:
              precision    recall  f1-score   support

       False       0.91      0.98      0.94     10832
        True       0.65      0.24      0.35      1401

    accuracy                           0.90     12233
   macro avg       0.78      0.61      0.65     12233
weighted avg       0.88      0.90      0.88     12233

No description has been provided for this image
No description has been provided for this image

2) KNN¶

In [510]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)

y_pred = knn.predict(X_test_scaled)
y_prob = knn.predict_proba(X_test_scaled)[:,1]

print("K-Nearest Neighbors Report:")
print(classification_report(y_test, y_pred))

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title("Confusion Matrix - KNN")
plt.show()

y_pred_proba = knn.predict_proba(X_test_scaled)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
K-Nearest Neighbors Report:
              precision    recall  f1-score   support

       False       0.91      0.97      0.94     10832
        True       0.52      0.24      0.33      1401

    accuracy                           0.89     12233
   macro avg       0.71      0.61      0.63     12233
weighted avg       0.86      0.89      0.87     12233

No description has been provided for this image
No description has been provided for this image

3) Gaussian Naive Bayes¶

In [513]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)
y_prob = nb.predict_proba(X_test)[:,1]

print("Gaussian Naive Bayes Report:")
print(classification_report(y_test, y_pred))

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title("Confusion Matrix - Gaussian Naive Bayes")
plt.show()

y_pred_proba = nb.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Gaussian Naive Bayes Report:
              precision    recall  f1-score   support

       False       0.92      0.92      0.92     10832
        True       0.40      0.42      0.41      1401

    accuracy                           0.86     12233
   macro avg       0.66      0.67      0.66     12233
weighted avg       0.86      0.86      0.86     12233

No description has been provided for this image
No description has been provided for this image

4) Categorical Naive Bayes¶

In [189]:
from sklearn.naive_bayes import CategoricalNB

data = df.copy()

data['age_bin'] = pd.qcut(data['age'], q=4, labels=False)
data['campaign_bin'] = pd.qcut(data['campaign'].rank(method='first'), q=4, labels=False)
data['pdays_bin'] = pd.qcut(data['pdays'].rank(method='first'), q=4, labels=False)
data['previous_bin'] = pd.qcut(data['previous'].rank(method='first'), q=4, labels=False)
data['emp_bin'] = pd.qcut(data['emp.var.rate'], q=2, labels=False)
data['price_bin'] = pd.qcut(data['cons.price.idx'], q=3, labels=False)
data['conf_bin'] = pd.qcut(data['cons.conf.idx'], q=3, labels=False)

data = data.drop(['age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx'], axis=1)

for col in data.select_dtypes(include='bool').columns:
    data[col] = data[col].astype(int)

X_cnb = data.drop("y", axis=1)
y_cnb = data["y"]

X_train_cnb, X_test_cnb, y_train_cnb, y_test_cnb = train_test_split(X_cnb, y_cnb, test_size=0.3, random_state=25)

cnb = CategoricalNB()
cnb.fit(X_train_cnb, y_train_cnb)

y_pred_cnb = cnb.predict(X_test_cnb)
y_prob_cnb = cnb.predict_proba(X_test_cnb)[:, 1]

print("Categorical Naive Bayes Report:")
print(classification_report(y_test_cnb, y_pred_cnb))

sns.heatmap(confusion_matrix(y_test_cnb, y_pred_cnb), annot=True, cmap="YlGnBu", fmt='g')
plt.title("Confusion Matrix - Categorical Naive Bayes")
plt.show()

y_pred_proba = cnb.predict_proba(X_test_cnb)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test_cnb,  y_pred_proba)
auc = metrics.roc_auc_score(y_test_cnb, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Categorical Naive Bayes Report:
              precision    recall  f1-score   support

           0       0.94      0.84      0.88     10832
           1       0.31      0.58      0.41      1401

    accuracy                           0.81     12233
   macro avg       0.63      0.71      0.65     12233
weighted avg       0.87      0.81      0.83     12233

No description has been provided for this image
No description has been provided for this image

5) Decision Tree¶

In [251]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)
y_prob = dt.predict_proba(X_test)[:,1]

print("Decision Tree Report:")
print(classification_report(y_test, y_pred))

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title("Confusion Matrix - Decision Tree")
plt.show()

y_pred_proba = dt.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Decision Tree Report:
              precision    recall  f1-score   support

       False       0.91      0.90      0.91     10832
        True       0.30      0.32      0.31      1401

    accuracy                           0.84     12233
   macro avg       0.61      0.61      0.61     12233
weighted avg       0.84      0.84      0.84     12233

No description has been provided for this image
No description has been provided for this image

6) Random Forest¶

In [199]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:,1]

print("Random Forest Report:")
print(classification_report(y_test, y_pred))

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title("Confusion Matrix - Random Forest")
plt.show()

y_pred_proba = rf.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Random Forest Report:
              precision    recall  f1-score   support

       False       0.91      0.97      0.94     10832
        True       0.53      0.27      0.36      1401

    accuracy                           0.89     12233
   macro avg       0.72      0.62      0.65     12233
weighted avg       0.87      0.89      0.87     12233

No description has been provided for this image
No description has been provided for this image

7) XGBoost¶

In [203]:
from xgboost import XGBClassifier

xgb = XGBClassifier(eval_metric='logloss')
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)
y_prob = xgb.predict_proba(X_test)[:,1]

print("XGBoost Report:")
print(classification_report(y_test, y_pred))

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title("Confusion Matrix - XGBoost")
plt.show()

y_pred_proba = xgb.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
XGBoost Report:
              precision    recall  f1-score   support

       False       0.91      0.97      0.94     10832
        True       0.58      0.27      0.37      1401

    accuracy                           0.89     12233
   macro avg       0.74      0.62      0.65     12233
weighted avg       0.87      0.89      0.88     12233

No description has been provided for this image
No description has been provided for this image
In [ ]:
 

Hyperparameter Tuning of Decision Tree¶

In [253]:
data = df.copy()
In [255]:
for col in data.select_dtypes(include='bool').columns:
    data[col] = data[col].astype(int)
In [257]:
X_tdt = data.drop("y", axis=1)
y_tdt = data["y"]
In [259]:
X_train_tdt, X_test_tdt, y_train_tdt, y_test_tdt = train_test_split(X_tdt, y_tdt, test_size=0.3, random_state=25)
In [261]:
tdt = DecisionTreeClassifier()
tdt.fit(X_train_tdt, y_train_tdt)
Out[261]:
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()

Initial Train and Test accuracy -¶

In [263]:
dt.score(X_train_tdt, y_train_tdt), dt.score(X_test_tdt, y_test_tdt)
Out[263]:
(0.9947796230117021, 0.8363443145589798)

Initial tree visualization -¶

In [269]:
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10)) 
plot_tree(tdt, filled=True, feature_names=X_train_tdt.columns.tolist(), class_names=[str(cls) for cls in tdt.classes_])
plt.show()
No description has been provided for this image

Initial tree structure -¶

In [271]:
tree_info = tdt.tree_
num_nodes = tree_info.node_count
num_leaves = tree_info.n_leaves
num_decision = num_nodes - num_leaves

print("Number of nodes:", num_nodes)
print("Number of leaves:", num_leaves)
print("Number of decision:", num_decision)
Number of nodes: 8963
Number of leaves: 4482
Number of decision: 4481

Tuning the 'max_depth' parameter and finding the accuracy -¶

In [311]:
tdt = DecisionTreeClassifier(max_depth = 3, random_state=0)
tdt.fit(X_train_tdt, y_train_tdt)
tdt.score(X_train_tdt, y_train_tdt), tdt.score(X_test_tdt, y_test_tdt)
Out[311]:
(0.9015836311400742, 0.8969181721572795)

Visualizing the tree -¶

In [313]:
plt.figure(figsize=(20,10)) 
plot_tree(tdt, filled=True, feature_names=X_train_tdt.columns.tolist(), class_names=[str(cls) for cls in tdt.classes_])
plt.show()

tree_info = tdt.tree_
num_nodes = tree_info.node_count
num_leaves = tree_info.n_leaves
num_decision = num_nodes - num_leaves

print("Number of nodes:", num_nodes)
print("Number of leaves:", num_leaves)
print("Number of decision:", num_decision)
No description has been provided for this image
Number of nodes: 15
Number of leaves: 8
Number of decision: 7

Tuning the 'max_depth' parameter and finding the accuracy -¶

In [315]:
tdt = DecisionTreeClassifier(max_depth = 4, random_state=0)
tdt.fit(X_train_tdt, y_train_tdt)
tdt.score(X_train_tdt, y_train_tdt), tdt.score(X_test_tdt, y_test_tdt)
Out[315]:
(0.9020040641861117, 0.8970816643505273)

Visualizing the tree -¶

In [317]:
plt.figure(figsize=(20,10)) 
plot_tree(tdt, filled=True, feature_names=X_train_tdt.columns.tolist(), class_names=[str(cls) for cls in tdt.classes_])
plt.show()

tree_info = tdt.tree_
num_nodes = tree_info.node_count
num_leaves = tree_info.n_leaves
num_decision = num_nodes - num_leaves

print("Number of nodes:", num_nodes)
print("Number of leaves:", num_leaves)
print("Number of decision:", num_decision)
No description has been provided for this image
Number of nodes: 31
Number of leaves: 16
Number of decision: 15

Visualing the effect of different 'max_depth' values on model's performance -¶

In [331]:
depths = [3, 4, 5, 6, 7, 8, 9]
train_accuracies = []
test_accuracies = []

for depth in depths:
    tdt = DecisionTreeClassifier(max_depth=depth, random_state=0)
    tdt.fit(X_train_tdt, y_train_tdt)
    train_accuracies.append(tdt.score(X_train_tdt, y_train_tdt))
    test_accuracies.append(tdt.score(X_test_tdt, y_test_tdt))

plt.figure(figsize=(10, 6))
plt.plot(depths, train_accuracies, marker='o', label='Train Accuracy')
plt.plot(depths, test_accuracies, marker='o', label='Test Accuracy')
plt.title('Decision Tree Performance vs Max Depth')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.xticks(depths)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Introducing another hyperparameter and visualing the effect of different 'min_samples_split' values on model's performance -¶

In [355]:
minSamplesSplits = list(range(10, 50))
train_accuracies = []
test_accuracies = []

for minSamplesSplit in minSamplesSplits:
    tdt = DecisionTreeClassifier(max_depth=6, min_samples_split = minSamplesSplit, random_state=0)
    tdt.fit(X_train_tdt, y_train_tdt)
    train_accuracies.append(tdt.score(X_train_tdt, y_train_tdt))
    test_accuracies.append(tdt.score(X_test_tdt, y_test_tdt))

plt.figure(figsize=(10, 6))
plt.plot(minSamplesSplits, train_accuracies, marker='o', label='Train Accuracy')
plt.plot(minSamplesSplits, test_accuracies, marker='o', label='Test Accuracy')
plt.title('Decision Tree Performance vs minSamplesSplits')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.xticks(minSamplesSplits)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Automating hyperparameter tuning by using Random Search Cross-Validation -¶

In [369]:
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'max_depth': list(range(2, 15)),
    'min_samples_split': list(range(10, 40)),  
    'min_samples_leaf': list(range(10, 40)),
    'max_features': [None, 'sqrt', 'log2']  
}

random_search = RandomizedSearchCV(
    estimator=DecisionTreeClassifier(random_state=0),
    param_distributions=param_dist,
    n_iter=100, 
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=0,
    verbose=1
)

random_search.fit(X_train_tdt, y_train_tdt)
print("Best Parameters:", random_search.best_params_)

best_model = random_search.best_estimator_
test_score = best_model.score(X_test_tdt, y_test_tdt)
print("Test Set Score:", test_score)
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best Parameters: {'min_samples_split': 12, 'min_samples_leaf': 11, 'max_features': None, 'max_depth': 3}
Test Set Score: 0.8969181721572795

Evaluating the model after hyperparameter tuning -¶

In [373]:
tdt = DecisionTreeClassifier(min_samples_split=12, min_samples_leaf=11, max_features=None, max_depth=3)
tdt.fit(X_train_tdt, y_train_tdt)
tdt.score(X_train_tdt, y_train_tdt), tdt.score(X_test_tdt, y_test_tdt)
Out[373]:
(0.9015836311400742, 0.8969181721572795)

Visualizing the decision tree after hyperparameter tuning -¶

In [375]:
plt.figure(figsize=(20,10)) 
plot_tree(tdt, filled=True, feature_names=X_train_tdt.columns.tolist(), class_names=[str(cls) for cls in tdt.classes_])
plt.show()

tree_info = tdt.tree_
num_nodes = tree_info.node_count
num_leaves = tree_info.n_leaves
num_decision = num_nodes - num_leaves

print("Number of nodes:", num_nodes)
print("Number of leaves:", num_leaves)
print("Number of decision:", num_decision)
No description has been provided for this image
Number of nodes: 15
Number of leaves: 8
Number of decision: 7

Rerunning the model and evaluating the performance after tuning -¶

In [377]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(min_samples_split=12, min_samples_leaf=11, max_features=None, max_depth=3)
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)
y_prob = dt.predict_proba(X_test)[:,1]

print("Decision Tree Report:")
print(classification_report(y_test, y_pred))

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title("Confusion Matrix - Decision Tree")
plt.show()

y_pred_proba = dt.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Decision Tree Report:
              precision    recall  f1-score   support

       False       0.90      0.99      0.94     10832
        True       0.68      0.19      0.30      1401

    accuracy                           0.90     12233
   macro avg       0.79      0.59      0.62     12233
weighted avg       0.88      0.90      0.87     12233

No description has been provided for this image
No description has been provided for this image

Models performance before tuning -¶

In [380]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)
y_prob = dt.predict_proba(X_test)[:,1]

print("Decision Tree Report:")
print(classification_report(y_test, y_pred))

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title("Confusion Matrix - Decision Tree")
plt.show()

y_pred_proba = dt.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Decision Tree Report:
              precision    recall  f1-score   support

       False       0.91      0.90      0.91     10832
        True       0.30      0.32      0.31      1401

    accuracy                           0.84     12233
   macro avg       0.61      0.61      0.61     12233
weighted avg       0.84      0.84      0.84     12233

No description has been provided for this image
No description has been provided for this image
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: