Installing Libraries¶

In [3]:
pip install ydata-profiling #pandas-profiling was renamed to ydata-profiling
Requirement already satisfied: ydata-profiling in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (0.0.dev0)
Requirement already satisfied: scipy<1.14,>=1.4.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.13.1)
Requirement already satisfied: pandas!=1.4.0,<3,>1.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (2.2.2)
Requirement already satisfied: matplotlib<3.9,>=3.2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (3.8.4)
Requirement already satisfied: pydantic>=2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (2.5.3)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (6.0.1)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (3.1.4)
Requirement already satisfied: visions<0.7.7,>=0.7.5 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (0.7.6)
Requirement already satisfied: numpy<2,>=1.16.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.26.4)
Requirement already satisfied: htmlmin==0.1.12 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.1.12)
Requirement already satisfied: phik<0.13,>=0.11.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.12.3)
Requirement already satisfied: requests<3,>=2.24.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (2.32.2)
Requirement already satisfied: tqdm<5,>=4.48.2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (4.66.4)
Requirement already satisfied: seaborn<0.14,>=0.10.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.13.2)
Requirement already satisfied: multimethod<2,>=1.4 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.9.1)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.14.2)
Requirement already satisfied: typeguard<5,>=3 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (4.2.1)
Requirement already satisfied: imagehash==4.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (4.3.1)
Requirement already satisfied: wordcloud>=1.9.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.9.4)
Requirement already satisfied: dacite>=1.8 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.8.1)
Requirement already satisfied: numba<1,>=0.56.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.59.1)
Requirement already satisfied: pillow in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from imagehash==4.3.1->ydata-profiling) (10.3.0)
Requirement already satisfied: PyWavelets in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from imagehash==4.3.1->ydata-profiling) (1.5.0)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (2.1.3)
Requirement already satisfied: contourpy>=1.0.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (23.2)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (2.9.0.post0)
Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from numba<1,>=0.56.0->ydata-profiling) (0.42.0)
Requirement already satisfied: pytz>=2020.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2023.3)
Requirement already satisfied: joblib>=0.14.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.4.2)
Requirement already satisfied: annotated-types>=0.4.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pydantic>=2->ydata-profiling) (0.6.0)
Requirement already satisfied: pydantic-core==2.14.6 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pydantic>=2->ydata-profiling) (2.14.6)
Requirement already satisfied: typing-extensions>=4.6.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pydantic>=2->ydata-profiling) (4.11.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from requests<3,>=2.24.0->ydata-profiling) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from requests<3,>=2.24.0->ydata-profiling) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from requests<3,>=2.24.0->ydata-profiling) (2.2.2)
Requirement already satisfied: certifi>=2017.4.17 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from requests<3,>=2.24.0->ydata-profiling) (2025.1.31)
Requirement already satisfied: patsy>=0.5.6 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from statsmodels<1,>=0.13.2->ydata-profiling) (0.5.6)
Requirement already satisfied: attrs>=19.3.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from visions<0.7.7,>=0.7.5->visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (23.1.0)
Requirement already satisfied: networkx>=2.4 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from visions<0.7.7,>=0.7.5->visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (3.2.1)
Requirement already satisfied: six in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from patsy>=0.5.6->statsmodels<1,>=0.13.2->ydata-profiling) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [4]:
pip install sweetviz
Requirement already satisfied: sweetviz in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (2.3.1)
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (2.2.2)
Requirement already satisfied: numpy>=1.16.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (1.26.4)
Requirement already satisfied: matplotlib>=3.1.3 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (3.8.4)
Requirement already satisfied: tqdm>=4.43.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (4.66.4)
Requirement already satisfied: scipy>=1.3.2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (1.13.1)
Requirement already satisfied: jinja2>=2.11.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (3.1.4)
Requirement already satisfied: importlib-resources>=1.2.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (6.5.2)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from jinja2>=2.11.1->sweetviz) (2.1.3)
Requirement already satisfied: contourpy>=1.0.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (23.2)
Requirement already satisfied: pillow>=8 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2023.3)
Requirement already satisfied: six>=1.5 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib>=3.1.3->sweetviz) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [ ]:
 
In [ ]:
 

Importing The Dataset¶

In [155]:
import pandas as pd
In [157]:
df = pd.read_csv('dataset.csv')
In [159]:
df.head()
Out[159]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

In [ ]:
 
In [ ]:
 

Data Preparation¶

In [164]:
df.describe()
Out[164]:
SeniorCitizen tenure MonthlyCharges
count 7043.000000 7043.000000 7043.000000
mean 0.162147 32.371149 64.761692
std 0.368612 24.559481 30.090047
min 0.000000 0.000000 18.250000
25% 0.000000 9.000000 35.500000
50% 0.000000 29.000000 70.350000
75% 0.000000 55.000000 89.850000
max 1.000000 72.000000 118.750000
In [166]:
# ISSUE: 'TotalCharges' which is a numerical column is missing from summary statistics. 
In [ ]:
 
In [169]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
In [171]:
# Datatype for the columnn 'TotalCharges' is incorrect.  
In [ ]:
 
In [174]:
df['TotalCharges'] = df['TotalCharges'].astype('float')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[174], line 1
----> 1 df['TotalCharges'] = df['TotalCharges'].astype('float')

File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors)
   6637     results = [
   6638         ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
   6639     ]
   6641 else:
   6642     # else, only a single dtype is given
-> 6643     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6644     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6645     return res.__finalize__(self, method="astype")

File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors)
    427 elif using_copy_on_write():
    428     copy = False
--> 430 return self.apply(
    431     "astype",
    432     dtype=dtype,
    433     copy=copy,
    434     errors=errors,
    435     using_cow=using_copy_on_write(),
    436 )

File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    361         applied = b.apply(f, **kwargs)
    362     else:
--> 363         applied = getattr(b, f)(**kwargs)
    364     result_blocks = extend_blocks(applied, result_blocks)
    366 out = type(self).from_blocks(result_blocks, self.axes)

File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze)
    755         raise ValueError("Can not squeeze with more than one column.")
    756     values = values[0, :]  # type: ignore[call-overload]
--> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    760 new_values = maybe_coerce_values(new_values)
    762 refs = None

File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors)
    234     dtype = dtype.numpy_dtype
    236 try:
--> 237     new_values = astype_array(values, dtype, copy=copy)
    238 except (ValueError, TypeError):
    239     # e.g. _astype_nansafe can fail on object-dtype of strings
    240     #  trying to convert to float
    241     if errors == "ignore":

File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:182, in astype_array(values, dtype, copy)
    179     values = values.astype(dtype, copy=copy)
    181 else:
--> 182     values = _astype_nansafe(values, dtype, copy=copy)
    184 # in pandas we don't store numpy str dtypes, so convert to object
    185 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:133, in _astype_nansafe(arr, dtype, copy, skipna)
    129     raise ValueError(msg)
    131 if copy or arr.dtype == object or dtype == object:
    132     # Explicit copy, or required since NumPy can't view from / to object.
--> 133     return arr.astype(dtype, copy=True)
    135 return arr.astype(dtype, copy=copy)

ValueError: could not convert string to float: ' '
In [176]:
# Unable to fix the datatype because of empty strings.  
In [ ]:
 
In [179]:
empty_rows = df[df['TotalCharges'].str.strip() == '']
print(empty_rows)
      customerID  gender  SeniorCitizen Partner Dependents  tenure  \
488   4472-LVYGI  Female              0     Yes        Yes       0   
753   3115-CZMZD    Male              0      No        Yes       0   
936   5709-LVOEQ  Female              0     Yes        Yes       0   
1082  4367-NUYAO    Male              0     Yes        Yes       0   
1340  1371-DWPAZ  Female              0     Yes        Yes       0   
3331  7644-OMVMY    Male              0     Yes        Yes       0   
3826  3213-VVOLG    Male              0     Yes        Yes       0   
4380  2520-SGTTA  Female              0     Yes        Yes       0   
5218  2923-ARZLG    Male              0     Yes        Yes       0   
6670  4075-WKNIU  Female              0     Yes        Yes       0   
6754  2775-SEFEE    Male              0      No        Yes       0   

     PhoneService     MultipleLines InternetService       OnlineSecurity  ...  \
488            No  No phone service             DSL                  Yes  ...   
753           Yes                No              No  No internet service  ...   
936           Yes                No             DSL                  Yes  ...   
1082          Yes               Yes              No  No internet service  ...   
1340           No  No phone service             DSL                  Yes  ...   
3331          Yes                No              No  No internet service  ...   
3826          Yes               Yes              No  No internet service  ...   
4380          Yes                No              No  No internet service  ...   
5218          Yes                No              No  No internet service  ...   
6670          Yes               Yes             DSL                   No  ...   
6754          Yes               Yes             DSL                  Yes  ...   

         DeviceProtection          TechSupport          StreamingTV  \
488                   Yes                  Yes                  Yes   
753   No internet service  No internet service  No internet service   
936                   Yes                   No                  Yes   
1082  No internet service  No internet service  No internet service   
1340                  Yes                  Yes                  Yes   
3331  No internet service  No internet service  No internet service   
3826  No internet service  No internet service  No internet service   
4380  No internet service  No internet service  No internet service   
5218  No internet service  No internet service  No internet service   
6670                  Yes                  Yes                  Yes   
6754                   No                  Yes                   No   

          StreamingMovies  Contract PaperlessBilling  \
488                    No  Two year              Yes   
753   No internet service  Two year               No   
936                   Yes  Two year               No   
1082  No internet service  Two year               No   
1340                   No  Two year               No   
3331  No internet service  Two year               No   
3826  No internet service  Two year               No   
4380  No internet service  Two year               No   
5218  No internet service  One year              Yes   
6670                   No  Two year               No   
6754                   No  Two year              Yes   

                  PaymentMethod MonthlyCharges  TotalCharges Churn  
488   Bank transfer (automatic)          52.55                  No  
753                Mailed check          20.25                  No  
936                Mailed check          80.85                  No  
1082               Mailed check          25.75                  No  
1340    Credit card (automatic)          56.05                  No  
3331               Mailed check          19.85                  No  
3826               Mailed check          25.35                  No  
4380               Mailed check          20.00                  No  
5218               Mailed check          19.70                  No  
6670               Mailed check          73.35                  No  
6754  Bank transfer (automatic)          61.90                  No  

[11 rows x 21 columns]
In [181]:
# Found 11 such rows where 'TotalCharges' was an empty string. 
In [ ]:
 
In [184]:
import numpy as np
df['TotalCharges'] = df['TotalCharges'].replace(r'^\s*$', np.nan, regex=True)
In [186]:
# Replaced the empty data cells with NaN. 
In [ ]:
 
In [189]:
df['TotalCharges'] = df['TotalCharges'].astype('float')
In [ ]:
 
In [192]:
df.describe()
Out[192]:
SeniorCitizen tenure MonthlyCharges TotalCharges
count 7043.000000 7043.000000 7043.000000 7032.000000
mean 0.162147 32.371149 64.761692 2283.300441
std 0.368612 24.559481 30.090047 2266.771362
min 0.000000 0.000000 18.250000 18.800000
25% 0.000000 9.000000 35.500000 401.450000
50% 0.000000 29.000000 70.350000 1397.475000
75% 0.000000 55.000000 89.850000 3794.737500
max 1.000000 72.000000 118.750000 8684.800000
In [ ]:
 
In [ ]:
 

Handling Missing Data¶

In [197]:
missing_values = df.isnull().sum()
missing_values
Out[197]:
customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64
In [ ]:
 
In [200]:
df = df.dropna(subset=['TotalCharges'])
In [ ]:
 
In [ ]:
 

Detecting outliers¶

In [205]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['tenure'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['MonthlyCharges'])
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['TotalCharges'])
plt.show()
In [ ]:
 
In [208]:
Q1 = df['tenure'].quantile(0.25)
Q3 = df['tenure'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['tenure'] < lower_bound) | (df['tenure'] > upper_bound)]
print(outliers)

Q1 = df['MonthlyCharges'].quantile(0.25)
Q3 = df['MonthlyCharges'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['MonthlyCharges'] < lower_bound) | (df['MonthlyCharges'] > upper_bound)]
print(outliers)

Q1 = df['TotalCharges'].quantile(0.25)
Q3 = df['TotalCharges'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['TotalCharges'] < lower_bound) | (df['TotalCharges'] > upper_bound)]
print(outliers)
Empty DataFrame
Columns: [customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn]
Index: []

[0 rows x 21 columns]
Empty DataFrame
Columns: [customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn]
Index: []

[0 rows x 21 columns]
Empty DataFrame
Columns: [customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn]
Index: []

[0 rows x 21 columns]
In [210]:
# No outliers were detected.
In [ ]:
 
In [ ]:
 

Data Transformation (Converting All Entries To Numeric)¶

In [215]:
df['Partner'] = df['Partner'].map({'Yes': 1, 'No': 0})
df['gender'] = df['gender'].map({'Female': 1, 'Male': 0})
df['Dependents'] = df['Dependents'].map({'Yes': 1, 'No': 0})
df['PaperlessBilling'] = df['PaperlessBilling'].map({'Yes': 1, 'No': 0}) 
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
In [217]:
# Binary Encoding 
In [ ]:
 
In [220]:
df['No_Internet'] = df['InternetService'].map({'No': 1, 'DSL': 0, 'Fiber optic': 0})
df['No_Phone'] = df['PhoneService'].map({'No': 1, 'Yes': 0})
In [222]:
# Binary Encoding 
In [ ]:
 
In [225]:
df['OnlineSecurity'] = df['OnlineSecurity'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['OnlineBackup'] = df['OnlineBackup'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['DeviceProtection'] = df['DeviceProtection'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['TechSupport'] = df['TechSupport'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['StreamingTV'] = df['StreamingTV'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['StreamingMovies'] = df['StreamingMovies'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
In [227]:
# Binary Encoding 
In [ ]:
 
In [230]:
df['MultipleLines'] = df['MultipleLines'].map({'Yes': 1, 'No': 0, 'No phone service': 0})
In [232]:
# Binary Encoding 
In [ ]:
 
In [235]:
df = pd.get_dummies(df, columns=['InternetService', 'PaymentMethod', 'Contract'])
In [237]:
# One-Hot Encoding.
In [ ]:
 
In [240]:
df.drop(columns=['PhoneService'], inplace=True)
df.drop(columns=['InternetService_No'], inplace=True)
In [242]:
# Dropping Redundant Columns.
In [ ]:
 
In [ ]:
 
In [246]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   customerID                               7032 non-null   object 
 1   gender                                   7032 non-null   int64  
 2   SeniorCitizen                            7032 non-null   int64  
 3   Partner                                  7032 non-null   int64  
 4   Dependents                               7032 non-null   int64  
 5   tenure                                   7032 non-null   int64  
 6   MultipleLines                            7032 non-null   int64  
 7   OnlineSecurity                           7032 non-null   int64  
 8   OnlineBackup                             7032 non-null   int64  
 9   DeviceProtection                         7032 non-null   int64  
 10  TechSupport                              7032 non-null   int64  
 11  StreamingTV                              7032 non-null   int64  
 12  StreamingMovies                          7032 non-null   int64  
 13  PaperlessBilling                         7032 non-null   int64  
 14  MonthlyCharges                           7032 non-null   float64
 15  TotalCharges                             7032 non-null   float64
 16  Churn                                    7032 non-null   int64  
 17  No_Internet                              7032 non-null   int64  
 18  No_Phone                                 7032 non-null   int64  
 19  InternetService_DSL                      7032 non-null   bool   
 20  InternetService_Fiber optic              7032 non-null   bool   
 21  PaymentMethod_Bank transfer (automatic)  7032 non-null   bool   
 22  PaymentMethod_Credit card (automatic)    7032 non-null   bool   
 23  PaymentMethod_Electronic check           7032 non-null   bool   
 24  PaymentMethod_Mailed check               7032 non-null   bool   
 25  Contract_Month-to-month                  7032 non-null   bool   
 26  Contract_One year                        7032 non-null   bool   
 27  Contract_Two year                        7032 non-null   bool   
dtypes: bool(9), float64(2), int64(16), object(1)
memory usage: 1.1+ MB
In [ ]:
 
In [ ]:
 
In [250]:
df['customerID'] = df['customerID'].astype(str)

dt_to_int = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'MultipleLines', 'OnlineSecurity', 
             'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 
             'PaperlessBilling', 'Churn', 'No_Internet', 'No_Phone', 
             'InternetService_DSL', 'InternetService_Fiber optic', 
             'PaymentMethod_Bank transfer (automatic)', 'PaymentMethod_Credit card (automatic)', 
             'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check', 'Contract_Month-to-month',
             'Contract_One year', 'Contract_Two year']

dt_to_float = ['MonthlyCharges', 'TotalCharges']

df[dt_to_int] = df[dt_to_int].astype('int8')
df[dt_to_float] = df[dt_to_float].astype('float32')
In [252]:
# Standardizing The Datatypes.
# Reduced the size for memory efficiency.
In [ ]:
 
In [ ]:
 
In [256]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   customerID                               7032 non-null   object 
 1   gender                                   7032 non-null   int8   
 2   SeniorCitizen                            7032 non-null   int8   
 3   Partner                                  7032 non-null   int8   
 4   Dependents                               7032 non-null   int8   
 5   tenure                                   7032 non-null   int8   
 6   MultipleLines                            7032 non-null   int8   
 7   OnlineSecurity                           7032 non-null   int8   
 8   OnlineBackup                             7032 non-null   int8   
 9   DeviceProtection                         7032 non-null   int8   
 10  TechSupport                              7032 non-null   int8   
 11  StreamingTV                              7032 non-null   int8   
 12  StreamingMovies                          7032 non-null   int8   
 13  PaperlessBilling                         7032 non-null   int8   
 14  MonthlyCharges                           7032 non-null   float32
 15  TotalCharges                             7032 non-null   float32
 16  Churn                                    7032 non-null   int8   
 17  No_Internet                              7032 non-null   int8   
 18  No_Phone                                 7032 non-null   int8   
 19  InternetService_DSL                      7032 non-null   int8   
 20  InternetService_Fiber optic              7032 non-null   int8   
 21  PaymentMethod_Bank transfer (automatic)  7032 non-null   int8   
 22  PaymentMethod_Credit card (automatic)    7032 non-null   int8   
 23  PaymentMethod_Electronic check           7032 non-null   int8   
 24  PaymentMethod_Mailed check               7032 non-null   int8   
 25  Contract_Month-to-month                  7032 non-null   int8   
 26  Contract_One year                        7032 non-null   int8   
 27  Contract_Two year                        7032 non-null   int8   
dtypes: float32(2), int8(25), object(1)
memory usage: 336.5+ KB
In [ ]:
 
In [ ]:
 

Handling Duplicates¶

In [261]:
duplicate = df[df.duplicated()]
duplicate
Out[261]:
customerID gender SeniorCitizen Partner Dependents tenure MultipleLines OnlineSecurity OnlineBackup DeviceProtection ... No_Phone InternetService_DSL InternetService_Fiber optic PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check Contract_Month-to-month Contract_One year Contract_Two year

0 rows × 28 columns

In [ ]:
 
In [ ]:
 
In [265]:
# customerID column: 

# Checked for duplicates -> 0 were detected. 

# Can't drop it yet, as it could still have some hidden patterns that might correlate with Churn.

# To check for correlations, it first had to be converted to numeric.

# Had two approaches for this - 

# 1) Split the customerID into numeric and the letters part, but this approach will only work if the numeric part is unique.
# But then I found that the numeric part had almost 2000 duplicates. 

# 2) So I decided to go for the second approach to encode the customer id to numeric.
# After converting to numeric using Categorical Encoding, I checked for its correlation with the Prediction Feature Churn.

# It as a value extremely close to 0, So then I finally deicded to drop it. 
In [267]:
duplicate_check = df['customerID'].duplicated().sum()
print(duplicate_check)
0
In [ ]:
 
In [270]:
df['customer_prefix'] = df['customerID'].str[:4]  
df['customer_suffix'] = df['customerID'].str[5:]  

duplicate_check = df['customer_prefix'].duplicated().sum()
print(f"Number of duplicate prefixes: {duplicate_check}")
Number of duplicate prefixes: 1954
In [ ]:
 
In [273]:
df['customerID_encoded'] = df['customerID'].astype('category').cat.codes
In [ ]:
 
In [276]:
print(df[['customerID_encoded', 'Churn']].corr())
                    customerID_encoded     Churn
customerID_encoded            1.000000 -0.017858
Churn                        -0.017858  1.000000
In [ ]:
 
In [279]:
df.drop(columns=['customerID', 'customerID_encoded', 'customer_prefix', 'customer_suffix'], inplace=True)
In [ ]:
 
In [ ]:
 
In [283]:
duplicate = df[df.duplicated()]
duplicate
Out[283]:
gender SeniorCitizen Partner Dependents tenure MultipleLines OnlineSecurity OnlineBackup DeviceProtection TechSupport ... No_Phone InternetService_DSL InternetService_Fiber optic PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check Contract_Month-to-month Contract_One year Contract_Two year
964 0 0 0 0 1 0 0 0 0 0 ... 0 1 0 0 0 0 1 1 0 0
1338 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
1491 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
1739 0 0 0 0 1 0 0 0 0 0 ... 0 0 1 0 0 1 0 1 0 0
1932 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
2713 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
2892 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
3301 1 1 0 0 1 0 0 0 0 0 ... 0 0 1 0 0 1 0 1 0 0
3754 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
4098 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
4476 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
5506 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
5736 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
5759 1 0 0 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 1 1 0 0
6267 1 0 0 0 1 0 0 0 0 0 ... 0 0 1 0 0 1 0 1 0 0
6499 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
6518 0 0 0 0 1 0 0 0 0 0 ... 0 1 0 0 0 1 0 1 0 0
6609 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
6706 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
6764 1 0 0 0 1 0 0 0 0 0 ... 0 0 1 0 0 1 0 1 0 0
6774 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
6924 0 0 0 0 1 0 0 0 0 0 ... 0 0 1 0 0 1 0 1 0 0

22 rows × 27 columns

In [ ]:
 
In [286]:
df.duplicated().sum()
Out[286]:
22
In [ ]:
 
In [289]:
df = df.drop_duplicates()
In [ ]:
 
In [ ]:
 
In [293]:
df['Churn'] = df.pop('Churn')
In [295]:
# Moved the column 'Churn' at the end.
In [ ]:
 
In [ ]:
 

Correlation Analysis¶

In [298]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm') 
Out[298]:
  gender SeniorCitizen Partner Dependents tenure MultipleLines OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges No_Internet No_Phone InternetService_DSL InternetService_Fiber optic PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check Contract_Month-to-month Contract_One year Contract_Two year Churn
gender 1.000000 0.001069 0.000583 -0.010912 -0.006370 0.008199 0.015839 0.012523 0.000209 0.007996 0.006488 0.009471 0.011497 0.012361 -0.000879 -0.003164 -0.007799 -0.007607 0.009898 0.015566 -0.002070 -0.001452 -0.011727 0.004008 -0.008196 0.003146 0.008694
SeniorCitizen 0.001069 1.000000 0.016030 -0.211479 0.014456 0.142403 -0.039258 0.066039 0.058881 -0.061293 0.104830 0.119247 0.155922 0.219131 0.101642 -0.181713 -0.008724 -0.108914 0.254556 -0.016781 -0.024909 0.170949 -0.151840 0.138919 -0.047053 -0.116898 0.151270
Partner 0.000583 0.016030 1.000000 0.451254 0.379564 0.140338 0.141722 0.139971 0.151709 0.118518 0.122387 0.115979 -0.014856 0.095277 0.317021 0.002823 -0.019420 -0.002662 0.000212 0.110009 0.080889 -0.083856 -0.093854 -0.278229 0.081661 0.246114 -0.148670
Dependents -0.010912 -0.211479 0.451254 1.000000 0.161288 -0.026103 0.079591 0.022187 0.012436 0.061825 -0.018146 -0.040073 -0.110973 -0.114641 0.062762 0.141100 0.000408 0.050589 -0.165140 0.051341 0.060125 -0.149862 0.059159 -0.228311 0.068243 0.200783 -0.162366
tenure -0.006370 0.014456 0.379564 0.161288 1.000000 0.330194 0.326798 0.359445 0.359833 0.323761 0.278077 0.283212 0.003709 0.244194 0.825293 -0.033641 -0.009217 0.011691 0.016640 0.242424 0.231385 -0.211583 -0.228902 -0.648215 0.200872 0.563273 -0.353339
MultipleLines 0.008199 0.142403 0.140338 -0.026103 0.330194 1.000000 0.097065 0.200679 0.200186 0.098884 0.256229 0.257610 0.163462 0.490016 0.467641 -0.209085 -0.280776 -0.202182 0.366462 0.074125 0.059004 0.083436 -0.225640 -0.086348 -0.004982 0.105286 0.041888
OnlineSecurity 0.015839 -0.039258 0.141722 0.079591 0.326798 0.097065 1.000000 0.282253 0.273833 0.353636 0.174223 0.186144 -0.004622 0.295398 0.411541 -0.332233 0.091098 0.319812 -0.031242 0.093412 0.114550 -0.112793 -0.077911 -0.245518 0.099739 0.190796 -0.170565
OnlineBackup 0.012523 0.066039 0.139971 0.022187 0.359445 0.200679 0.282253 1.000000 0.301907 0.292679 0.280308 0.273207 0.126740 0.440711 0.509049 -0.380417 0.051440 0.155840 0.165547 0.085844 0.089372 -0.000672 -0.172195 -0.162680 0.083045 0.110258 -0.081145
DeviceProtection 0.000209 0.058881 0.151709 0.012436 0.359833 0.200186 0.273833 0.301907 1.000000 0.331883 0.388829 0.401228 0.103705 0.481919 0.521863 -0.379578 0.069401 0.144206 0.175987 0.081946 0.110196 -0.003622 -0.185514 -0.224408 0.101868 0.164188 -0.064978
TechSupport 0.007996 -0.061293 0.118518 0.061825 0.323761 0.098884 0.353636 0.292679 0.331883 1.000000 0.276411 0.279014 0.037060 0.337361 0.431822 -0.335128 0.094559 0.311633 -0.021020 0.099516 0.116094 -0.115315 -0.082626 -0.284226 0.095327 0.240071 -0.163980
StreamingTV 0.006488 0.104830 0.122387 -0.018146 0.278077 0.256229 0.174223 0.280308 0.388829 0.276411 1.000000 0.532456 0.224151 0.629336 0.514548 -0.414390 0.020595 0.013681 0.329704 0.044870 0.038762 0.144760 -0.245983 -0.110561 0.060738 0.070836 0.065058
StreamingMovies 0.009471 0.119247 0.115979 -0.040073 0.283212 0.257610 0.186144 0.273207 0.401228 0.279014 0.532456 1.000000 0.211456 0.626885 0.518704 -0.417891 0.032697 0.024342 0.322398 0.047498 0.047152 0.137415 -0.248553 -0.115873 0.063583 0.074310 0.062670
PaperlessBilling 0.011497 0.155922 -0.014856 -0.110973 0.003709 0.163462 -0.004622 0.126740 0.103705 0.037060 0.224151 0.211456 1.000000 0.350900 0.157449 -0.319082 -0.017017 -0.064091 0.325295 -0.017976 -0.014221 0.207569 -0.202521 0.169603 -0.052846 -0.147104 0.190518
MonthlyCharges 0.012361 0.219131 0.095277 -0.114641 0.244194 0.490016 0.295398 0.440711 0.481919 0.337361 0.629336 0.626885 0.350900 1.000000 0.650540 -0.762181 -0.249625 -0.163695 0.787169 0.040927 0.028552 0.269931 -0.373324 0.061867 0.003271 -0.075152 0.194008
TotalCharges -0.000879 0.101642 0.317021 0.062762 0.825293 0.467641 0.411541 0.509049 0.521863 0.431822 0.514548 0.518704 0.157449 0.650540 1.000000 -0.373655 -0.114222 -0.053986 0.360768 0.184837 0.181387 -0.061060 -0.292598 -0.445223 0.169300 0.357016 -0.198362
No_Internet -0.003164 -0.181713 0.002823 0.141100 -0.033641 -0.209085 -0.332233 -0.380417 -0.379578 -0.335128 -0.414390 -0.417891 -0.319082 -0.762181 -0.373655 1.000000 -0.171445 -0.379098 -0.464418 0.000606 0.003568 -0.282854 0.315183 -0.221836 0.039877 0.220282 -0.228220
No_Phone -0.007799 -0.008724 -0.019420 0.000408 -0.009217 -0.280776 0.091098 0.051440 0.069401 0.094559 0.020595 0.032697 -0.017017 -0.249625 -0.114222 -0.171445 1.000000 0.452245 -0.290997 -0.008821 0.006381 -0.002890 0.005708 0.002172 0.002615 -0.005022 -0.011072
InternetService_DSL -0.007607 -0.108914 -0.002662 0.050589 0.011691 -0.202182 0.319812 0.155840 0.144206 0.311633 0.013681 0.024342 -0.064091 -0.163695 -0.053986 -0.379098 0.452245 1.000000 -0.643450 0.023909 0.050418 -0.105062 0.045291 -0.063866 0.046507 0.030032 -0.124152
InternetService_Fiber optic 0.009898 0.254556 0.000212 -0.165140 0.016640 0.366462 -0.031242 0.165547 0.175987 -0.021020 0.329704 0.322398 0.325295 0.787169 0.360768 -0.464418 -0.290997 -0.643450 1.000000 -0.023384 -0.051204 0.334537 -0.304077 0.244634 -0.077498 -0.210967 0.307612
PaymentMethod_Bank transfer (automatic) 0.015566 -0.016781 0.110009 0.051341 0.242424 0.074125 0.093412 0.085844 0.081946 0.099516 0.044870 0.047498 -0.017976 0.040927 0.184837 0.000606 -0.008821 0.023909 -0.023384 1.000000 -0.279541 -0.378198 -0.287391 -0.178966 0.056822 0.154215 -0.117442
PaymentMethod_Credit card (automatic) -0.002070 -0.024909 0.080889 0.060125 0.231385 0.059004 0.114550 0.089372 0.110196 0.116094 0.038762 0.047152 -0.014221 0.028552 0.181387 0.003568 0.006381 0.050418 -0.051204 -0.279541 1.000000 -0.374894 -0.284881 -0.203821 0.066798 0.173645 -0.134052
PaymentMethod_Electronic check -0.001452 0.170949 -0.083856 -0.149862 -0.211583 0.083436 -0.112793 -0.000672 -0.003622 -0.115315 0.144760 0.137415 0.207569 0.269931 -0.061060 -0.282854 -0.002890 -0.105062 0.334537 -0.378198 -0.374894 1.000000 -0.385422 0.332156 -0.109966 -0.281924 0.301079
PaymentMethod_Mailed check -0.011727 -0.151840 -0.093854 0.059159 -0.228902 -0.225640 -0.077911 -0.172195 -0.185514 -0.082626 -0.245983 -0.248553 -0.202521 -0.373324 -0.292598 0.315183 0.005708 0.045291 -0.304077 -0.287391 -0.284881 -0.385422 1.000000 0.002854 0.002128 -0.005351 -0.091649
Contract_Month-to-month 0.004008 0.138919 -0.278229 -0.228311 -0.648215 -0.086348 -0.245518 -0.162680 -0.224408 -0.284226 -0.110561 -0.115873 0.169603 0.061867 -0.445223 -0.221836 0.002172 -0.063866 0.244634 -0.178966 -0.203821 0.332156 0.002854 1.000000 -0.569560 -0.621445 0.404346
Contract_One year -0.008196 -0.047053 0.081661 0.068243 0.200872 -0.004982 0.099739 0.083045 0.101868 0.095327 0.060738 0.063583 -0.052846 0.003271 0.169300 0.039877 0.002615 0.046507 -0.077498 0.056822 0.066798 -0.109966 0.002128 -0.569560 1.000000 -0.290013 -0.177742
Contract_Two year 0.003146 -0.116898 0.246114 0.200783 0.563273 0.105286 0.190796 0.110258 0.164188 0.240071 0.070836 0.074310 -0.147104 -0.075152 0.357016 0.220282 -0.005022 0.030032 -0.210967 0.154215 0.173645 -0.281924 -0.005351 -0.621445 -0.290013 1.000000 -0.301375
Churn 0.008694 0.151270 -0.148670 -0.162366 -0.353339 0.041888 -0.170565 -0.081145 -0.064978 -0.163980 0.065058 0.062670 0.190518 0.194008 -0.198362 -0.228220 -0.011072 -0.124152 0.307612 -0.117442 -0.134052 0.301079 -0.091649 0.404346 -0.177742 -0.301375 1.000000
In [ ]:
 
In [ ]:
 

Feature Importance¶

In [303]:
feature_importance = corr["Churn"].abs().sort_values(ascending=False)
feature_importance
Out[303]:
Churn                                      1.000000
Contract_Month-to-month                    0.404346
tenure                                     0.353339
InternetService_Fiber optic                0.307612
Contract_Two year                          0.301375
PaymentMethod_Electronic check             0.301079
No_Internet                                0.228220
TotalCharges                               0.198362
MonthlyCharges                             0.194008
PaperlessBilling                           0.190518
Contract_One year                          0.177742
OnlineSecurity                             0.170565
TechSupport                                0.163980
Dependents                                 0.162366
SeniorCitizen                              0.151270
Partner                                    0.148670
PaymentMethod_Credit card (automatic)      0.134052
InternetService_DSL                        0.124152
PaymentMethod_Bank transfer (automatic)    0.117442
PaymentMethod_Mailed check                 0.091649
OnlineBackup                               0.081145
StreamingTV                                0.065058
DeviceProtection                           0.064978
StreamingMovies                            0.062670
MultipleLines                              0.041888
No_Phone                                   0.011072
gender                                     0.008694
Name: Churn, dtype: float64
In [ ]:
 
In [306]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.heatmap(corr[["Churn"]].abs().sort_values(by="Churn", ascending=False), annot=True, cmap="coolwarm")
plt.title("Feature Importance")
plt.show()
In [ ]:
 
In [ ]:
 

Multicollinearity¶

In [311]:
threshold = 0.7
high_corr_pairs = corr.unstack().sort_values(ascending=False)
high_corr_pairs = high_corr_pairs[high_corr_pairs != 1] 
high_corr_pairs = high_corr_pairs[abs(high_corr_pairs) > threshold] 

print(high_corr_pairs)
TotalCharges                 tenure                         0.825293
tenure                       TotalCharges                   0.825293
InternetService_Fiber optic  MonthlyCharges                 0.787169
MonthlyCharges               InternetService_Fiber optic    0.787169
No_Internet                  MonthlyCharges                -0.762181
MonthlyCharges               No_Internet                   -0.762181
dtype: float64
In [ ]:
 
In [314]:
high_corr_matrix = corr[abs(corr) > threshold]
high_corr_matrix = high_corr_matrix[high_corr_matrix != 1] 
plt.figure(figsize=(15, 8))
sns.heatmap(high_corr_matrix, annot=True, cmap="coolwarm")
plt.title("Highly Correlated Features (Multicollinearity)")
plt.show()
In [ ]:
 
In [ ]:
 

Generating html report and contrasting the training vs test datasets on the target using SweetViz¶

In [151]:
from sklearn.model_selection import train_test_split
import sweetviz as sv
In [153]:
train_df, test_df = train_test_split(df, train_size=0.80)
compare = sv.compare([train_df, "Training Data"], [test_df, "Test Data"], "Churn")
 
compare.show_html('Compare.html')
                                             |      | [  0%]   00:00 -> (? left)
Report Compare.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
In [ ]:
 
In [ ]:
 

Generating html report using Pandas Profiling (ydata-profiling)¶

In [166]:
from ydata_profiling import ProfileReport
In [168]:
profile = ProfileReport(df)

profile.to_file("ProfileReport.html")
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 0 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 0 ... 9 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 4 0 ... 1 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 9 9 ... 9 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 9 ... 0 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 9 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 9 ... 9 9 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 0 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 9 9 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 9 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 0 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 9 9 ... 0 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 9 ... 9 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 9 0 ... 0 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 9 ... 0 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
In [ ]:
 
In [ ]: