Import Libraries¶

Import a few libraries you think you'll need (Or just import them as you go along!)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Get the Data¶

Read in the advertising.csv file and set it to a data frame called ad_data.

ad_data = pd.read_csv('D:/Data Science/Py-DS-ML-Bootcamp-master/Refactored_Py_DS_ML_Bootcamp-master/13-Logistic-Regression/advertising.csv')

Check the head of ad_data

ad_data.head()

Use info and describe() on ad_data

# It will give you information about your features prasent in your data frame
ad_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Daily Time Spent on Site    1000 non-null float64
Age                         1000 non-null int64
Area Income                 1000 non-null float64
Daily Internet Usage        1000 non-null float64
Ad Topic Line               1000 non-null object
City                        1000 non-null object
Male                        1000 non-null int64
Country                     1000 non-null object
Timestamp                   1000 non-null object
Clicked on Ad               1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB

#It will provide us statistical information of numerical data in our data frame
ad_data.describe()

Exploratory Data Analysis¶

Let's use seaborn to explore the data!

I have Tried recreating the plots shown below!

Create a histogram of the Age.¶

sns.set_style('whitegrid')
ad_data['Age'].hist(bins=30)
plt.xlabel('Age')

Text(0.5, 0, 'Age')

We can see age is almost distributed normally.¶

Create a jointplot showing Area Income versus Age.¶

sns.jointplot(x='Age',y='Area Income',data=ad_data)

C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<seaborn.axisgrid.JointGrid at 0x9b13588>

In above graph it is clear that there is a trend between age and income. Graph tells us most of the people have started earning at the age of 20 and there income get increases with the time but at the age of 50 their income started decreasing, which is obvious

Create a jointplot showing the kde distributions of Daily Time spent on site vs. Age.¶

sns.jointplot(x='Age',y='Daily Time Spent on Site',data=ad_data,color='red',kind='kde');

Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'

sns.jointplot(x='Daily Time Spent on Site',y='Daily Internet Usage',data=ad_data,color='green')

<seaborn.axisgrid.JointGrid at 0x9e99ef0>

As we can see in the above graph there are clear two cluster present in our data one for those users who spent less time on site and there usages is low too. Another one is for those users whose daily spent time is high and internet usage is high too. To understand the relationship in more detail lets create a pair plot

Finally, create a pairplot with the hue defined by the 'Clicked on Ad' column feature.¶

sns.pairplot(ad_data,hue='Clicked on Ad',palette='bwr')

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:488: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:83: RuntimeWarning: invalid value encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

<seaborn.axisgrid.PairGrid at 0x9ce5cc0>

From above graph, We can find several relationship between the features such as age and income and time spent on website etc..

Logistic Regression¶

Now it's time to do a train test split, and train our model!

We'll have the freedom here to choose columns that we want to train on!

Split the data into training set and testing set using train_test_split

from sklearn.model_selection import train_test_split

#Define x and y features in data frame.
X = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']]
y = ad_data['Clicked on Ad']

#Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Train and fit a logistic regression model on the training set.

#Import logistic regression library
from sklearn.linear_model import LogisticRegression

#Apply logistic algorithm
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Predictions and Evaluations¶

Now predict values for the testing data.

predictions = logmodel.predict(X_test)

Create a classification report for the model.

from sklearn.metrics import classification_report,confusion_matrix

print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.87      0.96      0.91       162
          1       0.96      0.86      0.91       168

avg / total       0.91      0.91      0.91       330

From above report we are getting 91% accuracy which is not bad.

Confusion Matrix¶

print(confusion_matrix(y_test,predictions))

[[156   6]
 [ 24 144]]

We have some miss labeled data(24,6), but as per our data size its not that big so we can ignore that.

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp
0	68.95	35	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	2016-03-27 00:53:11
1	80.23	31	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	2016-04-04 01:39:02
2	69.47	26	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	2016-03-13 20:35:42
3	74.15	29	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	2016-01-10 02:31:19
4	68.37	35	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	2016-06-03 03:36:18

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male	Clicked on Ad
count	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.00000
mean	65.000200	36.009000	55000.000080	180.000100	0.481000	0.50000
std	15.853615	8.785562	13414.634022	43.902339	0.499889	0.50025
min	32.600000	19.000000	13996.500000	104.780000	0.000000	0.00000
25%	51.360000	29.000000	47031.802500	138.830000	0.000000	0.00000
50%	68.215000	35.000000	57012.300000	183.130000	0.000000	0.50000
75%	78.547500	42.000000	65470.635000	218.792500	1.000000	1.00000
max	91.430000	61.000000	79484.800000	269.960000	1.000000	1.00000