Import a few libraries you think you'll need (Or just import them as you go along!)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Read in the advertising.csv file and set it to a data frame called ad_data.
ad_data = pd.read_csv('D:/Data Science/Py-DS-ML-Bootcamp-master/Refactored_Py_DS_ML_Bootcamp-master/13-Logistic-Regression/advertising.csv')
Check the head of ad_data
ad_data.head()
Use info and describe() on ad_data
# It will give you information about your features prasent in your data frame
ad_data.info()
#It will provide us statistical information of numerical data in our data frame
ad_data.describe()
Let's use seaborn to explore the data!
I have Tried recreating the plots shown below!
sns.set_style('whitegrid')
ad_data['Age'].hist(bins=30)
plt.xlabel('Age')
sns.jointplot(x='Age',y='Area Income',data=ad_data)
In above graph it is clear that there is a trend between age and income. Graph tells us most of the people have started earning at the age of 20 and there income get increases with the time but at the age of 50 their income started decreasing, which is obvious
sns.jointplot(x='Age',y='Daily Time Spent on Site',data=ad_data,color='red',kind='kde');
Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'
sns.jointplot(x='Daily Time Spent on Site',y='Daily Internet Usage',data=ad_data,color='green')
As we can see in the above graph there are clear two cluster present in our data one for those users who spent less time on site and there usages is low too. Another one is for those users whose daily spent time is high and internet usage is high too. To understand the relationship in more detail lets create a pair plot
sns.pairplot(ad_data,hue='Clicked on Ad',palette='bwr')
From above graph, We can find several relationship between the features such as age and income and time spent on website etc..
Now it's time to do a train test split, and train our model!
We'll have the freedom here to choose columns that we want to train on!
Split the data into training set and testing set using train_test_split
from sklearn.model_selection import train_test_split
#Define x and y features in data frame.
X = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']]
y = ad_data['Clicked on Ad']
#Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Train and fit a logistic regression model on the training set.
#Import logistic regression library
from sklearn.linear_model import LogisticRegression
#Apply logistic algorithm
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
Now predict values for the testing data.
predictions = logmodel.predict(X_test)
Create a classification report for the model.
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
From above report we are getting 91% accuracy which is not bad.
print(confusion_matrix(y_test,predictions))
We have some miss labeled data(24,6), but as per our data size its not that big so we can ignore that.