Decision Tree
¶Hyperparameters
1) Depth (max_depth) - max depth 20 means 2^20 leaves
2) Min no. of obs in a node (min_samples_split)
3) Min no. of obs in a leaf (min_samples_leaf)
4) Gini/ Entropy (criterion)
5) class_weight={key will be class:values with be weights} #this will be in a dictionary. Use only if there's class bias in the data
import sklearn.tree as dt
dir(dt)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
DecisionTreeClassifier?
# to know the hyperparameters
# Hyperparameter tuning
param_grid={'max_depth':np.arange(3,20),
'max_features':np.arange(3,10),
'criterion':['gini','entropy'],
'min_sample_split':np.arange(2,10)}
tree = GridSearchCV(DecisionTreeClassifier(min_sample_leaf=100), param_grid, cv = 10)
tree.fit( train_X, train_y )
tree.get_params_
clf_tree = DecisionTreeClassifier( max_depth = 5,max_features=9 )
clf_tree.fit( train_X, train_y )
Bagging Algorithms
¶import sklearn.ensemble as en
dir(en)
Bagging
¶The primary weakness of decision trees is that they don't tend to have the best predictive accuracy. This is partially due to high variance, meaning that different splits in the training data can lead to very different trees.
Bagging is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees. Bagging is short for bootstrap aggregation, meaning the aggregation of bootstrap samples.
What is a bootstrap sample? A random sample with replacement:
How does bagging work (for decision trees)?
Notes:
Bagging increases predictive accuracy by reducing the variance, similar to how cross-validation reduces the variance associated with train/test split (for estimating out-of-sample error) by splitting many times an averaging the results.
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV
param_grid={'n_estimators':np.arange(3,100)}
tree=GridSearchCV(BaggingClassifier(oob_score=False,warm_start=True),param_grid,cv=5,n_jobs=-1)
tree.fit(train_x,train_y.values.ravel())
bagclm = BaggingClassifier(oob_score=True, n_estimators=100)
bagclm.fit(train_X, train_y)
Random Forest
¶Random Forests is a slight variation of bagged trees that has even better performance:
What's the point?
One important tuning parameter is n_estimators, which is the number of trees that should be grown. It should be a large enough value that the error seems to have "stabilized".
The other important tuning parameter is max_features, which is the number of features that should be considered at each split.
Advantages of Random Forests:
Disadvantages of Random Forests:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid={'n_estimators':[100,200,300,400,500,600]}
# param_grid={'n_estimators':np.arange(10,100)}
tree=GridSearchCV(RandomForestClassifier(oob_score=False,warm_start=True),param_grid,cv=5,n_jobs=-1)
tree.fit(train_x,train_y.values.ravel())
tree.best_params_
radm_clf=RandomForestClassifier(oob_score=True,n_estimators=100,n_jobs=-1,random_state=42)
radm_clf.fit(train_x,train_y.values.ravel())
RandomForest doesn't get overfit
AdaBoost & GradientBoost
¶from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
pargrid_ada = {'n_estimators': [100, 200, 400, 600, 800],
'learning_rate': [10 ** x for x in range(-3, 3)]}
gscv_ada = GridSearchCV(estimator=AdaBoostClassifier(),
param_grid=pargrid_ada,
cv=5,
verbose=True, n_jobs=-1)
gscv_ada.fit(train_X, train_y)
gscv_ada.best_params_
ad=AdaBoostClassifier()
ad.fit(train_X, train_y, learning_rate =0.1, n_estimators=800)
Xtreme Gradient Boosting Model
¶from xgboost import XGBClassifier #conda install -c anaconda py-xgboost
from sklearn.model_selection import GridSearchCV
pargrid_xgbm = {'n_estimators': [200, 250, 300, 400, 500],
'learning_rate': [10 ** x for x in range(-3, 1)], #this is alpha
'max_features': [5,6,7,8,9,10,11,12,13,14]}
gscv_xgbm = GridSearchCV(estimator=XGBClassifier(),
param_grid=pargrid_xgbm,
cv=5,
verbose=True, n_jobs=-1)
gscv_xgbm.fit(train_X, train_y)
gscv_xgbm.best_params_
xgbm = gscv_xgbm.best_estimator_
gscv_gbm.best_score_
xgbm.fit(train_X, train_y)
CatBoost
¶Designed to handle categorical variabless much better
LightGBM
¶Is designed to run on very large data very fast
Logistic Regression
¶from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit( train_x, train_y )
# For multiclass problem
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs',multi_class='auto')
logreg.fit( train_x, train_y )
Naive Bayes
¶No hyperparameters
Only use when you are sure that all Xs are independent of each other
Mostly used to text data
It is termed as ‘Naive’ because it assumes independence between every pair of feature in the data
Output is class (0 or 1). Does not give probability hence, there's no way of tweaking the cut-off
Only use Naive Bayes when you explicitly know that all Xs are independent of each other
Generally works well on text data because all words(which become variables) have very complex relationship with each other
CANNOT TWEAK CUT-OFF
from sklearn.naive_bayes import GaussianNB
nb_clf=GaussianNB()
nb_clf.fit(train_x,train_y)
K-Nearest Neighbor
¶Hyperparameters - n_neighbors (K - no. of neighbors to be considered for average)
Can also set weights (hyperparater) to
a) uniform (default) - All points in each neighborhood are weighted equally
b) distance - weight points by the inverse of their distance
KNeighborsClassifier - for classification
KNeighborsRegressor - for regression problems
By default it uses euclidean distance
Mostly used for product recommendation
CAN TWEAK CUT - OFF
from sklearn.neighbors import KNeighborsClassifier
# GridSearch to find the best value of K
from sklearn.model_selection import GridSearchCV
#tuned_parameters=[{'n_neighbors':[3,5,7,9,11]}]
#these values should not be more than sqrt(no. of obs). Also, should not be very less
tuned_parameters=[{'n_neighbors':list(range(5,50,2))}]
knn_clf=GridSearchCV(KNeighborsClassifier(),
tuned_parameters,
cv=5,
scoring='accuracy')
knn_clf.fit(train_x,train_y)
knn.best_score_ # this is accuracy on training data
knn.best_params_ # to get the best value of K
# Buildin the model with best_params_ (say 9)
knn_clf=KNeighborsClassifier(n_neighbors=9)
knn_clf.fit(train_x,train_y)
# Tweak cut-off if there's bias
Support Vector Machines - SVM
¶#### Hyperparamters -
(1) C - Hyper-parameter for soft margin
(2) gamma - Hyper-parameter for dimensionality projection
Cannot tweak cut-off as it doesn't give probability
Does not give variable importance like RandomForest
Works best on multiclass classification problems
Classification only (has a seperate algorithm for regression problems)
from sklearn.svm import SVC
svc=SVC(kernel='rbf',class_weight='balanced')
from sklearn.model_selection import GridSearchCV
param_grid = {'C': list(range(1,200,2)), # Hyper-parameter for soft margin
'gamma': [10**x for x in range(-4,2)]} # Hyper-parameter for dimensionality projection
grid = GridSearchCV(svc, param_grid)
%time grid.fit(train_X, train_y)
print(grid.best_params_)
model=grid.best_estimator_
#Accuracy
metirics.accuracy_score(test_y,model.predict(test_x))
Model Evaluation
¶1) Accuracy
¶from sklearn import metrics
test_accuracy=metrics.accuracy_score(test_y,model.predict(test_x)) #accuracy on testing data
train_accuracy=metrics.accuracy_score(train_y,model.predict(train_x)) #accuracy on training data
2) Classification Report
¶from sklearn.metrics import classification_report
print(classification_report(test_y,model.predict(test_x)))
3) Confusion Matrix
¶from sklearn import metrics
import seaborn as sn
import matplotlib as plt
cm=metrics.confusion_matrix(test_y, nb_clf.predict(test_x))
sn.heatmap(cm,annot=True,fmt='.2f',xticklabels=['no','yes'],yticklabels=['no','yes'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
# from sklearn.metrics import confusion_matrix
# mat = confusion_matrix(ytest, yfit)
# sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
# xticklabels=faces.target_names,
# yticklabels=faces.target_names)
# plt.xlabel('true label')
# plt.ylabel('predicted label');
4) AUC score
¶- Only for binary classification
auc_score = metrics.roc_auc_score(test_y,model.predict(test_x))
Tweaking cut-off
¶test_predicted_prob=pd.DataFrame(logreg.predict_proba(test_x))[[1]]
test_predicted_prob.columns=['prob']
actual=test_y.reset_index()
actual.drop('index',axis=1,inplace=True)
# making a DataFrame with actual and prob columns
df_test_predict = pd.concat([actual, test_predicted_prob], axis=1)
df_test_predict.columns = ['actual','prob']
df_test_predict.head()
test_roc_like_df = pd.DataFrame()
test_temp = df_test_predict.copy()
for cut_off in np.linspace(0,1,50):
test_temp['predicted'] = test_temp['prob'].apply(lambda x: 0 if x < cut_off else 1)
test_temp['tp'] = test_temp.apply(lambda x: 1 if x['actual']==1 and x['predicted']==1 else 0, axis=1)
test_temp['fp'] = test_temp.apply(lambda x: 1 if x['actual']==0 and x['predicted']==1 else 0, axis=1)
test_temp['tn'] = test_temp.apply(lambda x: 1 if x['actual']==0 and x['predicted']==0 else 0, axis=1)
test_temp['fn'] = test_temp.apply(lambda x: 1 if x['actual']==1 and x['predicted']==0 else 0, axis=1)
sensitivity = test_temp['tp'].sum() / (test_temp['tp'].sum() + test_temp['fn'].sum())
specificity = test_temp['tn'].sum() / (test_temp['tn'].sum() + test_temp['fp'].sum())
accuracy=(test_temp['tp'].sum()+test_temp['tn'].sum()) / (test_temp['tp'].sum() + test_temp['fn'].sum()+test_temp['tn'].sum() + test_temp['fp'].sum())
test_roc_like_table = pd.DataFrame([cut_off, sensitivity, specificity,accuracy]).T
test_roc_like_table.columns = ['cutoff', 'sensitivity', 'specificity','accuracy']
test_roc_like_df = pd.concat([test_roc_like_df, test_roc_like_table], axis=0)
test_roc_like_df.head()
import matplotlib.pyplot as plt
test_temp.sum()
plt.subplots(figsize=(10,4))
plt.scatter(test_roc_like_df['cutoff'], test_roc_like_df['sensitivity'], marker='*', label='Sensitivity')
plt.scatter(test_roc_like_df['cutoff'], test_roc_like_df['specificity'], marker='*', label='Specificity')
#plt.scatter(test_roc_like_df['cutoff'], 1-test_roc_like_df['specificity'], marker='*', label='FPR')
plt.title('For each cutoff, pair of sensitivity and FPR is plotted for ROC')
plt.legend()
## Finding ideal cut-off for checking if this remains same in OOS validation
test_roc_like_df['total'] = test_roc_like_df['sensitivity'] + test_roc_like_df['specificity']
test_roc_like_df[test_roc_like_df['total']==test_roc_like_df['total'].max()]
df_test_predict['predicted'] = df_test_predict['prob'].apply(lambda x: 1 if x > 0.408163 else 0)
import seaborn as sns
sns.heatmap(pd.crosstab(df_test_predict['actual'], df_test_predict['predicted']), annot=True, fmt='.0f')
accuracy=metrics.accuracy_score(df_test_predict.actual, df_test_predict.predicted)
print('Accuracy: ',round(accuracy,2))
Classification Report
¶from sklearn.metrics import classification_report
print(classification_report(df_test_predict.actual, df_test_predict.predicted))