Decision Tree

  • Hyperparameters
      1) Depth (max_depth) - max depth 20 means 2^20 leaves
      2) Min no. of obs in a node (min_samples_split)
      3) Min no. of obs in a leaf (min_samples_leaf)
      4) Gini/ Entropy (criterion)
      5) class_weight={key will be class:values with be weights} #this will be in a dictionary. Use only if there's class bias in the data
In [1]:
import sklearn.tree as dt
In [10]:
In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
In [9]:
# to know the hyperparameters
In [ ]:
# Hyperparameter tuning
In [ ]:
tree = GridSearchCV(DecisionTreeClassifier(min_sample_leaf=100), param_grid, cv = 10) train_X, train_y )
In [ ]:
In [ ]:
clf_tree = DecisionTreeClassifier( max_depth = 5,max_features=9 ) train_X, train_y )

Bagging Algorithms

In [2]:
import sklearn.ensemble as en


The primary weakness of decision trees is that they don't tend to have the best predictive accuracy. This is partially due to high variance, meaning that different splits in the training data can lead to very different trees.

Bagging is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees. Bagging is short for bootstrap aggregation, meaning the aggregation of bootstrap samples.

What is a bootstrap sample? A random sample with replacement:

How does bagging work (for decision trees)?

  1. Grow B trees using B bootstrap samples from the training data.
  2. Train each tree on its bootstrap sample and make predictions.
  3. Combine the predictions:
    • Average the predictions for regression trees
    • Take a vote for classification trees


  • Each bootstrap sample should be the same size as the original training set.
  • B should be a large enough value that the error seems to have "stabilized".
  • The trees are grown deep so that they have low bias/high variance.

Bagging increases predictive accuracy by reducing the variance, similar to how cross-validation reduces the variance associated with train/test split (for estimating out-of-sample error) by splitting many times an averaging the results.

In [4]:
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV
In [ ]:
In [ ]:
In [ ]:
bagclm = BaggingClassifier(oob_score=True, n_estimators=100), train_y)

Random Forest

Random Forests is a slight variation of bagged trees that has even better performance:

  • Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.
  • However, when building each tree, each time a split is considered, a random sample of m features is chosen as split candidates from the full set of p features. The split is only allowed to use one of those m features.
    • A new random sample of features is chosen for every single tree at every single split.
    • For classification, m is typically chosen to be the square root of p.
    • For regression, m is typically chosen to be somewhere between p/3 and p.

What's the point?

  • Suppose there is one very strong feature in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are highly correlated.
  • Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).
  • By randomly leaving out candidate features from each split, Random Forests "decorrelates" the trees, such that the averaging process can reduce the variance of the resulting model.

Tuning n_estimators

One important tuning parameter is n_estimators, which is the number of trees that should be grown. It should be a large enough value that the error seems to have "stabilized".

Tuning max_features

The other important tuning parameter is max_features, which is the number of features that should be considered at each split.

Comparing Random Forests with decision trees

Advantages of Random Forests:

  • Performance is competitive with the best supervised learning methods
  • Provides a more reliable estimate of feature importance
  • Allows you to estimate out-of-sample error without using train/test split or cross-validation

Disadvantages of Random Forests:

  • Less interpretable
  • Slower to train
  • Slower to predict
In [ ]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV
In [ ]:

# param_grid={'n_estimators':np.arange(10,100)}
In [ ]:
In [ ]:
RandomForest doesn't get overfit

AdaBoost & GradientBoost

In [8]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
In [ ]:
pargrid_ada = {'n_estimators': [100, 200, 400, 600, 800],
               'learning_rate': [10 ** x for x in range(-3, 3)]}
In [ ]:
gscv_ada = GridSearchCV(estimator=AdaBoostClassifier(), 
                        verbose=True, n_jobs=-1), train_y)
In [ ]:
In [ ]:
ad=AdaBoostClassifier(), train_y, learning_rate =0.1, n_estimators=800)

Xtreme Gradient Boosting Model

In [9]:
from xgboost import XGBClassifier     #conda install -c anaconda py-xgboost 
from sklearn.model_selection import GridSearchCV
In [ ]:
pargrid_xgbm = {'n_estimators': [200, 250, 300, 400, 500],
               'learning_rate': [10 ** x for x in range(-3, 1)],  #this is alpha
                'max_features': [5,6,7,8,9,10,11,12,13,14]}
In [ ]:
gscv_xgbm = GridSearchCV(estimator=XGBClassifier(), 
                        verbose=True, n_jobs=-1), train_y)
In [ ]:

xgbm = gscv_xgbm.best_estimator_

gscv_gbm.best_score_, train_y)


Designed to handle categorical variabless much better
In [ ]:


Is designed to run on very large data very fast
In [ ]:

Logistic Regression

In [ ]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression() train_x, train_y )

# For multiclass problem

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver='lbfgs',multi_class='auto') train_x, train_y )

Naive Bayes

  • #### No hyperparameters
  • Only use when you are sure that all Xs are independent of each other
  • Mostly used to text data
  • It is termed as ‘Naive’ because it assumes independence between every pair of feature in the data
  • Output is class (0 or 1). Does not give probability hence, there's no way of tweaking the cut-off
  • Only use Naive Bayes when you explicitly know that all Xs are independent of each other
  • Generally works well on text data because all words(which become variables) have very complex relationship with each other
In [ ]:
from sklearn.naive_bayes import GaussianNB


K-Nearest Neighbor

  • #### Hyperparameters - n_neighbors (K - no. of neighbors to be considered for average)
  • Can also set weights (hyperparater) to
    • a) uniform (default) - All points in each neighborhood are weighted equally
    • b) distance - weight points by the inverse of their distance
  • KNeighborsClassifier - for classification
  • KNeighborsRegressor - for regression problems
  • By default it uses euclidean distance
  • Mostly used for product recommendation
  • ##### CAN TWEAK CUT - OFF
In [2]:
from sklearn.neighbors import KNeighborsClassifier
In [19]:
# GridSearch to find the best value of K

from sklearn.model_selection import GridSearchCV

#these values should not be more than sqrt(no. of obs). Also, should not be very less


knn.best_score_ # this is accuracy on training data
In [ ]:
knn.best_params_ # to get the best value of K
In [ ]:
# Buildin the model with best_params_ (say 9)


# Tweak cut-off if there's bias

Support Vector Machines - SVM

  • #### Hyperparamters -

    • (1) C - Hyper-parameter for soft margin
    • (2) gamma - Hyper-parameter for dimensionality projection
  • Cannot tweak cut-off as it doesn't give probability

  • Does not give variable importance like RandomForest
  • Works best on multiclass classification problems
  • Classification only (has a seperate algorithm for regression problems)
In [1]:
from sklearn.svm import SVC
In [3]:
In [ ]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': list(range(1,200,2)),                      # Hyper-parameter for soft margin
              'gamma': [10**x for x in range(-4,2)]}          # Hyper-parameter for dimensionality projection
grid = GridSearchCV(svc, param_grid)
%time, train_y)
In [ ]:
In [ ]:
In [ ]:

Model Evaluation

1) Accuracy

In [ ]:
from sklearn import metrics

test_accuracy=metrics.accuracy_score(test_y,model.predict(test_x))    #accuracy on testing data

train_accuracy=metrics.accuracy_score(train_y,model.predict(train_x))  #accuracy on training data

2) Classification Report

In [ ]:
from sklearn.metrics import classification_report

3) Confusion Matrix

In [ ]:
from sklearn import metrics
import seaborn as sn
import matplotlib as plt

cm=metrics.confusion_matrix(test_y, nb_clf.predict(test_x))


plt.ylabel('True Label')
plt.xlabel('Predicted Label')

# from sklearn.metrics import confusion_matrix
# mat = confusion_matrix(ytest, yfit)
# sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
#             xticklabels=faces.target_names,
#             yticklabels=faces.target_names)
# plt.xlabel('true label')
# plt.ylabel('predicted label');

4) AUC score

- Only for binary classification
In [ ]:
auc_score = metrics.roc_auc_score(test_y,model.predict(test_x))

Tweaking cut-off

In [ ]:

# making a DataFrame with actual and prob columns
df_test_predict = pd.concat([actual, test_predicted_prob], axis=1)
df_test_predict.columns = ['actual','prob']
In [ ]:
test_roc_like_df = pd.DataFrame()
test_temp = df_test_predict.copy()

for cut_off in np.linspace(0,1,50):
    test_temp['predicted'] = test_temp['prob'].apply(lambda x: 0 if x < cut_off else 1)
    test_temp['tp'] = test_temp.apply(lambda x: 1 if x['actual']==1 and x['predicted']==1 else 0, axis=1)
    test_temp['fp'] = test_temp.apply(lambda x: 1 if x['actual']==0 and x['predicted']==1 else 0, axis=1)
    test_temp['tn'] = test_temp.apply(lambda x: 1 if x['actual']==0 and x['predicted']==0 else 0, axis=1)
    test_temp['fn'] = test_temp.apply(lambda x: 1 if x['actual']==1 and x['predicted']==0 else 0, axis=1)
    sensitivity = test_temp['tp'].sum() / (test_temp['tp'].sum() + test_temp['fn'].sum())
    specificity = test_temp['tn'].sum() / (test_temp['tn'].sum() + test_temp['fp'].sum())
    accuracy=(test_temp['tp'].sum()+test_temp['tn'].sum()) / (test_temp['tp'].sum() + test_temp['fn'].sum()+test_temp['tn'].sum() + test_temp['fp'].sum())
    test_roc_like_table = pd.DataFrame([cut_off, sensitivity, specificity,accuracy]).T
    test_roc_like_table.columns = ['cutoff', 'sensitivity', 'specificity','accuracy']
    test_roc_like_df = pd.concat([test_roc_like_df, test_roc_like_table], axis=0)
In [ ]:
In [ ]:
import matplotlib.pyplot as plt
plt.scatter(test_roc_like_df['cutoff'], test_roc_like_df['sensitivity'], marker='*', label='Sensitivity')
plt.scatter(test_roc_like_df['cutoff'], test_roc_like_df['specificity'], marker='*', label='Specificity')
#plt.scatter(test_roc_like_df['cutoff'], 1-test_roc_like_df['specificity'], marker='*', label='FPR')
plt.title('For each cutoff, pair of sensitivity and FPR is plotted for ROC')
In [ ]:
## Finding ideal cut-off for checking if this remains same in OOS validation
test_roc_like_df['total'] = test_roc_like_df['sensitivity'] + test_roc_like_df['specificity']
In [ ]:
df_test_predict['predicted'] = df_test_predict['prob'].apply(lambda x: 1 if x > 0.408163 else 0)

import seaborn as sns
sns.heatmap(pd.crosstab(df_test_predict['actual'], df_test_predict['predicted']), annot=True, fmt='.0f')
In [ ]:
accuracy=metrics.accuracy_score(df_test_predict.actual, df_test_predict.predicted)
print('Accuracy: ',round(accuracy,2))

Classification Report

In [ ]:
from sklearn.metrics import classification_report
print(classification_report(df_test_predict.actual, df_test_predict.predicted))