`Decision Tree`¶

Hyperparameters

  1) Depth (max_depth) - max depth 20 means 2^20 leaves
  2) Min no. of obs in a node (min_samples_split)
  3) Min no. of obs in a leaf (min_samples_leaf)
  4) Gini/ Entropy (criterion)
  5) class_weight={key will be class:values with be weights} #this will be in a dictionary. Use only if there's class bias in the data

import sklearn.tree as dt

dir(dt)

['DecisionTreeClassifier',
 'DecisionTreeRegressor',
 'ExtraTreeClassifier',
 'ExtraTreeRegressor',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_criterion',
 '_reingold_tilford',
 '_splitter',
 '_tree',
 '_utils',
 'export',
 'export_graphviz',
 'export_text',
 'plot_tree',
 'tree']

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

DecisionTreeClassifier?
# to know the hyperparameters

# Hyperparameter tuning
param_grid={'max_depth':np.arange(3,20),
           'max_features':np.arange(3,10),
           'criterion':['gini','entropy'],
           'min_sample_split':np.arange(2,10)}

tree = GridSearchCV(DecisionTreeClassifier(min_sample_leaf=100), param_grid, cv = 10)
tree.fit( train_X, train_y )

tree.get_params_

clf_tree = DecisionTreeClassifier( max_depth = 5,max_features=9 )
clf_tree.fit( train_X, train_y )

`Bagging Algorithms`¶

import sklearn.ensemble as en
dir(en)

['AdaBoostClassifier',
 'AdaBoostRegressor',
 'BaggingClassifier',
 'BaggingRegressor',
 'BaseEnsemble',
 'ExtraTreesClassifier',
 'ExtraTreesRegressor',
 'GradientBoostingClassifier',
 'GradientBoostingRegressor',
 'IsolationForest',
 'RandomForestClassifier',
 'RandomForestRegressor',
 'RandomTreesEmbedding',
 'VotingClassifier',
 'VotingRegressor',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_gb_losses',
 '_gradient_boosting',
 'bagging',
 'base',
 'forest',
 'gradient_boosting',
 'iforest',
 'partial_dependence',
 'voting',
 'weight_boosting']

`Bagging`¶

The primary weakness of decision trees is that they don't tend to have the best predictive accuracy. This is partially due to high variance, meaning that different splits in the training data can lead to very different trees.

Bagging is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees. Bagging is short for bootstrap aggregation, meaning the aggregation of bootstrap samples.

What is a bootstrap sample? A random sample with replacement:

How does bagging work (for decision trees)?

Grow B trees using B bootstrap samples from the training data.
Train each tree on its bootstrap sample and make predictions.
Combine the predictions:
- Average the predictions for regression trees
- Take a vote for classification trees

Notes:

Each bootstrap sample should be the same size as the original training set.
B should be a large enough value that the error seems to have "stabilized".
The trees are grown deep so that they have low bias/high variance.

Bagging increases predictive accuracy by reducing the variance, similar to how cross-validation reduces the variance associated with train/test split (for estimating out-of-sample error) by splitting many times an averaging the results.

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV

param_grid={'n_estimators':np.arange(3,100)}

tree=GridSearchCV(BaggingClassifier(oob_score=False,warm_start=True),param_grid,cv=5,n_jobs=-1)
tree.fit(train_x,train_y.values.ravel())

bagclm = BaggingClassifier(oob_score=True, n_estimators=100)
bagclm.fit(train_X, train_y)

`Random Forest`¶

Random Forests is a slight variation of bagged trees that has even better performance:

Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.
However, when building each tree, each time a split is considered, a random sample of m features is chosen as split candidates from the full set of p features. The split is only allowed to use one of those m features.
- A new random sample of features is chosen for every single tree at every single split.
- For classification, m is typically chosen to be the square root of p.
- For regression, m is typically chosen to be somewhere between p/3 and p.

What's the point?

Suppose there is one very strong feature in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are highly correlated.
Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).
By randomly leaving out candidate features from each split, Random Forests "decorrelates" the trees, such that the averaging process can reduce the variance of the resulting model.

Tuning n_estimators¶

One important tuning parameter is n_estimators, which is the number of trees that should be grown. It should be a large enough value that the error seems to have "stabilized".

Tuning max_features¶

The other important tuning parameter is max_features, which is the number of features that should be considered at each split.

Comparing Random Forests with decision trees¶

Advantages of Random Forests:

Performance is competitive with the best supervised learning methods
Provides a more reliable estimate of feature importance
Allows you to estimate out-of-sample error without using train/test split or cross-validation

Disadvantages of Random Forests:

Less interpretable
Slower to train
Slower to predict

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

param_grid={'n_estimators':[100,200,300,400,500,600]}

# param_grid={'n_estimators':np.arange(10,100)}
tree=GridSearchCV(RandomForestClassifier(oob_score=False,warm_start=True),param_grid,cv=5,n_jobs=-1)
tree.fit(train_x,train_y.values.ravel())

tree.best_params_

radm_clf=RandomForestClassifier(oob_score=True,n_estimators=100,n_jobs=-1,random_state=42)
radm_clf.fit(train_x,train_y.values.ravel())

RandomForest doesn't get overfit

`AdaBoost & GradientBoost`¶

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

pargrid_ada = {'n_estimators': [100, 200, 400, 600, 800],
               'learning_rate': [10 ** x for x in range(-3, 3)]}

gscv_ada = GridSearchCV(estimator=AdaBoostClassifier(), 
                        param_grid=pargrid_ada, 
                        cv=5,
                        verbose=True, n_jobs=-1)
gscv_ada.fit(train_X, train_y)

gscv_ada.best_params_

ad=AdaBoostClassifier()
ad.fit(train_X, train_y, learning_rate =0.1, n_estimators=800)

`Xtreme Gradient Boosting Model`¶

from xgboost import XGBClassifier     #conda install -c anaconda py-xgboost 
from sklearn.model_selection import GridSearchCV

pargrid_xgbm = {'n_estimators': [200, 250, 300, 400, 500],
               'learning_rate': [10 ** x for x in range(-3, 1)],  #this is alpha
                'max_features': [5,6,7,8,9,10,11,12,13,14]}

gscv_xgbm = GridSearchCV(estimator=XGBClassifier(), 
                        param_grid=pargrid_xgbm, 
                        cv=5,
                        verbose=True, n_jobs=-1)

gscv_xgbm.fit(train_X, train_y)

gscv_xgbm.best_params_

xgbm = gscv_xgbm.best_estimator_

gscv_gbm.best_score_

xgbm.fit(train_X, train_y)

`CatBoost`¶

Designed to handle categorical variabless much better

`LightGBM`¶

Is designed to run on very large data very fast

`Logistic Regression`¶

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit( train_x, train_y )


# For multiclass problem

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver='lbfgs',multi_class='auto')
logreg.fit( train_x, train_y )

`Naive Bayes`¶

#### No hyperparameters
Only use when you are sure that all Xs are independent of each other
Mostly used to text data
It is termed as ‘Naive’ because it assumes independence between every pair of feature in the data
Output is class (0 or 1). Does not give probability hence, there's no way of tweaking the cut-off
Only use Naive Bayes when you explicitly know that all Xs are independent of each other
Generally works well on text data because all words(which become variables) have very complex relationship with each other
##### CANNOT TWEAK CUT-OFF

from sklearn.naive_bayes import GaussianNB

nb_clf=GaussianNB()

nb_clf.fit(train_x,train_y)

`K-Nearest Neighbor`¶

#### Hyperparameters - n_neighbors (K - no. of neighbors to be considered for average)
Can also set weights (hyperparater) to
- a) uniform (default) - All points in each neighborhood are weighted equally
- b) distance - weight points by the inverse of their distance
KNeighborsClassifier - for classification
KNeighborsRegressor - for regression problems
By default it uses euclidean distance
Mostly used for product recommendation
##### CAN TWEAK CUT - OFF

from sklearn.neighbors import KNeighborsClassifier

# GridSearch to find the best value of K

from sklearn.model_selection import GridSearchCV

#tuned_parameters=[{'n_neighbors':[3,5,7,9,11]}] 
#these values should not be more than sqrt(no. of obs). Also, should not be very less
tuned_parameters=[{'n_neighbors':list(range(5,50,2))}]

knn_clf=GridSearchCV(KNeighborsClassifier(),
                    tuned_parameters,
                    cv=5,
                    scoring='accuracy')

knn_clf.fit(train_x,train_y)

knn.best_score_ # this is accuracy on training data

knn.best_params_ # to get the best value of K

# Buildin the model with best_params_ (say 9)

knn_clf=KNeighborsClassifier(n_neighbors=9)
knn_clf.fit(train_x,train_y)

# Tweak cut-off if there's bias

`Support Vector Machines - SVM`¶

#### Hyperparamters -
- (1) C - Hyper-parameter for soft margin
- (2) gamma - Hyper-parameter for dimensionality projection
Cannot tweak cut-off as it doesn't give probability
Does not give variable importance like RandomForest
Works best on multiclass classification problems
Classification only (has a seperate algorithm for regression problems)

from sklearn.svm import SVC

svc=SVC(kernel='rbf',class_weight='balanced')

from sklearn.model_selection import GridSearchCV
param_grid = {'C': list(range(1,200,2)),                      # Hyper-parameter for soft margin
              'gamma': [10**x for x in range(-4,2)]}          # Hyper-parameter for dimensionality projection
grid = GridSearchCV(svc, param_grid)
          
%time grid.fit(train_X, train_y)

print(grid.best_params_)

model=grid.best_estimator_

#Accuracy
metirics.accuracy_score(test_y,model.predict(test_x))

`Model Evaluation`¶

`1) Accuracy`¶

from sklearn import metrics

test_accuracy=metrics.accuracy_score(test_y,model.predict(test_x))    #accuracy on testing data

train_accuracy=metrics.accuracy_score(train_y,model.predict(train_x))  #accuracy on training data

`2) Classification Report`¶

from sklearn.metrics import classification_report
print(classification_report(test_y,model.predict(test_x)))

`3) Confusion Matrix`¶

from sklearn import metrics
import seaborn as sn
import matplotlib as plt

cm=metrics.confusion_matrix(test_y, nb_clf.predict(test_x))

sn.heatmap(cm,annot=True,fmt='.2f',xticklabels=['no','yes'],yticklabels=['no','yes'])

plt.ylabel('True Label')
plt.xlabel('Predicted Label')


# from sklearn.metrics import confusion_matrix
# mat = confusion_matrix(ytest, yfit)
# sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
#             xticklabels=faces.target_names,
#             yticklabels=faces.target_names)
# plt.xlabel('true label')
# plt.ylabel('predicted label');

`4) AUC score`¶

- Only for binary classification

auc_score = metrics.roc_auc_score(test_y,model.predict(test_x))

`Tweaking cut-off`¶

test_predicted_prob=pd.DataFrame(logreg.predict_proba(test_x))[[1]]
test_predicted_prob.columns=['prob']
actual=test_y.reset_index()
actual.drop('index',axis=1,inplace=True)

# making a DataFrame with actual and prob columns
df_test_predict = pd.concat([actual, test_predicted_prob], axis=1)
df_test_predict.columns = ['actual','prob']
df_test_predict.head()

test_roc_like_df = pd.DataFrame()
test_temp = df_test_predict.copy()

for cut_off in np.linspace(0,1,50):
    test_temp['predicted'] = test_temp['prob'].apply(lambda x: 0 if x < cut_off else 1)
    test_temp['tp'] = test_temp.apply(lambda x: 1 if x['actual']==1 and x['predicted']==1 else 0, axis=1)
    test_temp['fp'] = test_temp.apply(lambda x: 1 if x['actual']==0 and x['predicted']==1 else 0, axis=1)
    test_temp['tn'] = test_temp.apply(lambda x: 1 if x['actual']==0 and x['predicted']==0 else 0, axis=1)
    test_temp['fn'] = test_temp.apply(lambda x: 1 if x['actual']==1 and x['predicted']==0 else 0, axis=1)
    sensitivity = test_temp['tp'].sum() / (test_temp['tp'].sum() + test_temp['fn'].sum())
    specificity = test_temp['tn'].sum() / (test_temp['tn'].sum() + test_temp['fp'].sum())
    
    accuracy=(test_temp['tp'].sum()+test_temp['tn'].sum()) / (test_temp['tp'].sum() + test_temp['fn'].sum()+test_temp['tn'].sum() + test_temp['fp'].sum())
    
    test_roc_like_table = pd.DataFrame([cut_off, sensitivity, specificity,accuracy]).T
    test_roc_like_table.columns = ['cutoff', 'sensitivity', 'specificity','accuracy']
    test_roc_like_df = pd.concat([test_roc_like_df, test_roc_like_table], axis=0)

test_roc_like_df.head()

import matplotlib.pyplot as plt
test_temp.sum()
plt.subplots(figsize=(10,4))
plt.scatter(test_roc_like_df['cutoff'], test_roc_like_df['sensitivity'], marker='*', label='Sensitivity')
plt.scatter(test_roc_like_df['cutoff'], test_roc_like_df['specificity'], marker='*', label='Specificity')
#plt.scatter(test_roc_like_df['cutoff'], 1-test_roc_like_df['specificity'], marker='*', label='FPR')
plt.title('For each cutoff, pair of sensitivity and FPR is plotted for ROC')
plt.legend()

## Finding ideal cut-off for checking if this remains same in OOS validation
test_roc_like_df['total'] = test_roc_like_df['sensitivity'] + test_roc_like_df['specificity']
test_roc_like_df[test_roc_like_df['total']==test_roc_like_df['total'].max()]

df_test_predict['predicted'] = df_test_predict['prob'].apply(lambda x: 1 if x > 0.408163 else 0)

import seaborn as sns
sns.heatmap(pd.crosstab(df_test_predict['actual'], df_test_predict['predicted']), annot=True, fmt='.0f')

accuracy=metrics.accuracy_score(df_test_predict.actual, df_test_predict.predicted)
print('Accuracy: ',round(accuracy,2))

`Classification Report`¶

from sklearn.metrics import classification_report
print(classification_report(df_test_predict.actual, df_test_predict.predicted))

Decision Tree¶

Bagging Algorithms¶

Bagging¶

Random Forest¶

Tuning n_estimators¶

Tuning max_features¶

Comparing Random Forests with decision trees¶

AdaBoost & GradientBoost¶

Xtreme Gradient Boosting Model¶

CatBoost¶

LightGBM¶

Logistic Regression¶

Naive Bayes¶

K-Nearest Neighbor¶

Support Vector Machines - SVM¶

Model Evaluation¶

1) Accuracy¶

2) Classification Report¶

3) Confusion Matrix¶

4) AUC score¶

Tweaking cut-off¶

Classification Report¶