This is a hands-on example of Random Forests in Python, to practice how to handle most common tasks in using this popular ML algorithm in Python. In principle, I will use scikit-learn, but let’s see how it goes and which other libraries I have to throw into the mix.

Random Forests

This is my checklist to explain random forests (and sometimes advocate its use) to the non-initiated, trying to convey the intuition behind the algorithm but still using some of the jargon.

It is an ensemble learning algorithm
- Combines multiple models (ensemble) to come up with a prediction
- For the combination uses bagging strategy (bootstrap aggregating, many samples with replacement)
- Aims to combine multiple low-bias high variance models, so as to optimize the bias-variance trade-off, achieving high accuracy and avoiding overfitting
- In the case of random forest, it combines decision trees models
- For regression problems, combines by averaging decision trees prediction.
- For classification problems, combines by most frequent prediction (votes and the majority wins)
Contrary to other algorithms, it has relatively few hyperparameters to tune (so no huge grid search)
- The number of trees
- And the number of splits at leaf nodes
Perfectly suitable for parallel processing, because each decision tree on each sample would be trained independently
Because of the two above, fitting it can be fast in comparison to other algorithms
The algorithm itself makes it easy to select features. Decision trees create, guess what, a tree structure. The algorithm puts the most important features towards the top of the tree. Thereby it offers a practical and intuitive way to select the most important ones: just prune the tree. The algorithm ranks the features according to their contribution in reducing the impurity measure used to decide each split. That is, on the top of the tree you can find the features that best segment the data.

A quick summary of decision trees: each (non-leaf) node in the tree defines a condition on a single feature that segments the data, trying to get similar values of the target variable in the same partition of the data. The condition is the result of an optimization process on the so-called impurity measure: an information metric that indicates how good or bad the data segmentation is (tipically gini impurity, information gain or entropy based-measures are used for classification problems and variance reduction for regression problems).
For these advantages it became very popular, winning several kaggle competitions

Data and task

Here I use vital statistics data from Colombia, in particular, deaths database for 2016 (one row per death person during the year).

Let’s try a classification problem using the data: violent deaths. The idea is to see how well the features in this dataset can predict if a death was violent or not. The prediction itself may not be very useful, but it would be interesting to see which variables are most relevant. Indeed Random Forests is a technique frequently used for feature selection, because the algorithm naturally ranks the predictors.

import pandas as pd
deaths_2016 = pd.read_csv(
    "../../Python/deaths_eevv/data_src/Nofetal_2016.txt",
    sep="\t", encoding="WINDOWS-1252"
)

deaths_2016["MUNI"] = deaths_2016["COD_DPTO"]*1000 + deaths_2016["COD_MUNIC"]
deaths_2016 = deaths_2016.loc[deaths_2016["PMAN_MUER"] != 3]
deaths_2016 = deaths_2016[[
    "PMAN_MUER", "MES", "HORA", "MINUTOS", "SEXO", "EST_CIVIL", "GRU_ED2",
    "NIVEL_EDU", "IDPERTET", "SEG_SOCIAL", "MUNI", "A_DEFUN"
]]
deaths_2016 = deaths_2016.dropna()

deaths_2016.head(10)

   PMAN_MUER  MES  HORA  MINUTOS  SEXO  EST_CIVIL  GRU_ED2  NIVEL_EDU
\
0          2    2     0      0.0     1          3        5          4
1          2    2    12      5.0     1          5        4         99
2          1    3     3     40.0     2          5        1         13
3          1    3     4     30.0     2          5        1         13
4          2    2     1      0.0     2          3        5          2
5          1    3    19      0.0     1          5        1         13
6          2    2    20     40.0     1          5        5         13
7          1    2    13     50.0     1          5        1         13
8          1    2     6     43.0     1          5        1         13
9          2    2     1      0.0     1          1        4          2

   IDPERTET  SEG_SOCIAL   MUNI  A_DEFUN
0         6           1  25386        1
1         6           2   5091        3
2         6           2  76001        1
3         6           2  20001        1
4         6           2  41132        1
5         6           2  54001        1
6         6           2   5154        3
7         6           2  41001        1
8         6           1  11001        1
9         6           2  41013        3

Naive model

Let’s first fit a naive model using all relevant variables. It is naive at least because:

Without much preprocessing or feature engineering. It is actually passing the categorical predictors as numeric variables.
Does nothing to deal with imbalanced data

Of course, you should never just throw the data into a model like this. I have already explored this data set (not presented in this post) so I know what to expect. But since this is just to leave (not so) mental notes while I explore random forests in Python, it is just fine to simply do it like this.

# Separate the predictors and the target variable
X_all = deaths_2016.drop("PMAN_MUER", "columns")
y_all = deaths_2016.loc[:, "PMAN_MUER"]

# Split the data in test and train samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_all, test_size=0.3, random_state=0
)

# Configure the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rforestclf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=0)

# Fit the model
with TicToc(): # TicToc just to time it
    rforestclf.fit(X_train, y_train)

# Predict
with TicToc(): # TicToc just to time it
    y_pred = rforestclf.predict(X_test)

Elapsed time is 9.684633 seconds.
Elapsed time is 0.643846 seconds.

Model evaluation

Now explore the model evaluation metrics that sklearn provides.

from sklearn import metrics

Accuracy Score

Which is simply correctly classified / total obs

metrics.accuracy_score(y_test, y_pred)

0.942251846947834

That looks good, but of course it is inflated because this is a highly imbalanced dataset in terms of our dependent variable. So it is fairly easy to correctly hit the non-violent death category, but relatively tricky to correctly hit the violent deaths, which is what I am actually after.

Balanced Accuracy Score

scikit-learn provides an implementation of a balanced accuracy score that simply applies class-balanced sample weights to average the recall rate in each class. The normal accuracy score above simply assumes no sample weights (or equivalently sample weights = 1).

metrics.balanced_accuracy_score(y_test, y_pred)

0.827683444232459

Cool, much better (in terms of indicating that the model is not that good).

Cohen Kappa Score

It is an inter-rater reliability measure (e.g. two human annotators). It is interesting to take a look at it because, for its intended use, “It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance”. And that agreement by chance plays an important role in the case of imbalanced datasets.

metrics.cohen_kappa_score(y_test, y_pred)

0.7155314036982154

“Scores above .8 are generally considered good agreement” so again, it shows the model is not that good.

Confusion Matrix

Cool, it gives you the confusion matrix.

metrics.confusion_matrix(y_test, y_pred)

array([[56936,  1089],
       [ 2749,  5687]], dtype=int64)

Which is useful to be used for further processing. However, to take a look at a confusion matrix, I usually prefer more info there directly on the table (e.g. percentages, row and column totals -called support in scikit-learn terminology-).

So perhaps something like this?

pd.crosstab(y_test, y_pred, rownames=["Observed"],
            colnames=["Predicted"], margins=True)

Predicted      1     2    All
Observed
1          56936  1089  58025
2           2749  5687   8436
All        59685  6776  66461

Ok, a bit better. But percentages in addition to counts would be even better. And a visualization with colors and at least basic interactivity (e.g. tooltips) would be great. Is there something out there to do that with one line of code?

Well, googling a couple of minutes I found several alternatives. Here’s an example using mlxtend

from mlxtend.plotting import plot_confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plot_confusion_matrix(
    conf_mat=metrics.confusion_matrix(y_test, y_pred),
    show_absolute=True,
    show_normed=False
)
plt.show()
fig, ax = plot_confusion_matrix(
    conf_mat=metrics.confusion_matrix(y_test, y_pred),
    show_absolute=True,
    show_normed=True
)
plt.show()

{width=350px}

{width=350px}\

Much better. The first plot immediately reveals that there is an issue with imbalanced data and the need to take a look at a normalized confusion matrix. Then in the normalized matrix you can quickly spot that the model does not perform good for the low-frequency class. Out of the 8436 minority-class observations, the model correctly classifies only 5687 (‘67.4%') and a whooping ‘32.6%’ (2749 obs) were wrongly classified.

Classification report

It “builds a text report showing the main classification metrics”. The definition of the metrics is here. But in summary, intuitively: “Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples.”

Precision is the percentage of correctly classified, for all those observations predicted in a class. (or the column percentage from the confusion matrix associated to correctly classified, not shown there, though). “tp / (tp + fp)”
Recall is the percentage of correctly classified obs. within each class of observed values (or the row percentages of the confusion matrix associated to correctly classified, which is shown above in parethensis). “tp / (tp + fn)”
fbeta-score is a weighted harmonic mean of precision and recall, where beta weights recall more than precision by a factor of beta. The default for beta is 1, so f1-score. Which beta would you choose in this example?

print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.95      0.98      0.97     58025
           2       0.84      0.67      0.75      8436

    accuracy                           0.94     66461
   macro avg       0.90      0.83      0.86     66461
weighted avg       0.94      0.94      0.94     66461

There are a bunch of functions to individually obtain the metrics in the report above. For example:

metrics.precision_score(y_test, y_pred)
metrics.precision_score(y_test, y_pred, pos_label=2)
metrics.recall_score(y_test, y_pred)
metrics.recall_score(y_test, y_pred, pos_label=1)
metrics.recall_score(y_test, y_pred, pos_label=2)
metrics.f1_score(y_test, y_pred)
metrics.f1_score(y_test, y_pred, pos_label=2)
metrics.fbeta_score(y_test, y_pred, beta=0.01)
metrics.fbeta_score(y_test, y_pred, beta=0.5)
metrics.fbeta_score(y_test, y_pred, beta=0.01, pos_label=2)
metrics.fbeta_score(y_test, y_pred, beta=0.5, pos_label=2)
metrics.fbeta_score(y_test, y_pred, beta=1, pos_label=2)
# Since I care here more about recall over precision, perhaps beta=2 makes sense
metrics.fbeta_score(y_test, y_pred, beta=2, pos_label=2)
metrics.average_precision_score(y_test, y_pred)
metrics.precision_recall_fscore_support(y_test, y_pred, beta=1)

(array([0.95394153, 0.83928571]),
 array([0.98123223, 0.67413466]),
 array([0.96739444, 0.74769918]),
 array([58025,  8436], dtype=int64))

Hamming loss

This is the Hamming distance (# of obs that differ) between the observed and predicted, averaged (divided by the number of obs). Its sort of the counterpart of the accuracy score.

metrics.hamming_loss(y_test, y_pred)

0.057748153052165935

Matthews correlation coefficient

A correlation coefficient that ranges between -1 and 1. (1 means perfect prediction, 0 random prediction and -1 inverse prediction). It is pretty relevant for this case because it “is generally regarded as a balanced measure which can be used even if the classes are of very different sizes."

metrics.matthews_corrcoef(y_test, y_pred)

0.7210096175078716

Jaccard similarity coefficient score

From the documentation … “The Jaccard index [1], or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true.”

metrics.jaccard_score(y_test, y_pred)
metrics.jaccard_score(y_test, y_pred, pos_label=2)

0.5970603674540682

ROC curve and AUC

ROC curve and its AUC is well known in many fields and applications. I’d just note here that it is designed to assess the performance of classifiers that produce a probability or a score for every observation, and based on a threshold, decides the classification. Decision Trees and Random Forests are, in principle, discrete classifiers -i.e. predict the class label and not necesarily a score- but they can be “converted” into score-classifiers (see for example Fawcett, 2006). scikit-learn indeed provides a function to predict probabilities instead of the class labels. predict_proba() predicts class probabilities “computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.”

# Need to predict probabilities instead of class labels directly for the ROC-AUC
y_prob = rforestclf.predict_proba(X_test)
y_prob = y_prob[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_prob, pos_label=2)
auc = metrics.roc_auc_score(y_test, y_prob)
plt.plot(fpr, tpr, marker=".",
         label="ROC curve (AUC = %0.2f)" % auc)
plt.plot([0, 1], [0, 1], linestyle="--")
plt.legend(loc="lower right")
plt.xlabel("False Positive Rate / 1-specificity")
plt.ylabel("True Positive Rate / hit rate / recall / sensitivity")
plt.show()

UPDATE: I was not aware of scikitplot that provides convenient plotting functionality for scikit-learn metrics and estimators.

# Note that here you do not have to select just one class like above
y_prob = rforestclf.predict_proba(X_test)
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_test, y_prob, ax=ax) # you can customize it. get rid of microaverages

<matplotlib.axes._subplots.AxesSubplot at 0x26ed625a308>

Precision Recall curve

Precision Recall curve is an alternative to ROC, which could be more appropriate in cases of high class-imbalance -see for example Saito and Rehmsmeier (2015)-. scikit-learn does not let you down here and provides functions for it.

y_prob = rforestclf.predict_proba(X_test)
y_prob = y_prob[:, 1]
precision, recall, threshold = metrics.precision_recall_curve(
    y_test, y_prob, pos_label=2
)
auprc = metrics.auc(recall, precision)
avg_precision = metrics.average_precision_score(y_test, y_prob)
plt.plot(recall, precision, marker=".",
         label="Precision-Recall curve (AUPRC = %0.2f)" % auprc)
plt.plot([0, 1], [avg_precision, avg_precision], linestyle="--",
         label="Average Precision Score = %0.2f" % avg_precision)
plt.legend(loc="lower left")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.show()

UPDATE: I was not aware of scikitplot that provides convenient plotting functionality for scikit-learn metrics and estimators.

y_prob = rforestclf.predict_proba(X_test)
from scikitplot.metrics import plot_precision_recall
fig, ax = plt.subplots()
plot_precision_recall(y_test, y_prob, ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x26ed53b1bc8>

KS statistic

UPDATED to use scikitplot

Another interesting statistic to look at is KS and its plot. It indicates how good the predicted probabilities separate the classes. The point KS statistic correspond to the maximum difference between TPR and FPR, when you vary the classification threshold.

from scikitplot.metrics import plot_ks_statistic
plot_ks_statistic(y_test, y_prob)

<matplotlib.axes._subplots.AxesSubplot at 0x26ed5331048>

Interesting. The particular KS value is not that high, again, showing that the model is not that good. More interesting though is the pretty low threshold at which maximum difference ocurrs. So yeah, this is an indication that this might be one of the typical cases where using the default threshold may not be a good idea (not that using this KS-indicated specific threshold is also optimal, for many reasons, but one relevant reason for this example is that KS assumes the same importance for both classes, and here I might be more interested in class 2).

But anyway, out of curiosity let’s take a quick look and see what happens if I use this threshold.

y_prob = rforestclf.predict_proba(X_test)

from scikitplot.helpers import binary_ks_curve

thresholds, pct1, pct2, ks_statistic, max_distance_at, classes = binary_ks_curve(
    y_test, np.array(y_prob)[:, 1].ravel())

threshold = max_distance_at
y_pred_custom = np.ndarray(shape=y_prob[:,1].shape)
y_pred_custom[y_prob[:,1] >= threshold] = 2
y_pred_custom[y_prob[:,1] < threshold] = 1

fig, ax = plot_confusion_matrix(
    conf_mat=metrics.confusion_matrix(y_test, y_pred_custom),
    show_absolute=True,
    show_normed=True
)

Cool, changing the threshold helps increasing considerably the recall, of course at the expense of precision. Later I see how this compares with other strategies to deal with this imbalanced data.

Calibration plot

Alright, when one think about the threshold, one most probably think also about calibration. Decision-trees and Random Forests estimators usually provide well-calibrated scores. But let’s take a look anyway.

from scikitplot.metrics import plot_calibration_curve
plot_calibration_curve(y_test, [y_prob])

<matplotlib.axes._subplots.AxesSubplot at 0x26ed4cc5e48>

Cumulative gain plot

It shows how the fraction of true positives change, as you increase the percentage of the sample, previously ordered by the prediction score. So this is particularly useful when you what to prioritize based on the score/probability and how useful the model is to fo that.

from scikitplot.metrics import plot_cumulative_gain
plot_cumulative_gain(y_test, y_prob)

<matplotlib.axes._subplots.AxesSubplot at 0x26ed5378608>

Lift curve

Similar to the cumulative gain plot, for the lift curve you need to order the observations based on the score, and the lift shows how much the model contributes on top of a no-predictive-model scenario.

from scikitplot.metrics import plot_lift_curve
plot_lift_curve(y_test, y_prob)

<matplotlib.axes._subplots.AxesSubplot at 0x26ed62c97c8>

Wrapping-up model evaluation

Good, above you have code examples for the most common evaluation metrics and a short description of them and their pros and cons. Just let me remember myself that sometimes may not be a good idea to rely only on one metric but rather analyze several metrics, always having in mind the issues of the case under study (like the imbalance in this case).

Having seen model evaluation metrics, there are at least three issues to deal with:

Check feature importance in the model
Take a deep look into performance of the model, in particular, considering imbalanced data.
Some feature pre-processing, in particular, correctly handling categorical variables because for most variables in this dataset it makes no sense at all to use them as numerical predictors. (two expected challenges: it would take longer to fit the model and let’s see how scikit learn handles feature importance for categorical variables passed to the model for example as dummy variables -one hot encoding in scikit terminology-. I have used for example H2O’s implementation and it can nicely handle such cases)

So those are issues to deal with in other posts.

Session Info

from sinfo import sinfo
from IPython.display import display_markdown
sinfo_html = sinfo(html=True, write_req_file=False)
display_markdown(sinfo_html.data, raw=True)

Click to view module versions

-----
eli5        0.10.1
joypy       0.2.2
matplotlib  3.2.1
mlxtend     0.17.2
numpy       1.18.1
pandas      1.0.3
pytictoc    1.4.0
scikitplot  0.3.7
seaborn     0.10.0
sinfo       0.3.1
sklearn     0.22.2.post1
-----

Click to view dependency modules

attr                19.3.0
backcall            0.1.0
colorama            0.4.3
concurrent          NA
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.1
decorator           4.4.2
defusedxml          0.6.0
eli5                0.10.1
encodings           NA
entrypoints         0.3
genericpath         NA
graphviz            0.13.2
importlib_metadata  1.6.0
ipykernel           5.2.0
ipython_genutils    0.2.0
ipywidgets          7.5.1
jedi                0.15.2
jinja2              2.11.1
joblib              0.14.1
joypy               0.2.2
jsonschema          3.2.0
kiwisolver          1.2.0
markupsafe          1.1.1
matplotlib          3.2.1
mistune             0.8.4
mlxtend             0.17.2
mpl_toolkits        NA
nbconvert           5.6.1
nbformat            5.0.4
nt                  NA
ntpath              NA
nturl2path          NA
numpy               1.18.1
opcode              NA
pandas              1.0.3
pandocfilters       NA
parso               0.6.2
pickleshare         0.7.5
posixpath           NA
prompt_toolkit      3.0.5
pvectorc            NA
pweave              0.30.3
pydoc_data          NA
pyexpat             NA
pygments            2.6.1
pyparsing           2.4.7
pyrsistent          NA
pytz                2019.3
scikitplot          0.3.7
scipy               1.3.1
seaborn             0.10.0
sinfo               0.3.1
six                 1.14.0
sklearn             0.22.2.post1
sre_compile         NA
sre_constants       NA
sre_parse           NA
statsmodels         0.11.1
testpath            0.4.4
tornado             6.0.4
traitlets           4.3.3
wcwidth             NA
zipp                NA
zmq                 19.0.0

-----
IPython             7.13.0
jupyter_client      6.1.2
jupyter_core        4.6.3
notebook            6.0.3
-----
Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 22:22:21) [MSC v.1916 64 bit (AMD64)]
Windows-10-10.0.18362-SP0
4 logical CPU cores, Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
-----
Session information updated at 2020-04-14 16:42