There are several high-level and granular metrics to investigate to best understand how the model performs on new data when evaluating model performance. We split our balanced dataset into training and testing sets to evaluate the performance of our models using various parameters such as accuracy, precision, recall, and F-1 scores for classification. We also make use of confusion matrices and Receiver-Operating Characteristics (ROC) Curves. Furthermore, we use our unbalanced dataset to cross-validate our models for more insight into our model performance when an unbalanced dataset is presented to it.
When using these tools, we consider the specific successes and errors our models make. Successes include:
Errors include:
Using these categories, we can then calculate and examine the following metrics:
Accuracy is defined as the fraction of predictions that our model got right from all its predictions for all the classes. Here, this would be:
\[ \frac{TP + TN}{TP + TN + FP + FN} \]
Precision is defined as the fraction of predictions that our model got right from the total predictions it made for that class.
\[ \frac{TP}{TP + FP} \]
Recall is defined as the fraction of predictions that were predicted correctly among the labels that belong to that class.
\[ \frac{TP}{TP + FN} \]
The F1 score is another measure of a model’s accuracy, calculated as the harmonic mean of the model's precision and recall. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0.
\[ 2 * \frac{Precision * Recall}{Precision + Recall} \]
To better illustrate model performance, we make use of ROC curves and confusion matrices. ROC curve plots the performance of a model in terms of the error rate over increasing probability thresholds. The area under the curve (AUC) associated with an ROC curve is used as a measure of accuracy for a model. The straight diagonal line indicates a 50/50 guess and is used here as a baseline for performance.
A confusion matrix provides insight with regards to exactly how each model classified articles correctly and incorrectly, by displaying predictions by their type: TN, TP, FP, and FN.