Metrics
Abstract Base Class#
Metric
#
The abstract base class for metrics.
Properties should be implemented as class attributes in derived metrics
The compute_metric
method needs to be implemented
Attributes#
full_name
abstractmethod
instance-attribute
property
#
A human-readable name for this metric.
is_multiclass
abstractmethod
instance-attribute
property
#
Whether or not this metric computes a value for each class individually, or for all classes at once.
bounds
abstractmethod
instance-attribute
property
#
A tuple of the minimum and maximum possible value for this metric to take.
Can be infinite.
dependencies
abstractmethod
instance-attribute
property
#
All metrics upon which this metric depends.
Used to generate a computation schedule, such that no metric is calculated before its
dependencies. The dependencies must match the compute_metric
signature.
This is checked during class definition.
sklearn_equivalent
abstractmethod
instance-attribute
property
#
The sklearn
equivalent function, if applicable
aliases
abstractmethod
instance-attribute
property
#
A list of all valid aliases for this metric. Can be used when creating metric syntax strings.
Functions#
compute_metric
abstractmethod
#
Computes the metric values from its dependencies.
AveragedMetric
#
The abstract base class for the composition of any instance of Metric
with any instance of Averaging
.
Parameters:
Attributes#
aliases
property
#
A list of all valid aliases for this metric.
Constructed from the product of the all aliases of the Metric and Averaging methods.
Can be used when creating metric syntax strings.
is_multiclass
property
#
Whether or not this metric computes a value for each class individually, or for all classes at once.
An AveragedMetric is always multiclass.
bounds
property
#
A tuple of the minimum and maximum possible value for this metric to take. Can be infinite.
dependencies
property
#
All metrics upon which this AveragedMetric depends.
Constructed from the union of all Metric and AveragingMethod dependencies.
Used to generate a computation schedule, such that no metric is calculated before its dependencies.
The dependencies must match the compute_metric
signature.
This is checked during class definition.
sklearn_equivalent
property
#
The sklearn
equivalent function, if applicable
Metric Instances#
DiagMass
#
Bases: Metric
Computes the mass on the diagonal of the normalized confusion matrix.
It is defined as the rate of true positives to all entries:
where \(TP\) are the true positives, and \(N\) are the total number of predictions.
This is a metric primarily used as a intermediate value for other metrics, and says relatively little on its own.
Not to be confused with the True Positive Rate.
Prevalence
#
Bases: Metric
Computes the marginal distribution of condition occurence. Also known as the prevalence.
It can be defined as the rate of positives to all predictions:
where \(P\) is the count of condition positives, and \(N\) are the total number of predictions.
This is a metric primarily used as a intermediate value for other metrics, and say relatively little on its own.
ModelBias
#
Bases: Metric
Computes the marginal distribution of prediction occurence. Also known as the model bias.
It can be defined as the rate of predicted positives to all predictions:
where \(PP\) is the count of predicted positives, and \(N\) are the total number of predictions.
This is a metric primarily used as a intermediate value for other metrics, and say relatively little on its own.
TruePositiveRate
#
Bases: Metric
Computes the True Positive Rate, also known as recall, sensitivity.
It is defined as the ratio of correctly predited positives to all condition positives:
where \(TP\) are the true positives, and \(TN\) are true negatives and \(N\) the number of predictions.
Essentially, out of all condition positives, how many were correctly predicted. Can be seen as a metric measuring retrieval.
Examples:
tpr
recall@macro
Read more:
FalseNegativeRate
#
Bases: Metric
Computes the False Negative Rate, also known as the miss-rate.
It is defined as the ratio of false negatives to condition positives:
where \(TP\) are the true positives, and \(FN\) are the false negatives.
Examples:
fnr
false_negative_rate@macro
Read more:
PositivePredictiveValue
#
Bases: Metric
Computes the Positive Predictive Value, also known as precision.
It is defined as the ratio of true positives to predicted positives:
where \(TP\) is the count of true positives, and \(FP\) the count falsely predicted positives.
It is the complement of the False Discovery Rate, \(PPV=1-FDR\).
Examples:
ppv
precision@macro
Read more:
FalseDiscoveryRate
#
Bases: Metric
Computes the False Discovery Rate.
It is defined as the ratio of falsely predicted positives to predicted positives:
where \(TP\) is the count of true positives, and \(FP\) the count of falsely predicted positives.
It is the complement of the Positive Predictve Value, \(FDR=1-PPV\).
Examples:
fdr
false_discovery_rate@macro
Read more:
FalsePositiveRate
#
Bases: Metric
Computes the False Positive Rate, the probability of false alarm.
Also known as the fall-out.
It is defined as the ratio of falsely predicted positives to condition negatives:
where \(TN\) is the count of true negatives, and \(FP\) the count of falsely predicted positives.
It is the complement of the True Negative Rate, \(FPR=1-TNR\).
Examples:
fpr
fall-out@macro
Read more:
TrueNegativeRate
#
Bases: Metric
Computes the True Negative Rate, i.e. specificity, selectivity.
It is defined as the ratio of true predicted negatives to condition negatives:
where \(TN\) is the count of true negatives, and FP the count of falsely predicted positives.
It is the complement of the False Positive Rate, \(TNR=1-FPR\).
Examples:
tnr
selectivity@macro
Read more:
FalseOmissionRate
#
Bases: Metric
Computes the False Omission Rate.
It is defined as the ratio of falsely predicted negatives to all predicted negatives:
where \(\(TN\)\) is the count of true negatives, and \(\(FN\)\) the count of falsely predicted negatives.
It is the complement of the Negative Predictive Value, \(FOR=1-NPV\).
Examples:
for
false_omission_rate@macro
Read more:
NegativePredictiveValue
#
Bases: Metric
Computes the Negative Predicitive Value.
It is defined as the ratio of true negatives to all predicted negatives:
where TN are the true negatives, and FN are the falsely predicted negatives.
It is the complement of the False Omission Rate, \(NPV=1-FOR\).
Examples:
npv
negative_predictive_value@macro
Read more:
Accuracy
#
Bases: Metric
Computes the multiclass accuracy score.
It is defined as the rate of correct classifications to all classifications:
where \(TP\) are the true positives, \(TN\) the true negatives and \(N\) the total number of predictions.
Possible values lie in the range [0.0, 1.0], with larger values denoting better performance. The value of a random classifier is dependent on the label distribution, which makes accuracy especially susceptible to class imbalance. It is also not directly comparable across datasets.
Examples:
acc
accuracy@macro
Read more:
BalancedAccuracy
#
Bases: Metric
Computes the balanced accuracy score.
It is defined as the the arithmetic average of the per-class true-positive rate:
where \(TPR\) is the true positive rate (precision).
Possible values lie in the range [0.0, 1.0], with larger values denoting better performance. Unlike
accuracy, balanced accuracy can be 'chance corrected', such that random performance is yield a score
of 0.0. This can be achieved by setting adjusted=True
.
Examples:
ba
balanced_accuracy@macro
ba+adjusted=True
Read more:
Parameters:
-
adjusted
(bool
, default:False
) –whether the chance-corrected variant is computed. Defaults to
False
.
MatthewsCorrelationCoefficient
#
Bases: Metric
Computes the multiclass Matthew's Correlation Coefficient (MCC), also known as the phi coefficient.
Goes by a variety of names, depending on the application scenario.
A metric that holistically combines many different classification metrics.
A perfect classifier scores 1.0, a random classifier 0.0. Values smaller than 0 indicate worse than random performance.
It's absolute value is proportional to the square root of the Chi-square test statistic.
Quoting Wikipedia:
Some scientists claim the Matthews correlation coefficient to be the most informative single score to establish the quality of a binary classifier prediction in a confusion matrix context.
Examples:
mcc
phi
Read more:
CohensKappa
#
Bases: Metric
Computes the multiclass Cohen's Kappa coefficient.
Commonly used to quantify inter-annotator agreement, Cohen's kappa can also be used to quantify the quality of a predictor.
It is defined as
where \(p_o\) is the observed agreement and \(p_e\) the expected agreement due to chance. Perfect agreement yields a score of 1, with a score of 0 corresponding to random performance. Several guidelines exist to interpret the magnitude of the score.
Examples:
kappa
cohen_kappa
F1
#
Bases: Metric
Computes the univariate \(F_{1}\)-score.
It is defined as:
or simply put, the harmonic mean between precision (PPV) and recall (TPR).
It is an exceedingly common metric used to evaluate machine learning performance. It is closely related to the Precision-Recall curve, an anlysis with varying thresholds.
The 1 in the name from an unseen \(\beta\) parameter that weights precision and recall.
See the FBeta
metric.
The \(F_{1}\)-score is susceptible to class imbalance. Values fall in the range [0, 1]. A random classifier which predicts a class with a probability \(p\), achieves a performance of,
Since this value is maximized for \(p=1\), Flach & Kull
recommend comparing performance not to a random classifier, but the 'always-on' classifier
(perfect recall but poor precision). See the F1Gain
metric.
Examples:
f1
f1@macro
FBeta
#
Bases: Metric
Computes the univariate \(F_{\beta}\)-score.
Commonly used to quantify inter-annotator agreement, Cohen's kappa can also be used to quantify the quality of a predictor.
It is defined as:
or simply put, the weighted harmonic mean between precision (PPV) and recall (TPR).
The value of \(\beta\) determines to which degree a user deems recall more important than
precision. Larger values (x > 1) weight recall more, whereas lower values weight precision more.
A value of 1 corresponds to equal weighting, see the F1
metric.
The \(F_{\beta}\)-score is susceptible to class imbalance. Values fall in the range [0, 1]. A random classifier which predicts a class with a probability \(p\), achieves a performance of,
Since this value is maximized for \(p=1\), Flach & Kull
recommend comparing performance not to a random classifier, but the 'always-on' classifier
(perfect recall but poor precision). See the FBetaGain
metric.
Examples:
fbeta+beta=2
fbeta+beta=0.5@macro
Informedness
#
Bases: Metric
Computes the Informedness metric, also known Youden's J.
It is defined as:
where sensitivity is the True Positive Rate (TPR), and specificity is the True Negative Rate (TNR).
Values fall in the range [-1, 1], with higher values corresponding to better performance and 0 corresponding to random performance.
In the binary case, this metric is equivalent to the adjusted balanced accuracy, ba+adj=True
.
It is commonly used in conjunction with a Reciever-Operator Curve analysis.
Examples:
informedness
youdenj@macro
Read more:
Markedness
#
Bases: Metric
Computes the markedness metric, also known as \(\Delta p\).
It is defined as:
where precision is the Positive Predictive Value (PPV).
Values fall in the range [-1, 1], with higher values corresponding to better performance and 0 corresponding to random performance.
It is commonly used in conjunction with a Reciever-Operator Curve analysis.
Examples:
markedness
delta_p@macro
Read more:
P4
#
Bases: Metric
Computes the P4 metric.
It is defined as:
where precision corresponds to the Positive Predictive Value (PPV), recall to the True Positive Rate (TPR), and specificity to the True Negative Rate (TNR). Put otherwise, it is the harmonic mean of the 4 listed metrics.
Introduced in 2022 by Sitarz, it is meant to extend the properties of the F1, Markedness and Informedness metrics. It is one of few defined metrics that incorporates the Negative Predictive Value.
Possible values lie in the range [0, 1], with a score of 0 implying one of the intermediate metrics is 0, and a 1 requiring perfect classification.
Relative to MCC, the author notes different behaviour at extreme values, but otherwise the metrics are meant to provide a similar amount of information with a single value.
Examples:
p4
p4@macro
Read more:
JaccardIndex
#
Bases: Metric
Computes the Jaccard Index, also known as the threat score.
It is defined as:
where \(TP\) is the count of true positives, \(FP\) the count of false positives and \(FN\) the count of false negatives.
Alternatively, it may be defined as the area of overlap between predicted and conditions, divided by the area of all predicted and condition positives.
Due to the alternative definition, it is commonly used when labels are not readily present, for example in evaluating clustering performance.
Examples:
jaccard
critical_success_index@macro
Read more:
PositiveLikelihoodRatio
#
Bases: Metric
Computes the positive likelihood ratio.
It is defined as
where sensitivity is the True Positive Rate (TPR), and specificity is the True Negative Rate (TNR).
Simply put, it is the ratio of the probabilities of the model predicting a positive when the condition is positive and negative, respectively.
Possible values lie in the range [0.0, \(\infty\)], with 0.0 corresponding to no true positives, and infinity corresponding to no false positives. Larger values indicate better performance, with a score of 1 corresponding to random performance.
Examples:
plr
positive_likelihood_ratio@macro
Read more:
LogPositiveLikelihoodRatio
#
Bases: Metric
Computes the positive likelihood ratio.
It is defined as
where sensitivity is the True Positive Rate (TPR), and specificity is the True Negative Rate (TNR).
Simply put, it is logarithm of the ratio of the probabilities of the model predicting a positive when the condition is positive and negative, respectively.
Possible values lie in the range (\(-\infty\), \(\infty\)), with \(-\infty\) corresponding to no true positives, and infinity corresponding to no false positives. Larger values indicate better performance, with a score of 0 corresponding to random performance.
Examples:
log_plr
lplr
log_positive_likelihood_ratio@macro
Read more:
NegativeLikelihoodRatio
#
Bases: Metric
Computes the negative likelihood ratio.
It is defined as
where sensitivity is the True Positive Rate (TPR), and specificity is the True Negative Rate(TNR).
Simply put, it is the ratio of the probabilities of the model predicting a negative when the condition is positive and negative, respectively.
Possible values lie in the range [0.0, \(\infty\)], with 0.0 corresponding to no false negatives, and infinity corresponding to no true negatives. Smaller values indicate better performance, with a score of 1 corresponding to random performance.
Examples:
nlr
negative_likelihood_ratio@macro
Read more:
LogNegativeLikelihoodRatio
#
Bases: Metric
Computes the negative likelihood ratio.
It is defined as
where sensitivity is the True Positive Rate (TPR), and specificity is the True Negative Rate (TNR).
Simply put, it is the logarithm of the ratio of the probabilities of the model predicting a negative when the condition is positive and negative, respectively.
Possible values lie in the range (\(-\infty\), \(\infty\)), with \(-\infty\) corresponding to no true positives, and infinity corresponding to no true negatives. Smaller values indicate better performance, with a score of 0 corresponding to random performance.
Examples:
log_nlr
lnlr
log_negative_likelihood_ratio@macro
Read more:
DiagnosticOddsRatio
#
Bases: Metric
Computes the diagnostic odds ratio.
It is defined as:
where \(\mathtt{LR}^{+}=\) and \(\mathtt{LR}^{-}=\) are the positive and negative likelihood ratios, respectively.
Possible values lie in the range [0.0, \(\infty\)]. Larger values indicate better performance, with a score of 1 corresponding to random performance.
To make experiment aggregation easier, you can log transform this metric by specifying
log_transform=true
. This makes the sampling distribution essentially Gaussian.
Examples:
dor
diagnostic_odds_ratio@macro
Read more:
LogDiagnosticOddsRatio
#
Bases: Metric
Computes the diagnostic odds ratio.
It is defined as:
where \(\mathtt{LR}^{+}\) and \(\mathtt{LR}^{-}=\) are the positive and negative likelihood ratios, respectively.
Possible values lie in the range (-\(\infty\), \(\infty\)). Larger values indicate better performance, with a score of 0 corresponding to random performance.
Examples:
log_dor
ldor
log_diagnostic_odds_ratio@macro
Read more:
PrevalenceThreshold
#
Bases: Metric
Computes the prevalence threshold.
It is defined as:
where \(\mathtt{TPR}\) and \(\mathtt{TNR}\) are the true positive and negative rates, respectively.
Possible values lie in the range (0, 1). Larger values indicate worse performance, with a score of 0 corresponding to perfect classification, and a score of 1 to perfect misclassifcation.
It representents the inflection point in a sensitivity and specificity curve (ROC), beyond which a classifiers positive predictive value drops sharply. See Balayla (2020) for more information.
Examples:
pt
prevalence_threshold