Evaluation And Ranking - Grand Challenge

TDSC-ABUS2023 Banner

Evaluation Metrics

Task Segmentation:

Dice Similarity Coefficient (Dice): Dice is used to evaluate the area-based overlap index.
Hausdorff distance (HD): HD is used to evaluate the coincidence of the surface for stable and sensitive to outliers.

Task Classification:

Accuracy: Accuracy indicates the performance of classifier overall
AUC: In clinical application, both sensitivity and specificity are important, as missing over cancer or overtreatment is unacceptable

Task Detection:

FROC: Free-Response Receiver Operating Characteristic, which balances sensitivity and false positive rates, with performance reported as average sensitivity at various false positive levels (FP= 0.125, 0.25, 0.5, 1, 2, 4, 8). A detected proposal counts as a hit if its Intersection over Union (IoU) with ground-truth bounding boxes of mediastinal lesions exceeds 0.3.

Evaluation code could be found in our official github repository: https://github.com/PerceptionComputingLab/TDSC-ABUS2023/tree/main/Metrics

Ranking Method

The ranking scheme includes the following steps:

Task Segmentation:

Calculate the Dice, HD for all cases.
Rank the Dice, HD separately.
Rank (Dice-HD)/2 as final task score.
Tie if the score are equal.

Task Classification:

Calculate the accuracy, AUC for all cases
Rank the accuracy, AUC separately
Rank the average accuracy and AUC as final task score
Tie if the score are equal.

Task Detection:

Calculate the froc for all cases
Rank the froc
Tie if the score are equal.

Overall Ranking:

Rank average score of three tasks, unsubmitted tasks would be calculated as 0
Tie if the score are equal.