Evaluation Metrics¶
Task Segmentation:
- Dice Similarity Coefficient (Dice): Dice is used to evaluate the area-based overlap index.
- Hausdorff distance (HD): HD is used to evaluate the coincidence of the surface for stable and sensitive to outliers.
Task Classification:
- Accuracy: Accuracy indicates the performance of classifier overall
- AUC: In clinical application, both sensitivity and specificity are important, as missing over cancer or overtreatment is unacceptable
Task Detection:
- FROC: Free-Response Receiver Operating Characteristic, which balances sensitivity and false positive rates, with performance reported as average sensitivity at various false positive levels (FP= 0.125, 0.25, 0.5, 1, 2, 4, 8). A detected proposal counts as a hit if its Intersection over Union (IoU) with ground-truth bounding boxes of mediastinal lesions exceeds 0.3.
Evaluation code could be found in our official github repository: https://github.com/PerceptionComputingLab/TDSC-ABUS2023/tree/main/Metrics
Ranking Method¶
The ranking scheme includes the following steps:
Task Segmentation:
- Calculate the Dice, HD for all cases.
- Rank the Dice, HD separately.
- Rank (Dice-HD)/2 as final task score.
- Tie if the score are equal.
Task Classification:
- Calculate the accuracy, AUC for all cases
- Rank the accuracy, AUC separately
- Rank the average accuracy and AUC as final task score
- Tie if the score are equal.
Task Detection:
- Calculate the froc for all cases
- Rank the froc
- Tie if the score are equal.
Overall Ranking:
- Rank average score of three tasks, unsubmitted tasks would be calculated as 0
- Tie if the score are equal.