Evaluation Metrics

Task Segmentation:

  1. Dice Similarity Coefficient (Dice): Dice is used to evaluate the area-based overlap index. 
  2. Hausdorff distance (HD): HD is used to evaluate the coincidence of the surface for stable and sensitive to outliers.

Task Classification:

  1. Accuracy: Accuracy indicates the performance of classifier overall
  2. AUC: In clinical application, both sensitivity and specificity are important, as missing over cancer or overtreatment is unacceptable  

Task Detection:

  1. FROC: Free-Response Receiver Operating Characteristic, which balances sensitivity and false positive rates, with performance reported as average sensitivity at various false positive levels (FP= 0.125, 0.25, 0.5, 1, 2, 4, 8). A detected proposal counts as a hit if its Intersection over Union (IoU) with ground-truth bounding boxes of mediastinal lesions exceeds 0.3.

Evaluation code could be found in our official github repository: https://github.com/PerceptionComputingLab/TDSC-ABUS2023/tree/main/Metrics

Ranking Method

The ranking scheme includes the following steps:

Task Segmentation:

  1. Calculate the Dice, HD for all cases. 
  2. Rank the Dice, HD separately.
  3. Rank (Dice-HD)/2 as final task score.
  4. Tie if the score are equal.

Task Classification:

  1. Calculate the accuracy, AUC for all cases
  2. Rank the accuracy, AUC separately
  3. Rank the average accuracy and AUC as final task score
  4. Tie if the score are equal.

Task Detection:

  1. Calculate the froc for all cases
  2. Rank the froc
  3. Tie if the score are equal.

Overall Ranking:

  1. Rank average score of three tasks, unsubmitted tasks would be calculated as 0
  2. Tie if the score are equal.