Leaderboard

We show the evaluation results, for each team, sorted by highest accuracy. We plan to publish the results of this shared task (details will follow).

Results Within Topics

We evaluate all submitted predictions using a new balanced subset of of the original test set.

To see more details about each model, click on each row.

# Team University Precision Recall Accuracy
1 ReCAP✝ Trier University 0.85 0.66 0.77
Abortion 0.79 0.59 0.71
Gay Marriage 0.90 0.73 0.83
Model overview: BERT (uncased, sequence length 512), tuning for 3 epochs.
1 ASV Leipzig University 0.79 0.73 0.77
Abortion 0.78 0.68 0.75
Gay Marriage 0.80 0.78 0.79
Model overview: BERT (uncased, sequence length 512, tuning for 5 epochs), loss function: sigmoid_binary_crossentrophy.
2 IBM Research IBM Research 0.69 0.59 0.66
Abortion 0.64 0.54 0.62
Gay Marriage 0.73 0.63 0.70
Model overview: Two BERT models fine-tuned in cascade starting from the vanilla BERT model.
3 UKP TU Darmstadt 0.68 0.52 0.64
Abortion 0.63 0.48 0.60
Gay Marriage 0.74 0.56 0.68
Model overview: Microsoft's Multi-Task Deep Neural Network mt-dnn. Basis for the mt-dnn is BERT (large). No hyper-parameter tuning, 4 epochs.
4 HHU SSSC Düsseldorf University 0.70 0.33 0.60
Abortion 0.65 0.32 0.57
Gay Marriage 0.76 0.35 0.62
Model overview: Manhattan LSTM, a siamese network, which measures the similarity of both arguments. Document embeddings via BERT (base, uncased, not fine-tuned, sequence length 512 tokens).
5 ReCAP✝ Trier University 0.65 0.24 0.56
Abortion 0.67 0.22 0.56
Gay Marriage 0.64 0.25 0.64
Model overview: BERT (uncased, sequence length 32), tuning for 3 epochs.
6 DBS* LMU 0.53 1.00 0.55
Abortion 0.53 1.00 0.55
Gay Marriage 0.53 1.00 0.55
Model overview: Bert (base). Arguments organized as graph: edges are weighted with the confidence that arguments agree and confidence that they disagree. If known from training set that the arguments agree or disagree the confidence is 0 and 1 or 1 and 0 accordingly.
7 ACQuA✝ MLU Halle 0.53 0.57 0.54
Abortion 0.53 0.57 0.53
Gay Marriage 0.54 0.57 0.54
Model overview: Rule-based: same-side classification is considered as a sentiment analysis task. Arguments with negative as well as positive sentiments are on the same side. Vocabulary withf positive and negative words from Minqing Hu and Bing Liu.
8 Paderborn University Paderborn University 0.59 0.19 0.53
Abortion 0.62 0.21 0.54
Gay Marriage 0.55 0.17 0.52
Model overview: Siamese Neural Network as a benchmark model. Embedding via Flair library.
9 sam Postdam University 0.51 0.58 0.51
Abortion 0.56 0.62 0.56
Gay Marriage 0.46 0.54 0.45
Model overview: Bidirectional LSTM with 512 hidden units. Embedded sentences are decided via a two-layer MLP.
10 ACQuA✝ MLU Halle 0.50 0.11 0.50
Abortion 0.54 0.11 0.51
Gay Marriage 0.47 0.11 0.49
Model overview: Rule-based: same-side classification is considered as a sentiment analysis task. Arguments with negative as well as positive sentiments are on the same side. Vocabulary withf positive and negative words from Minqing Hu and Bing Liu.

✝ ReCAP and ACQuA teams submitted 8 and 6 approaches respectively. For each team, the Table shows the best and the worst performing models.

* In addition to their Bert model the DBS team from LMU exploited also the fact that the "same side" relation is an equivalence relation: if a test set contains arguments from the training set, labels can be deduced via the transitivity property (among others). However, the test set underlying the results shown above does not contain such exploitable algebraic or logical structures—in order to compare, as intended, the language processing power of the submissions. We would like to mention this fact since the DBS team informed us about their use of this possibility beforehand, which shows their fair-mindedness, and, not least, since their exploitation of the algebraic structure is a smart move.


Results Cross Topics

We evaluate all submitted predictions on a balanced subset of the Cross domain test set:

# Team University Precision Recall Accuracy
1 ReCAP✝ Trier University 0.72 0.72 0.73
2 ASV Leipzig University 0.72 0.72 0.72
3 HHU SSSC Düsseldorf University 0.72 0.53 0.66
4 DBS LMU 0.67 0.53 0.63
4 UKP TU Darmstadt 0.64 0.59 0.63
5 IBM Research IBM Research 0.62 0.49 0.60
6 Paderborn University Paderborn University 0.60 0.38 0.56
7 ReCAP✝ Trier University 0.70 0.11 0.53
8 sam Postdam University 0.51 0.52 0.51
9 ACQuA✝ MLU Halle 0.50 0.57 0.50
9 ACQuA✝ MLU Halle 0.46 0.00 0.50

✝ ReCAP and ACQuA teams submitted 8 and 6 approaches respectively. For each team, the Table shows the best and the worst performing models.