You are here

CLTE Benchmark

Cross-Lingual Textual Entailment (CLTE) is the task of identifying multi-directional entailment relations between two sentences, T1 and T2, written in different languages.

Each T1/T2 pair in the dataset is annotated (XML format) with one of the following entailment relations: 

  • Bidirectional (T1 ->T2 & T1 <- T2): the two fragments entail each other (semantic equivalence)
  • Forward (T1 -> T2 & T1 !<- T2): unidirectional entailment from T1 to T2
  • Backward (T1 !-> T2 & T1 <- T2): unidirectional entailment from T2 to T1
  • No Entailment (T1 !-> T2 & T1 !<- T2): there is no entailment between T1 and T2

Both T1 and T2 are assumed to be TRUE statements; hence in the dataset there are no contradictory pairs. 

The CLTE datasets have been created within the EU-funded project Cosyne (Multilingual Content Synchronizaton with Wikis).

Various CLTE datasets covering different language pairs are available.

CLTE-Semeval Benchmark

The following data was created for the Cross-lingual Textual Entailment (CLTE) for Content Synchronization Task, wich was offered at Semeval-2012 and SemEval 2013.

Four language combinations are available, each containing 1,500 CLTE pairs:

  • Spanish/English
  • German/English
  • Italian/English
  • French/English

Additionally, a monolingual English dataset is available as a by-product of the data collection methodology (1,500 pairs).

The CLTE-SemEval dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Creative Commons License

Publications or presentations containing results obtained through the use of CLTE-SemEval should cite the following reference:

Matteo Negri, Luisa Bentivogli, Yashar Mehdad, Danilo Giampiccolo, and Alessandro Marchetti. 2011.
Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora.
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011).

To obtain the data, please click on the following button:

 

CLTE-Cosyne Benchmark

The CLTE-Cosyne Benchmark consists of:

  • 1518 pairs for the language combinations English/Italian, English/German
  • 800 pairs for the language combination Italian/German

In addition, two monolingual datasets are available, respectively for English (1518 pairs) and Italian (800 pairs).

To get the CLTE-Cosyne benchmark, please contact Matteo Negri

 

Other references: 

 

Technology type: