The TAB corpus
The Text Anonymization Benchmark (TAB) is a new, open-source corpus for text anonymization. It comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) manually annotated with:
- semantic categories for personal identifiers,
- need for masking of these identifiers,
- confidential attributes,
- co-reference relations.
The corpus is available for download here.