The CLEANUP Project

Welcome to the public webpage of the CLEANUP project! CLEANUP was a four-years research project funded by the Research Council of Norway from 2020 to 2024. The goal of CLEANUP was to develop new machine learning methods to automatically anonymise (or at least strongly de-identify) text documents containing personal data, such as electronic health records, court rulings or chat-based interactions with customers.

The project brought together a consortium of researchers from machine learning, natural language processing, computational privacy, statistical modelling, health informatics and IT law. In addition, partners from the Norwegian public and private sector (covering the fields of insurance, welfare, healthcare and legal publishing) contributed to the project with their data and domain knowledge.

Oh, and if you were wondering what CLEANUP stands for : it's "Machine Learning for the Anonymisation of Unstructured Personal Data" (yes, we were a bit creative with the acronym).

News:

[2025-07-01]	Lucas Charpentier, a PhD student at UiO whose work was done in close collaboration with the CLEANUP team, just got a long paper accepted to ACL 2025, the top-tier conference in NLP. Lucas showed how a RAG-inspired approach could be employed to test the robustness of existing de-identification methods by attempting the reverse process of re-identification.
[2024-12-31]	The project has now officially ended. Thank you to everyone who contributed!
[2024-10-01]	For a practical guide on text de-identification, please see our practical guidebook, written in close collaboration with NAV.
[2023-10-22]	Our latest work, Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis (currently in submission) is now available on arXiv! This journal paper offers an in-depth analysis of our sanitization approach and proposes several privacy risk indicators.
[2022-11-01]	Our paper, Neural Text Sanitization with Explicit Measures of Privacy Risk, is accepted as a long paper to AACL! The paper presents a novel approach to text sanitization based on estimates of disclosure risk, which allows us to directly control the trade-off between privacy protection and data utility.
[2022-05-01]	We present a new, carefully curated dataset for privacy-enhancing NLP: the Text Anonymization Benchmark (TAB). See our paper recently published in Computational Linguistics for details.
[2021-09-10]	One of our master students, Torbjørn Dahl, is working on reference resolution on de-identified texts in collaboration with Lovdata.
[2021-05-06]	Our position paper on text anonymisation has been accepted to ACL 2021, one of the top-tier conferences within NLP. See current version here.
[2020-11-01]	Our PhD research fellow Anthi Papadopolou has just started her PhD on neural models for text anonymisation. Welcome onboard!
[2020-04-30]	The official website of the CLEANUP project is now up and running!
[2020-02-01]	The CLEANUP project has now officially started!