Menu

Making Data Sharing Work in the EU: Why Data Provenance and Machine Unlearning Matter

Making Data Sharing Work in the EU: Why Data Provenance and Machine Unlearning Matter

10 min read

|

April 2nd 2026

10 min read

|

April 2nd 2026

AI4POL

The European Union is pursuing an ambitious vision for a competitive and trustworthy data economy. Through initiatives such as the European data strategy, the EU aims to increase the availability and reuse of data across sectors while safeguarding fundamental rights, trade secrets, and intellectual property. Legislative frameworks, including the Data Act and sectoral data spaces such as the European Health Data Space (EHDS) Regulation, are designed to support this objective by enabling secure and controlled data sharing. 

However, as data flows across organisations, infrastructures, and sectors, new governance challenges are emerging. Data ecosystems today involve multiple actors, public institutions, private companies, researchers, and individuals who contribute to the creation, processing, and reuse of data. As datasets are increasingly reused, combined, and integrated into machine-learning systems, it becomes more difficult to control whether data circulating within these ecosystems has been collected and processed lawfully. One key challenge arises when unlawful or improperly obtained data propagates through interconnected systems. Once such data is incorporated into datasets or used to train machine-learning models, identifying its origin and mitigating its impact becomes technically complex.  

Think of a situation where health data made available to a research organisation under the EHDS Regulation later turns out to be unlawfully processed because certain data was disclosed in violation of the General Data Protection Regulation, intellectual property rights or trade secrets. At that point, this health data may already have been incorporated into analytical pipelines, statistical models, or machine learning systems used for research or innovation purposes. The EHDS Regulation does not provide guidance on how to deal with these consequences once health data has been reused, transformed, or embedded in models. In the absence of technical solutions, the default response may be to invalidate entire datasets or models, thereby undermining the Regulation’s objective of promoting data-driven innovation and evidence-based policymaking.

Situations like these raise important questions for market players and regulators about how to trace back the origin of unlawful data and how to resolve such illegalities in today’s increasingly complex data ecosystems. Two emerging technical approaches — data provenance and machine unlearning — offer promising mechanisms for addressing these challenges. 

Data Provenance: Strengthening Traceability 

Data provenance refers to the ability to document the origin, history, and transformation of data throughout its lifecycle. By recording how data moves through digital infrastructures, provenance mechanisms make it possible to trace where data originated, how it was processed, and which actors interacted with it. 

Embedding provenance mechanisms in data ecosystems can significantly improve transparency and accountability. In particular, provenance systems can help: 

  • Trace the origin of datasets, enabling organisations and regulators to understand where data comes from 

  • Document data transformations, showing how datasets have been modified or combined over time 

  • Identify responsible actors, clarifying which organisations contributed to the creation or processing of data 

  • Support regulatory oversight, making it easier for authorities to investigate potential violations 

By improving traceability, provenance systems help organisations identify problematic data and determine how it has been used across different systems. However, identifying unlawful data is only part of the solution. Once such data has been integrated into machine-learning models, removing its influence can be significantly more difficult. 

Machine Unlearning: Removing the Influence of Data 

Machine learning models are typically trained on large datasets, meaning that individual data points can influence model behaviour. When data must later be removed, for example because it was collected unlawfully, its influence may remain embedded in trained models. 

Machine unlearning addresses this problem by enabling organisations to remove the influence of specific training data from a model. Although the field is still developing, machine unlearning techniques offer a promising approach for resolving data illegalities in AI systems. 

In practice, machine unlearning can help organisations: 

  • Remove the impact of specific data points from trained models, without the need to discard or retrain the model entirely 

  • Comply with deletion or correction requirements 

  • Reduce compliance costs by enabling targeted remediation rather than full model replacement 

This approach is particularly relevant in contexts where organisations must respond to legal obligations such as the removal of unlawfully obtained data or the correction of inaccurate information. 

Why These Approaches Are Complementary?  

Data provenance and machine unlearning address different stages of the same governance challenge. Their combined use can provide a more comprehensive framework for managing unlawful data in complex digital ecosystems. 

Without provenance systems, organisations may struggle to determine which data must be removed. Without machine unlearning, the impact of unlawful data on downstream systems cannot be eliminated. 

Policy Implications 

As the EU continues to promote the exchange of data and to develop large-scale data-sharing infrastructures, policymakers should consider encouraging the further adoption of technical mechanisms such as provenance and machine unlearning to support effective data governance. 

Several policy priorities emerge: 
Promote provenance-by-design in data infrastructures 
  • Encourage standardised provenance metadata across European data spaces 

  • Integrate traceability mechanisms into data governance frameworks 

Expected outcomes: 

  • Greater transparency in data ecosystems 

  • Stronger accountability across data-sharing environments 

Support the development of machine unlearning techniques 
  • Invest in research and technical standards 

  • Clarify how unlearning may support compliance with data sharing obligations 

Expected outcomes: 

  • More effective implementation of data deletion or correction requirements  

  • Reduced compliance burdens for organisations using AI systems 

Strengthen regulatory technical capacity 
  • Expand technical expertise within supervisory authorities 

  • Develop tools for auditing data ecosystems 

Expected outcomes: 

  • More effective oversight of complex data ecosystems 

  • Improved enforcement of data governance rules 

Building Well-Functioning Data Ecosystems 

The success of Europe’s data strategy depends not only on increasing data availability but also on ensuring that data sharing is governed effectively and responsibly. As data ecosystems become more complex, regulatory frameworks must be supported by technical tools capable of addressing real-world data governance challenges.

Legislative instruments like the Data Act and the EHDS Regulation do not specify how to address situations in which data is exchanged in ways that do not meet legal requirements and this data has already been processed further along the data chain. By integrating data provenance and machine unlearning into emerging data infrastructures, policymakers can strengthen transparency, accountability, and compliance as well as avoid that entire datasets can no longer be used, which would jeopardise the objective underlying the European data strategy. 

by

Inge Graef & Pratiksha Ashok

/

This project has received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101177455

This project has received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101177455

This project has received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101177455