
The European Union is pursuing an ambitious vision for a competitive and trustworthy data economy. Through initiatives such as the European data strategy, the EU aims to increase the availability and reuse of data across sectors while safeguarding fundamental rights, trade secrets, and intellectual property. Legislative frameworks, including the Data Act and sectoral data spaces such as the European Health Data Space (EHDS) Regulation, are designed to support this objective by enabling secure and controlled data sharing.
However, as data flows across organisations, infrastructures, and sectors, new governance challenges are emerging. Data ecosystems today involve multiple actors, public institutions, private companies, researchers, and individuals who contribute to the creation, processing, and reuse of data. As datasets are increasingly reused, combined, and integrated into machine-learning systems, it becomes more difficult to control whether data circulating within these ecosystems has been collected and processed lawfully. One key challenge arises when unlawful or improperly obtained data propagates through interconnected systems. Once such data is incorporated into datasets or used to train machine-learning models, identifying its origin and mitigating its impact becomes technically complex.
Think of a situation where health data made available to a research organisation under the EHDS Regulation later turns out to be unlawfully processed because certain data was disclosed in violation of the General Data Protection Regulation, intellectual property rights or trade secrets. At that point, this health data may already have been incorporated into analytical pipelines, statistical models, or machine learning systems used for research or innovation purposes. The EHDS Regulation does not provide guidance on how to deal with these consequences once health data has been reused, transformed, or embedded in models. In the absence of technical solutions, the default response may be to invalidate entire datasets or models, thereby undermining the Regulation’s objective of promoting data-driven innovation and evidence-based policymaking.
Situations like these raise important questions for market players and regulators about how to trace back the origin of unlawful data and how to resolve such illegalities in today’s increasingly complex data ecosystems. Two emerging technical approaches — data provenance and machine unlearning — offer promising mechanisms for addressing these challenges.
Data Provenance: Strengthening Traceability
Data provenance refers to the ability to document the origin, history, and transformation of data throughout its lifecycle. By recording how data moves through digital infrastructures, provenance mechanisms make it possible to trace where data originated, how it was processed, and which actors interacted with it.
Embedding provenance mechanisms in data ecosystems can significantly improve transparency and accountability. In particular, provenance systems can help:
Trace the origin of datasets, enabling organisations and regulators to understand where data comes from
Document data transformations, showing how datasets have been modified or combined over time
Identify responsible actors, clarifying which organisations contributed to the creation or processing of data
Support regulatory oversight, making it easier for authorities to investigate potential violations
By improving traceability, provenance systems help organisations identify problematic data and determine how it has been used across different systems. However, identifying unlawful data is only part of the solution. Once such data has been integrated into machine-learning models, removing its influence can be significantly more difficult.
Machine Unlearning: Removing the Influence of Data
Machine learning models are typically trained on large datasets, meaning that individual data points can influence model behaviour. When data must later be removed, for example because it was collected unlawfully, its influence may remain embedded in trained models.
Machine unlearning addresses this problem by enabling organisations to remove the influence of specific training data from a model. Although the field is still developing, machine unlearning techniques offer a promising approach for resolving data illegalities in AI systems.
In practice, machine unlearning can help organisations:
Remove the impact of specific data points from trained models, without the need to discard or retrain the model entirely
Comply with deletion or correction requirements
Reduce compliance costs by enabling targeted remediation rather than full model replacement
This approach is particularly relevant in contexts where organisations must respond to legal obligations such as the removal of unlawfully obtained data or the correction of inaccurate information.
Why These Approaches Are Complementary?
Data provenance and machine unlearning address different stages of the same governance challenge. Their combined use can provide a more comprehensive framework for managing unlawful data in complex digital ecosystems.
Without provenance systems, organisations may struggle to determine which data must be removed. Without machine unlearning, the impact of unlawful data on downstream systems cannot be eliminated.
Policy Implications
As the EU continues to promote the exchange of data and to develop large-scale data-sharing infrastructures, policymakers should consider encouraging the further adoption of technical mechanisms such as provenance and machine unlearning to support effective data governance.
Several policy priorities emerge:
Promote provenance-by-design in data infrastructures
Encourage standardised provenance metadata across European data spaces
Integrate traceability mechanisms into data governance frameworks
Expected outcomes:
Greater transparency in data ecosystems
Stronger accountability across data-sharing environments
Support the development of machine unlearning techniques
Invest in research and technical standards
Clarify how unlearning may support compliance with data sharing obligations
Expected outcomes:
More effective implementation of data deletion or correction requirements
Reduced compliance burdens for organisations using AI systems
Strengthen regulatory technical capacity
Expand technical expertise within supervisory authorities
Develop tools for auditing data ecosystems
Expected outcomes:
More effective oversight of complex data ecosystems
Improved enforcement of data governance rules
Building Well-Functioning Data Ecosystems
The success of Europe’s data strategy depends not only on increasing data availability but also on ensuring that data sharing is governed effectively and responsibly. As data ecosystems become more complex, regulatory frameworks must be supported by technical tools capable of addressing real-world data governance challenges.
Legislative instruments like the Data Act and the EHDS Regulation do not specify how to address situations in which data is exchanged in ways that do not meet legal requirements and this data has already been processed further along the data chain. By integrating data provenance and machine unlearning into emerging data infrastructures, policymakers can strengthen transparency, accountability, and compliance as well as avoid that entire datasets can no longer be used, which would jeopardise the objective underlying the European data strategy.
Read more
