Tutorial on Private, Collaborative Learning in Document Analysis

at ICDAR 2024 | 18th International Conference on Document Analysis and Recognition

The issues of privacy and restricted access to documents have always been core problems of document analysis and recognition (DAR) with important repercussions in the way the community does research. One effect is that many models are still trained on private, undisclosed datasets, while public document analysis datasets tend to be small or focus on very specific narrow domains.

There are important regulatory issues when using contemporary documents, especially in the administrative/fin-tech/insurance-tech domains, that need to be taken into account every time a document analysis application is introduced. Existing privacy protection regulations, like the GDPR in the European Union, impose specific restrictions to the treatment of documents with AI models. For example limitations on the time a document can be kept in storage, or restrictions in how client data can be used for training new models. It is to be expected that as new regulation on Artificial Intelligence is introduced (e.g. the AI Act at the European Union, or the recent AI executive order in the USA), privacy guarantees will become a requirement for many sensitive applications.

The tutorial motivates and explains an important, emerging DAR topic on privacy preserving and collaborative learning methods for document analysis. This line of research is already being defined and advancing fast outside the strict walls of the DAR community, while it presents new opportunities for DAR research.

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.