09/05/2017

Researchers develop a system for automatically hiding confidential information in text documents

The method offers a similar level of accuracy to that provided manually by privacy experts and is even more thorough

Until now, protecting confidential documents was carried out by one or more experts who would manually identify and eliminate words

The researcher David Sánchez, from the CRISES-UNESCO Chair research group on data privacy at the URV’s Department of Computer Engineering and Mathematics, and the researcher Montserrat Batet, from the KISON research group at the Open University of Catalonia, have developed a system that can automatically detect and hide confidential information in text documents. This allows the documents to be sent to third parties without compromising the privacy or anonymity of the individuals and organisations that the documents refer to.

Nowadays personal data are highly useful in many ambits such as research, business and planning. For example, patients’ data stored in medical records are essential for medical research, banking operations form the basis of financial analysis, and the analysis of commercial transactions helps to improve the personalisation of services. Given that many of these data are confidential, the documents in which they are contained need to be protected before they can be passed onto researchers. Using adequate protection mechanisms is therefore essential for ensuring the privacy or anonymity of individuals.

Although European Union legislation is very strict regarding the use of personal data without the consent of individuals, in other countries such as the United States, private documents are commonly required for legal processes, medical leave, insurance, etc. However, in all cases, such documents must not reveal any confidential information that could be used, for example, to discriminate against the individual.

The system eliminates or replaces

Until now, protecting confidential documents was carried out by one or more experts who would manually identify and eliminate words or sentences that could reveal sensitive or potentially discriminatory information. The process involved identifying sensitive terms, such as the names of contagious diseases or words that could allow sensitive terms to be deduced, such as combinations of medicines or treatments that are only used to treat a specific illness. Identifying these terms is an arduous task whose complexity means that mistakes can occur.

The method developed automates the entire process, which enables efficient handling and protection of the huge volume of data currently used in research. The system works by identifying any information available on the internet that could provide a third party with the knowledge needed to uncover confidential information in a protected document. It then hides any terms that could facilitate such a deduction.

Tests have shown that the method is more exhaustive and offers a similar level of precision to a human expert. Furthermore, in contrast to the experts, the system does not merely eliminate sensitive terms, but also, as far as possible, tries to replace them with more general and therefore ambiguous concepts. For example, rather than specifying that a patient has pneumonia, it indicates that the person has a respiratory ailment. In this way it makes a protected document easier to understand and more useful for any subsequent analyses.

Implementation in research

Currently, the method has been implemented in prototype software which has been tested on clinical documents in English. The intention in the near future is to apply it to other knowledge areas and to establish it as a professional tool for use in research.

The research that has led to the development of the system has been conducted as part of the European CLARUS project, which investigates data privacy in the cloud. The project is coordinated by the URV and receives funding from the EU’s Horizon 2020 programme for the 2015-2017 period. It is also part of the UOC’s SmartGlacis project: Security and Privacy Technologies for Intelligent Cities, which is funded by the Spanish Ministry for the Economy and Competitiveness.

Reference: David Sánchez and Montserrat Batet: “Toward sensitive document release with privacy guarantees”. Engineering Applications of Artificial Intelligence. Volume 59, March 2017, Pages 23-34. DOI: 10.1016/j.engappai.2016.12.013

Posted at: Communicating science, Research, School of Engineering, Science & Technology