Data Integration Projects

Confidentiality Information Series

Part 1 - What is confidentiality and why is it important?

Agencies collecting information from people and organisations have a legal and ethical responsibility to ensure:

they respect the privacy of those providing the information; and
that individuals and organisations cannot be identified in a disseminated dataset.

There is a clear relationship between confidentiality and privacy. A breach of confidentiality can result in disclosure of information which might intrude on the privacy of a person or an organisation.

Confidentiality refers to the obligation of data custodians (agencies that collect information) to keep the confidential information they are entrusted with secret.

Why is confidentiality important?

Agencies collecting data often rely on the trust and goodwill of the Australian people to provide information.

Maintaining public trust helps to achieve better quality data and a higher response to data collections.

Protecting confidentiality is a key element in maintaining the trust of data providers.

This leads to reliable data to inform governments, researchers and the community.

Confidentiality and therefore trust can be broken when a person or organisation can be identified in a disseminated dataset, either directly or indirectly.

For example, a person could be directly identified in a dataset if that dataset contains their name and address. However, a person or an organisation could also be indirectly identified if there is a combination of information in the dataset from which their identity can be deduced.

Example: the combination of date of birth and a detailed area code (for example, a town where 300 people live) may enable identification as there will be some unique dates of birth in such a small area.

What does ‘confidentialise’ mean?

The term confidentialise refers to the steps a data custodian must take to mitigate the risk that a particular person or organisation could be identified in a dataset, either directly or indirectly. Confidentialisation requires two key steps:

de-identification of the data, that is, the removal of any direct identifiers (e.g. name and address) from the data; and
assessment and management of the risk of indirect identification occurring in the de-identified dataset.

De-identified data does not necessarily protect the identity of individuals or organisations.

Removing identifying information such as name and address protects data providers from direct identification.

However, it may still be possible to indirectly identify a person or an organisation in a de-identified dataset. If enough detail is available, the identity of a particular person or organisation may be derived from the presence of a very rare characteristic or the combination of unique or remarkable characteristics.

Example: the identity of a person could be deduced if a dataset indicates the person is over 85 years old, has yearly income of more than one million dollars, and resides in a town of 400 people.

Example: the identity of a person with a very rare disease or health condition could be deduced even in highly aggregated data.

To protect the identity of individuals and organisations, both direct and indirect identification need to be considered.

Confidentialising data involves removing or altering information, or collapsing detail, to ensure that no person or organisation is likely to be identified in the data (either directly or indirectly).

There are various methods used to confidentialise data. These methods aim to protect the identity of individuals and organisations while enabling sufficiently detailed information to be released to make the data useful for statistical and research purposes.

The main techniques for confidentialising data are described in below: "How to confidentialise data: the basic principles".

For more information about assessing and managing the risks of indirect identification in microdata see below: "Managing the risk of disclosure in the release of microdata".

The confidentiality information series

This information sheet is part of a series designed to explain, and provide advice on, a range of issues around confidentialising data. The other sheets are below.

The obligation to protect identity and privacy

Confidentiality refers to the obligation of data custodians (agencies that collect information) to keep the confidential information they are entrusted with secret.

This obligation is recognised in the Privacy Act 1988. The obligation to protect confidential information is also reflected in legislation governing the collection, use and dissemination of information for specific government activities. Examples include the Social Security (Administration) Act 1999, the Taxation Administration Act 1953, and the Census and Statistics Act 1905 (see examples below). Penalties apply if the secrecy provisions set out in these Acts are breached.

As well as the requirements set out in legislation, obligations to protect a person’s or organisation’s identity and privacy are also outlined in government policies and principles. These provide advice on the protocols and procedures required to manage information safely. ‘High Level Principles for Data Integration Involving Commonwealth Data for Statistical and Research Purposes’ is one example of a set of principle-based obligations for Commonwealth government agencies.

Managing identification risks

Confidentiality, and therefore trust, can be broken when a person or an organisation can be identified in a disseminated dataset, either directly or indirectly.

One of the biggest challenges in making data publicly available is ensuring that no person or organisation is likely to be identified in the data.

Identification (often referred to as disclosure) occurs when someone learns something that they did not already know about another person or organisation through data that has been disseminated. This may be in the form of aggregate data (typically data presented in tables) or microdata (unit record data where each record represents observations for a person or an organisation).

Definitions

Microdata are unit record data containing individual responses to questions on survey questionnaires, or administrative forms. For example, data in response to the question ‘In what year were you born?’.

Aggregate data (or macrodata) refers to aggregated microdata. For example, a count of the number of people of a particular age (obtained from the question ‘In what year were you born?’).

How to confidentialise data: the basic principles

Managing the risks of identification in disseminated data (also called disclosure control) involves taking steps to evaluate and mitigate the risk that the identity of a particular person or organisation may be disclosed.

Risks of identification can be managed by confidentialising data. The aim is to protect the identity of a person or an organisation, while at the same time maximising the usefulness of the data.

In simple cases, data can be manually confidentialised. However, the use of software is sometimes necessary. Specialised skills and knowledge of the data are also required to correctly confidentialise a dataset to minimise the risk of identification.

This information sheet outlines some common techniques for confidentialising data. The information is provided as a guide only and gives simple examples, using data presented in tables, to illustrate the concepts. However, most of the techniques apply to both aggregate and microdata.

Issues specific to the management of the confidentiality of microdata are discussed in Information sheet 5, ‘Managing the risk of disclosure in the release of microdata’.

Managing the risk of disclosure in the release of microdata

"As the world becomes more complex and computing capabilities increasingly advance, the role of microdata in statistics has become much more important. Decision makers are increasingly turning to and requesting access to microdata." Statistical Journal of the International Association for Official Statistics 26 (2009/2010) 57-63

Microdata are unit record data where each record represents observations for a person or an organisation. Microdata contain individual responses to questions on survey questionnaires, or administrative forms, including identifying information such as name, address, telephone number and age.

Microdata are a valuable resource for researchers and policy makers. The challenge for data custodians is striking the right balance between fulfilling obligations to protect the identity of individuals and organisations, and maximising the information available for statistical and research purposes. This requires careful weighing of the identification risks and benefits.

How confidentiality affects research

Agencies and users should work together to promote legislative, regulatory, and dissemination policies and practices that facilitate timely and cost-effective access to data for statistical research and policy analysis but do not permit full and open access by all of the public for any use. If confidentiality issues are not fully addressed in constructive and proactive ways, users face the very real risk of losing access to high quality data

Source: Doyle P, Lane JW, Theeuwes JJM, Zayatz LM (2001) Confidentiality, disclosure and data access. Theory and practical applications for statistical agencies. North-Holland.

Confidentialisation techniques are applied to microdata (unit record data where each record represents observations for a person or organisation) to enable them to be made available to analysts and researchers. Without these techniques, access to valuable information for research and analytic purposes would be severely restricted. Although application of confidentialisation techniques generally leads to losses in information availability, when confidentialisation is done well and with knowledge of the key research objectives in mind, the information loss can be minimal. This fact sheet looks at the impact of confidentialisation on information availability for use in research and analysis.

There has always been a debate over the delicate balance between gaining full, unrestricted access to data (for researchers), and the application of confidentiality techniques to protect the privacy of data providers (by data custodians). Selecting the confidentiality techniques to be applied to microdata is a careful balance between fulfilling obligations to protect the identity of individuals and organisations and maximising the information available for statistical and research purposes.

The information required by the research sector is becoming more sophisticated over time. Increases in the use of techniques such as data linkage, data modelling, and data mining mean that researchers’ requirements are more detailed and more varied than ever before. As a consequence, there is more pressure on data custodians to provide greater access to microdata through high quality, detailed unit record files.

Users of microdata may be concerned that any reductions or changes made to datasets during the confidentialisation process may affect their ability to undertake analysis or research using the data, or may impact on the results of an analysis. However, generally very few changes need to be made to the dataset, and in most cases these have no impact on statistical analyses.

The goal of confidentialisation is to protect the identity of individual respondents. The main types of data cells affected when confidentialising a dataset are:

rare events or characteristics;
unusual data (extreme high or low reported values); and
low count cross-classification cells.

Generally, statistical analysis is based on observing trends and patterns in data, and most statistical techniques rely on multiple events or individuals with similar characteristics from which to draw inferences. Where there are multiple observations with similar characteristics, the risk of individual identification is low, and the confidentialisation process generally would not result in any change to the data. The low frequency events and unusual values that are targeted in confidentialisation procedures are generally not amenable to statistical analysis.

Introduction

Part 2 - Understanding re-identification