Data Integration Projects - How to determine the risk level - Risk assessment guidelines
Projects in Scope series
- What's in scope?
- Public register of Data Integration Projects
- How to determine the risk level
- Key Concepts
- Risk Assessment Process
- Risk Assessment Guidelines
- Appendix A
The risk assessment guidelines provide assistance to organisations who are involved in an integration project in assessing the risk of a breach. This section outlines key components of the Guidelines, some thresholds of risk and some mitigation strategies. These are designed to be integrated into Commonwealth agency approval processes to assist with decisions about integration projects. A robust and consistent risk assessment process will ultimately increase confidence in data integration as a method of maximising the value of Commonwealth data. However, the Guidelines can be adapted to suit an agency’s context, as risks will vary by situation and organisation. For example, the same project undertaken by Department of Social Services may have a different level of risk than one done by Department of Health because of the pre-existing conditions within the organisations.
Further research needs to be done to provide better guidance on effective mitigation strategies and measuring these risks. As the Guidelines are designed to evolve, changes will be made as the process matures.
Dimensions of risk
The Oversight Board approved eight dimensions of risk in August 2011. The dimensions are:
- nature of data collection,
- technical complexity,
- managerial complexity,
- duration of project,
- how the data is to be linked, and
- nature of access.
Originally these dimensions were suggested to describe the nature of a project and have since been used to assess the risk of a project. As they describe the whole nature of a project, some are less relevant to the risk of harm to a data provider and of a loss of public trust in Government. The definitions that follow refine these dimensions of risk to account for this redundancy.
- Nature of data collection has been redefined as consent, as this is the most relevant component.
- The definition of technical complexity has been tightened to focus on the challenges of appropriately confidentialising information.
- The size dimension has been split to refer to the number of quasi-identifying variables, and the amount of other, less identifying information about a data provider in a dataset.
- The ‘how the data is to be linked’ dimension has been dropped as it does not significantly affect the likelihood of a breach.
The following sections offer guidance on how to assess risk by:
- providing additional guidelines to determine whether a project can be considered ‘high’, ‘medium’ or ‘low’ risk,
- exploring the importance of each of the dimensions of risk, and
- discussing some additional risks that relate to each dimension.
It is intended to provide a set of ‘rules of thumb’ rather than definitive advice. The data custodian may use their judgment if they justify the reasons for their departure from the following guidance.
An adverse public perception of a data integration project, regardless of whether or not there is likely to be a data breach, may lead to a considerable loss in public trust in all government data collection activities, having a broad impact on all government departments. This consequence of this systemic risk must be considered before it is decided to proceed further with a project.
Consequence of a breach
Sensitivity assesses the effect of a breach on the data providers.
The National Health and Medical Research Council’s National Statement on Ethical Conduct in Human Research (Endnote 13) provides a good framework for assessing the sensitivity of the data in relation to persons. The main elements of a risk of harm are:
- physical harm,
- psychological harm,
- economic harm,
- social harm,
- legal harm, and
- devaluation of personal worth.
The National Statement focuses on individuals, but many of the concepts are relevant to organisations.
Most ethical risk frameworks assess whether the risk might affect populations over which the data user has additional duty of care obligations. These populations are usually children, people with mental illness, and people with cognitive or intellectual impairments. This may also extend to other small population groups where their information may be sensitive, such as Aboriginal or Torres Strait Islander populations or groups from particular ethnic or religious backgrounds. A way to account for this additional duty of care is to increase the risk rating of a project if these populations are likely to be affected. For example, an initial sensitivity rating of ‘low’ may be increased to ‘medium’ if a project included children. However, there are exceptions to this rule. For example, some information about school children may be less sensitive if there is little variation in the dataset, or where the subject/topic is inherently of low sensitivity (for example, participation in sport).
The Guidelines take a different position on the classification of harm in comparison to the National Statement on Ethical Conduct. The Guidelines classify harm risk ratings as the following:
- ‘High’ consequence involves a foreseeable risk of serious harm to data providers in any of the main elements of harm.
- ‘Medium’ consequence involves a foreseeable risk of any harm to data providers.
- ‘Low’ consequence involves no foreseeable risk of harm.
The reason for this is the need to focus the risk management effort of the Oversight Board on those projects that pose the greatest risk to public trust and thus have the greatest potential to undermine the value of Commonwealth data as a strategic asset. Many projects carry some risk of harm in the event of a breach, but in general this risk must be managed by those directly involved (the data custodians, the integrating authority, and the data users).
Consent is the component of the nature of data collection that impacts on public trust in the Government. If an agency has informed consent from the provider, the consequences of a breach may be lower. Consent may lower the consequence as data providers are aware, or partially aware, of the research being undertaken and the risks of participating in the research. To be able to consent, there must be an option to opt-out for the data provider (Endnote 14).
- ‘High’ consequence involves no consent, or coerced consent.
- ‘Medium’ includes partially-informed consent.
- ‘Low’ consequence exists when informed consent has been obtained and the risks of integration have been explained and are understood (Endnote 15).
There is much literature on the notion of consent. While most focuses on the notion of an individual, the concepts that apply to an individual can easily be applied to organisations.
Amount of information about a data provider
The number and nature of variables containing information about a data provider in a dataset affects the consequence of breach. The National Statistical Service’s (NSS) Best Practice Guidelines for Integration will provide a number of examples on how to mitigate this risk. For example, the separation principle limits the number of variables in a given dataset by splitting datasets into linking variables and analysis variables.
Rating consequence risk
A project is assessed as having a ‘high’ consequence risk if:
- The level of consent or sensitivity of data risks have been rated as ‘high’, or
- The level of consent or sensitivity of data risks have been rated ‘medium’ and the risk due to the amount of information about a data provider rated ‘high’.
A project with ‘low’ consequence risk has no dimensions rated ‘high’ and a maximum of one dimension assessed as ‘medium’.
If the amount of personal information is rated as ‘high’, but the other two are ‘low’ then the overall rating is ‘medium’.
All other combinations of risk are rated as ‘medium’.
The sensitivity of data, and consent have been weighted equally as they are both can have very serious consequences. The amount of personal information is covered to a certain degree by the sensitivity of the data dimension.
Likelihood of a breach
Likelihood of identification
Not all variables are equally identifying in nature. Some variables are subjective (for example, have you been depressed most of the time for more than a month). Some have low visibility to others (for example, do you play golf). Other variables may be easy to assess objectively (for example, do you work as a teacher). Some are highly identifying (for example, name and address). The number and nature of the variables contained in a dataset affect the likelihood of identification. The inclusion of some variables, known as quasi-identifying variables, quickly increases the probability of spontaneous identification of records and identification through list matching (Endnote 16). For example date of birth and country of birth are quasi-identifying variables. In isolation they are not identifying; however, in combination they may be unique to an individual. In many circumstances, three quasi-identifying variables in combination may be enough to enable identification. For example, a male working in Department of Social Services that was born on 29 August 1987 and lives in Lyneham is likely to be identifying. Additionally matching these details against a publicly available dataset may provide ample information to identify persons or organisations within the de-identified dataset. However, removing one of these pieces of information may remove the identifying nature. The likelihood of a breach rises quickly with the number of quasi-identifying variables, especially where datasets are to be released at a unit record level, for example in a confidentialised unit record file (CURF). This risk is:
- ‘High’ where there is a high probability that an individual can be identified using a combination of quasi-identifying and other variables on a dataset.
- ‘Low’ where it is unlikely that quasi-identifying and other variables can be combined to identify an individual.
The technical complexity of a project affects the likelihood of a breach. It includes the complexity of confidentialisation and of methodology.
- ‘High’ risk involves complex data that is likely to be published as a CURF, or multiple aggregate tables that require consequential confidentialisation.
- ‘Low’ risk involves publishing simple aggregated data with basic confidentialisation required.
Data governance becomes more complex as the number of people and organisations involved in an integration project increases. This potentially leads to diminishing control over data management practices and therefore increases the risk of a breach.
The more organisations involved, the harder it is for the data custodians to influence the practices of these organisations. The more people involved, the higher the risk of data being leaked or used inappropriately.
- ‘High’ managerial complexity risk would have:
- four or more agencies involved in the process, or
- thirty or more staff directly involved in the integration.
- ‘Low’ managerial complexity risk would have:
- only one agency involved, and
- fewer than ten staff directly involved in the integration.
Duration of project
The longer data are stored following a project, the more likely a breach becomes. Similarly, the longer a project runs, the more likely a breach becomes. There are two reasons for the increased likelihood of a breach, data storage and documentation.
Data access that is poorly controlled, or data that are exposed to external attack, are poorly stored. The longer the data exists in this storage, the more likely it is that one of these deficiencies will be exposed and a breach will occur.
Poor metadata, or data documentation, may lead to new staff publishing data without appropriate confidentialisation, or using data inappropriately.
- A project with ‘high’ duration of project risk would:
- retain data for more than three years, or
- run for more than three years.
- A project with ‘low’ duration of project risk would:
- destroy data on the completion of the project, and
- run for less than one year.
Nature of access
The nature of access is concerned with the quality, consistency and coverage of governance and controls placed around access to data at all stages of the integration project.
Unrestricted or unaudited access is a ‘high’ risk.
‘Low’ risk requires:
- access to be granted on a demonstrated ‘need to know’ basis, and
- the separation principle to be applied, and
- regularly audited and restricted access.
Rating likelihood risk
It is difficult to weight the likelihood dimensions of risk, as the importance and impact of these dimensions depends on the context in which they are applied. As a guide the overall likelihood risk is
- ‘high’ if three or more likelihood dimensions have been assessed as ‘high’, and
- ‘low’ if no dimensions are rated ‘high’ and less than three are rated medium.
Mitigation strategies, especially for work undertaken internally within a Commonwealth agency, should reduce the likelihood risk considerably.
Legislation influences the way data are used and shared. Legislation that broadly covers data dissemination includes:
- Privacy Amendment (Enhancing Privacy Protection) Act 2012
- Privacy Act 1988
- Privacy and Personal Information Protection Act 1998 (NSW)
- Health Records and Information Privacy Act 2002 (NSW)
- Information Act 2002 (NT)
- Information Privacy Act 2009 (Qld)
- Personal Information and Protection Act 2004 (Tas)
- Information Privacy Act 2000 (Vic)
- Freedom of Information Act 1992 (WA)
- Industry Research and Development Act 1986
- Pooled Development Funds Act 1992
- Venture Capital Act 2002
However, there are many specific pieces of legislation that govern the use of data. An example of specific legislation is the Social Security (Administration) Act 1999 which governs the way data collected by Centrelink is used and shared.
Likelihood and consequence risk matrix
The likelihood and consequence risk matrix assists in determining the overall risk of a project. The overall risk rating is determined by the combination of the likelihood and consequence risk ratings. Once this rating is known, risk mitigation strategies can be identified and applied.
There are many possible mitigation strategies that can be applied to reduce the likelihood risks. However, there are very few mitigation strategies that can be applied to the consequence risk dimensions without changing the scope of the project. For example, a dataset with highly sensitive data can have the sensitive data removed; however, this changes the project as the component data is now different.
Risk mitigation examples
Data labs for external data users
Requiring data users to use secure data labs ensures that the IT environment is more secure, limiting the potential for loss or theft of the data.
The data supplied by data providers may be subject to the Privacy Act and/or protected by one or more confidentiality/secrecy provisions that govern the management of that data. Data custodians, integrating authorities and data users are obliged to comply with the Privacy Act and the confidentiality/secrecy provisions in relevant legislation governing the collection, use and disclosure of the information.
Elements of the integration Best Practice Guidelines
Some elements of the integration Best Practice Guidelines can be applied more easily than others, and they can be applied to different extents. For example, applying the separation principle may be costly and may be an operational challenge. However, separating a dataset into linking and analysis variables is relatively straight forward and reduces the size of the datasets. Therefore, if one of the datasets is compromised, only a subset of the information is made public, and harm may be averted. The consequence of the breach of one of these datasets is therefore lower than the breach of the combined dataset
Experienced data integrating staff
Using experienced staff ensures that processes are able to run more smoothly and efficiently. They will also be more likely to have an understanding of the data governance that applies to data integration and the purpose of the governance. Therefore, there is a lower risk of breaches resulting from negligence or ignorance.
This is by no means an exhaustive list of mitigation strategies. However, most mitigation strategies will impact on the IT environment, staff accountability and organisational procedures. As the risk assessment process matures, more mitigation strategies will become apparent. Choosing which to implement will in general be up to the data custodian and integrating authority involved in a data integration project. There are many mitigation strategies to consider for implementation. The key element of the post mitigation risk assessment is about how, and how much, the mitigation strategies proposed will lower the overall risk of a data integration project. The justification needs to satisfy the data custodians, integrating authority and, ultimately, the Oversight Board.
13 The NHMRC National Statement on Ethical Conduct in Human Research can be found here: https://www.nhmrc.gov.au/about-us/publications/national-statement-ethical-conduct-human-research-2007-updated-2018
14 Opt-out refers to the concept of being able to decline to be involved in a research project without fear of repercussions. For example, those on government payments would have to be reassured that by not consenting for their information being used for research purposes that their current and future potential to claim payments will not be jeopardised. It is not enough to say that a payment is voluntary and therefore if they do not want their information used they can choose not to receive the payment.
15 The Australian Communications and Media Authority’s (ACMA) paper Community research on informed consent: Qualitative research report (2011) notes that “[customers] often gave ‘consent’ but claimed that in reality it was not always ‘informed consent’, as…they often provided consent without a full understanding and comprehension of the terms and conditions of the agreement.”
16 List matching is where a user has access to another source of data, such as an external administrative dataset, and attempts to match the two datasets using common data items.