Data Integration - Glossary
- 1 Accreditation
- 2 Administrative data
- 3 Agency head
- 4 Analysis data
- 5 Authorisation
- 6 Commonwealth data
- 7 Confidential information
- 8 Confidentialise
- 9 Confidentiality
- 10 Consent
- 11 Content data
- 12 Cross Portfolio Data Integration Oversight Board
- 13 Data custodians
- 14 Data integration
- 15 Data linking
- 16 Data provider
- 17 Data user
- 18 De-identified data
- 19 Deterministic (exact) linking
- 20 End users
- 21 Ethics committee
- 22 Event data
- 23 Exact linking
- 24 Express consent
- 25 Identifiable data
- 26 Identification
- 27 Identifier(s)
- 28 Integrated dataset
- 29 Integrating authority
- 30 Implied consent
- 31 Linkage key
- 32 Linking
- 33 Metadata
- 34 Microdata
- 35 Personal information
- 36 Privacy
- 37 Privacy Act
- 38 Privacy Impact Assessment
- 39 Probabilistic linking
- 40 Re-identifiable data
- 41 Research purposes
- 42 Secretariat to the Cross Portfolio Data Integration Oversight Board
- 43 Separation principle
- 44 Statistical data integration
- 45 Statistical and research purposes
- 46 Statistical disclosure control
- 47 Statistical disclosure control techniques
- 48 Statistical outputs
- 49 Statistical purposes
- 50 Statistical linkage key
- 51 Source data
- 52 Trusted Institution
- 53 Unique identifier
- 54 Unit record data
To gain accreditation, integrating authorities must be approved by the Cross Portfolio Data Integration Oversight Board (the Oversight Board) as having the requisite expertise, skills and knowledge, infrastructure and secure environment to undertake high risk data integration projects involving Commonwealth data for statistical and research purposes.
For more information on the accreditation process, see ‘The interim accreditation process for Integrating Authorities’.
Information (including personal information) collected by agencies for the administration of programs, policies or services (e.g. Medicare data, taxation data). Administrative data is one type of unit record level data.
The person legally accountable for the activities of an organisation, and those of its staff and affiliates. For example,
- Government Department – the Secretary of the Department
- Private sector – CEO, Company Secretary or Managing Director
- University – Vice-Chancellor, Pro Vice-Chancellor, Deputy Vice-Chancellor or University Registrar.
Analysis data (also referred to as content data or event data) is survey data or the administrative or clinical information from a record, that may be used for statistical or research purposes.
Examples of analysis data include clinical information, benefit information and company profits. This information does not contain name and address or other information that may identify an individual or an organisation.
A #data custodian must be authorised to release identifiable data to the #integrating authority, either by the data custodian’s legislation, the legislation of the integrating authority or by consent from the data provider (that is, the person, household, business or other organisation who originally supplied the data for statistical or administrative purposes), where this is not precluded by legislation.
For more information, see Authorisation to release identifiable data.
Commonwealth data includes any dataset containing information collected by, or on behalf of, the Australian government or any dataset containing information collected by another Australian jurisdiction and provided to the Australian government for the common good.
Confidential information is any information with restrictions placed on the communication or dissemination of that information. This includes all the different kinds of information that are collected, stored and used by data custodians and which are subject to statutory restrictions and obligations under legislation including personal information protected by the Privacy Act 1988.
To remove or alter information, suppress or collapse detail within a dataset, to ensure that no person or organisation is likely to be identified in the data (directly or indirectly).
For more information on confidentiality, including information on popular techniques for confidentialising data, see the Confidentiality Information Series on the National Statistical Service website.
The legal and ethical obligation to the provider of information to maintain their privacy and protect the secrecy of their information. Also see #Confidentialise.
For more information on confidentiality, including information on popular techniques for confidentialising data, see the Confidentiality Information Series.
See #Analysis data.
Cross Portfolio Data Integration Oversight Board
This Board was established in 2011 to oversee the development of a cross government environment that is safe and effective for data integration involving Commonwealth data for statistical and research purposes. The Board is chaired by the Australian Statistician and membership includes the Secretaries of the Department of Health; the Department of Human Services and the Department of Social Services.
For more information, see the Cross Portfolio Data Integration Oversight Board Terms of Reference on the National Statistical Service website.
Data custodians collect and hold information on behalf of a data provider. The role of data custodians may also extend to producing source data, in addition to their role as a holder of datasets. Data custodians are responsible for managing the use, disclosure and protection of source data used in a statistical data integration project.
For more information, see the data custodian section in the guide.
Data integration involves bringing together multiple datasets, generally at the unit record level (i.e. for a person or organisation) or micro level (e.g. information for a small geographic area), to provide new datasets for statistical or research purposes. Data integration refers to the full range of management and governance practices around the process, including project approval, data transfer, linking and merging the data and dissemination.
Data linking is an element in the process of data integration. Data linking creates links between data from different sources based on common features present in those sources. Also known as ‘data linkage’ or ‘data matching’, data are combined at the unit record or micro level.
For more information, see the Data Linking Information Series.
An individual, household, business or other organisation which originally supplied data to the data custodian either for statistical or administrative purposes.
A researcher who accesses and investigates integrated datasets at the unit record or aggregate level for statistical and research purposes. Data users include academics working in research institutions and employees in Commonwealth and State/Territory government agencies undertaking research. Data users are differentiated from end users of the data.
For more information, see the data user section in the guide.
De-identified data is data that has had direct identifiers removed (i.e. information that directly establishes the identity of an individual or organisation, such as name, address, Australian Business Number).
Deterministic linkage is a linking methodology which combines records that match exactly (using a unique identifier such as a social security number or Australian Business Number or a created unique identifier such as a linkage key).
For more information, see the Data Linking Information Sheet: Deterministic linking and linkage keys on the National Statistical Service website.
A person who examines, uses and undertakes secondary analysis of aggregate statistical output or the research findings of data users. Examples of end users include employees undertaking research in public and private sector organisations, representatives from media outlets and consumer advocacy groups, and members of the wider community.
Shortened form of Human Research Ethics Committee (HRECs). HRECs protect the welfare and rights of participants involved in research. HRECs review proposals for research that involves humans, monitors the conduct of research and deals with complaints that arise from research. In the context of data integration involving Commonwealth data, some data custodians require that an ethics committee approve a data integration project prior to the release of data. Ethics approval does not however guarantee that approval for data release will be given. More information on HRECs, including a list of registered HRECs, is available from the National Health and Medical Research Council.
See #Analysis data
Express consent is given explicitly, either orally or in writing, and does not require any inference on the part of the organisation seeking consent. The best evidence of express consent is when a person has to do something deliberate to indicate their consent, such as ticking a consent box or signing a statement giving their consent. The key elements of consent are that it is provided voluntarily, the individual is adequately informed and the individual has the capacity to understand, provide and communicate their consent.
For more information, see #Implied consent.
Identifiable data enables someone to establish the identity of a person or organisation to which some data relate. The identity of a person or organisation could be established directly if the dataset contains identifiers such as name and address, or indirectly if there is a combination of information in the dataset from which their identity can be deduced.
Identification can occur in two ways. Direct identification occurs when a direct identifier (e.g. name, address, Medicare number, Australian Business Number) is included with the data that can be used to establish the identity of a person or an organisation.
Indirect identification occurs when the identity of an individual or organisation is disclosed, not through the use of direct identifiers, but through a combination of unique characteristics about the person.
Information that directly establishes the identity of an individual or organisation. Examples of identifiers are: name, address, driver's licence number, Medicare number and Australian Business Number. Also known as direct identifier.
A dataset created by bringing together two or more datasets, generally at the unit record level (i.e. at the individual person or business level) or micro level (e.g. information for a small geographic area).
The integrating authority is the single organisation ultimately accountable for the sound conduct of the statistical data integration project. It is responsible for the implementation of the data integration project and the management of the integrated datasets throughout its entire life cycle, ensuring full compliance with commitments made to data custodians, and in line with the high level principles and supporting governance and institutional arrangements. The integrating authority is also responsible for providing researchers with safe and secure access to the integrated data in line with the requirements of data custodians.
For more information, see integrating authorities in the guide.
Implied consent arises where consent may be reasonably inferred in the circumstances from the conduct of the individual and the agency collecting the data. The key elements of consent are that it is provided voluntarily, the individual is adequately informed and the individual has the capacity to understand, provide and communicate their consent.
For more information, see #Express consent.
A code which is created as a unique identifier to replace identifying details for a person, business, or organisation, such as name and address.
See #Data linking
Metadata is information about data. It provides data users with information to help them understand the available data and its limitations. Metadata may include information about concepts, classifications, quality, scope and coverage of the data collections, as well as sources of error and areas where careful interpretation is required when using the data.
For more information, see Providing metadata in the guide.
Information or an opinion (including information or an opinion forming part of a database), whether true or not, and whether recorded in a material form or not, about an individual whose identity is apparent, or can reasonably be ascertained, from the information or opinion. Source: Privacy Act 1988
In the context of data integration, privacy refers to the protection of an individual’s personal information as defined by the Privacy Act. The Privacy Act 1988 is an Australian law which regulates how personal information is collected, used, stored and disclosed.
The Privacy Act 1988 is an Australian law which regulates the handling of personal information about individuals. This includes the collection, use, storage and disclosure of personal information, and access to and correction of that information. Most Australian and Norfolk Island Government agencies and some private sector agencies are bound to privacy protections under the Australian Privacy Principles contained in schedule 1 of the Privacy Act 1988.
Source: [www.oaic.gov.au www.oaic.gov.au]
Privacy Impact Assessment
Privacy Impact Assessments are a useful tool to assess the potential privacy impacts of the project. They are used for mapping the project’s personal (or business) information flows and documenting procedures for collection, use, disclosure and retention of this information, as well as the associated legislative and organisational rules. This provides a basis for analysing risks to privacy and examining how these risks will be managed, avoided or reduced.
For more information on Privacy Impact Assessments, see the Privacy Impact Assessment Guide, Office of the Australian Information Commissioner, [www.oaic.gov.au www.oaic.gov.au].
Data linking based on the relative likelihood that two records refer to the same entity given a set of similarities/differences between the values of the linking variables (e.g. name, date of birth, sex) on the two records. Complex methods and sophisticated data linking software are used to achieve high quality results.
For more information, see the Data Linking Information Sheet: Probabilistic linking.
Data from which identifiers have been removed and replaced by a code (such as a linkage key), but it remains possible to re-identify a specific individual if there is a combination of information in the dataset from which their identity can be deduced. This may include the code used for linking.
For more information, see #Identifiable data.
Activities to investigate or explain phenomena, which result in statistical outputs or conclusions drawn in relation to population groups and not in relation to specific individuals, households, businesses or organisations.
Secretariat to the Cross Portfolio Data Integration Oversight Board
The Secretariat was established in 2011 to provide support to the Board and its ongoing activities. The Secretariat provides a central contact point in government for issues relating to data integration involving Commonwealth data for statistical and research purposes, whether coming from data custodians, integrating authorities, researchers or the public.
The separation principle is one mechanism to protect the identities of individuals and organisations in datasets. The separation principle means that no one individual can see the identifying or demographic information, used to identify which records relate to the same person or organisation (e.g. name, address, date of birth), in conjunction with the analysis data (e.g. clinical information, benefit information, company profits). Instead, staff can see only the information they need to do the linking or analysis. So, rather than someone being able to see that John Smith has a rare medical condition, the person doing the linking sees only the information needed to do the linking (e.g. John Smith's name and address) and the analyst just sees a record, with no identifying information, showing that a person has a rare medical condition together with any other variables needed for analysis (e.g. broad age group, sex).
For more information, see the Separation Principle section in the guide.
Statistical data integration
Statistical and research purposes
Using data for ‘statistical and research purposes’ means using it to describe characteristics of groups within the population, and relationships that might exist between variables such as social and economic conditions, behaviours and outcomes. Data used for statistical or research purposes cannot be used in a way that has a direct effect on a person, family, household or organisation (e.g. the data cannot be used for detecting fraud nor for ensuring compliance).
Statistical disclosure control
Involves managing the risks of an individual or organisation being identified, either directly or indirectly through released data. This risk is managed by confidentialising the data to minimise the risk of identification.
For more information, see the Confidentiality Information Series on the National Statistical Service website.
Statistical disclosure control techniques
Techniques for confidentialising a dataset to minimise the risk that the identity of a particular individual or organisation may be disclosed. Two broad statistical disclosure control techniques are data reduction methods which aim to control or limit the amount of detail available without compromising the usefulness of the information available for research, and data modification methods (perturbation) which involve changing the data slightly to reduce the risk of disclosure.
The result of any collection, storage, analysis and transformation of data where the individual statistical unit is of no interest in itself, and the results are presented in a form that does not reveal information about identifiable individuals.
Purposes which support the collection, storage, compilation, analysis and transformation of data for the production of statistical outputs, and the dissemination of those outputs and information describing them. Statistical purposes include the collection or use of information to provide for the drawing of a sample of statistical units for data collection.
See #Linkage key.
The original data, prior to data integration, held by the data custodians on behalf of the data providers.
Trusted institutions are those that have a compatible institutional purpose and are judged by the data custodians, to be able to provide a secure environment to ensure the confidentiality of the data. This environment will include the skills, the values, the technical infrastructure and the policy and legislative coverage deemed necessary to provide adequate protection.
A number or code that uniquely identifies a person, business or organisation, such as passport number, Customer Reference Number or Australian Business Number.
Unit record data
Unit record data refers to data where each record represents observations for an individual or organisation. Unit record data may contain individual responses to questions on a survey questionnaire or administrative forms. For example, a unit record would have one person's answers given to the questions ‘In what year were you born?’, 'what is your address?' and 'what is your employment status?'.