What is anonymized data?

The definition of anonymized data is data that has been stripped of personally identifiable information, also known as PII.

Anonymized data can be helpful for research purposes, as well as for compliance with privacy regulations. But it's important to note that there's often more than one kind of PII. The obvious ones are name, address and social security number, but it also includes things like IP address, biometrics and phone number. If a user can't be identified by any of this information, then the data is considered anonymized.

The anonymity of data is important because if it has been properly anonymized, it legally cannot be used to identify anyone — even if hackers were to steal it. That makes it useful for certain situations where you need to analyze large amounts of data but want to protect the privacy of the people involved.

Anonymized data is data which has been processed in such a way that the original identifying characteristics have been removed. It therefore can't be linked back to any specific person, even if it's combined with other information sources.

The term "anonymize" is in fact a misnomer, because there is no way to guarantee that anonymized data can't be re-identified. However, anonymization techniques do have the potential to make data less personal, and reduce the risk of re-identification.

How to anonymize data

While many organizations adopt processes for anonymizing data at source (e.g. removing names and addresses from forms before they're processed), others choose to do so later in the process. This is often preferable as it allows for better efficiencies, and means you're able to keep all your information together in one place rather than distributing copies across multiple sources.

It's also possible to anonymize data retrospectively by de-identifying it after it's been collected or used for a certain period of time.

Methods of Data Anonymization

  • Generalization. This is the process of removing parts of the data to make it so that identifying it is more difficult or impossible. For example, collecting someones postal code or zipcode, but removing the last 3 digits in order to maintain some level of discretion while still being accurate.
  • Pseudonymization. This is the process of giving parts of data different or private identifiers that are not that of the original source. In contrast to generalization, this method allows for a full picture of the data without compromising the identity or privacy of the user who disclosed it.
  • Data masking. This process hides or alters data values. This is the safest way to prevent reverse engineering information, but has its cons in that the original data may be more difficult to access for those who have permission to do so.