What is Data Profiling?

Data profiling is a process that discovers, analyzes and displays the content, structure, quality and consistency of a given set of data. It is an exercise of summarizing and creating a profile of a dataset so someone can look at it and get a bird's eye view of what's in it and how it can be used.  It's very helpful for data scientists and analysts to use before diving into a new dataset to do an analysis, or a data engineer/analytics engineer who is considering how they might want to clean up a dataset.

Data Profiling Meaning

Data profiling is a set of processes and tools used to understand the contents of a dataset. A data profiler will obtain statistics on the content of a dataset, including basic summaries such as count, average, standard deviation, number of missing and empty fields, frequency distributions. Some will have more advanced information such as calculation of entropy for data element values.

Profiling is usually conducted prior to starting a major data migration or integration project. Data profiling helps the data architect understand the environment and make better decisions about how to map from one system to another.

Data profiling can also be included as part of an ongoing data quality monitoring program. A monitoring program will periodically run profiles and compare results with what was previously observed. This can help identify new problems introduced into the system (e.g., by business errors or software upgrades).

Why do you need to Data Profile?

Data profiling is a process that discovers, analyzes and displays the content, structure, quality and consistency of a given set of data. The main aim of data profiling is to improve data quality. Data profiling can also be used to determine whether the data has been moved to its final destination, and if not, what needs to be done before importing it into the target database. It's an essential first step when creating a data governance framework within your organization, and should be conducted on a regular basis.

Data profiling is the analysis of data from one or more data sources with the aim of understanding its content, structure, and/or quality. The process of data profiling is applied to the whole set of data in a given source, or to a sample of it.

What is the goal?

The goal of data profiling is to discover "data about the data" – that is, metadata:

  • the kind of analysis performed by a database designer when designing a database schema
  • the kind of analysis performed by an information systems architect when designing an integration between different databases

It is a form of metadata discovery. It's helpful in collecting a "big picture" understanding of the dataset, and such discovery informs higher level decisions when making changes to a dataset, database, or even warehouse.

How to do data profiling?

There are four common methods for discovering metadata for the sake of data profiling:

Automatic data flow: where the system automatically determines how data moves through the enterprise by correlating operations on the database to determine which operations change the same columns in different tables (e.g. joins). This process can be assisted by user-specified mapping rules where possible. The results are shown visually as a series of connected boxes representing tables and columns with lines indicating joins or other associations between them.

Column profiling: observing the number of times a value appears within a column in a dataset.

Cross column profiling: observing the values or number of times a value appears across several columns and drawing analysis from doing so.

Cross-table profiling: observing the values or number of values across several tables, and understanding how they compare to each other.

What's included?

  • Tagging: This is the practice of providing further context on data with descriptions, categories, and identifying keywords.
  • Metadata discovery: This is when metadata (data about the data) is collected and stored if its relevant.
  • Statistic collection: The basic information about data and metadata, such as the count, sum, and value itself, is indexed.
  • Data quality inspection: The relevance and quality of data is measured and taken into account.

From the blog

See all