Data iceberg

This is some text inside of a div block.

What is the Concept of a Data Iceberg?

The term "data iceberg" is not a standard term in data engineering or data science. However, it can be used metaphorically to describe situations where only a small part of a larger issue or phenomenon is visible, while the majority remains hidden or not immediately apparent.

  • Complexity of Data Issues: This could refer to the visible issues or data points that are easily observable, while suggesting that there are much deeper and more complex problems or data sets that are not visible or are only partially understood.
  • Hidden Insights in Data: This metaphor might also be used to describe situations where the initial analysis of data provides some insights, but there is a much larger potential for discovery and understanding that lies beneath the surface.

What are the Visible Parts of a Data Iceberg?

The "iceberg" is a term used to describe the most obvious data quality issues that are easy to detect, like the tip of an iceberg. These issues include: misspelled names, duplicates, and missing values.

  • Misspelled Names: This refers to data entries where names have been spelled incorrectly. This can create confusion and inaccuracies in data analysis.
  • Duplicates: Duplicate entries can skew data analysis results, making it appear as though there is more data than there actually is.
  • Missing Values: Missing data can lead to incomplete analysis and potentially inaccurate results.

What are the Hidden Parts of a Data Iceberg?

Other issues that may not be immediately apparent include: subtle inconsistencies, latent inaccuracies, and complex relational discrepancies.

  • Subtle Inconsistencies: These are small discrepancies in data that might not be immediately noticeable but can significantly impact analysis.
  • Latent Inaccuracies: These are errors in data that are not immediately apparent but can lead to inaccurate analysis and decision-making.
  • Complex Relational Discrepancies: These are complex issues that arise when data from different sources or systems doesn't match up correctly.

What are Some Examples of Data Platform Issues?

Some examples of data platform issues that may be visible at the surface include: Slow query performance, Data inconsistency, and Poor data quality.

  • Slow Query Performance: This refers to the speed at which data can be retrieved from a database. Slow query performance can hinder data analysis and decision-making.
  • Data Inconsistency: This refers to discrepancies in data across different databases or systems.
  • Poor Data Quality: This refers to data that is inaccurate, incomplete, or outdated, which can lead to inaccurate analysis and decision-making.

What are Some Examples of Data Issues?

Some examples of data issues include: Data incompleteness, Data inaccuracy, Tables with both raw and transformed data, and Tables with both frequently updated data and stale data.

  • Data Incompleteness: This refers to data sets that are missing information, which can lead to incomplete analysis and potentially inaccurate results.
  • Data Inaccuracy: This refers to data that is incorrect or misleading, which can lead to inaccurate analysis and decision-making.
  • Tables with both raw and transformed data: This can create confusion and inaccuracies in data analysis.
  • Tables with both frequently updated data and stale data: This can lead to inconsistencies in data analysis and potentially inaccurate results.

From the blog

See all