What is a Data Lake?

As the name implies, a data lake is a repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags.

What are Data Lakes used for?

When a business question arises, the user can select the appropriate subset of data from the lake. For example, if you want to know how many widgets were sold in California last month, you could query the entire dataset for sales transactions that occurred in California during that time period. You could also pull up retail locations and customer demographics from other sources to gain additional insight into this information.

Unlike hierarchical databases that require complex queries to extract useful information, datalakes allow users to quickly search for any data element on demand. This makes it easier for businesses to derive value from big data analytics.

In addition to storing structured and semi-structured records, such as database tables and XML files, data lakes can also store unstructured data like social media posts, email messages and documents. Easy access to this type of unstructured information can be particularly beneficial for companies looking to create competitive advantage through advanced analytics.

A data lake can handle multiple types of data simultaneously. Structured data (such as sales transactions) can be stored in a traditional relational database. Unstructured text documents, email messages, video and audio files can all be stored in their native formats.

What's the difference between a Data Lake vs. Data Warehouse?

While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

Data lakes are large, accessible data repositories that collect vast amounts of data from various sources. The key difference between a data lake and a traditional data warehouse is in the way the data is staged, or ingested, for later use. Data warehouses can only store structured data, which requires predefined fields with consistent definitions across the entire database. A data lake, on the other hand, accepts unstructured, semi-structured, and structured data as it is until it's needed.

A typical approach to loading the data into the repository includes the use of Extract, Transform and Load (ETL) tools or Hadoop distributed file system (HDFS) components such as Apache Sqoop or Flume. As an alternative, some organizations choose to use stream-processing tools such as Apache Kafka or Amazon Kinesis to capture streaming data and load it directly into the repository.

Using these techniques, an organization can dump all of its raw, structured and unstructured data into the repository without having to worry about whether it is structured or not.

Data Lake tools and platforms

The information that data lakes store can be sensitive, and ensuring the reliable access to it is essential to running DataOps smoothly. Here are some of the most-used data lake tools in the market.

  1. AWS Data Lake. Many businesses already use a number of products in the AWS suite, so completing the picture by hosting their data on AWS Data Lake makes the most sense. It also means that for those who do already use AWS products, managing the incoming and outgoing data from the data lake are easier and can be done natively.
  2. Google Data Lake. For those using BigQuery, Google Data Lake might be the most logical solution, for the same reasons that AWS Data Lake would be a seamless and easy transition for those already in the AWS suite.
  3. Databricks. Databricks offers a platform that combines both data engineering capabilities, and the functionality to store and work on data from AWS. It also has comprehensive collaboration tools that allow for multiple people in the data organization to build and analyze.