Your Resource for All Things Apps, Ops, and Infrastructure

Using Data Lakes vs Data Warehouses

With the rise of Big Data, there has been a lot of talk around the best way for an organization to collect and store all their data as well as which data storage strategy is going to allow an organization to best leverage their data for business insights.

Data lakes and data warehouses are two different types of Big Data storage repositories. Data warehouses are nothing new and have long been used by organizations to manage their data. Data lakes are a newer technology that is beginning to grow in popularity. However, it’s not a complete replacement for data warehouses.

Below, we will define and compare data warehouses and data lakes.

Data Warehouses

A data warehouse is a large collection of raw data in a scalable environment that supports a variety of workloads. Data warehouses require structure. This requires the purpose and organization of the data to be thought-out and planned prior to the collection of data but results in high value for the end user.

The data in a data warehouse is in a structured format schema – preprocessed, formatted, indexed, and designed for performance. It is a non-transactional system that is optimized for reads in a column-oriented format often in a large range of rows. It is queried via SQL.

OLTP vs. OLAP

When defining data warehouses, it’s important to note the difference between Online Analytical Processing (OLAP) and Online Transactional Processing (OLTP). OLAP is a category of software tools that are used for the retrieval and analysis of data in data warehouses. Alternatively, OLTP is a transactional system that include sales orders, product inventory, online messaging, and financial transaction systems. It is typically used by core business functions to administer the various transactions of their organization. In simple terms, OLAP is used for data analysis, and OLTP is used for data processing.

An example of an OLTP system would be MySQL. While it can often simplify processes, it can only handle them individually with the ability to only read or write one record at a time.

An example of an OLAP system would be Redshift. This is a system optimized for reads. It is formatted for reading many rows at a time but often not the whole row – just select columns. One primary difference between OLAP and OLTP is ACID transactions.

Data Lakes

A data lake is a large collection of raw data in a scalable environment that supports a variety of workloads. This data is made up of mixed structure data – sometimes unknown schema, flat files, and multiple different formats – as well as very raw data with unknown value. It is accessed through a variety of languages, including SQL, Python, R, and Spark.

Both data lakes and data warehouses are designed to be non-transactional systems for analytics that offer SQL interfaces into the data.

Data lakes tend to consist of more data, like web logs, that seem to have no value to the end users. However, storing data like this offers value if you’re able to correlate it with other data sources.

Often, data lakes are where the data originates, but it is then promoted into a data warehouse once it’s been placed into a pipeline and processed to only extract the necessary fields that offer the most value.

A Closer Look

So…Data Warehouse or Data Lake?

When it comes to data lakes and data warehouses, it’s not an either / or. Ideally, one should start gathering all their data into a data lake for discovery purposes, as it’s a cost-effective method for collecting a large quantity of varying types of data, then they should put some of the data into the data warehouse once it’s determined to be useful.

Data lakes allow organizations to store massive amounts of all data types in a low-cost storage location, like Amazon S3 (and archived in S3 Glacier), where it can be recovered at a later date and allows the organization to avoid throwing away a potentially valuable resource. With this practice, all the data is readily available.

If / when you determine that certain data is valuable, then it can be processed, and pipelines can be written to move the data into the organization’s data warehouse to be leveraged for insights.

Conduct a Health Check on your ransomware resiliency and work fearlessly toward the future.

Subscribe to the AHEAD i/o Newsletter