Raw data is data that has not yet been processed for a purpose. Perhaps the greatest difference between data lakes and data warehouses is the varying raw vs. processed data structure. Data lakes primarily store raw, unprocessed data, while data warehouses store processed and refined data. Because of this, data lakes typically require a much larger storage capacity than data warehouses.
Additionally, raw, unprocessed data is malleable, can be quickly analyzed for any purpose, and is ideal for machine learning. However, the risk of raw data is that data lakes sometimes become data swamps without appropriate data quality and governance measures in place. By storing only processed data, data warehouses save on pricey storage space by not maintaining data that they don’t use.
A larger audience can easily understand processed data. The purpose of individual data pieces in a data lake is not fixed. Instead, raw data flows into a data lake, sometimes with specific future use in mind and sometimes on hand. As a result, data lakes have less organization and less filtration of data than their counterpart. Processed data is raw data that has been put to a specific use. Since data warehouses only house processed data, all of the data in a data warehouse has been used for a specific purpose within the organization.