DATA LAKE REFERENCE ARCHITECTURE

Kulasangar Gowrisangar
4 min readJul 21, 2018

INTRODUCTION

Data lake is a single platform which is made up of, a combination of data governance, analytics and storage. It’s a secure, durable and centralized cloud-based storage platform that lets you to ingest and store, structured and unstructured data. It also allows us to make necessary transformations on the raw data assets as needed. A comprehensive portfolio of data exploration, reporting, analytics, machine learning, and visualization on the data can be done by utilizing this data lake architecture.

DATA LAKE VS DATA WAREHOUSE

While a data warehouse can also be a large collection of data, it is highly organized and structured. In a data warehouse, data doesn’t arrive in its original form, but is instead transformed and loaded into the organization predefined in the warehouse. This highly structured approach means that a data warehouse is often highly tuned to solve a specific set of problems, but is unusable for others. The structure and organization make it easy to query for specific problems, but practically impossible for others.

A data lake on the other hand, can be applied to a large number and wide variety of problems. Believe it or not, this is because of the lack of structure and organization in a data lake. The lack of a predefined schema gives the data lake more versatility and flexibility. A Data Lake operates, with a more broad and distributed context, where some questions remain ambiguous, with an undefined set of users and a…

--

--