Data lake is a single platform which is made up of, a combination of data governance, analytics and storage. It’s a secure, durable and centralized cloud-based storage platform that lets you to ingest and store, structured and unstructured data. It also allows us to make necessary transformations on the raw data assets as needed. A comprehensive portfolio of data exploration, reporting, analytics, machine learning, and visualization on the data can be done by utilizing this data lake architecture.
DATA LAKE VS DATA WAREHOUSE
While a data warehouse can also be a large collection of data, it is highly organized and structured. In a data warehouse, data doesn’t arrive in its original form, but is instead transformed and loaded into the organization predefined in the warehouse. This highly structured approach means that a data warehouse is often highly tuned to solve a specific set of problems, but is unusable for others. The structure and organization make it easy to query for specific problems, but practically impossible for others.
A data lake on the other hand, can be applied to a large number and wide variety of problems. Believe it or not, this is because of the lack of structure and organization in a data lake. The lack of a predefined schema gives the data lake more versatility and flexibility. A Data Lake operates, with a more broad and distributed context, where some questions remain ambiguous, with an undefined set of users and a variety of different data presentations. One of the core capabilities of a data lake architecture is the ability to quickly and easily ingest multiple types of data, such as real-time streaming data and bulk data assets from on-premises storage platforms, as well as data generated and processed by legacy on-premises platforms, such as mainframes and data warehouses.
WHY DO WE NEED A DATA LAKE?
- Managed Unlimited Data
A Data Lake can store unlimited data, in its original formats and fidelity, for as long as you need. As a data staging area, it can not only prepare data quickly and cost effectively for use in downstream systems, but also serves as an automatic compliance archive to satisfy internal and external regulatory demands. Unlike traditional archival storage solutions, a Data Lake is an online system: all data is available for query.
- Accelerate Data Preparation and Reduce Costs
Increasingly, data processing workloads that previously had to run on expensive systems can migrate to a Data Lake, where they run at very low cost, in parallel, much faster than before. Optimizing the placement of these workloads and the data on which they operate frees capacity on high-end data warehouses, making them more valuable by allowing them to concentrate on the business-critical OLAP and other applications for which they were designed.
- Explore and Analyze
Fast Above all else, a Data Lake enables analytic agility. IT can provide analysts and data scientists with a self-service environment to ask new questions and rapidly integrate, combine, and explore any data they need. Structure can be applied incrementally, at the right time, rather than necessarily up front. Not limited to standard SQL, a Data Lake offers options for full-text search, machine learning, scripting, and connectivity to existing business intelligence, data discovery, and analytic platforms. A Data Lake finally makes it cost-effective to run data-driven experiments and analysis over unlimited data.
GENERAL DATA LAKE REFERENCE ARCHITECTURE
Here are the key components in a data lake architecture. We have our data sources which can be structured and unstructured. They all integrate into a raw data store that consumes data in the purest possible form without any transformations. It is a cheap persistent storage that can store data at scale.
Analytical sandboxes are one of the key components in data lake architecture. These are the exploratory areas for data scientists where they can develop and test new hypothesis, mash-up and explore data to form new use-cases, create rapid prototypes to validate these use-cases and realize what can be done to extract value out of the business.
Then we have batch processing engine that processes the raw data into something that can be consumed by the users. In other words, a structure that can be used for reporting to the end users. Then the transformed data would be stored in processed data store. There is a real-time processing engine that takes streaming data and processes it as well.
Finally once the necessary data is being stored, the BI tools would be able to leverage those to show meaningful graphs and numbers to the end customers. For example sales per quarter, growth within a region for the year etc.
AWS DATA LAKE REFERENCE ARCHITECTURE
The aws data lake foundation provides these features:
- Data submission, including batch submissions to Amazon S3 and streaming submissions via Amazon Kinesis Firehose
- Ingest processing, including data validation, metadata extraction, and indexing via Amazon S3 events, Amazon Simple Notification Service (Amazon SNS), AWS Lambda, Amazon Kinesis Analytics, and Amazon ES
- Dataset management through Amazon Redshift transformations and Kinesis Analytics
- Data transformation, aggregation, and analysis through Amazon Athena and Amazon Redshift Spectrum
- Search, by indexing metadata in Amazon ES and exposing it through Kibana dashboards
- Publishing into an S3 bucket for use by visualization tools, and visualization with Amazon QuickSight