Delta Lake is an open source solution from Databricks, the original creator of the Big Data Apache Spark analysis engine. It is a tool to make Data Lakes data more reliable thanks to an extra storage thickness. Find out everything you need to know about it.
A Data Lake is a huge directory in which a company stores all the data it needs to process. This makes it possible to unify multiple data sources. Unfortunately, many times, the data stored within the Data Lakes is unreliableand the Data Lakes are themselves in disarray.
What is Delta Lake and what is it for?
This is why Databricks, founded by the developers of the famous Big Data Apache Spark analysis engine, decided to launch its proprietary Delta Lake open source solution. The announcement was made at the Spark + AI Summit in San Francisco, and Databricks CEO Ali Ghodsi announced that this is an even more important innovation than Spark.
Delta Lake is a storage layer added on top of Data Lake to provide reliable data sources for Machine Learning and Data Science. The tool reviews all incoming data, and makes sure that they correspond to the user’s scheme. This ensures that the data is reliable and correct.
An ACID transaction is added to each transaction performed, to ensure that operations are always correct.. Thus, it is no longer possible to be confronted with an error or incomplete data.
In addition, the Delta Lake is compatible with Apache Spark APIs.. In particular, this allows it to support the metadata of the data lake.
Delta Lake can operate on on-premise servers, on the Cloud, or on devices such as laptops. It is compatible with batch or streaming data sources.
In the near future, DataBricks also plans to add functionality to access older versions of the data for audits or to reproduce MLFlow Machine Learning experiments.
The benefits of Delta Lake available to all Data Lakes
For the past year, Delta Lake was already supplied by Databricks to nearly a thousand large companies… like Viacom, Edmunds, Riot Games and McGraw Hill. The firm assures that its solution has helped hundreds of companies overcome the challenges associated with traditional Data Lakes.
From now on, all companies will be able to benefit from its profits. Thanks to Delta Lake’s open-sourcing, developers will be able to easily build Data Lakes and turn them into “Delta Lakes”. for increased reliability. Indeed, it is important to re-emphasize that Delta Lake can be run over any existing Data Lakes.
For the future, Databricks is still hesitating about how the project will be governed. It is likely that Delta Lake is proposed on Github.This is to ensure that it can receive a large number of contributions while remaining properly governed. Regardless of the mode of governance chosen, Databricks’ priority is to federate a large community around its project in order to maximize the reliability of the data in Data Lakes.
This is why the firm has chosen to propose its solution under the Apache License V2 license…just like Apache Spark. This permissive license has been preferred to the Commons Clause license, which is used by many open-source companies to prevent major cloud providers from integrating their tools into their own commercial SaaS offerings.
Databricks wants its Delta Lake technology to be used by companies of all sizes, both on-premise and in the Cloud. Its goal is for Delta Lake to become a new standard for Big Data storage.…