Data Lakes are increasingly being used by businesses for data storage. Learn more about the definition of Data Lake, its advantages, disadvantages, and differences from Data Warehouse.
A Data Lake is a data repository for storing a very large amount of raw data in native format for an indefinite period of time. This method of storage facilitates the cohabitation between different schemas and structural forms of data, usually blobs of objects or files.
Within a single Data Lake, all company data is stored. The raw data, including copies of source system data, alongside transformed data. This data is then used for reporting, data visualization, data analysis or Machine Learning.
Data Lake: a useful data lake
The Data Lake aggregates structured data from relational corridor or column databases, semi-structured data such as CSV, logs, XML, JSON, and unstructured data such as emails, documents and PDFs. Even binary data such as images, audio files or videos can be found there.
The term Data Lake was conceptualized for the first time by James Dixon, CTO of Penthao, to draw a parallel with the Data Mart. The Data Mart is a smaller data repository of interesting attributes extracted from raw data. According to him, the Data Mart presented several problems, and Data Lake presented itself as the optimal solution. The main problem with the Mart was the information silo phenomenon. As a PricewathouseCoopers study showed, Data Lake can put an end to this problem, as companies can use it to extract their data and store it within a single Hadoop repository.
According to Dixon, if you think of a Data Mart as a store of bottled water, packaged for easy consumption, Data Lake is a great source of water in its natural state.. The various users can come and examine this lake, dive into it, or extract samples from it.
Examples of Data Lake
An example of Data Lake is the Apache Hadoop distributed file system. The The first version of Hadoop 1.0 had limited capabilities in terms of Map Reduce data processing.. To interact with this Data Lake, it was necessary to master Java, Map Reduce, and high level tools like Pig and Hive. The arrival of Hadoop 2.0, YARN for resource management, and new processing paradigms such as streaming, helped overcome these limitations.
Many companies also use cloud storage services such as Amazon S3. More and more educational institutions are taking an interest in Data Lake. For example, Cardiff University has set up the Personal DataLake project to create a new type of Data Lake that allows the management of individual users’ Big Data by providing them with a centralized point for collecting, organizing and sharing personal data.
Finally, companies that use the Internet of Things are very fond of the Data Lake model. Indeed, it must be possible to gather data from hundreds, even millions of sensors and correlate them. This infrastructure is, for example, at the heart of the operation of the Linky connected meters being installed throughout France by Enedis. These meters record different types of information on consumption, allocated power, safety faults and make it easier for maintenance teams to intervene.
What are the disadvantages of a Data Lake?
The Technological advances are often seen as an inevitable forward movement, an inevitability for the technology industry.. However, it is incorrect to think that novelty is always synonymous with improvement. New technologies often bring their own challenges. Some are predictable, others less so. These challenges minimize the benefits of innovation.
The Data Lakes are an excellent example of this phenomenon. The companies have adopted Data Lakes very quickly, seeing them as a complement or even a replacement for their Data Marts and Data Warehouses.. These companies have ignored the limitations and flaws of Data Lakes.
Yes, it is, Data Lakes can create more problems than they solve.. According to Adam Wray, CEO and President of Basho, the Data Lakes are simply “evil”. It is best for a company to view their data through a supply chain prism with a beginning, middle and end. This data needs to be collected, found, explored and transformed according to an organized plan. This approach maximizes the value extracted from the data.
The Data Lakes can totally ruin that tactic. The data lakes allow to store any format with no quantity limit, which leads to many problems and prevents extracting value from the data. Without the ability to categorize or prioritize data, chaos quickly sets in.
The Data Lakes are being massively adopted because it seems obvious that it is impossible to take advantage of data that is not at your fingertips.. However, values are not always automatically derived from the data at hand. Data Lakes do not allow prioritization of data and how it will be used. The result is a veritable museum of countless works of art, but without the eyes of a curator who can determine which works are worth exhibiting.
For Wray, the main reason Data Lakes are evil is that they are ruleless, incredibly expensive, and the value that can be extracted from them is minimal compared to what is promised. This may seem obvious and inevitable, given the immense amount of data available in our time. However, for Wray, this phenomenon is connected to other problems related to the Data Lakes. To avoid these problems, he believes it is important for companies to ask themselves three questions: is the data current, where is it, and what are the risks involved in storing it?
Data must be timely
The data are only useful if they can be used to make good and timely decisions. If a company wants to analyze data and has to spend a lot of time finding it in the Data Lake and preparing it, efficiency is greatly reduced.
Concretely, the data lake does not allow companies to prioritize their data. In fact, it is more difficult to see how these data can be useful. For Wray, the real purpose of the data should be different. Companies should ask themselves how to provide this data in a streamlined, simplified form so that as many people as possible can access and act on it. Data lakes do not help answer this question.
In most cases, the data should be synthesized or used before being stored in the Data Lake. For example, Data Lake advocates often argue that all data from IoT sensors should be stored on Data Lake.. However, if we take the example of a temperature sensor fitted to a turbine, if the temperature reaches a certain threshold, measures must be taken to switch it off or someone must be sent to repair it. So this is timely information. Even in the long term, it is useless to store all the data from the sensor if the sensor transmits the temperature every 15 seconds. The data can be synthesized as long as the temperature is stable, without losing any useful information. Rather than storing anything and everything, it is better to make decisions based on the necessary data.
A latency problem to take into account
In an increasingly globalized world, where data can be accessed from anywhere in the world, one would think that the location of the data no longer matters. However, As Wray points out, latency depends on the location of the data. This latency can affect the timeliness of the data..
If the Data Lakes are physically very far away, it can take a long time to retrieve a specific piece of data. However, data are too often used to neglect this phenomenon. Latency can slow down the entire enterprise. In addition, the location of Data Lake can also reduce data security.
With At Data Lakes, companies often think that they can simply store all their data in any way, regardless of where it comes from.. However, such behaviour exposes the company to many risks, including legal risks.
A risk to data confidentiality
As Wray points out, laws and regulations on data privacy differ in all countries. It is therefore not possible to transpose a model from one country to another as if nothing had happened. This may seem obvious, but many companies do not take this into account. The lack of data prioritisation within data lakes can increase the difficulty of adapting to regulations. Companies may not be aware of all the data they collect, where it comes from, and the risks to which they are exposed.
The leaks of sensitive customer data, such as financial information or private e-mails, are commonplace. That is why consumer protection is essential in all businesses. The lack of Data Lakes constraints can lead companies to place risky data in an insecure location. Nearly 200 million U.S. voters paid the price shortly after a data lake was made available on a public cloud. This is the largest data leak in history.
It’s why Wray believes that Data Lakes should be handled with care.. He considers these lakes of data to be evil because they are metaphorically presented as a buffet of food that could be compared to a meal in a good restaurant. The quantity is too much, and the quality is poor. According to him, about 95% of the data collected and aggregated within the Data Lake will be discarded. So many companies use Hadoop and the Data Lakes thinking that if the service is complex and expensive, it will necessarily be of high quality. This is an absurd idea.
Actually, Data Lakes create complexity, add burdens, and confuse the business without providing much value.. Many companies think they can store data now to extract value later, but never succeed because the difficulty in extracting that value is too extreme. This can lead to real disasters in the long run.
The business leaders should stop storing anything and everything on their Data Lakes and focus on individual projects and the analytical tools needed to deliver great value by assembling only the data needed.. This process creates actionable data to solve the problems the company is currently facing. After completing a series of individual problems, it is possible to find out which data is used more often and which data is a priority. It is then possible to create a more efficient and effective repository.
Data Lake is a challenge for the company
According to David Needle, Data Lakes is one of the most controversial ways to manage Big Data.. Similarly, PricewaterhouseCoopers tempered its enthusiasm by pointing out that not all Data Lake initiatives are necessarily successful.
According to Sean Martin, CTO of Cambridge Semantics, many companies create Big Data cemeteries, where they store all their Hadoop files and hope to get something out of it by a happy coincidence.. Afterwards, they forget and lose track of the stored data.
So, the main challenge is not to create a Data Lake, but to take advantage of the opportunities it represents. Companies that have successfully developed Data Lakes are gradually improving it as they ask themselves what data and metadata are important to their business.
Differences between Data Lake and Data Warehouse
While a Data Warehouse allows to store data in files or folders, a Data Lake is based on a flat architecture.. Each data element in a Lake is assigned a unique identifier, and tagged with an extended set of metadata keywords. If the company has a question, it can perform a query to search for relevant data within the Data Lake. Then, the dataset delivered in response to that query can be analyzed to answer the question.
The The term Data Lake is often associated with the storage of Hadoop-oriented objects.. In such a scenario, an enterprise’s data is first loaded onto the Hadoop platform and then business analytics and data mining tools are used on the data within the Hadoop cluster nodes where it resides.
Just like the Big DataData Lake is sometimes used as a simple marketing label for a product that supports Hadoop. Similarly, the term is increasingly used to describe a large pool of data, where the pattern and requirements for the data are not defined until a query is made.
At In a data warehouse, only processed and structured data can be found.. In a Data Lake, they can be structured, semi-structured, unstructured or raw. Before loading data into a Data Warehous, it is necessary to give it a shape and structure, for example a model. Within a Data Lake, it is possible to load raw data and give it a shape and structure only when the time comes to use the data.
The Storage in a data warehouse is very expensive for large volumes of data, while a data lake such as Hadoop is designed for low-cost storage.. Two reasons for that. First, Hadoop is open source software. In fact, the license and community support is free. Also, Hadoop is designed to be installed on cheap hardware.
The Warehouse is less agile, its configuration is fixed. It is possible to modify its structure, but this takes time.. For its part, the Lake is very agile and can be configured or reconfigured at will. Models, queries, and applications can be easily modified by developers and Data Scientists. On the other hand, the Data Warehouse is more mature in terms of security, thanks to decades of existence. Despite this, much effort is being made by the Big Data industry to secure the Data Lakes, and the backlog is expected to be quickly caught up. Finally, the main users of the Data Lake are Data Scientists, while the Data Warehouse is open to all members of the company.
These differences are important to take into account. They demonstrate that Data Lake is not a simple evolution or replacement of the Data Warehouse.. The Data Lake is the Data Warehouse are optimized for different uses, and must be used for what they were designed for.
How do we keep a Data Lake from turning into a Data Swamp?
Widely adopted by companies as part of an analytical strategy, Data Lakes often prove to be inefficient and costly. To to prevent your data lake from turning into a “swamp”, it is important to avoid making certain mistakes. Here are the 5 most common mistakes.
Lack of experience with the data lake
From many companies undertake the deployment of a data lake without any real experience in this field. IT departments discover Hadoop for the first time, and quickly find themselves disoriented by the novelty. As a result, deployment is slow, implementation is difficult, and goals are quickly obsolete. To avoid this problem, it is best to call on experienced leaders or, failing that, contact experts for advice.
Lack of engineering skills
The lack of qualified engineers is also one of the most common sources of failure in the data lake field. It is difficult to identify a qualified engineer for such a program. Mastering technologies such as Spark, Kafka and HBase is essential but not always sufficient for the efficient deployment of a data lake. To minimize this risk, it is essential to recruit talented software engineers, and then train them to master Hadoop if necessary. It is also possible to invest in a data lake management platform to avoid having to use Hadoop.
An immature exploitation model
The traditional separation between the IT department and the rest of the business can be a major obstacle, especially in the initial phase. The successful deployment of a Data Lake depends on close collaboration between data scientists and data engineers. The data scientists must use the tools provided by the IT department, and the data engineers must exploit what is implemented by the data scientists. An effective operating model must be in place, as well as a robust governance structure.
Poor data governance
The data governance is the set of processes that manage the company’s data sets.. Governance ensures that the data is trustworthy and that those responsible can be easily identified in the event of a problem. Poor governance is the source of many failures. Governance should be established early in the initial phase of implementation.
Lack of technical capacity
Many companies underestimate the technical complexity of data lake solutions. Hadoop provides several tools to implement some, but not all, of the necessary technical capabilities. Before starting data ingestion, it is therefore best to make sure you have these capabilities at hand.