The evolution of storing and analyzing data is quite impressive. First came the database and Data Warehouse. Now, with cloud computing becoming the default way of storing data, another type of data storage paradigm has emerged. That’s Data Lake. For the uninitiated, a data lake is a kind of storage where you can dump all kinds of data- structured, unstructured, semi-structured- and then analyse only those data that you need to analyse. Unlike a data warehouse where the data is required to be prepped in order to make it useful, in a data lake, you can dump unfiltered, raw data.
Azure offers Data Lake to its cloud customers and has been upgrading it from day one. The current version- Gen2, supports Object Storage which makes the whole thing highly scalable. With Azure’s superior ecosystem, big data analytics has become efficient and fast.
Every new technology comes with promises and pitfalls. It is crucial that organizations follow best practices pertaining to technology to avoid these pitfalls. Azure Data Lake entails some best practices.
Here are 5 major ADL best practices that you should follow.
Data Lake Governance
Many resources on the web define Data Lake as a storage platform that’s opposite of Think First, Load Later. They tell you that with Azure Data Lake, you can dump data first and then think about how to clean and prepare it. While this is theoretically true, without a proper Data Lake governance, the Data Lake can soon turn into a Data Swamp.
Since all types of data are dumped into the data lake, it is necessary to create separate partitions for separate kinds of data. This means you should have separate partitions for raw data, processed data and data ready for exploration or data ready to be used in customer-facing applications. Since the Gen2 version of ADL supports blob storage, you have greater flexibility to assign metadata to each data.
Above all, there should be a clear internal guideline, why and how the data is ingested. At AutomationFactory.ai, we still try to leverage ETL instead of ELT, even for Azure Data Lake, as much as possible.
Data security is one of the most critical aspects of a business operation. Leakage of data and improper data security configuration can result in long term damage to the reputation of an organisation.
On a high level, just like other Azure properties, Azure Data Like is governed by Azure Active Directory- the Identity and Access Management service equivalent of Azure. With AAD, you specifically decide which users can access ADL.
Users who have access to Azure Data Lake have different purposes. So, not all authorised users need to access all files or blobs in Azure. With Role-Based Access Control, you can decide which users can access exactly which files or data.
The traditional RWX based access control is also available in Azure. Using this, you can set permission for files so that they remain readable, writable and executable only to specific user groups.
Lastly, with a firewall, you can protect the Data Lake from prying eyes. This network-level security allows the organisation to create IP ranges from where the data lake can be accessed. Users beyond the set IP range won’t be able to access the lake.
Note: Azure Data Lake does not support obfuscation of cells in tables. The simple way to obfuscate data is to mask the data before it is ingested. However, using Data Bricks, the data can be obfuscated inside the Azure storage.
Leverage Read Access Geo Redundant Storage
One of the biggest benefits of cloud computing is the fact that you can replicate your business data and keep it as a backup in another Azure region. This helps your business application to remain highly available.
In the case of ADL Gen2, Azure recommends that you spin up a Geo Redundant Storage (as opposed to LRS that replicates data in the same region) so that the data remains available in another Azure region. You should give your application read-access to the data in the second region so that in case of an outage, the application can keep on using secondary data storage as if nothing has happened.
Merge Small Files To Avoid Slow Processing Speed and Complexities
Since we dump all kinds of data in Azure Data Lake, it consists of millions of log data, sensor data, changelog etc. Processing these small files will result in performance issues. It is recommended that you merge these small files (compaction). Ideally, you should merge these small files and create a large file of 256 MB.
AutomationFactory will help you formulate robust data governance and ensure that the ETL or ELT happens, keeping all the best practices in mind. Remember, it is the proper usage of data that will differentiate you from your competitors.