Lifting & Shifting Data From RedShift To Azure SQL Using Azure Pipeline
- Azure, Cloud Computing, ETL
- Consulting, Planning, Execution
- Cargo, Logistics, Trucking, Shipping
AWS To Azure Data Transfer: Migrating raw data from AWS RedShift To Azure Data SQL Warehouse.
Easy data migration is one of the key components of Digital Transformation. This case study will investigate how we created an Azure pipeline to orchestrate an automated data migration workflow using ETL logic.
The objective of this workflow was to transfer raw data from Amazon Redshift to Azure SQL data warehouse. Additionally, we were required to create a logic that would trigger automated data extraction based on parameters like – new file entry or a specific time of the day.
This was not a simple lift-and-shift approach. After the extraction the data was transformed and schematic changes were done so that the data looks similar to other data that the client was accustomed to handle.
Technology Used –
- Azure Data Factory for providing ETL logic and for file processing
- Azure SQL Data Warehouse to ensure end storing of the data that is to be consumed by the analysts
- Azure Data Lake for storing files for longer terms
- Azure Data Factory
Systematic Approach To Complex Issues
As one of the world’s largest ITService Providers with over 120 engineers and IT support staff are ready to help.
SmartData been helping organizations throughout the World to manage their IT with our unique approach to technology management and consultancy solutions.
We reviewed the challenge, and leveraging the power and scale of the cloud, devised a solution which in reality is beyond the traditional infrastructure.
The client approached us with a data science challenge pertaining to one of their data sets. We provided the client with the data in an AWS environment belonging to the Redshift data warehouse. This was found to be expensive, with the data and the computing expenditure getting coupled together in AWS. This increased the computing costs. Yes, the speed improved too, but that level of swiftness was not required.
However, the data was also found to be available in the CSV format stored in an S3 storage bucket. This further heralded the start of a newly devised approach. The entire infrastructure belonging to the client was already being deployed as well as managed by AutomationFactory.ai in Azure, making them opt for consolidation into the presently existing infrastructure.
The key considerations were- the solution must process large sets of data of more than 11,000 files and a total 2TB compressed size with extra files introduced every day. Raw files are to be ingested in a database and stored for future requirements. This ingestion should be rate-controlled and parallelizable for ensuring the management of multiple database connections with orderly ingestion.
There has to be an account for every single file ensuring the correct moving of data. Maintenance has to be made ongoing with minimal effort and cost involved. Maintenance has to be automated and certainly delegated away from the end-users.
Deliver Only Exceptional Quality, And Improve!
AutomationFactory.ai recommended Microsoft’s Extract-Transform-Load (ETL) for Azure as ETL is an absolutely native service for Azure to get tied to the other Microsoft services.
AutomationFactory.ai built data pipeline to move from AWS to Azure. With manually triggered initial load, the update schedules got set to check the new files to be conducted at regular intervals.
We created status tables for keeping track of all the files. This further tracks the status of data when it gets passed through the data pipelines and ensures the usage of a decoupled structure for any troubleshooting or even manual intervention to occur at any particular stage without the creation of dependencies. The decoupled structure ensures the fixing of the individual files and the steps in isolation. This gets followed by the other pipelines, as well as the steps getting continued without interruption. The clarity in decoupling reveals successful identification of an error in the process that has got notified to the users for further investigation.
The entire data gets mapped back to the tables for further usage during requirements of processing or cleaning. The data is later transformed along with extra schema changes matching the client’s end-use and getting mapped to the traditional trading data.
The data pipelines were abstracted deliberately for allowing the least of the work to include new sources of data in the future. The objective was to make things easy for the client’s end-users, letting them do the required steps.
Brainstorming & Recommendations
- Microsoft ETL For Azure
- Data Cleaning Strategies
- Cost Saving Strategies
- Ease Of Use For End Users
- Building Data Pipeline
- Update Schedule Initialization
- Creation of Status Tables
- Data Pipeline Abstraction
- Seamless Transfer Of Data
- Automatic Data Transfers
- Schematically Changed Data
- Cost Saving