We have been using Azure Data Factory (ADF) at work for a while now. It’s been a great tool for ETL services that can easily scale-out serverless data integration and data transformations. ADF has worked well to transform the data of our new POS into our data warehouse for reporting in Power BI.
We have a Dev, QA and Prod environment for our data factories. The main focus of this article is diving into CI/CD for deploying Data Factories from Dev to QA and finally into Prod. There are a lot of pre-requisites to actually do CI/CD for ADF.
Since this blog is rather long, it is being split into three parts:
- Part 1: Lessons Learned, Creating the Data Factory Resource and Configuring Source Control.
- Part 2: Setting up sample resources, creating your pipeline and publishing it
- Part 3: Configuring your CI/CD Pipeline, Deploying and Running your ADF Pipeline
Let’s get started with some lessons I learned in the process then we’ll start setting up a new ADF resource.
CI/CD: Lessons learned for ARM Template deployments
Deploying ADF from Dev to QA then to Prod can be done manually by exporting templates from Dev and then importing them into QA and Prod. I am not a fan of manual deployments. Devs end up skipping vital processes (like checking in code, testing, etc.) when they manually deploy to QA and Prod. Or someone on the team wins the lottery and quits. Now everyone is standing around trying to figure out how to deploy and inevitably screw the deployment up before finally figuring it out.
Instead, I strictly deploy to QA and Prod via CI/CD pipelines. However, this can get you in hot water with ADF deployments. ADF is deployed via ARM templates which are scoped to a Resource Group. This is GREAT except for one little setting on a ARM template deployment.
NOTE: If you follow my blog some of this will be repeat from a previous post:
Watch out for one setting called “Deployment Mode”. It has two options in the dropdown:
See that little “Info” icon. You gotta read that fine print. You can’t screw this one up! Also tell everyone on your team about this.
If you accidentally switch this to “Complete” and then run your CI/CD pipeline, say GOOD BYE to all your resources in the resource group that this deployment is scoped too. Because as soon as you click deploy, the first thing this does is nuke EVERYTHING in that resource group.
How do you prevent this?
- Lock your resources cause sooner or later someone might screw this up. See: https://doylestowncoder.com/2021/09/06/azure-locking-resources/
- Use deployment credentials that have the “least privileges” to get the job done.
- Limit resources in the Resource Group that the deployment is scoped to.
When deploying via ARM Templates, we create one resource group per ADF resource. Seems like an over kill but much easier to recover from just in case someone makes this mistake. Now it could be argued that if you use locks then you are protected if this happened. True! But locks have to be removed temporarily to perform certain tasks like deleting a database in SQL Server. So why risk it. Doesn’t cost any more to have an extra resource group.
So use “INCREMENTAL” setting and lock your resources down. Then make sure the account deploying has the least privileges to get the job done.
Now it is time to move on to creating the data factory resource…
Creating a Data Factory
In this section we will create our dev resource. This will prepare the environment for CI/CD.
In Azure, go to “Data Factories” blade.
Once on “Data Factories” blade, click the “Create” button to start the process. Now, remember that deployments are scoped to an Azure Resource Group so choose wisely. We create one resource group per resource since the CI/CD will be using ARM Templates. We don’t want to nuke our entire prod environment.
Now, proceed to the “Git Configuration”. I check the “Configure Git later” setting. Once the resource is deployed, then we’ll configure git BUT only for the dev environment.
The next step is configuring Networking. Obviously follow the standards your Azure team uses here. It really varies on the security you need to adhere to. This is just a sample so I am keeping it simple.
After networking, we’ll need to configure the encryption settings. By default data is encrypted with Microsoft-managed keys. If your data is is sensitive (ie. PII, Credit Card info) then I would recommend using your own managed key. This will prevent the operations team for Azure from looking at your data.
Next are the Tags. Add the appropriate tags for your environment.
Finally, review the configurations you select. Triple check they are correct before clicking “Create”
Your ADF will start to be deployed and should be completed in a few minutes.
Now repeat the process and create another Data Factory for your QA environment. This way you will have a resource to deploy to later when we setup CI/CD.
Source Control Configuration (DEV Only)
Now that your data factories are created, it is time to setup source control for your DEV environment. For this sample, I’ll configure this to use my SampleADF repo in my GitHub account. This is key for CI/CD. Navigate to your DEV Data Factory and open it in Studio.
Next navigate to “Manage” (last icon on the left menu). This will open up to manage console. Tap “Git Configuration”.
Now click the “Configure” button in the middle of the screen and configure your source control. You’ll need to configure a few thinks like the “Repository Type”. This will vary based on the type of repository you select. For the sample, I’ll be using “Azure DevOps Git” as my repository.
On this step, you’ll need to configure your Organization, Project Name, Repository Name, Collaboration Branch and most importantly your Publish Branch. Note: if your branch already has existing resources then they will be imported. If you do not want them imported then “uncheck” the checkbox.
Once configured, your Source Control settings should look similar to this:
As you are building your data factories, linked services, pipelines, data flows, etc, you’ll need to publish them. Each time you publish, your changes will be committed to the “publish” branch in your source control. This is the branch you’ll use in your CI/CD pipeline to deploy to QA and PROD.
This blog delved into lessons learned from ARM Template deployments. Then it walked through the process of creating an Azure Data Factory for your Dev environment. It asked you to repeat the process since we’ll need two data factories when we build out the CI/CD pipeline. Finally, it walked through setting up source control on the data factory in your dev environment.
Part 2 will setup a sample storage account, show you how to setup and configure your linked services, and then build a sample copy tool in your Dev environment. Once those steps are completed, we’ll be ready to build out the CI/CD in Part 3 of this blog.
One thought on “Building CI/CD Pipelines with Azure Data Factory: Part 1”