Building CI/CD Pipelines with Azure Data Factory: Part 1

We have been using Azure Data Factory (ADF) at work for a while now. It’s been a great tool for ETL services that can easily scale-out serverless data integration and data transformations. ADF has worked well to transform the data of our new POS into our data warehouse for reporting in Power BI.

We have a Dev, QA and Prod environment for our data factories. The main focus of this article is diving into CI/CD for deploying Data Factories from Dev to QA and finally into Prod. There are a lot of pre-requisites to actually do CI/CD for ADF.

Since this blog is rather long, it is being split into three parts:

  • Part 1: Lessons Learned, Creating the Data Factory Resource and Configuring Source Control.
  • Part 2: Setting up sample resources, creating your pipeline and publishing it
  • Part 3: Configuring your CI/CD Pipeline, Deploying and Running your ADF Pipeline

Let’s get started with some lessons I learned in the process then we’ll start setting up a new ADF resource.

CI/CD: Lessons learned for ARM Template deployments

Deploying ADF from Dev to QA then to Prod can be done manually by exporting templates from Dev and then importing them into QA and Prod. I am not a fan of manual deployments. Devs end up skipping vital processes (like checking in code, testing, etc.) when they manually deploy to QA and Prod. Or someone on the team wins the lottery and quits. Now everyone is standing around trying to figure out how to deploy and inevitably screw the deployment up before finally figuring it out.

Instead, I strictly deploy to QA and Prod via CI/CD pipelines. However, this can get you in hot water with ADF deployments. ADF is deployed via ARM templates which are scoped to a Resource Group. This is GREAT except for one little setting on a ARM template deployment.

NOTE: If you follow my blog some of this will be repeat from a previous post:

https://doylestowncoder.com/2021/10/06/azure-devops-arm-templates-how-to-ruin-your-day/

Watch out for one setting called “Deployment Mode”. It has two options in the dropdown:

  • Incremental
  • Complete

See that little “Info” icon. You gotta read that fine print. You can’t screw this one up! Also tell everyone on your team about this.

If you accidentally switch this to “Complete” and then run your CI/CD pipeline, say GOOD BYE to all your resources in the resource group that this deployment is scoped too. Because as soon as you click deploy, the first thing this does is nuke EVERYTHING in that resource group.

COMPLETE SETTING IS LIKE A NUKE

How do you prevent this?

When deploying via ARM Templates, we create one resource group per ADF resource. Seems like an over kill but much easier to recover from just in case someone makes this mistake. Now it could be argued that if you use locks then you are protected if this happened. True! But locks have to be removed temporarily to perform certain tasks like deleting a database in SQL Server. So why risk it. Doesn’t cost any more to have an extra resource group.

So use “INCREMENTAL” setting and lock your resources down. Then make sure the account deploying has the least privileges to get the job done.

Now it is time to move on to creating the data factory resource…

Creating a Data Factory

In this section we will create our dev resource. This will prepare the environment for CI/CD.

In Azure, go to “Data Factories” blade.

Once on “Data Factories” blade, click the “Create” button to start the process. Now, remember that deployments are scoped to an Azure Resource Group so choose wisely. We create one resource group per resource since the CI/CD will be using ARM Templates. We don’t want to nuke our entire prod environment.

Now, proceed to the “Git Configuration”. I check the “Configure Git later” setting. Once the resource is deployed, then we’ll configure git BUT only for the dev environment.

The next step is configuring Networking. Obviously follow the standards your Azure team uses here. It really varies on the security you need to adhere to. This is just a sample so I am keeping it simple.

After networking, we’ll need to configure the encryption settings. By default data is encrypted with Microsoft-managed keys. If your data is is sensitive (ie. PII, Credit Card info) then I would recommend using your own managed key. This will prevent the operations team for Azure from looking at your data.

Next are the Tags. Add the appropriate tags for your environment.

Finally, review the configurations you select. Triple check they are correct before clicking “Create”

Your ADF will start to be deployed and should be completed in a few minutes.

Now repeat the process and create another Data Factory for your QA environment. This way you will have a resource to deploy to later when we setup CI/CD.

Source Control Configuration (DEV Only)

Now that your data factories are created, it is time to setup source control for your DEV environment. For this sample, I’ll configure this to use my SampleADF repo in my GitHub account. This is key for CI/CD. Navigate to your DEV Data Factory and open it in Studio.

Next navigate to “Manage” (last icon on the left menu). This will open up to manage console. Tap “Git Configuration”.

Now click the “Configure” button in the middle of the screen and configure your source control. You’ll need to configure a few thinks like the “Repository Type”. This will vary based on the type of repository you select. For the sample, I’ll be using “Azure DevOps Git” as my repository.

On this step, you’ll need to configure your Organization, Project Name, Repository Name, Collaboration Branch and most importantly your Publish Branch. Note: if your branch already has existing resources then they will be imported. If you do not want them imported then “uncheck” the checkbox.

Once configured, your Source Control settings should look similar to this:

As you are building your data factories, linked services, pipelines, data flows, etc, you’ll need to publish them. Each time you publish, your changes will be committed to the “publish” branch in your source control. This is the branch you’ll use in your CI/CD pipeline to deploy to QA and PROD.

Conclusion

This blog delved into lessons learned from ARM Template deployments. Then it walked through the process of creating an Azure Data Factory for your Dev environment. It asked you to repeat the process since we’ll need two data factories when we build out the CI/CD pipeline. Finally, it walked through setting up source control on the data factory in your dev environment.

Part 2 will setup a sample storage account, show you how to setup and configure your linked services, and then build a sample copy tool in your Dev environment. Once those steps are completed, we’ll be ready to build out the CI/CD in Part 3 of this blog.

Advertisement

FiveTran: Square Data Connector

This blog covers our experiences using FiveTran’s Square Data Connector and syncing to an Azure SQL Database.

We are currently using Square as our POS (Point of Sale). This POS has been phenomenal when compared to our experience with NCR Aloha. Many, many pluses for using Square over Aloha but that is a different blog for another day. However, Aloha did have SQL Replication as a feature so we could easily replicate sales data to an on premise SQL Server or to the cloud using Azure SQL Database. Square does not have this feature out of the box.

It might be possible if your sales data is small enough that you can use Square’s APIs or SDK to build Power BI reports and perform well. But if you have a large volume of data or need data transformations, then more than likely you will need a third party tool or roll your own solution.

Prior to rolling out Square, I did setup the Azure Square Data Connector (https://docs.microsoft.com/en-us/azure/data-factory/connector-square?tabs=data-factory); however it did not work well. At the time it was in preview and the database schema was downright terrible. We abandoned that approach and started using Square APIs to retrieve the data for our reports. This was a band-aid and not a long term solution. We are a high volume retail chain with each store averaging 1500 transactions a days so this approach was not feasible in the long term.

In addition, we were a small team so we needed a quick easy fix. Square recommended FiveTran. I was a little skeptical at first when I checked out FiveTran. But it seemed like the best when compared to other options out there at the time.

Advantages:

  • Extremely easy to setup and configure. You only need your Square API Key and the connection info to your destination.
  • Database Schema that FiveTran creates is clean and easy to understand.
  • You do not need to learn the intricacies of the Square API or Square SDK

Disadvantages:

  • You are at the mercy of FiveTran when they introduce bugs. Ok, so that’s true with any third party tool. We had a painful three weeks back in September 2021 when a new enhancement they rolled out did not work so well. Constant timeouts with syncs, etc… Solution was to start an entire new Data Connector. That took about 36 hours to reload a years worth of data.
  • Square Offline Mode – oh what a pain when trying to keep your data synced for reporting purposes.
  • Immature tools – you can’t run a sync for a specific time period for missing data. This is a problem when using Square Offline Mode.
  • Square’s Loyalty, Customer and Catalog APIs do not let you query using a last updated filter; therefore at least one sync per data is rather long to re-pull all of this data. We have over 200,000 customers and almost 50,000 items in our catalog. Hopefully Square will fix that so FiveTran can adjust their Data connector.

Configuring FiveTran

FiveTran has plenty of documentation on setting up your Data Connector. Depending on your security setup for your destination, don’t forget to whitelist the APIs as outlined in this documentation:

Once you setup your Data Connector, check out the Settings configuration:

Yea, that’s it. Seriously. No tool to rerun a sync for the last week which would be VERY helpful if some of your registers ran in offline mode. That was a real let down.

FiveTran Square Data Schema

Here is a quick link to the DB Schema that will be created by FiveTran:

Square Offline

Square Registers can be configured to run in offline mode. Offline mode allows you to take swiped credit card payments when your internet is temporary unavailable. However, the payments are not processed until the register is back online. See this article for reference:

I highly recommend offline mode if your internet is spotty. For instance, if you have Comcast in your retail locations then you probably really want Offline Mode turned on. This means you can still take credit card/debit cards up to X dollars based on how you have offline mode configured. However, there is a catch. If a register is in offline mode, you need to get it back online ASAP. You have 72 hours to get those transactions up to to Square for it to process the credit cards otherwise you run the risk of those transactions being “cancelled” thus losing sales. In addition, once that payment is actually processed by Square it might get declined due to insufficient funds. Offline is a risk the business owner needs to fully understand and accept before turning it on.

There are other ways around using offline mode or minimizing the risk of offline mode. You can keep a 5G MIFI so when your internet is down you can switch your registers to use the MIFI. The 5G MIFI works GREAT! We actually ran 10 registers on it for over a month while waiting for Verizon FIOS to be setup at a store. Or you can go a little high tech and setup a failover cradle point for your internet provider. When the internet is out, this will failover to a wireless data plan until your internet is back up.

Square API Problem

FiveTran DataConnector has a sweeper. This is to minimize a Square API problem with Square Offline Mode and any latency with transactions being processed. When transactions are uploaded to Square from the register, Square uses the date/time of the transaction on the register. So if a transaction is processed on Monday, but the register does not get back online till Tuesday, FiveTran will not sync that transaction since it already synced for the timeframe when the transactions was processed offline.

When the sync runs, it syncs the data for the appropriate timeframe based on your configuration. If your data connector is configured to sync every 6 hours, then it syncs the last 6 hours of data as well as a 6 hour timeframe 72 hours ago. In theory this should work well BUT during that timeframe, your reports will not match the Square Dashboard regarding sales data. Business users are complaining…

Also, if a register is offline for more than 72 hours which can happen, those transactions are never synced. EVER! Until you manually intervene. See the architecture below on how I handled it.

This is an issue with Square. Square should really expose a date in the API for when the transaction is uploaded into the Square cloud. Then they should allow filtering by that date. Then the FiveTran Data Connector could use that date to sync which will then catch all offline transactions even if uploaded days later.

In addition, FiveTran could build a tool that lets you sync data for a specific timeframe. Then you could easily run this tool to sync missing transactions due to offline mode. But it is just a band-aid requiring intervention.

Buggy Upgrade

Unfortunately FiveTran introduced a bug in their Square Data Connector with a release in September 2021. This was a rather painful experience. Syncs were timing out, data was missing, reports were wrong, etc. The upgrade of the database schema was the issue and would time out, then somehow magically work, timeout, then work, etc…

This led us to a decision to build our own sync tool as a “safety net” to FiveTran. We built this using the Square API and Azure Data Factory. It took several weeks to build but well worth the time. As of now, we are running the FiveTran Square DataConnector. But in addition, we run our sync tool throughout the day and use FiveTran as a final sync for the day.

After three weeks having FiveTran sync issues, we created a new Square Data Connector pointing to a new Azure SQL Database. This created the new schema and synced to our database within 36 hours. Then it was rather simple for use to switch reporting to use the new database and solved the issues we were having with FiveTran syncing.

Architecture

Currently we implemented our architecture as such.

It has the following advantages:

  • We are a small team. We have the best of both worlds where FiveTran syncs everything to our FiveTran Square database while our “Square API Pipeline” syncs just sales data into our Data Warehouse. If either process goes down, we are still getting sales into our DataWarehouse and keeping reports up to date in a timely fashion.
  • We can manually kickoff the Square API Sales Pipeline for a specified time period thus cleaning up any missing (offline mode) sales/transactions easily. Currently we cannot do that using the FiveTran Data Connector.

With a larger team, I would probably change my architecture. I would remove FiveTran completely from the solution in order to eliminate that expense and probably just use Azure Data Factories. The results from the Square API would be pumped into a Data Lake. Then another ETL Tool will transform the data into a Data Warehouse for reporting. This architecture should allow us to easily bolt on additional AI/Analytics tools using the raw data in the Data Lake or in the ETL Tool. In addition, it would also allow us to reload the DataWarehouse (without reloading data from Square) in the event that was needed for a feature requests to the ETL pipeline.

Conclusion

Hopefully you found this useful if you are building your own custom Power BI Dashboards for your Square POS data.

References