Building CI/CD Pipelines with Azure Data Factory: Part 3

In Part 1, I delved into lessons learned, creating the Data Factory Resources and configuring Source Control. Here is a link to Part 1.

In Part 2, I covered setting up sample resources, creating your Data Factory Pipeline in a Dev Environment and Publishing it. Here is a link to Part 2.

Now it’s time to create our Release Pipeline in Azure DevOps. This will be the CI/CD Pipelines to easily deploy out Azure Data Factories across our environments!

Note: The following resources should of already been created for your QA environment:

  • Data Factory
  • Key Vault
  • Storage Account

Configure Containers in QA Storage Account

Your storage account will need to mimic your DEV Storage Account. Make sure it is identical. For this sample I had to do the following in the QA Storage Account:

  • Created the “copyto” and “copyfrom” containers
  • Uploaded the products.csv file to the “copyfrom” container.

Add the Secret to your QA Key Vault

You need to setup your connection string in your QA Key Vault. First, grab your connection string from your QA Storage Account. This is under the “Access Keys” menu.

Just in case you are new to Storage Account, take a look at the “Rotate Key” button. Also note that there are two keys: key1 and key2. Why? You can use either connection string for key1 or key2. This allows you to constantly rotate your keys as needed for security purposes. For instance, you can have a two week rotation cycle:

  • Week 1 ->
    • rotate key1
    • update Key Vault with the new connection string for key1
    • wait pre-determined time for key to be propagated to all dependent resources
    • rotate key2 since resources are no longer using it
  • Week 2 ->
    • rotate key2
    • update Key Vault with the new connection string for key2
    • wait pre-determined time for key to be propagated to all dependent resources
    • rotate key1 since resources are no longer using it
  • Repeat

If a key is compromised, it is only a matter of time before it no longer works. This rotation also allows you to quickly change the keys if your system is hacked. This entire process can be scripted and automated using PowerShell.

Now, let’s add the connection string to your QA Key Vault:

The name of the key must be identical to the name in Dev.

Once you have created the key, then you are ready to move on to adding the access policy.

Add Access Policy in Key Vault

Your QA Data Factory needs to have access to your Key Vault. In Part 2, we set this up for Dev. Now, we need to do it for QA. Start by clicking the “Add Access Policy” button.

This is a repeat of what we did before. I am only setting the access to allow my QA Data Factory to “List” secrets and “Get” secrets. It does not need any other access. Next, select the principal that needs access. This principal is your QA Data Factory. Then click “Add”

Finally, click “Save” which will save the new access policy for your Key Vault.

Data Factory Pipeline Triggers

In most real-world applications, your Data Factory Pipelines will be running based on scheduled triggers. When deploying, you need to make sure the triggers are stopped before deployment and started after deployment. Luckily we don’t have to write that logic. Microsoft already did. You just need to add a PowerShell script to your “adf_publish” branch in your repo. See the following link:

This script actually does a lot more than stopping and starting triggers. Directly from the link above:

The following sample script can be used to stop triggers before deployment and restart them afterward. The script also includes code to delete resources that have been removed. Save the script in an Azure DevOps git repository and reference it via an Azure PowerShell task the latest Azure PowerShell version.

Now add the script to your “adf_publish” repo.

Finally we are ready to create our CI/CD Pipeline.

Create your CI/CD Pipeline

Now let’s go setup Azure DevOps to deploy. Start by navigating to “Pipelines” in your project in Azure DevOps. Then head over to “Releases” and click “New Pipeline”.

We need an empty job. This will allow us to configure the pipeline as needed. So click “Empty Job”.

We will need to configure the stage. I kept the defaults since this is just a sample so I just closed the dialog.

Now add the artifact to the stage. We are deploying what was published by the DEV Data Factory. Remember, in DEV, when the Data Factory publishes the changes, it commits the changes to the “adf_publish” branch. We are not deploying/publishing an actual build like you would when deploying an App Service. Instead, we are just deploying ARM templates that are in the Repo branch so we need to pick the Azure Repo Project and Source. Then select the “Default Branch” which will be “adf_publish”. Finally, save the artifact.

Now the artifect is configured so it is time to setup the actual job. Click the “1 Job, 0 Task” link.

Now let’s add three tasks to this job. Here are the following tasks we will create:

  • First Task: Azure PowerShell Script (this will stop the triggers)
  • Second Task: ARM Template Deployment (this will deploy the ARM templates)
  • Third Task: Azure PowerShell Script (this will start the triggers)

First Task

You will need to click the “+” to search for the task to add.

This task of this job will be configured to stop the triggers. You’ll need to add an “Azure PowerShell” task.

Once you have added the task, you will see it added under your job.

Now click on the task and edit it. I changed the “Display Name” to include “Pre-Deployement”. Then you’ll need to select your Azure Subscription. The next step is picking the location of the PowerShell script that you added to your repo. You can easily do this by clicking the eclipse.

After that you need to fill in the script arguments. When running it for pre-deployment, the syntax for the arguments is the following:

-armTemplate "$(System.DefaultWorkingDirectory)/<your-arm-template-location>" -ResourceGroupName <your-resource-group-name> -DataFactoryName <your-data-factory-name> -predeployment $true -deleteDeployment $false

You ARM template location is in the “adf_publish” branch in your repo. You need the path including ARMTemplateForFactory.json file.

Here is the configuration. Also note, that I selected the latest version for the PowerShell version.

Now, save the job and let’s add the next task.

Second Task

We’ll need to add an “ARM template deployment” task.

After clicking add, configure the “Azure Details including the subscription, resource group and location. The action should be set to the default which is “Create or update resource group”.

Now scroll to the template section. Using the eclipse, select the the location of the ARM Template and the Template Parameters. Next click the eclipse for the “Override Parameters”. See the next screen shot.

Here is where the secret sauce happens. This is where everything comes together. You will need to adjust the parameters using the Override Parameters dialog. The factory name will be the resource name of your Data Factory in QA. The Key Vault will be the URL for the QA Key Vault. When it deploys, the ARM template will be deployed to the resource you set for factoryName and point it to the Key Vault you configured. Pretty sweet!

Finally, “Save” the task.

Third Task

This is really just a clone of the first task along with changing the boolean parameters. First, let’s clone the task.

Then move the ARM Template task in between the “Pre-Deployment” and the “Pre-Deployment copy” as below.

Now update the name to be “Post Deployment”. Then change the -predeployment flag to $false and the -deleteDeployment flag to $true.

Save! Your three tasks should look like this:

We are finally ready to deploy to QA!

Deploy

Let’s deploy it. First, I opened up the Azure Data Factory Studio. As you can see there are no pipelines, datasets, etc…

Then I close that tab. Now let’s go deploy! We need to create a release.

You will need to select the “Stage” and the “Version” of the “Artifact”. Basically this git hash of what you want to deploy.

Once you click “Create”, it will queue the release.

Then I go into the “Release” and click “Deploy”.

After a few minutes, your deployment will complete if you did everything properly.

Now open Azure Data Factory Studio (if already open, refresh it) for your QA Data Factory. Notice that your pipeline and datasets were deployed.

Then go check out your Key Vault. Notice that it points to your QA environment so the parameters were properly deployed.

Pretty exciting! We are finally ready to test the Data Factory Pipeline.

Run your QA Data Factory Pipeline

Notice that nothing is in the “copyto” folder in the QA Storage Account.

Back in Azure Data Factory Studio, open up the pipeline and go to your pipeline. Then click “Trigger Now”.

Most pipelines will require parameters however our pipeline does not. It’s just a simple sample. In order to run it, just click “Ok”.

You can verify it ran correctly in the “Pipeline Runs” under “Monitor”. Notice that the pipeline ran just fine.

But the proof is seeing that the file now exists in the “copyto” folder in the QA Storage Account. Nothing like the sweet smell of success!!!

Finally we have built out a DEV and QA environment with CI/CD for your data factory. You’ll have to do that again for your production environment but that should be a lot simpler now that you have QA working.

Conclusion

I really hope you have found this three part series really helpful for setting up your CI/CD Release Pipelines for your Azure Data Factory. I know this has been a lot to read through and setup up.

References

Here is a list of references:

Building CI/CD Pipelines with Azure Data Factory: Part 2

In Part 1, I delved into lessons learned, creating the Data Factory Resources and configuring Source Control.

Part 2 covers setting up sample resources, creating your Data Factory Pipeline in a Dev Environment and Publishing it.

Our Sample

Our Data Factory pipeline is going to be pretty simple. We’re just going to move a file from one location to another location in a storage account. Simple and easy sample but this article is more about DevOps then some really cool Data Factory.

Preparing for CI/CD

You need to think through all your resources your data factory will need access to like connection strings, storage accounts, etc. You could create them as parameters and replace them in the CI/CD pipelines. Not a bad idea but if your secrets change, you’ll have to redeploy.

I have seen a sample of Azure DevOp pipelines that take CSV files of secrets for a Data Factory. It uses that file to replace strings in the ARM Templates. I really hate that idea. I can’t even believe someone is suggesting that. In my opinion, using that solution is storing connection strings and secrets in places you don’t really want them. Next thing you know those CSV files are check in to the repo and then everyone on your team has passwords to prod.

Instead, I use KeyVault. It is much simpler approach. It is rather easy to tell CI/CD which key vault to use instead of storing parameters in a CI/CD Pipeline. It’ll simplify the deployment process and when a connection string or secret changes then you can easily update it in the appropriate key vault. Next time the pipeline is triggered, it’ll get the new secret from Key Vault. Easy to manage!

Setup a Storage Account

So let’s go ahead and setup a storage account. I am not going to go into a lot of details on this. Since this is a sample, we’ll keep it simple. I picked the redundancy to be “Locally-redundant storage” which is the cheapest. Then I clicked through the rest of the steps accepting all the defaults.

For this sample I created a DEV and QA storage account.

Setup a Key Vault

Again I am keeping this pretty simple. I’ll assume the reader knows how to create a Key Vault. I pretty much created the Key Vault with the default config.

For this sample I created a DEV and QA Key Vault. Once you have the Key Vault resource setup for each environment, then configure the secrets (using the same Secret Name) in each Key Vault making sure the secret is configured appropriately for that environment.

We are ready to setup our data factory. The first step is configuring the “Linked Services”

Configure “Linked Services” in ADF

Let’s setup a pipeline in our DEV data factory and publish it. As previously stated, we are not going to do anything exciting. We’re just going to create a pipeline that copies a CSV file from one location to another. I know you’re seriously disappointed but this blog is already long enough.

First let’s add our “Linked Services”. We want to add a linked service for our Dev Key Vault resource. You’ll need to go to “Linked Services” under “Manage” on the left menu.

Next click “+ New” and search for “Key Vault”. Click the icon for “Azure Key Vault” in the search result.

This will open the settings that you will need to configure for your Azure Key Vault. Name is just the name of the linked service. I leave the “Authentication Method” set to “Managed Identity”. Then I select “Enter Manually” for the “Azure key vault selection method”. This “Base URL” is a parameter you setup below. Under parameters, I setup the default value for the parameter. The Azure DevOps pipeline will pass the Key Vault in as this parameter. I’ll show you this in Part 3. But don’t save yet! There is still one critical piece you need to do so read the next section. We’ll come back and save this in a minute.

Now you need to open a new tab and head over to your Dev Key Vault to add an access policy to your Dev ADF. This is necessary for your Dev Data Factory to have access to your secrets. In your Key Vault, click “Access policies”.

The next step is adding an access policy. Click “Add Access Policy”. I only give Data Factory access to get and list secrets. Your scenario may vary but for this sample we are building, we only need to get the secret. Regarding the principal on the left, search for your Data Factory resource, click it and then click the “Select” button. Now you are ready to click the “Add” button on the left.

Wait a second! There is yet another button to click. You MUST click the “Save” button. I forget to do this all the time. The previous page said “Add”. That means add and save it to me. But not here! You still need to click that “Save” button.

Now go back to your Data Factory and test your connection. It should light up green. If so, then click “Create” button.

Let’s add a Linked Service for the Storage Account. This will be a little easier to create based on how the connection string is being handled. In this sample we are linking this to the “Secret” we created in the Key Vault. This is not the most secure way to connect to a storage account but for purpose of the sample it works well. We are linking this to the Azure Key Vault “Linked Service” we just created.

Test your connection. It should light up green. If so, then click “Create” button.

Our linked services are ready! They should look similar to the screen shot below.

Create the Data Factory Pipeline

Let’s start by creating a simple pipeline using the “Copy Data Tool”.

First you have to setup the Source. Notice how we select the linked service for the Storage Account. You’ll need to configure the “File or Folder” location to copy from.

Next we have to pick the Target. We’ll use the same storage account. In addition, we will enter in the “Folder path” to copy too.

The next step requires setting up the file format for the Target.

Next is the settings where we set the Task Name.

Now review the summary…

Then click “Next” which kicks of creating the copy pipeline.

Once the deployment is complete, the pipeline is created. You can click the “Edit Pipeline” button to view the new pipeline.

Here is the critical piece which is publishing the pipeline. This commits the pipeline to source control in the publish branch. This branch is used by Azure DevOps for deploying to the QA and Prod environments.

After clicking “OK”, the pipeline and datasets are published along with the linked services.

Check out your repo and you’ll see the commit for the publish you just completed.

Now let’s trigger the pipeline to run which will copy the file.

After running the trigger the file is copied into the “copyto” folder. So amazing, right! Ok, still just a quick sample…

Conclusion

Now we have a working pipeline in Dev. Part 3 will demonstrate how to setup Azure DevOps and deploy to your QA environment.

Building CI/CD Pipelines with Azure Data Factory: Part 1

We have been using Azure Data Factory (ADF) at work for a while now. It’s been a great tool for ETL services that can easily scale-out serverless data integration and data transformations. ADF has worked well to transform the data of our new POS into our data warehouse for reporting in Power BI.

We have a Dev, QA and Prod environment for our data factories. The main focus of this article is diving into CI/CD for deploying Data Factories from Dev to QA and finally into Prod. There are a lot of pre-requisites to actually do CI/CD for ADF.

Since this blog is rather long, it is being split into three parts:

  • Part 1: Lessons Learned, Creating the Data Factory Resource and Configuring Source Control.
  • Part 2: Setting up sample resources, creating your pipeline and publishing it
  • Part 3: Configuring your CI/CD Pipeline, Deploying and Running your ADF Pipeline

Let’s get started with some lessons I learned in the process then we’ll start setting up a new ADF resource.

CI/CD: Lessons learned for ARM Template deployments

Deploying ADF from Dev to QA then to Prod can be done manually by exporting templates from Dev and then importing them into QA and Prod. I am not a fan of manual deployments. Devs end up skipping vital processes (like checking in code, testing, etc.) when they manually deploy to QA and Prod. Or someone on the team wins the lottery and quits. Now everyone is standing around trying to figure out how to deploy and inevitably screw the deployment up before finally figuring it out.

Instead, I strictly deploy to QA and Prod via CI/CD pipelines. However, this can get you in hot water with ADF deployments. ADF is deployed via ARM templates which are scoped to a Resource Group. This is GREAT except for one little setting on a ARM template deployment.

NOTE: If you follow my blog some of this will be repeat from a previous post:

https://doylestowncoder.com/2021/10/06/azure-devops-arm-templates-how-to-ruin-your-day/

Watch out for one setting called “Deployment Mode”. It has two options in the dropdown:

  • Incremental
  • Complete

See that little “Info” icon. You gotta read that fine print. You can’t screw this one up! Also tell everyone on your team about this.

If you accidentally switch this to “Complete” and then run your CI/CD pipeline, say GOOD BYE to all your resources in the resource group that this deployment is scoped too. Because as soon as you click deploy, the first thing this does is nuke EVERYTHING in that resource group.

COMPLETE SETTING IS LIKE A NUKE

How do you prevent this?

When deploying via ARM Templates, we create one resource group per ADF resource. Seems like an over kill but much easier to recover from just in case someone makes this mistake. Now it could be argued that if you use locks then you are protected if this happened. True! But locks have to be removed temporarily to perform certain tasks like deleting a database in SQL Server. So why risk it. Doesn’t cost any more to have an extra resource group.

So use “INCREMENTAL” setting and lock your resources down. Then make sure the account deploying has the least privileges to get the job done.

Now it is time to move on to creating the data factory resource…

Creating a Data Factory

In this section we will create our dev resource. This will prepare the environment for CI/CD.

In Azure, go to “Data Factories” blade.

Once on “Data Factories” blade, click the “Create” button to start the process. Now, remember that deployments are scoped to an Azure Resource Group so choose wisely. We create one resource group per resource since the CI/CD will be using ARM Templates. We don’t want to nuke our entire prod environment.

Now, proceed to the “Git Configuration”. I check the “Configure Git later” setting. Once the resource is deployed, then we’ll configure git BUT only for the dev environment.

The next step is configuring Networking. Obviously follow the standards your Azure team uses here. It really varies on the security you need to adhere to. This is just a sample so I am keeping it simple.

After networking, we’ll need to configure the encryption settings. By default data is encrypted with Microsoft-managed keys. If your data is is sensitive (ie. PII, Credit Card info) then I would recommend using your own managed key. This will prevent the operations team for Azure from looking at your data.

Next are the Tags. Add the appropriate tags for your environment.

Finally, review the configurations you select. Triple check they are correct before clicking “Create”

Your ADF will start to be deployed and should be completed in a few minutes.

Now repeat the process and create another Data Factory for your QA environment. This way you will have a resource to deploy to later when we setup CI/CD.

Source Control Configuration (DEV Only)

Now that your data factories are created, it is time to setup source control for your DEV environment. For this sample, I’ll configure this to use my SampleADF repo in my GitHub account. This is key for CI/CD. Navigate to your DEV Data Factory and open it in Studio.

Next navigate to “Manage” (last icon on the left menu). This will open up to manage console. Tap “Git Configuration”.

Now click the “Configure” button in the middle of the screen and configure your source control. You’ll need to configure a few thinks like the “Repository Type”. This will vary based on the type of repository you select. For the sample, I’ll be using “Azure DevOps Git” as my repository.

On this step, you’ll need to configure your Organization, Project Name, Repository Name, Collaboration Branch and most importantly your Publish Branch. Note: if your branch already has existing resources then they will be imported. If you do not want them imported then “uncheck” the checkbox.

Once configured, your Source Control settings should look similar to this:

As you are building your data factories, linked services, pipelines, data flows, etc, you’ll need to publish them. Each time you publish, your changes will be committed to the “publish” branch in your source control. This is the branch you’ll use in your CI/CD pipeline to deploy to QA and PROD.

Conclusion

This blog delved into lessons learned from ARM Template deployments. Then it walked through the process of creating an Azure Data Factory for your Dev environment. It asked you to repeat the process since we’ll need two data factories when we build out the CI/CD pipeline. Finally, it walked through setting up source control on the data factory in your dev environment.

Part 2 will setup a sample storage account, show you how to setup and configure your linked services, and then build a sample copy tool in your Dev environment. Once those steps are completed, we’ll be ready to build out the CI/CD in Part 3 of this blog.

FiveTran: Square Data Connector

This blog covers our experiences using FiveTran’s Square Data Connector and syncing to an Azure SQL Database.

We are currently using Square as our POS (Point of Sale). This POS has been phenomenal when compared to our experience with NCR Aloha. Many, many pluses for using Square over Aloha but that is a different blog for another day. However, Aloha did have SQL Replication as a feature so we could easily replicate sales data to an on premise SQL Server or to the cloud using Azure SQL Database. Square does not have this feature out of the box.

It might be possible if your sales data is small enough that you can use Square’s APIs or SDK to build Power BI reports and perform well. But if you have a large volume of data or need data transformations, then more than likely you will need a third party tool or roll your own solution.

Prior to rolling out Square, I did setup the Azure Square Data Connector (https://docs.microsoft.com/en-us/azure/data-factory/connector-square?tabs=data-factory); however it did not work well. At the time it was in preview and the database schema was downright terrible. We abandoned that approach and started using Square APIs to retrieve the data for our reports. This was a band-aid and not a long term solution. We are a high volume retail chain with each store averaging 1500 transactions a days so this approach was not feasible in the long term.

In addition, we were a small team so we needed a quick easy fix. Square recommended FiveTran. I was a little skeptical at first when I checked out FiveTran. But it seemed like the best when compared to other options out there at the time.

Advantages:

  • Extremely easy to setup and configure. You only need your Square API Key and the connection info to your destination.
  • Database Schema that FiveTran creates is clean and easy to understand.
  • You do not need to learn the intricacies of the Square API or Square SDK

Disadvantages:

  • You are at the mercy of FiveTran when they introduce bugs. Ok, so that’s true with any third party tool. We had a painful three weeks back in September 2021 when a new enhancement they rolled out did not work so well. Constant timeouts with syncs, etc… Solution was to start an entire new Data Connector. That took about 36 hours to reload a years worth of data.
  • Square Offline Mode – oh what a pain when trying to keep your data synced for reporting purposes.
  • Immature tools – you can’t run a sync for a specific time period for missing data. This is a problem when using Square Offline Mode.
  • Square’s Loyalty, Customer and Catalog APIs do not let you query using a last updated filter; therefore at least one sync per data is rather long to re-pull all of this data. We have over 200,000 customers and almost 50,000 items in our catalog. Hopefully Square will fix that so FiveTran can adjust their Data connector.

Configuring FiveTran

FiveTran has plenty of documentation on setting up your Data Connector. Depending on your security setup for your destination, don’t forget to whitelist the APIs as outlined in this documentation:

Once you setup your Data Connector, check out the Settings configuration:

Yea, that’s it. Seriously. No tool to rerun a sync for the last week which would be VERY helpful if some of your registers ran in offline mode. That was a real let down.

FiveTran Square Data Schema

Here is a quick link to the DB Schema that will be created by FiveTran:

Square Offline

Square Registers can be configured to run in offline mode. Offline mode allows you to take swiped credit card payments when your internet is temporary unavailable. However, the payments are not processed until the register is back online. See this article for reference:

I highly recommend offline mode if your internet is spotty. For instance, if you have Comcast in your retail locations then you probably really want Offline Mode turned on. This means you can still take credit card/debit cards up to X dollars based on how you have offline mode configured. However, there is a catch. If a register is in offline mode, you need to get it back online ASAP. You have 72 hours to get those transactions up to to Square for it to process the credit cards otherwise you run the risk of those transactions being “cancelled” thus losing sales. In addition, once that payment is actually processed by Square it might get declined due to insufficient funds. Offline is a risk the business owner needs to fully understand and accept before turning it on.

There are other ways around using offline mode or minimizing the risk of offline mode. You can keep a 5G MIFI so when your internet is down you can switch your registers to use the MIFI. The 5G MIFI works GREAT! We actually ran 10 registers on it for over a month while waiting for Verizon FIOS to be setup at a store. Or you can go a little high tech and setup a failover cradle point for your internet provider. When the internet is out, this will failover to a wireless data plan until your internet is back up.

Square API Problem

FiveTran DataConnector has a sweeper. This is to minimize a Square API problem with Square Offline Mode and any latency with transactions being processed. When transactions are uploaded to Square from the register, Square uses the date/time of the transaction on the register. So if a transaction is processed on Monday, but the register does not get back online till Tuesday, FiveTran will not sync that transaction since it already synced for the timeframe when the transactions was processed offline.

When the sync runs, it syncs the data for the appropriate timeframe based on your configuration. If your data connector is configured to sync every 6 hours, then it syncs the last 6 hours of data as well as a 6 hour timeframe 72 hours ago. In theory this should work well BUT during that timeframe, your reports will not match the Square Dashboard regarding sales data. Business users are complaining…

Also, if a register is offline for more than 72 hours which can happen, those transactions are never synced. EVER! Until you manually intervene. See the architecture below on how I handled it.

This is an issue with Square. Square should really expose a date in the API for when the transaction is uploaded into the Square cloud. Then they should allow filtering by that date. Then the FiveTran Data Connector could use that date to sync which will then catch all offline transactions even if uploaded days later.

In addition, FiveTran could build a tool that lets you sync data for a specific timeframe. Then you could easily run this tool to sync missing transactions due to offline mode. But it is just a band-aid requiring intervention.

Buggy Upgrade

Unfortunately FiveTran introduced a bug in their Square Data Connector with a release in September 2021. This was a rather painful experience. Syncs were timing out, data was missing, reports were wrong, etc. The upgrade of the database schema was the issue and would time out, then somehow magically work, timeout, then work, etc…

This led us to a decision to build our own sync tool as a “safety net” to FiveTran. We built this using the Square API and Azure Data Factory. It took several weeks to build but well worth the time. As of now, we are running the FiveTran Square DataConnector. But in addition, we run our sync tool throughout the day and use FiveTran as a final sync for the day.

After three weeks having FiveTran sync issues, we created a new Square Data Connector pointing to a new Azure SQL Database. This created the new schema and synced to our database within 36 hours. Then it was rather simple for use to switch reporting to use the new database and solved the issues we were having with FiveTran syncing.

Architecture

Currently we implemented our architecture as such.

It has the following advantages:

  • We are a small team. We have the best of both worlds where FiveTran syncs everything to our FiveTran Square database while our “Square API Pipeline” syncs just sales data into our Data Warehouse. If either process goes down, we are still getting sales into our DataWarehouse and keeping reports up to date in a timely fashion.
  • We can manually kickoff the Square API Sales Pipeline for a specified time period thus cleaning up any missing (offline mode) sales/transactions easily. Currently we cannot do that using the FiveTran Data Connector.

With a larger team, I would probably change my architecture. I would remove FiveTran completely from the solution in order to eliminate that expense and probably just use Azure Data Factories. The results from the Square API would be pumped into a Data Lake. Then another ETL Tool will transform the data into a Data Warehouse for reporting. This architecture should allow us to easily bolt on additional AI/Analytics tools using the raw data in the Data Lake or in the ETL Tool. In addition, it would also allow us to reload the DataWarehouse (without reloading data from Square) in the event that was needed for a feature requests to the ETL pipeline.

Conclusion

Hopefully you found this useful if you are building your own custom Power BI Dashboards for your Square POS data.

References

Azure DevOps ARM Templates: How to ruin your day!

Love CI/CD in Azure DevOps! But if you make this one innocent mistake while using ARM Templates, you can completely utterly ruin your day. ARM Templates are scoped to a Resource Group. I use ARM Template deployments for deploying Azure Data Factory to QA and Prod

When configuring an ARM Template deployment watch out for this setting called “Deployment Mode”. It has two options in the dropdown:

  • Incremental
  • Complete

The default is incremental.

See that little “Info” icon. You gotta read that fine print. You can’t screw this one up! Also tell everyone on your team about this.

If you accidentally switch this to “Complete” and then run your CI/CD pipeline, say GOOD BYE to all your resources in the resource group that this deployment is scoped too. Because as soon as you click deploy, the first thing this does is nuke EVERYTHING in that resource group.

COMPLETE DEPLOYMENT MODE IS LIKE A NUKE

How do you prevent this?

  • Lock your resources cause sooner or later someone might screw this up. See my previous blog on locking resources.
  • Use deployment credentials that use the “least privileges” to get the job done.
  • Limit resources in the Resource Group that the deployment is scoped to.

My Microsoft rep has commented on how easy it is to accidentally do this or have a junior guy flip the setting thinking it was just resetting a resource and redeploying just the templates.

I have a rule on our team. When deploying via ARM Templates, we create resource groups that only have resources in it that group are deployed via those templates. For ADF Resources, we create one resource group per ADF resource. Seems like an over kill but much easier to recover from just in case someone makes this mistake.

So use “INCREMENTAL” and lock your resources down. Your family will certainly thank me! Cause if you screw this up and nuke your prod environment, you’ll probably spend days rebuilding it.

Security Tip

Did you notice something? Hmm, what happens if your CI/CD account gets hacked and you don’t have locks and you are not using least privileges. So you already made a few bad mistakes… Now what if you only have one or two resource groups for your prod environment. Hmm, now the hacker quickly creates a CI/CD deployment using an ARM Template that is scoped to those groups. More than likely this won’t set off any red flags. Then he deploys. Your day is officially RUINED!

Yet another reason to use MFA on your Azure Accounts.

Conclusion

Pay attention to that setting. Don’t forget about least privilege for your deployment credentials – just as important as least privileges for accessing your databases. And lock those resources.