- Kevin Feasel
- Mala Mahadevan
- Mike Chrestensen
Notes: Questions and Topics
Oops, Microsoft Did it Again
Twice in six months, Azure Active Directory has gone down. Both times, it happened right around an episode of Shop Talk. I think they’re just trying to give us content.
Microsoft has provided a summary of the issue. This was pretty big—bigger than I understood at the time of the episode, as this one did affect services more than the issue six months prior.
Modern Data Warehousing
Fred e-mailed and asked a few questions about modern data warehousing. I gave fairly substantial answers, so check out the video for those details. Here’s the brief summary.
I want to understand how to create the hub and spoke data warehousing architecture.
Hub and spoke has a couple of meanings. The original term came from “departmental data marts,” which is a style popularized by Bill Inmon. I’m not the biggest fan of this approach. Instead, I like the bus architecture by Ralph Kimball. I think this approach makes it much easier for business analysts to get the data they need without needing to know much (if any) SQL.
By the way, because he’s a really smart guy, I want to link to Simon Whiteley’s take on whether the Kimball model fits in the modern data warehouse. My short version is that Kimball is still quite relevant, in part because its intended purpose is ease of use rather than performance, strictly speaking.
How will my schema look?
Ellis Butterfield has a presentation from 2017 covering hub-and-spoke in a Massive Parallel Processing system, such as Azure Synapse Analytics dedicated SQL pools (nee Azure SQL Data Warehouse). The idea is that dedicated SQL pools have concurrency limits, such that only a certain number of queries may run at a time. If you’re hitting those limits and queries are waiting, you might want to offload processing onto different machines.
In short, Ellis’s example uses Azure SQL Database elastic pools as the spokes, creating external tables which point back to the dedicated SQL pool, allowing you to isolate workloads on specific pools.
Another option is to use the serverless SQL pool, creating external tables against pre-cleaned data in the lake. This might help offload queries from a dedicated SQL pool.
A third option is to use Azure SQL Databases as the spokes, using either replication or ETL to migrate data from the dedicated SQL pool into SQL databases. Essentially, you’d treat these as data marts: line-of-business specific databases which serve queries for business analysts in those departments. This fits really well with the Kimball approach but does add more in terms of complexity.
A fourth option is to use Power BI as the spoke, migrating data from the dedicated SQL pool into a Power BI data model and use that for reporting. This assumes that your relevant data set can fit into Power BI’s Tabular model or that you’re using Azure Analysis Services to fit it in.
A fifth option would be to use result set caching in the dedicated SQL pool, especially if you have infrequently-changing data and many occurrences of the same queries. The idea here is that if you can make each individual query faster, you have less concurrency, as each gets in and out sooner and there’s less chance a lot of them will be hanging around at the same time.
What do I need to consider when I am transforming the data?
Here are a few things to consider:
- Explore raw flat files using Azure Synapse Studio. This will give you an opportunity to see what the incoming data looks like and get a quick determination on data quality and shape.
- Mapping Data Flows in Azure Data Factory can transform data stored in your data lake. These are great for relatively clean data.
- You can also use Apache Spark and Spark clusters to wrangle nasty data, particularly if you need to do things like parsing regular expressions or fixing file delimiters.
- Store your transformed data in Parquet format. It’s a great format in general and works extremely well with fact-style data, which has a lot of columnar repetition.
- Generally, you’ll have three layers of data in your data lake. You’ll sometimes hear them referred to as bronze, silver, and gold. Occasionally you’ll hear raw, refined, and ready.
- For loading data into clustered columnstore indexes, stage your data in heap tables or temp tables before running those transformations.
If I am going to use Azure services, is there a cheaper substitute for Azure Synapse Analytics?
If your database is large enough—50-100 terabytes or more—then no, you’re probably not going to find a cheaper alternative. But if it’s small, on the order of 1 TB or less, go with Azure SQL Database. Azure Synapse Analytics dedicated SQL pools don’t really make much sense for a database under 1 TB.
Here are a few tips for saving cash when using Azure Synapse Analytics:
- Keep your dedicated SQL pools as small as possible. This component is less expensive than Azure SQL Data Warehouse was, but it is still costly.
- Use serverless queries for one-off stuff rather than building it into the dedicated pool.
- If you do a good job of laying out your data lake files and folders, serverless queries can be really inexpensive. It’s $5 per terabyte of data scanned, but even if you have 500+ TB of data stored, you might only need to scan a few gigabytes to get your answer.
- Use Power BI Premium and store segments of the data model there. If you use that for reporting and analysis, you can scale down the dedicated SQL pool. And if you’re using Power BI Premium Per User pricing, that’s $20 per user per month. Or $5K a month if you use the Premium SKU.
- Pause dedicated SQL pools if there’s a time frame you’re allowed to do that. For example, if your analysts are only in the office from 6 AM until 6 PM, you may be able to shut off the dedicated SQL pool for 10-12 hours a day.
- If you can’t turn it off, that’s fine. Reserved instance pricing can cut down on the cost of dedicated SQL pools considerably. A 3-year reservation can save 65% or so off the pay-as-you-go price. If you do reserve, I’d recommend reserving a bit below where you normally are, as then you have the opportunity to scale back some if you get major query gains and you can negotiate up more easily than down.
- Keep an eye on Spark pools and make sure they run only when needed. Those can also get expensive. Spark pools have auto-shutdown capabilities so if somebody forgets to turn one off, it can shut down automatically.
Thanks for the great question, Fred! If you have any questions you want us to handle on Shop Talk, hit us up at shoptalk at tripass dot org.