On the other side of support, we talked about a couple of surveys from Brent Ozar and Steve Stedman, respectively, which look at who (among their user bases) is using which version of SQL Server. I also go into some detail about what I mean when I say that the population surveyed may not be the same as the broader population of SQL Server installs.
Print Spoolers? We Don’t Need No Steenkin’ Print Spoolers!
It has not been a good month for print spooling on Windows.
We started off tonight by mentioning that Mike got a promotion at his job thanks, in part, to what he’s been able to learn by being part of the SQL community. We’re all quite happy for Mike and wish him the best, and we’re glad that he’s a part of the community.
Linked Servers and their Alternatives
Mala brought up our first major topic tonight: what are some alternatives to using linked servers? Here’s a brief synopsis, but you’ll definitely want to listen to the video.
First, are you sure you need an alternative? Linked servers do work for specific purposes and they do a pretty good job in that role. The main problem comes when we try to move lots of data over linked servers, especially when we do it over and over.
One idea I hadn’t thought of but makes a lot of sense is using Machine Learning Services, especially if you need to do data analytics from remote data and then store it in your local SQL Server instance.
ML Experiments on Large Data
Our other major topic was from George Walkey, who asked for my preferences around performing machine learning experiments on large data sets, specifically asking about Databricks, SparkML, Azure Synapse Analytics, Azure Machine Learning, and on-premises solutions. Here’s a synopsis of what I wrote back:
My team generally uses on-premises tooling because that’s what we had available. We started out with SQL Server Machine Learning Services and built a few products off of that. However, most of our new products don’t use ML Services and we’ve started to move existing ones away where it doesn’t make sense (mostly, it doesn’t make sense when you need sub-second performance, aren’t using batch operations to generate lots of records at a time, don’t have the data in SQL Server, don’t store the results in SQL Server, and can’t use native scoring and the PREDICT operator). Instead, we are prone to building out Linux-based servers running R and/or Python to host APIs. We also tried out Azure ML and team members really liked it, but internal problems kept us from being able to use it.
As for what I like:
On-prem is easy if you’re on a tight budget. There are a lot of tools available, and if you have knowledge in the ML space, you can do pretty much anything on-prem that you could do in a cloud.
I’m not the biggest fan of SparkML (and by extension, the Databricks ML capabilities). Spark’s machine learning capabilities aren’t fantastic, and they tend to lag way behind native scikit-learn or R packages. Sure, you can run native scikit-learn code on a Spark cluster, but then you don’t get the advantages of scale-out.
Azure ML (and AWS SageMaker) are good options, especially for people without an enormous amount of ML experience. They’re going to constrain you a bit on the algorithms you can choose, but that’s usually okay–and pro users can find their way around that by bringing in Python libraries.
Azure Syanpse Analytics offers ML capabilities, but it’s worth keeping in mind that the Spark pool machine learning has the same SparkML downsides as what you find in Databricks. Also, between Azure Synapse Analytics and Azure ML, Microsoft has two different ML products run by two different teams here and both of them are competing for what is “the right choice.” Last year, the guidance was to use Azure ML for everything machine learning-related in Azure, and Azure Synapse Analytics Spark pools with SparkML only in cases where you already had existing Spark code. I’m not sure what the guidance will look like over the next few months, but that does leave me wondering.
Mala started us off with a discussion of SQL Server and SQL Management Objects (SMO). We spent the first half of the episode discussing SMO as a concept, understanding where and how to use it, and working through some of the foibles associated with it.
Chat did a good job of carrying this episode. Check out things like:
Customer support and how terrible it is, as well as my thoughts on the nature of level 1 & level 2 support versus level 3 and higher.
How containers differ from the Java Virtual Machine and why it makes sense for the former to be completely unrelated to the latter.
Chris Voss e-mailed (shoptalk at tripass dot org) with a great question:
When it comes to deciding what technology or data skill you wish to learn next, what’s your process? Do you base it on what looks great, what could be good for a given task at your job, what tool or language appears to be most in demand? With all that considered, how much time then goes into such external learning? I know it depends, but I wonder what everyone has to say.
I’m going to keep some of my powder dry because this makes up a fair percentage of my upcoming Dataminutes talk, but here are some quick thoughts:
Build a not-to-do list. It’s easy to add things to to-do lists, but a not-to-do list is even more important. Doing this gives you focus. So what goes on your not-to-do list? That’s for Dataminutes…
Think about how good you are at your current job, in terms of knowing what you need to know as well as job security. If you’re at risk, focus your learning time on what makes you better at your current job. Keeping the job you have is important, as it’s much easier to get a new job if you have one than if you’re out of work and looking.
If you’re in a pretty good place job-wise, think about what you want your next job to be and spend some amount of time on that. You shouldn’t spend 100% of your learning on “the next role,” but this is a continuum, in that as you get more comfortable in the current position, you should be able to branch further.
Spend an hour a day at work learning. If you need to (and can!), set up a daily meeting on your calendar during a time frame when you aren’t that busy, and use it to learn. Bias toward things helpful for your current position, so that if pressed, you can bring up specific cases where the time you spent learning directly and positively affected the company.
Don’t forget about learning for fun. That may also be part of your “next job” learning, but it can be enjoyable to learn something simply for the sake of learning it.
Go outside your comfort zone. Nobody respects a Fachidiot.
Also, don’t forget about leaving your normal realm. I’m assuming that most of the Shop Talk audience is technical people, like application developers, database administrators, and the like. Pick up a book on literature, history, or philosophy.
Your brain needs breaks, too. Specifically, physical activity is really good for giving your brain a break. If you spend 8+ hours a day sitting in front of a screen at work and navigating logic problems, it can wear you out to keep trying to solve similar logic problems at night. Switch it up a bit and you’ll find more success long-term.
Tonight’s episode was a fun topic where I essentially gave a survey but without collecting responses. Because that’s how I roll. Here were the questions (including a couple I didn’t ask), along with my original answers.
Best Named SQL feature: Hekaton was, at least until marketing got in the way and made it In-Memory OLTP.
Worst Named SQL feature: The timestamp data type. As a runner-up, Azure Data Studio.
SQL Feature Most Likely To Have Been Named By Marketing: Always On Availability Groups. Or maybe AlwaysOn. I think it might have been Always-On at one point as well, but don’t recall.
Most Accurately Named SQL Feature: I want jokingly to say “Priority boost” here. On the plus side, most features are named clearly enough, even if sometimes they take away nice names and give them marketing-heavy names.
Least Accurately Named SQL Feature: The timestamp data type. As a runner-up, Azure Data Studio.
SQL Feature That Should Throw A Hard 24 Error If You Try to Use It In A New Project (But Continue To Work For All Your Existing Crap): Non-schemabound table-valued functions
SQL Feature That Just Needs A Little Love to Be Great: If this was 2010, I would have said Service Broker. Right now, PolyBase.
SQL Feature That Can Never Be Great Even With Barry White Levels of Love: CLR strict security.
Best SQL Feature: Metadata is also described in terms of SQL.
Suckiest SQL Feature: Stretch DB is too easy. CLR strict security could be on here, but I’ll pick something different: SSIS scale-out. It’s a half-hearted attempt to say that SSIS scales.
Surprisingly Useful SQL Feature: Batch mode processing. To clarify, the surprising part is just how often batch mode processing can speed things up.
Surprisingly Useless SQL Feature: Temporal tables. I think they tried to solve two problems at once and ended up providing a halfway-viable solution to each, but not a complete solution to either. A bit of that Barry White love could fix this.
SQL Feature I’ll Probably Never Use But I Like That It Exists: Query Store plan forcing for cursors. It’s easy to dunk on cursors and performance, but I appreciate that they were able to figure this out.
@rajeshparuchuri asks, “How do you delete old backup files? Or what is your preferred way of deleting old backup files?”
We had some discussions about different methods for deleting, but there are several key methods:
Use xp_cmdshell to delete the files. This wasn’t available to Rajesh, so we had to move on.
Write a PowerShell script to delete the files and call that script in the backup job. Essentially, know where the backups are stored and write a one-liner to get child items (with a .trn or .bak extension), filter to get just those older than your retention period, and delete the objects.
Use a maintenance plan. I include this in the list for the sake of completeness, not because it’s a great idea.
If you’re writing backups to a cloud store like AWS S3 or Azure Blob Storage, you can set a retention plan on the bucket/container and automatically delete files older than a certain age.
Similarly, many newer SANs offer the ability to set a retention policy on a share.
The Downside of Storing VARBINARY in a Table
@rednelo7625 asked: “I have a question about storing Word docs in a document DB as VARBINARY(MAX). I’m limiting my users to an upload size of 3MB. I’m storing them this way so that I can encrypt them and back them up. Do you see any problem with this practice?”
The short answer is yes, there are some risks with this. The biggest problem is that if the database becomes really popular, it can bloat up quickly and make activities like database backups, CHECKDB runs, etc. run a lot slower. And with blob data like this, there’s not much we can do about it—especially something like DOCX files which are already zipped (spoiler: .docx is basically a zipped series of XML files), so you don’t get any real compression benefits from using COMPRESS().
As for the reasoning, it’s to encrypt and back up files. My response was to recommend checking into FileTable functionality that was introduced in SQL Server 2012. FileTable gives you a nice compromise between a desire to keep file metadata in SQL Server while keeping the binary data out. If you take a full backup of your database, it does back up the FileTable data (unless you choose to exclude it), so you do get backup protection that way without needing to store the files directly in the database.
As far as encryption goes, I’d recommend enabling disk-based encryption on the storage where those files are (assuming you use FileTable). I made the guess that rednelo was using Transparent Data Encryption for this, and encrypting the disks provides a similar level of protection—it protects data at rest. For further protection, it would be possible to write an app which accepts those Word documents and saves them in a password-protected archive, or encrypts the binary data and saves it as an encrypted file, decrypting it later for access.
Eventually, we got pulled off onto a sidetrack around IoT, especially in agriculture. This is an area of interest to me because the average person has no clue just how sophisticated modern agriculture is. Anders and @ENHANCE_QC have some knowledge in the area (family members in farming) and added some nice context too.
We covered one of the key problems with IoT: you may own the hardware, but if it has to talk to a remote server to survive, your expensive product can quickly become a brick. This also ties to right-to-repair laws, which I fully support. My philosophy on this is that if I bought the product, I should be able to alter the product, including repairs and modifications.
Twice in six months, Azure Active Directory has gone down. Both times, it happened right around an episode of Shop Talk. I think they’re just trying to give us content.
Microsoft has provided a summary of the issue. This was pretty big—bigger than I understood at the time of the episode, as this one did affect services more than the issue six months prior.
Modern Data Warehousing
Fred e-mailed and asked a few questions about modern data warehousing. I gave fairly substantial answers, so check out the video for those details. Here’s the brief summary.
I want to understand how to create the hub and spoke data warehousing architecture.
Hub and spoke has a couple of meanings. The original term came from “departmental data marts,” which is a style popularized by Bill Inmon. I’m not the biggest fan of this approach. Instead, I like the bus architecture by Ralph Kimball. I think this approach makes it much easier for business analysts to get the data they need without needing to know much (if any) SQL.
Ellis Butterfield has a presentation from 2017 covering hub-and-spoke in a Massive Parallel Processing system, such as Azure Synapse Analytics dedicated SQL pools (nee Azure SQL Data Warehouse). The idea is that dedicated SQL pools have concurrency limits, such that only a certain number of queries may run at a time. If you’re hitting those limits and queries are waiting, you might want to offload processing onto different machines.
In short, Ellis’s example uses Azure SQL Database elastic pools as the spokes, creating external tables which point back to the dedicated SQL pool, allowing you to isolate workloads on specific pools.
Another option is to use the serverless SQL pool, creating external tables against pre-cleaned data in the lake. This might help offload queries from a dedicated SQL pool.
A third option is to use Azure SQL Databases as the spokes, using either replication or ETL to migrate data from the dedicated SQL pool into SQL databases. Essentially, you’d treat these as data marts: line-of-business specific databases which serve queries for business analysts in those departments. This fits really well with the Kimball approach but does add more in terms of complexity.
A fourth option is to use Power BI as the spoke, migrating data from the dedicated SQL pool into a Power BI data model and use that for reporting. This assumes that your relevant data set can fit into Power BI’s Tabular model or that you’re using Azure Analysis Services to fit it in.
A fifth option would be to use result set caching in the dedicated SQL pool, especially if you have infrequently-changing data and many occurrences of the same queries. The idea here is that if you can make each individual query faster, you have less concurrency, as each gets in and out sooner and there’s less chance a lot of them will be hanging around at the same time.
What do I need to consider when I am transforming the data?
Here are a few things to consider:
Explore raw flat files using Azure Synapse Studio. This will give you an opportunity to see what the incoming data looks like and get a quick determination on data quality and shape.
Mapping Data Flows in Azure Data Factory can transform data stored in your data lake. These are great for relatively clean data.
You can also use Apache Spark and Spark clusters to wrangle nasty data, particularly if you need to do things like parsing regular expressions or fixing file delimiters.
Store your transformed data in Parquet format. It’s a great format in general and works extremely well with fact-style data, which has a lot of columnar repetition.
Generally, you’ll have three layers of data in your data lake. You’ll sometimes hear them referred to as bronze, silver, and gold. Occasionally you’ll hear raw, refined, and ready.
For loading data into clustered columnstore indexes, stage your data in heap tables or temp tables before running those transformations.
If I am going to use Azure services, is there a cheaper substitute for Azure Synapse Analytics?
If your database is large enough—50-100 terabytes or more—then no, you’re probably not going to find a cheaper alternative. But if it’s small, on the order of 1 TB or less, go with Azure SQL Database. Azure Synapse Analytics dedicated SQL pools don’t really make much sense for a database under 1 TB.
Here are a few tips for saving cash when using Azure Synapse Analytics:
Keep your dedicated SQL pools as small as possible. This component is less expensive than Azure SQL Data Warehouse was, but it is still costly.
Use serverless queries for one-off stuff rather than building it into the dedicated pool.
If you do a good job of laying out your data lake files and folders, serverless queries can be really inexpensive. It’s $5 per terabyte of data scanned, but even if you have 500+ TB of data stored, you might only need to scan a few gigabytes to get your answer.
Use Power BI Premium and store segments of the data model there. If you use that for reporting and analysis, you can scale down the dedicated SQL pool. And if you’re using Power BI Premium Per User pricing, that’s $20 per user per month. Or $5K a month if you use the Premium SKU.
Pause dedicated SQL pools if there’s a time frame you’re allowed to do that. For example, if your analysts are only in the office from 6 AM until 6 PM, you may be able to shut off the dedicated SQL pool for 10-12 hours a day.
If you can’t turn it off, that’s fine. Reserved instance pricing can cut down on the cost of dedicated SQL pools considerably. A 3-year reservation can save 65% or so off the pay-as-you-go price. If you do reserve, I’d recommend reserving a bit below where you normally are, as then you have the opportunity to scale back some if you get major query gains and you can negotiate up more easily than down.
Keep an eye on Spark pools and make sure they run only when needed. Those can also get expensive. Spark pools have auto-shutdown capabilities so if somebody forgets to turn one off, it can shut down automatically.
Thanks for the great question, Fred! If you have any questions you want us to handle on Shop Talk, hit us up at shoptalk at tripass dot org.
Mala and I spent some time talking about Microsoft Ignite and the data sessions. This ran Tuesday through Thursday, so you can check out highlights, etc.
Did Kevin Defend COBOL?
Look, I’ve made a lot of mistakes in my life. Defending COBOL (as someone who doesn’t even like COBOL!) probably wasn’t one of them. But we spent a lot of time on technology, innovation, and my recognition (admonition?) that there’s nothing new under the sun.