Show notes for the Triangle Area SQL Server User Group's Shop Talk program.
Author: Kevin Feasel
Kevin Feasel is a Microsoft Data Platform MVP and CTO at Envizage, where he specializes in data analytics with T-SQL and R, forcing Spark clusters to do his bidding, fighting with Kafka, and pulling rabbits out of hats on demand. He is the lead contributor to Curated SQL (https://curatedsql.com) and author of PolyBase Revealed (https://www.apress.com/us/book/9781484254608). A resident of Durham, North Carolina, he can be found cycling the trails along the triangle whenever the weather's nice enough.
The majority of this episode is dedicated to building a safe sequence for when you absolutely need contiguous numbers for ID values. Identity and sequence can, by design, leave gaps. 99% of the time (or more), that’s not an issue. But in the rare case where numeric IDs are part of an audit process, I show a technique to eliminate those gaps, though at the cost of slowing down insert performance considerably.
The notebook is part of my upcoming post-con for Data Platform Summit, so I’m not releasing it right now. But I might release a version of it in the future.
We had one announcement: TriPASS board elections are coming up this Thursday. They will run for the next three weeks. Because we had one nominee for each position, we’ll have a up-or-down vote on the entire slate.
Upcoming Events
Data Platform Summit is also coming up, from November 30th through December 8th. Registration is open and you can use the discount code TRIANGLESQL for a nice discount.
PASS Summit Thoughts
Tom and I talked about PASS Summit for the entire episode. We had some great conversations in chat as well.
After we finished recording, I saw Deb Melkin’s blog post summarizing PASS Virtual Summit, as well as Andy Levy’s. Check those out as well for additional thoughts on Summit.
TriPASS Plans in November
We will not have Shop Talk next Monday, November 23rd, as it is the Monday of Thanksgiving. We also are done with regular data science meetings for the year, so no regular meeting that week.
We had one announcement: TriPASS board elections are coming up in mid-November. We have three positions up for election this year: President, VP of Marketing, and Treasurer. All TriPASS members in good standing of eligible to run, and the election will run from November 19th through December 3rd, for terms beginning February 23, 2021 through February 26, 2023. If you’re interested, e-mail (shoptalk at tripass dot org will work) and I can provide more details as needed.
Conferences
PASS Summit is coming up, from November 10th through the 13th. Registration is still open and you can use the discount code LGDISCWY7 to save some cash. Tom also gave the full-court press on why you should sign up for PASS Virtual Summit this year.
Data Platform Summit is also coming up, from November 30th through December 8th. Registration is open and you can use the discount code TRIANGLESQL for a nice discount.
SSMS and ADS Joined Together
Our big topic for the evening was the about SQL Server Management Studio 18.7, which now forces installation of Azure Data Studio.
Tom, Mala, and I all share fairly similar opinions, so I’ll lay out mine just a little bit. I don’t hate the idea, but I do see two problems with it. First, this is a required installation. Microsoft’s saying that you can uninstall ADS afterward if you choose, but I’d much rather there be an option to install or not rather than having to uninstall later. I use Azure Data Studio pretty regularly and I still don’t like the notion of forcing an installation when you wanted SQL Server Management Studio.
Second, I’m a bit curious about release cadence and bloat. SSMS releases less frequently than ADS, so if I subsequently install 18.7.2, will it try to install an old version of Azure Data Studio over my current version? I don’t know the answer for that and figure we’ll have to wait and see. But we do see from Glenn Berry that ADS is installed as a system installation rather than a user installation, so you might have two copies as a result. You might have a third copy if you’ve used the Insiders track. There’s nothing SSMS-specific about this as you can install multiple copies all on your own, but I’m sure it’ll ruffle a few feathers.
Mala did bring up a good point that I’d like to mention here as well: SQL Server Management Studio is already a collection of installed applications—it brings along with it Profiler, Replication Monitor, Database Tuning Advisor (ugh), the Analysis Services deployment tool, and more. These are all separate applications
But that’s not to say I’m dismissive of the people who don’t want to install ADS, as well as the people who want to make ADS an optional installation rather than mandatory. I understand concerns about corporate compliance, potential security risks involved (you are, after all, installing Python and a very extensible IDE), etc. And given that you don’t need ADS to use SSMS, I’d rather they not require installation. Make it an option and I think a lot of the concern within the community simmers down.
This topic also led us down a rabbit hole of installing SSMS on servers. In real life, I’m not too hard-pressed on the idea because it’s nice to have the tools there in case you need direct server access. But I wanted to push the other side of the argument as hard as I could, and I think SSMS 18.7 makes that “Don’t install SSMS on servers ever” argument a bit easier because it increases the scope of what you need to worry about on a server. Or we just keep copies of 18.6 around for a few years…
TriPASS Plans in November
We will not have Shop Talk next week (November 9th), as this is the start of pre-cons for PASS Summit. We also will not have an Advanced DBA group meeting that week, so expect no TriPASS-specific content unless I decide to do something ad hoc.
We will have Shop Talk on the 16th and Tracy will talk about dbatools & dbachecks; register for that now to get a reminder or just show up when we go live on Twitch.
We will probably not have Shop Talk on November 23rd, as it is the Monday of Thanksgiving. We also are done with regular data science meetings for the year, so no regular meeting that week.
We had two announcements. First, TriPASS board elections are coming up in mid-November. We have three positions up for election this year: President, VP of Marketing, and Treasurer. All TriPASS members in good standing of eligible to run, and the election will run from November 19th through December 3rd, for terms beginning February 23, 2021 through February 26, 2023. If you’re interested, e-mail (shoptalk at tripass dot org will work) and I can provide more details as needed.
Second, Mala gave us an update on the Azure SQL Championship, which is now winding down.
Conferences
PASS Summit is coming up, from November 10th through the 13th. Registration is still open and you can use the discount code LGDISCWY7 to save some cash.
Data Platform Summit is also coming up, from November 30th through December 8th. Registration is open and you can use the discount code TRIANGLESQL for a nice discount.
We talked for a little while about this topic, but I’m interested to see where SolarWinds takes the product. SentryOne isn’t perfect by any means—and I kind of threatened to get into why I’m not a fan of their user experience (or the user experience of any other third-party SQL Server monitoring solutions). Here are my notes for what would make for a good monitoring product:
The key question: is my system currently experiencing problems? How can I know? What does normal look like versus the current state?
If we are experiencing problems, what is the cause? Or what are the causes?
If I am not particularly familiar with this system but am knowledgeable in maintenance, what are the pain points I can most easily fix?
If I am not very knowledgeable with this platform, what should I look for?
Ultimately, a monitoring solution—like any other dashboard—should evoke an emotion: either anger (because something is broken and I need to fix it now) or happiness (because everything is going great and I can keep doing whatever it is I do). With most products, there are so many things going on that it’s hard to tell what you should feel, even if you are familiar with the environment and the tool.
Anders shared some concern around how third-party tools are generally better than first-party from the standpoint that at least third-party tools have support and (usually) documentation, whereas in-house solutions have nothing if the person who put it together leaves. I don’t completely agree with that notion, but you’ll have to watch the video for my argument against.
QDS Toolbox
I am excited to be able to promote the QDS Toolbox, a solution to make Query Store better. You can also check out my blog post for my thoughts in detail.
One thing I found interesting is that Software Engineer, Software Developer, and Developer / Programmer Analyst have a fairly wide discrepancy in salary expectations: a Software Engineer at the median salary in the survey earns $123,250, a Software Developer $118,250, and a Developer / Programmer Analyst $109,500. That’s a $14K per year range for people doing ostensibly the same job, but I do have a conjecture regarding why they’re different. My conjecture is that you’re more likely to find people with the title Software Engineer at companies which build software products for customers, whereas you’d find Software Developers at companies which aren’t necessarily selling software to customers; those developers would be working more on back office or internal solutions. Meanwhile, Programmer Analyst is a classic term for employees in the government, where salaries tend to be lower (particularly at the state level). That’s a conjecture, mind, so it could very well be wrong.
On the data side, I was surprised to see the median DBA salary at $100,750; I would have expected DBA salaries to be at least as high as developer salaries. We do see Database Developer roles earning as much as Software Developers, so that’s about what I’d expect. Also surprising is that Data Scientist salaries seem to have dropped considerably, though I surmise the reason here is that there are a lot of newly-minted Data Scientists and people switching from Data Analyst job titles pushing those salaries down.
The salary tables include salaries in USD normalized across the United States, as well as weighting factors for particular cities, so for example, Raleigh salaries are about 4% above the quoted norm. After those tables, there are similar tables for Canadians in CAD. Surprising to me there is that Canadian salaries in CAD are about 75-85% of US salaries in USD, so there’s a huge price difference. A DBA at the 50th percentile in the US has an expected salary of $100,750 USD in the survey. A DBA at the 50th percentile in Canada has an expected salary of $84,000 CAD. Right now, the exchange rate is approximately 4 CAD = 3 USD, so $84,000 CAD * 0.75 USD/CAD = $63,000 USD.
Take all survey numbers with as many grains of salt as you need, but it’s an interesting way of getting a feel for the industry.
We had two announcements. First, TriPASS board elections are coming up in mid-November. We have three positions up for election this year: President, VP of Marketing, and Treasurer. All TriPASS members in good standing of eligible to run, and the election will run from November 19th through December 3rd, for terms beginning February 23, 2021 through February 26, 2023. If you’re interested, e-mail (shoptalk at tripass dot org will work) and I can provide more details as needed.
Our first topic for the night was a function available only in Azure SQL Edge: DATE_BUCKET(). I walked through a sample of some of the code I plan to show off during my PASS Summit pre-con, and I’m glad I did this because I was not very happy with the delivery. The concept feels a little backwards to me, as “date bucket” sounds like it should be numeric: bucket 1, bucket 2, bucket 3. Instead, it’s the begin date for the bucket itself: 2020-09-01, 2020-09-15, 2020-09-29.
I certainly believe it’s useful and would like to see it in SQL Server and Azure SQL Database, but I’ll need to work on my examples to make sure they’re clearer.
Thoughts on Graph Databases
The other key topic for the night was a discussion of graph databases. We recently recorded an episode of the SQL Data Partners podcast with Mala as the guest and the two of us had some detailed thoughts on the utility of graph databases. We took some time tonight to flesh out those thoughts a bit further.
The short version of it for me is that graph concepts are extremely useful; graph databases are not, unless your company’s key differentiator requires solving graph-style problems with large amounts of data in near real time. An example I like to give for this is logistics, in which you need to know where people and things are and where they need to go in the most efficient manner. If you’re developing software for logistics companies to use to solve this problem, then a graph database might be the best option. Check out the show (as well as the upcoming podcast episode, #208) for the rest of my thoughts on the topic.
Mala’s Books & More Kiosk
We’ve upgraded Mala’s Book Corner to include a lot more. This time around, she has one blog post and three podcasts for us.
We had two announcements. First, TriPASS board elections are coming up in mid-November. We have three positions up for election this year: President, VP of Marketing, and Treasurer. All TriPASS members in good standing of eligible to run, and the election will run from November 19th through December 3rd, for terms beginning February 23, 2021 through February 26, 2023. If you’re interested, e-mail (shoptalk at tripass dot org will work) and I can provide more details as needed.
Our first topic for the night was code review tools, focusing mostly on the database world. Mala led us through some of the research she has done to find an alternative to Crucible and Fisheye. Mala keys us in on several tools, with an emphasis on tsqllint.
Xenographics
Over at Curated SQL, I posted about Xenographics, a website dedicated to…uncommon visuals. I enjoyed walking through several of them. This site includes some visuals I like as well as some which I can’t understand even after reviewing them.
Watch the video for the visuals we look at, but I wanted to take a moment and hit six characteristics I think make for a good visual. This list is neither necessary nor sufficient—a good visual need not be all of these at once, and I won’t claim that this is the authoritative list of rules for quality visuals. That said, here they are:
Intuitive — A visual should be easy for a person to understand despite not having much context. In some cases, you have the opportunity to provide additional context, be it in person or in a magazine. That lets you increase the complexity a bit, but some visuals are really difficult to understand and if you don’t have the luxury to provide additional context, it makes your viewer’s job difficult.
Compact — Given two visuals, the one which can put more information into a given space without losing fidelity or intuitiveness is preferable. This lets you save more screen real estate for additional visuals and text. There are certainly limits to this philosophy, so consider it a precept with diminishing marginal returns.
Concise — Remove details other than what helps tell the story. This fits in with compactness: if you have unnecessary visual elements, removing them lets you reclaim that space without losing any fidelity. Also, remove unnecessary coloration, changes in line thickness, and other things which don’t contribute to understanding the story. Please note that this doesn’t mean removing all color—just coloration which doesn’t make it easier for a person to understand what’s happening.
Consistent — By consistency, what I mean is that the meaning of elements on the visual does not change within a run or between runs. Granted, this is more relevant to dashboards than individual visuals, but think about a Reporting Services report which uses default colors for lines on a chart. If you refresh the page and the colors for different indicators change, it’s hard for a person to build that mental link to understand what’s happening.
Glanceable — Concise and consistent visuals tend to be more glanceable than their alternatives. Glanceable means that you are able to pick out some key information without needing to stare the the visual. Ideally, a quick glance at a visual tells you enough of what you need to know, especially if you have seen the same visual in prior states.
Informative — This last consideration is critical but often goes overlooked. The data needs to be useful and pertinent to users, describing the situation at the appropriate grain: it includes all of the necessary detail for understanding while eschewing unnecessary detail.
We had two announcements. First, TriPASS board elections are coming up in mid-November. We have three positions up for election this year: President, VP of Marketing, and Treasurer. All TriPASS members in good standing of eligible to run, and the election will run from November 19th through December 3rd, for terms beginning February 23, 2021 through February 26, 2023. If you’re interested, e-mail (shoptalk at tripass dot org will work) and I can provide more details as needed.
Leslie sent in an e-mail (shoptalk at tripass dot org) asking for guidance or a syllabus for learning to become a database developer, so Mala and I tackled that. We broke it down into five categories and focused mostly on books. One book-buying tip I have: if you’re looking at older books, check out a website like ISBN.nu, as it aggregates across several book sellers to find the best price. I’ve used that site for about 15 years now. Most of the links I provide are from Amazon, but that’s almost never the cheapest price.
First up is T-SQL, as I think you should learn the language before trying to work through the rest of it. Carlos Chacon has a great book called From Zero to SQL in 20 Lessons, which was written for someone with no experience to get started. From there, I’d recommend Itzik Ben-Gan’s T-SQL Fundamentals or Kathi Kellenberger’s Beginning T-SQL. Mala brought up John Deardurff’s series on learning T-SQL as well as the SQL Quickstart Guide. Finally, once you’ve spent a couple of years building up those SQL skills, grab a copy of Grant Fritchey’s SQL Server 2017 Query Performance Tuning book, which is a massive tome full of great info. Or you can get the much smaller version from Red Gate.
After taking on T-SQL, learn about data modeling. The single best reference is the Handbook of Relational Database Design, released in 1989. The first half of the book is gold; the second half is only useful if you’re trying to implement this stuff on a late 1980s Oracle or Informix database… For something a bit more recent, check out Louis Davidson’s Pro SQL Server Relational Database Design. The 6th edition is coming out soon, but if you’re impatient, Louis and Jessica Moss teamed up on the 5th edition. I have an older edition, and I think Louis’s explanation of 4th normal form is the best I’ve ever read. Mala has a couple of recommendations as well: Information Modeling on Relational Databases and Database Design for Mere Mortals.
If you want to move from OLTP-style data modeling and into warehousing, Ralph Kimball’s Data Warehouse Toolkit is the book to get. The Kimball model is everywhere and this is the type of book you can come back to multiple times and learn more and more each time. If you’re interested in the Data Vault approach, check out Building a Scalable Data Warehouse with Data Vault 2.0. I haven’t read it, but people who are into Data Vault have recommended the book.
Of course, once you’ve learned a bit about warehousing, you might want to read up on ways to work with warehouses. If you’re looking at report-writing, Kathi Kellenberger’s Beginning SQL Server Reporting Services is a good start. From there, Paul Turley, et al’s Pro Microsoft SQL Server 2016 Reporting Services is the next step, but I’d probably only go that far if I were writing SSRS reports as a main part of the job.
Meanwhile, if you want to learn about Power BI, Marco Russo and Alberto Ferrari have you covered. They’ve released a number of books and also have plenty of videos available for free. I’d also recommend Phillip Seamark’s Beginning DAX with Power BI, though I don’t think it’s really a beginner book by any stretch of the imagination. Finally, Mala reminded me that I couldn’t finish a list of Power BI resources without mentioning Guy in a Cube.
Career Next Steps
During our coverage, @jpanagi1 asked us about career guidance:
my background – dont have compsci degree – engineering background – learned fundamentals of sql – recommended next steps? do i need more coege or certs? what are employeers looking for?
We spent some time talking through this specific case. To summarize, @jpanagi is currently working at a company which has teams working on interesting projects, and the question is, how can I get into that position?
The simple form of our advice is, work with the team a bit on projects. If you have official projects, that’s easiest; if you can volunteer to help them out a bit, that can work too. If you have a reasonable manager, bring up that you’d like to do that kind of work in the medium term. A reasonable manager might not want to lose you, but there are several options here.
One is “in-house” work. Suppose that team is a report-writing and dashboarding team and they’re using some nice products in interesting ways. You might be able to build a dashboard specifically for your team with the same products, gaining experience while providing direct benefits to your team and your manager.
Another option is working part-time on projects with members of the report-writing and dashboarding team. It might not be directly for your team, but maybe a time-sharing of 80-20 or 90-10 could get you some experience on projects and try out something new without your manager losing a person.
If your manager isn’t reasonable, things get harder. In that case, you might need to volunteer on the sly and work a few extra hours with that team after you get your regular work done. I’ve known people who were trained up “after hours,” where they’d stay late at work and learn from someone in another department. Smart managers at good companies formalize this kind of training, but don’t let that limit you otherwise.
Also, a bit of freelancing doesn’t hurt. Again, this is off-hours, but if you want to try out a different path, see if there’s something tangentially related to your team. Learn about the products and try them out. Then, if you come up with something interesting, you can bring it up with your manager. If you ultimately get locked out from doing this thing, well, at least you have a project for your resume, and you can decide whether the position you’re at is the right one.
Shortly before the episode began, Azure had a major outage in its Active Directory service. We used this as a springboard for discussion around what happens when you rely on a cloud service for your company’s infrastructure, but that cloud service goes down. The easiest answer is multi-cloud and cloud/on-premises hybrid scenarios, but those can get expensive and also limit you from some of the benefits of Platform-as-a-Service offerings.
SOS_SCHEDULER_YIELD
Our next segment was around a server having issues that I was asked to take a look at. The two biggest waits on the server were CXPACKET and SOS_SCHEDULER_YIELD. The knee-jerk reaction when you see SOS_SCHEDULER_YIELD is to say that it’s a CPU problem, but Paul Randal explains why that is not necessarily the case.
From there, you want to see if the signal wait time is high—the rule of thumb I’ve seen is that if signal waits are more than about 20% of total wait time for SOS_SCHEDULER_YIELD, then we have a problem where there isn’t enough CPU to go around. In my scenario, signal waits were approximately 99.999% of total waits. So the kneejerk reaction is to blame CPU, the more thoughtful reaction is to investigate further, and the results of the investigation were to blame CPU.
If that’s the case, here are a few of the options potentially available:
Tune some queries. Look for high-CPU queries, especially ones which run frequently. There are a few of those in my scenario, so every second of CPU you shave off per query call is a second which can go to another query and lessen the waits.
Drop MAXDOP and cost threshold for parallelism if it is too high. I’m not talking about the knee-jerk reaction of “Parallelism is high, so drop MAXDOP to 1” but if you have 32 cores and your MAXDOP is 32, that probably doesn’t make much sense for an OLTP system. As a warning, though, changing MAXDOP without reviewing poorly-performing queries can lead to bigger problems, as those poorly-performing queries are already struggling with a high value of MAXDOP, so if you reduce .
Increase CPU cores if you can. If your queries are looking good, it may just be that you have too much load on your server and adding more hardware can be the solution.
SSIS and ADO.NET: A Rant from Someone Not Me
Mala spent far too long working through an issue with multi-subnet failover on Availability Groups and SSIS. We use OLEDB drivers very heavily when working with SSIS, as we appreciate work finishing at a reasonable pace. But the OLEDB drivers installed with SQL Server 2017 don’t include support for multi-subnet failover. But hey, ADO.NET drivers do. So Mala spent a lot of time trying to switch over to that and catalogs some of the issues she ran into.
Microsoft Ignite is this week. Tom and I spent some time talking about the sessions and themes in the data space. Tom mentioned some sessions he’ll attend. I’m going to check out the Azure SQL Edge session.
Kerberos with Tom
Tom is working through some Kerberos double-hop issues, so we talked a bit about SPNs. Tom’s question was whether we needed to register the SPN for just the SQL service account, or also for SSRS, SSIS, etc. My recollection was that you only needed to register for the SQL Server account. It looks like that’s correct.
Innovate and Modernize Apps with Data and AI
I wanted to call out that a Microsoft Cloud Workshop that I helped build: Innovate and Modernize Apps with Data and AI. It hits on a lot of Azure technologies, so if you’re interested in how some of these technologies can fit together in an event-driven architecture, check it out.
Conference Thoughts
We wrapped up tonight’s show with a quick discussion of virtual conferences.
Tom has spent a lot of time recently with Azure Lab Services, so our first segment is an interview where we cover some of the details of Azure Labs, how it worked for Tom, how much it ended up costing him, and some of the things you can do with the product.
Raymond brought us into the main topic of the evening: practices and planning for refactoring databases.
We go into it in some detail, but I wanted to share my notes as there were a couple of topics I did not have a chance to cover. What follows is a disjointed set of concepts which hopefully resemble advice in the aggregate.
First, refactoring is not performance tuning. Refactoring is supposed to be a zero-sum code change intended to make maintenance easier. It, by its nature, is not intended to make things faster or change behavior. I gave a few classic examples of refactoring in application code, as well as a couple examples of database refactoring. In code, think about breaking a segment of code out into its own method/function, or separating two things into different classes. This doesn’t necessarily improve application performance but it does make it easier for developers to understand and maintain. On the database side, normalizing a set of tables can be an example of refactoring—you’re not necessarily changing the performance of the application, but instead making it easier to manage query calls to insert, update, and retrieve information. Another example of database refactoring would be changing text data types to VARCHAR(MAX).
One key consideration is, do you have the downtime to perform refactoring? In other words, do you need the application running while your changes happen? If so, check out my talk on approaching zero-downtime code deployments.
I think the most important mindset to take when refactoring is to ask why you need to refactor. Make sure you have a good reason and it’s not just “It’s not my code, so it’s bad.” Understand that code has a story and the developer who wrote it knows more about the story than you do—even if you wrote the code, the you of two years ago was closer to the problem then than you are today. Chesterton’s Fence is a good mindset to have: before you change code, understand what it is doing and why it is there. If you can’t explain why the code looks the way that it does, be very hesitant about changing it.
Keep in line with your application deployment. Working with app devs reduces the risk that you’ll drop procedures in use or make changes which break the code. Also, try to use stored procedures as interfaces whenever possible. They allow you to make changes much more cleanly than writing queries directly through the application or via an ORM.
Another thing to consider is whether to include a rollback process, or if forward is the only direction. This will depend on your (and your company’s) risk appetite. Only moving forward is a lot easier to develop against, but it does require foresight and is much higher risk. Rollback scripts force you to think about how to back out of a problem, but because they are (hopefully) almost never used, it’s a lot of extra development time. This will depend on the company and scenario—if you’re working on a real-time financial system, then you don’t have much of a choice. But if you’re working on a personal website or an in-house product with a small number of users, it may make more sense just to keep moving forward.
Whenever possible, have tests. A thorough set of database integration tests is great. But if all you have is a hand-created workbench of queries to run, that’s still better than nothing and can be sufficient. The goal is to have enough information to determine if there are issues with the rollforward phase as quickly as possible, ideally before users experience them.
Write your scripts to be re-runnable and put them through change management. Store the scripts in source control. There are a couple philosophies on what to store in source control: either the current state of the system (creation scripts for tables, stored procedures, etc.) or the change process (the set of database modification scripts run over time). I like a compromise approach of having the current state plus some history of changes.
Finally, early refactoring can be painful, especially because you’re usually trying to untangle years of growth. Choose some easy targets to build momentum. Those tend to be tangential projects or small parts of a major database. Start building some of the muscle memory around rollback scripts (if needed), change control, and testing; that will make subsequent work that much easier.