It’s election season for TriPASS and we have a slate of electors up for vote. If you are a TriPASS member, you’ll have received an e-mail about this.
The Power of Anti-Semi-Joins
I learned something from Erik Darling. I’ve learned many things from Erik Darling, but this one was around NOT EXISTS versus 0 = (SELECT COUNT(*)...). It turns out that SQL Server has an optimization that makes both return the same execution plan. After the recording, I went and checked Postgres, which does not have the same optimization. This is yet another case of how behaviors in one platform may subtly differ from others.
Approximate Functions
Mike successfully side-tracked me and we talked a bit about the three approximation functions in SQL Server: APPROX_COUNT_DISTINCT, APPROX_PERCENTILE_CONT, and APPROX_PERCENTILE_DISC. I explained a bit about how they work, linked to some of the original academic papers, and described the difference between the continuous and discrete functions.
Tenant Switching in Microsoft Fabric
Our final topic of the night was from a Koen Verbeeck blog post, covering tenant switching in Microsoft Fabric and the Power BI Service. This is another example of the classic Microsoft pattern of developing something without designing in the ability to switch tenants / subscriptions / accounts.
It’s election season for TriPASS and we have a slate of electors up for vote. If you are a TriPASS member, you’ll have received an e-mail about this.
Common Table Expressions
A question came in from chat:
Hi Kevin, I got a question as a beginner in SQL. When I am writing queries that requires a combination of joins inside multiple CTE’s, I get a timeout error in my application. Individually, the CTE’s work, and work until a number of CTE’s are joined. Is there a way to optimize this type of query?
Mike and I talked a bit about the topic and dug in further. The good news is that the chatter followed up with me afterward and we did, in fact, help with this problem. There’s a first time for everything, I suppose.
SSMS 21
Our other topic was what’s coming in SQL Server Management Studio 21. We talked about blog posts from Brent Ozar, Reitse Eskens, and Vlad Drumea. Along the way, I outed myself as a Light Mode beacon of brilliance.
The TriPASS 2025 call for speakers is open. We are accepting sessions for all three of our groups: the advanced DBA group (2nd Tuesday of each month), main meeting (3rd Tuesday of each month), and data science & business intelligence group (4th Tuesday of each month).
Why SQL Server vs PostgreSQL?
Our topic for the night was from Alex. Repeating the question:
In a recent shop Talk Mike Chrestensen mentioned that he is working on a project to migrate from Oracle to SQL Server. I’m curious what led him to choose SQL Server instead of PostgreSQL or some other open source ($0 licensing cost) DBMS.
Mike talked about the specific circumstances around his company. From there, I dove into the topic and explained my reasoning for why someone might want to choose SQL Server or PostgreSQL. I didn’t use the obvious “We’re already using this technology” answer, but instead focused on some of the key items in favor of each. I also talked about a bit of history at a prior employer.
An Overview of HammerDB
We also talked about a great tool for comparison testing of hardware & relational database products called HammerDB. This is an open-source solution that uses the TPC-C (OLTP-style queries) and TPC-H (OLAP-style queries) workloads. The UI is a bit rough, but the tool is powerful and is a lot easier than trying to figure out how to replicate your own workload in a non-production environment.
Our topic for the night was from an anonymous user. Summarizing the question:
I am currently an infrastructure DBA for an MSP. I’m wanting to gain experience with Performance Tuning SQL Server. My job is an old-school infrastructure DBA job; we handle Data protection and High availability/DR.
What’s the best next step in my career to gain experience in performance tuning? Would trying to get on a consulting firm be the best bet for gaining experience in performance tuning? Or would I already need to be very good at performance tuning to even get hired on at a consulting firm? Also, what’s life like working for a consulting firm—that is, work-life balance, or working nights and weekends?
We covered quite a few points in this discussion. First up, some recommendations on books that I’d recommend for people really digging into performance tuning, as well as one video series that Mala recommended.
We also covered life in four stereotypical types of roles: full-time employee, contractor, consultant for a large firm, and independent consultant (with some points around boutique consulting). As always, the specifics of any role are critical and we could only cover generic statements of concept.
The next story was an opportunity for me to rant about IoT insecurity and unethical business practices: Deebot robot vacuums collecting photos and audio. Oh, and the vacuum cleaners have a major security vulnerability. Basically, this hit a bunch of hot-button issues for me all at once.
Our last story was me bouncing off of the start of a blog post around transforming the test of a SQL query to match user intent. We didn’t get into the specifics of the post itself, but rather I focused hard on the question of whether the query optimizer should make optimizations to a query or take the query as-is. My answer is a firm ‘yes’ because it is in the nature of a 4th generation language. In a 4GL, users do not write procedural code explaining how the system should operate. Rather, they define the desired end state and leave it to the system to find the best way to get there.
Our primary topic for the evening was an article in the Wall Street Journal (thanks, Marshall) bemoaning the state of searching for a job in the tech space, or in the tech-adjacent space, or in the not-quite-tech-but-also-not-quite-working space. I drop some spicy takes, and if I end up in a ditch somewhere, it’s almost certainly not because of Boeing. Remember: Boeing. They get you places, one piece at a time.
Rate My Dashboard
The other topic was a blog post from Martin Schoombee, which I have been trying to talk about on Shop Talk for a good month. Martin talks about some of the challenges in rating the quality of a dashboard, namely that you have to understand what it is that the user needs to get out of a dashboard. Shiny gee-gaws and pretty widgets are great and all, but at the end of the day, a dashboard’s purpose is to provide relevant information to its users, allowing them to act upon that information in a timely manner. Did you do this? Congratulations: you have at least a minimally successful dashboard. Now, from there, aesthetically pleasing is fine. But even there, a majority of the aesthetic advice has as its fundamental purpose ensuring that users can quickly and easily learn information. That’s why we spend so much time talking about types of graphs, ensuring that we consider accessibility, etc.
Our primary topic for the evening was a review of Red Gate’s State of the Database Landscape 2024 survey. Red Gate received 3849 responses and put together a great deal of information, so we covered a fair amount of it. I also got sidetracked a couple of times on specific choices for visuals, but that complaining aside, I appreciate all of the work Red Gate put into the survey and their findings were interesting enough that we discussed them for quite some time.
OpEx vs CapEx
Along the way, ENHANCE_QC brought up a comment about operating expenditures versus capital expenditures in chat and I dove into the topic a bit, referencing a couple of slides from a talk I deliver on saving money in the cloud.
The first topic of the night was a blog post by Pete Warden, asking why the Internet of Things phenomenon never got past the trough of disillusionment in Gartner-speak. Pete brings up several good points, though I spend a long time railing on one point I didn’t see from Pete: the utter lack of security. But in fairness, I’m also not sure that the utter lack of security is actually a factor in why the mass market generally rejects IoT objects, because I’m very cynical about this topic.
Anders and I also spent a bit of time covering industrial IoT, where there’s a much stronger use case. And I went half-ranty on the desire for Intranet of Things and not Internet of Things.
Goodbye, System.Data.SqlClient Package
Topic number two comes from a David Engel announcement: Microsoft is deprecating the System.Data.SqlClient package. For the most part, this shouldn’t be a big deal to switch over because Microsoft.Data.SqlClient has been around for several years and has significantly grown in capability. It did remind me of FSharp.Data.SqlClient, which needed System.Data.SqlClient back when I used it.
Improving Full-Text Index Performance
Mala wanted us to talk about a Microsoft article on improving full-text index performance. This article brought back memories, many of which were not good. For example, it reminded me that AWE was a thing. For those who are blissfully unaware, prior to SQL Server being a 64-bit installation, the x86 version was limited to 2GB of RAM. Unless, of course, you enabled the Address Windowing Extensions capability, which allowed you to use 3GB of RAM for the database engine. Given that this article prominently talks about AWE, I think it gives you an idea of how quickly advice changes around full-text indexing.
I will give the article major kudos on one point: it provides an explicit formula for “How much memory will I need for this?” Too often, the answer to that question is, “Dunno, how much you got?” And that’s not a particularly satisfying answer. In fairness, the reason for this is there are so many factors that affect memory needs, including database sizes, indexing strategy, expected query response times, other services running on the machine, etc. That makes it really hard to give a satisfying a priori answer. This article does that, at least with respect to how much memory the fdhost.exe service will require. I think it definitely falls short on amount of memory left over, especially for modern servers, but I still applaud its attempt to get to a conclusive answer.
The first topic of the night was EightKB, an online conference for advanced data platform topics. Their annual event happened recently and you can see all of the videos on their YouTube channel. If you want to dig into topics around SQL Server and other data platform technologies, EightKB is one of the best places to go.
Training Resources for R
Our next question came via e-mail and was around good training resources for a person learning R. In this case, the person was not necessarily that technical but did want to learn about data analysis. Mala, Mike, and I all know R and so we put our heads together and came with the following as resources we can recommend:
R for Data Science (2nd Edition) — Free book by one of the core developers of the R programming language
The Epidemiologist R Handbook — Free handbook that specializes in epidemiology but does teach a lot about principles of R as well, so it’s certainly not limited to health topics. I haven’t read this but Mala Mahadevan recommended it.
An Introduction to R — Another free book. I haven’t read this but Mala Mahadevan recommended it.
Practical Data Science with R (2nd Edition) — This is a book I whole-heartedly recommend for people getting into data science. Your wife probably won’t care about some of it (stuff like application lifecycle management) but it does a great job on the topic.
DataCamp — There are 127 separate courses on R currently hosted on DataCamp (as of the time of recording). This is a paid service, but I’ve been paying for it for several years.
Running out of Flight Numbers and other Data Growth Problems
Our third major topic of the night was a post from View from the Wing (via the Brent Ozar Newsletter) around airlines getting closer to having 10,000 flight numbers and some of the challenges around it, especially for places that still have analog boards displaying flight information. None of us hosts has enough knowledge of airline industry software to know how easy or difficult it would be to change (though probably absurdly difficult, given that this seems to be the norm), though Anders in chat did share some interesting thoughts on the topic. We did digress a bit to discuss other problems companies run into, such as running out of identity integers.
The first topic of the night covered the recent CrowdStrike incident. Mala pointed us to one article and I brought up an article in The Register. There’s a fair amount that we know about the story, so we covered what we had, including a brief discussion of travel issues. I was fortunate in that my travel happened after American Airlines figured out their issues, but Friend of the Show Marshall had some tales of woe in Atlanta.
Languages and the Stack Overflow 2024 Survey
Our next topic came from Friend of the Show Mark Gordon, who asked what languages developers are into these days. I shared my list: C# and Java are common but not necessarily popular languages. Python, Rust, and Kotlin are popular languages (as is Go, though I forgot to mention that one), with Python really dominating the mainstream in a bunch of fields. And, of course, for any front-end development, there’s the JavaScript Framework of the Month Club: React, Vue, Express, etc.
This also happened to coincide with Stack Overflow’s 2024 survey of nearly 50,000 developers. We looked at the languages on this list and talked about which languages are over-represented (due to the survey being a sample of Stack Overflow users, who are themselves a biased sample of the overall developer population). A few languages like COBOL are underrepresented, I argued, as people aren’t going to Stack Overflow to answer COBOL questions. We also talked about longevity in systems: C++ and C are both still high on the list, even though I’d imagine that most new development isn’t happening in those languages. But that’s the effect of legacy in programming languages.
GitHub Forks and Repo Privacy
Our final topic was an article from Joe Leon at Truffle Security, looking at accessing deleted and private repo data on GitHub by use of forks. My stance on this is, it’s worth knowing that there’s a risk from forks, but the behavior makes sense and enables open-source projects to offer the ability for people to make pull requests without needing to grant write permissions to randos. GitHub can also rightly say RTFM, where they clearly indicate the consequences of enabling forks on a repository. So if you are administrator of GitHub repos in an enterprise, read the article, understand what it means, and carry on.