Our first topic of the night was LLM Benchmarks, a website aimed at testing a variety of Large Language Models with a series of questions. We saw that some of the answers were fine, though in other cases, the models went a bit off-kilter. One thing I found interesting was the question, “Is Taiwan an independent country?” In fairness, several of the models gave answers which were about as mushy and contradictory as policy is in real life.
SQL Style Guide
The primary topic for this evening was Simon Holywell’s SQL Style Guide. We spent a lot of time on it and my short version: I like that there is a style guide, agree with about 25% of it, don’t care much either way about 25% of it, and hate 50% of it. I don’t think I get into too many proper rants, but I also don’t pull any punches. There was some great chat engagement from Anders and Solomon as well, and we had our own internal debates, showing just how contentious style ideas are.
Row Level Security Performance and Troubleshooting
The next topic was an article on choosing a file format for data. I spent some time digging into the Avro format, as that’s one most data platform people don’t often think about. I also mourned the effective passing of ORC format, as Parquet has completely dominated it.
Our first topic of the night is that Mike Chrestensen will be a future guest on the SQL Data Partners podcast, where we talk a bit about generative AI. Sadly, Mike wasn’t able to make it to this episode, so we had to talk about him rather than with him.
SQL Style Guide
The primary topic for this evening was Simon Holywell’s SQL Style Guide. We spent a lot of time on it and my short version: I like that there is a style guide, agree with about 25% of it, don’t care much either way about 25% of it, and hate 50% of it. I don’t think I get into too many proper rants, but I also don’t pull any punches. There was some great chat engagement from Anders and Solomon as well, and we had our own internal debates, showing just how contentious style ideas are.
sp_ Procedures Are Bad
The final topic for the evening was an article from Louis Davidson, showing yet another reason why you shouldn’t preface stored procedures with sp_, particularly if you don’t expect to create them in the master database and run them from the context of other databases.
They’re spending like a startup expecting constant cash infusions. Microsoft already spent $10 billion to buy 1/3 of the company, so I’m not sure how many more rounds of big money they’d expect to get.
GPUs are inordinately expensive—we’re talking $45K for a single H100 GPU, and training models the size of ChatGPT can take thousands of those machines and enormous amounts of time and energy.
With Facebook open sourcing their work on models like Wizard and Vicuna, there’s a free alternative which goes beyond hobbyist level and thereby prevents OpenAI from getting an effective monopoly in generative AI models.
Mike brought up a great point during the broadcast, something I’ve mentioned before: OpenAI is the target of several lawsuits, and depending on how those go, litigation and precedent could severely impact their ability to make money long-term, especially if they’re found to have violated the rights of copyright holders in their model training process.
Why it wouldn’t go bankrupt:
ChatGPT is still the big brand name in generative AI and that means a lot.
Microsoft has tethered themselves to the OpenAI mast. They’ve dumped in $10 billion USD and MS Build was all about generative AI + everything, to the point that I found it ridiculous. They’re not going to let their generative AI partner wither away and embarrass Microsoft like that.
If they don’t get a lot of revenue from Azure OpenAI or their own independent work, they can scale down their costs pretty significantly by slowing down new training and new releases. Remember that the debate point is that OpenAI goes bankrupt by the end of 2024, less than 18 months from now. Scaling down costs may end up just kicking the can down the road, but they could remain viable by slowing their costs, not training new models as frequently, and cutting staff.
Patient Matching
The other major topic of the night was around patient matching. I received a question asking, how would you determine if two people are the same, given some info like name, address, date of birth, and maybe patient ID.
Mike, Mala, and I have all worked in the health care sector and had to work on similar problems, so we each discussed techniques we’ve used. My summary is:
Use as many non-ML tricks as you can up-front to simplify the problem, removing obvious duplicates and non-duplicates. Mike recommends starting with the NDI ruleset as a good first pass approach. The reason we want to start with non-ML approaches is that the ML approaches tend to be on the order of O(M*N) or O(N^2) at best and O(N^3) or even O(2^N) at worst. In other words, they get really slow as your datasets grow in size and probably won’t scale to millions of records.
If all you have is names, HMNI is a pretty good package for name matching.
Most likely, you have more than just names, and that’s where a technique known as record linking comes in. You may also see people refer to it as record linkage. For an academic survey of techniques, I can recommend this paper, which is about a decade old—just don’t assume that they products they recommend are still the best.
Spark can be a good platform for large-scale record linkage, and there’s a library called Splink which helps you find linked records in a distributed fashion. Scale-out won’t solve the O(N^2) problem, but it will mitigate it: instead of taking 24 days, an 8-node server might require only 4 days. If you’re working with enormous datasets, that time decrease can be worthwhile.
Our first major topic of the night is that the Carolina Code Conference is coming up in just a couple of weeks: August 19th in Greenville, South Carolina. I’m going to be a presenter and am also a sponsor, so I’ll be pushing my wares. Tickets are free, though there’s a limited number for this year’s event.
Will AI Fix Work?
The second topic was one we spent a whole lot of time on, and Betteridge’s Law of Headlines gives my quick answer: no, AI will not fix work. Mala dropped this hot potato in my lap right before showtime, so we talked through it together and a good amount of my cynicism showed. I think there are areas where AI in general—and generative AI in particular—can help individual employees, but we’re riding the hype cycle curve right now, and I doubt these sorts of technologies have massive long-term benefits. I’m willing to be wrong on this, but it feels like stories from 25 years ago about how learning to code would eliminate all of the burdens of the average employee’s life.
The first topic we hit was in-person attendance for our hybrid main meeting. We’ve had a few hybrid meetings so far and the large majority of attendees for these are still remote. Our gracious host (a company which has hosted TriPASS meetings for more than a decade) does need to hit certain thresholds for it to make sense for them to host us, and our current numbers aren’t cutting it. So far, I’ve come up with two ideas:
Make the main meeting in-person only (i.e., no hybrid meeting) and keep the other two remote-only
Go back to remote-only meetings
I’m not particularly satisfied with either of these, honestly, but I’m struggling to find a good option which increases in-person attendance but lets us remain hybrid. Back in 2019, the hybrid component was an “extra” but the large majority of attendees were coming in-person. At this point, the script is flipped.
Planning for a Successful Database Migration on AWS
Mala wanted to talk about this article. She started off not liking the article very much, but after we talked about it, we came to an agreement that it’s pretty good for a really high-level overview, though there’s a good amount of marketing fluff in it, as well as wishcasting that people change a database system that works with one that pays Amazon more money.
Replacing Customer Service with ChatGPT
Another article from Mala, though this time we had some more to sink our sarcastic teeth into. What it sounds like is, they created a reasonably smart chatbot to do level 1 tech support—basically, an interactive FAQ and probably incorporating some billing and account data in the prompt inputs. Ignoring the idiocy of bragging about laying off most of your customer support people, this is probably a reasonable use for generative AI, at least until it lies to your customers and your few remaining humans in customer support need to put out those new fires.
Kusto Detective Agency Season 2
The final topic for today’s show was a quick discussion of Kusto Detective Agency, now that season 2 is in full swing. This is definitely not a site for beginners, but it is a really fun challenge if you have some Kusto chops already. Or you can read one of the many walkthroughs online to get you through it. I got about 2/3 of the way through season 1 without having to rely on a walkthrough, but couldn’t get much further. I haven’t tried season 2 yet, but it’s on my todo list.
Our first topic comes from Tracy, who noticed an odd situation in which using CONVERT() on a dataset ended up about 200ms slower than using CAST(). I didn’t have any good answers, though chat and I speculated on possibilities.
The Future (and Past) of No-Code Programming Languages
After that, we meandered a bit on no-code programming languages, using this article as a jumping-off point. There’s some value in no-code languages but finding a general-purpose, no-code programming language which doesn’t want to make you tear your hair out in frustration as soon as you leave the happy path is like searching for El Dorado.
If you’re on that happy path, then they can be a good way of getting started. Examples I used include SSIS, Informatica, and the Azure ML designer. These are custom-purpose canvases for specific operations. Trying to turn diagrams into code is something that we’ve seen since Rational Rose (and maybe even earlier). It just doesn’t work.
Using AI for Movie Funding Decisions
Our final topic came from this Hollywood Reporter article about Warner Brothers using an ML model to help determine whether to fund movie production. We talked a bit about it, including some parts without snark. Just a bit.
Our first topic was an update on SQL Server tools. Mala and I talked about our experiences using Azure Data Studio, and I mentioned a three-part series from Erin Stellato on the most recent release for SQL Server Management Studio (part 1, part 2, part 3).
Azure Data Studio has made some great strides, especially in how it displays execution plans. If T-SQL development is your primary use case, give Azure Data Studio a try. If you’re mostly a database administrator, you can also give Azure Data Studio a try, though SQL Server Management Studio is likely to have more of what you need to do your job.
AI Feedback Loops and Costs of Training
Anders pitched me a softball with this article about the AI feedback loop. The idea is that, with the proliferation of Large Language Models (LLMs), they’re generating a large amount of the data which will be used to train the next generation of LLMs, and we tend to see model quality drop with auto-generated gibberish, leading to even more auto-generated gibberish.
I also covered an article on diminishing returns to Deep Learning techniques, indicating that it takes about k^9 computational power to cut errors down to 1/k. For example, if we want to cut training errors in half, that’s k=2, and we need about 500x as much computational power to do that. For this reason, there are some slowly-changing boundaries around how much neural network model improvement we can see with additional hardware generations.
A Case Study in Insider Risk Management
The last topic was an interesting story I learned about because of a training I helped deliver. The case study involved a woman who stole a formula while working at Coca-Cola and tried to use it to bootstrap a new company in China. The whole story is quite long, but goes into a good amount of detail on the topic and was pretty interesting, especially as it details how she was able to circumvent Coca-Cola’s Data Loss Prevention software that they were using at the time.
Microsoft Build was the day after our prior episode of Shop Talk, so this time around, we dove into Microsoft Fabric. I answered some questions for Mala and gave her a better idea of where this fits in the modern data warehousing world. We spent a bit of time talking about licensing, and I talked a bit about why I really don’t like the name.
OpenAI and Yet Another LLM Rant
One of these days, I’ll stop ranting about LLMs. Probably when the next thing hits the peak of the hype cycle. We spent some more time talking about OpenAI and LLMs, as it seemed that about 2/3 of Build content was geared that way. I was surprisingly less critical of it than I have been over the past few weeks, but I think that’s just an internal attempt not to go too far overboard in my critique—it’s not that there’s zero utility in LLMs; it’s that humans need to be in the loop the whole time, and we need to remember that these are language models and not fact models. I also talked a bit about a great article by Eric Hartford on uncensored models. I completely agree with Eric: it’s fine to add your own bias after the fact—for example, I don’t want my corporate-friendly chat assistant to start engaging in lewd conversation with potential customers—but I also don’t like the idea of some unrelated organization deciding what’s acceptable and what isn’t.
Microsoft Build was the day after this episode of Shop Talk, so we discussed a few items around Microsoft’s then-known emphasis on AI’ing all the things, as well as poking fun at a few deprecated features like StreamInsight.
We had a question come in from chat about how to get started with Azure SQL, getting an idea of what it entails versus on-premises SQL Server. I recommended the Azure SQL Workshop and Mala included a YouTube video from Anna Hoffman explaining when to use each service.
Recently I have taken over the DB tasks at the company I work at. The structure is either defined by software we use or logically set up decently. The only problem I’ve noticed so far though is the array of methods how data is manipulated in SQL server for user defined tables. For example, we use:
– SSIS packages in the Integration Catalog (SSISDB) – found eventually by using a file table but manually exporting right now since not setup with FileStream on the server
– SSIS packages in a file system called by SQL Server Agent job – parsed with a PowerShell script
– SQL Server Agent TSQL steps – using TSQL to parse sys tables to read the commands of steps for keywords
– Procedures – TSQL script looking for keyword through procedure definitions in all databases
– Triggers – Similar to procedures
Normally I use all these tools developed to try to track where the changes are being made and it actually works really well until the other day 🙁 For the life of me I can’t find where a table is being updated and it’s driving me crazy! I realized I even missed cases like SQL Server Maintenance Plans, Task Scheduler events, software methods/functions that connect to the database, etc. Is there a way to see where and how a DB object is manipulated and have the ability to go modify that code? We would like to stay away from installing 3rd party tools but I am open to suggestions. The other day I read about SQL Server Profiler that comes with SSMS. I think that is enabled and I am going to try that on Monday. However, I am curious will that return every context of how the object is updated? I have 0 knowledge about this tool. Also, if there is another built-in tool recommended I am all ears as I am new to SQL Server.
When you are content simply passing back a “bag of attributes” to the application and have it sort things out, especially if you know the result set will definitely be small.
When building a proof of concept and you are positive (not just hopeful) that you will rebuild the data model before the first actual release.
The second use case is the scary one, as there are many things we tell ourselves to mollify the voices in our heads saying that this is a bad idea.
Are Databases Legacy Technology?
The last topic came from a couple of anecdotes about “kids these days” and coming out of computer science programs not knowing about databases. Most computer science programs are bad at teaching anything database-related—if you want to learn about relational theory, there’s usually at most one course (undergraduate or graduate-level), but if you want to learn practical applications of databases, you’re usually out of luck.
I think that this is a challenge, as it means we have new developers coming out of school who are completely ignorant of how data management works. At the same time, this is a challenge which has been with us for quite some time. When I was an undergraduate, I had…one course offered on relational theory. I don’t think the pattern has gotten worse, but there’s obviously a gap and the question is, is there a broader-based way of filling that gap than trying to introduce a database course one university at a time? I don’t have a good answer for this, though we do talk out the idea a bit and maybe that will help somebody come up with a better solution.