Shop Talk: 2023-10-09

The Recording

The Panelists

  • Kevin Feasel
  • Mike Chrestensen

Notes: Questions and Topics

Data Relay Debrief

Our first topic was Data Relay, where I had spent the prior week. If you’re not familiar with it, Data Relay is a touring conference which spans 5 cities in 5 days across the UK. We talked quite a bit about the first day’s events, as that happened to be the National Museum of Computing in Bletchley Park.

Yes Aliases

Mike was responsible for the next topic, covering whether or not to alias table names in queries. We used this Reddit post to springboard the discussion, but it’s something Mike is dealing with at work as well. We talked about how a “no aliases” policy is a bad idea, but also about rules of thumb which make aliasing easier to understand. I tend not to have hard-and-fast rules around aliasing and certainly don’t expect aliases to be globally unique and non-repetitive (e.g., a dbo.Patient table always and everywhere aliased as pat, but no other table either being aliased as pat), but they should be meaningful enough within the context of a query to make clear which table, view, or expression it is.

Fully Qualified Object Names

Mike also had a question about qualifying object names. For example, in SQL Server, you can have up to four-part naming: instance.database.schema.object. We talked a bit about what each of those four parts actually means and why I always include schema and object when querying, almost never include database, and even more rarely include instance.

Shop Talk: 2023-09-25

The Recording

The Panelists

  • Kevin Feasel
  • Mike Chrestensen
  • Mala Mahadevan

Notes: Questions and Topics

READPAST Hint

Mala ran into the READPAST hint in the wild and wanted to know more about it, so that was our first topic of the evening. We walked through an article which provides a good description of what the hint does, although I did point out that it doesn’t always work. The best use for READPAST is in things like queueing systems, where you can safely ignore the locked items and pick up an unlocked item.

Row-Level Security Attacks

The next article we looked at was a good article on row-level security attacks. For the most part, this boiled down to attacks around row-level security rather than through row-level security, although there was an interesting side channel attack involving division by 0.

SQL Server 2022 Utilization

We got a question in chat about whether SQL Server 2022 is worth learning, as many companies haven’t moved to it. It’s a fair question, but my short answer is yes, it can make sense to learn about SQL Server 2022 even if your company is using 2016. There are a couple of reasons. First, SQL Server’s database engine doesn’t tend radically to change between versions, so it’s pretty easy to get up to speed on multiple versions of SQL Server. Second, you might find something useful in SQL Server 2022 which isn’t available in prior versions, and that could drive a change.

Shop Talk: 2023-09-11

The Recording

The Panelists

  • Kevin Feasel
  • Mike Chrestensen
  • Rick Pack

Notes: Questions and Topics

LLM Benchmarks

Our first topic of the night was LLM Benchmarks, a website aimed at testing a variety of Large Language Models with a series of questions. We saw that some of the answers were fine, though in other cases, the models went a bit off-kilter. One thing I found interesting was the question, “Is Taiwan an independent country?” In fairness, several of the models gave answers which were about as mushy and contradictory as policy is in real life.

SQL Style Guide

The primary topic for this evening was Simon Holywell’s SQL Style Guide. We spent a lot of time on it and my short version: I like that there is a style guide, agree with about 25% of it, don’t care much either way about 25% of it, and hate 50% of it. I don’t think I get into too many proper rants, but I also don’t pull any punches. There was some great chat engagement from Anders and Solomon as well, and we had our own internal debates, showing just how contentious style ideas are.

Row Level Security Performance and Troubleshooting

Our second topic was a Ben Johnston article on the performance of row level security in SQL Server. This is a really good article.

Choosing a Data Format

The next topic was an article on choosing a file format for data. I spent some time digging into the Avro format, as that’s one most data platform people don’t often think about. I also mourned the effective passing of ORC format, as Parquet has completely dominated it.

SQL Server Security Checklist

The final topic of the night was a SQL Server security checklist from Hemantgiri Goswami. It’s a perfectly reasonable guide if you don’t have anything specific of your own.

Shop Talk: 2023-08-28

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan

Notes: Questions and Topics

Mike to Be on the SQL Data Partners Podcast

Our first topic of the night is that Mike Chrestensen will be a future guest on the SQL Data Partners podcast, where we talk a bit about generative AI. Sadly, Mike wasn’t able to make it to this episode, so we had to talk about him rather than with him.

SQL Style Guide

The primary topic for this evening was Simon Holywell’s SQL Style Guide. We spent a lot of time on it and my short version: I like that there is a style guide, agree with about 25% of it, don’t care much either way about 25% of it, and hate 50% of it. I don’t think I get into too many proper rants, but I also don’t pull any punches. There was some great chat engagement from Anders and Solomon as well, and we had our own internal debates, showing just how contentious style ideas are.

sp_ Procedures Are Bad

The final topic for the evening was an article from Louis Davidson, showing yet another reason why you shouldn’t preface stored procedures with sp_, particularly if you don’t expect to create them in the master database and run them from the context of other databases.

Shop Talk: 2023-08-14

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan
  • Mike Chrestensen

Notes: Questions and Topics

Debate — Resolved: OpenAI to Go Bankrupt by 2024

Our first major topic of the night was based around an article speculating that OpenAI could go bankrupt by the end of 2024. The authors of the article lay out a few reasons for why they think so and I summarized some thoughts in support of the contention and some in opposition.

Why it could go bankrupt:

  • They’re spending like a startup expecting constant cash infusions. Microsoft already spent $10 billion to buy 1/3 of the company, so I’m not sure how many more rounds of big money they’d expect to get.
  • GPUs are inordinately expensive—we’re talking $45K for a single H100 GPU, and training models the size of ChatGPT can take thousands of those machines and enormous amounts of time and energy.
  • With Facebook open sourcing their work on models like Wizard and Vicuna, there’s a free alternative which goes beyond hobbyist level and thereby prevents OpenAI from getting an effective monopoly in generative AI models.
  • Mike brought up a great point during the broadcast, something I’ve mentioned before: OpenAI is the target of several lawsuits, and depending on how those go, litigation and precedent could severely impact their ability to make money long-term, especially if they’re found to have violated the rights of copyright holders in their model training process.

Why it wouldn’t go bankrupt:

  • ChatGPT is still the big brand name in generative AI and that means a lot.
  • Microsoft has tethered themselves to the OpenAI mast. They’ve dumped in $10 billion USD and MS Build was all about generative AI + everything, to the point that I found it ridiculous. They’re not going to let their generative AI partner wither away and embarrass Microsoft like that.
  • If they don’t get a lot of revenue from Azure OpenAI or their own independent work, they can scale down their costs pretty significantly by slowing down new training and new releases. Remember that the debate point is that OpenAI goes bankrupt by the end of 2024, less than 18 months from now. Scaling down costs may end up just kicking the can down the road, but they could remain viable by slowing their costs, not training new models as frequently, and cutting staff.

Patient Matching

The other major topic of the night was around patient matching. I received a question asking, how would you determine if two people are the same, given some info like name, address, date of birth, and maybe patient ID.

Mike, Mala, and I have all worked in the health care sector and had to work on similar problems, so we each discussed techniques we’ve used. My summary is:

  • Use as many non-ML tricks as you can up-front to simplify the problem, removing obvious duplicates and non-duplicates. Mike recommends starting with the NDI ruleset as a good first pass approach. The reason we want to start with non-ML approaches is that the ML approaches tend to be on the order of O(M*N) or O(N^2) at best and O(N^3) or even O(2^N) at worst. In other words, they get really slow as your datasets grow in size and probably won’t scale to millions of records.
  • If all you have is names, HMNI is a pretty good package for name matching.
  • Most likely, you have more than just names, and that’s where a technique known as record linking comes in. You may also see people refer to it as record linkage. For an academic survey of techniques, I can recommend this paper, which is about a decade old—just don’t assume that they products they recommend are still the best.
  • Spark can be a good platform for large-scale record linkage, and there’s a library called Splink which helps you find linked records in a distributed fashion. Scale-out won’t solve the O(N^2) problem, but it will mitigate it: instead of taking 24 days, an 8-node server might require only 4 days. If you’re working with enormous datasets, that time decrease can be worthwhile.

Shop Talk: 2023-07-31

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan

Notes: Questions and Topics

Carolina Code Conference

Our first major topic of the night is that the Carolina Code Conference is coming up in just a couple of weeks: August 19th in Greenville, South Carolina. I’m going to be a presenter and am also a sponsor, so I’ll be pushing my wares. Tickets are free, though there’s a limited number for this year’s event.

Will AI Fix Work?

The second topic was one we spent a whole lot of time on, and Betteridge’s Law of Headlines gives my quick answer: no, AI will not fix work. Mala dropped this hot potato in my lap right before showtime, so we talked through it together and a good amount of my cynicism showed. I think there are areas where AI in general—and generative AI in particular—can help individual employees, but we’re riding the hype cycle curve right now, and I doubt these sorts of technologies have massive long-term benefits. I’m willing to be wrong on this, but it feels like stories from 25 years ago about how learning to code would eliminate all of the burdens of the average employee’s life.

Shop Talk: 2023-07-17

The Recording

The Panelists

  • Kevin Feasel
  • Tracy Boggiano
  • Mike Chrestensen
  • Mala Mahadevan

Notes: Questions and Topics

The In-Person Component of Hybrid Meetings

The first topic we hit was in-person attendance for our hybrid main meeting. We’ve had a few hybrid meetings so far and the large majority of attendees for these are still remote. Our gracious host (a company which has hosted TriPASS meetings for more than a decade) does need to hit certain thresholds for it to make sense for them to host us, and our current numbers aren’t cutting it. So far, I’ve come up with two ideas:

  • Make the main meeting in-person only (i.e., no hybrid meeting) and keep the other two remote-only
  • Go back to remote-only meetings

I’m not particularly satisfied with either of these, honestly, but I’m struggling to find a good option which increases in-person attendance but lets us remain hybrid. Back in 2019, the hybrid component was an “extra” but the large majority of attendees were coming in-person. At this point, the script is flipped.

Planning for a Successful Database Migration on AWS

Mala wanted to talk about this article. She started off not liking the article very much, but after we talked about it, we came to an agreement that it’s pretty good for a really high-level overview, though there’s a good amount of marketing fluff in it, as well as wishcasting that people change a database system that works with one that pays Amazon more money.

Replacing Customer Service with ChatGPT

Another article from Mala, though this time we had some more to sink our sarcastic teeth into. What it sounds like is, they created a reasonably smart chatbot to do level 1 tech support—basically, an interactive FAQ and probably incorporating some billing and account data in the prompt inputs. Ignoring the idiocy of bragging about laying off most of your customer support people, this is probably a reasonable use for generative AI, at least until it lies to your customers and your few remaining humans in customer support need to put out those new fires.

Kusto Detective Agency Season 2

The final topic for today’s show was a quick discussion of Kusto Detective Agency, now that season 2 is in full swing. This is definitely not a site for beginners, but it is a really fun challenge if you have some Kusto chops already. Or you can read one of the many walkthroughs online to get you through it. I got about 2/3 of the way through season 1 without having to rely on a walkthrough, but couldn’t get much further. I haven’t tried season 2 yet, but it’s on my todo list.

Shop Talk: 2023-07-03

The Recording

The Panelists

  • Kevin Feasel
  • Tracy Boggiano
  • Mala Mahadevan

Notes: Questions and Topics

CAST() and CONVERT()

Our first topic comes from Tracy, who noticed an odd situation in which using CONVERT() on a dataset ended up about 200ms slower than using CAST(). I didn’t have any good answers, though chat and I speculated on possibilities.

The Future (and Past) of No-Code Programming Languages

After that, we meandered a bit on no-code programming languages, using this article as a jumping-off point. There’s some value in no-code languages but finding a general-purpose, no-code programming language which doesn’t want to make you tear your hair out in frustration as soon as you leave the happy path is like searching for El Dorado.

If you’re on that happy path, then they can be a good way of getting started. Examples I used include SSIS, Informatica, and the Azure ML designer. These are custom-purpose canvases for specific operations. Trying to turn diagrams into code is something that we’ve seen since Rational Rose (and maybe even earlier). It just doesn’t work.

Using AI for Movie Funding Decisions

Our final topic came from this Hollywood Reporter article about Warner Brothers using an ML model to help determine whether to fund movie production. We talked a bit about it, including some parts without snark. Just a bit.

Shop Talk: 2023-06-19

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan

Notes: Questions and Topics

Azure Data Studio and SSMS 19.1

Our first topic was an update on SQL Server tools. Mala and I talked about our experiences using Azure Data Studio, and I mentioned a three-part series from Erin Stellato on the most recent release for SQL Server Management Studio (part 1, part 2, part 3).

Azure Data Studio has made some great strides, especially in how it displays execution plans. If T-SQL development is your primary use case, give Azure Data Studio a try. If you’re mostly a database administrator, you can also give Azure Data Studio a try, though SQL Server Management Studio is likely to have more of what you need to do your job.

AI Feedback Loops and Costs of Training

Anders pitched me a softball with this article about the AI feedback loop. The idea is that, with the proliferation of Large Language Models (LLMs), they’re generating a large amount of the data which will be used to train the next generation of LLMs, and we tend to see model quality drop with auto-generated gibberish, leading to even more auto-generated gibberish.

I also covered an article on diminishing returns to Deep Learning techniques, indicating that it takes about k^9 computational power to cut errors down to 1/k. For example, if we want to cut training errors in half, that’s k=2, and we need about 500x as much computational power to do that. For this reason, there are some slowly-changing boundaries around how much neural network model improvement we can see with additional hardware generations.

A Case Study in Insider Risk Management

The last topic was an interesting story I learned about because of a training I helped deliver. The case study involved a woman who stole a formula while working at Coca-Cola and tried to use it to bootstrap a new company in China. The whole story is quite long, but goes into a good amount of detail on the topic and was pretty interesting, especially as it details how she was able to circumvent Coca-Cola’s Data Loss Prevention software that they were using at the time.

Shop Talk: 2023-06-05

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan
  • Mike Chrestensen
  • Tracy Boggiano

Notes: Questions and Topics

Microsoft Fabric

Microsoft Build was the day after our prior episode of Shop Talk, so this time around, we dove into Microsoft Fabric. I answered some questions for Mala and gave her a better idea of where this fits in the modern data warehousing world. We spent a bit of time talking about licensing, and I talked a bit about why I really don’t like the name.

OpenAI and Yet Another LLM Rant

One of these days, I’ll stop ranting about LLMs. Probably when the next thing hits the peak of the hype cycle. We spent some more time talking about OpenAI and LLMs, as it seemed that about 2/3 of Build content was geared that way. I was surprisingly less critical of it than I have been over the past few weeks, but I think that’s just an internal attempt not to go too far overboard in my critique—it’s not that there’s zero utility in LLMs; it’s that humans need to be in the loop the whole time, and we need to remember that these are language models and not fact models. I also talked a bit about a great article by Eric Hartford on uncensored models. I completely agree with Eric: it’s fine to add your own bias after the fact—for example, I don’t want my corporate-friendly chat assistant to start engaging in lewd conversation with potential customers—but I also don’t like the idea of some unrelated organization deciding what’s acceptable and what isn’t.