Shop Talk: 2023-08-14

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan
  • Mike Chrestensen

Notes: Questions and Topics

Debate — Resolved: OpenAI to Go Bankrupt by 2024

Our first major topic of the night was based around an article speculating that OpenAI could go bankrupt by the end of 2024. The authors of the article lay out a few reasons for why they think so and I summarized some thoughts in support of the contention and some in opposition.

Why it could go bankrupt:

  • They’re spending like a startup expecting constant cash infusions. Microsoft already spent $10 billion to buy 1/3 of the company, so I’m not sure how many more rounds of big money they’d expect to get.
  • GPUs are inordinately expensive—we’re talking $45K for a single H100 GPU, and training models the size of ChatGPT can take thousands of those machines and enormous amounts of time and energy.
  • With Facebook open sourcing their work on models like Wizard and Vicuna, there’s a free alternative which goes beyond hobbyist level and thereby prevents OpenAI from getting an effective monopoly in generative AI models.
  • Mike brought up a great point during the broadcast, something I’ve mentioned before: OpenAI is the target of several lawsuits, and depending on how those go, litigation and precedent could severely impact their ability to make money long-term, especially if they’re found to have violated the rights of copyright holders in their model training process.

Why it wouldn’t go bankrupt:

  • ChatGPT is still the big brand name in generative AI and that means a lot.
  • Microsoft has tethered themselves to the OpenAI mast. They’ve dumped in $10 billion USD and MS Build was all about generative AI + everything, to the point that I found it ridiculous. They’re not going to let their generative AI partner wither away and embarrass Microsoft like that.
  • If they don’t get a lot of revenue from Azure OpenAI or their own independent work, they can scale down their costs pretty significantly by slowing down new training and new releases. Remember that the debate point is that OpenAI goes bankrupt by the end of 2024, less than 18 months from now. Scaling down costs may end up just kicking the can down the road, but they could remain viable by slowing their costs, not training new models as frequently, and cutting staff.

Patient Matching

The other major topic of the night was around patient matching. I received a question asking, how would you determine if two people are the same, given some info like name, address, date of birth, and maybe patient ID.

Mike, Mala, and I have all worked in the health care sector and had to work on similar problems, so we each discussed techniques we’ve used. My summary is:

  • Use as many non-ML tricks as you can up-front to simplify the problem, removing obvious duplicates and non-duplicates. Mike recommends starting with the NDI ruleset as a good first pass approach. The reason we want to start with non-ML approaches is that the ML approaches tend to be on the order of O(M*N) or O(N^2) at best and O(N^3) or even O(2^N) at worst. In other words, they get really slow as your datasets grow in size and probably won’t scale to millions of records.
  • If all you have is names, HMNI is a pretty good package for name matching.
  • Most likely, you have more than just names, and that’s where a technique known as record linking comes in. You may also see people refer to it as record linkage. For an academic survey of techniques, I can recommend this paper, which is about a decade old—just don’t assume that they products they recommend are still the best.
  • Spark can be a good platform for large-scale record linkage, and there’s a library called Splink which helps you find linked records in a distributed fashion. Scale-out won’t solve the O(N^2) problem, but it will mitigate it: instead of taking 24 days, an 8-node server might require only 4 days. If you’re working with enormous datasets, that time decrease can be worthwhile.

Shop Talk: 2023-07-31

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan

Notes: Questions and Topics

Carolina Code Conference

Our first major topic of the night is that the Carolina Code Conference is coming up in just a couple of weeks: August 19th in Greenville, South Carolina. I’m going to be a presenter and am also a sponsor, so I’ll be pushing my wares. Tickets are free, though there’s a limited number for this year’s event.

Will AI Fix Work?

The second topic was one we spent a whole lot of time on, and Betteridge’s Law of Headlines gives my quick answer: no, AI will not fix work. Mala dropped this hot potato in my lap right before showtime, so we talked through it together and a good amount of my cynicism showed. I think there are areas where AI in general—and generative AI in particular—can help individual employees, but we’re riding the hype cycle curve right now, and I doubt these sorts of technologies have massive long-term benefits. I’m willing to be wrong on this, but it feels like stories from 25 years ago about how learning to code would eliminate all of the burdens of the average employee’s life.

Shop Talk: 2023-07-17

The Recording

The Panelists

  • Kevin Feasel
  • Tracy Boggiano
  • Mike Chrestensen
  • Mala Mahadevan

Notes: Questions and Topics

The In-Person Component of Hybrid Meetings

The first topic we hit was in-person attendance for our hybrid main meeting. We’ve had a few hybrid meetings so far and the large majority of attendees for these are still remote. Our gracious host (a company which has hosted TriPASS meetings for more than a decade) does need to hit certain thresholds for it to make sense for them to host us, and our current numbers aren’t cutting it. So far, I’ve come up with two ideas:

  • Make the main meeting in-person only (i.e., no hybrid meeting) and keep the other two remote-only
  • Go back to remote-only meetings

I’m not particularly satisfied with either of these, honestly, but I’m struggling to find a good option which increases in-person attendance but lets us remain hybrid. Back in 2019, the hybrid component was an “extra” but the large majority of attendees were coming in-person. At this point, the script is flipped.

Planning for a Successful Database Migration on AWS

Mala wanted to talk about this article. She started off not liking the article very much, but after we talked about it, we came to an agreement that it’s pretty good for a really high-level overview, though there’s a good amount of marketing fluff in it, as well as wishcasting that people change a database system that works with one that pays Amazon more money.

Replacing Customer Service with ChatGPT

Another article from Mala, though this time we had some more to sink our sarcastic teeth into. What it sounds like is, they created a reasonably smart chatbot to do level 1 tech support—basically, an interactive FAQ and probably incorporating some billing and account data in the prompt inputs. Ignoring the idiocy of bragging about laying off most of your customer support people, this is probably a reasonable use for generative AI, at least until it lies to your customers and your few remaining humans in customer support need to put out those new fires.

Kusto Detective Agency Season 2

The final topic for today’s show was a quick discussion of Kusto Detective Agency, now that season 2 is in full swing. This is definitely not a site for beginners, but it is a really fun challenge if you have some Kusto chops already. Or you can read one of the many walkthroughs online to get you through it. I got about 2/3 of the way through season 1 without having to rely on a walkthrough, but couldn’t get much further. I haven’t tried season 2 yet, but it’s on my todo list.

Shop Talk: 2023-07-03

The Recording

The Panelists

  • Kevin Feasel
  • Tracy Boggiano
  • Mala Mahadevan

Notes: Questions and Topics

CAST() and CONVERT()

Our first topic comes from Tracy, who noticed an odd situation in which using CONVERT() on a dataset ended up about 200ms slower than using CAST(). I didn’t have any good answers, though chat and I speculated on possibilities.

The Future (and Past) of No-Code Programming Languages

After that, we meandered a bit on no-code programming languages, using this article as a jumping-off point. There’s some value in no-code languages but finding a general-purpose, no-code programming language which doesn’t want to make you tear your hair out in frustration as soon as you leave the happy path is like searching for El Dorado.

If you’re on that happy path, then they can be a good way of getting started. Examples I used include SSIS, Informatica, and the Azure ML designer. These are custom-purpose canvases for specific operations. Trying to turn diagrams into code is something that we’ve seen since Rational Rose (and maybe even earlier). It just doesn’t work.

Using AI for Movie Funding Decisions

Our final topic came from this Hollywood Reporter article about Warner Brothers using an ML model to help determine whether to fund movie production. We talked a bit about it, including some parts without snark. Just a bit.

Shop Talk: 2023-06-19

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan

Notes: Questions and Topics

Azure Data Studio and SSMS 19.1

Our first topic was an update on SQL Server tools. Mala and I talked about our experiences using Azure Data Studio, and I mentioned a three-part series from Erin Stellato on the most recent release for SQL Server Management Studio (part 1, part 2, part 3).

Azure Data Studio has made some great strides, especially in how it displays execution plans. If T-SQL development is your primary use case, give Azure Data Studio a try. If you’re mostly a database administrator, you can also give Azure Data Studio a try, though SQL Server Management Studio is likely to have more of what you need to do your job.

AI Feedback Loops and Costs of Training

Anders pitched me a softball with this article about the AI feedback loop. The idea is that, with the proliferation of Large Language Models (LLMs), they’re generating a large amount of the data which will be used to train the next generation of LLMs, and we tend to see model quality drop with auto-generated gibberish, leading to even more auto-generated gibberish.

I also covered an article on diminishing returns to Deep Learning techniques, indicating that it takes about k^9 computational power to cut errors down to 1/k. For example, if we want to cut training errors in half, that’s k=2, and we need about 500x as much computational power to do that. For this reason, there are some slowly-changing boundaries around how much neural network model improvement we can see with additional hardware generations.

A Case Study in Insider Risk Management

The last topic was an interesting story I learned about because of a training I helped deliver. The case study involved a woman who stole a formula while working at Coca-Cola and tried to use it to bootstrap a new company in China. The whole story is quite long, but goes into a good amount of detail on the topic and was pretty interesting, especially as it details how she was able to circumvent Coca-Cola’s Data Loss Prevention software that they were using at the time.

Shop Talk: 2023-06-05

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan
  • Mike Chrestensen
  • Tracy Boggiano

Notes: Questions and Topics

Microsoft Fabric

Microsoft Build was the day after our prior episode of Shop Talk, so this time around, we dove into Microsoft Fabric. I answered some questions for Mala and gave her a better idea of where this fits in the modern data warehousing world. We spent a bit of time talking about licensing, and I talked a bit about why I really don’t like the name.

OpenAI and Yet Another LLM Rant

One of these days, I’ll stop ranting about LLMs. Probably when the next thing hits the peak of the hype cycle. We spent some more time talking about OpenAI and LLMs, as it seemed that about 2/3 of Build content was geared that way. I was surprisingly less critical of it than I have been over the past few weeks, but I think that’s just an internal attempt not to go too far overboard in my critique—it’s not that there’s zero utility in LLMs; it’s that humans need to be in the loop the whole time, and we need to remember that these are language models and not fact models. I also talked a bit about a great article by Eric Hartford on uncensored models. I completely agree with Eric: it’s fine to add your own bias after the fact—for example, I don’t want my corporate-friendly chat assistant to start engaging in lewd conversation with potential customers—but I also don’t like the idea of some unrelated organization deciding what’s acceptable and what isn’t.

Shop Talk: 2023-05-22

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan
  • Mike Chrestensen

Notes: Questions and Topics

Pre-Thoughts on Build

Microsoft Build was the day after this episode of Shop Talk, so we discussed a few items around Microsoft’s then-known emphasis on AI’ing all the things, as well as poking fun at a few deprecated features like StreamInsight.

Along the way, we poured a 40 out for CLR and I once again recommended Solomon Rutzky’s series on CLR strict security.

Learning Azure SQL

We had a question come in from chat about how to get started with Azure SQL, getting an idea of what it entails versus on-premises SQL Server. I recommended the Azure SQL Workshop and Mala included a YouTube video from Anna Hoffman explaining when to use each service.

Shop Talk: 2023-05-08

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan

Notes: Questions and Topics

Auditing Table Changes

The first topic came in the form of an e-mail:

Recently I have taken over the DB tasks at the company I work at. The structure is either defined by software we use or logically set up decently. The only problem I’ve noticed so far though is the array of methods how data is manipulated in SQL server for user defined tables. For example, we use:

– SSIS packages in the Integration Catalog (SSISDB) – found eventually by using a file table but manually exporting right now since not setup with FileStream on the server

– SSIS packages in a file system called by SQL Server Agent job – parsed with a PowerShell script

– SQL Server Agent TSQL steps – using TSQL to parse sys tables to read the commands of steps for keywords

– Procedures – TSQL script looking for keyword through procedure definitions in all databases

– Triggers – Similar to procedures

Normally I use all these tools developed to try to track where the changes are being made and it actually works really well until the other day 🙁 For the life of me I can’t find where a table is being updated and it’s driving me crazy! I realized I even missed cases like SQL Server Maintenance Plans, Task Scheduler events, software methods/functions that connect to the database, etc. Is there a way to see where and how a DB object is manipulated and have the ability to go modify that code? We would like to stay away from installing 3rd party tools but I am open to suggestions. The other day I read about SQL Server Profiler that comes with SSMS. I think that is enabled and I am going to try that on Monday. However, I am curious will that return every context of how the object is updated? I have 0 knowledge about this tool. Also, if there is another built-in tool recommended I am all ears as I am new to SQL Server.

Mala and I walk through the recommended answer, which is to use SQL Audit. Mark Gordon had a great presentation on the topic for TriPASS, so I recommend you check it out if you want to get started. If you just want to see the very specific answer of how to set up a SQL Audit for a single table, Nic Cain has a DBA Stack Exchange answer which does the trick. You can also dig into what the fn_get_audit_file() function does and see things like application name, which (if your connection strings are configured correctly), will provide you an answer of what is updating that table.

Entity-Attribute-Value Tables

The next topic was a bit of a blast from the past: Entity-Attribute-Value tables. This is a pretty good look at what they are, and Phil Factor has the best way of explaining them. I am not entirely anti-EAV, but I am anti-EAV about 90% of the time. The exceptional cases are:

  • When you are content simply passing back a “bag of attributes” to the application and have it sort things out, especially if you know the result set will definitely be small.
  • When building a proof of concept and you are positive (not just hopeful) that you will rebuild the data model before the first actual release.

The second use case is the scary one, as there are many things we tell ourselves to mollify the voices in our heads saying that this is a bad idea.

Are Databases Legacy Technology?

The last topic came from a couple of anecdotes about “kids these days” and coming out of computer science programs not knowing about databases. Most computer science programs are bad at teaching anything database-related—if you want to learn about relational theory, there’s usually at most one course (undergraduate or graduate-level), but if you want to learn practical applications of databases, you’re usually out of luck.

I think that this is a challenge, as it means we have new developers coming out of school who are completely ignorant of how data management works. At the same time, this is a challenge which has been with us for quite some time. When I was an undergraduate, I had…one course offered on relational theory. I don’t think the pattern has gotten worse, but there’s obviously a gap and the question is, is there a broader-based way of filling that gap than trying to introduce a database course one university at a time? I don’t have a good answer for this, though we do talk out the idea a bit and maybe that will help somebody come up with a better solution.

Shop Talk: 2023-04-24

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan
  • Mike Chrestensen

Notes: Questions and Topics

MVP Summit

Last week, Tracy and I were in Redmond for MVP Summit. It was good seeing a bunch of people for the first time in several years.

Cross-Platform Code

We next talked a bit about the notion of cross-platform code, that is, code which runs in Oracle, SQL Server, and other relational databases. There’s some value in the idea but the problem is that you lose out on some of the platform-specific performance and code benefits and end up designing for mediocrity rather than high performance.

What Compatibility Level Does

Our first real topic was a Brent Ozar blog post on compatibility level. The short version is that recent versions of SQL Server have introduced more functionality wrapped around compatibility level, and it’s good to know which things are gated by compatibility level—and importantly, which things aren’t.

Another Large Language Model Rant

Chat then baited me into a discussion on large language models and using ChatGPT for coding assistance. Where I’m at with this is, if you’re good enough to know what code you intend to write (and you don’t worry about potential licensing problems), you can probably treat ChatGPT as a drunken intern who writes code that you need carefully to review. Aside from that, I’m rather pessimistic about the whole notion.

Shop Talk: 2022-04-10

The Recording

The Panelists

  • Kevin Feasel
  • Mike Chrestensen

Notes: Questions and Topics

Caring about GCP

I started off with a ramble around a comment I made on the SQL Data Partners podcast: that you’d have to pay me to learn about Google Cloud Platform. Someone reached out to me to ask for more info and if that meant that I dislike GCP or something.

The short version is no, I have no ill feelings toward GCP. Neither do I have positive feelings toward it. It fits squarely into a purposeful blind spot for me, which comes about because the opportunity cost of learning something is the value of learning the next-best alternative. In other words, there’s only a certain number of hours in the day to learn things, so I’m going to prioritize things I find interesting, things which are easy for me to pick up, or things which (eventually?) make me money. Azure I know because I get free credits and get paid to know it well. AWS I know because I’ve worked in jobs where they’ve paid me to know enough about it. I’ve never had anyone pay me to learn GCP, so there’s no external incentive. If a customer came to me and said that they were switching to GCP and would like me to learn it, then yeah, I’d pick it up and see how things differ from Azure and AWS. But otherwise, it’s not on my radar.

Now, one thing I didn’t get into is that philosophically, I do find value in the learning equivalent of “wandering aimlessly.” I’m the type of person who would walk up and down the aisles in university libraries, not looking for a specific book but just willing to let whatever drew my attention guide me. This style of learning doesn’t always pay off, though I’ve found its hit rate is a lot higher than you’d first expect. So even if nobody pays me, there is a chance that someday I pick up GCP and try out some things. But the probability is still low—there are a lot of books on those shelves.

Draft Flag Driven Development

Mala pointed out a link to this Alex Bunardzic article on what he calls Draft Flag Driven Development. It took me a bit of discussion in chat and noodling through the idea to figure out the problem that I have with it. I do understand that, for many companies, the signal from code in a test environment succeeding (or failing) is not an extremely strong indicator of production succeeding or failing. But the big concern I have with this style of development is the risk of “not only did I break this thing, but I also broke a bunch of other stuff along the way” problems, where reverting back to the prior release isn’t enough—think catastrophic data failure or sending permanent changes to a third party provider.

Ordered Columnstore Indexes in SQL Server 2022

Ed Pollack has a great article on ordered columnstore indexes in SQL Server 2022. We walked through the article in detail, covering scenarios where columnstore itself works well, where ordered columnstore indexes are useful, and some of the pitfalls you might hit when using this feature.