- Kevin Feasel
- Mala Mahadevan
- Mike Chrestensen
Notes: Questions and Topics
Debate — Resolved: OpenAI to Go Bankrupt by 2024
Our first major topic of the night was based around an article speculating that OpenAI could go bankrupt by the end of 2024. The authors of the article lay out a few reasons for why they think so and I summarized some thoughts in support of the contention and some in opposition.
Why it could go bankrupt:
- They’re spending like a startup expecting constant cash infusions. Microsoft already spent $10 billion to buy 1/3 of the company, so I’m not sure how many more rounds of big money they’d expect to get.
- GPUs are inordinately expensive—we’re talking $45K for a single H100 GPU, and training models the size of ChatGPT can take thousands of those machines and enormous amounts of time and energy.
- With Facebook open sourcing their work on models like Wizard and Vicuna, there’s a free alternative which goes beyond hobbyist level and thereby prevents OpenAI from getting an effective monopoly in generative AI models.
- Mike brought up a great point during the broadcast, something I’ve mentioned before: OpenAI is the target of several lawsuits, and depending on how those go, litigation and precedent could severely impact their ability to make money long-term, especially if they’re found to have violated the rights of copyright holders in their model training process.
Why it wouldn’t go bankrupt:
- ChatGPT is still the big brand name in generative AI and that means a lot.
- Microsoft has tethered themselves to the OpenAI mast. They’ve dumped in $10 billion USD and MS Build was all about generative AI + everything, to the point that I found it ridiculous. They’re not going to let their generative AI partner wither away and embarrass Microsoft like that.
- If they don’t get a lot of revenue from Azure OpenAI or their own independent work, they can scale down their costs pretty significantly by slowing down new training and new releases. Remember that the debate point is that OpenAI goes bankrupt by the end of 2024, less than 18 months from now. Scaling down costs may end up just kicking the can down the road, but they could remain viable by slowing their costs, not training new models as frequently, and cutting staff.
The other major topic of the night was around patient matching. I received a question asking, how would you determine if two people are the same, given some info like name, address, date of birth, and maybe patient ID.
Mike, Mala, and I have all worked in the health care sector and had to work on similar problems, so we each discussed techniques we’ve used. My summary is:
- Use as many non-ML tricks as you can up-front to simplify the problem, removing obvious duplicates and non-duplicates. Mike recommends starting with the NDI ruleset as a good first pass approach. The reason we want to start with non-ML approaches is that the ML approaches tend to be on the order of O(M*N) or O(N^2) at best and O(N^3) or even O(2^N) at worst. In other words, they get really slow as your datasets grow in size and probably won’t scale to millions of records.
- If all you have is names, HMNI is a pretty good package for name matching.
- Most likely, you have more than just names, and that’s where a technique known as record linking comes in. You may also see people refer to it as record linkage. For an academic survey of techniques, I can recommend this paper, which is about a decade old—just don’t assume that they products they recommend are still the best.
- Spark can be a good platform for large-scale record linkage, and there’s a library called Splink which helps you find linked records in a distributed fashion. Scale-out won’t solve the O(N^2) problem, but it will mitigate it: instead of taking 24 days, an 8-node server might require only 4 days. If you’re working with enormous datasets, that time decrease can be worthwhile.