Congratulations to Tracy Boggiano for finally getting her MVP. It’s been a long time coming and I’m glad that she is getting the appropriate recognition for her community support.
Azure Data Studio Database Projects
Tom gives us his first thoughts on Azure Data Studio database projects. Tom referenced a blog post by Wolfgang Strasser and walked us through his first thoughts.
We then turned this into an extended discussion on the state of Azure Data Studio today.
A Rant on the XEvent Profiler
Tom and I then discussed Profiler. No, not that profiler; the other one. I hate the fact that Microsoft named this the XEvent Profiler because it really muddies the waters. The product itself is fine and is starting to give Extended Events a reasonable UI. But that name…
The biggest problem I have with the name is that it seems to be intentionally confusing, and as long as there are two tools called Profiler, that ambiguity will lead to confusion. “Oh, I heard from <insert name here> that Profiler is bad, so I’ll avoid this thing called XE Profiler. What’s an XE?” It would have been better to name it something different and make it easier for people to say something like “Avoid Profiler and use the SQL Server Performance Tracker instead.”
The product is fine; the name is not.
We had several questions around licensing, and I’m bundling them here.
First, if you have questions about SQL Server licensing, Thomas Grohser did a talk for our group last month and he explains it better than I ever will.
We also talked about licensing in tools like Visual Studio Code, which has its own license based on MIT. We talked a bit about which licenses tend to pass muster in legal teams at organizations, as well as some of the ones which don’t.
I also talked about why I hate Oracle and the exact amount of my Oracle razzing which is real versus me being a troll.
Mala’s Book Corner, by Kevin
In this week’s edition of Mala’s Book Corner Kevin’s Book Hovel, I recommended Spark in Action 2nd Edition by Jean-Georges Perrin. Jean-Georges is local to us in the Triangle area and published a great book on Spark. The examples are a lot better than what I’ve seen in other Spark books and training materials, so if you’re interested in learning about Spark, get this book.
On Shop Talk, Tracy Boggiano, Tom Norman, and I will all be speaking at PASS Summit. Other speakers include friends of the show Chris Voss and Jared Poche.
Angela Henry heads up the group from the Triad area.
Melissa Coates (whose pre-con is on a different day from mine, so you can see both of ours!), Brad Llewellyn, and Ben DeBow round things out from the Charlotte area.
Dealing with XML and JSON
@iconpro555555 asked a question at the very end of last week’s Shop Talk around handling structured data—that is, XML and JSON—in SQL Server.
SQL Server has a long history of dealing with XML data, including an XML data type, XML-specific indexes, and XPath query notation. It’s notorious among DBAs as “Something we don’t want,” and Tracy points out the biggest problem: once you start to get a decent amount of data, everything falls apart.
In SQL Server 2016, we got JSON support through the OPENJSON, FOR JSON, and JSONVALUE commands. There is no separate JSON datatype or direct indexing, though if you want to, you could create a persistent computed column using JSONVALUE to extract a specific value from the JSON and index that persistent computed column.
I don’t necessarily mind XML or JSON data in the database and it can be the best solution. But one pre-condition for it being the best solution is that you shouldn’t be shredding that data regularly. If you do, build out a proper, normalized data model and take advantage of what relational databases do best.
Storing and Using Secrets in Powershell
The next topic was around secrets in Powershell. This was a question from the very beginning of Shop Talk’s history, but because Rob Sewell recently blogged about this, I wanted to bring it up.
A Rant on Database Diagramming Tools
My official Rant of the Night was around how weak the database diagramming tool is in SQL Server Management Studio. The thing barely works at all, requires you to create database objects (meaning that developers with read-only permissions can’t even create these diagrams for themselves), provides only a modicum of relevant information, and are a royal pain to disentangle if you have more than just a couple of tables.
Tom defended the honor of SSMS database diagrams, while folks in chat recommended tools like DBeaver and DBDiagram. I brought up Mermaid, as I’ve just started getting into it for presentations and it allows for code-centric modeling, meaning that you write Python code to represent your model. I could see that tying in well with queries against system tables to get tables and columns, foreign keys, and the like. I also mentioned sqlDBm in an off-hand fashion because I couldn’t remember the name of it. I’ve used it before to reasonably good effect and the free version works pretty well if you have one fairly stable database. Visio database reverse engineering is excellent as well when your goal is to understand an existing, stable design.
I got things going with a rant about how things in Azure feel like Lego pieces which don’t quite fit together. The first five hours of an Azure project generally involves me enjoying how well all of these services fit together. Then, you find a spot where things don’t quite work and then you spend 15 hours trying to bludgeon the thing into submission.
My example was around using Azure Synapse Analytics. I used a Spark pool to train a model, and the notebook worked great, saving my model to Data Lake Storage Gen2 and everything.
Then, I wanted to deploy the model using Azure ML. Now, Azure ML lets you deploy PySpark models, so that’s no problem. The problem was, the model is in HDFS (really Data Lake but let’s not split hairs too much given that Blob Storage is an implementation of WebHDFS) but Azure ML won’t let you use Data Lake Storage for the model location—it needs to be local to the machine. Which means local to the driver. Which is the opposite of what you want with Apache Spark.
The end result was that I could get the pieces to fit, but it involved a bit too much welding, filing, and sanding for my taste.
Dealing with a Production Outage
Tracy tells a story of teamwork during an outage. I’d link to it, but unfortunately she’s dealing with a blog outage.
How is the DBA Role Evolving?
Mala couldn’t make it tonight but she did give us a great topic: how is the DBA role evolving?
I took it three ways, noting that this is focused on “production DBAs” rather than database developers:
Is the DBA role evolving? My short answer is, I can argue either way. On the “yes” side, DBAs are expected to have more development skills than they had before. We want them to know about source control, Powershell, and DevOps. They also, in cloud environments, have fewer knobs to turn and lose ultimate control over the machine. But on the “no” side, the key set of DBA roles hasn’t really changed that much over the past several decades. Think about taking backups and testing restoration, ensuring high availability and/or disaster recovery, preventing or correcting corruption, tuning performance, and monitoring systems. The means have changed but the motivations and goals are the same.
Can a DBA from 1985, 1995, or 2005 be teleported into today’s world and survive? Assume that there’s some time for learning about technological changes. Well, take a look at DBAs around you. Many of them got their starts in the 1990s, so the answer is probably “yes.” Now, the person from 1985 or 1995 would need to unlearn quite a bit and learn quite a bit, but I don’t think this is some crazy sci-fi future scenario where a DBA from then gets dropped in and has absolutely no clue how to be a DBA today.
Will there be DBAs 10 years from now? 20 years from now? I think the answer is absolutely yes, because the things which define a DBA’s role don’t come about because DBAs wormed their way into organizations. They are critical for the business: we have data, data which the company finds valuable. As a result, we need to make sure that the data is available, that it won’t disappear tomorrow, that it won’t become corrupt tomorrow, and that business users will be able to get to it in a reasonable amount of time. There will be people who specialize in this administration of the bases of data as long as people at businesses care about data.
Last night was the “Let’s get a copyright strike from the TV” edition of Shop Talk.
Migrating from Azure VMs to Azure SQL Database
Tom started us off with a discussion of how simple it was to migrate from SQL Server on a virtual machine hosted in Azure to running Azure SQL Database via Availability Groups. The one problem he ran into was around things like logins and SQL Agent jobs not going over to the AG secondaries because system databases can’t be part of an Availability Group. This feature was announced for SQL Server 2019, but had to be pulled, and as of CU5 at least, is not in the product.
Goodbye, Azure Notebooks
I reported the sad news that Azure Notebooks is going away on October 9th. There are several alternatives available. For training people, Azure Labs is probably the best option. For personal work, Visual Studio Code and Azure Data Studio offer notebook options, though I’d probably just use Jupyter Notebooks off of Anaconda and be done with it.
Visual Studio Codespaces does look pretty good as well and the pricing isn’t too bad. But none of these have what I really appreciated about Azure Notebooks: it being free.
My last question of the night was, how long should you stay at a job? In this case, I broke it down into two categories: what’s the minimum number of years you should stay at a job to avoid the “job-hopper” label, and what are the key conditions for leaving a job?
As far as years of service go, Tracy said 2-3 years. Anders mentioned that he starts looking at 2 years just to stay in tune with the market. When I started my career, 5+ years was the expectation, though I did get some pushback from folks in the audience and we settled on 5 years being a midwestern thing. Anyhow, I’d say that the number today is 2 years: we stay at a job that long and it’s “career break-even.”
This ties in with the other side, conditions for leaving. @thedukeny mentioned boredom and the quest for more money, and Anders echoed on the boredom comment. @rporrata mentioned three things: money, family, and (lack of) training. Tom mentioned the company not paying for training and also falling behind on technology. With Tracy, it’s terrible bosses and old systems.
I mentioned one other criterion: moving up. In the Olden Days, you had clearly defined levels where you could move up in a company, from Programmer 1 to Programmer 2 to 3 to 4 to Architect, and so on. Each role had specified tasks and commensurate pay. You could expect some level of growth at a company over the course of a couple decades. But that’s changed radically at most organizations. Now, you have a role: Programmer. In many places, there aren’t even labels for Junior or Senior—you’re just Programmer. This has nice benefits for employees in making it easier to pick up new problems to solve—you aren’t constrained to doing tasks A, B, and C because that’s all a Programmer 1 does. But it also means that your potential for growth is quite limited. There are no new titles to which you can aspire or big salary bumps which go along with it. And companies have this strange habit of assuming that an across-the-board 2% pay increase for employees is fine, regardless of how that person’s market value has changed.
As a result, people leave companies to grow. Each job change is an opportunity to get back up to your market salary level as well as take on more roles and assignments. There’s also much less stigma about people leaving and returning after a few years, as most people recognize that this isn’t the 1950s or the 1980s in terms of working at companies. There are some “lifer” companies out there with a fairly high percentage of extremely long-term employees, but those are rare.
For show canon purposes, Tom was captured in a multi-state illegal fireworks ring, smuggling the good stuff.
Notes: Questions and Topics
Last night was the “Let’s get a copyright strike from the TV” edition of Shop Talk.
I Remember Halloween
Mala started us off with a question about Halloween protection. What is Halloween protection? How can we identify this? What are possible solutions?
Answer: Jared gave us a description of Halloween protection, and I followed up with a quick demo. If you want more thorough answers on Halloween protection and solutions to the performance issues it can provide, I recommend Paul White’s series on the topic, Jared’s blog post, and Mala passed along an article from Itzik Ben-Gan.
Upgrading SQL Server via Distributed Availability Groups
Tracy then told us about a project she had undertaken to upgrade SQL Server using Distributed Availability Groups, a feature introduced in SQL Server 2016 and enhanced with direct seeding in 2017. Tracy has used this technique to perform near-instant upgrades from 2016 to 2017, and points out that it can work for upgrading any instance from 2012 and on. Tracy shares a few important tips, such as changing the listener rather than upgrading connection strings, as that obviates the risk of missing some connection string somewhere.
Tracy should have a blog post forthcoming on the topic as well.
The Silliness of Security Compliance
In last night’s Rant of the Evening, I channeled Sean McCown in pointing out that there is nothing inherently wrong with enabling xp_cmdshell. If you’re not a sysadmin, you can’t use xp_cmdshell by default; if you are a sysadmin, you can enable xp_cmdshell whenever you want. So any security check telling you to disable xp_cmdshell is wrong—the correct answer is not to hand out xp_cmdshell rights willy-nilly.
All of this comes in the context of security compliance audits. It was Tracy’s topic and she did a good job airing out some of the sillier ones when I wasn’t cutting her off.
As a point of order, the concept of security is by no means silly. Having people who understand the behaviors (including unintended!) of first- and third-party applications and systems is critical, and I support those people how I can.
I was a little ill last night when we recorded, so that’s why I made any mistakes I did. It was a 24-hour thing, though, so I’m going to have to come up with a new excuse for why I make so many mistakes.
Listing System Functions
Nadir hit me up via e-mail and wanted to figure out how to list system functions in T-SQL.
Answer: This script will give you the set of system functions available in SQL Server.
FROM sys.system_sql_modules m
INNER JOIN sys.all_objects o ON m.object_id = o.object_id
o.type NOT IN ('P', 'V')
Tom dropped the next question on us: he wants to prevent anybody from modifying a particular table, like adding or removing columns. How can we notify or prevent this sort of modification?
Answer: There are a few techniques available. Tom mentioned using SQL Audit and creating a view with schemabinding against that table—that will prevent table modifications so long as the view is shcemabound.
Tracy calls out preventing SELECT * operations against a table by adding a computed column which divides 1/0 to teach developers not to do that. @srutzky did mention a couple reasons not to do this, though
A Little Bit on Containers
From there, I started to bring up a question that Mala had asked around why you might want to use containers. We talked about using Docker containers for androgogical purposes, though Tom pushed Azure Lab Services, which is another good option. We also covered Docker containers for automated testing, where I think it makes a huge amount of sense.
During this, Tom also asked the question, will SQL Server be on Windows 202X, say 5 years from now? My answer is an adamant yes. As long as there is a Windows operating system, I’d expect to see SQL Server on Windows. There are too many features only available on Windows today, and even some new features have are only partially implemented in Linux.
SQL Server Authentication Options
@iconpro5555 asked, what are the different methods for connecting to SQL Server?
Answer: There are two techniques: Active Directory and SQL authentication. Within Active Directory, we can break it out a bit further. On-premises, we have Windows Integrated Authentication. When dealing with Azure Active Directory, you have three ways of connecting: via multi-factor authentication (MFA), by typing in a username and password, or integrated Azure Active Directory.
As a side note, I wish that we had all of those AAD options for on-prem Active Directory.
Answer: Answering the second question first, Azure SQL Edge runs on 64-bit ARM processors and x64 processors. I’ve used it in private preview and liked what was available. You can also get into the public preview right now.
On the point about Apple, I can see it happening for their lower-end devices, as we’ve seen a cottage industry of ARM-based Chromebooks out there, so we know it’s viable. You wouldn’t use them for heavy gaming, video editing, and the like, but they’d definitely offer battery life improvements over the current generation if you’re a casual user of the product.
Tom dropped the next question on us: he is looking to use table partitioning to improve performance where he only needs recent data. Will this technique help him out?
Mala made me defend goto. Her question: when and where would you expect to use goto statements in code? What was Dijkstra’s argument against them based upon?
Answer: I enjoyed the dive into this topic, as it really tells a story of the history of computer science. If you want, you can easily read Dijkstra’s letter to the editor (and which title Dijkstra really didn’t like; a classic case of the editor getting to choose the title). But temper this with a quotation from Dijkstra a few years later:
Please don’t fall into the trap of believing that I am terribly dogmatical about [the go to statement]. I have the uncomfortable feeling that others are making a religion out of it, as if the conceptual problems of programming could be solved by a single trick, by a simple form of coding discipline!
Summarizing a couple of themes that I spent a lot of time on in the episode, it’s hard for those of us who never really worked with pre-structured programming languages to understand the debate. At the time Dijkstra was writing, common languages didn’t always have structure components like break, continue, return, switch, case, do, while, and sometimes not even else! In lieu of those structural keywords, programmers needed to use the tools available, and the biggest one was goto. Today, we throw the brakes when we see a single goto statement. But Dijkstra wasn’t really concerned about that; he was concerned about it being the blunt instrument programmers used even if there were better options available.
Nowadays, it seems like the generally accepted exceptions to “don’t use goto” are:
switch cases in languages like C#, where you can go to a different case in the switch.
Breaking out of deeply nested loops, though in that case the question might be, why are you nested so deeply?
Ensuring that all code paths reach a certain destination for cleanup steps in languages without finally. For example, this might involve freeing memory, closing connections, and releasing handles.
The reason I’m not automatically critical of GOTO in SQL Server is that there is no FINALLY block in TRY/CATCH. But then again, it’s really uncommon that you’d need that construct.
I can second both of these recommendations, having the paper copy of Joe’s book and a PDF of Kalen’s. The link to Kalen’s book lets you download it as a PDF for free.
Second Thoughts on Azure ML
The final thing I’m covering here is some second thoughts on Azure Machine Learning. The brief version of it is as follows.
When it came out, Azure ML felt like SSIS for machine learning. You dragged and dropped items, clicked the mouse a whole bunch of times, and end up with a pretty-looking data flow to build a model. Data scientists tended to complain that they could do most of the work in Azure ML in about 6 lines of R or Python code, and that the visual interface made model comparison really clunky.
Later on, demos were still drag-and-drop, but I remember that instead of dragging and dropping various data cleanup components, they’d drop in an R/Python code block and put in those 6 lines of code. So that was better, but the visual interface was still too constraining and compute was rather expensive once you did the math.
I skipped Azure ML for a while (including the Studio phase but came back to it as the result of a work project and I have to say that it looks a lot better. The integration with Azure Container Instances and Azure Kubernetes Service is pretty nice, model registration competes with MLflow (in that it’s easier to maintain, though not as feature-rich), and I approve of AutoML for at least getting you 80% of the way there (though @thedukeny points out that in many cases, AutoML can do at least as well as a data science team). And pricing isn’t too bad—we have a moderately used web service (called approximately 16K times per day and pushes roughly 4 million rows) and would be out approximately $100 per month for 24/7 utilization of ACI. That’s a fair bit less than we’re paying now.
The bottom line is, if you ignored Azure ML over the past couple of years like I did, I recommend giving it another try to see if it might fit some of your needs.
Mala gave me a bomb to throw with this question: what is the best index maintenance routine? Is rebuilding statistics by itself enough, or do we need index rebuilding (and/or reorganization)? Are common strategies really useful? How relevant are the “canned” maintenance routines like Ola Hallengren’s solution or Minion Reindex?
Answer: Given my job as Resident Bomb-Thrower, I needed to hold an unpopular opinion. Don’t worry, I didn’t let you down!
The short version is that all of our panelists and most of chat enjoy using Ola Hallengren’s solutions. I’ve set up Minion Reindex in low-touch DBA places where index maintenance is essentially fire-and-forget.
If I’m a full-time DBA somewhere, I would pay close attention to Jeff Moden’s presentation on index maintenance. Jeff tried out the “Never rebuild a thing” route and fine-tuned it considerably from there. This solution has a lot going for it, though it does require understanding your systems and having the time to experiment. On the plus side, you come close to eliminating the number of unnecessary index rebuilds, so it’s got that going for it.
We then got to my unpopular opinion of the night: use NVARCHAR everywhere. The immediate answer from every panelist and several members of chat was some variation of “NO!” That’s how you can tell you’re in a room full of DBAs.
I don’t intend to write up my thoughts on the topic here, so check out the video for my off-the-cuff answer. I am also preparing a full-length video on the topic to flesh out the idea a bit further.
Off-The-Wall Job Opportunities
My first topic of the night was, What are some weird job opportunity e-mails you’ve received? We’ve all received those e-mails from recruiters looking at decade-old resumes or sending out e-mail blasts based on not-quite-calibrated keywords.
Answer: Check the video for everybody’s answers, but I had two of them. First, recruiters asking about technologies which haven’t been on my resume in well over a decade. In my case, I know a recruiter’s desperate when I get e-mails about ColdFusion.
The other example is that I had shared my resume on some job site or another and included the term “data warehousing” on it. I received two separate automated e-mails indicating that I would be a great fit as a shift manager at a warehouse.
Anders and I also spent some time commiserating about the “You would be a great fit for this job which pays you a third of what you’re making now!” e-mails from certain job websites.
My final question is something I’m having trouble answering. What are tips for new graduates trying to find jobs in the world of data science? I extended the question to include development and administration positions as well.
Answer: My normal recommendation would be to send the person to relevant user groups. In our area, that’s Research Triangle Analysts and TriPASS. That way you can meet people in the industry, make contacts, and learn about new job openings. Unfortunately, with user groups being virtual-only experiences, that’s not a great way to make contacts right now. Compounding this, a lot of places have frozen job hiring, so if you’re trying to get in at entry level, it’s especially hard to find a job right now.
Tom’s advice is to look for entry-level jobs and expand your scope beyond data science and into things like report writing or QA. I’d include business analyst roles in there as well. Tom also mentioned looking for internships, though those are harder to get after graduation.
Mala recommended checking the PASS Careers site, as well as the #sqljobs Twitter hashtag.
Unfortunately, I was the only person able to make it on for Shop Talk last night. Fortunately, we had some great folks in chat to break up the monotony of me being me.
Jupyter Books and Calendar Tables
Mala asked (a while ago), what is the difference between a Jupyter Notebook and a Jupyter Book?
Answer: Jupyter Notebooks are individual documents which can contain code, markdown, and documentation. I used the analogy of “looseleaf” notebooks for a bundle of notebooks in the same directory which are somewhat related, but with nothing explicitly tying them together.
Jupyter Books are collections of notebooks with a table of contents and proper “binding.” In Azure Data Studio, you can see the output:
@thedukeny asked during Data Architecture Day, “One thing I’ve always wondered is how are people fitting subscription data to a star schema? So many business are subscription based with THROUGH and FROM periods and not based on a particular date”
Answer: The key here is to use the subscription as a dimension and include the time frames there. Then, in your ETL process, have the fact look up the right subscription key—it’s a lot better to take that extra time while loading the data and make data retrieval easier and faster.
Here’s the script I used to explain my thoughts:
CREATE TABLE #Subscription
-- Create a couple of people. We'd have other dimensions too, of course.
INSERT INTO #Subscription
-- Other user information. No worries about dupes...
-- User 1 came in at Gold
(1, NEWID(), 1, 'Gold', '2020-05-01', '9999-12-31', 1),
-- User 2 came in at Talc and upgraded to Zirconium
(2, NEWID(), 2, 'Talc', '2018-04-01', '2018-12-01', 0),
(3, NEWID(), 2, 'Zirconium', '2018-12-01', '9999-12-31', 1),
-- User 3 came in at Quartz, downgraded to Amethyst, and left
(4, NEWID(), 3, 'Quartz', '2017-11-29', '2018-11-28', 0),
(5, NEWID(), 3, 'Amethyst', '2018-11-28', '2019-11-27', 1);
-- Example fact table for watching a video.
CREATE TABLE #CourseView
-- Could include UserKey if we normally group by that...
-- Other keys
-- User 2 upgrades to Topaz, so that's an SCD2 change.
-- Update the current row and add a new row.
SET PeriodEndDate = '2020-06-08', IsCurrentRow = 0
WHERE SubscriptionKey = 3;
INSERT INTO #Subscription
(6, NEWID(), 2, 'Topaz', '2020-06-08', '9999-12-31', 1);
-- User 1 leaves us. This is changing the end date, so just SCD1.
SET PeriodEndDate = '2020-06-30'
WHERE SubscriptionKey = 1;
I also ginned up a sample query against another data set based on some discussion with @thedukeny. This was off the top of my head, but gives an idea of how you can use it.
SUM(er.Amount) AS Total
FROM ExpenseReports.dbo.Employee e
INNER JOIN #Subscription s ON e.EmployeeID = s.UserKey
CROSS JOIN ExpenseReports.dbo.Calendar c
LEFT OUTER JOIN ExpenseReports.dbo.ExpenseReport er
ON er.EmployeeID = e.EmployeeID
AND c.Date = er.ExpenseDate
WHERE c.[Date] >= s.PeriodBeginDate and c.[Date] < s.PeriodEndDate
and c.date < '2020-01-01'
GROUP BY c.CalendarYear, c.CalendarQuarterName, e.FirstName, e.LastName, s.Level
ORDER BY e.FirstName, c.CalendarYear, c.CalendarQuarterName;
This was a great question because modern web-based companies are moving to subscription models for payment, and some of the older books like Kimball’s don’t really cover subscription models that well.
Crazy Job Descriptions
From Mike Chrestensen: “This is the second job I’ve seen. Am I crazy or is this impossible. Full stack + DBA salesforce support + .ASP/.NET/C#, SQL Dev, Reporting and interface with business.”
Answer: I love these descriptions. I think that there are four major reasons why you see crazy job descriptions (though in fairness, the one Mike showed me wasn’t necessarily crazy):
HR got their hands on it. That’s where you start seeing things like one of my favorites, where you need 3 years of SQL Server 2012 experience…in the year 2014. Anders also pointed out a developer job which wanted several years of .NET experience in 1999—mind you, .NET was in preview at that time and C# didn’t even stabilize as a language until after 2000.
Janice just left the company and now we need a new Janice. Here’s all the stuff she did over the past decade. You should be able to do all of the things she did.
Rebecca needs a promotion, but the company requires that we post all jobs to the public. Therefore, we figure out all of the things Rebecca does and has experience with and make them mandatory for this new role. You can tell they’re gaming the system when those descriptions mention experience with internal products.
The company is a startup, or still has that startup mentality.
The fourth one is the one I’m most okay with, especially if you know it’s a startup. One person needs to wear a lot of hats in that kind of role, so it’s good to know what you’re in for.
I don’t want to make it sound like having a brain dump of every potentially-tangential technology is a good thing for a job description—it’s not. My ideal job description includes the stuff that you need to know, the stuff we want you to know, and the stuff which is a bonus clearly delineated. If it’s an app developer role, it’s okay to list SQL Server development skills, SSIS, and SSRS, but make it clear that these are small parts of the job and bonus skills rather than core to the position.
Tom kicked us off with a request for assistance around Kerberos, where he was getting an error message I hadn’t seen before:
Cannot connect to WMI provider. You do not have permission or the server is unreachable. Note that you can only manage SQL Server 2005 and later servers with SQL Server Configuration Manager. Invalid namespace [0x8004100e]
He noticed this Microsoft Support page which explains how to fix it, but that didn’t seem to do the trick. He promises to vanquish this beast and report on it for the rest of us.
Imbalanced Classes and Confusion Matrices
Deepak reached out to me with a question. “I have a labeled data set of about 100,000 employees. I need to predict who will leave or stay with the company. This is highly imbalanced data because most of the employees stay with the company and very few leave. Thus wondering how to prepare the data for training the model? Should I first filter out the data of employees who left the company and then may be use 1000 employees who left the company and may be 200 employees who stayed with the company to train the model? If not, then which AI ML model should I use which can handle highly imbalanced data for classification?”
Answer: This is known as an imbalanced class problem. In this simple scenario, we have two possible classes: left the company (1000) and stayed with the company (99,000). This type of issue explains why accuracy is not the end-all, be-all measure for gauging prediction quality, as I can create a model which is correct 99% of the time by simply saying that everybody will stay with the company.
From there, @thedukeny nudged me into talking about a concept known as the confusion matrix. This leads to four measures in addition to accuracy:
Positive predictive value: given that we said you would stay, how likely were you to stay?
Negative predictive value: given that we said you would go, how likely were you to go?
Sensitivity / Recall: when you actually stay, how often did we predict you would stay?
Specificity: when you actually go, how often did we predict that you would go?
In cases where the classes are pretty well balanced, accuracy (based on the prediction we made, were we correct?) is not a terrible measure, but we want to temper it with these four additional measures to get a more complete look at how we are doing.
Getting back to Deepak’s direct question, I have one technique and two algorithms which I think work well in these cases. The technique is to oversample (or undersample). When we oversample, we make copies of the lesser-used class in order to balance things out a bit more. Instead of having a 99:1 ratio of stay to go, we might make dozens of copies of the people who go, just to tell the algorithm that it’s not okay simply to avoid that 1% and take the easy 99% victory. The opposite side is undersampling, where we throw away enough of the majority class to get closer to a 1:1 balance. As a note, it doesn’t need to be perfectly even—even a 2:1 or 3:1 ratio is still fine depending on your algorithm and amount of data.
As far as algorithms go, gradient boosted decision tree algorithms (like xgboost) do a really good job of handling class imbalance. Also, one of my favorite sets of classification algorithms does a great job with this problem: online passive-aggressive algorithms. The way this algorithm works is that, for each record, we determine whether the model we’ve created would have predicted the answer correctly. If so, the model makes no changes to its weights—in other words, it is passive. But if the predicted class is incorrect, the model will move heaven and earth to get things right—it is aggressive in changing weights until the current example is just within the boundary of being correct. It then moves on to the next set of inputs and forgets about the changes it needed to make to get the prior answer correct. In this way, passive-aggressive algorithms focus in on the problem at hand and ignore how we got to where we are, and so it treats each input as vital. That’s why we may not necessarily need many examples of a class to get it right as a whole: the algorithm is aggressive enough with every single record that it doesn’t need iteration after iteration of a class to recognize that it’s important to learn something about it.
Presentations and Sessions
@johnfan14 asks, “OK.. You guys attend so many seminars, so how do you manage them? like planning and scheduling.. You use Excel or some kind of calendar or some other tools?”
Answer: Tom, Tracy, and I all gave about the same answer: calendar entries and spreadsheets. I use Google Sheets so that I can access it easily from the road and keep track of when, where, what, and key stats I need for reporting. Here’s a quick sample from 2019:
We also use calendar entries to keep track of when and where, and I also have a spreadsheet with travel plans to ensure that I have airplane, hotel room, and rental car (if necessary) booked. Early on in my speaking career, I accidentally double-booked a flight because I didn’t realize I had already purchased a ticket. That’s when I started keeping the spreadsheet.
Database Comparison Tools
From @jeffalope: “Anyone have any cool ways to compare multiple databases where the expectation is that the schema should be the same but you know there is some drift? I know that sqlpackage (ssdt) and red-gate sql compare exist but just wonder if anyone has knowledge of a turn key solution to generate a “report” for a few dozen instances?”
Answer: Tom gave us a website which has a list of comparison tools, but there are two parts to Jeff’s problem. First is a product which generates a comparison between a source database and a target database. Tom’s link gives us a myriad tools for that job.
But second, the requirement to compare a few dozen instances stresses things a bit. I had put together something in the past for Red Gate schema comparisons but it turned into a bit of a mess because of two reasons: first, schema drift happens a bit too regularly; and second, because minor server configuration differences can lead to lots and lots of what are effectively false positive results around things like collation. As a result, I don’t have a great solution but I can confirm that Red Gate’s tool is something you can call from .NET code (I used C# but you could also use F# or Powershell) and build an internal product around.
We had two similar questions from Data Architecture Day that I’m getting around to answering. First, from @lisanke:
How does the process of DB design change if we tell you, you’ll be working with a group (or groups) of programmers who’ll want to store and retrieve data efficiently from the DB for their tasks (e.g. system test) But they don’t know (yet) what they’ll be storing And that target will always be changing through the development life cycle?? can I create database schema and structure and access procedures that helps the team and won’t be completely inefficient with constantly evolving data design?
And second from @thedukeny: “Tips for balancing normality with development speed?”
Answer: I consider these to be very similar questions with similar answers, so I’ll take them together.
The key question is, do you prototype? I love this story from Phil Factor about Entity-Attribute-Value (EAV) and prototyping. As Phil points out, EAV can work in the early development phase, but as we get closer to a stable product, it needs to be turned into a survivable data model. If you need to get something put together immediately to show off, it’s a workable approach. But the problem I see is that very few companies create proper prototypes, where you have one approach to the problem and then throw it away and start over. That’s a shame, but it is something we have to live with—unless you’re in a lucky situation in which you can create real prototypes and then not immediately ship that code to production.
The other problem we have to live with is that “We’ll fix it in round 2” is a comforting lie we tell ourselves as we write halfway-working code. Round 2 almost never happens because we have so much more stuff to do and by the time we actually have a chance to get back to that code, it’s now “too difficult” to fix.
We also have to keep in mind that databases are stateful and require planning. I can gut a web application and, as long as I haven’t messed with the interfaces, nobody needs to be the wiser. How we get to the current state in a web app is almost unimportant; for a database, it’s the entire story. And this is just as true for non-relational databases as it is for relational!
There are some non-relational databases which might be easier to develop against, so it might make sense to create a prototype in one system and then migrate it to the real system once you’re ready for real development. If you’re in a shop where you can do that, go for it. But there are some risks of performance problems, translation issues, and missing functionality as you make that shift, so build in time for all of that.
I think a better approach is to skip the database while the data model is fluid. If you’re changing your data model on a daily basis, it’s a sign that your product isn’t mature enough for a database. That’s not a bad thing, but save yourself the pain and just use in-memory objects which you can load up from JSON, XML, flat files, or even hard-coded in unit tests if that’s what you really want to do. That way, you can quickly modify the data structure you need in your app as requirements change, and once the model is more stable, then it’s time to negotiate the differences between an application and a properly-normalized relational database.
I think that’s reasonable advice early in the development cycle, which is where @lisanke focused. But getting to @thedukeny’s question, how about when the app is mostly stable but you’re doing regular development?
In that case, we want to have a data model planned out. This data model is different from the application’s objects (or record types or whatever), as your data model typically survives the application. I know of databases which have survived three or four versions of an application, and that’s why it’s important to understand the data rules and data model.
Like all other development, expect changes and re-work based on how well the product has been maintained over time. If you have a model which is generally pretty good, you’ll want to keep it in shape but won’t have to spend too much time rearchitecting everything. If it’s a junk design, build that time in when coming up with estimates for doing work. As much as you can, take the time to get it right and work with the dev team. Most developers want to get things right but don’t necessarily know the best way to do it, so they do what they know and move on. I understand that, like any other group, there are developers of varying levels of work ethic and professionalism, but find the people willing and able to help and work with them.
One last piece of advice which bridges both of these questions is, use stored procedures as an interface. @jeffalope also mentioned this in the context of a data access layer (DAL). Try to keep your database as removed from the business objects as you can, and stored procedures do a great job of that. This way, you can redesign tables with potentially zero impact on the application, as the interface to the stored procedures did not change. It also allows application objects to exist separate from the data model. It’s extra work, but well worth it in the end.