Show notes for the Triangle Area SQL Server User Group's Shop Talk program.
Author: Kevin Feasel
Kevin Feasel is a Microsoft Data Platform MVP and CTO at Envizage, where he specializes in data analytics with T-SQL and R, forcing Spark clusters to do his bidding, fighting with Kafka, and pulling rabbits out of hats on demand. He is the lead contributor to Curated SQL (https://curatedsql.com) and author of PolyBase Revealed (https://www.apress.com/us/book/9781484254608). A resident of Durham, North Carolina, he can be found cycling the trails along the triangle whenever the weather's nice enough.
Answer: Answering the second question first, Azure SQL Edge runs on 64-bit ARM processors and x64 processors. I’ve used it in private preview and liked what was available. You can also get into the public preview right now.
On the point about Apple, I can see it happening for their lower-end devices, as we’ve seen a cottage industry of ARM-based Chromebooks out there, so we know it’s viable. You wouldn’t use them for heavy gaming, video editing, and the like, but they’d definitely offer battery life improvements over the current generation if you’re a casual user of the product.
Partitioned Tables
Tom dropped the next question on us: he is looking to use table partitioning to improve performance where he only needs recent data. Will this technique help him out?
Mala made me defend goto. Her question: when and where would you expect to use goto statements in code? What was Dijkstra’s argument against them based upon?
Answer: I enjoyed the dive into this topic, as it really tells a story of the history of computer science. If you want, you can easily read Dijkstra’s letter to the editor (and which title Dijkstra really didn’t like; a classic case of the editor getting to choose the title). But temper this with a quotation from Dijkstra a few years later:
Please don’t fall into the trap of believing that I am terribly dogmatical about [the go to statement]. I have the uncomfortable feeling that others are making a religion out of it, as if the conceptual problems of programming could be solved by a single trick, by a simple form of coding discipline!
Summarizing a couple of themes that I spent a lot of time on in the episode, it’s hard for those of us who never really worked with pre-structured programming languages to understand the debate. At the time Dijkstra was writing, common languages didn’t always have structure components like break, continue, return, switch, case, do, while, and sometimes not even else! In lieu of those structural keywords, programmers needed to use the tools available, and the biggest one was goto. Today, we throw the brakes when we see a single goto statement. But Dijkstra wasn’t really concerned about that; he was concerned about it being the blunt instrument programmers used even if there were better options available.
Nowadays, it seems like the generally accepted exceptions to “don’t use goto” are:
switch cases in languages like C#, where you can go to a different case in the switch.
Breaking out of deeply nested loops, though in that case the question might be, why are you nested so deeply?
Ensuring that all code paths reach a certain destination for cleanup steps in languages without finally. For example, this might involve freeing memory, closing connections, and releasing handles.
The reason I’m not automatically critical of GOTO in SQL Server is that there is no FINALLY block in TRY/CATCH. But then again, it’s really uncommon that you’d need that construct.
I can second both of these recommendations, having the paper copy of Joe’s book and a PDF of Kalen’s. The link to Kalen’s book lets you download it as a PDF for free.
Second Thoughts on Azure ML
The final thing I’m covering here is some second thoughts on Azure Machine Learning. The brief version of it is as follows.
When it came out, Azure ML felt like SSIS for machine learning. You dragged and dropped items, clicked the mouse a whole bunch of times, and end up with a pretty-looking data flow to build a model. Data scientists tended to complain that they could do most of the work in Azure ML in about 6 lines of R or Python code, and that the visual interface made model comparison really clunky.
Later on, demos were still drag-and-drop, but I remember that instead of dragging and dropping various data cleanup components, they’d drop in an R/Python code block and put in those 6 lines of code. So that was better, but the visual interface was still too constraining and compute was rather expensive once you did the math.
I skipped Azure ML for a while (including the Studio phase but came back to it as the result of a work project and I have to say that it looks a lot better. The integration with Azure Container Instances and Azure Kubernetes Service is pretty nice, model registration competes with MLflow (in that it’s easier to maintain, though not as feature-rich), and I approve of AutoML for at least getting you 80% of the way there (though @thedukeny points out that in many cases, AutoML can do at least as well as a data science team). And pricing isn’t too bad—we have a moderately used web service (called approximately 16K times per day and pushes roughly 4 million rows) and would be out approximately $100 per month for 24/7 utilization of ACI. That’s a fair bit less than we’re paying now.
The bottom line is, if you ignored Azure ML over the past couple of years like I did, I recommend giving it another try to see if it might fit some of your needs.
Mala gave me a bomb to throw with this question: what is the best index maintenance routine? Is rebuilding statistics by itself enough, or do we need index rebuilding (and/or reorganization)? Are common strategies really useful? How relevant are the “canned” maintenance routines like Ola Hallengren’s solution or Minion Reindex?
Answer: Given my job as Resident Bomb-Thrower, I needed to hold an unpopular opinion. Don’t worry, I didn’t let you down!
The short version is that all of our panelists and most of chat enjoy using Ola Hallengren’s solutions. I’ve set up Minion Reindex in low-touch DBA places where index maintenance is essentially fire-and-forget.
If I’m a full-time DBA somewhere, I would pay close attention to Jeff Moden’s presentation on index maintenance. Jeff tried out the “Never rebuild a thing” route and fine-tuned it considerably from there. This solution has a lot going for it, though it does require understanding your systems and having the time to experiment. On the plus side, you come close to eliminating the number of unnecessary index rebuilds, so it’s got that going for it.
NVARCHAR Everywhere
We then got to my unpopular opinion of the night: use NVARCHAR everywhere. The immediate answer from every panelist and several members of chat was some variation of “NO!” That’s how you can tell you’re in a room full of DBAs.
I don’t intend to write up my thoughts on the topic here, so check out the video for my off-the-cuff answer. I am also preparing a full-length video on the topic to flesh out the idea a bit further.
Off-The-Wall Job Opportunities
My first topic of the night was, What are some weird job opportunity e-mails you’ve received? We’ve all received those e-mails from recruiters looking at decade-old resumes or sending out e-mail blasts based on not-quite-calibrated keywords.
Answer: Check the video for everybody’s answers, but I had two of them. First, recruiters asking about technologies which haven’t been on my resume in well over a decade. In my case, I know a recruiter’s desperate when I get e-mails about ColdFusion.
The other example is that I had shared my resume on some job site or another and included the term “data warehousing” on it. I received two separate automated e-mails indicating that I would be a great fit as a shift manager at a warehouse.
Anders and I also spent some time commiserating about the “You would be a great fit for this job which pays you a third of what you’re making now!” e-mails from certain job websites.
My final question is something I’m having trouble answering. What are tips for new graduates trying to find jobs in the world of data science? I extended the question to include development and administration positions as well.
Answer: My normal recommendation would be to send the person to relevant user groups. In our area, that’s Research Triangle Analysts and TriPASS. That way you can meet people in the industry, make contacts, and learn about new job openings. Unfortunately, with user groups being virtual-only experiences, that’s not a great way to make contacts right now. Compounding this, a lot of places have frozen job hiring, so if you’re trying to get in at entry level, it’s especially hard to find a job right now.
Tom’s advice is to look for entry-level jobs and expand your scope beyond data science and into things like report writing or QA. I’d include business analyst roles in there as well. Tom also mentioned looking for internships, though those are harder to get after graduation.
Mala recommended checking the PASS Careers site, as well as the #sqljobs Twitter hashtag.
Unfortunately, I was the only person able to make it on for Shop Talk last night. Fortunately, we had some great folks in chat to break up the monotony of me being me.
Jupyter Books and Calendar Tables
Mala asked (a while ago), what is the difference between a Jupyter Notebook and a Jupyter Book?
Answer: Jupyter Notebooks are individual documents which can contain code, markdown, and documentation. I used the analogy of “looseleaf” notebooks for a bundle of notebooks in the same directory which are somewhat related, but with nothing explicitly tying them together.
Jupyter Books are collections of notebooks with a table of contents and proper “binding.” In Azure Data Studio, you can see the output:
My example was Emanuele Meazzo’sSQL Server Diagnostic Book. You can see everything connected together in a way that simply having a folder of “looseleaf” notebooks doesn’t give you.
@thedukeny asked during Data Architecture Day, “One thing I’ve always wondered is how are people fitting subscription data to a star schema? So many business are subscription based with THROUGH and FROM periods and not based on a particular date”
Answer: The key here is to use the subscription as a dimension and include the time frames there. Then, in your ETL process, have the fact look up the right subscription key—it’s a lot better to take that extra time while loading the data and make data retrieval easier and faster.
Here’s the script I used to explain my thoughts:
CREATE TABLE #Subscription
(
SubscriptionKey INT,
SourceSubscriptionID UNIQUEIDENTIFIER,
UserKey INT,
Level VARCHAR(30),
PeriodBeginDate DATE,
PeriodEndDate DATE,
IsCurrentRow BIT
);
-- Create a couple of people. We'd have other dimensions too, of course.
INSERT INTO #Subscription
(
SubscriptionKey,
SourceSubscriptionID,
UserKey,
-- Other user information. No worries about dupes...
Level,
PeriodBeginDate,
PeriodEndDate,
IsCurrentRow
)
VALUES
-- User 1 came in at Gold
(1, NEWID(), 1, 'Gold', '2020-05-01', '9999-12-31', 1),
-- User 2 came in at Talc and upgraded to Zirconium
(2, NEWID(), 2, 'Talc', '2018-04-01', '2018-12-01', 0),
(3, NEWID(), 2, 'Zirconium', '2018-12-01', '9999-12-31', 1),
-- User 3 came in at Quartz, downgraded to Amethyst, and left
(4, NEWID(), 3, 'Quartz', '2017-11-29', '2018-11-28', 0),
(5, NEWID(), 3, 'Amethyst', '2018-11-28', '2019-11-27', 1);
-- Example fact table for watching a video.
CREATE TABLE #CourseView
(
DateKey INT,
CourseKey INT,
SubscriptionKey INT,
-- Could include UserKey if we normally group by that...
-- Other keys
NumberOfMinutesViewed INT,
NumberOfSegmentsViewed INT,
FinishedCourse BIT
);
-- User 2 upgrades to Topaz, so that's an SCD2 change.
-- Update the current row and add a new row.
UPDATE #Subscription
SET PeriodEndDate = '2020-06-08', IsCurrentRow = 0
WHERE SubscriptionKey = 3;
INSERT INTO #Subscription
(
SubscriptionKey,
SourceSubscriptionID,
UserKey,
Level,
PeriodBeginDate,
PeriodEndDate,
IsCurrentRow
)
VALUES
(6, NEWID(), 2, 'Topaz', '2020-06-08', '9999-12-31', 1);
-- User 1 leaves us. This is changing the end date, so just SCD1.
UPDATE #Subscription
SET PeriodEndDate = '2020-06-30'
WHERE SubscriptionKey = 1;
I also ginned up a sample query against another data set based on some discussion with @thedukeny. This was off the top of my head, but gives an idea of how you can use it.
SELECT
c.CalendarYear,
c.CalendarQuarterName,
e.FirstName,
e.LastName,
s.[Level],
SUM(er.Amount) AS Total
FROM ExpenseReports.dbo.Employee e
INNER JOIN #Subscription s ON e.EmployeeID = s.UserKey
CROSS JOIN ExpenseReports.dbo.Calendar c
LEFT OUTER JOIN ExpenseReports.dbo.ExpenseReport er
ON er.EmployeeID = e.EmployeeID
AND c.Date = er.ExpenseDate
WHERE c.[Date] >= s.PeriodBeginDate and c.[Date] < s.PeriodEndDate
and c.date < '2020-01-01'
GROUP BY c.CalendarYear, c.CalendarQuarterName, e.FirstName, e.LastName, s.Level
ORDER BY e.FirstName, c.CalendarYear, c.CalendarQuarterName;
This was a great question because modern web-based companies are moving to subscription models for payment, and some of the older books like Kimball’s don’t really cover subscription models that well.
Crazy Job Descriptions
From Mike Chrestensen: “This is the second job I’ve seen. Am I crazy or is this impossible. Full stack + DBA salesforce support + .ASP/.NET/C#, SQL Dev, Reporting and interface with business.”
Answer: I love these descriptions. I think that there are four major reasons why you see crazy job descriptions (though in fairness, the one Mike showed me wasn’t necessarily crazy):
HR got their hands on it. That’s where you start seeing things like one of my favorites, where you need 3 years of SQL Server 2012 experience…in the year 2014. Anders also pointed out a developer job which wanted several years of .NET experience in 1999—mind you, .NET was in preview at that time and C# didn’t even stabilize as a language until after 2000.
Janice just left the company and now we need a new Janice. Here’s all the stuff she did over the past decade. You should be able to do all of the things she did.
Rebecca needs a promotion, but the company requires that we post all jobs to the public. Therefore, we figure out all of the things Rebecca does and has experience with and make them mandatory for this new role. You can tell they’re gaming the system when those descriptions mention experience with internal products.
The company is a startup, or still has that startup mentality.
The fourth one is the one I’m most okay with, especially if you know it’s a startup. One person needs to wear a lot of hats in that kind of role, so it’s good to know what you’re in for.
I don’t want to make it sound like having a brain dump of every potentially-tangential technology is a good thing for a job description—it’s not. My ideal job description includes the stuff that you need to know, the stuff we want you to know, and the stuff which is a bonus clearly delineated. If it’s an app developer role, it’s okay to list SQL Server development skills, SSIS, and SSRS, but make it clear that these are small parts of the job and bonus skills rather than core to the position.
Tom kicked us off with a request for assistance around Kerberos, where he was getting an error message I hadn’t seen before:
Cannot connect to WMI provider. You do not have permission or the server is unreachable. Note that you can only manage SQL Server 2005 and later servers with SQL Server Configuration Manager. Invalid namespace [0x8004100e]
He noticed this Microsoft Support page which explains how to fix it, but that didn’t seem to do the trick. He promises to vanquish this beast and report on it for the rest of us.
Imbalanced Classes and Confusion Matrices
Deepak reached out to me with a question. “I have a labeled data set of about 100,000 employees. I need to predict who will leave or stay with the company. This is highly imbalanced data because most of the employees stay with the company and very few leave. Thus wondering how to prepare the data for training the model? Should I first filter out the data of employees who left the company and then may be use 1000 employees who left the company and may be 200 employees who stayed with the company to train the model? If not, then which AI ML model should I use which can handle highly imbalanced data for classification?”
Answer: This is known as an imbalanced class problem. In this simple scenario, we have two possible classes: left the company (1000) and stayed with the company (99,000). This type of issue explains why accuracy is not the end-all, be-all measure for gauging prediction quality, as I can create a model which is correct 99% of the time by simply saying that everybody will stay with the company.
From there, @thedukeny nudged me into talking about a concept known as the confusion matrix. This leads to four measures in addition to accuracy:
Positive predictive value: given that we said you would stay, how likely were you to stay?
Negative predictive value: given that we said you would go, how likely were you to go?
Sensitivity / Recall: when you actually stay, how often did we predict you would stay?
Specificity: when you actually go, how often did we predict that you would go?
In cases where the classes are pretty well balanced, accuracy (based on the prediction we made, were we correct?) is not a terrible measure, but we want to temper it with these four additional measures to get a more complete look at how we are doing.
Getting back to Deepak’s direct question, I have one technique and two algorithms which I think work well in these cases. The technique is to oversample (or undersample). When we oversample, we make copies of the lesser-used class in order to balance things out a bit more. Instead of having a 99:1 ratio of stay to go, we might make dozens of copies of the people who go, just to tell the algorithm that it’s not okay simply to avoid that 1% and take the easy 99% victory. The opposite side is undersampling, where we throw away enough of the majority class to get closer to a 1:1 balance. As a note, it doesn’t need to be perfectly even—even a 2:1 or 3:1 ratio is still fine depending on your algorithm and amount of data.
As far as algorithms go, gradient boosted decision tree algorithms (like xgboost) do a really good job of handling class imbalance. Also, one of my favorite sets of classification algorithms does a great job with this problem: online passive-aggressive algorithms. The way this algorithm works is that, for each record, we determine whether the model we’ve created would have predicted the answer correctly. If so, the model makes no changes to its weights—in other words, it is passive. But if the predicted class is incorrect, the model will move heaven and earth to get things right—it is aggressive in changing weights until the current example is just within the boundary of being correct. It then moves on to the next set of inputs and forgets about the changes it needed to make to get the prior answer correct. In this way, passive-aggressive algorithms focus in on the problem at hand and ignore how we got to where we are, and so it treats each input as vital. That’s why we may not necessarily need many examples of a class to get it right as a whole: the algorithm is aggressive enough with every single record that it doesn’t need iteration after iteration of a class to recognize that it’s important to learn something about it.
Presentations and Sessions
@johnfan14 asks, “OK.. You guys attend so many seminars, so how do you manage them? like planning and scheduling.. You use Excel or some kind of calendar or some other tools?”
Answer: Tom, Tracy, and I all gave about the same answer: calendar entries and spreadsheets. I use Google Sheets so that I can access it easily from the road and keep track of when, where, what, and key stats I need for reporting. Here’s a quick sample from 2019:
We also use calendar entries to keep track of when and where, and I also have a spreadsheet with travel plans to ensure that I have airplane, hotel room, and rental car (if necessary) booked. Early on in my speaking career, I accidentally double-booked a flight because I didn’t realize I had already purchased a ticket. That’s when I started keeping the spreadsheet.
Database Comparison Tools
From @jeffalope: “Anyone have any cool ways to compare multiple databases where the expectation is that the schema should be the same but you know there is some drift? I know that sqlpackage (ssdt) and red-gate sql compare exist but just wonder if anyone has knowledge of a turn key solution to generate a “report” for a few dozen instances?”
Answer: Tom gave us a website which has a list of comparison tools, but there are two parts to Jeff’s problem. First is a product which generates a comparison between a source database and a target database. Tom’s link gives us a myriad tools for that job.
But second, the requirement to compare a few dozen instances stresses things a bit. I had put together something in the past for Red Gate schema comparisons but it turned into a bit of a mess because of two reasons: first, schema drift happens a bit too regularly; and second, because minor server configuration differences can lead to lots and lots of what are effectively false positive results around things like collation. As a result, I don’t have a great solution but I can confirm that Red Gate’s tool is something you can call from .NET code (I used C# but you could also use F# or Powershell) and build an internal product around.
We had two similar questions from Data Architecture Day that I’m getting around to answering. First, from @lisanke:
How does the process of DB design change if we tell you, you’ll be working with a group (or groups) of programmers who’ll want to store and retrieve data efficiently from the DB for their tasks (e.g. system test) But they don’t know (yet) what they’ll be storing And that target will always be changing through the development life cycle?? can I create database schema and structure and access procedures that helps the team and won’t be completely inefficient with constantly evolving data design?
And second from @thedukeny: “Tips for balancing normality with development speed?”
Answer: I consider these to be very similar questions with similar answers, so I’ll take them together.
The key question is, do you prototype? I love this story from Phil Factor about Entity-Attribute-Value (EAV) and prototyping. As Phil points out, EAV can work in the early development phase, but as we get closer to a stable product, it needs to be turned into a survivable data model. If you need to get something put together immediately to show off, it’s a workable approach. But the problem I see is that very few companies create proper prototypes, where you have one approach to the problem and then throw it away and start over. That’s a shame, but it is something we have to live with—unless you’re in a lucky situation in which you can create real prototypes and then not immediately ship that code to production.
The other problem we have to live with is that “We’ll fix it in round 2” is a comforting lie we tell ourselves as we write halfway-working code. Round 2 almost never happens because we have so much more stuff to do and by the time we actually have a chance to get back to that code, it’s now “too difficult” to fix.
We also have to keep in mind that databases are stateful and require planning. I can gut a web application and, as long as I haven’t messed with the interfaces, nobody needs to be the wiser. How we get to the current state in a web app is almost unimportant; for a database, it’s the entire story. And this is just as true for non-relational databases as it is for relational!
There are some non-relational databases which might be easier to develop against, so it might make sense to create a prototype in one system and then migrate it to the real system once you’re ready for real development. If you’re in a shop where you can do that, go for it. But there are some risks of performance problems, translation issues, and missing functionality as you make that shift, so build in time for all of that.
I think a better approach is to skip the database while the data model is fluid. If you’re changing your data model on a daily basis, it’s a sign that your product isn’t mature enough for a database. That’s not a bad thing, but save yourself the pain and just use in-memory objects which you can load up from JSON, XML, flat files, or even hard-coded in unit tests if that’s what you really want to do. That way, you can quickly modify the data structure you need in your app as requirements change, and once the model is more stable, then it’s time to negotiate the differences between an application and a properly-normalized relational database.
I think that’s reasonable advice early in the development cycle, which is where @lisanke focused. But getting to @thedukeny’s question, how about when the app is mostly stable but you’re doing regular development?
In that case, we want to have a data model planned out. This data model is different from the application’s objects (or record types or whatever), as your data model typically survives the application. I know of databases which have survived three or four versions of an application, and that’s why it’s important to understand the data rules and data model.
Like all other development, expect changes and re-work based on how well the product has been maintained over time. If you have a model which is generally pretty good, you’ll want to keep it in shape but won’t have to spend too much time rearchitecting everything. If it’s a junk design, build that time in when coming up with estimates for doing work. As much as you can, take the time to get it right and work with the dev team. Most developers want to get things right but don’t necessarily know the best way to do it, so they do what they know and move on. I understand that, like any other group, there are developers of varying levels of work ethic and professionalism, but find the people willing and able to help and work with them.
One last piece of advice which bridges both of these questions is, use stored procedures as an interface. @jeffalope also mentioned this in the context of a data access layer (DAL). Try to keep your database as removed from the business objects as you can, and stored procedures do a great job of that. This way, you can redesign tables with potentially zero impact on the application, as the interface to the stored procedures did not change. It also allows application objects to exist separate from the data model. It’s extra work, but well worth it in the end.
I spent a little bit of time gleefully recapping Data Architecture Day. We had 376 viewers over the course of the 11 sessions. If you missed any sessions or want to review something, I have put together a playlist for you.
Resource Governor and Machine Learning Services
I ran into some issues around SQL Server Machine Learning Services and processes (R and Python) running out of memory. The reason is that, by default, SQL Server creates an external resource group with a limit of 20% of the allocated memory for external scripts. In my talk on Machine Learning Services in production, I cover how to change this value.
Slowly Changing Dimension Types
@iconpro5555 asked during Data Architecture Day: “what are some approaches to versioning changes? do u mix. type 2 data with type 1?”
Answer: The reference here is to slowly changing dimensions in a Kimball style warehouse. If you want an excellent write-up, I recommend Simon Whiteley’s take on the matter. Here’s the script I used when putting together my answer:
USE tempdb
GO
-- Type 0: Nothing will ever change...right?
CREATE TABLE dbo.Type0Dim
(
Name VARCHAR(50),
DateOfBirth DATE
);
-- To insert: add a new row.
INSERT INTO dbo.Type0Dim(Name, DateOfBirth) VALUES('Tony', '2001-01-01');
SELECT * FROM dbo.Type0Dim;
-- To update: we don't!
GO
-- Type 1: History? Who cares?
CREATE TABLE dbo.Type1Dim
(
Name VARCHAR(50),
DateOfBirth DATE
);
-- To insert: add a new row.
INSERT INTO dbo.Type1Dim(Name, DateOfBirth) VALUES('Tony', NULL);
SELECT * FROM dbo.Type1Dim;
-- To update: simple update statement.
UPDATE dbo.Type1Dim SET DateOfBirth = '2001-01-01' WHERE Name = 'Tony';
SELECT * FROM dbo.Type1Dim;
GO
-- Type 2: History is a new row.
CREATE TABLE dbo.Type2Dim
(
Name VARCHAR(50),
FavoriteColor VARCHAR(30),
IsCurrent BIT,
PeriodBeginDate DATETIME2(3),
PeriodEndDate DATETIME2(3)
);
-- To insert: add a new row.
INSERT INTO dbo.Type2Dim(Name, FavoriteColor, IsCurrent, PeriodBeginDate, PeriodEndDate)
VALUES('Tony', 'Green', 1, GETUTCDATE(), '9999-12-31');
SELECT * FROM dbo.Type2Dim;
-- To update: update the current row and add a new row.
DECLARE @Now DATETIME2(3) = GETUTCDATE();
UPDATE dbo.Type2Dim SET IsCurrent = 0, PeriodEndDate = @Now WHERE Name = 'Tony' and IsCurrent = 1;
INSERT INTO dbo.Type2Dim(Name, FavoriteColor, IsCurrent, PeriodBeginDate, PeriodEndDate)
VALUES('Tony', 'Puce', 1, @Now, '9999-12-31');
SELECT * FROM dbo.Type2Dim;
GO
-- Type 3: History is the prior version.
CREATE TABLE dbo.Type3Dim
(
Name VARCHAR(50),
FavoriteColor VARCHAR(30),
PriorFavoriteColor VARCHAR(30)
);
-- To insert: add a new row WITHOUT prior values.
INSERT INTO dbo.Type3Dim(Name, FavoriteColor)
VALUES('Tony', 'Green');
SELECT * FROM dbo.Type3Dim;
-- To update: set the prior value.
UPDATE dbo.Type3Dim SET PriorFavoriteColor = 'Green', FavoriteColor = 'Puce' WHERE Name = 'Tony';
SELECT * FROM dbo.Type3Dim;
GO
-- Type 4: the Princess is in another castle.
CREATE TABLE dbo.Type4Dim
(
Name VARCHAR(50),
FavoriteColor VARCHAR(30)
);
CREATE TABLE dbo.Type4DimHistory
(
HistoryKey BIGINT IDENTITY(1,1) NOT NULL,
Name VARCHAR(50),
FavoriteColor VARCHAR(30),
CreateDate DATETIME2(3)
);
-- To insert: add a row into the dimension *and* a row into the history.
INSERT INTO dbo.Type4Dim(Name, FavoriteColor) VALUES('Tony', 'Green');
INSERT INTO dbo.Type4DimHistory(Name, FavoriteColor, CreateDate) VALUES('Tony', 'Green', GETUTCDATE());
SELECT * FROM dbo.Type4Dim;
SELECT * FROM dbo.Type4DimHistory;
-- To update: update the dimension *and* add a new row into the history.
UPDATE dbo.Type4Dim SET FavoriteColor = 'Puce' WHERE Name = 'Tony';
INSERT INTO dbo.Type4DimHistory(Name, FavoriteColor, CreateDate) VALUES('Tony', 'Puce', GETUTCDATE());
SELECT * FROM dbo.Type4Dim;
SELECT * FROM dbo.Type4DimHistory;
GO
-- Type 6: I just can't decide!
CREATE TABLE dbo.Type6Dim
(
Name VARCHAR(50),
FavoriteColor VARCHAR(30),
PriorFavoriteColor VARCHAR(30),
IsCurrent BIT,
PeriodBeginDate DATETIME2(3),
PeriodEndDate DATETIME2(3)
);
-- To insert: add a row without prior values.
INSERT INTO dbo.Type6Dim(Name, FavoriteColor, IsCurrent, PeriodBeginDate, PeriodEndDate)
VALUES('Tony', 'Green', 1, GETUTCDATE(), '9999-12-31');
SELECT * FROM dbo.Type6Dim;
-- To update: update the current row and add a new row.
DECLARE @Now DATETIME2(3) = GETUTCDATE();
UPDATE dbo.Type6Dim SET IsCurrent = 0, PeriodEndDate = @Now WHERE Name = 'Tony' and IsCurrent = 1;
INSERT INTO dbo.Type6Dim(Name, FavoriteColor, PriorFavoriteColor, IsCurrent, PeriodBeginDate, PeriodEndDate)
VALUES('Tony', 'Puce', 'Green', 1, @Now, '9999-12-31');
SELECT * FROM dbo.Type6Dim;
GO
-- Clean up.
DROP TABLE IF EXISTS dbo.Type0Dim;
DROP TABLE IF EXISTS dbo.Type1Dim;
DROP TABLE IF EXISTS dbo.Type2Dim;
DROP TABLE IF EXISTS dbo.Type3Dim;
DROP TABLE IF EXISTS dbo.Type4Dim;
DROP TABLE IF EXISTS dbo.Type4DimHistory;
DROP TABLE IF EXISTS dbo.Type6Dim;
GO
One thing that I hit at the end of the discussion but want to mention here is that the types properly refer to attributes, not to dimensions. What I mean by this is, you can definitely have a dimension which combines types 0, 1, and 2 together. You may have some information which will never change, and so those attributes are type 0. Then, you may have some attributes for which we don’t care about history—those would be type 1 attributes. And some attributes might affect our fact tables, so we track the history as type 2.
Let’s say we have a Person dimension and a Sales fact. On the Person dimension, perhaps we have a DateJoined indicator. We never update that indicator, as the person joined on a specific date and that’s it. So DateJoined would be type 0. The person has a telephone number. We care about the current value of the telephone number, but the telephone number itself won’t tell us much about historical sales. At one point in time, a phone number change was typically indicative of moving to a different area, but between area code splits and mobile phones, your area code isn’t really a great indicator of your location any more. So a phone number might be a type 1 attribute: if it changes, we want to keep track of the latest value, but we don’t need to store a history of it. Favorite color, however, might be a type 2 attribute—if a person’s favorite color changes, that might affect their sales behavior as they shift from buying things in one color to buying things in a different color. In this case, we want to know when the person had a particular favorite color.
This starts to sound complex, but it’s the kind of analysis which is critical for a solid data warehouse. This may also be part of why so many data warehousing projects fail!
Mala’s Book Corner
Mala has a couple of book recommendations for us this week:
Tom asked for recommendations on hands-on labs. I shared some of my experiences and Neil Hambly brought up ideas as well. I don’t have a great answer, especially if you don’t do hands-on labs frequently. If you’re in the ML space, I like Azure Notebooks because it’s free and I don’t have to worry about the computers people bring in. We also talked about a few options for when notebooks aren’t a good solution: Docker containers, VMs on thumb drives, bringing your own hardware (as a trainer), and Azure VMs come up. Basically, unless you control the situation really well and pass out the hardware yourself, I would shy away from trying to use attendees’ machines directly for a hands-on lab and virtualize, containerize, or cloudify whatever I could.
Oxford Commas are Great
We had 100% approval of Oxford commas in chat. It was beautiful.
Kevin — Mostly Sourcetree, sometimes TortoiseGit, and sometimes just using the Git bash shell (for things like git stash and git stash pop)
Presentations and Sessions
@johnfan14 asks, “How many presentations you do in a month on average? How many sessions you normally submit to a conference? What is the optimal numbers do you think?”
Answer: This really depends on the person and we spent some time covering the various criteria. The short answer is that it really depends on your budget and available time, as well as location. Being based in the southeastern United States and in a city with a reasonably good airport, all four of us have an easy time traveling to a variety of events.
I didn’t really say out loud my answer to the specific questions so I’ll do so here. Prior to this year, my presentation counts were 63 in 2019, 70 in 2018, 50 in 2017, 53 in 2016, and 19 in 2015. Some events had me give multiple presentations, so I wasn’t at 63 events in 2019, for example, but the answer is “quite a few.” But I’m an extreme outlier, so I didn’t want to skew things too much.
As far as session submissions, this depends on the conference rules, but when there is no limit, I submit 4. That way there’s a variety of topics (which makes me more likely to be selected) but not so much noise that an organizer needs to wade through a dozen submissions. If you have one good talk or two good talks, just submit those—you don’t need 4.
Database Documentation Tools
From @rporrata: “DB Documentation what tools have you used especially for redesigning systems outside of the usual suspects erwin/studio.”
Answer: This ended up blending together two sets of tools: database documentation tools and database architecture tools. I’m not a big fan of database architecture tools in general, so without further ado, links to resources we came up with and chat helped us out with.
Architecture tools:
Visio allows you to reverse engineer database diagrams, though watch out: there were some editions which didn’t have this functionality.
SSMS database diagrams are a low-end way of showing relationships. They barely function but work okay with small numbers of tables and columns.
Solomon Rutzky recommended Diagrams.Net, which is an old favorite of mine—it used to be called Draw.io.
I’ve used SqlDBM a few times. It’s fine for building static models, though there are niceties I wish it had (like being able to prevent table overlap).
Neil Hambly brought up DOC xPress by SentryOne (formerly a Pragmatic Works product). Mala mentioned that this product works well for documenting SSIS packages.
Free Databases for Teaching SQL
Mark Hutchinson e-mailed before the show began and had a question for us: “I’m going to teach a couple of friends SQL. Just the DML subset of the language to start (Select, Insert, Update, Delete). Currently, we’re going to use MS Access, since the GUI can help do some of the heavy lifting at the early part of the course. For later, maybe a second course, I was thinking about introducing the students to a large database, such as SQL Server. What is the best free (or damned cheap) database? One of the students is out of work and I’m retired, so money is an issue.”
Answer: We’re SQL Server folks, so we’re going to be biased. But each one of us came up with SQL Server Developer Edition. It is 100% free for non-production use and easy to install. We debated the merits of other editions as well, so here’s the quick +/- on each:
Developer Edition — Installation is really easy (basically next-next-next if you’re installing for a development box and don’t need custom configuration like you would for production) and you get the full SQL Server surface area.
Express Edition — Installation is also easy and you can use Express Edition in production environments for free. It’s limited to 10GB per database, no more than 4 cores on 1 socket, and a limited amount of RAM, but if you wanted a zero-cost SQL Server in production, that’s an option.
LocalDB — SQL Server Express LocalDB integrates really well with Visual Studio and lets you work with database projects. It’s a single-user variant of SQL Server with practically no configuration necessary. Some of the surface area is different in LocalDB, but that’s stuff you wouldn’t notice while learning SQL.
Containerized Developer Edition — You can certainly get SQL Server as a Docker container. If you’re already familiar with Docker and have it set up on the training machines, then this is a good technique as it’s really easy to spin up a new container and blow it away if you mess it up too badly. But if you don’t have things configured already, it can turn into a mess where you spend so much time trying to set up your environment that you lose out on what you wanted to do.
Azure SQL Database — Avoid installation altogether with Azure SQL Database. There is no free tier option anymore, but the Basic tier DTU model is $5 per month and there’s no installation or configuration at all.
My question is when we say “all expressions that appear in the same logical query processing phase are evaluated conceptually at the same point in time,” does that mean the query looks for all instances of FROM then all instances of WHERE, etc.? How would this be processed? SELECT c.CUSTOMERID FROM CUSTOMER c WHERE c.CUSTOMERID NOT IN (SELECT ag.CUSTOMERID FROM ANGRYCUSTOMERS ag )
Answer: Mike is talking about clause ordering, where the FROM clause parses before the WHERE clause, etc. And within the WHERE clause, all predicates are handled at the same time, meaning that if you have a WHERE clause with X=1 AND Y=2 AND Z=3, the database engine can take them in any order (regardless of the order in which you wrote them). From your perspective, all of those filters happen concurrently, so you can write them in any order and the database engine will (hopefully) pick the smartest path. This is why short-circuiting may not work in SQL the way you’d expect it to in an imperative language: because the optimizer doesn’t care about the order in which you write things and can shake up the order if it looks like a better path.
In this particular instance, FROM CUSTOMER c will process first, and then we will get to the WHERE clause. Inside there, we have a NOT IN operator which operates on a sub-query, so we move to FROM ANGRYCUSTOMERS ag and then SELECT ag.CUSTOMERID. After that completes, we finally get to SELECT c.CUSTOMERID.
Events of Note
We wrapped up with one event of note because I forgot about the other one:
Marcin Gminski has put together the Data Community Event Tracker, an indispensable tool for finding online data platform events.
Data Architecture Day is Saturday, May 16th. The schedule is available and we have a jam-packed schedule.
EightKB, a conference for internals, is open for registration.
Chris Voss e-mailed us with a great question. “What are the “home lab” setups like? What computers/specs does everyone have, and for what purposes? I’m talking about person systems rather than work. Part of why I ask is because I’m looking at new computers, so I’m asking basically every tech person to ensure I’m doing this right.”
Answer: This will depend on whether you want a desktop or a laptop. Or you could just have a server room in your basement…
Tom and Tracy went over some of the characteristics they look for in laptops, starting with 64 GB of RAM. I mentioned that I’d much rather have extremely fast disk for a single SQL Server installation, though both of them have multiple SQL Server instances running on multiple VMs, so the need for lots of RAM makes perfect sense.
If you want a resource for building a desktop machine, the folks at Logical Increments do an incredible job. The service is entirely free and I used it to build my machine learning and video processing desktop. You pick the price point and they give you several recommendations on hardware choices. One thing I would say is that I’d recommend going up a notch on drives—prosumer grade NVMe (like the Samsung Pro series) over consumer-grade SSD. Fill your motherboard’s NVMe slots first before using SSD or HDD.
Neil Hambly had some nice recommendations as well, including Overclockers for the UK folks and making sure that you swap out SSD every 18-24 months to eliminate the risk of a drive dying on you. Unlike hard disks, SSD doesn’t give you much as much warning before it dies out, and it can just suddenly drop off.
SSD vs NVMe
As a bit of kismet, Mark Gordon had e-mailed me earlier with a great follow-on question: “When it comes to storage for SQL Server, does NVMe offer an improvement over SSD?”
NVMe is not affected by the ATA interface constrictions as it sits right on the top of the PCI Express directly connected to the CPU. That results in 4 times faster Input/Output Operations Per Second (IOPs) rivaling the fastest SAS option out there. The seek time for data is ten times faster. NVMe can deliver sustained read-write speed of 2000MB per second, way faster than the SATA SSD III, which limits at 600MB per second. Here the bottleneck is NAND technology, which is rapidly advancing, which means we’ll likely see higher speeds soon with NVMe.
With SQL Server, you will notice the difference under load. NVMe is still nowhere near as fast as RAM, but it’s a lot closer than SSD (which is itself way closer than 15K spinning disk).
By the way, for the pedantic-minded, I am aware that NVMe disks are still SSD; when I say SSD, I mean SSD over SATA in the classic 2.5″ form factor.
Mark had some follow-up bits I can hit briefly here. He mentioned tempdb as a good candidate for fast disk and that’s a smart idea: tempdb should be on the fastest disk you can possibly get. Here’s a rough guide that I’m coming up with off the top of my head, ranking things in order of best to worst:
All direct-attached NVMe
All-flash SAN array
Tempdb on NVMe and all other databases on direct-attached SSD
All direct-attached SSD
SSD SAN array
Tempdb on direct-attached SSD and all other databases on HDD
Hard disks everywhere
There are other configurations that nestle in between some of these (e.g., direct-attached SSD for tempdb but SSD SAN for the rest is slightly better than SSD SAN array but slightly worse than all direct-attached SSD), but the general rule of thumb is that direct-attached beats SAN and that NVMe > SSD > HDD.
Learning T-SQL
John Fan Zhang asked for a good book to learn SQL.
My recommendation, to the point where I have purchased this book for one of my employees needing to learn T-SQL, is Itzik Ben-Gan’s T-SQL Fundamentals 3rd Edition. Itzik is brilliant and an outstanding teacher, and even if you have an advanced knowledge of T-SQL, you’ll still pick up things from his beginner-level book.
Gabriel hit me up with this question before the stream began: “Is there a really a need to support and maintain RIGHT JOIN?”
Answer: Tracy says no, Tom says no, Mala says no, and Kevin says mostly no.
The thing about RIGHT JOIN is that it is usually confusing to people because it’s backwards from how we want to read. In English, we read left to right, top to bottom. We also work from the assumption that the most important stuff comes first. RIGHT JOIN violates this by making the latter table the “important” one. The other consideration here is that every RIGHT OUTER JOIN operation can be rewritten as a logically equivalent LEFT OUTER JOIN.
That said, I have personally run into a couple of cases where it made sense to use a RIGHT JOIN rather than switching to LEFT JOIN. These cases were mostly around complex join criteria with a combination of LEFT JOIN and INNER JOIN and one last RIGHT OUTER JOIN to catch the “I don’t have anything else” scenario. So I wouldn’t get rid of RIGHT OUTER JOIN, but if I see it in a code review, the first question is asking why this needs to be ROJ and cannot be a LOJ.
Aliases and Table Names
Finally, chat got off onto the tangent of aliases and table names. On this topic, @iconpro555 tossed us into the briar patch with “why not use 4 letter names because that is what you use for aliases anyway?”
As far as naming goes, my rule of thumb is: make it clear but not overly verbose. 4 characters is fine if a table is called dbo.Home and represents information about a home (location, square footage, tax appraisal, etc.). But don’t be afraid to add a few extra characters to a column name if it clarifies intent. One thing I really like to see is unit of measure. You show me a thing called TotalCPUTime, but is that in seconds? milliseconds? microseconds? This gets really annoying even with SQL Server DMVs because some of them are milliseconds and others microseconds.
Names are for developers, whether that’s the current developer or a future maintainer. Just like with the discussion about RIGHT OUTER JOIN, we are optimizing for developers rather than for the database engine. There are times when you need to optimize for the sake of the database engine rather than the developer, and that’s where you start adding copious notes clarifying your intent.
Events of Note
We wrapped up with one event of note because I forgot about the other one:
Mala had a two-part question for us. When do you upgrade? And why do you upgrade?
Answer: Each of us has different opinions.
Mala would regularly wait for SP1 of a product before upgrading. With Microsoft eliminating regular service packs for SQL Server, she’s not quite sure what rule of thumb to follow.
Tom likes to push the envelope, preferring to upgrade quickly. Though he doesn’t hit each version—he tends to skip a version, e.g., 2016 to 2019. He does want to see compelling items in a version before upgrading.
Kevin likes to upgrade for upgrading’s sake. Or something like that… I have enjoyed being part of the Early Access Program for SQL Server and getting a chance to try out products under development. I pushed back a bit against the “Wait for SP1” argument, but one thing I failed to say during it is that if everybody waits for SP1, SP1 will still have a bunch of bugs. I am thankful for the people whose philosophy is “Someone’s got to find the bugs, and it might as well be me” and everybody who waits to upgrade should as well.
From there, I derailed things onto my refusal to work for a company stuck on old version of SQL Server, with no plan to upgrade (or a plan but no real desire to upgrade). Tom and Mala make me walk it back a bit.
ANSI SQL
Mike Lisanke wanted to know why we call the language SQL for SQL Server, Oracle, DB/2, Postgres, etc., and yet they’re all different languages.
We covered a lot in here, but the gist is that ANSI releases versions of the standard which companies subsequently adopt in part (and extend in part). I mentioned that there isn’t “an” ANSI SQL standard and Wikipedia has a nice table (about 2/5 of the way down) showing the different versions of ANSI SQL. I had guessed about the pre-89 versions and wasn’t quite right—there was only one pre-89 version, there wasn’t a 1997 version, and 2000 was 1999. Other than that the answer was fine! But there have been 10 iterations of the ANSI SQL standard.
We also talked about the origin stories of a few platforms, including Sybase/SQL Server, Oracle/Postgres, and MySQL/MariaDB. We also talked about coding for ANSI compliance. Tom likes that idea (or just using PolyBase—which I recommend!). I don’t care much for coding for ANSI compliance for most places because you lose chances to improve performance for a chimerical gain. The exception here is if you must write software which is cross-platform; then you’re stuck.
Tom mentions making use of hierarchyid in SQL Server. Then we started name-dropping books.
First up, I recommend Adam Machanic, et al’s, Expert SQL Server 2005 Development. I haven’t read it in a while and obviously the development surface area has changed in 15 years, but there is an excellent chapter on trees and hierarchies.
I then pulled out my copy of Candace Fleming and Barbara von Halle’s Handbook of Relational Database Design. I consider it the best explanation of normalization I’ve ever seen in print (and thanks to Grant Fritchey for the recommendation!). Just don’t read the second half of the book unless you want a story of how implement on ancient systems.
I started us off with a topic of discussion: working from home. Mala and Tom both have significant experience with the topic and they share their thoughts. Stick around for a bit of ranting about Microsoft Teams. @thedukeny points out this highly-upvoted item to allow for multiple Teams accounts at the same time. Slack does it right, and teams is painful.
Tom brought up desk-sharing, which I absolutely hate. On the plus side, it did remind me of a Dilbert strip from 25 years ago.
Containers
Chris Voss asked a question a while back and I finally got a chance to answer: Our team is starting the use of containers for local environments to test our database development, before deploying to the shared dev environment.Can anyone share their container strategies, and what are space considerations for local sandboxes? Would it make sense to put an entire application code base in the same container?
Answer: There are a few questions in here, so let’s take them in turn.
As far as space goes, Tom Norman pointed out that containers won’t save you space across machines: if you have a 500GB database you need on every developer’s laptop, even if that database is in a container, it’ll cost you 500GB of disk space per laptop. Kevin pointed out that the container savings is when you can layer your containers: if you have a bunch of applications using .NET Core, for example, you can reuse container layers so that you might have a couple dozen .NET Core apps which all use the same base layer, so that layer gets stored on disk once.
Does it make sense to put application code in the same container as database code? No, for the same reason that you wouldn’t put app code on the same server as your database. Keeping components isolated reduces system complexity and makes it easier to upgrade or swap out parts.
The Telemetry Service
Mark Gordon raised a question about the telemetry service which derived from a weird account setup. Mark’s research led him to read up a bit on the telemetry service. We then had a bit of discussion about the telemetry service itself and I referenced a Brent Ozar post on the topic.
My personal opinion is that I’m fine with a telemetry service. I build telemetry in my applications and would expect the same from products like SQL Server. There are differing opinions on the topic, though.
Anders Pedersen starts us off with a doozy. When deleting a large number of rows, should we do this in one transaction or not?
Answer: Nope. Delete in batches, although this can take a while for enormous tables. If you’re retaining a tiny percentage of rows, then it might be easier to create a new table, migrate the data you want to keep to that table, drop the old table, and rename the new table back to the old name.
If you’re using Enterprise Edition, you can partition your tables by date and use partition switching.
On Long-Term Storage of Data
As part of deleting lots of data, we ended up talking about long-term archival storage of data. Tom brought up Stretch DB and I laughed. I laughed because Stretch DB is dead on arrival as soon as you look at the price.
If you aren’t made of money, there are a few other options. One I like is to use PolyBase for cold storage of data. Solomon Rutzky also recommended storing archival data on slow disk within SQL Server.
Magnetic Storage Has Its Place
Mike Lisanke calls me out and says that magnetic storage has its place in the world.
To that I say, this is true. I want things as fast as possible, though, and faster storage is one of the easiest ways to make your SQL Server a lot faster. Spinning disk and tape are good for long-term backup storage. But they’re generally not for OLTP or even OLAP scenarios. Give me NVMe or even SSD any day of the week.
Why do Databases Not Have Multi-Level Caching?
From Mike Lisanke, why do databases not have the concept of multi-level caching?
Answer: This answer is for SQL Server in particular; it may be different for other database technologies.
SQL Server has a buffer pool, where data is read into memory before it is returned. That’s one level of caching. From there, multi-level caching is more of an architecture decision: adding caching apps like Redis or using in-process cache in your app servers. That’s outside of the database but replaces database calls, so it effectively acts as another layer of caching.
Also, there is a concept of aggregations in SQL Server Analysis Services, where the engine creates pre-computed aggregations of slices of your data. That gives you a performance boost sort of like what caching does, and you can replicate this in the database engine with rollup tables.
Tools for Recording Presentations
Mala recommends Skype, as it is free and lets you save recordings. She also recommended checking out work from Doug Lane (for example, his gear to make technical videos—though that is a few years old) and Erik Darling.
Tom uses GoToWebinar but doesn’t do many recordings.
I use Whereby for streams and you can record on there. I use Camtasia for professional video editing and post-processing. OBS Studio is great for gonzo work when you don’t want post-processing. It’s also the software I use for streaming. Windows Video Editor is a thing but I have no experience with it so I don’t know how well it would work here. Adobe Premiere Pro is great if you can afford it.
Mike Lisanke had a follow-up question here about using transcription services. Anders Pederson recommended Rev. You can also use things like Azure Speech to Text.
What Are You Working on Now?
Mala is currently going through SSIS training from Andy Leonard. Andy is an outstanding teacher and one of the best at SSIS. If you get a chance to learn from Andy, take it.
Tom is working on building an enclave in his environment so he can use Always Encrypted with enclaves.
On Tuning a Stored Procedure
John fan Zhang had a lengthy question for us which I’m summarizing as, given a new stored procedure which inserts batches of rows into a table, I am seeing resulting worse database performance. What can I do about this? The table is a heap. Will a unique clustered index help?
Answer: My first thought is, check your storage. If you have cheap disk, get better disk performance and your problem probably goes away.