Shop Talk: 2021-06-21

The Recording

The Panelists

  • Kevin Feasel
  • Mala Mahadevan
  • Mike Chrestensen

Notes: Questions and Topics

The Value of Community

We started off tonight by mentioning that Mike got a promotion at his job thanks, in part, to what he’s been able to learn by being part of the SQL community. We’re all quite happy for Mike and wish him the best, and we’re glad that he’s a part of the community.

Linked Servers and their Alternatives

Mala brought up our first major topic tonight: what are some alternatives to using linked servers? Here’s a brief synopsis, but you’ll definitely want to listen to the video.

  • First, are you sure you need an alternative? Linked servers do work for specific purposes and they do a pretty good job in that role. The main problem comes when we try to move lots of data over linked servers, especially when we do it over and over.
  • PolyBase is an alternative. Of course I’d talk about PolyBase
  • You can also use OPENROWSET for one-off data queries.
  • If you need to migrate data frequently, perhaps using an ETL tool would be preferable. That could be a product you purchase (SSIS, Informatica, etc.) or an application you develop. The next night, Joshua Higginbotham talked to us about using Python for ETL work, so it’s a good lead-in.
  • Data virtualization solutions are also available.
  • One idea I hadn’t thought of but makes a lot of sense is using Machine Learning Services, especially if you need to do data analytics from remote data and then store it in your local SQL Server instance.

ML Experiments on Large Data

Our other major topic was from George Walkey, who asked for my preferences around performing machine learning experiments on large data sets, specifically asking about Databricks, SparkML, Azure Synapse Analytics, Azure Machine Learning, and on-premises solutions. Here’s a synopsis of what I wrote back:

My team generally uses on-premises tooling because that’s what we had available.  We started out with SQL Server Machine Learning Services and built a few products off of that.  However, most of our new products don’t use ML Services and we’ve started to move existing ones away where it doesn’t make sense (mostly, it doesn’t make sense when you need sub-second performance, aren’t using batch operations to generate lots of records at a time, don’t have the data in SQL Server, don’t store the results in SQL Server, and can’t use native scoring and the PREDICT operator).  Instead, we are prone to building out Linux-based servers running R and/or Python to host APIs.  We also tried out Azure ML and team members really liked it, but internal problems kept us from being able to use it.

As for what I like:

  • On-prem is easy if you’re on a tight budget.  There are a lot of tools available, and if you have knowledge in the ML space, you can do pretty much anything on-prem that you could do in a cloud.
  • I’m not the biggest fan of SparkML (and by extension, the Databricks ML capabilities).  Spark’s machine learning capabilities aren’t fantastic, and they tend to lag way behind native scikit-learn or R packages.  Sure, you can run native scikit-learn code on a Spark cluster, but then you don’t get the advantages of scale-out.
  • Azure ML (and AWS SageMaker) are good options, especially for people without an enormous amount of ML experience.  They’re going to constrain you a bit on the algorithms you can choose, but that’s usually okay–and pro users can find their way around that by bringing in Python libraries.
  • Azure Syanpse Analytics offers ML capabilities, but it’s worth keeping in mind that the Spark pool machine learning has the same SparkML downsides as what you find in Databricks. Also, between Azure Synapse Analytics and Azure ML, Microsoft has two different ML products run by two different teams here and both of them are competing for what is “the right choice.”  Last year, the guidance was to use Azure ML for everything machine learning-related in Azure, and Azure Synapse Analytics Spark pools with SparkML only in cases where you already had existing Spark code.  I’m not sure what the guidance will look like over the next few months, but that does leave me wondering.

Leave a Reply

Your email address will not be published. Required fields are marked *