The smart money is on SQL Server 2025 being released to General Availability this month. Next week there are two major conferences in the Microsoft data space, Microsoft Ignite and the PASS Summit. Almost all the recent versions have been released one of these events and so with their happening in the same week, it’s highly likely to the week for announcements.
Therefore Steve Jones asks us to write about what is making us excited about this new release. Perhaps ‘excitement’ is the wrong word, but certainly Steve wants to know which features will make our lives easier. He lists a few ideas but for me the biggest thing is the probably the new suite of functions about fuzzy string comparison.
One of the things that I feel has been significant in the last 20 years has been tidying data. In some ways, I feel like Data Quality is less of a thing nowadays, especially with the advent of Big Data – one of the fundamental things about structured data is that the quality can be poor sometimes. AI prompts seem to make us bad at typing because it’s just so forgiving. But if you’re trying to know whether someone’s name is Steven or Stephen, then even though AI might not care, Steve probably does. And even though two names might sound the same, there is no bounds to the ‘creativity’ of parents to name their daughters Khloe, which probably gets written down by call centre employees as Chloe or Clooey or something else entirely. Cafes even get confused with my name, let alone anyone with a name outside the top ten from the last hundred years (fun fact: apparently “Robert” was the number one name in the US from 1924-1939, and then top 10 until 1972. I was born two years later, so clearly my parents weren’t following trends – but I was named after a Scottish King, and was probably lucky that there was already a Bruce in my mum’s family). So even though data is becoming more and more UNstructured, we need to care more and more that our Big data is also Good data. High quality. Spelling things the right way.
I’m very sad that Master Data Services is leaving the product with SQL Server 2025. Master Data is a great way of handling lists of things, such as known verified data from trusted sources. But even though there are alternatives (such as keeping an instance of SQL Server 2022 around, or just using an external product like Profisee, which has some great tools for identifying data matches), I’m not thrilled about that aspect of SQL Server 2025.
However, SQL Server 2025 does bring some great options for doing fuzzy string matches, making custom Data Quality options even richer. I’ve spoken about this at some user groups recently (including tomorrow, remotely for TriPASS, and in a few weeks in Melbourne and Sydney for Difinity), and in that session I go much deeper into how I see data matching going. I’ll also write more about these methods in future posts, but it’ll take a few posts, covering quite a few sub-topics.
In the meantime though, have look at the Microsoft documentation on Fuzzy String Matching. You’ll see there are four new functions, covering similarity and distance across two algorithms (Edit Distance and Jaro-Winkler).
We’ve had Fuzzy Lookup and Fuzzy Grouping in SSIS and Power Query (well, Power Query Online) for over twenty years, but they’ve felt like closed boxes. Sure you can put all your data through a Fuzzy Grouping transform, but there weren’t many options for being able to fine-tune things. Now I can find a bunch of candidates for a match, and then apply my own set of logic using these new functions.
Keep your eye out for lengthier, more technical posts in a month or so (after the conferences are done), and for now, take the time to explore where these new functions can fit into your current data matching code. You might not be rolling out SQL Server 2025 for a few months, but you can start using this in Azure SQL already!
@robfarley.com@bluesky (previously @rob_farley@twitter)