PostgreSQL

PostgreSQL: Fixing the Substring Bug That Bit Back

Noah Misch delivered two important commits fixing a tricky bug in PostgreSQL's SUBSTRING() function that was causing errors with multibyte characters in toasted data. The fix improves how PostgreSQL handles character encoding validation and resolves spurious Valgrind errors, making the database more robust for international applications.

Duration: PT4M17S

https://podlog.io/listen/postgresql-9847372b/episode/postgresql-fixing-the-substring-bug-that-bit-back-cd8a501a

Transcript

Hey there, fellow code explorers! Welcome back to another episode of the PostgreSQL podcast. I'm your host, and wow, do we have an interesting debugging story for you today - February 15th, 2026. You know those bugs that seem simple on the surface but reveal layers of complexity once you start digging? That's exactly what we're diving into today.

So here's what happened in the PostgreSQL world today - no merged pull requests, but we got two absolutely crucial commits from Noah Misch that tell a fascinating story about multibyte character handling and the ripple effects of previous changes.

Let me paint you a picture. Imagine you're working with a database that stores text in multiple languages - maybe Japanese, with those beautiful UTF-8 characters that can take up to 3 bytes each. Now, you're using PostgreSQL's SUBSTRING function to extract just a portion of that text. Sounds straightforward, right? Well, not quite.

Here's where it gets interesting. Back in commit 1e7fe06c - and this is a perfect example of how changes in one part of a system can have unexpected consequences elsewhere - the PostgreSQL team made a change to pg_mbstrlen_with_len to make it throw an error if it encounters an incomplete character at the end of input. Most of the time, this is exactly what you want - catch those encoding issues early and report them clearly.

But here's the plot twist: the SUBSTRING function had a different need. When SUBSTRING wants to extract, say, 2 characters from a UTF-8 string, it needs to detoast - that's PostgreSQL's way of uncompressing stored data - up to 8 bytes to be safe. But here's the kicker: in a string of 3-byte UTF-8 characters, those 8 bytes might span 2 complete characters and 1 partial character. And that partial character? It was triggering that new error check.

Noah Misch stepped in with a really elegant solution. Instead of using the standard string length function that now throws errors on incomplete characters, he implemented a custom string traversal that stops exactly when it has found as many characters as the substring could possibly need. It's like the difference between a blunt instrument and a precision tool.

But wait, there's more to this story! The fix also addresses something that's been bugging the team - SUBSTRING was detoasting one extra character because of some ambiguity between exclusive and inclusive end positions. Noah cleaned that up too, making the function more efficient.

And then we have the second commit, which is one of those "attention to detail" moments that I absolutely love. Noah noticed that a couple of functions - pg_mblen_range and pg_mblen_with_len - were causing spurious Valgrind errors. Not real errors, mind you, but false positives that could drive developers crazy during debugging. The fix was simple but important: reorder the validation checks to happen after the Valgrind annotations. Sometimes the smallest changes make the biggest difference in developer experience.

What I love about these commits is how they showcase the ripple effect of software development. You make one improvement - better error handling for encoding issues - and suddenly you discover edge cases in other parts of the system. It's like tuning a guitar; adjust one string and you might need to fine-tune the others.

Today's Focus: If you're working with international text data in your applications, take a moment to test your SUBSTRING operations, especially with multibyte characters. And if you're a PostgreSQL user dealing with UTF-8 or other multibyte encodings, consider testing this fix in your development environment - Noah specifically mentioned that applications using SUBSTRING on non-ASCII column values should consider applying this to their February 12th releases.

That's a wrap on today's PostgreSQL adventure! Remember, every bug fixed is a step toward more robust, reliable software that serves users around the world better. Keep coding, keep exploring, and I'll catch you tomorrow for another dive into the ever-evolving world of PostgreSQL development. Until then, happy coding!