Tuesday, April 7

Using Perl-like Regular Expressions in MySQL

Regular expressions are amazing. They allow you to parse bits of, search for patterns in and validate strings, all with a simple, uh, regular expression. For example the following represents any possible email address:


It's not my expression - Google has expressions for almost anything you want to find, and you can easily figure out your own quite easily for the rest.

There is one pretty fatal flaw with regular expressions though: they're not necessarily universal or platform independent. The above pattern would be used by Perl, whereas the following represents email addresses in a POSIX regex environment, like MySQL:


The things is, how much use are regular expressions if they can't be used across platforms? Sure, they're inherently faster (in development and execution) than hardcoding some pattern matching logic, but for me the real gain is being able to define immutable "facts" using them, and then referring to these facts from anywhere you need to. This requires a universal syntax.

Perl and POSIX are the most popular syntaxes, but there are others. Of the two, the Perl syntax seems to be regarded as the most powerful, and looking above the more concise. It's also more wildly used - many platforms (like .NET and Java) have a compatible syntax, but thanks to an excellent library called PCRE (Perl Compatible Regular Expressions), the Perl syntax has been made available to a wide variety of situations and platforms, including Ruby.

Since it made sense to store regex in the native format of MySQL, my first approach was to look for some kind of conversion library to convert from POSIX to Perl syntax. I found nothing. Nada. Zip. Zilch. Personally I found that pretty strange.

Going the other way was better - almost accidentally I discovered that some bright spark had created lib_mysqludf_preg, a bunch of User Defined Functions which accept Perl regex (once again using the PCRE library). Once installed, these would work in exactly the same way MySQL functions work on POSIX regex - I would then store all regex in the Perl syntax and use them in MySQL (via the new UDFs), Java, Ruby and wherever else I needed to (provided they accepted Perl regex, of course). Hooray!

Of course installing the library wasn't that simple. I was now in Linux world, and was required to, gasp, build my own copy of the library. This wasn't as difficult as I thought it would be - the hardest part was using Synaptic to get the MySQL development libraries. Compared to the rest of the installation, building was a breeze.

Each UDF is created using the following command:

CREATE FUNCTION lib_mysqludf_preg_info RETURNS STRING SONAME 'lib_mysqludf_preg.so';

where lib_mysqludf_preg.so is the name of the newly built library.

In my case, however, this returned the following error:

ERROR 1126 (HY000) at line 3: Can't open shared library 'lib_mysqludf_preg.so' (errno: 22 lib_mysqludf_preg.so: cannot open shared object file: No such file or directory)

The lib_mysqludf_preg readme warned me that I needed to place lib_mysqludf_preg.so in a location covered by LD_LIBRARY_PATH (whatever that was), and I had done this by adding the default build location of /usr/local/lib to ld.so.conf and running ldconfig to refresh the file cache for the linker. Checking the output of that last command confirmed that ld, the system linker, could indeed see the file.

pretending to be some kind of expert with these things, I used strace to figure out where MySQL was looking for lib_mysqludf_preg.so. This totally sent me down the wrong path as it complained about a permission error - as far as I could tell with my not-so-l33t Linux skills the file was readable by anyone.

After half a day of faffing around I finally found a lead on the MySQL forums: it turns out that the file had actually been built into the wrong location. I was running a 64bit MySQL on a 64bit Linux, and for that reason lib_mysqludf_preg.so had to be placed in lib64. After doing this, the CREATE FUNCTION command worked as advertised. I removed the path I added to ld.so.conf.

Any misgivings I had toward anything Linux (this wouldn't have taken a day to figure out on a certain other platform, I reckon) quickly dissipated after I adapted my database to use Perl regex instead of POSIX. I could finally use single regex "facts" across my application - something that was definitely worth the hassle getting it all to finally work.

No comments:

Post a Comment