Unicode Support on CentOS 5.2 with PHP and PCRE
Yesterday, I talked about how to get the most out of running regular expressions in PHP. The reason that I needed to dig in deep on regular expression syntax with PHP is because I needed to write some regular expressions that deal with Unicode characters.
After much reading, I believed that I knew everything that I needed. I started writing some regex strings and testing the code. Unfortunately, every time I ran a test with a string that contained Unicode characters, the match failed. When I removed the Unicode characters from the string and tested again, it would work. I was baffled.
Finding the Problem
I had the regex testing characters (‘\X’, ‘\pL’, etc) inside of a character class, such as ‘[\X-]‘, since I was creating a regex to test for domains. I wrote a really simple rule by simply looking for ‘/^\X$/’ and testing the regex with a single Unicode character. Amazingly, having the ‘\X’ outside of the square brackets changed everything as I now received the following very concerning warning:
Since PHP uses the PCRE engine to run regular expressions, I started to dig into it. I found out that I could query PCRE directly. I ended up with something very similar:
pcregrep: Error in command-line regex at offset 2: support for \P, \p, and \X has not been compiled
It looked like the error was coming from PCRE itself. I searched around for a while thinking that I could simply install a new package using yum. I hoped to find something like pcre-utf8, pcre-unicode, php-pcre-unicode, or something to make it simple and quick to add this support since I much prefer using package management tools rather than compiling and installing from source.
Unfortunately, no such package exists. This support is something that must be an option that PCRE is compiled with, and my CentOS repository only has packages that don’t include that support. After much digging around, I found that this isn’t necessarily CentOS’s fault as this package has carried over from the RHEL (Red Hat Enterprice Linux) side of things.
A great way of checking to see if this is an issue on your system is by running the following:
$ pcretest -C PCRE version 6.6 06-Feb-2006 Compiled with UTF-8 support No Unicode properties support Newline character is LF Internal link size = 2 POSIX malloc threshold = 10 Default match limit = 10000000 Default recursion depth limit = 10000000 Match recursion uses stack
This is the output that I received. Notice the “UTF-8 support” and the “No Unicode properties support” lines. This means that PRCE was compiled with the “–enable-utf8″ configure option which allows PCRE to recognize and work with UTF-8 encoded strings. However, it wasn’t compiled with the “–enable-unicode-properties” configure option which works in conjunction with the enable-utf8 option to add support for the ‘\p’, ‘\P’, and ‘\X’ character classes.
This seems to have been an oversight when the rpm file was first put together. Fortunately, there is a way to fix it.
Fixing the Problem
Since I’m sure that many of you are like me and would rather not manually compile and install software outside of the package management system, the solution is to update the rpm to have the option that it needs and install it.
I had never done this before. Fortunately, I found a very helpful guide that details this process out very nicely: How to patch and rebuild an RPM package.
I have provided the new rpm file that I have built at the bottom of this post. If you don’t care about all this jibber-jabber, you can skip down there and grab the file. However, if you would like to learn how to address this issue yourself or have a system that my file will not support, please read on to see how I rebuilt the rpm with the new option.
Rebuilding the rpm
- The first thing I did is set up my ~/.rpmmacros file and src/rpm folder structure as detailed in the Setup section of the guide that I’m following. I’ll simply refer you over there as it doesn’t need repeating here.
- I needed to grab the source rpm for the current version of PCRE on my platform. I’m on CentOS 5.2 with version 6.6 of PCRE. I found the matching source rpm file (pcre-6.6-2.el5_1.7.src.rpm) here.
- I then installed the source rpm in order to gain access to its files:
$ rpm -ivh pcre-6.6-2.el5_1.7.src.rpm
This put the necessary files into my ~/src/rpm/SOURCES and ~/src/rpm/SPECS folders.
- I opened up the ~/src/rpm/SPECS/pcre.spec file and found the following line:
%configure --enable-utf8
I changed it to include the Unicode properties option:
%configure --enable-utf8 --enable-unicode-properties
I then saved and closed the file.
- This is the only change that I needed to make. So, now it is time to build the new rpm file. I simply ran the following to build it:
$ rpmbuild -ba ~/src/rpm/SPECS/pcre.spec
Toward the end of the large amount of output, I received the following:
Wrote: ~/src/rpm/SRPMS/pcre-6.6-2.7.src.rpm Wrote: ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm Wrote: ~/src/rpm/RPMS/x86_64/pcre-devel-6.6-2.7.x86_64.rpm Wrote: ~/src/rpm/RPMS/x86_64/pcre-debuginfo-6.6-2.7.x86_64.rpm
This tells me exactly where I can find my new source rpm and rpm files.
Updated rpm File for CentOS 5.2 64-bit
If you are running a 64-bit version of CentOS 5.2, the following file should work for you. If you have a different architecture, Linux distro, or encounter any errors when trying to install this file, then you should follow the instructions above to build an rpm that is suitable for your distribution.
pcre-6.6-2.7.x86_64.rpm – PCRE 6.6 for CentOS 5.2 64-bit
Installing New rpm
Now that I have my new rpm file, I just need to install it. Since I already have a pcre package installed, I need to tell the rpm command to update rather than install. The following command does this for me:
# rpm -Uvh ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm
Notice that I need to be root to run this command.
Finally, to verify that everything worked, I ran the pcrecheck program again:
$ pcretest -C PCRE version 6.6 06-Feb-2006 Compiled with UTF-8 support Unicode properties support Newline character is LF Internal link size = 2 POSIX malloc threshold = 10 Default match limit = 10000000 Default recursion depth limit = 10000000 Match recursion uses stack
Looks good.
Finally, time to move on with life.
Tags: CentOS, PCRE, PHP, regular expressions, UnicodeShare This Post
Related Posts
Receive Updates
New posts on gaarai.com delivered directly to your email.






Thanks for this – your RPM works perfectly, and right now I just needed to get this working
That’s great news Grant. I’m glad that it helped you out.
Thanks a ton! This is exactly what i needed after about 2 hours of searching the internet.
Good deal Adam. I checked out your sites. What anime is that in your “Awesome” category? I don’t recognize it, but I’m intrigued.
Your walk through rocked, I have the new rpm installed, and get unicode goodness :
$ pcretest -C
PCRE version 6.6 06-Feb-2006
Compiled with
UTF-8 support
Unicode properties support
…
But I still get the errors in my PHP scripts. Were there any mods to PHP you made here?
I’m glad that you liked the tutorial and got PCRE to work properly Cameron.
As for the errors in your scripts, my best guess is that you need to restart your server process so that PHP reloads. I’ve run into many situations where modifications of PHP wouldn’t change the behavior until I restarted Apache.
If that doesn’t fix your problem, do your error messages match what I have in the post?
Right you are. I did a graceful restart at the time which didn’t work, but I just did a full stop and start cycle and it works well. Thanks again!
That’s great news Cameron.
Happy UTF-8ing.
Worked for me, thanks!
You’re welcome Sebastiaan. Thanks for the blog link.
Thanks for the RPM, I just received a confirmation from our RHEL sales rep that this feature will be in RHEL 5.4
Thanks for the instructions on how to enable the UTF-8 properties. I had an oddity in Laconica (a microblogging script) where the RSS feed has asterisks in it. It turns out their regex to replace control characters (“\p{Cc}\p{Cs}”) with asterisks wasn’t working right and was replacing all Ps, Cs and Ss with an asterisk!
Just for reference, I’m on CentOS4 and the CentOS 5 RPM rebuilds fine there
Thanks for sharing about CentOS 4. I’m sure that others will find the info helpful.
Hi -
Thanks for the info — everything works well until I try the
install….. (I’m running CentOS 5.3 on Intel 386 platform)
>>>>>>>>>>>>>>>
rpm -Uvh ./src/rpm/RPMS/i386/pcre-6.6-2.7.i386.rpm
error: Failed dependencies:
pcre = 6.6-2.el5_1.7 is needed by (installed) pcre-devel-6.6-2.el5_1.7.i 386
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I don’t want to start removing installed RPMs and risk wandering into the woods – any idea what I need to do to get back on the right road?
Sorry for the delay KC.
If you follow my “Rebuilding the rpm” instructions, you will get both the pcre and pcre-devel rpms. Try installing both at the same time:
rpm -Uvh ./src/rpm/RPMS/i386/pcre-6.6-2.7.i386.rpm ./src/rpm/RPMS/i386/pcre-devel-6.6-2.7.i386.rpm
This should upgrade both packages at the same time and should bypass the problem. Let me know if this works for you.
Works like a champ! – Thank you so much.
Good to hear KC. I’m glad that it worked.
[...] este pequeño tutorial me he basado en los manuales Unicode Support on CentOS 5.2 with PHP and PCRE y How to patch and rebuild an RPM package Tagged as: centos, pcre, PHP, unicode No hay [...]
Many thanks,
Worked like a treat.. really appreciate not having to create the RPM!
All the best,
Paul Hudson
Thank you! I am far from an expert on cmd line but i was able to get through this successfully – thanks for taking the time to put it together-
[...] I find the way to slove the problem this might be useful Unicode Support on CentOS 5.2 with PHP and PCRE | gaarai.com [...]
Thank you SO VERY MUCH for this guide. I would be completely stuck without it.
I’ve uploaded a compiled version for Centos 5.2 on i386 which can be found here:
http://www.ngse.co.uk/pcre-6.6-2.7.i386.rpm
Not sure if you want to host/add this to the article downloads at all.
Thanks again!
I’m glad that it was helpful to you Robin. Thanks for the i386 version. I’ll add it to the post.
Thanks, worked for me as well.
I’m running CentOS 4.7 so I had to track down the source RPM’s. Was able to thanks to Alan Dixon’s post here: http://homeofficekernel.blogspot.com/2009/01/centos4-and-civicrm-21.html
I know there are a number of folks out there still running CentOS 4 – I’d be happy to send you my 4.7 compatible rebuilt rpm’s to post, if you’d like. Just shoot me an email.
Thanks again!
Thanks Chris. I’m adding your RPM to the post.
Thank you very much for your detailed explanation. I am running RHEL5 x86_64 but not an expert at all in this kind of thing. It worked out very well.
Best regards,
Jaap
I’m glad that it worked for you Jaap.
Thank you! Worked perfectly in one step:
rpm -Uvh http://gaarai.com/wp-content/uploads/2009/01/pcre-66-27×86_64.rpm
I learn something new every day. I didn’t realize that you could run RPM using a URL.
Thanks for the tip.