Friday, July 5, 2013

The magic of LC_ALL=C

One fine day, I was assigned a task to search a particular value, say -99999999, take a count and then replace it with Sybase default datetime value 19000101.

The unix command grep has always been my reliable companion for such searches and sed for replacement.  Without thinking for a second, I fired this -

time(grep -c "-99999999" inputFile)

I was pretty sure that I will get the result in a minute or two. But, to my surprise, it went on for an hour or so and still wasn't able to produce the result.  I even tried other versions of grep - egrep, fgrep. But in vain. Then, I checked number of rows as well as size of the file. It contained millions of records. 

cksum inputFile

3604281750    1813912821

Here, size of the file is 1813912821 bytes, which is around 1.68 Gig. 

So, I thought of using sed this time.  I simply fired below:

time(sed -n '/-99999999/p' inputFile | wc -l)

7867816

real     1m30.78s
user     0m0.44s
sys       0m2.44s

So, it took around 1 minute and 30 seconds. 

Then, I thought of checking how awk works for the same task.

time(awk '/-99999999/ { n++ }; END { print n+0 }' inputFile)

7867816

real     1m26.57s
user     1m22.96s
sys       0m1.13s

Better than sed; but still not up to the mark.

Then, I came to know about Locales in Unix/Linux.  A locale defines how the program should display or parse dates, numbers, and other information, as well as what character encoding it should use for reading and writing strings.

You can check values of the locale variables defined on your system as below:

locale

LANG="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=


Here, en is a two letter language code.  GB is country or territory (Great Britain).  for US, it will be en_US.UTF-8.  UTF-8 is character encoding. 

If you see the last variable, LC_ALL, doesn't have any value specified.  When we define any value for this variable, values for other variables will be over overridden.

Sometimes programs (or commands) doing string manipulations become painfully slow, for example: grep as we witnessed above. If we change locale settings to use POSIX instead of UTF-8 encoding, commands/programs run unbelievably faster. 

 time(LC_ALL=C grep -c "-99999999" inputFile)

7687816 

real    0m3.45s
user   0m2.60s
sys    0m0.85s

it took less than 4 seconds this time. Faster than sed/awk. 

Remember, though, you've changed locale settings for just one command.  It's not applicable to subsequent commands you fire.   What if you want to do it for your current session?

export LC_ALL=c

Now, if you check values of locale variables again:

locale

LANG="en_GB.UTF-8"
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL="C"


Except, LANG variable, values of all other variables are set to "C" (POSIX). You can set this variable in your .profile.

Even your sed and awk commands run faster under POSIX locale.

time(LC_ALL=C sed -n '/-99999999/p' inputFile | wc-l)

7867816

real    0m27.99s
user   0m0.36s
sys    0m20.85s

Comes down to 27 seconds from 1 minute and 30 seconds.

Let's check the performance of awk. 

time(LC_ALL=C awk '/-99999999/ { n++ }; END { print n+0 }' inputFile)

7867816


real      0m6.78s
user     0m5.62s
sys       0m0.98s

Wow!!! Comes down to almost 7 seconds from 1 minute and 26 seconds. 

Still, it can't match the performance of grep used with LC_ALL=C. 

It's not advisable though to set LC_ALL=C for every program.  Some programs might crash. So, test it thoroughly before you apply it. 

There is one more flavour of grep.  pcregrep.  It's a grep with perl-compatible regular expressions.  If you don't want to use LC_ALL=C, you can use pcregrep.  pcregrep uses the PCRE regular expression library
to support patterns that are compatible with the regular expressions of
Perl 5.

time(pcregrep -c "-99999999" inputFile)

7867816

real      0m9.78s
user     0m5.62s
sys       0m3.98s

Took around 10 seconds.  Better than regular grep without using LC_ALL=C, but can't beat the performance of grep used with LC_ALL=C. 

P.S.: As per GNU documentation, this problem is fixed with the release of GNU grep 2.7.   My system shows release as 2.6.*.  So, couldn't test whether LC_ALL=C is not needed on 2.7 release

uname -r

2.6.32.*