One fine day, I was assigned a task to search a particular value, say -99999999, take a count and then replace it with Sybase default datetime value 19000101.
The unix command grep has always been my reliable companion for such searches and sed for replacement. Without thinking for a second, I fired this -
time(grep -c "-99999999" inputFile)
I was pretty sure that I will get the result in a minute or two. But, to my surprise, it went on for an hour or so and still wasn't able to produce the result. I even tried other versions of grep - egrep, fgrep. But in vain. Then, I checked number of rows as well as size of the file. It contained millions of records.
cksum inputFile
3604281750 1813912821
Here, size of the file is 1813912821 bytes, which is around 1.68 Gig.
So, I thought of using sed this time. I simply fired below:
time(sed -n '/-99999999/p' inputFile | wc -l)
7867816
real 1m30.78s
user 0m0.44s
sys 0m2.44s
So, it took around 1 minute and 30 seconds.
Then, I thought of checking how awk works for the same task.
time(awk '/-99999999/ { n++ }; END { print n+0 }' inputFile)
7867816
real 1m26.57s
user 1m22.96s
sys 0m1.13s
Better than sed; but still not up to the mark.
Then, I came to know about Locales in Unix/Linux. A locale defines how the program should display or parse dates, numbers, and other information, as well as what character encoding it should use for reading and writing strings.
You can check values of the locale variables defined on your system as below:
locale
LANG="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
Here, en is a two letter language code. GB is country or territory (Great Britain). for US, it will be en_US.UTF-8. UTF-8 is character encoding.
If you see the last variable, LC_ALL, doesn't have any value specified. When we define any value for this variable, values for other variables will be over overridden.
Sometimes programs (or commands) doing string manipulations become painfully slow, for example: grep as we witnessed above. If we change locale settings to use POSIX instead of UTF-8 encoding, commands/programs run unbelievably faster.
time(LC_ALL=C grep -c "-99999999" inputFile)
7687816
real 0m3.45s
user 0m2.60s
sys 0m0.85s
it took less than 4 seconds this time. Faster than sed/awk.
Remember, though, you've changed locale settings for just one command. It's not applicable to subsequent commands you fire. What if you want to do it for your current session?
export LC_ALL=c
Now, if you check values of locale variables again:
locale
LANG="en_GB.UTF-8"
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL="C"
Except, LANG variable, values of all other variables are set to "C" (POSIX). You can set this variable in your .profile.
Even your sed and awk commands run faster under POSIX locale.
time(LC_ALL=C sed -n '/-99999999/p' inputFile | wc-l)
7867816
real 0m27.99s
user 0m0.36s
sys 0m20.85s
Comes down to 27 seconds from 1 minute and 30 seconds.
Let's check the performance of awk.
time(LC_ALL=C awk '/-99999999/ { n++ }; END { print n+0 }' inputFile)
7867816
real 0m6.78s
user 0m5.62s
sys 0m0.98s
Wow!!! Comes down to almost 7 seconds from 1 minute and 26 seconds.
Still, it can't match the performance of grep used with LC_ALL=C.
It's not advisable though to set LC_ALL=C for every program. Some programs might crash. So, test it thoroughly before you apply it.
There is one more flavour of grep. pcregrep. It's a grep with perl-compatible regular expressions. If you don't want to use LC_ALL=C, you can use pcregrep. pcregrep uses the PCRE regular expression library
to support patterns that are compatible with the regular expressions of
Perl 5.
time(pcregrep -c "-99999999" inputFile)
7867816
real 0m9.78s
user 0m5.62s
sys 0m3.98s
Took around 10 seconds. Better than regular grep without using LC_ALL=C, but can't beat the performance of grep used with LC_ALL=C.
P.S.: As per GNU documentation, this problem is fixed with the release of GNU grep 2.7. My system shows release as 2.6.*. So, couldn't test whether LC_ALL=C is not needed on 2.7 release
uname -r
2.6.32.*
The unix command grep has always been my reliable companion for such searches and sed for replacement. Without thinking for a second, I fired this -
time(grep -c "-99999999" inputFile)
I was pretty sure that I will get the result in a minute or two. But, to my surprise, it went on for an hour or so and still wasn't able to produce the result. I even tried other versions of grep - egrep, fgrep. But in vain. Then, I checked number of rows as well as size of the file. It contained millions of records.
cksum inputFile
3604281750 1813912821
Here, size of the file is 1813912821 bytes, which is around 1.68 Gig.
So, I thought of using sed this time. I simply fired below:
time(sed -n '/-99999999/p' inputFile | wc -l)
7867816
real 1m30.78s
user 0m0.44s
sys 0m2.44s
So, it took around 1 minute and 30 seconds.
Then, I thought of checking how awk works for the same task.
time(awk '/-99999999/ { n++ }; END { print n+0 }' inputFile)
7867816
real 1m26.57s
user 1m22.96s
sys 0m1.13s
Better than sed; but still not up to the mark.
Then, I came to know about Locales in Unix/Linux. A locale defines how the program should display or parse dates, numbers, and other information, as well as what character encoding it should use for reading and writing strings.
You can check values of the locale variables defined on your system as below:
locale
LANG="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
Here, en is a two letter language code. GB is country or territory (Great Britain). for US, it will be en_US.UTF-8. UTF-8 is character encoding.
If you see the last variable, LC_ALL, doesn't have any value specified. When we define any value for this variable, values for other variables will be over overridden.
Sometimes programs (or commands) doing string manipulations become painfully slow, for example: grep as we witnessed above. If we change locale settings to use POSIX instead of UTF-8 encoding, commands/programs run unbelievably faster.
time(LC_ALL=C grep -c "-99999999" inputFile)
7687816
real 0m3.45s
user 0m2.60s
sys 0m0.85s
it took less than 4 seconds this time. Faster than sed/awk.
Remember, though, you've changed locale settings for just one command. It's not applicable to subsequent commands you fire. What if you want to do it for your current session?
export LC_ALL=c
Now, if you check values of locale variables again:
locale
LANG="en_GB.UTF-8"
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL="C"
Except, LANG variable, values of all other variables are set to "C" (POSIX). You can set this variable in your .profile.
Even your sed and awk commands run faster under POSIX locale.
time(LC_ALL=C sed -n '/-99999999/p' inputFile | wc-l)
7867816
real 0m27.99s
user 0m0.36s
sys 0m20.85s
Comes down to 27 seconds from 1 minute and 30 seconds.
Let's check the performance of awk.
time(LC_ALL=C awk '/-99999999/ { n++ }; END { print n+0 }' inputFile)
7867816
real 0m6.78s
user 0m5.62s
sys 0m0.98s
Wow!!! Comes down to almost 7 seconds from 1 minute and 26 seconds.
Still, it can't match the performance of grep used with LC_ALL=C.
It's not advisable though to set LC_ALL=C for every program. Some programs might crash. So, test it thoroughly before you apply it.
There is one more flavour of grep. pcregrep. It's a grep with perl-compatible regular expressions. If you don't want to use LC_ALL=C, you can use pcregrep. pcregrep uses the PCRE regular expression library
to support patterns that are compatible with the regular expressions of
Perl 5.
time(pcregrep -c "-99999999" inputFile)
7867816
real 0m9.78s
user 0m5.62s
sys 0m3.98s
Took around 10 seconds. Better than regular grep without using LC_ALL=C, but can't beat the performance of grep used with LC_ALL=C.
P.S.: As per GNU documentation, this problem is fixed with the release of GNU grep 2.7. My system shows release as 2.6.*. So, couldn't test whether LC_ALL=C is not needed on 2.7 release
uname -r
2.6.32.*