Wednesday, March 29, 2006

Advanced use -exec in the Unix find utility

I recently was tasked with writing a utility to parse through a large set of log files, and delete files based on some very specific, and relatively complex, rules. Basically I needed to delete files that contained any of a couple specific strings anywhere in the file, and also, deleted any file that contained some specific strings in the last few lines:

The file needed to be deleted if it was named *.log, and any of the following were true:
  • contains the string "Termination signal '15'" anywhere in file
  • contains the string "Begin Stack Backtrace" anywhere in the file
  • contains the string "fatal error terminated partition" in the last few lines
  • contains the string "shutting down partition as requested" in the last few lines
I didn't want to write something that would parse through every file for each string, and I wanted something that would be extensible if I was requested to search other things. I decided to use the Unix find command with a small shell script wrapper to improve readability.

# Delete all logs with the following anywhere in the log
fullregex="Termination signal \'15\'|Begin Stack Backtrace"

# Delete all logs with the following anywhere in the last few lines of the log
tailregex="fatal error terminated partition|Shutting down partition as requested"

find $FORTE_ROOT/log/ -name \*.log -type f \
\( \
-exec sh -c "tail \$0 | grep -qEi \"$tailregex\"" {} \; \
-o \
-exec grep -qiE "$fullregex" {} \; \) \
-exec rm {} \;

There are two main components at work here. The first is grep and regular expressions, and second is the extended logic and language of find itself.

Many people overlook the advanced -E option of grep. Regular expressions can be cryptic, and daunting. But their power is indisputable. The regular expressions (fullregex, and tailregex) are pretty simple. They each match one of two strings seperated by the regex or operator '|'. The grep -q option is very useful in scripting because it simply returns a Boolean 0 or 1 based if the pattern was matched.

The find command is another powerful tool that is often under-utilized. The real power find of is that it's options are themselves a mini programming language. In fact, the options of find are an implicit if statement as you might find in any other language. For each file in the directory it's searching, it interprets each option in the order until it gets a false result, or it runs out of options.

By using the -exec option, find becomes a very powerful tool for manipulating large sets of files based on arbitrary rules. You can execute any program you like with -exec and find will interpret it's exit code as a true or false value (0 is true, 1 is false), and continue or end processing based on it.

In my example, checking the last few lines of a file is much more efficient than scanning the entire file. The reason I point this out is that I want to keep my processing to a minimum. Because find stops processing as soon as it knows a result, I can use that to ensure that I do the minimum amount of work possible. Notice the use of parentheses and the -o. This is a logical grouping, and the -o represents an OR. All of the options of find are implicitly AND'ed unless you use the -o between them. In this case I want to remove the file if either grep succeeds. And because an OR always returns true if the first part (the first grep) is true, find does not need to process the second grep.

For clarity, here is what the above find command would do for each file if written as a standard shell script:


#find $FORTE_ROOT/log/ -name \*.log -type f
# Assume $filename is replaced by the current file
if [ $filename == *.log ]; then # -name \*.log
if [ -f $filename ]; then # -type f
# \( \
# -exec sh -c "tail \$0 | grep -qEi \"$tailregex\"" {} \; \
# -o \
# -exec grep -qiE "$fullregex" {} \; \) #
# For the sake of this example, I've broken the or'd
# -exec's into an if-elif statement.
if ( tail $filename | grep -qEi "$tailregex" ); then
# -exec rm {} \;
rm $filename
# Note how if the first grep succeeds, the second
# never occurs
elif ( grep -qiE "$fullregex" $filename ); then
# -exec rm {} \;
rm $filename
fi
fi
fi

The last thing to talk about is the the use of -exec. One thing I've found frustrating about -exec in the past was that I couldn't get command piping (|) to work. I realized that the reason for this is that -exec is not being processed by a shell. Once I realized that, the problem could be resolved by introducing a shell:
-exec sh -c "tail \$0 | grep -qEi \"$tailregex\"" {} \;
Note that you have to pass the filename ( {} ) as a parameter to the shell, rather than using it with the shells commands. You can then reference the filename with \$0.

The last bit of info, is how to terminate -exec. find needs a way to distinguish its own parameters from those of the -exec. Because that, you need to terminate the -exec option with \;.

In the end, find is a powerful tool in the arsenal of a Unix systems admin/engineer. It's a tool that, like grep, sed, awk, etc..., can solve a complicated task in relatively simple terms.



No comments:

Post a Comment