|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
|
| News | Books | See also | Mini tutorial | Recommended Links | Recommended Articles | Reference | String search algorithms | |
| Regex | fgrep | egrep | gzgrep | bzgrep | Agrep | xargs | find | Etc |
Due to its size mini-tutorial is now converted into a separate page.
Unix grep is the old and pretty capricious text file search utility that can search either for strings or for regular expression. I never have luck to write a complex regular expression that works from the first time in "old" grep (before introduction of Perl regular expressions) despite using Perl on daily basis (and due to my teaching experience being reasonably well-versed in regex syntax. )
Grep regex syntax has three variations:
This "multiple personalities" behavior is very confusing and that fact essentially spoil the broth. I hate the fact that nobody has the courage to implement a new standard grep and that the current implementation has all warts accumulated during the 30 years of Unix existence. I highly recommend using -P option (Perl regular expressions). It make grep behavior less insane.
The simplest way of using grep is plain vanilla string search (fgrep or grep -F invocation): you can select all lines that contain a certain string in one or more files. For example,
fgrep foo file# returns all the lines that contain a string "foo" in the file "file".
Another way of using grep is to have it accept data through STDIN.
instead of having it search a file. For example,
ls | fgrep blah #lists all files in the current directory containing the string "blah"
As for regular expressions grep is very idiosyncratic in a sense that you need to remember to use backslash before any special character in a regular expressions. For example:
grep 'if | while' #-- wrong
grep 'if \|while' #-- will work, please note single quotes
In complex cases it's always easier to use Perl or use grep -P option (Perl regular expression option is available only in GNU grep) than to explore intricacies of grep syntax.
|
In complex cases it's always easier to use Perl or use grep -P option (Perl regular expression option is available only in GNU grep) than to explore intricacies of grep syntax |
Please note that only the latest GNU grep has an option -P that gives you the possibility to use Perl-style regular expressions. Here is the relevant quote from the GNU grep 2.5 documentation:
There are four major variants of
grep, controlled by the following options.
- `-G' `--basic-regexp' Interpret the pattern as a basic regular expression. This is the default.
- `-E' `--extended-regexp' Interpret the pattern as an extended regular expression.
- `-F' `--fixed-strings' Interpret the pattern as a list of fixed strings, separated by newlines, any of which is to be matched.
- `-P' `--perl-regexp' Interpret the pattern as a Perl regular expression.
In addition, two shortcuts EGREP and FGREP are available. EGREP is the same as `grep -E'. FGREP is the same as `grep -F' Also there is a separate implementation of grep that uses Perl regular expressions, called pcregrep
The simplest example for grep is to find a word in the file:
grep foo myfile# lists the lines that match the regular expression "foo" in the file "myfile".
The name grep is a combination of editor command characters. It is from the editor command :g/RE/p, which translates to global Regular Expression print. In fgrep the f stands for fast. This tutorial is based on the GNU version of grep. Other version are quite similar and sometimes more powerful but GNU grep is a standard de-facto and currently it beat others by having -P (Perl regex) option.
The most primitive regular expression is a string. In this case grep returns all matching lines that contain foo as a substring. grep has a special version that does string searching very fast (fgrep, see below).
Another way of using grep is to use pipe, for example,
ls | grep blah #lists all files in the current directory containing the string "blah"
There are also several variants of grep that can search directly in archives, for example gzgrep and bzgrep. gzgrep is an envelope for grep that can invoke the grep on compressed or gzip'ed files. All options specified are passed directly to grep. If no file is specified, then the standard input is decompressed and fed to grep. Otherwise the given files are uncompressed if necessary and fed to grep.
Dr. Nikolai Bezroukov
|
|||||||
glark offers grep-like searching of text files, with very powerful, complex regular expressions (e.g., "/foo\w+/ and /bar[^\d]*baz$/ within 4 lines of each other"). It also highlights the matches, displays context (preceding and succeeding lines), does case-insensitive matches, and automatic exclusion of non-text files. It supports most options from the GNU version of grep.
There are some tools that look like you will never replace them. One of those (for me) is grep. It does what it does very well (remarks about the shortcomings of regexen in general aside). It works reasonably well with Unicode/UTF-8 (a great opportunity to Fail Miserably for any tool, viz. a2ps).
Yet, the other day I read about ack, which claims to be "better than grep, a search tool for programmers". Woo. Better than grep? In what way?
The ack homepage lists the top ten reasons why one should use it instead of grep. Actually, it's thirteen reasons but then some are dupes. So I'd say "about ten reasons". Let's look at them in order.
- It's blazingly fast because it only searches the stuff you want searched.
Wait, how does it know what I want? A DWIM-Interface at last? Not quite. First off, ack is faster than grep for simple searches. Here's an example:
$ time ack 1Jsztn-000647-SL exim_main.log >/dev/null real 0m3.463s user 0m3.280s sys 0m0.180s $ time grep -F 1Jsztn-000647-SL exim_main.log >/dev/null real 0m14.957s user 0m14.770s sys 0m0.160sTwo notes: first, yes, the file was in the page cache before I ran ack; second, I even made it easy for grep by telling it explicitly I was looking for a fixed string (not that it helped much, the same command without -F was faster by about 0.1s). Oh and for completeness, the exim logfile I searched has about two million lines and is 250M. I've run those tests ten times for each, the times shown above are typical.
So yes, for simple searches, ack is faster than grep. Let's try with a more complicated pattern, then. This time, let's use the pattern (klausman|gentoo) on the same file. Note that we have to use -E for grep to use extended regexen, which ack in turn does not need, since it (almost) always uses them. Here, grep takes its sweet time: 3:56, nearly four minutes. In contrast, ack accomplished the same task in 49 seconds (all times averaged over ten runs, then rounded to integer seconds).
As for the "being clever" side of speed, see below, points 5 and 6
- ack is pure Perl, so it runs on Windows just fine.
This isn't relevant to me, since I don't use windows for anything where I might need grep. That said, it might be a killer feature for others.
- The standalone version uses no non-standard modules, so you can put it in your ~/bin without fear.
Ok, this is not so much of a feature than a hard criterion. If I needed extra modules for the whole thing to run, that'd be a deal breaker. I already have tons of libraries, I don't need more undergrowth around my dependency tree.
- Searches recursively through directories by default, while ignoring .svn, CVS and other VCS directories.
This is a feature, yet one that wouldn't pry me away from grep: -r is there (though it distinctly feels like an afterthought). Since ack ignores a certain set of files and directories, its recursive capabilities where there from the start, making it feel more seamless.
- ack ignores most of the crap you don't want to search
To be precise:
- VCS directories
- blib, the Perl build directory
- backup files like foo~ and #foo#
- binary files, core dumps, etc.
Most of the time, I don't want to search those (and have to exclude them with grep -v from find results). Of course, this ignore-mode can be switched off with ack (-u). All that said, it sure makes command lines shorter (and easier to read and construct). Also, this is the first spot where ack's Perl-centricism shows. I don't mind, even though I prefer that other language with P.
- Ignoring .svn directories means that ack is faster than grep for searching through trees.
Dupe. See Point 5
- Lets you specify file types to search, as in --perl or --nohtml.
While at first glance, this may seem limited, ack comes with a plethora of definitions (45 if I counted correctly), so it's not as perl-centric as it may seem from the example. This feature saves command-line space (if there's such a thing), since it avoids wild find-constructs. The docs mention that --perl also checks the shebang line of files that don't have a suffix, but make no mention of the other "shipped" file type recognizers doing so.
- File-filtering capabilities usable without searching with ack -f. This lets you create lists of files of a given type.
This mostly is a consequence of the feature above. Even if it weren't there, you could simply search for "."
- Color highlighting of search results.
While I've looked upon color in shells as kinda childish for a while, I wouldn't want to miss syntax highlighting in vim, colors for ls (if they're not as sucky as the defaults we had for years) or match highlighting for grep. It's really neat to see that yes, the pattern you grepped for indeed matches what you think it does. Especially during evolutionary construction of command lines and shell scripts.
- Uses real Perl regular expressions, not a GNU subset
Again, this doesn't bother me much. I use egrep/grep -E all the time, anyway. And I'm no Perl programmer, so I don't get withdrawal symptoms every time I use another regex engine.
- Allows you to specify output using Perl's special variables
This sounds neat, yet I don't really have a use case for it. Also, my perl-fu is weak, so I probably won't use it anyway. Still, might be a killer feature for you.
The docs have an example:
ack '(Mr|Mr?s)\. (Smith|Jones)' --output='$&'- Many command-line switches are the same as in GNU grep:
Specifically mentioned are -w, -c and -l. It's always nice if you don't have to look up all the flags every time.
- Command name is 25% fewer characters to type! Save days of free-time! Heck, it's 50% shorter compared to grep -r
Okay, now we have proof that not only the ack webmaster can't count, he's also making up reasons for fun. Works for me.
Bottom line: yes, ack is an exciting new tool which partly replaces grep. That said, a drop-in replacement it ain't. While the standalone version of ack needs nothing but a perl interpreter and its standard modules, for embedded systems that may not work out (vs. the binary with no deps beside a libc). This might also be an issue if you need grep early on during boot and /usr (where your perl resides) isn't mounted yet. Also, default behaviour is divergent enough that it might yield nasty surprises if you just drop in ack instead of grep. Still, I recommend giving ack a try if you ever use grep on the command line. If you're a coder who often needs to search through working copies/checkouts, even more so.
Update
I've written a followup on this, including some tips for day-to-day usage (and an explanation of grep's sucky performance).
Comments
René "Necoro" Neumann writes (in German, translation by me):
Stumbled across your blog entry about "ack" today. I tried it and found it to be cool :). So I created two ebuilds for it:
Just wanted to let you know (there is no comment function on your blog).
If you haven't been paying attention to GNU grep recently, you should be happily surprised by some of the new features and options that have come about with the 2.5 series. They bring it functionality you can't get anywhere else -- including the ability to output only matched patterns (not lines), color output, and new file and directory options.
Granted, the addition of this feature set caused a number of bugs that made it necessary to rewrite part of the code, but the latest 2.5.1a bugfix release is eminently usable.
One highlight of the new version is its ability to output only matched patterns. This is one of the most exciting features, because it adds completely new functionality to the tool. Remember, "grep" is an acronym -- it got its name from a function in the old Unix
edutility, global / regular expression / print -- and its purpose was to output lines from its input that match a given regular expression.It remains such, but the new
-ooption (or--only-matching) specifies that only the matched patterns themselves are to be output, and not the entire lines they come on. If more than one match is found on a single line, those matches are output on lines of their own.With this new option, suddenly GNU grep is transformed from a utility that outputs lines into a tool for harvesting patterns. You can use it to harvest data from input files, such as pulling out referrers from your server logs, or URLs from a file:
egrep -o '(((http(s)?|ftp|telnet|news|gopher)://|mailto:)[^\(\)[:space:]]+)' logfileOr grab email addresses from a file:
egrep -o '\@/:[:space:]]+\>@[a-zA-Z_\.]+?\.[a-zA-Z]{2,3}' somefileUse it to pull out all the senders from an email archive and sort into a file of unique addresses:
grep '^From: ' huge-mail-archive | egrep -o '\@/:[:space:]]+\>@[a-zA-Z_\.]+?\.[a-zA-Z]{2,3}' | sort | uniq > email.addressesNew uses for this feature keep popping up. You can use it, for instance, as a tool for testing regular expressions. Say you've whipped up a complicated regexp to do some task. You think it's the world's greatest regexp, it's going to do everything short of solving all the world's problems -- but at runtime, it doesn't seem to go as planned.
Next time this happens, use the
-ooption when you're in the design stage, and have grep read from the standard input, where you can feed it test data -- you'll see right away whether or not it matches exactly what you think it does. Since grep will be tossing back to you not the matched lines but the actual matches to the expression, it'll give you a pretty good clue how to fix it.Output matches in color
Use the
--coloroption to display matches in the input in color (red, by default). Color is added via ANSI escape sequences, which don't work in all displays, but grep is smart enough to detect this and won't use color (even if specified) if you're sending the output down a pipeline. Otherwise, if you piped the output to (say)less, the ANSI escape sequences would send garbage to the screen. If, on the other hand, that's really what you want to do, there's a workaround: use the--color=alwaysto force it, and calllesswith the-Rflag (which prints all raw control characters). That way, the color codes will escape correctly and you'll page through screens of text with your matched patterns in full color:
grep --color=always "regexp" myfile | less -RThe
GREP_COLORenvironment variable controls which color is used. To change the color from red to something else, setGREP_COLORto a numeric value according to this chart:30 black 31 red 32 green 33 yellow 34 blue 35 purple 36 cyan 37 whiteFor example, to have matches highlighted in a shade of green:
GREP_COLOR=32; export GREP_COLOR; grep pattern myfileUse Perl regexps
One of the biggest developments in regular expressions to occur in the last few decades has been the Perl programming language, with its own regular expression dialect. GNU grep now takes Perl-style regexps with the
-Poption. (It's not always compiled in by default, so if you get an error message of "grep: The -P option is not supported" when you try to use it, you'll have to get the sources and recompile.)To search for a bell character (Ctrl-g), you can now use:
grep -P '\cG' myfileThis is considered a "major variant" of grep, as with the
-Eand-Foptions (which are the egrep and fgrep tools, respectively), but it doesn't yet come with an associated program name -- perhaps new versions will have a prep binary (it sounds much better than pgrep) that will mean the same thing as using-P.Dealing with input
A number of new features have to do with files and input. The new
--labeloption lets you specify a text "label" to standard input. Where it's really useful is when you're grepping a lot of files at once, plus standard input, and you're making use of the labels that grep prefixes its matches with. Normally, standard input would be the only one with a label you couldn't control -- it's always prefixed with "(standard input)" as its label. Now, it can be prefixed with whatever argument you give the--labeloption.When searching through multiple files, you can control which files to search for with the
grep changes quick reference
-Cxprints context lines before and after matches and must have argument x.
--coloroutputs matches in color (default red).
-Daction specifies an action to take on device files (the default is "read").
--exclude=filespec excludes files matching filespec.
--include=filespec only searches through files matching filespec.
--label=name makes name the new label for stdin.
--line-bufferedturns on line buffering.
-mX stops searching input after finding X matched lines.
-ooutputs only matched patterns, not entire lines.
-Puses Perl-style regular expressions.--includeand--excludeoptions. For example, to search for "linux" only in files with .txt extensions in the /usr/local/src directory tree, use:
grep -r --include=*.txt linux /usr/local/srcWhen you're recursively searching directories of files, you'll get errors when grep comes across a device file. With the new
--devicesoption, you can specify what you want it to do on these files, by giving it an optional action. The default action is "read," which means to just read the file as any other file. But you can also specify "skip," which will skip the file entirely. Those are currently the only two methods for handling devices.To search for "linux" in all files on the system, excluding special device files, use:
grep -r --device=skip linux /Finally, the
--line-bufferedoption turns on line buffering, and--m(or--max-count) gives the maximum number of matched lines to show, after which grep will stop searching the given input. For example, this command searches a huge file with line buffering, exiting after at most 10 matched lines occur:
grep --line-buffered -m 10 huge.filePOSIX updates
Some of the other new updates were made are so that GNU grep conforms to POSIX.2, including subtle changes in exit status.
One of these changes is that the interpretation of character classes is now locale-dependent. That means that ranges specified in bracketed expressions like
[A-Z]don't mean the same thing everywhere. If the system's current locale environment calls for its own characters or sorting, these settings will override any default character range.Another related update is a change to the old
-Coption, which outputs a specified number of lines of context before and after matched lines. In the past, when you used-Cwithout an option, grep would output two lines of before-and-after context, but now you have to give an argument; if you don't, grep will report an error and exit. That's something to look out for if you've got any old shells scripts or routines sitting around that call grep.
GNU grep comes with a recursive option (-r,-R) that allows you to recursively grep for a pattern through all files and any subdirectories.
But what happens if you aren't using GNU grep? You can use find to assist...
find /path/to/files -exec grep "pattern" {} \;You can, of course, provide your usual options to grep, e.g.
find /path/to/files -exec grep -li "pattern" {} \;
pcregrep searches files for character patterns, in the same way as other grep commands do, but it uses the PCRE regular expression library to support patterns that are compatible with the regular expressions of Perl 5. See pcre(3) for a full description of syntax and semantics.
If no files are specified, pcregrep reads the standard input. By default, each line that matches the pattern is copied to the standard output, and if there is more than one file, the file name is printed before each line of output. However, there are options that can change how pcregrep behaves.
Lines are limited to BUFSIZ characters. BUFSIZ is defined in <stdio.h>. The newline character is removed from the end of each line before it is matched against the pattern.
- To: <chris@freebsd.org>
- Subject: Re: Replacing GNU grep revisited
- From: "James P. Howard II" <howardjp@vocito.com>
- Date: Mon, 23 Jun 2003 10:21:53 -0400 (EDT)
- Cc: <freebsd-hackers@freebsd.org>, <tech@openbsd.org>
- References: <20030621103502.K18572@thor.farley.org> <20030622005852.GB59673@HAL9000.homeunix.com> <20030622092848.R28123@thor.farley.org> <20030623104718.GA49264@holly.machined.net>
Chris Costello said: > On Sunday, June 22, 2003, Sean Farley wrote: >> Reasons to consider for switching: >> 1. GNU's grep -r option "is broken" according to the following post. >> The only thing I have noticed is that FreeGrep has more options for >> controlling how symbolic links are traversed. >> http://groups.google.com/groups?hl=en&lr=lang_en&ie=UTF-8&selm=xzp7kchblor.fsf_flood.ping.uio.no%40ns.sol.net > > A workaround for this problem in the meantime would be to use > > find <directory> -type f | xargs grep EXPR > > Just FYI. Rumors of my demise are greatly exaggerated. And to call myself busy any more is an understatement. But yes, I got an email from Ted Unangst telling me about the OpenBSD move to FreeGrep and this pleases me greatly. I have been glancing over thier CVS tree (via the web) and they have made a number of changes to fix the bugs being discussed here. Aside from a handful of errors (which are presumably correctable), the speed is still an issue. It is horribly slow when compared to the GNU version. FreeBSD will see better times than OpenBSD due to some changes made to the regex code a few years ago which I adapted from the 4.4BSD-Lite2 code for grep, but it still lags behind GNU in performance. Jamie
Clearly, grep is a command I can’t live without. I constantly use it on its own and in pipes with other commands. For example:
% ps -aux | egrep 'chavez|PID' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND chavez 14355 0.0 1.6 2556 1792 pts/2 S 10:23 0:00 -tcsh chavez 18684 89.5 9.6 27680 5280 ? R N Sep25 85:26 /home/j03/l988I use this command combination often enough with different usernames that I’ve defined an alias for it.
Less Well-Known Regular Expression Constructs
Most are familiar with the asterisk, plus sign, and question mark modifiers to regular expression items (match zero or more, one or more, or exactly one of the item, respectively). However, you can specify how many of each item should be matched even more precisely using some extended regular expression constructs (use egrep or grep -E):
Form Meaning
{n} Match exactly n of the preceding item.
{n,} Match n or more of the preceding item.
{n,m} Match at least n and no more than m of the preceding item.Here are some simple examples:
% grep -E "t{2}" bio She has written eight books, including Essential Cultural Studies from Pitt. When she's not writing % grep -E "[0-9]{3,}" bio network of Unix and Windows NT/2000/XP systems. She % grep -E "(the ){2,}|(and ){2,}" bio and and creating murder mystery games. She you'd like to receive the the free newsletterThe first command searches for double t’s; the second command looks for numbers of three or more digits; and the third command searches for two consecutive instances of the words “the” and “and” (it’s a primitive copy editor). You might be tempted to formulate the final item as:
(the |and ){2,}However, this won’t work, as it will match “and the,” which is not generally an error.
Finally, be aware that the constuct {,m}, which might mean “match m or fewer of the preceding item,” is not defined.
Copyright © 1996-2008 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
Standard disclaimer: The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.
Last modified: September 11, 2008