Regular Expressions

Authors

Tim Swenson

Publication

QL Hacker's Journal

Pub Details

QL Hacker's Journal 27 Issue: 27

Date

January 1998

Pages

See all articles from QL Hacker's Journal 27

In all the years that I’ve been dealing with Unix, one of the things that I have not taken the time to really learn is Regular Expressions. Regular expressions are based on a mini-language used for pattern matching in a number of Unix search utilities. The most well known of these programs is grep and its variations fgrep and egrep. The term ‘grep’ is even derived from the words ‘regular expression’.

No matter what operating system you have used, you have probably run across a regular expression. Most operating systems have a way of understanding something like this; “dir *.txt”. In MS-DOS this means to list all files that end with a .txt extension. In QDOS, the equivalent phrase would be “wdir flp1__txt”. The asterisk or star, “*”, is a wild card and means to match all strings. The asterisk is really a metacharacter. Metacharacters are special characters that mean different things in the regular expression language. More experienced users of MS-DOS may have used something like this; “dir *.e??”. This means to match all files that start with a .e in the extension. It will match .exe, .efs, .exx, and others. The question mark is a metacharacter that means to match any character of length one.

So what does all this means to QDOS users? Well, a version of grep has been ported to the QL and comes with the C68 distribution. Grep is a very powerful and popular utility that can fill a number of needs. It is used to extract lines of text from files, but with its handling of regular expressions, it can be very smart on what it extracts. Once you know how grep works and how to use it, you will probably remember a time when it would have been useful to you.

With grep, you can do two things with its output, it can go to standard output or you can redirect it to a file. Since the QL does not have standard output, the QL version of grep opens a window to display its results. it also supports file direction. This means that you can send the output of grep to a file to be dealt with later.

To demonstrate the file redirection, lets take a look at a short grep example. In this example we have a text file and we want to find all lines that have the word QL in them:

   exec flp1_grep;"ql flp1_file_in > flp1_file_out"

Since we are using arguments, we have to put them in quotes after the grep command. The results of the grep will now be in th file flp1_file_out.

Before we go to far, let’s talk about three major concepts in regular expressions: characters, metacharacters, and character classes. A character is basically a byte, be it a text byte or binary byte. Metacharacters are a set of characters that are part of the regular expression language. In the examples above, the asterisk is a metacharacter. A character class is a way of matching a group of characters.

Let’s take a look at the metacharacters:

A character matches itself. Any character or string of characters are taken as literals. If you want to find the string “ing” in a file you would use the regular expression “ing”. Most of the times when I am using grep, I use only literal characters.

A dot (.) matches any character, but only 1 character, similar to the question mark in MS-DOS. If you want to find a word in a text file that has three letters, starts with a B and ends with D, then you would use the regular expression B.D (grep is case sensitive. Upper case lettering has only been used to highlight the example.).

The caret (^) means the beginning of a line. If you want to find all lines that start with the word “The”, you would use the regular expression “^The”.

The dollar sign ($) means the end of a line. If you want to find all lines that end with the word “end”, you would use the regular expression “end$”.

The question mark (?) is used to match an optional character. If you wanted to find the word “color” but don’t know if the British spelling “colour” is used, the regular expression “colo?r” would work. The ? means optional.

The plus (+) is used to match one or more items. If you want to find the words helper or helps, but not just help, you would use the regular expression “help+”. The plus must match at least one character or it will fail.

The asterisk (*) is used like +, but it allows a null match. To find the words helper, helps and help, the regular expression “help” would work. The asterisk allows for no character, as in the case of just help.

To get a little more power out of regular expressions, there is a metacharacter for the logical OR, the pipe symbol (|). Say you have a text file with a bunch of e-mail messages and you want to find all of the From and Subject lines, you would use the regular expression “From|Subject”.

Now that you know how to use the OR metacharacter, you will find that you need to limit the OR. That’s were the parentheses () come in. Using the last example of finding the From and Subject lines from e-mail messages, using the regular expression “From|Subject” will also find lies with either word in them. With e-mails, the From in the From line is always followed by a colon; “From:”. The same goes for Subject. Now how do we write a regular expression for this? One way is this: “From:|Subject:”. This will work, but a “cleaner” approach is this: “(From|Subject):”. Since AND’s are assumed in regular expressions, what you get is this “( From OR Subject ) AND :”. Just like in math, the parentheses control the bounds of the OR condition.

The backslash () is used to make a metacharacter a literal. If you want to look for all lines that end will a full sentence, meaning they end with a period, you could use the following regular expression: “.$”. But, since the period is a metacharacter you will find all lines that end with a character. To get grep to use the period as a period, you need to use the backslash like this; “.$”. The backslash tells grep to take the next character as a literal and not to interpret it.

Character classes are used as a way to search for groups of characters. Say you wanted to match the numbers less than 4. You could do this with “(1|2|3)”. Using the brackets, you could also create a character class “[123]”. The true power of the character class comes when using the period. The period means to create a range of characters (Metacharacters mean something else when in a character class). In the last example, the character class could also be written as “[1.3]”, meaning all characters from 1 to 3. To define the letters of the alphabet the character class would be “[a.z]”. Since grep is case sensitive, a better character class would be “[a.zA.Z]”.

You can mix up characters in a character class any way you like. Say you have to find all occurrences of numerical dates in a file. Dates could be defined as 7-23-97, or 7/23/97, or even 7.23.97. You want to find any dates with a dash, slash, or period. You would create the character class “[-/.]”. Remember that the period means only itself when inside a character class and does not mean to match a single character. So to find our dates, we would use the regular expression “7[-/.]23[-/.]97”.

The caret (^) means something else when used in a character class; it means to negate the class. If you want to match anything but numbers, you would create the character class “[^0.9]”. The caret works to negate when it is immediately used after the opening bracket. If it is used after that it only means itself. The character class “[-.^]” matches only a dash, period, or caret.

If you are interested in learning more, check out the book “Mastering Regular Expressions” by Jeffery Friedl.

Products

Downloadable Media

Related Articles

Image Gallery

Tags

People