The Beginner Guide to Regular Expressions

Have you ever tried to find a recurring pattern in a piece of text? You might have used something like the search function in your browser or word processor, but when you need to find something more complex, it can be like finding a needle in the proverbial haystack.

Fortunately, there’s a way to pick out precise patterns in text right down to the character. It’s called regular expressions, and it lets you become a master of searching through text.

If you’ve paid any attention at all to Linux utilities, you probably notice that they frequently make use of something called regular expressions. Although Unix and Linux made them popular, regular expressions are available in a variety of packages, including Microsoft Word.

regular-expressions

Regular expressions are most notably used in several notable Linux programs, including grep (which stands for Global Regular Expression Print), Awk and Sed.

It’s better to think of regular expressions as a little language, the basics of which can be described in a small space.

You can search using tools like Grep or Ack from either the standard input or a text file.

For example, if you were trying to find the term “Firefox” in output from the ps command, here’s how you would do it:

ps | grep firefox

And here’s how you’d find the term “maketecheasier” in a file.

grep maketecheasier somefile

In regular expressions, you can also search for parts of a string. The way you’d do this is with two characters. They’re actually called metacharacters. They’re similar to the wildcard matches you might have used in the shell.

  • “.” stands for a single character. The pattern “c.t” matches both the words “cat,” “cut” and “cot.” for example.
  • The “*” metacharacter means finding the previous character 0 or more times. The pattern “l.*x” would find “linux”, as well as any other words that happened to fit the same pattern.

The reason you just can’t use something like "l*x" like you would in the shell is because matching 0 or more characters means that it would find lines that didn’t have an “l” followed by any other character. In other words, any line trying to be matched, which is absolutely useless.

You can also find patterns starting at the beginning or the end of lines.

  • The “^” character matches at the beginning
  • The “$” matched at the end.

For example, "sier$" would match “Make Tech Easier” and "^Make" would match “Make.”

You can also get into more complicated characters. Anything you put in square brackets will be matched as a range. For example, “[a-z]” matches all the lowercase letters. “[a-zA-Z]” matches all of the letters. “[a-zA-Z0-9]” matches alphanumeric characters. Inside of the brackets, the “^” character negates anything. “[^a-zA-Z]” matches anything that’s not a letter.

You can also find word boundaries with the “\<” and “\>” characters. The pattern “\<Linux\>” matches “Linux,” obviously.

You can match something a specific number of times with curly brackets. “{3}” matches something three times and “{3,5}” matches something between 3 and 5 times.

With these simple building blocks, you can match some pretty complicated stuff. There’s a lot more to regular expressions than can be explained in a short article. If you want a comprehensive book-length treatment of the subject, you should definitely check out Mastering Regular Expressions by Jeffrey E.F. Friedl.

Disclosure: This article contains an affiliate link. While we only write about products we think deserve to be on this site, Make Tech Easier may earn a small commission if you click through and buy the product in question.

Image Credit: xkcd