Video Coming Soon...

Created by Zed A. Shaw Updated 2024-12-10 18:57:40

08: Capturing

Capturing is easily the most useful thing about regular expressions. It lets you "capture" a part of the regex so you can extract it. This allows you to simply pull only that part out, or to rewrite things in new ways while maintaining other parts.

Capture Example

Imagine you have a CSV file with lines like this:

age,weight,height
10,100,45
25,220,85

Let's say you want to extract only the weight part of each line, but ignore the rest. You do that by putting a () (parenthesis, or parens) around what you want like this:

.*,([0-9]+),.*

This will match everything up to the first , (comma), then any number of numbers, then , followed by the rest of the line. The ([0-9]+) tells sed to "capture" whatever matches inside those parens for later use.

How to Use a Capture

How do you then use the capture? Each capture is given a number starting with 1 and you put it into the edit rule with \1. If you had 10 captures then the last one would be \10. Here's how you would use this on a file named ex08.cvs:

sed -n -E "s:.*,([0-9]+),.*:\1:pg" .\ex8.csv

NOTE: This looks differently than our previous sed commands because I'm using the : character to separate each part. This is to show you that sed mostly doesn't care what you use aftr the s to separate the command. You can use s/// or s::: or s||| and you need this because sometimes your pattner will have that character. For example, if you try to match paths then / is in the path and you should use a character like s||| or s,,,.

There's a few new things here so let's break this command down:

sed
You know this, it's the sed command.
-n
The -n option says be quiet until told to print the results.
-E
The -E option says to use the "extended" regular expression syntax, which is a more advanced syntax that supports this.
"s
This starts your stream edit rule, and we put " around it so it's not interpreted by the shell.
:
We're using the : character to delimit the options so that it's easier to read. Otherwise we'd have to deal with /\ characters.
.*,([0-9]+),.*
This is the regular expression with the capture for the middle number.
:
Another option delimiter, make sure you understand this is like / we normally use, and sed really doesn't care what you put here as long as it matches the first one.
\1
This would normally be your replacement but here's we're "replacing" with the first capture's contents using \1.
:
Start next option.
p"
This uses the p option to "print" whatever it found, which matches with the -n command line option (switch) we gave sed. The " is simply terminating the whole regex replacement.

Be sure you understand each part of this and try to write your own before continuing.

Alternating with | (pipe)

A slight side note is that you can "alternate" between two patterns with | inside parens. Let's say you want to match only the number 100 or 220, you would change the command to this:

sed -n -E "s:.*,(100|220),.*:\1:pg" .\ex8.csv

The important change here is (100|220) which says "match 100 or 220". If you change this part to (100|320) then you'll only match the line with 100 there and not 220 because you are now saying "match 100 or 320."

You also do not have to use this alternating feature (\) with captures. You can use it to replace text (or grep for lines) in text and never use the \1 or other captures. For example, if we only want to replace text with 100 (but not use a capture) it's this:

sed -E "s:.*,(100|320),.*:TEST:pg" .\ex8.csv

Which will print out this:

age,weight,height
TEST
TEST
25,220,85

Notice this gets the two lines with 100 but not that last line with 220.

Using sed to Explore Captures

In the next exercise we'll get into advanced options for sed but you should take some time now to explore captures with sed. Here's how you can do it:

  1. Find text files that have lines with delimited text. Web server log files are really good for this, but any log file from any server will work.
  2. Take each field in the log file and get a regex that matches it. This will be trial and error but you'll use the captures to pull out each one.
  3. Once you have a regex that works for each field, then create a large sed command that extracts each field and prints it in the new form. For example, you could extract all the fields from a webserver log and output a .csv file.

This will be a fairly long project, but if you can pull it off you will definitely know regular expressions and sed well. You should combine this part of the exercise with the next exercise so you can learn even more about sed.

Further Study

Previous Lesson Next Lesson

Register for Learn Regex the Hard Way

Register to gain access to additional videos which demonstrate each exercise. Videos are priced to cover the cost of hosting.