Video Coming Soon...
08: Capturing
Capturing is easily the most useful thing about regular expressions. It lets you "capture" a part of the regex so you can extract it. This allows you to simply pull only that part out, or to rewrite things in new ways while maintaining other parts.
Capture Example
Imagine you have a CSV file with lines like this:
age,weight,height
10,100,45
25,220,85
Let's say you want to extract only the weight part of each line, but ignore the rest. You do that by putting a ()
(parenthesis, or parens) around what you want like this:
.*,([0-9]+),.*
This will match everything up to the first ,
(comma), then any number of numbers, then ,
followed by the rest of the line. The ([0-9]+)
tells sed
to "capture" whatever matches inside those parens for later use.
How to Use a Capture
How do you then use the capture? Each capture is given a number starting with 1
and you put it into the edit rule with \1
. If you had 10 captures then the last one would be \10
. Here's how you would use this on a file named ex08.cvs
:
sed -n -E "s:.*,([0-9]+),.*:\1:pg" .\ex8.csv
NOTE: This looks differently than our previous
sed
commands because I'm using the:
character to separate each part. This is to show you thatsed
mostly doesn't care what you use aftr thes
to separate the command. You can uses///
ors:::
ors|||
and you need this because sometimes your pattner will have that character. For example, if you try to match paths then/
is in the path and you should use a character likes|||
ors,,,
.
There's a few new things here so let's break this command down:
sed
- You know this, it's the
sed
command. -n
- The
-n
option says be quiet until told to print the results. -E
- The
-E
option says to use the "extended" regular expression syntax, which is a more advanced syntax that supports this. "s
- This starts your stream edit rule, and we put
"
around it so it's not interpreted by the shell. :
- We're using the
:
character to delimit the options so that it's easier to read. Otherwise we'd have to deal with/\
characters. .*,([0-9]+),.*
- This is the regular expression with the capture for the middle number.
:
- Another option delimiter, make sure you understand this is like
/
we normally use, andsed
really doesn't care what you put here as long as it matches the first one. \1
- This would normally be your replacement but here's we're "replacing" with the first capture's contents using
\1
. :
- Start next option.
p"
- This uses the
p
option to "print" whatever it found, which matches with the-n
command line option (switch) we gavesed
. The"
is simply terminating the whole regex replacement.
Be sure you understand each part of this and try to write your own before continuing.
Alternating with |
(pipe)
A slight side note is that you can "alternate" between two patterns with | inside parens. Let's say you want to match only the number 100
or 220
, you would change the command to this:
sed -n -E "s:.*,(100|220),.*:\1:pg" .\ex8.csv
The important change here is (100|220)
which says "match 100 or 220". If you change this part to (100|320)
then you'll only match the line with 100
there and not 220
because you are now saying "match 100 or 320."
You also do not have to use this alternating feature (\)
with captures. You can use it to replace text (or grep
for lines) in text and never use the \1
or other captures. For example, if we only want to replace text with 100 (but not use a capture) it's this:
sed -E "s:.*,(100|320),.*:TEST:pg" .\ex8.csv
Which will print out this:
age,weight,height
TEST
TEST
25,220,85
Notice this gets the two lines with 100
but not that last line with 220
.
Using sed
to Explore Captures
In the next exercise we'll get into advanced options for sed
but you should take some time now to explore captures with sed
. Here's how you can do it:
- Find text files that have lines with delimited text. Web server log files are really good for this, but any log file from any server will work.
- Take each field in the log file and get a regex that matches it. This will be trial and error but you'll use the captures to pull out each one.
- Once you have a regex that works for each field, then create a large
sed
command that extracts each field and prints it in the new form. For example, you could extract all the fields from a webserver log and output a .csv file.
This will be a fairly long project, but if you can pull it off you will definitely know regular expressions and sed
well. You should combine this part of the exercise with the next exercise so you can learn even more about sed
.
Further Study
- Study the POSIX extended regular expression syntax and try as many things as you can. You might want to use
grep
(ugrep
) instead ofsed
for this. You'll need to use-E
ingrep
to get that syntax as well.
Register for Learn Regex the Hard Way
Register to gain access to additional videos which demonstrate each exercise. Videos are priced to cover the cost of hosting.