Video Coming Soon...
05: Character Sets
In this exercise you'll learn to match sets of characters rather than just any charactor or a single character.
What is a Set?
A "set" is a list of items that must match exactly, but not in any particular order. The advantage of a set is you can give a range of values to match, which make it easier to write complex regex. For example, if you wanted to only match every character of the alphabet you could write [a-z]
which I read as "match 'a' through 'z'."
Sets in Regex
We now get into "operators" that are more complex than one character. Other operators like *
do a lot with just the one character, but for a set to work you need the [
(left-bracket), the ]
(right-bracket) and then some contents. Let's break down the set I mentioned to match the alphabet [a-z]
:
[
-- start the seta-z
-- range of characters froma
toz
. You can also put in any other characters outside that range.]
-- end the set
When you write this regex it means that grep
will match one character that is specified in the set. It does not mean to match multiple character that can be found in the set. You need more regex to do that, which I'll cover in a bit.
Try a few of these commands on the last poem:
$ grep "^[A-Z]" ex04.txt
I led you astray with promises
$ ugrep "^[a-z]" ex04.txt
once burnished and unspoken
yet without you here I apprehend
for everything I may have broken.
NOTE: Remember that the
^
in this case means "anchor to the start" and not how we use it next.
Inverted Sets
You can also add a ^
(caret) to the start of a regex set to say do not match the set. This is an inverted set where you're saying you do not want these characters. We can add a ^
(caret) to the previous example like this [^a-z]
.
When we do this the regex will any character that's not in the alphabet a-z
. So it will match an @
in an email address but not any of the letters.
Try some of these commands on the last poem:
$ grep "[j-z]$" ex04.txt
I led you astray with promises
once burnished and unspoken
$ grep "[^j-z]$" ex04.txt
yet without you here I apprehend
unyielding vigilance
for everything I may have broken.
In this example I use a normal set, then have you compare it to the inverted version.
Escaping Inside Sets
There may be situations where you have to match the [
, -
, and ]
characters inside the set. To do that just use the \
(backslash) character to make the regex explicitly escape them. You would do it like this [\[\-\]]
which would match any character that a regex uses for sets. Let's break this down to make sure you're reading each part as individual components instead of a big blob of confusing randomness:
[
-- start the set\[
-- explicitly escape the[
character so that it is not interpreted as part of a regex set, but instead is the actual character[
.\-
-- Same thing but for the-
character.\]
-- Again, escaping this one so that your regex set doesn't end early.]
-- Finally ending the regex set for real.
One way you can think of the \
(backslash) escape is it change an operator into a normal character. If you use it on \*
then you change the *
operator to just the *
character. Another way to think of \
is it "kills" the next character, turning it into just a boring dead character instead of an alive active operator.
Combinations
A fundamental aspect of computation is combination. Everything you learn is usually designed to be combined with everything else you've learned. In fact, it's so common that if you run into something that can't be combined you'll think it's bizarre. This doesn't mean that every combination you can think of will work, but if you combine two things correctly according to the rules then they should work.
To practice this I want you to take what you know about sets, inverted sets, and everything you've learned so far to search through the two poems for different lines. See how complex you can make the combinations and still have them work. For example, what does this do:
$ grep "[aeiou][aeiou]+" ex04.txt
Here I combined sets with the +
(plus) operator to find sequences of 2 or more vowels. This is what you should be trying to create.
Operators So Far
As usual, update your flash cards and keep drilling. I promise if you suffer through the pain of memorization you'll learn Regex faster than if you just flap around randomly.
.
-- Any one character.\s
-- Any space character (tabs, newlines, spaces).\t
-- A tab explicitly.\n
-- A newline explicitly.\\
-- A backslash explicitly.^
-- Match (anchor) the start of a line.$
-- Match (anchor) the end of a line.?
-- Match zero or none.*
-- Match zero or more.+
-- Match one or more.[list]
-- sets[^list]
-- inverted sets
You should also be combining all of these in as many ways as possible, and use grep on real data you have or that you can find.
Further Study
- Take a look at the regular expression for email. How much of this do you understand? This is only for fun so don't worry if you don't really understand it. I don't think anyone actually does.
- Looking through that monstrosity, can you find anything interesting to research?
- Google "regex to parse html" and have a laugh at the various replies.
Register for Learn Regex the Hard Way
Register to gain access to additional videos which demonstrate each exercise. Videos are priced to cover the cost of hosting.