Video Coming Soon...
06: Advanced Repetition
I'll be honest with you and admit that I never use these. It's not because these operations aren't useful, it's more that I find most of the other operations do everything I need. I thnk you may find the same thing, so consider this exercise more of an "expansion" exercise. You learn this more to stretch your understanding of regex than to learn something you'll use constantly.
New Test File
Make a new file for this exercise named ex06.txt
and put this in it:
444-55-5555
22-234567
4444 5555 6666 7777 8888
+91 444 44444
555-624-3476
These are common formats you'll find in business. The first is a US Social Security Number (SSN). The second is an EIN, which is like an SSN for businesses. The third is a credit card number. Then there's various phone numbers from different countries.
Exactly i Sequences
If you need to match a fixed length of characters or numbers then this is what you want. You use it the same way you use other repetition (*
, +
) but instead you write {
(left-curly-bracket), a count, and }
(right-curly-bracket). For example, if you want to match a US SSN you may try to write something like this:
[0-9]+-[0-9]+-[0-9]+
If you run this on the sample file for this exercise you'll realize this doesn't work:
ugrep "[0-9]+-[0-9]+-[0-9]+" .\ex06.txt
444-55-5555
555-624-3476
See how it also picks up the phone number? You only want the SSN 444-55-5555
. This is where the exact sequence comes in:
[0-9]{3}-[0-9]{2}-[0-9]{4}
You replace the +
or *
in a regex with the {3}
and it will only match 3 of the previous pattern. Running this regex we get the correct answer:
ugrep "[0-9]{3}-[0-9]{2}-[0-9]{4}" .\ex06.txt
444-55-5555
If we break this down we get this:
[0-9]
-- You know this as the set syntax which says "only characters 0 through 9" so numbers.{3}
-- This matches those numbers but exactly 3 of them. This will match the444
part of the444-55-5555
.-
-- Be sure you understand that this-
(dash) is not part of the pattern, it is an exact character to match. It matches the_
at the end of444-
.[0-9]
-- This is again matching only number 0-9.{2}
-- Now we match only two of those numbers.-
-- Again this is not part of the regex patterns, just an exact character to match.[0-9]
-- One last time, just match numbers 0-9.{4}
-- And finally match 4 of those numbers.
Make sure you go through and confirm you understand each part I describe, then take the time to write a regex and find each one of these lines but none of the others.
The "What then How Much" Pattern
Hopefully you're understanding a common pattern in regex:
- Write the thing you want to match, or not to match.
- Then write how much of that thing to match.
You can see this pattern in most of the regex you've used so far:
[A-Z]*
-- I want to match capital letters[A-Z]
. How much I want to match is zero or more*
.0+
-- I want match the number0
. How much I want to match is one or more.[0-9]{3}
-- I want to match any number 0 through 9[0-9]
. How much I want is exactly 3{3}
.
This is effectively backwards from how you might say these kinds of patterns normally. You would say:
"Match 3 numbers."
You wouldn't say:
"Match numbers, 3 only."
Which is one of the reasons why regex are confusing to people. Once you get used to this though it's not too difficult to understand.
Between i and j, Inclusive
It's now fairly easy to understand the next sequence operation of {i,j}
. It says "find i through j inclusive" numbers of matches. The word "inclusive" is important. It means that it includes the number for i
and the number for j
. Another way to say inclusive is, "Up to and including j
." For example, if you write this {3,4}
it will find 3 up to and including 4, so 3 or 4.
Using this knowledge try to find two lines at a time with one regex. For example, find SSNs and Phone Numbers.
i or More sequences
Finally we have the {i,}
sequence, which means i
or more occurrences. You can think of it like +
but it starts with a minimum number. The +
will find 1 or more, but {i,}
will find any arbitrary minimum number of occurrences. For example, if you do [0-9]{3,}
it will match 3 or more numbers.
Using this knowledge, try to match other lines in more combinations. You should also try adding number formats you know to this file and match those exactly.
Your Operators So Far
Time to update your flash cards. As I mentioned in the beginning, I don't really use these but learn them anyway because they might come up in some rare situations.
.
-- Any one character.\s
-- Any space character (tabs, newlines, spaces).\t
-- A tab explicitly.\n
-- A newline explicitly.\\
-- A backslash explicitly.^
-- Match (anchor) the start of a line.$
-- Match (anchor) the end of a line.?
-- Match zero or none.*
-- Match zero or more.+
-- Match one or more.[list]
-- sets[^list]
-- inverted sets{i}
-- exactly i sequences{i,j}
-- between i and j, inclusive, squences.{i,}
-- more than or equal to i sequences
Some Uses For Sequences
The only place I've really found a use for this is input validation. They aren't as useful for searching, but they do help you confirm that input from users will match a format you need. For example, if you want people's phone numbers on a website then you can use the [0-9]{3}-[0-9]{3}-[0-9]{4}
to confirm it fits the US format.
However, keep in mind that it's pretty hard to get validations like this right. No matter how "correct" you think a format is you'll always find some part of the world that surprises you.
Further Study
- As usual, try to drill the operators as much as you can. You can either suffer through the rote memorization now, or bumble around randomly learning the same thing anyway.
- Research all the different credit card formats you can find and see if you can device patterns to detect them.
- Do the same thing for phone numbers.
- If you wanted to make sure that someone's password had numbers, letters, special characters, and was a certain length could you use a regex?
Register for Learn Regex the Hard Way
Register to gain access to additional videos which demonstrate each exercise. Videos are priced to cover the cost of hosting.