Sample Video Frame

Created by Zed A. Shaw Updated 2024-10-08 04:45:56
 

Exercise 32: Scanners

My first book very casually covered scanners in Learn Python The Hard Way, Exercise 48, but now we're going to get more formal. I'll explain the concept behind scanning text, how that relates to regular expressions, and how you can create a little scanner for a tiny piece of Python code.

Let's use the following Python code as an example to start the discussion:

def hello(x, y):
    print(x + y)

hello(10, 20)

You've been training in Python for a while, so your brain most likely can read this code quickly, but do you really understand what it is? When I (or someone else) taught you Python I had you memorize all the "symbols". The def and () characters are each symbols, but there needs to be a way for Python to process these that is reliable and consistent. Python also needs to be able to read hello and understand it's a "name" for something, and then later know the difference between def hello(x, y) and hello(10, 20). How does it do this?

The first step to doing this is scanning the text looking for "tokens". At the scanning phase a language like Python doesn't first care what's a symbol (def) and what's a name (hello). It will simply try to convert the input language into patterns of text called "tokens". It does this by applying a sequence of regular expressions that "match" each of the possible inputs that Python understands. You'll remember from Exercise 31 that a regular expression is a way to tell Python what sequences of characters to match or accept. All the Python interpreter does is use many regular expressions to match every token it understands.

If you look at the code above, you might be able to write a set of regular expressions to process it. You'd need a simple regex for def that's just "def". You'd need more for the ()+:, characters. You'd then be left with how to handle print, hello, 10, and 20.

Once you've identified all the symbols in the code sample above you need to name them. You can't just refer to them by their regex, since it's inefficient to look up and also confusing. Later you'll learn that giving each symbol its own name (or number) simplifies parsing, but for now, let's devise some names for these regex patterns. I could say def is simply DEF, then ()+:, can be LPAREN RPAREN PLUS COLON COMMA. After that I can call the regex for words like hello and print simply NAME. By doing this I'm coming up with a way to convert the stream of raw text someone enters into a stream of single number (or named) tokens to use in later stages.

Python is also tricky because it needs a leading whitespace regular expression to handle the indenting and dedenting of the code blocks. For now, let's just use a fairly dumb one of ^\s+ and then pretend that this also captures how many spaces were used at the beginning of the line.

Eventually you'd have a set of regular expressions that can handle the preceding code and it might look something like this:

|Regex |Token | |-------| |def |DEF | |[a-zA-Z_][a-zA-Z0-9_]* |NAME | |[0-9]+ |INTEGER | |( |LPAREN | |) |RPAREN | |+ |PLUS | |: |COLON | |, |COMMA | |^\s+ |INDENT |

The job of a scanner is to take these regular expressions and use them to break the input text into a stream of identified symbols. If I do that to the code example I could produce this:

DEF NAME(hello) LPAREN NAME(x) COMMA NAME(y) RPAREN COLON
INDENT(4) NAME(print) LPAREN NAME(x) PLUS NAME(y) RPAREN
NAME(hello) RPAREN INTEGER(10) COMMA INTEGER(20) RPAREN

Study this transformation, matching each line of this scanner output, and compare it to the Python code above using the regular expressions in the table. You'll see that this is simply taking the input text and matching each regex to the token name and then saving any information needed like hello or the number 10.

Previous Lesson Next Lesson

Register for Learn More Python the Hard Way

Register today for the course and get the all currently available videos and lessons, plus all future modules for no extra charge.