Summaries/Python/Algemeen/RE.md

2.7 KiB
Raw Blame History

title updated created
RE 2022-04-25 11:42:43Z 2022-04-25 11:29:11Z

giving a label and looking at the results as a dictionary is pretty useful. For that we use the syntax (?P), where the parethesis starts the group, the ?P indicates that this is an extension to basic regexes, and is the dictionary key we want to use wrapped in <>.

for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict()['edit_link'])

Look-ahead and Look-behind

One more concept to be familiar with is called "look ahead" and "look behind" matching. In this case, the pattern being given to the regex engine is for text either before or after the text we are trying to isolate. For example, in our headers we want to isolate text which comes before the [edit] rendering, but we actually don't care about the [edit] text itself. Thus far we have been throwing the [edit] away, but if we want to use them to match but don't want to capture them we could put them in a group and use look ahead instead with ?= syntax

for item in re.finditer("(?P<title>[\w ]+)(?=\[edit\])",wiki):
    # What this regex says is match two groups, the first will be named and called title, will have any amount
    # of whitespace or regular word characters, the second will be the characters [edit] but we don't actually
    # want this edit put in our output match objects
    print(item)

I'll actually use this example to show you the verbose mode of python regexes. The verbose mode allows you to write multi-line regexes and increases readability. For this mode, we have to explicitly indicate all whitespace characters, either by prepending them with a \ or by using the \s special value. However, this means we can write our regex a bit more like code, and can even include comments with

pattern="""
(?P<title>.*)        #the university title
(\ located\ in\ )   #an indicator of the location
(?P<city>\w*)        #city the university is in
(,\ )                #separator for the state
(?P<state>\w*)       #the state the city is located in"""

# Now when we call finditer() we just pass the re.VERBOSE flag as the last parameter, this makes it much
# easier to understand large regexes!
for item in re.finditer(pattern,wiki,re.VERBOSE):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict())

lets create a pattern. We want to include the hash sign first, then any number of alphanumeric characters. And we end when we see some whitespace

pattern = '#[\w\d]*(?=\s)'
# Notice that the ending is a look ahead.
re.findall(pattern, health)