Hello everybody,
It’s Michael, and today’s post will cover regular expressions in Python. I know I already did a Java lesson on RegEx (the colloquial term for regular expressions-here’s the link to that lesson: Java Lesson 18: RegEx (or Regular Expressions)) but I wanted to cover how to use regular expressions in Python, so here goes.
To start working with regular expressions in Python, import the regular expressions module using this line of code-import re.
Now, here’s a simple example of regular expressions in Python:
text = "Jupyter Notebook is awesome!!!"
print(re.findall('e', text))
['e', 'e', 'e', 'e']
In this example, I have a String that reads Jupyter Notebook is awesome!!!. In the print expression, I’m using the re module’s findall function to find all of the e’s in the text String. The print expression then returns a list of all of the e’s found in text, of which there were four.
- To use any of the
remodule’s functions, you’ll need two parameters-the string/character/pattern you want to search for and the String where you want to search for that particular string/character/pattern.
Now, let’s try a more complex RegEx (the colloquial name for regular expressions) example in Python:
text2 = "Tonight is a very beautiful spring night"
print(re.search("ht$",text2))
<re.Match object; span=(38, 40), match='ht'>
In this example, I am using the search function to find the pattern ht$ anywhere in the string. An important thing to note is that, unlike the findall function, the search function only looks for one pattern/string/character match in the string being analyzed (in this case, text2).
- Similar to the
findallfunction, you’ll need to include two parameters for thesearchfunction-the pattern/string/character you’re searching for and the string where you are looking for the pattern/string/character.
Now, you’re probably wondering what ht$ means. The dollar sign ($) is called a metacharacter and in the context of regular expressions, metacharacters help define more specific search criteria for a pattern/string/character you are looking for. In the case of ht$, the search function is looking for any part of the text2 string that ends with ht; a Match object is returned if the search function finds any part of the String that ends with ht. In this case, a Match object was returned; the span attribute of the Match object will tell you where in the string a match was found (span shows you index positions in the String where the match was found). For text2, the match starts at index 38 and ends at index 39 (index 40 won’t be considered part of the match)-in other words, the match is found between the 39th and 40th positions in text2 (recall that string indexing starts with 0).
Here’s a list of other useful regex metacharacters:
[]-find a set of characters- example:
[a, e, i, o, u, y]can be used to find all of the vowels intext2(and yes I’ll consider y as a vowel)
- example:
\-use a special sequence (more on this later).-find any character (except the\nnewline character)- example:
n...tcan be used intext2to find any word in the string that starts with n, ends with t, and has any three letters in between
- example:
^-find any part of the string that starts with a certain character/pattern- example:
^Tocan be used to find any part oftext2that starts with “To”
- example:
$-find any part of the string that ends with a certain character/pattern (I explained this metacharacter in the above example)*-find none (or more) occurrences of a certain character/pattern in a string- example:
ni*can be used to find any amount (or no amount) of occurrences of the pattern “ni” intext2.
- example:
+-find at least one occurrence of a certain character/pattern in a string- example:
ni+can be used to find at least one occurrence of the pattern “ni” intext2
- example:
{}-find a specific number of occurrences of a certain character/pattern in a string- example:
ni{1}can be used to find one AND ONLY ONE occurrence of the pattern “ni” in the stringtext2.
- example:
|-find a match that contains either pattern/string- example:
tonight|todaycan be used to find out whethertext2contains either tonight or today. - note: this is the same operator that you’d use as an OR statement in conditional logic but it takes a slightly different albeit conceptually similar meaning when dealing with Python regex.
- example:
Now I know I mentioned special sequences in the list above, so let’s see some special sequences in action:
text3 = "Today's date is 04/10/2021"
print(re.split("\d{2}/\d{2}/\d{4}", text3))
["Today's date is ", '']
In this example, I’m using the split function to search for any part of the string text3 with the pattern \d{2}/\d{2}/\d{4}. You’re probably wondering what the split function does or what the pattern \d{2}/\d{2}/\d{4} means.
First of all, the split function, like the findall and search functions, takes in a pattern to search for along with the string where the function will look for the specified pattern. However, the split function is different because it doesn’t search for matches; rather, split returns a list of elements between each string split.
The expression \d{2}/\d{2}/\d{4} uses a group of special sequences. In the context of regular expressions, special sequences are represented with backslashes followed by an individual character that serve as convenient shorthand for pre-defined character classes. For instance, the special sequence \d looks for digits in the string; when combined with the metacharacter {2}, \d{2} looks for any sequence of two digits in a string. In the expression \d{2}/\d{2}/\d{4} , the split function is looking for a pattern that starts with two digits followed by a forward slash followed by another digit pair followed by another forward slash and ending with a sequence of four digits.
Here is a list of all the special sequences that Python regex uses:
\A-looks for a match if certain character(s) are at the beginning of the string- example:
\ATocan be used to see iftext3starts with the characters “To”
- example:
\b-looks for a match if certain character(s) are at the beginning or end of a word- example:
\bdacan be used to see if there are any words intext3that start with the characters “da”. However, if you want to see if “da” can be found at the end of a word, use the syntaxda\b
- example:
\B-looks for a match if certain character(s) are present BUT NOT at the beginning or end of a word- example:
\Bdacan be used to see if there are any words intext3that contain but don’t begin with the characters “da”. Likewise,da\Bcan be used to see if there are any words intext3that contain but don’t end with the characters “da”.
- example:
\d-looks for digits in the string (I discussed this sequence in the example above)\D-looks for non-digits in the string- example:
\Dcan used to return all of the non-digit characters intext3. - Yes, whitespace counts as a character too.
- example:
\s-looks for all of the whitespace characters in the string- example:
\scan be used to return a list of all the whitespace characters intext3, of which there are three.
- example:
\S-looks for all of the non-whitespace characters in the string- example:
\Scan be used to return a list of all the non-whitespace characters intext3
- example:
\w-looks for all of the word characters in the string; in case you’re wondering, the word characters are the letters of the alphabet, the digits 0-9, and the underscore (_)- example:
\wcan be used to return a list of all the word characters intext3
- example:
\W-looks for all of the non-word characters in the string (in other words, anything that’s not a letter, digit, or underscore)- example:
\Wcan be used to return a list of all the non-word characters intext3
- example:
\Z-looks for a match if certain character(s) are at the end of the string- example:
21\Zcan be used to see iftext3ends with the characters “21”.
- example:
- Whenever you’re using special sequences, be sure not to mix up letter cases, as capital letters and lowercase letters will do different things in the context of regex special sequences!
Now I’ve shown you how to split a string with regex, find all instances of a certain character pattern in a string, and retrieve a match object using regex. However, what if you wanted to replace one character pattern with another? Here’s an example of this:
text4 = "Today was a beautiful Monday afternoon!"
print(re.sub("Mon", "Tues", text4))
Today was a beautiful Tuesday afternoon!
If you want to replace one character pattern with another, use the sub method. The difference between this method and the other three regex methods (findall, search, and split) I discussed earlier is that sub takes three parameters while the other methods only take two; the three parameters sub uses are the character pattern you want to replace, the new character pattern you want to use, and the string where you want to make the switch (in that order). In this example, I’m replacing the character pattern “Mon” with the pattern “Tues” in text4 to change the string from Today was a beautiful Monday afternoon! to Today was a beautiful Tuesday afternoon!
Now, before I go, I want to discuss one more Python regex concept-sets. In regex, sets are sets of characters inside square brackets with a special meaning.
Let’s check out an example of sets below:
text5 = "His address is 742 Evergreen Terrace. His phone number is 413-234-9080. His date of birth is 09/12/1971"
print(re.findall("[0-9][0-9]", text5))
['74', '41', '23', '90', '80', '09', '12', '19', '71']
In this example, I am using the set [0-9][0-9] to find all two-digit sequences in text5.
You’re probably wondering what [0-9][0-9] does in the context of regex. The set [0-9][0-9] looks for all two-digit sequences in text5 between 00 and 99.
One interesting things about sets is that, between them, special sequences, and metacharacters, they are the most customizable of the three regex elements (though you could argue that metacharacters are widely customizable as well). In the example above, I could’ve used the set [0-9][0-9][0-9] to look for all three-digit sequences in text5 between 000 and 999. But what if I didn’t want to use the 00-99 digit sequence range? Let’s say I wanted to look for all two-digit sequences in text5 between 00 and 49; all I need to do is specify the set [0-4][0-9] in the first parameter of the findall function.
What other regex sets can you use with Python? Here’s a list of them:
[ber]-looks for all the B’s, E’s, and R’s in the string[b-r]-looks for all the lowercase letters between b and r in the string- If you wanted to modify this search to find capital letters, use the set
[B-R], which finds all of the capital letters between B and R in the string.
- If you wanted to modify this search to find capital letters, use the set
[^one]-looks for all of the characters that aren’t o, n, or e.[4567]-looks for all of the 4’s, 5’s, 6’s, and 7’s in the string[0-9]-looks for any digit between 0 and 9 in the string[0-6][0-9]-looks for any two-digit sequence between 00 and 69 in the string (I discussed this set concept in the above example)[b-rB-R]-looks for every letter between b and r in the string, both lowercase and uppercase[$]-looks for all dollar sign characters ($) in the string
Thanks for reading,
Michael