regex Archives - Michael's Programming Bytes

Python Lesson 16: Regular Expressions in Python

Advertisements

Hello everybody,

It’s Michael, and today’s post will cover regular expressions in Python. I know I already did a Java lesson on RegEx (the colloquial term for regular expressions-here’s the link to that lesson: Java Lesson 18: RegEx (or Regular Expressions)) but I wanted to cover how to use regular expressions in Python, so here goes.

To start working with regular expressions in Python, import the regular expressions module using this line of code-import re.

Now, here’s a simple example of regular expressions in Python:

text = "Jupyter Notebook is awesome!!!"

print(re.findall('e', text))

['e', 'e', 'e', 'e']

In this example, I have a String that reads Jupyter Notebook is awesome!!!. In the print expression, I’m using the re module’s findall function to find all of the e’s in the text String. The print expression then returns a list of all of the e’s found in text, of which there were four.

To use any of the re module’s functions, you’ll need two parameters-the string/character/pattern you want to search for and the String where you want to search for that particular string/character/pattern.

Now, let’s try a more complex RegEx (the colloquial name for regular expressions) example in Python:

text2 = "Tonight is a very beautiful spring night"

print(re.search("ht$",text2))

<re.Match object; span=(38, 40), match='ht'>

In this example, I am using the search function to find the pattern ht$ anywhere in the string. An important thing to note is that, unlike the findall function, the search function only looks for one pattern/string/character match in the string being analyzed (in this case, text2).

Similar to the findall function, you’ll need to include two parameters for the search function-the pattern/string/character you’re searching for and the string where you are looking for the pattern/string/character.

Now, you’re probably wondering what ht$ means. The dollar sign ($) is called a metacharacter and in the context of regular expressions, metacharacters help define more specific search criteria for a pattern/string/character you are looking for. In the case of ht$, the search function is looking for any part of the text2 string that ends with ht; a Match object is returned if the search function finds any part of the String that ends with ht. In this case, a Match object was returned; the span attribute of the Match object will tell you where in the string a match was found (span shows you index positions in the String where the match was found). For text2, the match starts at index 38 and ends at index 39 (index 40 won’t be considered part of the match)-in other words, the match is found between the 39th and 40th positions in text2 (recall that string indexing starts with 0).

Here’s a list of other useful regex metacharacters:

[]-find a set of characters
- example: [a, e, i, o, u, y] can be used to find all of the vowels in text2 (and yes I’ll consider y as a vowel)
\-use a special sequence (more on this later)
.-find any character (except the \n newline character)
- example: n...t can be used in text2 to find any word in the string that starts with n, ends with t, and has any three letters in between
^-find any part of the string that starts with a certain character/pattern
- example: ^To can be used to find any part of text2 that starts with “To”
$-find any part of the string that ends with a certain character/pattern (I explained this metacharacter in the above example)
*-find none (or more) occurrences of a certain character/pattern in a string
- example: ni* can be used to find any amount (or no amount) of occurrences of the pattern “ni” in text2.
+-find at least one occurrence of a certain character/pattern in a string
- example: ni+ can be used to find at least one occurrence of the pattern “ni” in text2
{}-find a specific number of occurrences of a certain character/pattern in a string
- example: ni{1} can be used to find one AND ONLY ONE occurrence of the pattern “ni” in the string text2.
|-find a match that contains either pattern/string
- example: tonight|today can be used to find out whether text2 contains either tonight or today.
- note: this is the same operator that you’d use as an OR statement in conditional logic but it takes a slightly different albeit conceptually similar meaning when dealing with Python regex.

Now I know I mentioned special sequences in the list above, so let’s see some special sequences in action:

text3 = "Today's date is 04/10/2021"

print(re.split("\d{2}/\d{2}/\d{4}", text3))

["Today's date is ", '']

In this example, I’m using the split function to search for any part of the string text3 with the pattern \d{2}/\d{2}/\d{4}. You’re probably wondering what the split function does or what the pattern \d{2}/\d{2}/\d{4} means.

First of all, the split function, like the findall and search functions, takes in a pattern to search for along with the string where the function will look for the specified pattern. However, the split function is different because it doesn’t search for matches; rather, split returns a list of elements between each string split.

The expression \d{2}/\d{2}/\d{4} uses a group of special sequences. In the context of regular expressions, special sequences are represented with backslashes followed by an individual character that serve as convenient shorthand for pre-defined character classes. For instance, the special sequence \d looks for digits in the string; when combined with the metacharacter {2}, \d{2} looks for any sequence of two digits in a string. In the expression \d{2}/\d{2}/\d{4} , the split function is looking for a pattern that starts with two digits followed by a forward slash followed by another digit pair followed by another forward slash and ending with a sequence of four digits.

Here is a list of all the special sequences that Python regex uses:

\A-looks for a match if certain character(s) are at the beginning of the string
- example: \ATo can be used to see if text3 starts with the characters “To”
\b-looks for a match if certain character(s) are at the beginning or end of a word
- example: \bda can be used to see if there are any words in text3 that start with the characters “da”. However, if you want to see if “da” can be found at the end of a word, use the syntax da\b
\B-looks for a match if certain character(s) are present BUT NOT at the beginning or end of a word
- example: \Bda can be used to see if there are any words in text3 that contain but don’t begin with the characters “da”. Likewise, da\B can be used to see if there are any words in text3 that contain but don’t end with the characters “da”.
\d-looks for digits in the string (I discussed this sequence in the example above)
\D-looks for non-digits in the string
- example: \D can used to return all of the non-digit characters in text3.
- Yes, whitespace counts as a character too.
\s-looks for all of the whitespace characters in the string
- example: \s can be used to return a list of all the whitespace characters in text3, of which there are three.
\S-looks for all of the non-whitespace characters in the string
- example: \S can be used to return a list of all the non-whitespace characters in text3
\w-looks for all of the word characters in the string; in case you’re wondering, the word characters are the letters of the alphabet, the digits 0-9, and the underscore (_)
- example: \w can be used to return a list of all the word characters in text3
\W-looks for all of the non-word characters in the string (in other words, anything that’s not a letter, digit, or underscore)
- example: \W can be used to return a list of all the non-word characters in text3
\Z-looks for a match if certain character(s) are at the end of the string
- example: 21\Z can be used to see if text3 ends with the characters “21”.

Whenever you’re using special sequences, be sure not to mix up letter cases, as capital letters and lowercase letters will do different things in the context of regex special sequences!

Now I’ve shown you how to split a string with regex, find all instances of a certain character pattern in a string, and retrieve a match object using regex. However, what if you wanted to replace one character pattern with another? Here’s an example of this:

text4 = "Today was a beautiful Monday afternoon!"

print(re.sub("Mon", "Tues", text4))

Today was a beautiful Tuesday afternoon!

If you want to replace one character pattern with another, use the sub method. The difference between this method and the other three regex methods (findall, search, and split) I discussed earlier is that sub takes three parameters while the other methods only take two; the three parameters sub uses are the character pattern you want to replace, the new character pattern you want to use, and the string where you want to make the switch (in that order). In this example, I’m replacing the character pattern “Mon” with the pattern “Tues” in text4 to change the string from Today was a beautiful Monday afternoon! to Today was a beautiful Tuesday afternoon!

Now, before I go, I want to discuss one more Python regex concept-sets. In regex, sets are sets of characters inside square brackets with a special meaning.

Let’s check out an example of sets below:

text5 = "His address is 742 Evergreen Terrace. His phone number is 413-234-9080. His date of birth is 09/12/1971"

print(re.findall("[0-9][0-9]", text5))

['74', '41', '23', '90', '80', '09', '12', '19', '71']

In this example, I am using the set [0-9][0-9] to find all two-digit sequences in text5.

You’re probably wondering what [0-9][0-9] does in the context of regex. The set [0-9][0-9] looks for all two-digit sequences in text5 between 00 and 99.

One interesting things about sets is that, between them, special sequences, and metacharacters, they are the most customizable of the three regex elements (though you could argue that metacharacters are widely customizable as well). In the example above, I could’ve used the set [0-9][0-9][0-9] to look for all three-digit sequences in text5 between 000 and 999. But what if I didn’t want to use the 00-99 digit sequence range? Let’s say I wanted to look for all two-digit sequences in text5 between 00 and 49; all I need to do is specify the set [0-4][0-9] in the first parameter of the findall function.

What other regex sets can you use with Python? Here’s a list of them:

[ber]-looks for all the B’s, E’s, and R’s in the string
[b-r]-looks for all the lowercase letters between b and r in the string
- If you wanted to modify this search to find capital letters, use the set [B-R], which finds all of the capital letters between B and R in the string.
[^one]-looks for all of the characters that aren’t o, n, or e.
[4567]-looks for all of the 4’s, 5’s, 6’s, and 7’s in the string
[0-9]-looks for any digit between 0 and 9 in the string
[0-6][0-9]-looks for any two-digit sequence between 00 and 69 in the string (I discussed this set concept in the above example)
[b-rB-R]-looks for every letter between b and r in the string, both lowercase and uppercase
[$]-looks for all dollar sign characters ($) in the string

Thanks for reading,

Michael

Java Lesson 18: RegEx (or Regular Expressions)

Advertisements

Hello everybody,

Michael here, and today’s post will be about RegEx (or regular expressions) in Java.

Now, since this is the first time I’ll be discussed regular expressions in any programming language, you’re probably wondering “Well, what the heck are regular expressions?” Regular expressions are sequences of characters that form a search pattern. When you’re looking for certain data in a text, you can use regular expressions to describe what you’re looking for.

Still confused? Let me make it easier for you. Let’s say you had a contact list, filled with names, e-mail addresses, birthdates, and phone numbers. Now let’s say you want to retrieve all of the phone numbers from the contact list. Assuming you’re in the US, phone numbers always have 10 digits and always follow the format XXX-XXX-XXXX.

Now let’s say you want to retrieve all of the birthdates from the list. Dates, depending on how they’re written, usually follow the format XX/XX/XXXX (this goes for whether you write the day before the month or vice versa; the year would usually have four digits).

Alright then, let’s explore how regular expressions work in Java. Here’s a simple example where the program searches for a word in a String:

package lesson18;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegEx
{
public static void main (String [] args)
{
Pattern p = Pattern.compile(“Michael’s Analytics Blog”, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(“The best programming blog is Michaels Analytics Blog!”);
boolean matched = m.find();

if (matched)
{
System.out.println(“Found a match!”);
}
else
{
System.out.println(“Didn’t find a match!”);
}
}
}

And here’s the sample output:

run:
Didn’t find a match!
BUILD SUCCESSFUL (total time: 0 seconds)

OK, so most of this probably looks unfamiliar to most of you. Here’s a breakdown of all of the important pieces in this code:

To make this program work, there were two classes I needed to import-java.util.regex.Pattern and java.util.regex.Matcher. These are the two classes you will need to import if your program involves regular expressions.
I needed to create an object of the Pattern class and the Matcher class. The Pattern class contains the pattern I want to search for and the Matcher class contains the expression (String or otherwise) where I want to look for the pattern.
In order to tell Java the pattern I want to look for, I used the compile() method in the Pattern class. The compile() method usually has two parameters-the first being the pattern I want to look for and the second being a flag indicating how the search should be performed. In this example, I used CASE_INSENSITIVE as a flag, which indicates that Java should ignore the case of letters when performing the search.
- The second parameter is optional.
In the Matcher class, I used the matcher() method from the Pattern class to tell Java where to look for the pattern Michael's Analytics Blog. The matcher() method takes a single parameter, which is the expression where I want to search for the pattern (in this case, the expression is The best programming blog is Michaels Analytics Blog!)
I don’t think it’s absolutely necessary (but I could be wrong), but I think it’s a good idea to have a boolean variable (such as matched in this example) in any program that uses regular expressions. With a boolean variable (coupled with an if-else statement), the program has an easy way to let the user know whether or not a match was found in the Matcher class expression.
Last but not least, the program will output one of two messages, depending on whether a match was found.

Oh, and one more thing. You guys certainly noticed that the message Didn't find a match! was printed, but only those with a great eye for detail would understand why the message was printed. See, the Matcher class expression I used was The best programming blog is Michaels Analytics Blog! while the pattern I wanted to search for was Michael's Analytics Blog. Since there isn’t an exact match (as Michaels Analytics Blog is missing an apostrophe), the boolean variable matched is false and the message Didn't find a match! was printed.

Alright then, now let’s play around with some RegEx patterns:

public class RegEx
{
public static void main (String [] args)
{
Pattern p = Pattern.compile(“[aeiouy]”, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(“Nashville, Tennessee”);
boolean matched = m.find();

if (matched)
{
System.out.println(“Found a match!”);
}
else
{
System.out.println(“Didn’t find a match!”);
}
}

And here’s the sample output:

run:
Found a match!
BUILD SUCCESSFUL (total time: 13 seconds)

In this example, I used the pattern [aeiouy] to search for any vowels (and yes I counted y as a vowel) in the Matcher class expression Nashville, Tennessee. In this case the program found a match, as there are vowels in Nashville, Tennessee.

Keep in mind that you need to wrap any RegEx patterns (not just words) in double quotes, as the first parameter in the Pattern.compile() method must be a String.
I could’ve also used the pattern [^aeiouy] to search for consonants in the Matcher class expression (and there would’ve been matches). The ^ operator means NOT, as in “search for characters NOT in a certain range”.

Alright, now let’s explore meta-characters and quantifiers. In the context of Java regular expressions, meta-characters are characters with a special meaning and quantifiers define quantities in pattern searching. Here’s an example of meta-characters and quantifiers in action:

public class RegEx
{
public static void main (String [] args)
{
Pattern p = Pattern.compile(“^\\d{2}/\\d{2}/\\d{2}$”, Pattern.CASE_INSENSITIVE);
Matcher m1 = p.matcher(“08/05/20”);
Matcher m2 = p.matcher(“08/05/2020”);
boolean matched1 = m1.find();
boolean matched2 = m2.find();

if (matched1)
{
System.out.println(“Found a match!”);
}
else
{
System.out.println(“Didn’t find a match!”);
}

if (matched2)
{
System.out.println(“Found a match!”);
}

else
{
System.out.println(“Didn’t find a match”);
}
}
}

And here’s the output:

run:
Found a match!
Didn’t find a match
BUILD SUCCESSFUL (total time: 3 seconds)

In this example, I have two Strings-08/05/20 and 08/05/2020 (both of which are dates)-and I’m checking both of them to see if they follow the pattern ^\\d{2}/\\d{2}/\\d{2}$ (and this time, I created matched variables for both Strings). In plain English, I’m trying to see whether the two dates follow the MM-DD-YY format.

You’re probably wondering what the pattern ^\\d{2}/\\d{2}/\\d{2}$ means. Here’s a breakdown of the pattern:

^ & $ look for matches at the beginning and end of a String, respectively. Including both of these meta-characters in the pattern ensures that the pattern search will look for a String that exactly matches the pattern specified in the Pattern.compile() method.
The three instances of \\d{2} tell Java to look for a sequence of two digits. The main pattern \\d{2}/\\d{2}/\\d{2}tells Java to look for a sequence of two digits followed by a slash followed by a sequence of two digits followed by a slash followed by a sequence of two digits.
- Keep in mind that you need two backslashes by the d, not just one. This is because if you only have one backslash by the d, Java will think it’s an escape character and not a regex element.

The two boolean variables-matched1 and matched2-then analyze whether the pattern is found in the two Matcher class expressions m1 and m2; matched1 searches m1 for a match while matched2 searches m2 for a match. The output shows that m1 returned a match but m2 didn’t return a match. The reason m2 didn’t return a match is because m2 follows the pattern \\d{2}/\\d{2}/\\d{4}, which isn’t the pattern Java was looking for.

Last but not least, here’s an explanation of some of the important meta-characters in Java regex:

|-find a match for any pattern separated by the pipe symbol (|) as in boy|girl|person
.-find just one instance of a particular character
^ & $-find a match at the beginning and end of a String, respectively (I discussed these two meta-characters in the example above)
\d-find a digit
\s-find a whitespace character
\b-find a match either at the beginning of a word (as in \bpattern) or at the end of a word (as in pattern\b)
\uxxxx-find a Unicode character with a certain hexadecimal code

Keep in mind that the double slash rule I discussed with \d also applies to \s, \b, and \uxxx.

Now let’s discuss the important quantifiers in Java regex:

x+-match any String that contains at least one x
x*-match any String that contains zero or more instance of x
x?-match any String that contains either zero or one instance(s) of x
x{Y}-match any String that contains a sequence of Y xs
x{Y, Z}-match any String that contains a sequence of between Y to Z xs
x{Y, }-match any String that contains a sequence of at least Y xs.

You can use the curly bracket quantifiers in conjunction with some of the meta-characters, as I did in the second example program. In that program, I had three instances of \\d{2}, indicating that I wanted to search for three instances of two-digit sequences. However, if I wanted the String 08/05/2020 to match the pattern, I could’ve altered the pattern to read ^\\d{2}/\\d{2}/\\d{4}$ or ^\\d{2}/\\d{2}/\\d{2, 4}$ or ^\\d{2}/\\d{2}/\\d{2, }$.

Thanks for reading,

Michael