Java Lesson 18: RegEx (or Regular Expressions)

Advertisements

Hello everybody,

Michael here, and today’s post will be about RegEx (or regular expressions) in Java.

Now, since this is the first time I’ll be discussed regular expressions in any programming language, you’re probably wondering “Well, what the heck are regular expressions?” Regular expressions are sequences of characters that form a search pattern. When you’re looking for certain data in a text, you can use regular expressions to describe what you’re looking for.

Still confused? Let me make it easier for you. Let’s say you had a contact list, filled with names, e-mail addresses, birthdates, and phone numbers. Now let’s say you want to retrieve all of the phone numbers from the contact list. Assuming you’re in the US, phone numbers always have 10 digits and always follow the format XXX-XXX-XXXX.

Now let’s say you want to retrieve all of the birthdates from the list. Dates, depending on how they’re written, usually follow the format XX/XX/XXXX (this goes for whether you write the day before the month or vice versa; the year would usually have four digits).

Alright then, let’s explore how regular expressions work in Java. Here’s a simple example where the program searches for a word in a String:

package lesson18;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegEx
{
public static void main (String [] args)
{
Pattern p = Pattern.compile(“Michael’s Analytics Blog”, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(“The best programming blog is Michaels Analytics Blog!”);
boolean matched = m.find();

if (matched)
{
System.out.println(“Found a match!”);
}
else
{
System.out.println(“Didn’t find a match!”);
}
}
}

And here’s the sample output:

run:
Didn’t find a match!
BUILD SUCCESSFUL (total time: 0 seconds)

OK, so most of this probably looks unfamiliar to most of you. Here’s a breakdown of all of the important pieces in this code:

  • To make this program work, there were two classes I needed to import-java.util.regex.Pattern and java.util.regex.Matcher. These are the two classes you will need to import if your program involves regular expressions.
  • I needed to create an object of the Pattern class and the Matcher class. The Pattern class contains the pattern I want to search for and the Matcher class contains the expression (String or otherwise) where I want to look for the pattern.
  • In order to tell Java the pattern I want to look for, I used the compile() method in the Pattern class. The compile() method usually has two parameters-the first being the pattern I want to look for and the second being a flag indicating how the search should be performed. In this example, I used CASE_INSENSITIVE as a flag, which indicates that Java should ignore the case of letters when performing the search.
    • The second parameter is optional.
  • In the Matcher class, I used the matcher() method from the Pattern class to tell Java where to look for the pattern Michael's Analytics Blog. The matcher() method takes a single parameter, which is the expression where I want to search for the pattern (in this case, the expression is The best programming blog is Michaels Analytics Blog!)
  • I don’t think it’s absolutely necessary (but I could be wrong), but I think it’s a good idea to have a boolean variable (such as matched in this example) in any program that uses regular expressions. With a boolean variable (coupled with an if-else statement), the program has an easy way to let the user know whether or not a match was found in the Matcher class expression.
  • Last but not least, the program will output one of two messages, depending on whether a match was found.

Oh, and one more thing. You guys certainly noticed that the message Didn't find a match! was printed, but only those with a great eye for detail would understand why the message was printed. See, the Matcher class expression I used was The best programming blog is Michaels Analytics Blog! while the pattern I wanted to search for was Michael's Analytics Blog. Since there isn’t an exact match (as Michaels Analytics Blog is missing an apostrophe), the boolean variable matched is false and the message Didn't find a match! was printed.

Alright then, now let’s play around with some RegEx patterns:

public class RegEx
{
public static void main (String [] args)
{
Pattern p = Pattern.compile(“[aeiouy]”, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(“Nashville, Tennessee”);
boolean matched = m.find();

if (matched)
{
System.out.println(“Found a match!”);
}
else
{
System.out.println(“Didn’t find a match!”);
}
}

And here’s the sample output:

run:
Found a match!
BUILD SUCCESSFUL (total time: 13 seconds)

In this example, I used the pattern [aeiouy] to search for any vowels (and yes I counted y as a vowel) in the Matcher class expression Nashville, Tennessee. In this case the program found a match, as there are vowels in Nashville, Tennessee.

  • Keep in mind that you need to wrap any RegEx patterns (not just words) in double quotes, as the first parameter in the Pattern.compile() method must be a String.
  • I could’ve also used the pattern [^aeiouy] to search for consonants in the Matcher class expression (and there would’ve been matches). The ^ operator means NOT, as in “search for characters NOT in a certain range”.

Alright, now let’s explore meta-characters and quantifiers. In the context of Java regular expressions, meta-characters are characters with a special meaning and quantifiers define quantities in pattern searching. Here’s an example of meta-characters and quantifiers in action:

public class RegEx
{
public static void main (String [] args)
{
Pattern p = Pattern.compile(“^\\d{2}/\\d{2}/\\d{2}$”, Pattern.CASE_INSENSITIVE);
Matcher m1 = p.matcher(“08/05/20”);
Matcher m2 = p.matcher(“08/05/2020”);
boolean matched1 = m1.find();
boolean matched2 = m2.find();

if (matched1)
{
System.out.println(“Found a match!”);
}
else
{
System.out.println(“Didn’t find a match!”);
}

if (matched2)
{
System.out.println(“Found a match!”);
}

else
{
System.out.println(“Didn’t find a match”);
}
}
}

And here’s the output:

run:
Found a match!
Didn’t find a match
BUILD SUCCESSFUL (total time: 3 seconds)

In this example, I have two Strings-08/05/20 and 08/05/2020 (both of which are dates)-and I’m checking both of them to see if they follow the pattern ^\\d{2}/\\d{2}/\\d{2}$ (and this time, I created matched variables for both Strings). In plain English, I’m trying to see whether the two dates follow the MM-DD-YY format.

You’re probably wondering what the pattern ^\\d{2}/\\d{2}/\\d{2}$ means. Here’s a breakdown of the pattern:

  • ^ & $ look for matches at the beginning and end of a String, respectively. Including both of these meta-characters in the pattern ensures that the pattern search will look for a String that exactly matches the pattern specified in the Pattern.compile() method.
  • The three instances of \\d{2} tell Java to look for a sequence of two digits. The main pattern \\d{2}/\\d{2}/\\d{2}tells Java to look for a sequence of two digits followed by a slash followed by a sequence of two digits followed by a slash followed by a sequence of two digits.
    • Keep in mind that you need two backslashes by the d, not just one. This is because if you only have one backslash by the d, Java will think it’s an escape character and not a regex element.

The two boolean variables-matched1 and matched2-then analyze whether the pattern is found in the two Matcher class expressions m1 and m2; matched1 searches m1 for a match while matched2 searches m2 for a match. The output shows that m1 returned a match but m2 didn’t return a match. The reason m2 didn’t return a match is because m2 follows the pattern \\d{2}/\\d{2}/\\d{4}, which isn’t the pattern Java was looking for.

Last but not least, here’s an explanation of some of the important meta-characters in Java regex:

  • |-find a match for any pattern separated by the pipe symbol (|) as in boy|girl|person
  • .-find just one instance of a particular character
  • ^ & $-find a match at the beginning and end of a String, respectively (I discussed these two meta-characters in the example above)
  • \d-find a digit
  • \s-find a whitespace character
  • \b-find a match either at the beginning of a word (as in \bpattern) or at the end of a word (as in pattern\b)
  • \uxxxx-find a Unicode character with a certain hexadecimal code

Keep in mind that the double slash rule I discussed with \d also applies to \s, \b, and \uxxx.

Now let’s discuss the important quantifiers in Java regex:

  • x+-match any String that contains at least one x
  • x*-match any String that contains zero or more instance of x
  • x?-match any String that contains either zero or one instance(s) of x
  • x{Y}-match any String that contains a sequence of Y xs
  • x{Y, Z}-match any String that contains a sequence of between Y to Z xs
  • x{Y, }-match any String that contains a sequence of at least Y xs.

You can use the curly bracket quantifiers in conjunction with some of the meta-characters, as I did in the second example program. In that program, I had three instances of \\d{2}, indicating that I wanted to search for three instances of two-digit sequences. However, if I wanted the String 08/05/2020 to match the pattern, I could’ve altered the pattern to read ^\\d{2}/\\d{2}/\\d{4}$ or ^\\d{2}/\\d{2}/\\d{2, 4}$ or ^\\d{2}/\\d{2}/\\d{2, }$.

Thanks for reading,

Michael

 

 

Leave a ReplyCancel reply