A regular expression (or regex in its abbreviated form) is a sequence of special characters that lets us detect some patterns in texts. For instance, an id card made of 8 digits and an uppercase letter, or an e-mail containing a @
. Using regular expressions, we can easily detect these patterns in a given text and, also, force a text to meet a given pattern when it’s being entered by the user.
In order to work with regular expressions in Java, we need to use some classes from java.util.regex
package. To be more precise, we are going to rely on:
Pattern
class, that lets us define a given pattern for a regular expressionMatcher
class, that lets us check if a givent text meets a given Pattern.Regarding Pattern
class, there are a couple of interesting methods inside it that we need to manage:
compile
: it creates a Pattern object for a given regular expressionmatcher
: it returns a Matcher object to match a pattern with a textInside Matcher
class, we can use these methods:
find
: checks if a given pattern is found in a textmatches
: checks if the whole text meets a pattern (not only a part of it). There is also a static method in Pattern class, Pattern.matches
that produces the same result.Let’s check in the following example if a text contains a digit between 0 and 9. This digit can be represented by the symbol \d
, so we use it inside the pattern, this way:
String text = "Hi, my name is Nacho and I'm 44 years old";
Pattern p = Pattern.compile("\\d");
Matcher m = p.matcher(text);
if (m.find())
{
System.out.println("The text contains digit(s)");
}
else
{
System.out.println("The text does not contain any digit");
}
Note that we need to escape the \
when we place it inside the string, as \\d
In this table you can find some basic symbols that we can use inside patterns.
Symbol | Meaning |
---|---|
x |
‘x’ character |
\t |
Tabulation |
\n |
New line |
[abc] |
Character ‘a’, ‘b’ or ‘c’ |
[^abc] |
Anything but characters ‘a’, ‘b’ or ‘c’ |
[a-zA-Z] |
Range from ‘a’ to ‘z’ or from ‘A’ to ‘Z’ |
. |
Any character |
^ |
Beginning of line (there’s nothing before) |
$ |
End of line (there’s nothing after) |
\d |
Digit from 0 to 9 |
\D |
Anything but a digit |
\s |
Spacing char (white space, tab, new line…) |
\S |
Anything but a spacing char |
\w |
Alphanumeric char (letters, digits and underscore) |
\W |
Anything but an alphanumeric char |
(one|two) |
Text ‘one’ or text ‘two’ |
With these symbols, we can made expressions such as:
\
):Pattern p = Pattern.compile("\\.$");
Pattern p = Pattern.compile("^\\d\\d\\d\\d$");
Pattern p = Pattern.compile("(winter|spring|summer|autumn)");
Exercise 1:
Create a project called CarIDCheck that asks the user to enter a cad id, and checks if it’s made of 4 digits followed by 3 uppercase letters. We are not going to check if these letters are vowels or not, we just check if they are uppercase.
In order to make some more complex expressions, we need to make use of some additional symbols in the pattern. To be more precise, we need to specify the cardinality of some parts of the expressions, in orde to shorten them. This is a list of cardinality symbols that you can use in your patterns:
Symbol | Meaning |
---|---|
x? |
x symbol appears 0 or 1 times |
x+ |
x symbol appears 1 or more times |
x* |
x symbol appears 0 or more times |
x{n} |
x symbol appears n times |
x{n,} |
x symbol appears at least n times |
x{n, m} |
x symbol appears between n and m times (both included) |
This way, we can easily check:
Pattern p = Pattern.copile("^\\d{4}$");
Pattern p = Pattern.copile("^\\d{8}[A-Z]$");
Exercise 2:
Repeat previous exercise using cardinality symbols
Exercise 3:
Create a program called EmailChecker that asks the user to enter an e-mail and checks if it’s valid. We will consider that a valid e-mail will be made of alphanumeric characters (at least one), followed by a
@
, one or more alphanumeric characters, a dot and one or more alphanumeric characters. SomyEmail@one.com
is a valid e-mail, butmyOtherMail@aaa
is not.
Groups lets us isolate some parts of a text that meet a given pattern, so we can treat them later in the code. We can use group
method inside Matcher
class to detect groups, and each group must be defined between parentheses (...)
in the pattern. Groups are explored from left to right.
Let’s see how groups work with the following example: we are going to get every sequence of 4 digits in a text:
String text = "Einstein was born in 1879 and Edison in 1847";
Pattern p = Pattern.compile("(\\d{4})");
Matcher m = p.Matcher(text);
if (!m.find())
{
System.out.println("The text has no sequence of 4 digits");
}
else
{
do
{
String data = m.group();
System.out.println("Found " + data);
}
while(m.find());
}
As you can see, every time we call group
method we move to next group identified in the text, until there are no more matches pending.
We can define more than one group in a single expression. In this case, group
method admits an additional parameter indicating which of the groups we are choosing (starting from 1).
The following example identifies names and surnames in a text:
String text = "Albert Einstein was born in 1879" +
" and Thomas Edison in 1847";
Pattern p = Pattern.compile("([A-Z][a-z]+) ([A-Z][a-z]+)");
Matcher m = p.Matcher(text);
if (!m.find())
{
System.out.println("The text has no names");
}
else
{
do
{
String name = m.group(1);
String surname = m.group(2);
System.out.println("Found " + name + " " + surname);
}
while(m.find());
}
It would output Found Albert Einstein in the first iteration and Found Thomas Edison in the second one.
Exercise 4:
Create a program called HourIdentifier that looks for hours in a text. An hour is made of two digits, followed by
:
, and two digits. For instance,08:45
. We are not going to check if the hour is valid or not, we just need to identify them. Then, store these hours in a list and show them in ascending order.