Regular Expressions --正则表达式官方教程

前端开发作者： 2024-08-20 22:00:01

http://docs.oracle.com/javase/tutorial/essential/regex/index.htmlThis lesson explains how to use the

: Provides a general overview of regular expressions. It also introduces the core classes that comprise this API.
: Defines a simple application for testing pattern matching with regular expressions.
: Introduces basic pattern matching,metacharacters,and quoting.
: Describes simple character classes,negation,ranges,unions,intersections,and subtraction.
: Describes the basic predefined character classes for whitespace,word,and digit characters.
: Explains greedy,reluctant,and possessive quantifiers for matching a specified expression x number of times.
: Explains how to treat multiple characters as a single unit.
: Describes line,and input boundaries.
: Examines other useful methods of the Pattern class,and explores advanced features such as compiling with flags and using embedded flag expressions.
: Describes the commonly-used methods of the Matcher class.
: Describes how to examine a PatternSyntaxException.
: To read more about regular expressions,consult this section for additional resources.

Introduction

What Are Regular Expressions?

A Pattern object is a compiled representation of a regular expression. The Pattern class provides no public constructors. To create a pattern,you must first invoke one of its public static compile methods,which will then return a Pattern object. These methods accept a regular expression as the first argument; the first few lessons of this trail will teach you the required syntax.
A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like thePattern class, Matcher defines no public constructors. You obtain a Matcher object by invoking the matcher method on a Patternobject.
A PatternSyntaxException object is an unchecked exception that indicates a syntax error in a regular expression pattern.

Test Harness

This section defines a reusable test harness, RegexTestHarness.java ,for exploring the regular expression constructs supported by this API. The command to run this code is java RegexTestHarness; no command-line arguments are accepted. The application loops repeatedly,prompting the user for a regular expression and input string. Using this test harness is optional,but you may find it convenient for exploring the test cases discussed in the following pages.

import java.io.Console; import java.util.regex.Pattern; import java.util.regex.Matcher;

public static void main(String[] args){
    Console console = System.console();
    if (console == null) {
        System.err.println("No console.");
        System.exit(1);
    }
    while (true) {

        Pattern pattern = 
        Pattern.compile(console.readLine("%nEnter your regex: "));

        Matcher matcher = 
        pattern.matcher(console.readLine("Enter input string to search: "));

        boolean found = false;
        while (matcher.find()) {
            console.format("I found the text" +
                " \"%s\" starting at " +
                "index %d and ending at index %d.%n",matcher.group(),matcher.start(),matcher.end());
            found = true;
        }
        if(!found){
            console.format("No match found.%n");
        }
    }
}

String Literals

The most basic form of pattern matching supported by this API is the match of a string literal. For example,if the regular expression is foo and the input string is foo,the match will succeed because the strings are identical. Try this out with the test harness:

 
Enter your regex: foo
Enter input string to search: foo
I found the text foo starting at index 0 and ending at index 3.

 
Enter your regex: foo
Enter input string to search: foofoofoo
I found the text foo starting at index 0 and ending at index 3.
I found the text foo starting at index 3 and ending at index 6.
I found the text foo starting at index 6 and ending at index 9.

Enter your regex: cat.
Enter input string to search: cats
I found the text cats starting at index 0 and ending at index 4.

Note: In certain situations the special characters listed above will not be treated as metacharacters. You'll encounter this as you learn more about how regular expressions are constructed. You can,however,use this list to check whether or not a specific character will ever be considered a metacharacter. For example,the characters @ and # never carry a special meaning.

precede the metacharacter with a backslash,or
enclose it within \Q (which starts the quote) and \E (which ends it).

Character Classes

If you browse through the Pattern class specification,you'll see tables summarizing the supported regular expression constructs. In the "Character Classes" section you'll find the following: > [abc]> > > > > > >


[^abc]	[a-zA-Z]	[a-d[m-p]]	[a-z&&[def]]	[a-z&&[^bc]]	[a-z&&[^m-p]]

Note: The word "class" in the phrase "character class" does not refer to a .class file. In the context of regular expressions,a character class is a set of characters enclosed within square brackets. It specifies the characters that will successfully match a single character from a given input string.

 
Enter your regex: [bcr]at
Enter input string to search: bat
I found the text "bat" starting at index 0 and ending at index 3.

Negation

 
Enter your regex: [^bcr]at
Enter input string to search: bat
No match found.

Ranges

Enter your regex: [a-c]
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Unions

Enter your regex: [0-4[6-8]]
Enter input string to search: 0
I found the text "0" starting at index 0 and ending at index 1.

Intersections

 
Enter your regex: [0-9&&[345]]
Enter input string to search: 3
I found the text "3" starting at index 0 and ending at index 1.

 
Enter your regex: [2-8&&[4-6]]
Enter input string to search: 3
No match found.

Subtraction

 
Enter your regex: [0-9&&[^345]]
Enter input string to search: 2
I found the text "2" starting at index 0 and ending at index 1.

Predefined Character Classes

The Pattern API contains a number of useful predefined character classes,which offer convenient shorthands for commonly used regular expressions: > .> > > > \D> > > \s> > > \S> > > \w> > > \W> > >


\d	[0-9]
[^0-9]
[ \t\n\x0B\f\r]
[^\s]
[a-zA-Z_0-9]
[^\w]

private final String REGEX = "\\d"; // a single digit

 
Enter your regex: .
Enter input string to search: @
I found the text "@" starting at index 0 and ending at index 1.

\d matches all digits
\s matches spaces
\w matches word characters

\D matches non-digits
\S matches non-spaces
\W matches non-word characters

Quantifiers

Quantifiers allow you to specify the number of occurrences to match against. For convenience,the three sections of the Pattern API specification describing greedy,and possessive quantifiers are presented below. At first glance it may appear that the quantifiers X?, X?? and X?+ do exactly the same thing,since they all promise to match "X,once or not at all". There are subtle implementation differences which will be explained near the end of this section. > X?> > > > > X*> > > > > X+> > > > > X{n}> > > > > X{n,}> > > > > X{n,m}> > > > >


X??	X?+	X,once or not at all
X*?	X*+	X,zero or more times
X+?	X++	X,one or more times
X{n}?	X{n}+	X,exactly `n` times
X{n,}?	X{n,}+	X,at least `n` times
X{n,m}?	X{n,m}+	X,at least `n` but not more than `m` times

 
Enter your regex: a?
Enter input string to search: 
I found the text "" starting at index 0 and ending at index 0.

 
Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

 
Enter your regex: a?
Enter input string to search: aaaaa
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 1 and ending at index 2.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "a" starting at index 3 and ending at index 4.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "" starting at index 5 and ending at index 5.

 
Enter your regex: a?
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

 
Enter your regex: a{3}
Enter input string to search: aa
No match found.

 
Enter your regex: a{3}
Enter input string to search: aaaaaaaaa
I found the text "aaa" starting at index 0 and ending at index 3.
I found the text "aaa" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.

 
Enter your regex: a{3,}
Enter input string to search: aaaaaaaaa
I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.

 
Enter your regex: a{3,6} // find at least 3 (but no more than 6) a's in a row
Enter input string to search: aaaaaaaaa
I found the text "aaaaaa" starting at index 0 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.

 
Enter your regex: (dog){3}
Enter input string to search: dogdogdogdogdogdog
I found the text "dogdogdog" starting at index 0 and ending at index 9.
I found the text "dogdogdog" starting at index 9 and ending at index 18.

Enter your regex: [abc]{3}
Enter input string to search: abccabaaaccbbbc
I found the text "abc" starting at index 0 and ending at index 3.
I found the text "cab" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
I found the text "ccb" starting at index 9 and ending at index 12.
I found the text "bbc" starting at index 12 and ending at index 15.

 
Enter your regex: .*foo  // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Capturing Groups

In the ,we saw how quantifiers attach to one character,character class,or capturing group at a time. But until now,we have not discussed the notion of capturing groups in any detail.

((A)(B(C)))
(A)
(B(C))
(C)

public int start(int group): Returns the start index of the subsequence captured by the given group during the previous match operation.
public int end (int group): Returns the index of the last character,plus one,of the subsequence captured by the given group during the previous match operation.
public String group (int group): Returns the input subsequence captured by the given group during the previous match operation.

 
Enter your regex: (\d\d)\1
Enter input string to search: 1212
I found the text "1212" starting at index 0 and ending at index 4.

 
Enter your regex: (\d\d)\1
Enter input string to search: 1234
No match found.

Boundary Matchers

Until now,we've only been interested in whether or not a match is found at some location within a particular input string. We never cared about where in the string the match was taking place. > ^> > > > > > > >


$	\b	\B	\A	\G	\Z	\z

 
Enter your regex: ^dog$
Enter input string to search: dog
I found the text "dog" starting at index 0 and ending at index 3.

 
Enter your regex: \bdog\B
Enter input string to search: The dog plays in the yard.
No match found.

 
Enter your regex: dog 
Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.
I found the text "dog" starting at index 4 and ending at index 7.

Methods of the Pattern Class

Until now,we've only used the test harness to create Pattern objects in their most basic form. This section explores advanced techniques such as creating patterns with flags and using embedded flag expressions. It also explores some additional useful methods that we haven't yet discussed.

Pattern.CANON_EQ Enables canonical equivalence. When this flag is specified,two characters will be considered to match if,and only if,their full canonical decompositions match. The expression "a\u030A",will match the string "\u00E5" when this flag is specified. By default,matching does not take canonical equivalence into account. Specifying this flag may impose a performance penalty.
Pattern.CASE_INSENSITIVE Enables case-insensitive matching. By default,case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag. Case-insensitive matching can also be enabled via the embedded flag expression (?i). Specifying this flag may impose a slight performance penalty.
Pattern.COMMENTS Permits whitespace and comments in the pattern. In this mode,whitespace is ignored,and embedded comments starting with # are ignored until the end of a line. Comments mode can also be enabled via the embedded flag expression (?x).
Pattern.DOTALL Enables dotall mode. In dotall mode,the expression . matches any character,including a line terminator. By default this expression does not match line terminators. Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode,which is what this is called in Perl.)
Pattern.LITERAL Enables literal parsing of the pattern. When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters or escape sequences in the input sequence will be given no special meaning. The flags CASE_INSENSITIVE and UNICODE_CASE retain their impact on matching when used in conjunction with this flag. The other flags become superfluous. There is no embedded flag character for enabling literal parsing.
Pattern.MULTILINE Enables multiline mode. In multiline mode the expressions ^ and $ match just after or just before,respectively,a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence. Multiline mode can also be enabled via the embedded flag expression (?m).
Pattern.UNICODE_CASE Enables Unicode-aware case folding. When this flag is specified then case-insensitive matching,when enabled by the CASE_INSENSITIVE flag,is done in a manner consistent with the Unicode Standard. By default,case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case folding can also be enabled via the embedded flag expression (?u). Specifying this flag may impose a performance penalty.
Pattern.UNIX_LINES Enables UNIX lines mode. In this mode,only the '\n' line terminator is recognized in the behavior of ., ^,and$. UNIX lines mode can also be enabled via the embedded flag expression (?d).

Pattern pattern = 
Pattern.compile(console.readLine("%nEnter your regex: "),Pattern.CASE_INSENSITIVE);

 
Enter your regex: dog
Enter input string to search: DoGDOg
I found the text "DoG" starting at index 0 and ending at index 3.
I found the text "DOg" starting at index 3 and ending at index 6.

pattern = Pattern.compile("[az]$",Pattern.MULTILINE | Pattern.UNIX_LINES);

 
final int flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
Pattern pattern = Pattern.compile("aa",flags);

 
Enter your regex: (?i)foo
Enter input string to search: FOOfooFoOfoO
I found the text "FOO" starting at index 0 and ending at index 3.
I found the text "foo" starting at index 3 and ending at index 6.
I found the text "FoO" starting at index 6 and ending at index 9.
I found the text "foO" starting at index 9 and ending at index 12.

> Pattern.CANON_EQ> > > > Pattern.COMMENTS> > > Pattern.MULTILINE> > > Pattern.DOTALL> > > Pattern.LITERAL> > > > Pattern.UNIX_LINES> > >


Pattern.CASE_INSENSITIVE	(?i)
(?x)
(?m)
(?s)
Pattern.UNICODE_CASE	(?u)
(?d)

import java.util.regex.Pattern; import java.util.regex.Matcher;

private static final String REGEX = ":";
private static final String INPUT =
    "one:two:three:four:five";

public static void main(String[] args) {
    Pattern p = Pattern.compile(REGEX);
    String[] items = p.split(INPUT);
    for(String s : items) {
        System.out.println(s);
    }
}

OUTPUT:

import java.util.regex.Pattern; import java.util.regex.Matcher;

private static final String REGEX = "\\d";
private static final String INPUT =
    "one9two4three7four1five";

public static void main(String[] args) {
    Pattern p = Pattern.compile(REGEX);
    String[] items = p.split(INPUT);
    for(String s : items) {
        System.out.println(s);
    }
}

OUTPUT:

public static String quote(String s) Returns a literal pattern String for the specified String. This method produces a String that can be used to create a Pattern that would match String s as if it were a literal pattern. Metacharacters or escape sequences in the input sequence will be given no special meaning.
public String toString() Returns the String representation of this pattern. This is the regular expression from which this pattern was compiled.

public boolean matches(String regex): Tells whether or not this string matches the given regular expression. An invocation of this method of the form str.matches(regex) yields exactly the same result as the expression Pattern.matches(regex, str).
public String[] split(String regex,int limit): Splits this string around matches of the given regular expression. An invocation of this method of the form str.split(regex, n) yields the same result as the expression Pattern.compile(regex).split(str, n)
public String[] split(String regex): Splits this string around matches of the given regular expression. This method works the same as if you invoked the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are not included in the resulting array.

public String replace(CharSequence target,CharSequence replacement): Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence. The replacement proceeds from the beginning of the string to the end,replacing "aa" with "b" in the string "aaa" will result in "ba" rather than "ab".

Methods of the Matcher Class

This section describes some additional useful methods of the Matcher class. For convenience,the methods listed below are grouped according to functionality.

public int start(): Returns the start index of the previous match.
public int start(int group): Returns the start index of the subsequence captured by the given group during the previous match operation.
public int end(): Returns the offset after the last character matched.
public int end(int group): Returns the offset after the last character of the subsequence captured by the given group during the previous match operation.

public boolean lookingAt(): Attempts to match the input sequence,starting at the beginning of the region,against the pattern.
public boolean find(): Attempts to find the next subsequence of the input sequence that matches the pattern.
public boolean find(int start): Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern,starting at the specified index.
public boolean matches(): Attempts to match the entire region against the pattern.

public Matcher appendReplacement(StringBuffer sb,String replacement): Implements a non-terminal append-and-replace step.
public StringBuffer appendTail(StringBuffer sb): Implements a terminal append-and-replace step.
public String replaceAll(String replacement): Replaces every subsequence of the input sequence that matches the pattern with the given replacement string.
public String replaceFirst(String replacement): Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string.
public static String quoteReplacement(String s): Returns a literal replacement String for the specified String. This method produces aString that will work as a literal replacement s in the appendReplacement method of the Matcher class. The String produced will match the sequence of characters in s treated as a literal sequence. Slashes ('\') and dollar signs ('$') will be given no special meaning.

import java.util.regex.Pattern; import java.util.regex.Matcher;

private static final String REGEX =
    "\\bdog\\b";
private static final String INPUT =
    "dog dog dog doggie dogg";

public static void main(String[] args) {
   Pattern p = Pattern.compile(REGEX);
   //  get a matcher object
   Matcher m = p.matcher(INPUT);
   int count = 0;
   while(m.find()) {
       count++;
       System.out.println("Match number "
                          + count);
       System.out.println("start(): "
                          + m.start());
       System.out.println("end(): "
                          + m.end());
  }

OUTPUT:

import java.util.regex.Pattern; import java.util.regex.Matcher;

private static final String REGEX = "foo";
private static final String INPUT =
    "fooooooooooooooooo";
private static Pattern pattern;
private static Matcher matcher;

public static void main(String[] args) {

    // Initialize
    pattern = Pattern.compile(REGEX);
    matcher = pattern.matcher(INPUT);

    System.out.println("Current REGEX is: "
                       + REGEX);
    System.out.println("Current INPUT is: "
                       + INPUT);

    System.out.println("lookingAt(): "
        + matcher.lookingAt());
    System.out.println("matches(): "
        + matcher.matches());
}

Current REGEX is: foo
Current INPUT is: fooooooooooooooooo
lookingAt(): true
matches(): false

import java.util.regex.Pattern; import java.util.regex.Matcher;

private static String REGEX = "dog";
private static String INPUT =
    "The dog says meow. All dogs say meow.";
private static String REPLACE = "cat";

public static void main(String[] args) {
    Pattern p = Pattern.compile(REGEX);
    // get a matcher object
    Matcher m = p.matcher(INPUT);
    INPUT = m.replaceAll(REPLACE);
    System.out.println(INPUT);
}

OUTPUT: The cat says meow. All cats say meow.

import java.util.regex.Pattern; import java.util.regex.Matcher;

private static String REGEX = "a*b";
private static String INPUT =
    "aabfooaabfooabfoob";
private static String REPLACE = "-";

public static void main(String[] args) {
    Pattern p = Pattern.compile(REGEX);
    // get a matcher object
    Matcher m = p.matcher(INPUT);
    INPUT = m.replaceAll(REPLACE);
    System.out.println(INPUT);
}

OUTPUT: -foo-foo-foo-

import java.util.regex.Pattern; import java.util.regex.Matcher;

private static String REGEX = "a*b";
private static String INPUT = "aabfooaabfooabfoob";
private static String REPLACE = "-";

public static void main(String[] args) {
    Pattern p = Pattern.compile(REGEX);
    Matcher m = p.matcher(INPUT); // get a matcher object
    StringBuffer sb = new StringBuffer();
    while(m.find()){
        m.appendReplacement(sb,REPLACE);
    }
    m.appendTail(sb);
    System.out.println(sb.toString());
}

OUTPUT: -foo-foo-foo-

public String replaceFirst(String regex,String replacement): Replaces the first substring of this string that matches the given regular expression with the given replacement. An invocation of this method of the form str.replaceFirst(regex, repl) yields exactly the same result as the expression Pattern.compile(regex).matcher(str).replaceFirst(repl)
public String replaceAll(String regex,String replacement): Replaces each substring of this string that matches the given regular expression with the given replacement. An invocation of this method of the form str.replaceAll(regex, repl) yields exactly the same result as the expression Pattern.compile(regex).matcher(str).replaceAll(repl)

Methods of the PatternSyntaxException Class

A PatternSyntaxException is an unchecked exception that indicates a syntax error in a regular expression pattern. ThePatternSyntaxException class provides the following methods to help you determine what went wrong:

public String getDescription(): Retrieves the description of the error.
public int getIndex(): Retrieves the error index.
public String getPattern(): Retrieves the erroneous regular expression pattern.
public String getMessage(): Returns a multi-line string containing the description of the syntax error and its index,the erroneous regular-expression pattern,and a visual indication of the error index within the pattern.

import java.io.Console; import java.util.regex.Pattern; import java.util.regex.Matcher; import java.util.regex.PatternSyntaxException;

public static void main(String[] args){
    Pattern pattern = null;
    Matcher matcher = null;

    Console console = System.console();
    if (console == null) {
        System.err.println("No console.");
        System.exit(1);
    }
    while (true) {
        try{
            pattern = 
            Pattern.compile(console.readLine("%nEnter your regex: "));

            matcher = 
            pattern.matcher(console.readLine("Enter input string to search: "));
        }
        catch(PatternSyntaxException pse){
            console.format("There is a problem" +
                           " with the regular expression!%n");
            console.format("The pattern in question is: %s%n",pse.getPattern());
            console.format("The description is: %s%n",pse.getDescription());
            console.format("The message is: %s%n",pse.getMessage());
            console.format("The index is: %s%n",pse.getIndex());
            System.exit(0);
        }
        boolean found = false;
        while (matcher.find()) {
            console.format("I found the text" +
                " \"%s\" starting at " +
                "index %d and ending at index %d.%n",matcher.end());
            found = true;
        }
        if(!found){
            console.format("No match found.%n");
        }
    }
}

Enter your regex: ?i)
There is a problem with the regular expression!
The pattern in question is: ?i)
The description is: Dangling meta character '?'
The message is: Dangling meta character '?' near index 0
?i)
^
The index is: 0

Unicode Support

As of the JDK 7 release,Regular Expression pattern matching has expanded functionality to support Unicode 6.0.

String hexPattern = "\x{" + Integer.toHexString(codePoint) + "}";

Scripts

Blocks

General Category

Additional Resources

Now that you've completed this lesson on regular expressions,you'll probably find that your main references will be the API documentation for the following classes: Pattern,and PatternSyntaxException.

Questions and Exercises: Regular Expressions

Questions

What are the three public classes in the java.util.regex package? Describe the purpose of each.
Consider the string literal "foo". What is the start index? What is the end index? Explain what these numbers mean.
What is the difference between an ordinary character and a metacharacter? Give an example of each.
How do you force a metacharacter to act like an ordinary character?
What do you call a set of characters enclosed in square brackets? What is it for?
Here are three predefined character classes: \d, \s,and \w. Describe each one,and rewrite it using square brackets.
For each of \d,and \w,write two simple expressions that match the opposite set of characters.
Consider the regular expression (dog){3}. Identify the two subexpressions. What string does the expression match?