admin管理员组

文章数量:1579086

文章目录

  • 由浅入深_Java正则表达式(Regular Expression)
    • 1. 正则表达式介绍
      • 1.1 什么是正则表达式
      • 1.2 正则表达式在java.util.regex中的组织方式
      • 1.3 本文对正则表达式的介绍方式
    • 2. 通用测试程序(Test Harness)
    • 3. 匹配字符串字面量
      • 3.1 简单体验程序
      • 3.2 元字符(Metacharacters)
    • 4. 单字符集匹配
      • 4.1 简单匹配
      • 4.2 补集
      • 4.3 范围
      • 4.4 并集
      • 4.4 交集
      • 4.5 差集
    • 5. 预定义的单字符匹配
    • 6. 量词
      • 6.1 <=1, >= 0, >= 1与0长度匹配
      • 6.2 == n
      • 6.3 >= n
      • 6.4 {n, m}
      • 6.5 三种量词的区别
    • 7. 匹配组
    • 8. 边界匹配
      • 8.1 `^`,`### 开头/结尾
      • 8.2 \b,\B 单词边界
      • 8.3 \G 连续的第二次匹配
    • 9. Pattern类的方法
      • 9.1 带有flag的模式
      • 9.2 内嵌flag的表达式
      • 9.3 matches(String,CharSequence)
      • 9.4 split(String)
      • 9.5 其它方法
      • 9.6 java.lang.String对正则的支持
    • 10. Matcher类的方法
      • 10.1 索引(index)方法
      • 10.2 研究(study)方法
      • 10.3 替换(Replacement )方法
      • 10.4 与Matcher方法等效的java.lang.String方法
    • 11. PatternSyntaxException
    • 12. Unicode Support
    • 13. 相关学习资源

由浅入深_Java正则表达式(Regular Expression)

java.util.regex 包下正则表达式的模式匹配相关API。这篇文章深入浅出地介绍Java中地正则表达式。只需要了解基本的Java语法基础即可。

1. 正则表达式介绍

1.1 什么是正则表达式

​ 正则表达式是描述字符串集合的一种方法,此字符串集合中的每个串都有着共同的特征。正则表达式能够用来搜索,编辑,操作文本数据。你必须学会特定的语法来创建正则表达式,此语法超过Java语言的范畴。正则表达式非常复杂,但是一旦你理解了它们被构造的基础原理,你将能够解析或者构造一个正则表达式。

​ 本文介绍java.util.regex API支持的正则表达式语法,并给出许多可运行的例子来阐述这些对象是如何交互的。在正则表达式的世界里,由许多不同的风格可供选择,如grepPerlPythonPHPawk等,而java语言的java.util.regexPerl更相像。

1.2 正则表达式在java.util.regex中的组织方式

/*
java.util.regex

Interfaces
	MatchResult
Classes
	Matcher
	Pattern
Exceptions
	PatternSyntaxException

*/

​ 包下主要由3个类组成:PatternMatcherPatternSyntaxExceptionPattern, Matcher, and PatternSyntaxException

  • Pattern。Pattern对象是(object is a compiled representation of a regular expression.);Pattern 类并不提供public构造器。要创建一个pattern,我们必须调用Patternpublic static compile方法返回一个Pattern对象。
  • Matcher 。Matcher对象是解析正则表达式的引擎,它对输入串进行匹配操作。类似地,它也没有提供public构造器,通过在一个Pattern 对象上调用matcher 方法来获取一个Matcher 对象。
  • PatternSyntaxExceptionunchecked exception,正则语法错误。

1.3 本文对正则表达式的介绍方式

​ 先介绍正则表达式,理解怎么构造正则表达式。最后我们才会深入介绍与正则表达式相关的类。

2. 通用测试程序(Test Harness)

​ Test Harness,我们可以理解为,通用测试程序。我们用此程序来学习正则表达式的基本构造。

public class RegexTestHarness {

    public static void main(String[] args){
        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {

            Pattern pattern = 
            Pattern.compile(console.readLine("%nEnter your regex: "));

            Matcher matcher = 
            pattern.matcher(console.readLine("Enter input string to search: "));

            boolean found = false;
            while (matcher.find()) {
                console.format("I found the text" +
                    " \"%s\" starting at " +
                    "index %d and ending at index %d.%n",
                    matcher.group(),
                    matcher.start(),
                    matcher.end());
                found = true;
            }
            if(!found){
                console.format("No match found.%n");
            }
        }
    }
}

3. 匹配字符串字面量

3.1 简单体验程序

​ 测试下面的输入,体验一下。

/* java regex.RegexTestHarness
Enter your regex: foo
Enter input string to search: foo
I found the text foo starting at index 0 and ending at index 3.
*/

​ 再试一下下面的输入?

/* java regex.RegexTestHarness
Enter your regex: foo
Enter input string to search: foofoofoo
I found the text foo starting at index 0 and ending at index 3.
I found the text foo starting at index 3 and ending at index 6.
I found the text foo starting at index 6 and ending at index 9.  
*/

3.2 元字符(Metacharacters)

​ 体验一下下面的输入

/* java regex.RegexTestHarness
Enter your regex: cat.
Enter input string to search: cats
I found the text cats starting at index 0 and ending at index 4.
*/

​ 可以看到 "."在这里起到了作用,实际上起的作用为匹配任意一个字符。我们称这种字符为元字符。这里是Java正则API支持全部元字符:

< ( [ { \ ^ - = $ ! | ] } ) ? * + . >

​ 当我们需要匹配的字符就是上面的原字符咋办?我们可以有两种方法来将其转化称普通字符:

  • \。反斜杠转义
  • \Q \E。用这两个符号把我们需要转化的串包住。

下面是一个例子:

/* java regex.RegexTestHarness

Enter your regex: .
Enter input string to search: asdasdasd.
I found the text "a" starting at index 0 and ending at index 1.
I found the text "s" starting at index 1 and ending at index 2.
I found the text "d" starting at index 2 and ending at index 3.
I found the text "a" starting at index 3 and ending at index 4.
I found the text "s" starting at index 4 and ending at index 5.
I found the text "d" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "s" starting at index 7 and ending at index 8.
I found the text "d" starting at index 8 and ending at index 9.
I found the text "." starting at index 9 and ending at index 10.

Enter your regex: .
Enter input string to search:
No match found.

Enter your regex: \.
Enter input string to search: asdasdasd.
I found the text "." starting at index 9 and ending at index 10.

Enter your regex: \Qd.\E
Enter input string to search: asdasdasd.
I found the text "d." starting at index 8 and ending at index 10.

*/

4. 单字符集匹配

ConstructDescription
[abc]a, b, or c (simple class) { 简单匹配 }
[^abc]Any character except a, b, or c (negation) { 补集 }
[a-zA-Z]a through z, or A through Z, inclusive (range) { 范围 }
[a-d[m-p]]a through d, or m through p: [a-dm-p] (union) { 并集 }
[a-z&&[def]]d, e, or f (intersection) { 交集 }
[a-z&&[^bc]]a through z, except for b and c: [ad-z] (subtraction) { 差集 }
[a-z&&[^m-p]]a through z, and not m through p: [a-lq-z] (subtraction) { 差集 }

​ 学过glob语法,应该看到这个,就很容易理解

​ 下面会对上面的每个模式,给出一个例子。(:懂集合论真的可以为所欲为。

4.1 简单匹配

简单匹配是最直白的了,只要字符匹配[]中的任意一个字符即可。

/*

Enter your regex: [bcr]at
Enter input string to search: bat
I found the text "bat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at
Enter input string to search: cat
I found the text "cat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at
Enter input string to search: rat
I found the text "rat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at
Enter input string to search: hat
No match found.

*/

4.2 补集

​ 补集,集合论中有介绍。

​ 不扯到集合论也容易理解,就是[ ^ .. ] 我们可以理解为^[ ... ]^操作相当于取补集。

/*

Enter your regex: [^bcr]at
Enter input string to search: bat
No match found.

Enter your regex: [^bcr]at
Enter input string to search: cat
No match found.

Enter your regex: [^bcr]at
Enter input string to search: rat
No match found.

Enter your regex: [^bcr]at
Enter input string to search: hat
I found the text "hat" starting at index 0 and ending at index 3.

*/

4.3 范围

[1-5]表示'1'~'5'的任意一个字符。我们还能扩大范围[a-zA-Z]表示匹配所有的大小写字母

/*

Enter your regex: [a-c]
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: [a-c]
Enter input string to search: b
I found the text "b" starting at index 0 and ending at index 1.

Enter your regex: [a-c]
Enter input string to search: c
I found the text "c" starting at index 0 and ending at index 1.

Enter your regex: [a-c]
Enter input string to search: d
No match found.

Enter your regex: foo[1-5]
Enter input string to search: foo1
I found the text "foo1" starting at index 0 and ending at index 4.

Enter your regex: foo[1-5]
Enter input string to search: foo5
I found the text "foo5" starting at index 0 and ending at index 4.

Enter your regex: foo[1-5]
Enter input string to search: foo6
No match found.

Enter your regex: foo[^1-5]
Enter input string to search: foo1
No match found.

Enter your regex: foo[^1-5]
Enter input string to search: foo6
I found the text "foo6" starting at index 0 and ending at index 4.
    
*/

4.4 并集

​ 万变不离其宗,看下面例子很容易理解。

/*[0-4[6-8]]

Enter your regex: [0-4[6-8]]
Enter input string to search: 0
I found the text "0" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]
Enter input string to search: 5
No match found.

Enter your regex: [0-4[6-8]]
Enter input string to search: 6
I found the text "6" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]
Enter input string to search: 8
I found the text "8" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]
Enter input string to search: 9
No match found.
    
*/

4.4 交集

/* [0-9&&[345]]

Enter your regex: [0-9&&[345]]
Enter input string to search: 3
I found the text "3" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[345]]
Enter input string to search: 4
I found the text "4" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[345]]
Enter input string to search: 5
I found the text "5" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[345]]
Enter input string to search: 2
No match found.

Enter your regex: [0-9&&[345]]
Enter input string to search: 6
No match found.

*/
/*[2-8&&[4-6]]

Enter your regex: [2-8&&[4-6]]
Enter input string to search: 3
No match found.

Enter your regex: [2-8&&[4-6]]
Enter input string to search: 4
I found the text "4" starting at index 0 and ending at index 1.

Enter your regex: [2-8&&[4-6]]
Enter input string to search: 5
I found the text "5" starting at index 0 and ending at index 1.

Enter your regex: [2-8&&[4-6]]
Enter input string to search: 6
I found the text "6" starting at index 0 and ending at index 1.

Enter your regex: [2-8&&[4-6]]
Enter input string to search: 7
No match found.

*/

4.5 差集

/* [0-9&&[^345]]

Enter your regex: [0-9&&[^345]]
Enter input string to search: 2
I found the text "2" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[^345]]
Enter input string to search: 3
No match found.

Enter your regex: [0-9&&[^345]]
Enter input string to search: 4
No match found.

Enter your regex: [0-9&&[^345]]
Enter input string to search: 5
No match found.

Enter your regex: [0-9&&[^345]]
Enter input string to search: 6
I found the text "6" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[^345]]
Enter input string to search: 9
I found the text "9" starting at index 0 and ending at index 1.
    
*/

5. 预定义的单字符匹配

ConstructDescription
.Any character (may or may not match line terminators){ 任意字符,可能含行终结符 }
\dA digit: [0-9] { digit,数字 }
\DA non-digit: [^0-9] { Ndigit ,非数字 }
\sA whitespace character: [ \t\n\x0B\f\r] { 空白符 }
\SA non-whitespace character: [^\s] { 非空白符 }
\wA word character: [a-zA-Z_0-9]
\WA non-word character: [^\w]

​ 尽可能用这些,使得正则更易读,消除冗长难读引起的错误。

​ 下面给出上面正则的使用例子:

/*

Enter your regex: .
Enter input string to search: @
I found the text "@" starting at index 0 and ending at index 1.

Enter your regex: . 
Enter input string to search: 1
I found the text "1" starting at index 0 and ending at index 1.

Enter your regex: .
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: \d
Enter input string to search: 1
I found the text "1" starting at index 0 and ending at index 1.

Enter your regex: \d
Enter input string to search: a
No match found.

Enter your regex: \D
Enter input string to search: 1
No match found.

Enter your regex: \D
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: \s
Enter input string to search:  
I found the text " " starting at index 0 and ending at index 1.

Enter your regex: \s
Enter input string to search: a
No match found.

Enter your regex: \S
Enter input string to search:  
No match found.

Enter your regex: \S
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: \w
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: \w
Enter input string to search: !
No match found.

Enter your regex: \W
Enter input string to search: a
No match found.

Enter your regex: \W
Enter input string to search: !
I found the text "!" starting at index 0 and ending at index 1.

*/

6. 量词

​ 量词(Quantifiers ),匹配发生的次数。实际上,量词只适用于单个字符。也就是说abc++只作用在c上,当然,我们能够将量词修饰在单字符集上,如[abc]+。同时我们要注意(dog){3}dog{3}的区别。

Greedy 贪婪Reluctant 勉强Possessive 独占Meaning
X?X??X?+X, once or not at all <= 1
X*X*?X*+X, zero or more times >= 0
X+X+?X++X, one or more times >= 1
X{n}X{n}?X{n}+X, exactly n times = n
X{n,}X{n,}?X{n,}+X, at least n times >= n
X{n,m}X{n,m}?X{n,m}+X, at least n,not more than m times [n, m]

​ 下面研究一个贪婪例子:

/*

Enter your regex: a?
Enter input string to search: 
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a*
Enter input string to search: 
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a+
Enter input string to search: 
No match found.

*/

6.1 <=1, >= 0, >= 1与0长度匹配

​ 我们注意到上面的一段输出:

I found the text "" starting at index 0 and ending at index 0.

​ 也就是说我们得到了一段长度为0的匹配。

​ 再来看一个例子:

/*

Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a*
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a+
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

*/

​ 注意到,a?匹配a或者a*匹配a时,能够匹配到两次。

​ 再看一个例子:

/*

Enter your regex: a?
Enter input string to search: aaaaa
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 1 and ending at index 2.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "a" starting at index 3 and ending at index 4.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a*
Enter input string to search: aaaaa
I found the text "aaaaa" starting at index 0 and ending at index 5.
I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a+
Enter input string to search: aaaaa
I found the text "aaaaa" starting at index 0 and ending at index 5.
    
*/

​ 在这里,可以初步理解一下贪婪的用意。通过下面例子加深对贪婪的理解:

/*

Enter your regex: a?
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a*
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a+
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.

*/

6.2 == n

Enter your regex: a{3}
Enter input string to search: aa
No match found.

Enter your regex: a{3}
Enter input string to search: aaa
I found the text "aaa" starting at index 0 and ending at index 3.

Enter your regex: a{3}
Enter input string to search: aaaa
I found the text "aaa" starting at index 0 and ending at index 3.

​ 看看贪婪匹配的特点?

Enter your regex: a{3}
Enter input string to search: aaaaaaaaa
I found the text "aaa" starting at index 0 and ending at index 3.
I found the text "aaa" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.

6.3 >= n

Enter your regex: a{3,}
Enter input string to search: aaaaaaaaa
I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.

6.4 {n, m}

Enter your regex: a{3,6} // find at least 3 (but no more than 6) a's in a row
Enter input string to search: aaaaaaaaa
I found the text "aaaaaa" starting at index 0 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.

6.5 三种量词的区别

​ 正如第6节最开始表格介绍的,三种量词,分别为:Greedy ,Reluctant,Possessive (贪婪的,勉强的,独占的)。

​ 通过一个例子理解一下。

Enter your regex: .*foo  // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo  // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.

我们现在来看看它们分别是怎么运作的:

  • // 我们规定:左侧为当前正匹配的字符串,右侧为被匹配的字符串
    
    
    (xfooxxxxxxfoo)foo    =?  (xfooxxxxxxfoo)
    (xfooxxxxxxfo)foo     =?  (xfooxxxxxxfo)o
    (xfooxxxxxxf)foo     =?   (xfooxxxxxxf)oo
    (xfooxxxxxx)foo     =     (xfooxxxxxx)foo  √
    
  • ()foo            =?      ()xfooxxxxxxfoo
    (x)foo           =?      (x)fooxxxxxxfoo
    (xf)oo           =?      (xf)ooxxxxxxfoo
    (xfo)o           =?      (xfo)oxxxxxxfoo
    (xfoo)           =       (xfoo)xxxxxxfoo   √
    
    "xfoo" is a solution 
    
    ()foo            =?      ()xxxxxxfoo
    (x)foo           =?      (x)xxxxxfoo
    (xx)foo          =?      (xx)xxxxfoo
    (xxx)foo         =?      (xxx)xxxfoo
    (xxxx)foo        =?      (xxxx)xxfoo
    (xxxxx)foo       =?      (xxxxx)xfoo
    (xxxxxx)foo      =?      (xxxxxx)foo
    (xxxxxxf)oo      =?      (xxxxxxf)oo    // 勉强的作用,勉强向前进一步,那么就进一步
    (xxxxxxfo)o      =?      (xxxxxxfo)o
    (xxxxxxfoo)      =       (xxxxxxfoo)"xxxxxxfoo" is a another solution 
    
  • (xfooxxxxxxfoo)foo    !=  (xfooxxxxxxfoo)
        
    // no solution/match
    

总结:

  • Greedy, “贪婪的” ,多吃多占,它会尽可能多的匹配字符,会回溯。
  • Reluctant,“勉强的”,只要符合就停止进一步的匹配,它会尽可能少的匹配字符。
  • Possessive,“独占的”,它会如Greedy一样尽可能多的匹配字符,但是它不会回溯。

7. 匹配组

Capturing groups,将多个字符看成一个整体。用括号括起来,我们就创建了一个组。如(dog)创建了单个组,此组的内容仅包含dog匹配组将会被保存在内存中,以便重新使用**(backreferences)**。

​ 匹配组怎么编号?通过左括号来编号,我们以( ( A ) ( B ( C ) ) )为一个例子。

  • **1. ** ( ( A ) ( B ( C ) ) )
  • **2. ** ( A )
  • **3. ** ( B ( C ) )
  • **4. ** ( C )

groupCount 方法能够获取匹配组的总数,上面的匹配组总数为4

  • public int start(int group) : Returns the start index of the subsequence captured by the given group during the previous match operation.
  • public int end (int group): Returns the index of the last character, plus one, of the subsequence captured by the given group during the previous match operation.
  • public String group (int group): Returns the input subsequence captured by the given group during the previous match operation.

group 0,是一个特殊的匹配组,它总是表示整个表达式;它并不包含在groupCount中。(?开头的组是pure的,它是non-capturing groups,并不能匹配文本,数出group总数。

​ Backreferences。通过backreference我们能把保存在内存中的匹配组再引用。\group-th是backreferences的语法。下面给出一个例子。

Enter your regex: (\d\d)\1
Enter input string to search: 1212
I found the text "1212" starting at index 0 and ending at index 4.

(\d\d)是一个组且组号为1,则\1表示组号为1的引用。也就是说当\d\d匹配到12的时候,我们再使用\1,此时的\1必须要匹配12

​ 下面修改一下数字,验证一下:

Enter your regex: (\d\d)\1
Enter input string to search: 1234
No match found.

8. 边界匹配

​ 我们之前只关心是否能够被匹配,我们从不关心匹配的位置发生在哪。

​ 而边界匹配(boundary matchers)能够让我们的匹配更精细化。比如说,我们只对出现在一行的开头或者结尾的单词是否匹配某个特定的词感兴趣。

Boundary ConstructDescription
^The beginning of a line
$The end of a line
\bA word boundary
\BA non-word boundary
\AThe beginning of the input
\GThe end of the previous match
\ZThe end of the input but for the final terminator, if any
\zThe end of the input

8.1 ^,`### 开头/结尾

Enter your regex: ^dog$
Enter input string to search: dog
I found the text "dog" starting at index 0 and ending at index 3.

Enter your regex: ^dog$
Enter input string to search:       dog
No match found.

Enter your regex: \s*dog$
Enter input string to search:             dog
I found the text "            dog" starting at index 0 and ending at index 15.

Enter your regex: ^dog\w*
Enter input string to search: dogblahblah
I found the text "dogblahblah" starting at index 0 and ending at index 11.

8.2 \b,\B 单词边界

Enter your regex: \bdog\b
Enter input string to search: The dog plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.

Enter your regex: \bdog\b
Enter input string to search: The doggie plays in the yard.
No match found.
Enter your regex: \bdog\B
Enter input string to search: The dog plays in the yard.
No match found.

Enter your regex: \bdog\B
Enter input string to search: The doggie plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.

8.3 \G 连续的第二次匹配

Enter your regex: dog 
Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.
I found the text "dog" starting at index 4 and ending at index 7.

Enter your regex: \Gdog 
Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.

9. Pattern类的方法

​ 上面我们都是通过通过测试程序(Test Harness)来学习正则表达式基础的,现在我们深入Pattern类学习一些更加高级的技术,如带有标志位(flag)的模式串,还有一些其它有用的方法。

9.1 带有flag的模式

先看看Parrern类的compile方法:

public static Pattern compile(String regex) {
    return new Pattern(regex, 0);
}
public static Pattern compile(String regex, int flags) {
    return new Pattern(regex, flags);
}

我们所说的flags就是这里的传参。flags是Pattern类里的静态字段:

//当且仅当两个字符的"正规分解(canonical decomposition)"都完全相同的情况下,才认定匹配。
//比如用了这个标志之后,表达式"a\u030A"会匹配"?"。
//默认情况下,不考虑"规范相等性(canonical equivalence)"。

//啥是正规分解:char的数值表示与Unicode码表示相等,即打印也是同样的显示结果
public static final int CANON_EQ = 0x80;

// 大小写不敏感,也能通过内置?i来启用
public static final int CASE_INSENSITIVE = 0x02;

// 允许wordspace 和 注释 出现在串中,# 来注释一行
// 也能通过内置?x来启用
public static final int COMMENTS = 0x04;

// . 匹配包括行终结符, 默认是不匹配行终结符的
// ?s开启
public static final int DOTALL = 0x20;

// 元字符/逃逸字符不会解释出特殊的意思
public static final int LITERAL = 0x10;

// ^ 仅匹配行终结符前
// $ 仅匹配行终结符之后
// ?m 开启
public static final int MULTILINE = 0x08;

// Unicode Standard
// ?u
public static final int UNICODE_CASE = 0x40;

// 只有\n这个行中终结符能够被., ^, $的行为识别
// ?d
public static final int UNIX_LINES = 0x01;

​ 给个例子:

// 将RegexTestHarness改写如下

Pattern pattern = 
	Pattern.compile(console.readLine("%nEnter your regex: "),
	Pattern.CASE_INSENSITIVE);

// 我们再测试测试?
Enter your regex: dog
Enter input string to search: DoGDOg
I found the text "DoG" starting at index 0 and ending at index 3.
I found the text "DOg" starting at index 3 and ending at index 6.

​ 我们能够使用或操作|)来传入多个flags

pattern = Pattern.compile("[az]$", Pattern.MULTILINE | Pattern.UNIX_LINES);

​ 当然我们也能够使用int来传参:

final int flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
Pattern pattern = Pattern.compile("aa", flags);

9.2 内嵌flag的表达式

​ 我们仍然使用最开始的通用测试程序RegexTestHarness

Enter your regex: (?i)foo   // 我们知道 ?i 是大小写不敏感
Enter input string to search: FOOfooFoOfoO
I found the text "FOO" starting at index 0 and ending at index 3.
I found the text "foo" starting at index 3 and ending at index 6.
I found the text "FoO" starting at index 6 and ending at index 9.
I found the text "foO" starting at index 9 and ending at index 12.

​ 正如我们前面介绍的,下面我们总结为一个表:

ConstantEquivalent Embedded Flag Expression
Pattern.CANON_EQNone
Pattern.CASE_INSENSITIVE(?i)
Pattern.COMMENTS(?x)
Pattern.MULTILINE(?m)
Pattern.DOTALL(?s)
Pattern.LITERALNone
Pattern.UNICODE_CASE(?u)
Pattern.UNIX_LINES(?d)

9.3 matches(String,CharSequence)

​ 此方法很简单,我们看JDK8里面的具体实现就知道怎么回事了

// 静态方法
// input串用regex匹配
public static boolean matches(String regex, CharSequence input) {
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(input);
    return m.matches();
}

9.4 split(String)

public String[] split(CharSequence input) {
    return split(input, 0);
}

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }
    
    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

​ 下面给出split的两个例子:

public class SplitDemo {

    private static final String REGEX = ":";
    private static final String INPUT =
        "one:two:three:four:five";
    
    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        String[] items = p.split(INPUT);
        for(String s : items) {
            System.out.println(s);
        }
    }
}
public class SplitDemo2 {

    private static final String REGEX = "\\d";// 匹配数字
    private static final String INPUT =
        "one9two4three7four1five";

    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        String[] items = p.split(INPUT);
        for(String s : items) {
            System.out.println(s);
        }
    }
}

9.5 其它方法

// Returns a literal pattern String for the specified String.
public static String quote(String s)

//  the regular expression from which this pattern was compiled.
public String toString() 

9.6 java.lang.String对正则的支持

public boolean matches(String regex)


public String[] split(String regex,
                      int limit)// limit匹配的最大次数

public String[] split(String regex)


public String replace(CharSequence target,
                      CharSequence replacement)// 表面上没有正则

10. Matcher类的方法

根据它们的功能,列出了几类方法

10.1 索引(index)方法

​ index方法,告诉我们匹配发生在哪。

public int start()// 上一个匹配的开始索引

public int start(int group)// 上一次匹配操作,给定匹配组的开始索引

public int end()// 被匹配的最后一个字符索引的下一个索引

public int end(int group)// 上次匹配操作,给定匹配组被匹配的最后一个字符索引的下一个索引

​ 下面给出一个例子:

/*
Match number 1
start(): 0
end(): 3
Match number 2
start(): 4
end(): 7
Match number 3
start(): 8
end(): 11                                   */

public class MatcherDemo {

    private static final String REGEX = // 注意单词边界的准确含义
        "\\bdog\\b";
    private static final String INPUT =
        "dog dog dog doggie dogg";

    public static void main(String[] args) {
       Pattern p = Pattern.compile(REGEX);
       //  get a matcher object
       Matcher m = p.matcher(INPUT);
       int count = 0;
       while(m.find()) {
           count++;
           System.out.println("Match number "
                              + count);
           System.out.println("start(): "
                              + m.start());
           System.out.println("end(): "
                              + m.end()); // 注意end的含义
      }
   }
}

10.2 研究(study)方法

public boolean lookingAt()
    
public boolean find()
    
public boolean find(int start)

public boolean matches()

​ matches()、lookAt()和find()三者的区别可以看看这篇文章:

Java中正则Matcher类的matches()、lookAt()和find()的区别

​ 看一下lookingAtmatches方法的例子:

/*
Current REGEX is: foo
Current INPUT is: fooooooooooooooooo
lookingAt(): true
matches(): false
*/

public class MatchesLooking {

    private static final String REGEX = "foo";
    private static final String INPUT =
        "fooooooooooooooooo";
    private static Pattern pattern;
    private static Matcher matcher;

    public static void main(String[] args) {
   
        // Initialize
        pattern = Pattern.compile(REGEX);
        matcher = pattern.matcher(INPUT);

        System.out.println("Current REGEX is: "
                           + REGEX);
        System.out.println("Current INPUT is: "
                           + INPUT);

        System.out.println("lookingAt(): "
            + matcher.lookingAt());
        System.out.println("matches(): "
            + matcher.matches());
    }
}

10.3 替换(Replacement )方法

public Matcher appendReplacement(StringBuffer sb,
                                 String replacement)
    
public StringBuffer appendTail(StringBuffer sb)
    
public String replaceAll(String replacement)
    
public String replaceFirst(String replacement)
    
public static String quoteReplacement(String s)

​ 下面给出replaceFirst(String)replaceAll(String)例子。

/*
The cat says meow. All cats say meow.
*/

public class ReplaceDemo {
 
    private static String REGEX = "dog";
    private static String INPUT =
        "The dog says meow. All dogs say meow.";
    private static String REPLACE = "cat";
 
    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        // get a matcher object
        Matcher m = p.matcher(INPUT);
        INPUT = m.replaceAll(REPLACE);// change to replaceFirst to have a try?
        System.out.println(INPUT);
    }
}

​ 再看一个例子:

/*
-foo-foo-foo-
*/
public class ReplaceDemo2 {
 
    private static String REGEX = "a*b";
    private static String INPUT =
        "aabfooaabfooabfoob";
    private static String REPLACE = "-";
 
    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        // get a matcher object
        Matcher m = p.matcher(INPUT);
        INPUT = m.replaceAll(REPLACE);
        System.out.println(INPUT);
    }
}

​ 再来看看appendReplacement(StringBuffer,String)appendTail(StringBuffer)

/*
-foo-foo-foo- 
*/

public class RegexDemo {
 
    private static String REGEX = "a*b";
    private static String INPUT = "aabfooaabfooabfoob";
    private static String REPLACE = "-";
 
    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        Matcher m = p.matcher(INPUT); // get a matcher object
        StringBuffer sb = new StringBuffer();
        while(m.find()){
            m.appendReplacement(sb,REPLACE);
        }
        m.appendTail(sb);
        System.out.println(sb.toString());
    }
}

10.4 与Matcher方法等效的java.lang.String方法

public String replaceFirst(String regex,
                           String replacement)
    
public String replaceAll(String regex,
                         String replacement)

11. PatternSyntaxException

package java.util.regex;

// 无注释的PatternSyntaxException
public class PatternSyntaxException extends IllegalArgumentException {
    private static final long serialVersionUID = -3864639126226059218L;
    private final String desc;// 错误描述
    private final String pattern;// 错误的pattern
    private final int index;// The pattern错误,大概的索引,-1代表索引未知
    
    public PatternSyntaxException(String desc, String regex, int index) {
        this.desc = desc;
        this.pattern = regex;
        this.index = index;
    }

    public int getIndex() {
        return index;
    }

    public String getDescription() {
        return desc;
    }

    public String getPattern() {
        return pattern;
    }

    private static final String nl =
        java.security.AccessController
            .doPrivileged(new GetPropertyAction("line.separator"));

    public String getMessage() {
        StringBuffer sb = new StringBuffer();
        sb.append(desc);
        if (index >= 0) {
            sb.append(" near index ");
            sb.append(index);
        }
        sb.append(nl);
        sb.append(pattern);
        if (index >= 0 && pattern != null && index < pattern.length()) {
            sb.append(nl);
            for (int i = 0; i < index; i++) sb.append(' ');
            sb.append('^');
        }
        return sb.toString();
    }

}

​ 给出一个例子:

/*
Enter your regex: ?i)
There is a problem with the regular expression!
The pattern in question is: ?i)
The description is: Dangling meta character '?'
The message is: Dangling meta character '?' near index 0
?i)
^
The index is: 0
*/
public class RegexTestHarness2 {

    public static void main(String[] args){
        Pattern pattern = null;
        Matcher matcher = null;

        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {
            try{
                pattern = 
                Pattern.compile(console.readLine("%nEnter your regex: "));

                matcher = 
                pattern.matcher(console.readLine("Enter input string to search: "));
            } catch(PatternSyntaxException pse){
                console.format("There is a problem" +
                               " with the regular expression!%n");
                console.format("The pattern in question is: %s%n",
                               pse.getPattern());
                console.format("The description is: %s%n",
                               pse.getDescription());
                console.format("The message is: %s%n",
                               pse.getMessage());
                console.format("The index is: %s%n",
                               pse.getIndex());
                System.exit(0);
            }
            boolean found = false;
            while (matcher.find()) {
                console.format("I found the text" +
                    " \"%s\" starting at " +
                    "index %d and ending at index %d.%n",
                    matcher.group(),
                    matcher.start(),
                    matcher.end());
                found = true;
            }
            if(!found){
                console.format("No match found.%n");
            }
        }
    }
}

12. Unicode Support

​ Unicode Support

13. 相关学习资源

  • 主要资源当然是参考API文档啦: Pattern, Matcher, andPatternSyntaxException.

  • Mastering Regular Expressions by Jeffrey E. F. Friedl。此书清晰地描述了正则表达式构造的方法/原理。

本文标签: essentialTutorialsJavaexpressionsregular