正则表达式和 re 模块¶

正则表达式¶

正则表达式是用来匹配字符串或者子串的一种模式，匹配的字符串可以很具体，也可以很一般化。

Python 标准库提供了 re 模块。

import re

re.match & re.search¶

在 re 模块中， re.match 和 re.search 是常用的两个方法：

re.match(pattern, string[, flags])
re.search(pattern, string[, flags])

两者都寻找第一个匹配成功的部分，成功则返回一个 match 对象，不成功则返回 None，不同之处在于 re.match 只匹配字符串的开头部分，而 re.search 匹配的则是整个字符串中的子串。

re.findall & re.finditer¶

re.findall(pattern, string) 返回所有匹配的对象， re.finditer 则返回一个迭代器。

re.split¶

re.split(pattern, string[, maxsplit]) 按照 pattern 指定的内容对字符串进行分割。

re.sub¶

re.sub(pattern, repl, string[, count]) 将 pattern 匹配的内容进行替换。

re.compile¶

re.compile(pattern) 生成一个 pattern 对象，这个对象有匹配，替换，分割字符串的方法。

正则表达式规则¶

正则表达式由一些普通字符和一些元字符（metacharacters）组成。普通字符包括大小写的字母和数字，而元字符则具有特殊的含义：

子表达式	匹配内容
`.`	匹配除了换行符之外的内容
`\w`	匹配所有字母和数字字符
`\d`	匹配所有数字，相当于 `[0-9]`
`\s`	匹配空白，相当于 `[\t\n\t\f\v]`
`\W,\D,\S`	匹配对应小写字母形式的补
`[...]`	表示可以匹配的集合，支持范围表示如 `a-z`, `0-9` 等
`(...)`	表示作为一个整体进行匹配
¦	表示逻辑或
`^`	表示匹配后面的子表达式的补
`*`	表示匹配前面的子表达式 0 次或更多次
`+`	表示匹配前面的子表达式 1 次或更多次
`?`	表示匹配前面的子表达式 0 次或 1 次
`{m}`	表示匹配前面的子表达式 m 次
`{m,}`	表示匹配前面的子表达式至少 m 次
`{m,n}`	表示匹配前面的子表达式至少 m 次，至多 n 次

例如：

ca*t 匹配： ct, cat, caaaat, ...
ab\d|ac\d 匹配： ab1, ac9, ...
([^a-q]bd) 匹配： rbd, 5bd, ...

例子¶

假设我们要匹配这样的字符串：

string = 'hello world'
pattern = 'hello (\w+)'

match = re.match(pattern, string)
print match

<_sre.SRE_Match object at 0x0000000003A5DA80>

一旦找到了符合条件的部分，我们便可以使用 group 方法查看匹配的部分：

if match is not None:
    print match.group(0)

hello world

if match is not None:
    print match.group(1)

world

我们可以改变 string 的内容：

string = 'hello there'
pattern = 'hello (\w+)'

match = re.match(pattern, string)
if match is not None:
    print match.group(0)
    print match.group(1)

hello there
there

通常，match.group(0) 匹配整个返回的内容，之后的 1,2,3,... 返回规则中每个括号（按照括号的位置排序）匹配的部分。

如果某个 pattern 需要反复使用，那么我们可以将它预先编译：

pattern1 = re.compile('hello (\w+)')

match = pattern1.match(string)
if match is not None:
    print match.group(1)

there

由于元字符的存在，所以对于一些特殊字符，我们需要使用 '\' 进行逃逸字符的处理，使用表达式 '\\' 来匹配 '\' 。

但事实上，Python 本身对逃逸字符也是这样处理的：

pattern = '\\'
print pattern

\

因为逃逸字符的问题，我们需要使用四个 '\\\\' 来匹配一个单独的 '\'：

pattern = '\\\\'
path = "C:\\foo\\bar\\baz.txt"
print re.split(pattern, path)

['C:', 'foo', 'bar', 'baz.txt']

这样看起来十分麻烦，好在 Python 提供了 raw string 来忽略对逃逸字符串的处理，从而可以这样进行匹配：

pattern = r'\\'
path = r"C:\foo\bar\baz.txt"
print re.split(pattern, path)

['C:', 'foo', 'bar', 'baz.txt']

如果规则太多复杂，正则表达式不一定是个好选择。

Numpy 的 fromregex()¶

%%file test.dat 
1312 foo
1534    bar
444  qux

Writing test.dat

fromregex(file, pattern, dtype)

dtype 中的内容与 pattern 的括号一一对应：

pattern = "(\d+)\s+(...)"
dt = [('num', 'int64'), ('key', 'S3')]

from numpy import fromregex
output = fromregex('test.dat', pattern, dt)
print output

[(1312L, 'foo') (1534L, 'bar') (444L, 'qux')]

显示 num 项：

print output['num']

[1312 1534  444]

import os
os.remove('test.dat')