高级正则表达式技术（Python版）

正则表达式是从信息中搜索特定的模式的一把瑞士军刀。它们是一个巨大的工具库，其中的一些功能经常被忽视或未被充分利用。今天我将向你们展示一些正则表达式的高级用法。

举个例子，这是一个我们可能用来检测电话美国电话号码的正则表达式：

r'^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$'

我们可以加上一些注释和空格使得它更具有可读性。

r'^'
r'(1[-\s.])?' # optional '1-', '1.' or '1'
r'(\()?' # optional opening parenthesis
r'\d{3}' # the area code
r'(?(2)\))' # if there was opening parenthesis, close it
r'[-\s.]?' # followed by '-' or '.' or space
r'\d{3}' # first 3 digits
r'[-\s.]?' # followed by '-' or '.' or space
r'\d{4}$' # last 4 digits

让我们把它放到一个代码片段里：

import re
numbers = [ "123 555 6789",
"1-(123)-555-6789",
"(123-555-6789",
"(123).555.6789",
"123 55 6789" ]
for number in numbers:
pattern = re.match(r'^'
r'(1[-\s.])?' # optional '1-', '1.' or '1'
r'(\()?' # optional opening parenthesis
r'\d{3}' # the area code
r'(?(2)\))' # if there was opening parenthesis, close it
r'[-\s.]?' # followed by '-' or '.' or space
r'\d{3}' # first 3 digits
r'[-\s.]?' # followed by '-' or '.' or space
r'\d{4}$\s*',number) # last 4 digits
if pattern:
print '{0} is valid'.format(number)
else:
print '{0} is not valid'.format(number)

输出，不带空格：

123 555 6789 is valid
1-(123)-555-6789 is valid
(123-555-6789 is not valid
(123).555.6789 is valid
123 55 6789 is not valid

幸运的是，python 可以通过对 re测试数据pile 或 re.match 设置 re.DEBUG (实际上就是整数 128) 标志就可以输出正则表达式的解析树。

import re
numbers = [ "123 555 6789",
"1-(123)-555-6789",
"(123-555-6789",
"(123).555.6789",
"123 55 6789" ]
for number in numbers:
pattern = re.match(r'^'
r'(1[-\s.])?' # optional '1-', '1.' or '1'
r'(\()?' # optional opening parenthesis
r'\d{3}' # the area code
r'(?(2)\))' # if there was opening parenthesis, close it
r'[-\s.]?' # followed by '-' or '.' or space
r'\d{3}' # first 3 digits
r'[-\s.]?' # followed by '-' or '.' or space
r'\d{4}$', number, re.DEBUG) # last 4 digits
if pattern:
print '{0} is valid'.format(number)
else:
print '{0} is not valid'.format(number)

解析树

at_beginning
max_repeat 0 1
subpattern 1
literal 49
in
literal 45
category category_space
literal 46
max_repeat 0 2147483648
in
category category_space
max_repeat 0 1
subpattern 2
literal 40
max_repeat 0 2147483648
in
category category_space
max_repeat 3 3
in
category category_digit
max_repeat 0 2147483648
in
category category_space
subpattern None
groupref_exists 2
literal 41
None
max_repeat 0 2147483648
in
category category_space
max_repeat 0 1
in
literal 45
category category_space
literal 46
max_repeat 0 2147483648
in
category category_space
max_repeat 3 3
in
category category_digit
max_repeat 0 2147483648
in
category category_space
max_repeat 0 1
in
literal 45
category category_space
literal 46
max_repeat 0 2147483648
in
category category_space
max_repeat 4 4
in
category category_digit
at at_end
max_repeat 0 2147483648
in
category category_space
123 555 6789 is valid
1-(123)-555-6789 is valid
(123-555-6789 is not valid
(123).555.6789 is valid
123 55 6789 is not valid

在我解释这个概念之前，我想先展示一个例子。我们要从一段 html 文本寻找锚标签：

import re
html = 'Hello <a href="http://pypix测试数据" title="pypix">Pypix</a>'
m = re.findall('<a.*>.*<\/a>', html)
if m:
print m

结果将在意料之中：

['<a href="http://pypix测试数据" title="pypix">Pypix</a>']

我们改下输入，添加第二个锚标签：

import re
html = 'Hello <a href="http://pypix测试数据" title="pypix">Pypix</a>' 'Hello <a href="http://example测试数据" title"example">Example</a>'
m = re.findall('<a.*>.*<\/a>', html)
if m:
print m

当你加一个问号在后面时（.*?）它将变为“非贪婪的”。

import re
html = 'Hello <a href="http://pypix测试数据" title="pypix">Pypix</a>' 'Hello <a href="http://example测试数据" title"example">Example</a>'
m = re.findall('<a.*?>.*?<\/a>', html)
if m:
print m

下面的模式首先匹配 foo，然后检测是否接着匹配 bar：

import re
strings = [ "hello foo", # returns False
"hello foobar" ] # returns True
for string in strings:
pattern = re.search(r'foo(?=bar)', string)
if pattern:
print 'True'
else:
print 'False'

这看起来似乎没什么用，因为我们可以直接检测 foobar 不是更简单么。然而，它也可以用来前向否定界定。下面的例子匹配foo，当且仅当它的后面没有跟着 bar。

import re
strings = [ "hello foo", # returns True
"hello foobar", # returns False
"hello foobaz"] # returns True
for string in strings:
pattern = re.search(r'foo(?!bar)', string)
if pattern:
print 'True'
else:
print 'False'

下面的模式匹配一个不是跟在 foo 后面的 bar。

import re
strings = [ "hello bar", # returns True
"hello foobar", # returns False
"hello bazbar"] # returns True
for string in strings:
pattern = re.search(r'(?<!foo)bar',string)
if pattern:
print 'True'
else:
print 'False'

比如我们可以用这个正则表达式来检测打开和闭合的尖括号：

import re
strings = [ "<pypix>", # returns true
"<foo", # returns false
"bar>", # returns false
"hello" ] # returns true
for string in strings:
pattern = re.search(r'^(<)?[a-z]+(?(1)>)$', string)
if pattern:
print 'True'
else:
print 'False'

在上面的例子中，1 表示分组 (<)，当然也可以为空因为后面跟着一个问号。当且仅当条件成立时它才匹配关闭的尖括号。

条件也可以是界定符。

无捕获组

分组，由圆括号括起来，将会捕获到一个数组，然后在后面要用的时候可以被引用。但是我们也可以不捕获它们。

我们先看一个非常简单的例子：

import re
string = 'Hello foobar'
pattern = re.search(r'(f.*)(b.*)', string)
print "f* => {0}".format(pattern.group(1)) # prints f* => foo
print "b* => {0}".format(pattern.group(2)) # prints b* => bar

现在我们改动一点点，在前面加上另外一个分组 (H.*)：

import re
string = 'Hello foobar'
pattern = re.search(r'(H.*)(f.*)(b.*)', string)
print "f* => {0}".format(pattern.group(1)) # prints f* => Hello
print "b* => {0}".format(pattern.group(2)) # prints b* => bar

模式数组改变了，取决于我们在代码中怎么使用这些变量，这可能会使我们的脚本不能正常工作。现在我们不得不找到代码中每一处出现了模式数组的地方，然后相应地调整下标。如果我们真的对一个新添加的分组的内容没兴趣的话，我们可以使它“不被捕获”，就像这样：

import re
string = 'Hello foobar'
pattern = re.search(r'(?:H.*)(f.*)(b.*)', string)
print "f* => {0}".format(pattern.group(1)) # prints f* => foo
print "b* => {0}".format(pattern.group(2)) # prints b* => bar

通过在分组的前面添加 ?:，我们就再也不用在模式数组中捕获它了。所以数组中其他的值也不需要移动。

命名组

像前面那个例子一样，这又是一个防止我们掉进陷阱的方法。我们实际上可以给分组命名，然后我们就可以通过名字来引用它们，而不再需要使用数组下标。格式是：(?Ppattern) 我们可以重写前面那个例子，就像这样：

import re
string = 'Hello foobar'
pattern = re.search(r'(?P<fstar>f.*)(?P<bstar>b.*)', string)
print "f* => {0}".format(pattern.group('fstar')) # prints f* => foo
print "b* => {0}".format(pattern.group('bstar')) # prints b* => bar

现在我们可以添加另外一个分组了，而不会影响模式数组里其他的已存在的组：

import re
string = 'Hello foobar'
pattern = re.search(r'(?P<hi>H.*)(?P<fstar>f.*)(?P<bstar>b.*)', string)
print "f* => {0}".format(pattern.group('fstar')) # prints f* => foo
print "b* => {0}".format(pattern.group('bstar')) # prints b* => bar
print "h* => {0}".format(pattern.group('hi')) # prints b* => Hello

使用回调函数

在 Python 中 re.sub() 可以用来给正则表达式替换添加回调函数。

让我们来看看这个例子，这是一个 e-mail 模板：

import re
template = "Hello [first_name] [last_name], Thank you for purchasing [product_name] from [store_name]. The total cost of your purchase was [product_price] plus [ship_price] for shipping. You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. Sincerely, [store_manager_name]"
# assume dic has all the replacement data
# such as dic['first_name'] dic['product_price'] etc...
dic = {
"first_name" : "John",
"last_name" : "Doe",
"product_name" : "iphone",
"store_name" : "Walkers",
"product_price": "$500",
"ship_price": "$10",
"ship_days_min": "1",
"ship_days_max": "5",
"store_manager_name": "DoeJohn"
}
result = re测试数据pile(r'\[(.*)\]')
print result.sub('John', template, count=1)

所以用回调函数是一个更好的办法：

import re
template = "Hello [first_name] [last_name], Thank you for purchasing [product_name] from [store_name]. The total cost of your purchase was [product_price] plus [ship_price] for shipping. You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. Sincerely, [store_manager_name]"
# assume dic has all the replacement data
# such as dic['first_name'] dic['product_price'] etc...
dic = {
"first_name" : "John",
"last_name" : "Doe",
"product_name" : "iphone",
"store_name" : "Walkers",
"product_price": "$500",
"ship_price": "$10",
"ship_days_min": "1",
"ship_days_max": "5",
"store_manager_name": "DoeJohn"
}
def multiple_replace(dic, text):
pattern = "|".join(map(lambda key : re.escape("["+key+"]"), dic.keys()))
return re.sub(pattern, lambda m: dic[m.group()[1:-1]], text)
print multiple_replace(dic, template)

不要重复发明轮子

更重要的可能是知道在什么时候不要使用正则表达式。在许多情况下你都可以找到替代的工具。

解析 [X]HTML

Stackoverflow 上的一个答案用一个绝妙的解释告诉了我们为什么不应该用正则表达式来解析 [X]HTML。

你应该使用使用 HTML 解析器，Python 有很多选择：

ElementTree 是标准库的一部分

BeautifulSoup 是一个流行的第三方库

lxml 是一个功能齐全基于 c 的快速的库

后面两个即使是处理畸形的 HTML 也能很优雅，这给大量的丑陋站点带来了福音。

ElementTree 的一个例子：

from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for element in tree.findall('h1'):
print ElementTree.tostring(element)

其他

在使用正则表达式之前，这里有很多其他可以考虑的工具。

以上就是高级正则表达式技术（Python版）的内容，更多相关内容请关注PHP中文网（HdhCmsTestgxlcms测试数据）！

查看更多关于高级正则表达式技术（Python版）的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://haodehen.cn/did86170

更新时间：2022-10-19 阅读：51次