difflib — 增量计算帮手

源代码: Lib/difflib.py


This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp 模块。

class difflib. SequenceMatcher

这是用于比较任何类型序列对的灵活类,只要序列元素 hashable . The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace. (Handling junk is an extension to the Ratcliff and Obershelp algorithm.) The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

计时: 基本 Ratcliff-Obershelp 算法在最坏情况下是时间的 3 次方,且在预期情况下是时间的 2 次方。 SequenceMatcher 对于最坏情况是时间的 2 次方,且预期情况行为从属序列共有多少元素的复杂方式;最佳情况是线性时间。

自动 junk 试探: SequenceMatcher 支持自动将某些序列项视为 junk 的试探。试探计数各单项在序列中出现多少次。若项重复 (在第一项后) 得分占序列 1% 以上且序列至少 200 项,则此项被标记为 popular (流行),且出于序列匹配目的将被视为 junk。可以关闭这种试探通过设置 autojunk 自变量对于 False 当创建 SequenceMatcher .

3.2 版改变: 添加 autojunk 参数。