跳至主要內容

字符串哈希

大约 2 分钟

字符串哈希

题目难度
$1062. 最长重复子串open in new window 中等数据范围 1500
1044. 最长重复子串open in new window困难数据范围 3*10^4
718. 最长重复子数组open in new window中等

Rabin-Karp 字符串编码

时间复杂度:O(n)

Hash 函数值一样时原字符串却不一样的现象我们称为哈希碰撞。

public class RabinKarp {
    public static void main(String[] args) {
        // 612244298
        System.out.println(getHash("dacbececeeaebaecbdeecdbaeeeeeadececcdcabbddadecdddabdeeedccabbaebddedecccacdbdcbbcddc"));
        // 612244298
        System.out.println(getHash("dcedaddeaeddecdecbeebeedaadbcbbebcbeecedaabcacaabdbaecaeebeaedadbabecbaabbebadaccadee"));
    }

    private static long getHash(String s) {
        int BASE = 26;
        long MOD = 1000000007;

        long hash = 0;
        for (int i = 0; i < s.length(); i++) {
            hash = (hash * BASE + (s.charAt(i) - 'a')) % MOD;
        }
        return hash;
    }
}

若进行 n 次比较,每次错误率 1/M,那么总错误率是 1-(1-1/M)^n。在随机数据下,若 M=10^9+7n=10^6,错误率约为 1/1000,并不是能够完全忽略不计的。

所以,进行字符串哈希时,经常会对两个大质数分别取模,这样的话哈希函数的值域就能扩大到两者之积,错误率就非常小了。

1044. 最长重复子串

import java.util.HashSet;
import java.util.Set;

public class Solution1044 {
    private static final int BASE = 26;
    private static final long MOD1 = 1000000007;
    private static final long MOD2 = 1000000009;
    private int[] nums;

    public String longestDupSubstring(String s) {
        int len = s.length();
        nums = new int[len];
        for (int i = 0; i < len; i++) {
            nums[i] = s.charAt(i) - 'a';
        }

        int left = 1;
        int right = len;
        while (left < right) {
            int mid = left + (right - left) / 2;
            // 边界二分 F, F,..., F, [T, T,..., T]
            // ----------------------^
            if (checkMid(s, mid) == -1) {
                right = mid;
            } else {
                left = mid + 1;
            }
        }
        // 长度为 left-1, 起始下标为 idx
        int idx = checkMid(s, left - 1);
        return s.substring(idx, idx + left - 1);
    }

    // 长度为 mid TTTFFF
    private int checkMid(String s, int mid) {
        long hash1 = 0;
        long hash2 = 0;
        for (int i = 0; i < mid; i++) {
            hash1 = (hash1 * BASE + nums[i]) % MOD1;
            hash2 = (hash2 * BASE + nums[i]) % MOD2;
        }

        Set<Long> hashSet = new HashSet<>();
        long hash = hash1 * MOD2 + hash2;
        hashSet.add(hash);

        // 26^mid % mod
        long aL1 = quickPow(BASE, mid, MOD1);
        long aL2 = quickPow(BASE, mid, MOD2);

        // 滑动窗口
        for (int i = 1; i < s.length() - mid + 1; i++) {
            hash1 = (hash1 * BASE - nums[i - 1] * aL1 % MOD1 + MOD1) % MOD1;
            hash1 = (hash1 + nums[i + mid - 1]) % MOD1;

            hash2 = (hash2 * BASE - nums[i - 1] * aL2 % MOD2 + MOD2) % MOD2;
            hash2 = (hash2 + nums[i + mid - 1]) % MOD2;

            hash = hash1 * MOD2 + hash2;
            if (hashSet.contains(hash)) {
                return i;
            }
            hashSet.add(hash);
        }
        return -1;
    }

    // res = a^b % mod
    private long quickPow(long a, long b, long mod) {
        long res = 1;
        while (b > 0) {
            if (b % 2 == 1) {
                res = res * a % mod;
            }
            a = a * a % mod;
            b /= 2;
        }
        return res;
    }
}

(全文完)