All-Or-Nothing Private Similarity Matching.

The data for socially or economically relevant analyses is often distributed across several entities. For example, health care data and socioeconomic data are often held by different entities, such as hospitals, insurance agencies and tax services. Before it can be used to run a joint analysis, it must be linked across those entities. Even within one entity, there may exist sub-entities, e.g., different customers, that require a separation of data. Personal data is protected by privacy legislation, such as PIPEDA, HIPAA, CCPA, or GDPR. It is therefore necessary to deploy state-of-the-art data protection technologies. Furthermore, it is important to balance the data subject’s perception of privacy and the quality of services offered.

Record linkage is offered by many trusted third parties, e.g. Statistics Canada [1], Ontario Health Study [2], or Population Data BC [3]. However, these entities carry a substantial risk handling this data and are required to deploy strong state-of-the-art data protection technologies.

The ideal protection would be to encrypt the data before sending it to the matching entity. Yet, that’s non-trivial, since standard encryption destroys any similarity between data. Data entries often subtly differ due to data entry errors or missing data fields. Hence, it is important to be able to match not only identical but also similar data entries.`Functional encryption allows a similarity function to be evaluated over encrypted data but is too slow for practical use. Partial encryption of data elements leaks partial matching results that have regularly been shown to break the encryption.

We are developing an open-source software AON-PRISMA based on a novel cryptographic algorithm for similarity matching that is both secure and efficient. AON-PRISMA has three key properties:

- Secure: AON-PRISMA supports all-or-nothing disclosure. This means if two data entries differ more than a tunable similarity threshold, no information is revealed about them, including their distance. This also means that it is not possible to construct a new encrypted data entry from two or more encrypted data entries that can be used to match against other records.
- Efficient: AON-PRISMA only uses secret-sharing and symmetric encryption and no public-key encryption. It is hence much faster than functional or homomorphic encryption scheme. We also have optimized the field operations in our implementation.
- Accurate: AON-PRISMA uses preprocessing of data entries using machine learning. By fine-tuning existing models, e.g., large language models, we can leverage the power of representation learning. We use contrastive learning that keeps distance information to encode data entries before matching them encrypted.

Standard encryption, such AES or RSA, can be ruled out for similarity matching since it destroys any similarity between data entries. This property is necessary since partial encryption leaks sensitive information. Consider the following trivial encryption scheme: For a bitstring, encrypt zeros as “A” and ones as “B”. Two similar strings 010111 and 010110 would become ABABBB and ABABBA. This preserves similarity and one can perform similarity matching on the encrypted data, but it is easy to see that this encryption scheme offers little security.

For AON-PRISMA, we have developed a new encryption algorithm. This new encryption scheme allows comparing encrypted data entries, but also offers all-or-nothing disclosure. This means that if two data entries are not similar, no information about those data entries can be inferred. Encrypted records remain encrypted, even when matched.

Furthermore, AON-PRISMA’s encrypts the entire data entry. Consider two bitstrings 000111 and 111000 matched against 000000. None of the two bitstrings by itself is similar to the matched string, but taking the first three bits of the first string and the last three bits of the second string produces a perfect match. Hence, a malicious entity with those two strings could produce a match despite not having a matching data entry.

AON-PRISMA’s encryption algorithm comes with a security proof that rules out these attacks.

AON-PRISMA’s encryption algorithm encodes each data entry using an error-correcting code, such that the combination of two encrypted data entries can be decoded if they are similar. The decoding uses efficient, polynomial-time algorithms, such as Berlekamp-Welch and list decoding. Symmetric encryption ensures that only similar records can be decoded.

Both – error-correcting codes and symmetric encryption – are fast operations, much faster than functional or homomorphic encryption. We compare the running time of AON-PRISMA’s encryption algorithm to efficient, “practical” function-hiding inner-product encryption. AON-PRISMA is orders of magnitude faster.

We have optimized AON-PRISMA’s encryption algorithm to operate over smaller field sizes that fit into CPU registers and take advantage of improved lookup tables, without sacrificing security. Such algorithmic optimizations provide unmatched speed up compared to naïve implementations.

For comparing data sets, the time to match two records is critical, since the total running time scales linearly in the time for a single matching, i.e., matching two records 2 times faster reduces the total running time by a factor of 2. AON-PRISMA also supports private blocking, e.g., locality-sensitive hash functions, which reduces the number of necessary comparisons from all pairs to a much smaller subset.

AON-PRISMA uses an extended version of the Hamming distance to match records. Each data entry encoding is not composed of bits, but integers which are then compared element-wise and their number of matches comprises the similarity. For similarity matching the threshold to determine a match can be set at encryption time and is tunable for different data domains.

A data entry input is encoded using a Transformer neural network into a fixed-length vector. The goal is to map the input to a feature space, such that the mapping preserve similarity between data entries. AON-PRISMA uses contrastive learning to achieve this encoding. This type of training makes AON-PRISMA very flexible. Depending on the training data set, AON-PRISMA’s encoding can capture syntactic, e.g., spelling errors, missing field, etc., and semantic, e.g., different vocabulary, domains, etc., differences. AON-PRISMA can also leverage existing model snapshots using fine-tuning. This allows to leverage large language models for natural language matching using state-of-the-art performance.

AON-PRISMA's architecture is distributed between clients which encode their data using machine learning and then encrypt it.
The AON-PRISMA server software which can be deployed on-premise or in the cloud then matches *encrypted* entries and reveals their similarity or dissimilarity but nothing else.
The results are reported, potentially in enriched or aggregated from, back to the clients.
You can see the entire process in this figure.

We evaluate AON-PRISMA on several data sets, including the Google-Amazon (semantic differences) and the FEBRL4 (syntactic differences) data sets. Our accuracy and F1 score match the state-of-the-art benchmarks [4]. We perform matching of the entire data sets within 30 – 120 minutes despite encryption providing all-or-nothing disclosure.

University of Waterloo

National Research Council

University of Waterloo

University of Waterloo

University of Waterloo

University of Waterloo