# Graph Encryption: Going Beyond Encrypted Keyword Search

This is a guest post by Xianrui Meng from Boston University about a paper he presented at CCS 2015, written in collaboration with Kobbi Nissim, George Kollios and myself. Note that Xianrui is on the job market.

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

## Graph Databases and Graph Privacy

As today's data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good alternative to RDBMSs. One of the many advantages of NoSQL systems is that they are capable of storing, processing, and managing large volumes of structured, semi-structured, and even unstructured data. NoSQL databases (e.g., document stores, wide-column stores, key-value (tuple) stores, object databases, and graph databases) can provide the scale and availability needed in cloud environments.

In an Internet-connected world, graph databases have become an increasingly significant data model among NoSQL technologies. Social networks (e.g., Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML documents, networked systems can all be modeled as graphs. One nice thing about graph databases is that they store the relations between entities (objects) in addition to the entities themselves and their properties. This allows the search engine to navigate both the data and their relationships extremely efficiently. Graph databases rely on the node-link-node relationship, where a node can be a profile or an object and the edge can be any relation defined by the application. Usually, we are interested in the structural characteristics of such a graph databases.

What do we mean by the confidentiality of a graph? And how to do we protect it? The problem has been studied by both the security and database communities. For example, in the database and data mining community, many solutions have been proposed based on graph anonymization. The core idea here is to anonymize the nodes and edges in the graph so that re-identification is hard. Although this approach may be efficient, from a security point view it is hard to tell what is achieved. Also, by leveraging auxiliary information, researchers have studied how to attack this kind of approach. On the other hand, cryptographers have some really compelling and provably-secure tools such as ORAM and FHE (mentioned in Seny's previous posts) that can protect all the information in a graph database. The problem, however, is their performance, which is crucial for databases. In today's world, efficiency is more than running in polynomial time; we need solutions that run and scale to massive volumes of data. Many real world graph datasets, such as biological networks and social networks, have millions of nodes, some even have billions of nodes and edges. Therefore, besides security, scalability is one of main aspects we have to consider.

## Graph Encryption

Previous work in encrypted search has focused on how to search encrypted documents, e.g., doing keyword search, conjunctive queries, etc. Graph encryption, on the other hand, focuses on performing graph queries on encrypted graphs rather than keyword search on encrypted documents. In some cases, this makes the problem harder since some graph queries can be extremely complex. Another technical challenge is that the privacy of nodes and edges needs to be protected but also the structure of the graph, {\bf which can lead to many interesting research directions}.

Graph encryption was introduced by Melissa Chase and Seny in [CK10]. That paper shows how to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency and focused subgraphs) can be performed (though the paper is more general as it describes structured encryption). Seny and I, together with Kobbi Nissim and George Kollios, followed this up with a paper last year [MKNK15] that showed how to handle more complex graphs queries.

## Queries on Encrypted Graph Databases

### Neighbor Queries and Adjacency Queries

As I mentioned earlier, [CK10] studied some simple graph queries, such as adjacency queries and neighbor queries. An adjacency query is takes two nodes as input and returns whether they have an edge in common. A neighbor query takes a node as input and returns all the nodes that share an edge with it.

The construction for neighbor queries is mainly based on the searchable symmetric encryption (SSE), where the input graph is viewed as particular kind of document collection. Another novel technique that is proposed in the paper is to use an efficient symmetric non-committing encryption scheme to achieve adaptive security efficiently. The paper also proposes a nice solution for focused subgraph queries, which are an essential part of the seminal HITS ranking algorithm of Kleinberg but are also useful in their own right.

### Approximate Shortest Distance Queries

Shortest distance queries are arguably one of the most fundamental and well-studied graph queries due to their numerous applications. A shortest distance query takes as input two nods and returns the smallest number of edges in the shortest path between them. In social networks these queries allow you to find the smallest number of friends (or collaborators, peers, etc) between two people. So a graph encryption scheme that supports shortest distance queries would potentially have many applications in graph database security, and could be a major building block for other graph encryption schemes. In the following, I briefly give an overview on our solution for approximate shortest distance queries.

As I mentioned, to design a secure yet scalable graph encryption scheme, we have to take into account many things, including the storage space on the server side, the bandwidth for the query, the computational overhead for both client and server, etc. Suppose we are given a graph $$G= (V, E)$$ and let $$n= |V|$$, $$m = |E|$$. If we were to use traditional shortest distance algorithm such as Dijsktra's algorithm, the query time would be $$O(n\log n+m)$$, which can be very slow for large graphs. The benefit of course would be that we would not need extra storage. Another approach is to build an encrypted adjacency matrix (see [CK10]) that somehow supports shortest distance queries. The problem there is that we would need to pay at least $$O(n^2)$$ storage, which is obviously expensive when the $$n$$ is, say, $$1$$ million.

Fortunately, thanks to brilliant algorithmic computer scientists, there exists a really nice and neat data structure called a distance oracle (DO) [TZ05, SGNP10, CDFGGW13]. Using such a structure, one can use much less storage overhead (typically $$O(n \log n)$$) and fast query performance (typically $$O(\log n)$$). However, most distance oracles return the approximate distance rather than the exact one. But one one can tweak the parameters in order to get the best trade-off between performance and approximation. When I first looked at these data structures, I felt that this was a really amazing tool; not only because of its functionality but also due to its simplicity.

There are many ways of generating distance oracles. Some of them offer better approximation while others can have better performance. Here I just describe one kind which are sketch-based distance oracles. In a such an oracle, every node $$v$$ has a sketch, $$Sk_v$$ (normally generated by some randomized algorithm). $$Sk_v$$ is a set containing many node pairs $$\langle w_i,d(v, w_i)\rangle$$, where $$w_i$$ is some node id and $$d(v, w_i)$$ is the distance between $$v$$ and $$w_i$$. For example, the following sketch $$Sk_v$$ consists of three pairs

$Sk_v = \{\langle w_1, d(v, w_1)\rangle, \langle w_2, d(v, w_2)\rangle, \langle w_3, d(v, w_3)\rangle \}.$

Querying the shortest distance between $$u$$ and $$v$$ is quite simple. We only need to retrieve $$Sk_u$$ and $$Sk_v$$, and find the common nodes in both sketches and add up their corresponding distances. We then return the minimum sum as the shortest distance. Formally, let $$I$$ be the common nodes that appear in both $$Sk_u$$ and $$Sk_v$$. Then, the approximate shortest distance between $$u$$ and $$v$$, $$d(u,v)$$, is

$d(u, v) = argmin_{s \in I}\{ d(u, s) + d(v, s)\}$

The design of this distance oracle guarantees that the returned distance is no greater than $$\alpha\times \mathsf{dist}(u,v)$$, where $$\mathsf{dist}(u, v)$$ is the true shortest distance between $$u$$ and $$v$$ and $$\alpha$$ is the approximation ratio. Note that the approximation ration $$\alpha$$ is a function of some parameters of the sketch so to one controls the approximation by tweaking the sketch which in turns effects both setup and query efficiency. In our solution, we leverage sketched-based distance oracles but we have to be very careful not to affect their approximation ration.

A distance oracle encryption scheme $$\mathsf{Graph} = (\mathsf{Setup}, \mathsf{DistQuery})$$ consists of a polynomial-time algorithm and a polynomial-time two-party protocol that work as follows:

• $$(K, \mathsf{EO}) \leftarrow \mathsf{Setup}(1^k, \Omega, \alpha, \varepsilon)$$: is a probabilistic algorithm that takes as input a security parameter $$k$$, an oracle
$$\Omega$$, an approximation factor $$\alpha$$, and an error parameter $$\varepsilon$$. It outputs a secret key $$K$$ and an encrypted oracle $$\mathsf{EO}$$.

• item $$(d, \bot) \leftarrow \mathsf{DistQuery}_{C,S}\big((K, q), \mathsf{EO}\big)$$: is a two-party protocol between a client $$C$$ that holds a key $$K$$ and a shortest distance query $$q = (u, v) \in V^2$$ and a server $$S$$ that holds an encrypted oracle $$\mathsf{EO}$$. After executing the protocol, the $$C$$ receives a distance $$d \geq 0$$ and server $$S$$ receives $$\bot$$.

For $$\alpha \geq 1$$ and $$\varepsilon \lt 1$$, we say that $$\mathsf{Graph}$$ is $$(\alpha, \varepsilon)$$-correct if for all $$k \in \mathbb{N}$$, for all $$\Omega$$ and for all $$q = (u, v) \in V^2$$,

$\mbox{Pr}\big[d \leq \alpha\cdot {\sf dist}(u, v)\big] \geq 1 - \varepsilon,$

where the probability is over the randomness in computing $$(K, \mathsf{EO}) \leftarrow \mathsf{Setup}(1^k, \Omega, \alpha, \varepsilon)$$ and then $$(d, \bot) \leftarrow \mathsf{DistQuery}\big((K, q), \mathsf{EO}\big)$$. I skip the adaptive security definition as it is similar to adaptive security for SSE and is captured by the general notion of security for structured encryption given in [CK10]. Next, I will go over two solutions for the oracle encryption.

A computationally-efficient solution. This approach is rather straightforward, so here I briefly sketch its description. The $$\mathsf{Setup}$$ algorithm works as follows:

1. For each node $$v \in V$$, generate a token by applying a PRF to $$v$$: $$\mathsf{tk}_v = F_K(v)$$.
2. Pad the sketches to the same length and encrypt each sketch $$Sk_v$$ as $${\sf Enc}_K(Sk_v)$$ using a symmetric encryption scheme.
3. For each node $$v \in V$$, store the pair $$(\mathsf{tk}_v, {\sf Enc}_K(Sk_v))$$ in a dictionary data structure $$\mathsf{DX}$$ (you should do the insertions at random).

The $$\mathsf{DistQuery}$$ algorithm is quite simple: given nodes $$u$$ and $$v$$, the client just computes $$F_K(u)$$ and $$F_K(v)$$ and sends them to the server as the token. After receiving the token, the server just retrieves $$\mathsf{DX}[F_K(u)]$$ and $$\mathsf{DX}[F_K(v)]$$ and sends back the encrypted sketches $${\sf Enc}_K(Sk_u)$$ and $${\sf Enc}_K(Sk_v)$$. Finally, the client decrypts the sketches, and computes the approximate shortest distance as is normally done in sketch-based distance oracles. This approach is efficient and simple since we use symmetric encryption. We show in the paper that this scheme is adaptively secure and that the leakage for this scheme are the size of the graph, maximum size of the distance oracle, and the query pattern (see paper for a precise definition).

Communication-efficient solution. The problem with the scheme described above is that the communication complexity is linear in the maximum sketch size. As I mentioned above, this can be a bottleneck in practice when the graphs are large. Now, at very high level, I briefly discuss how we can achieve a solution with optimal $$O(1)$$ communication complexity. The scheme makes use of a PRF, a degree-$$2$$ somewhat homomorphic encryption scheme $$\mathsf{SHE} = ({\sf Gen}, {\sf Enc}, {\sf Dec})$$, and a hash function $$h: V\to [t]$$.

• $$\mathsf{Setup}(1^k, \Omega, \alpha, \varepsilon)$$: Given $$1^k$$, $$\Omega$$, $$\alpha$$, and $$\varepsilon$$ as inputs, it generates a public/secret-key pair $$({\sf pk}, {\sf sk})$$ for $$\mathsf{SHE}$$. Let $$D$$ be the maximum distance over all the sketches and $$S$$ be the maximum sketch size. $$\mathsf{Setup}$$ sets $$N \leftarrow 2\cdot D +1$$ and samples a hash function $$h \leftarrow \mathcal{H}$$ with domain $$V$$ and co-domain $$[t]$$, where $$t = 2\cdot S^2\cdot\varepsilon^{-1}$$. It then creates a hash table for each node $$v \in V$$. More precisely, for each node $$v$$, it processes each pair $$(w_i, d_i) \in sk_v$$ and stores $${\sf Enc}_{pk}(2^{N - \delta_i})$$ at location $$h(w_i)$$ of a $$t$$-size array $$\mathsf{T}_v$$. In other words, for all $$v \in V$$, it creates an array $$\mathsf{T}_v$$ such that for all $$(w_i, \delta_i) \in Sk_v$$, $$\mathsf{T}_v[h(w_i)] \leftarrow {\sf Enc}_{pk}(2^{N - \delta_i})$$. It then fills the empty cells of $$\mathsf{T}_v$$ with homomorphic encryptions of $$0$$ and stores each hash table $$\mathsf{T}_{v_1}$$ through $$\mathsf{T}_{v_n}$$ in a dictionary $$\mathsf{DX}$$ by setting, for all $$v \in V$$, $$\mathsf{DX}[F_K(v)] \leftarrow \mathsf{T}_v$$. Finally, it outputs $$\mathsf{DX}$$ as the encrypted oracle $$\mathsf{EO}$$.

• The $$\mathsf{DistQuery}$$ protocol works as follows. Given a query $$q = (u, v)$$, the client sends tokens $$(\mathsf{tk}_1, \mathsf{tk}_2) = (F_K(u), F_K(v))$$ to the server which uses them to retrieve the hash tables of nodes $$u$$ and $$v$$ by computing $$\mathsf{T}_u := \mathsf{DX}[\mathsf{tk}_1]$$ and $$\mathsf{T}_v := \mathsf{DX}[\mathsf{tk}_2]$$. The server then homomorphically evaluates an inner product over the hash tables. More precisely, it computes $$c := \sum_{i=1}^t \mathsf{T}_u[i]\cdot\mathsf{T}_v[i]$$, where $$\sum$$ and $$\cdot$$ refer to the homomorphic addition and multiplication operations of of the SHE scheme. Finally, the server returns only $$c$$ to the client who decrypts it and outputs $$2N - \log_2 \left({\sf Dec}_{\sf sk}(c)\right)$$.

See the paper for more details and an analysis of the construction. What is important to note is that we can show that the scheme does not affect the quality of underlying oracle's approximation too much and, in fact, in certain cases it improves it!

It is also worth of mentioning that, in the paper, we also propose a third scheme that has $$O(1)$$ communication complexity but with some additional leakage which we call the sketch pattern leakage. This third scheme is far more efficient than the one above. One interesting subtlety is that, unlike more standard encrypted schemes schemes, where the leakage is over a structure that holds all the original data (e.g., an inverted index with full indexing), the leakage in this case is only over a data structure that holds a random subset of the data.

Finally, We implemented all our constructions and verified their efficiency experimentally.

## Conclusions and Future Work

I went over our graph encryption schemes with support for approximate shortest distance queries. The solutions I described are all adaptively-secure. Of course, there are other possible approaches based on ORAM or FHE which can provide stronger security (even hide access pattern!) but at a higher cost. As graph databases become more and more popular, I believe graph encryption will play an increasingly important role in database security. We live in a data-centric world that generates network and graph data of all kinds. There are still more challenging and exciting open problems in graph database security: e.g., how to construct graph encryption schemes for more complex graph queries? Can we support graph mining tasks, e.g., can we construct graph encryption schemes that allow us to detect communities over encrypted social networks? And of course, as is common in encrypted search, how can we quantify the security of our graph encryption schemes? Any briliant ideas? Talk to us! :-)