Sse on Outsourced Bits
http://senykam.github.io/tags/sse/
Recent content in Sse on Outsourced BitsHugo -- gohugo.ioen-usThu, 16 Jun 2016 12:32:12 -0300Graph Encryption: Going Beyond Encrypted Keyword Search
http://senykam.github.io/2016/06/16/graph-encryption-going-beyond-encrypted-keyword-search
Thu, 16 Jun 2016 12:32:12 -0300http://senykam.github.io/2016/06/16/graph-encryption-going-beyond-encrypted-keyword-search<p><em>This is a guest post by <a href="http://www.xianruimeng.org/">Xianrui Meng</a> from
Boston University about a paper he presented at CCS 2015, written in
collaboration with <a href="https://www.cs.bgu.ac.il/~kobbi/">Kobbi Nissim</a>, <a href="http://www.cs.bu.edu/~gkollios/">George
Kollios</a> and myself. Note that Xianrui is on
the job market.</em></p>
<p><img src="http://senykam.github.io/img/graph21.jpg" class="alignright" width="250">
Encrypted search has attracted a lot of attention from practitioners and
researchers in academia and industry. In previous posts, Seny already described
different ways one can search on encrypted data. Here, I would like to discuss
search on encrypted <em>graph</em> databases which are gaining a lot of
popularity.</p>
<h2 id="graph-databases-and-graph-privacy">Graph Databases and Graph Privacy</h2>
<p>As today's data is getting bigger and bigger, traditional
relational database management systems (RDBMS) cannot scale to the massive
amounts of data generated by end users and organizations. In addition, RDBMSs
cannot effectively capture certain data relationships; for example in
object-oriented data structures which are used in many applications. Today,
<a href="http://nosql-database.org/">NoSQL</a> (Not Only SQL) has emerged as a good
alternative to RDBMSs. One of the many advantages of NoSQL systems is that
they are capable of storing, processing, and managing large volumes of
structured, semi-structured, and even unstructured data. NoSQL databases (e.g.,
document stores, wide-column stores, key-value (tuple) stores, object
databases, and graph databases) can provide the scale and availability needed
in cloud environments.</p>
<p>In an Internet-connected world, graph databases have become an increasingly
significant data model among NoSQL technologies. Social networks (e.g.,
Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML
documents, networked systems can all be modeled as graphs. One nice thing
about graph databases is that they store the relations between entities
(objects) in addition to the entities themselves and their properties. This
allows the search engine to navigate both the data and their relationships
extremely efficiently. Graph databases rely on the node-link-node relationship,
where a node can be a profile or an object and the edge can be any relation
defined by the application. Usually, we are interested in the structural
characteristics of such a graph databases.</p>
<p>What do we mean by the confidentiality of a graph? And how to do we protect it?
The problem has been studied by both the security and database communities. For
example, in the database and data mining community, many solutions have been
proposed based on <em>graph anonymization</em>. The core idea here is to
anonymize the nodes and edges in the graph so that re-identification is hard.
Although this approach may be efficient, from a security point view it is hard
to tell what is achieved. Also, by leveraging auxiliary information,
researchers have studied how to attack this kind of approach. On the other
hand, cryptographers have some really compelling and provably-secure tools such
as ORAM and FHE (mentioned in Seny's previous posts) that can protect all the
information in a graph database. The problem, however, is their performance,
which is crucial for databases. In today's world, efficiency is more than
running in polynomial time; we need solutions that run and scale to massive
volumes of data. Many real world graph datasets, such as biological networks
and social networks, have millions of nodes, some even have billions of nodes
and edges. Therefore, besides security, scalability is one of main aspects we
have to consider.</p>
<h2 id="graph-encryption">Graph Encryption</h2>
<p>Previous work in encrypted search has focused on how to
search encrypted documents, e.g., doing keyword search, conjunctive queries,
etc. Graph encryption, on the other hand, focuses on performing graph queries
on encrypted graphs rather than keyword search on encrypted documents. In some
cases, this makes the problem harder since some graph queries can be extremely
complex. Another technical challenge is that the privacy of nodes and edges
needs to be protected but also the <em>structure</em> of the graph, {\bf which can
lead to many interesting research directions}.</p>
<p>Graph encryption was introduced by Melissa Chase and Seny in
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]. That paper shows how
to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency
and focused subgraphs) can be performed (though the paper is more general as it
describes <em>structured encryption</em>). Seny and I, together with Kobbi Nissim
and George Kollios, followed this up with a paper last year
[<a href="http://eprint.iacr.org/2015/266.pdf">MKNK15</a>] that showed how to
handle more complex graphs queries.</p>
<h2 id="queries-on-encrypted-graph-databases">Queries on Encrypted Graph Databases</h2>
<h3 id="neighbor-queries-and-adjacency-queries">Neighbor Queries and Adjacency Queries</h3>
<p>As I mentioned earlier,
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>] studied some simple
graph queries, such as adjacency queries and neighbor queries. An adjacency
query is takes two nodes as input and returns whether they have an edge in
common. A neighbor query takes a node as input and returns all the nodes that
share an edge with it.</p>
<p>The construction for neighbor queries is mainly based on the searchable
symmetric encryption (SSE), where the input graph is viewed as particular kind
of document collection. Another novel technique that is proposed in the paper
is to use an efficient symmetric non-committing encryption scheme to achieve
adaptive security efficiently. The paper also proposes a nice solution for
focused subgraph queries, which are an essential part of the seminal HITS
ranking algorithm of Kleinberg but are also useful in their own right.</p>
<h3 id="approximate-shortest-distance-queries">Approximate Shortest Distance Queries</h3>
<p>Shortest distance queries are arguably one of the most fundamental and
well-studied graph queries due to their numerous applications. A shortest
distance query takes as input two nods and returns the smallest number of edges
in the shortest path between them. In social networks these queries allow you
to find the smallest number of friends (or collaborators, peers, etc) between
two people. So a graph encryption scheme that supports shortest distance
queries would potentially have many applications in graph database security,
and could be a major building block for other graph encryption schemes. In the
following, I briefly give an overview on our solution for <em>approximate</em>
shortest distance queries.</p>
<p>As I mentioned, to design a secure yet scalable graph encryption
scheme, we have to take into account many things, including the storage space on
the server side, the bandwidth for the query, the computational overhead for
both client and server, etc. Suppose we are given a graph <span class="math">\(G= (V, E)\)</span> and let
<span class="math">\(n= |V|\)</span>, <span class="math">\(m = |E|\)</span>. If we were to use traditional shortest distance algorithm such as
Dijsktra's algorithm, the query time would be <span class="math">\(O(n\log n+m)\)</span>, which can be very
slow for large graphs. The benefit of course would be that we would not need
extra storage. Another approach is to build an encrypted adjacency matrix (see
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]) that somehow supports
shortest distance queries. The problem there is that we would need to pay at
least <span class="math">\(O(n^2)\)</span> storage, which is obviously expensive when the <span class="math">\(n\)</span> is, say, <span class="math">\(1\)</span>
million.</p>
<p>Fortunately, thanks to brilliant algorithmic computer scientists, there exists
a really nice and neat data structure called a <em>distance oracle</em> (DO)
[<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.333&rep=rep1&type=pdf">TZ05</a>,
<a href="http://research.microsoft.com/pubs/115785/wsdm2010.pdf">SGNP10</a>,
<a href="http://research-srv.microsoft.com/pubs/201773/cosn-similarity.pdf">CDFGGW13</a>].
Using such a structure, one can use much less storage overhead (typically <span class="math">\(O(n
\log n)\)</span>) and fast query performance (typically <span class="math">\(O(\log n)\)</span>). However, most
distance oracles return the <em>approximate</em> distance rather than the exact
one. But one one can tweak the parameters in order to get the best trade-off
between performance and approximation. When I first looked at these data
structures, I felt that this was a really amazing tool; not only because of its
functionality but also due to its simplicity.</p>
<p>There are many ways of generating distance oracles. Some of them offer
better approximation while others can have better performance. Here I just
describe one kind which are <em>sketch-based</em> distance oracles. In a such an
oracle, every node <span class="math">\(v\)</span> has a sketch, <span class="math">\(Sk_v\)</span> (normally generated by some
randomized algorithm). <span class="math">\(Sk_v\)</span> is a set containing many node pairs <span class="math">\(\langle
w_i,d(v, w_i)\rangle\)</span>, where <span class="math">\(w_i\)</span> is some node id and <span class="math">\(d(v, w_i)\)</span> is the
distance between <span class="math">\(v\)</span> and <span class="math">\(w_i\)</span>. For example, the following sketch <span class="math">\(Sk_v\)</span> consists of
three pairs</p>
<p><span class="math">\[ Sk_v = \{\langle w_1, d(v, w_1)\rangle, \langle w_2, d(v, w_2)\rangle,
\langle w_3, d(v, w_3)\rangle \}.
\]</span></p>
<p>Querying the shortest distance
between <span class="math">\(u\)</span> and <span class="math">\(v\)</span> is quite simple. We only need to retrieve <span class="math">\(Sk_u\)</span> and
<span class="math">\(Sk_v\)</span>, and find the common nodes in both sketches and add up their
corresponding distances. We then return the minimum sum as the
shortest distance. Formally, let <span class="math">\(I\)</span> be the common nodes that appear in both
<span class="math">\(Sk_u\)</span> and <span class="math">\(Sk_v\)</span>. Then, the approximate shortest distance between <span class="math">\(u\)</span> and <span class="math">\(v\)</span>,
<span class="math">\(d(u,v)\)</span>, is</p>
<p><span class="math">\[d(u, v) = argmin_{s \in I}\{ d(u, s) + d(v, s)\}\]</span></p>
<p>The design of this distance oracle guarantees that the returned distance is no
greater than <span class="math">\(\alpha\times \mathsf{dist}(u,v)\)</span>, where <span class="math">\(\mathsf{dist}(u, v)\)</span> is
the true shortest distance between <span class="math">\(u\)</span> and <span class="math">\(v\)</span> and <span class="math">\(\alpha\)</span> is the
approximation ratio. Note that the approximation ration <span class="math">\(\alpha\)</span> is a function
of some parameters of the sketch so to one controls the approximation by
tweaking the sketch which in turns effects both setup and query efficiency. In
our solution, we leverage sketched-based distance oracles but we have to be
very careful not to affect their approximation ration.</p>
<p>A distance oracle encryption scheme
<span class="math">\(\mathsf{Graph} = (\mathsf{Setup}, \mathsf{DistQuery})\)</span> consists of a polynomial-time algorithm
and a polynomial-time two-party protocol that work as follows:</p>
<ul>
<li><p><span class="math">\((K, \mathsf{EO}) \leftarrow \mathsf{Setup}(1^k, \Omega, \alpha, \varepsilon)\)</span>: is a
probabilistic algorithm that takes as input a security parameter <span class="math">\(k\)</span>, an oracle<br>
<span class="math">\(\Omega\)</span>, an approximation factor <span class="math">\(\alpha\)</span>, and an error parameter <span class="math">\(\varepsilon\)</span>.
It outputs a secret key <span class="math">\(K\)</span> and an encrypted oracle <span class="math">\(\mathsf{EO}\)</span>.</p></li>
<li><p>item <span class="math">\((d, \bot) \leftarrow \mathsf{DistQuery}_{C,S}\big((K, q), \mathsf{EO}\big)\)</span>: is a
two-party protocol between a client <span class="math">\(C\)</span> that holds a key <span class="math">\(K\)</span> and a shortest
distance query <span class="math">\(q = (u, v) \in V^2\)</span> and a server <span class="math">\(S\)</span> that holds an encrypted
oracle <span class="math">\(\mathsf{EO}\)</span>. After executing the protocol, the <span class="math">\(C\)</span> receives a distance <span class="math">\(d
\geq 0\)</span> and server <span class="math">\(S\)</span> receives <span class="math">\(\bot\)</span>.</p></li>
</ul>
<p>For <span class="math">\(\alpha \geq 1\)</span> and <span class="math">\(\varepsilon \lt 1\)</span>, we say that <span class="math">\(\mathsf{Graph}\)</span> is
<span class="math">\((\alpha, \varepsilon)\)</span>-correct if for all <span class="math">\(k \in \mathbb{N}\)</span>, for all <span class="math">\(\Omega\)</span>
and for all <span class="math">\(q = (u, v) \in V^2\)</span>,</p>
<p><span class="math">\[
\mbox{Pr}\big[d \leq \alpha\cdot {\sf dist}(u, v)\big] \geq 1 - \varepsilon,
\]</span></p>
<p>where the probability is over the randomness in computing <span class="math">\((K, \mathsf{EO}) \leftarrow
\mathsf{Setup}(1^k, \Omega, \alpha, \varepsilon)\)</span> and then <span class="math">\((d, \bot) \leftarrow
\mathsf{DistQuery}\big((K, q), \mathsf{EO}\big)\)</span>. I skip the adaptive security definition
as it is similar to adaptive security for SSE and is captured by the general
notion of security for structured encryption given in
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]. Next, I will go over
two solutions for the oracle encryption.</p>
<p><strong>A computationally-efficient solution.</strong>
This approach is rather straightforward, so here I briefly sketch its description. The <span class="math">\(\mathsf{Setup}\)</span> algorithm works as follows:</p>
<ol>
<li>For each node <span class="math">\(v \in V\)</span>, generate a token by applying a PRF to <span class="math">\(v\)</span>: <span class="math">\(\mathsf{tk}_v = F_K(v)\)</span>.</li>
<li>Pad the sketches to the same length and encrypt each sketch <span class="math">\(Sk_v\)</span> as <span class="math">\({\sf Enc}_K(Sk_v)\)</span> using a symmetric encryption scheme.</li>
<li>For each node <span class="math">\(v \in V\)</span>, store the pair <span class="math">\((\mathsf{tk}_v, {\sf Enc}_K(Sk_v))\)</span> in a <a href="https://en.wikipedia.org/wiki/Associative_array}">dictionary data structure</a> <span class="math">\(\mathsf{DX}\)</span> (you should do the insertions at random).</li>
</ol>
<p>The <span class="math">\(\mathsf{DistQuery}\)</span> algorithm is quite simple: given nodes <span class="math">\(u\)</span> and <span class="math">\(v\)</span>, the
client just computes <span class="math">\(F_K(u)\)</span> and <span class="math">\(F_K(v)\)</span> and sends them to the server as the
token. After receiving the token, the server just retrieves <span class="math">\(\mathsf{DX}[F_K(u)]\)</span> and
<span class="math">\(\mathsf{DX}[F_K(v)]\)</span> and sends back the encrypted sketches <span class="math">\({\sf Enc}_K(Sk_u)\)</span> and
<span class="math">\({\sf Enc}_K(Sk_v)\)</span>. Finally, the client decrypts the sketches, and computes the
approximate shortest distance as is normally done in sketch-based distance
oracles. This approach is efficient and simple since we use symmetric
encryption. We show in the paper that this scheme is adaptively secure and that
the leakage for this scheme are the size of the graph, maximum size of the
distance oracle, and the query pattern (see paper for a precise definition).</p>
<p><strong>Communication-efficient solution.</strong>
The problem with the scheme described above is that the communication
complexity is linear in the maximum sketch size. As I mentioned above, this
can be a bottleneck in practice when the graphs are large. Now, at very high
level, I briefly discuss how we can achieve a solution with optimal <span class="math">\(O(1)\)</span>
communication complexity. The scheme makes use of a PRF, a degree-<span class="math">\(2\)</span> somewhat
homomorphic encryption scheme <span class="math">\(\mathsf{SHE} = ({\sf Gen}, {\sf Enc}, {\sf Dec})\)</span>, and a hash function <span class="math">\(h:
V\to [t]\)</span>.</p>
<ul>
<li><p><span class="math">\(\mathsf{Setup}(1^k, \Omega, \alpha, \varepsilon)\)</span>: Given <span class="math">\(1^k\)</span>, <span class="math">\(\Omega\)</span>,
<span class="math">\(\alpha\)</span>, and <span class="math">\(\varepsilon\)</span> as inputs, it generates a public/secret-key pair
<span class="math">\(({\sf pk}, {\sf sk})\)</span> for <span class="math">\(\mathsf{SHE}\)</span>. Let <span class="math">\(D\)</span> be the maximum distance over
all the sketches and <span class="math">\(S\)</span> be the maximum sketch size. <span class="math">\(\mathsf{Setup}\)</span> sets <span class="math">\(N
\leftarrow 2\cdot D +1\)</span> and samples a hash function <span class="math">\(h \leftarrow \mathcal{H}\)</span>
with domain <span class="math">\(V\)</span> and co-domain <span class="math">\([t]\)</span>, where <span class="math">\(t = 2\cdot
S^2\cdot\varepsilon^{-1}\)</span>. It then creates a hash table for each node <span class="math">\(v \in
V\)</span>. More precisely, for each node <span class="math">\(v\)</span>, it processes each pair <span class="math">\((w_i, d_i) \in
sk_v\)</span> and stores <span class="math">\({\sf Enc}_{pk}(2^{N - \delta_i})\)</span> at location <span class="math">\(h(w_i)\)</span> of a
<span class="math">\(t\)</span>-size array <span class="math">\(\mathsf{T}_v\)</span>. In other words, for all <span class="math">\(v \in V\)</span>, it creates an
array <span class="math">\(\mathsf{T}_v\)</span> such that for all <span class="math">\((w_i, \delta_i) \in Sk_v\)</span>,
<span class="math">\(\mathsf{T}_v[h(w_i)] \leftarrow {\sf Enc}_{pk}(2^{N - \delta_i})\)</span>. It then fills
the empty cells of <span class="math">\(\mathsf{T}_v\)</span> with homomorphic encryptions of <span class="math">\(0\)</span> and
stores each hash table <span class="math">\(\mathsf{T}_{v_1}\)</span> through <span class="math">\(\mathsf{T}_{v_n}\)</span> in a
dictionary <span class="math">\(\mathsf{DX}\)</span> by setting, for all <span class="math">\(v \in V\)</span>, <span class="math">\(\mathsf{DX}[F_K(v)]
\leftarrow \mathsf{T}_v\)</span>. Finally, it outputs <span class="math">\(\mathsf{DX}\)</span> as the encrypted
oracle <span class="math">\(\mathsf{EO}\)</span>.</p></li>
<li><p>The <span class="math">\(\mathsf{DistQuery}\)</span> protocol works as follows. Given a query <span class="math">\(q = (u,
v)\)</span>, the client sends tokens <span class="math">\((\mathsf{tk}_1, \mathsf{tk}_2) = (F_K(u),
F_K(v))\)</span> to the server which uses them to retrieve the hash tables of nodes
<span class="math">\(u\)</span> and <span class="math">\(v\)</span> by computing <span class="math">\(\mathsf{T}_u := \mathsf{DX}[\mathsf{tk}_1]\)</span> and
<span class="math">\(\mathsf{T}_v := \mathsf{DX}[\mathsf{tk}_2]\)</span>. The server then homomorphically
evaluates an inner product over the hash tables. More precisely, it computes <span class="math">\(c
:= \sum_{i=1}^t \mathsf{T}_u[i]\cdot\mathsf{T}_v[i]\)</span>, where <span class="math">\(\sum\)</span> and <span class="math">\(\cdot\)</span>
refer to the homomorphic addition and multiplication operations of of the SHE
scheme. Finally, the server returns only <span class="math">\(c\)</span> to the client who decrypts it and
outputs <span class="math">\(2N - \log_2 \left({\sf Dec}_{\sf sk}(c)\right)\)</span>.</p></li>
</ul>
<p>See the paper for more details and an analysis of the construction. What is
important to note is that we can show that the scheme does not affect the
quality of underlying oracle's approximation too much and, in fact, in certain
cases it improves it!</p>
<p>It is also worth of mentioning that, in the paper, we also propose a third
scheme that has <span class="math">\(O(1)\)</span> communication complexity but with some additional
leakage which we call the sketch pattern leakage. This third scheme is far more
efficient than the one above. One interesting subtlety is that, unlike more
standard encrypted schemes schemes, where the leakage is over a structure that
holds all the original data (e.g., an inverted index with full indexing), the
leakage in this case is only over a data structure that holds a random subset
of the data.</p>
<p>Finally, We implemented all our constructions and verified their efficiency
experimentally.</p>
<h2 id="conclusions-and-future-work">Conclusions and Future Work</h2>
<p>I went over our graph encryption schemes with support for approximate shortest distance
queries. The solutions I described are all adaptively-secure. Of course, there
are other possible approaches based on ORAM or FHE which can provide stronger
security (even hide access pattern!) but at a higher cost. As graph databases become more and more
popular, I believe graph encryption will play an increasingly important role in
database security. We live in a data-centric world that generates network and
graph data of all kinds. There are still more challenging and exciting open
problems in graph database security: e.g., how to construct graph encryption
schemes for more complex graph queries? Can we support graph mining tasks,
e.g., can we construct graph encryption schemes that allow us to detect
communities over encrypted social networks? And of course, as is common in
encrypted search, how can we quantify the security of our graph encryption
schemes? Any briliant ideas? Talk to us! :-)</p>
Applied Crypto Highlights: Searchable Encryption with Ranked Results
http://senykam.github.io/2015/04/15/applied-crypto-highlights-searchable-encryption-with-ranked-results
Wed, 15 Apr 2015 20:57:14 -0300http://senykam.github.io/2015/04/15/applied-crypto-highlights-searchable-encryption-with-ranked-results<p><em>This is the second in a series of guest posts highlighting new research in
applied cryptography. This post is written by <a href="http://www.baldimtsi.com/">Foteini
Baldimtsi</a> who is a postdoc at Boston University and
<a href="http://research.microsoft.com/en-us/people/oohrim/">Olya Ohrimenko</a> who is a
postdoc at Microsoft Research. Note that Olya is on the job market this year.</em></p>
<p><img src="http://senykam.github.io/img/steam.jpg" class="alignright" width="250">
Modern cloud services let their users outsource data as well as request
computations on it. Due to potentially sensitive content of users' data and
distrust in cloud services, it is natural for users to outsource their data
encrypted. It is, however, important for the users to still be able to use
cloud services for performing computations on the encrypted data. In this
article we consider an important class of such computations: search over
outsourced encrypted data. Searchable Encryption has attracted a lot of
attention from research community and has been thoroughly described by Seny
in <a href="http://outsourcedbits.org/2013/10/06/how-to-search-on-encrypted-data-part-1">previous blog posts</a>.</p>
<p>Search functionality alone, however, might not be enough when one considers a
large amount of data. Ideally, users would like to not only receive the
matching results, but get them back sorted according to how relevant they are
to their query (just like a search engine does!). In this blog post we describe
our <a href="http://fc15.ifca.ai/preproceedings/paper_89.pdf">recent result</a> from
the conference on Financial Cryptography and Data Security 2015 which builds on
top of searchable encryption techniques to return <em>ranked results</em> to
users' queries. Our goal is to create a scheme that is efficient and achieves a
high level of privacy against a curious cloud server.</p>
<h2 id="ranking-search-results-on-plaintext-data">Ranking search results on plaintext data</h2>
<p>Let us start by briefly describing how ranking would be done if users did not
take into account the privacy of their data and outsourced it in an unencrypted
format. Literature on information retrieval offers an abundance of ranking
methods. For our paper, we chose the $\mbox{tf-idf}$ ranking method due to its
simplicity, popularity and the fact that it supports free text queries. This method
is effective since it is based on term/keyword frequency (tf) and inverse
document frequency (idf).</p>
<p>Let $D=D_1,\dots,D_n$ be a document collection of $n$ documents, in which there
exist $m$ unique terms/keywords $t_1,\dots,t_m$. First, for every term, $t$, we
compute its frequency ($\mbox{tf}$) in each document $D_i$ as well as its inverse
document frequency ($\mbox{idf}$) in the entire collection (it captures how common the term is
in the whole document collection). Then, for each term and document we compute</p>
<p><span class="math">\[
\mbox{tf-idf}_{t,D_i} = \mbox{tf}_{t,d} \times \mbox{idf}_{t}
\]</span></p>
<p>and store the score values in the rank table, $T$:</p>
<p><figure><img src="http://senykam.github.io/img/searchindextable.jpg" alt="" title="$\mbox{tf-idf}$ rank table, $T$, outsourced to the cloud"><figcaption>$\mbox{tf-idf}$ rank table, $T$, outsourced to the cloud</figcaption></figure></p>
<p>Note that if a term does not appear in a table, then we store $0.00$ as a rank.
This table could be either computed by the owner of the document collection and
outsourced to the cloud, or computed by the cloud itself since it
receives the actual document collection $D$ in the clear.</p>
<p>Now suppose that a user wants to query the cloud for the multi-keyword query "searchable
encryption". Then, the cloud first searches for the terms "searchable" and
"encryption" in the table, adds the corresponding rows together to get the
overall score of the query, sorts the scores, and returns the relevant
documents in a sorted order.</p>
<h2 id="ranking-search-results-on-encrypted-data">Ranking search results on encrypted data</h2>
<p>A user that wishes to protect her privacy is likely to outsource her document
collection to the cloud in an encrypted format: $E(D_1),\dots,E(D_n)$. In order
to be able to perform <em>ranked search</em>, the user has to create the rank
table $T$ and send it to the cloud (as opposed to outsourcing plaintext data
where the cloud could also compute the rank table itself). Since the rank table
contains information about the distribution of words in individual documents and
the whole collection, it has to be encrypted as well. However, in order for the
server to be able to return ranked results using the $\mbox{tf-idf}$ method
described above, encrypted$T$ should be able to support the following
operations:</p>
<ol>
<li>search for terms/keywords</li>
<li>add numerical values</li>
<li>sort a list of numerical values.</li>
</ol>
<p>For the first operation one could simply encrypt the keywords on the table
using a <a href="http://outsourcedbits.org/2013/10/06/how-to-search-on-encrypted-data-part-1/">searchable
encryption</a>
(SE) scheme. Then, whenever the user wants to search for a phrase, she sends to
the cloud an SE trapdoor for each keyword in the phrase. The server can then
use the trapdoors to locate the keywords in the table.</p>
<p>The next two operations refer to the numerical entries on the table which
should be encrypted in a way that supports addition and sorting. A natural
solution would be to encrypt these values under a <a href="http://outsourcedbits.org/2012/06/26/applying-fully-homomorphic-encryption-part-1/">fully-homomorphic
encryption</a>
scheme that can support any type of computation over encrypted data. However,
the resulting solution would be very inefficient to be applied in practice.
Another potential solution would be to encrypt the numerical values under an
<a href="http://www.cc.gatech.edu/aboldyre/papers/bclo.pdf">order-preserving
encryption</a> (OPE) scheme.
However, this would be sufficient only for single-keyword queries, since OPE
schemes cannot support homomorphic addition (and, even if they did, they would
<a href="http://luca-giuzzi.unibs.it/corsi/Support/papers-cryptography/RAD78.pdf">not be
secure</a>.
Note that for single keyword queries, OPE might not be ideal since it leaks the
rank order of the documents for each keyword (see also the discussion
<a href="http://outsourcedbits.org/2013/10/14/how-to-search-on-encrypted-data-part-2/">here</a>.</p>
<p>Given that we aim for an efficient and provably secure
solution, we propose to encrypt the numerical values of the rank table using
the <a href="http://en.wikipedia.org/wiki/Paillier_cryptosystem">Paillier encryption
scheme</a>: a semi-homomorphic
scheme that supports the addition of encrypted values. (For the rest of this
post, we use $[a]$ to denote the encryption of value $a$ using this scheme.) %a
semi-homomorphic encryption scheme. By the properties of Paillier, the server
can add the corresponding rows of $T$ when a query is received. What is still
left to discuss is, how the server can also sort these encrypted values. In
the rest of the post, we describe our private sorting mechanism over encrypted
values.</p>
<p>Our private sorting mechanism requires to equip the cloud server with a secure
co-processor (e.g., <a href="http://www-03.ibm.com/security/cryptocards/pciecc/overview.shtml">IBM
PCIe</a>, <a href="https://software.intel.com/en-us/blogs/2013/09/26/protecting-application-secrets-with-intel-sgx">Intel
SGX</a>,
<a href="https://technet.microsoft.com/en-us/library/cc749022%28v=ws.10%29.aspx">Windows
TPM</a>.
The secure co-processor is then given the decryption key of the
semi-homomorphic encryption scheme which lets him assist the cloud server in
sorting. For the protocol to proceed, we assume that the co-processor does not
collude with the cloud server and both of them are following the protocol in an
honest-but-curious way. That is, neither of them deviates from the protocol but
both are curious to learn more about user's data.</p>
<p><figure><img src="http://senykam.github.io/img/introimagesingleuserslim.jpg" alt="" title="An overview of the interactions between the user, the cloud server $S_1$ and the co-processor $S_2$."><figcaption>An overview of the interactions between the user, the cloud server $S_1$ and the co-processor $S_2$.</figcaption></figure></p>
<p>Regarding the privacy of our scheme, we design our protocol in such a way that:
(a) the co-processor learns nothing about the values being sorted and (b) the
cloud server, as in SE, learns the search pattern (i.e., whether a keyword was
queried before or not), but learns nothing about the ranking of the documents.
For example, he does not learn which document ranks higher for user's query.</p>
<h2 id="private-sort">Private Sort</h2>
<p>We now develop a sorting protocol that the cloud server and the co-processor
can use to jointly sort encrypted ranking data of the documents. From now on we
denote the cloud server by $S_1$ and the co-processor by $S_2$. Our private
sort is a two-party protocol between $S_1$ and $S_2$ where $S_1$ has an
encrypted array of $n$ elements $[A] = { [A_1], [A_2], \ldots, [A_n]}$ and
$S_2$ has the secret key that can decrypt $A$.<br>
By the end of the protocol, $S_1$ should obtain $[B] = {[B_1],
[B_2], \ldots, [B_n]}$ where $[B]$ is an encryption of$A$ sorted. Since $S_1$
and $S_2$ are both curious, we are interested in protecting the content of $A$
and$B$ from both of them and we are willing to reveal <em>only</em> the size of
$A$,$n$. Hence, $S_2$ should only assist $S_1$ in sorting without seeing the
encrypted content of $A$ or $B$, otherwise he can trivially decrypt it. On the
other side of the protocol, nothing about the decryption key nor plaintext
values of $A$ and $B$ should be leaked to $S_1$. For example, we do not want
to leak to neither $S_1$ nor $S_2$ values of elements in $A$, their comparison
result with other elements, and their new location in $B$ (in the paper we
express these properties using simulation based security definitions).</p>
<p><strong>Private Sort Construction Overview.</strong>
As can be seen from the definitions, the participation of $S_1$ and $S_2$ in
private sort should not reveal anything about the content of the data to either
of them. Hence, any method we use for comparison and sorting must appear
independent of the data. We note, however, that many sorting algorithms access
the data depending on the comparison result and data content (e.g., quicksort).
This does not fit our model where everything about the data, including
individual comparisons, should be protected from $S_1$ and $S_2$.</p>
<p>Fortunately, there are sorting algorithms where data comparisons are determined
by the size of the data to be sorted, $n$ in our case, and not the data
content. One such algorithm is a
<a href="http://dl.acm.org/citation.cfm?id=1468121">sorting network</a> by K.
Batcher which relies on Two-Element Sort circuit. This circuit takes two
elements and outputs them in a sorted order.
Then the network consists of $O(\log n)$ layers where every layer has $O(n)$ Two-Element Sort circuits,
where constants in big-O are determined by $n$.
In order to sort the data, one simply passes it through the network.
Moving the data through the network depends only on $n$ and the Two-Element Sort.
Hence, if we develop a private Two-Element Sort, the implementation
of private Batcher's network becomes trivial.</p>
<h3 id="private-twoelement-sort">Private Two-Element Sort</h3>
<p>As the name suggests, Private Two-Element Sort is a special case of Private Sort, as defined above, for the case $n=2$. That is, $S_1$ has two encrypted elements $[a]$ and $[b]$ and wishes to obtain $[c]$ and $[d]$ where $c = \min(a,b)$ and $d = \max(a,b)$. Similarly,$S_2$ has the secret key of the encryption. The security definition is also the same and informally states that neither $S_1$ nor $S_2$ learn anything about $a$ and $b$.</p>
<p>We first describe operations that are required to perform Two-Element Private Sort without encryption and then for every operation give its private version. The sorting consists of:</p>
<ol>
<li>$t := a > b$ (Set bit $t$ to the result of comparing $a$ and $b$).</li>
<li>$c := (1-t)a + tb$ (Use $t$ to select the minimum of $a$ and $b$).</li>
<li>$d := ta + (1-t)b$ (Use $t$ to select the maximum of $a$ and $b$).</li>
</ol>
<p>Note that these three operations have to be performed on encrypted data:
$a$ and $b$ are part of the encrypted input of $S_1$,
bit $t$ and values $c$ and $d$ also should be encrypted to protect their content from $S_1$.
Moreover, neither of these values should be shown to$S_2$ since he can trivially decrypt them,
violating privacy guarantees against$S_2$.</p>
<p>We show how to perform above operations over encrypted data one by one, starting first with
a <em>Private Comparison</em> protocol for computing $[t]$ and following with
a <em>Private Select</em> protocol for computing $[c]$ and $[d]$.</p>
<p><strong>Private Comparison.</strong>
This protocol is a variation of a classical Andrew Yao's <a href="http://research.cs.wisc.edu/areas/sec/yao1982-ocr.pdf">Millionaire's problem</a>:
$S_1$ has $[a]$ and $[b]$ and wishes to obtain $[t]$, where
$t = (a > b)$ and $S_2$ has the private key of the encryption scheme.
Although there is more than one way of doing so, we pick an efficient
algorithm from a recent result by <a href="http://www.internetsociety.org/sites/default/files/04_1_2.pdf">Bost et al.</a>, which is a correction of the original <a href="http://bioinformatics.tudelft.nl/sites/default/files/Comparing%20encrypted%20data.pdf">protocol</a> by T.Veugen.
This algorithm lets $S_1$ and $S_2$ compare $a$ and $b$ using
number of interactions that is logarithmic in the number of bits in each element.</p>
<p>Note that neither $S_1$ nor $S_2$ learn the values of $a$, $b$, and $t$.
In addition, $S_2$ does not learn the ciphertexts corresponding to these values.</p>
<p><strong>Private Select.</strong>
Given the comparison bit $t$, we now devise a private algorithm for using this
bit to select the minimum and the maximum value of $a$ and $b$ (that is
performing operations 2 and 3 above). Recall that $S_1$ has to obtain $[c]$
and $[d]$ with $S_2$ "blindly" assisting him in the protocol.</p>
<p>We wish to use simple cryptographic operations in order to compute $c$ and $d$.
That is, we use semi-homomorphic cryptographic techniques as opposed to
fully-homomorphic ones. To this end, we use an interesting property of layered
Paillier Encryption. We omit many details from here and only point out the
features that we need.</p>
<p>We denote messages encrypted using first and second layers of Paillier Encryption as
$[m]$ and $[![m]!]$, respectively.
We recall that Paillier Encryption supports addition of ciphertexts as well
as multiplication by a constant, i.e., $[m_1][m_2] = [m_1+m_2]$ and $[m]^{C} = [Cm]$.
The same operations hold for ciphertexts of the second layer.
However, what is more interesting is that the ciphertext of the first layer
is in the same domain as the plaintext of the second layer, which
allows the following operations:</p>
<p>This trick allows us to implement the functionality of private select for $c$,
and similarly for $d$, as follows:</p>
<p><span class="math">\[[\![[c]]\!] := [\![[a]]\!]^{[1-t]} [\![[b]]\!]^{[t]} = [\![[(1-t)a + tb]]\!]\,\]</span></p>
<p>where $c$ and $d$ are doubly encrypted.</p>
<p>Recall that the output of Two-Element Private Sort is a building block of the
general sort, where $c$ and $d$ participate in further invocations of
Two-Element Private Sort. To make the values $c$ and $d$ usable in the next
layer of Batcher's network, $S_1$ uses $S_2$ to strip off the extra layer of
encryption. $S_1$ blinds the value he needs to strip via $[![[c+r]]!]$, and
sends it to $S_2$, who decrypts the ciphertext and sends back only $[c+r]$.
Using homomorphic properties of Paillier, $S_1$ subtracts $r$ to get $[c]$.
The similar protocol is executed for $d$. %Note that this protocol requires
one interaction with $S_2$.</p>
<h3 id="private-nelement-sort">Private $n$-Element Sort</h3>
<p>Let us now show how to sort an array of $n$ elements using our Private Two-Element Sort.
%We are now ready to combine all the building blocks.
$S_1$ executes Batcher's sorting network layer by layer.
For each layer in the network and for every sorting gate in this layer,
he engages with $S_2$ in Private Two-Element Sort.
He uses the outputs of this layer as inputs to the next layer
of the network. (See Figure\ref{fig:batcher} for an illustration.)</p>
<p><figure><img src="http://senykam.github.io/img/batcher1.jpg" alt="" title="Example of privately sorting an encrypted array of four elements $5,1,2,9$ where $[m]$ denotes a Paillier encryption of message $m$ and $\mathsf{pairs}_i$ denotes a pair of elements to be sorted. Note that only $S_1$ stores values in the arrays $A_i$ while $S_2$ blindly assists $S_1$ in sorting the values."><figcaption>Example of privately sorting an encrypted array of four elements $5,1,2,9$ where $[m]$ denotes a Paillier encryption of message $m$ and $\mathsf{pairs}_i$ denotes a pair of elements to be sorted. Note that only $S_1$ stores values in the arrays $A_i$ while $S_2$ blindly assists $S_1$ in sorting the values.</figcaption></figure></p>
<p><strong>Sketch of Privacy Analysis.</strong>
We note that the number of times $S_1$ engages with $S_2$ in the protocol does
not reveal either of them anything about the data content. Each engagement is
an execution of Private Two-Element Sort which, in turn, is a call to Private
Comparison and two calls to Private Select. Private comparison guarantees
privacy against $S_1$ and $S_2$ as long as they are non-colluding honest
adversaries. Private select relies on homomorphic properties of Paillier and
requires only the re-encryption step from $S_2$. Since $S_2$ receives a
blinded value he does not learn the value of $c$ or $d$. Moreover, since the
values of $c$ and $d$ are re-randomized we can treat $O(n (\log n)^2)$ calls to
Two-Element Private Sort independently.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We constructed a private sort mechanism that allows a cloud server $S_1$ to sort
a list of encrypted data without learning anything about their order (while
assisted by a non-colluding co-processor $S_2$). As discussed in the beginning
of our post, this tool lets a user store his encrypted documents in
a cloud server and receive ranked results when searching on them.</p>
<p>The method, as described in the post, assumes that the rank table has an entry
for every keyword-document pair, even if a keyword does not appear in
this document zero is stored.
In the <a href="https://eprint.iacr.org/2014/1017">full version</a> of the paper, we show that
we can relax this requirement and store only information for documents where
keyword appears, hence, significantly reducing the size of $T$ and query time for the server.
If we do so, we can add ranking to the optimal SE technique by <a href="http://research.microsoft.com/apps/pubs/?id=102088">Curtmola et al.</a> for single keyword queries or to the technique by <a href="https://eprint.iacr.org/2013/169">Cash et al.</a>
for efficiently answering Boolean queries on encrypted data (see earlier <a href="http://outsourcedbits.org/2014/08/21/how-to-search-on-encrypted-data-searchable-symmetric-encryption-part-5/#comment-2512">blog post</a> for more details on each).
Although the resulting scheme gives a significant performance
improvement and protects the ranking of the documents,
it inherits the leakage of the access pattern (i.e., identifiers of the documents where each query keyword appears)
from the corresponding SE technique.</p>
<p>Our work leaves several interesting open questions, including:
how to efficiently update the collection?
how can a user verify the ranking result it receives?
is a non-colluding co-processor provably necessary to solve multi-keyword
ranked search? Any ideas? :)</p>
How to Search on Encrypted Data: Searchable Symmetric Encryption (Part 5)
http://senykam.github.io/2014/08/21/how-to-search-on-encrypted-data-searchable-symmetric-encryption-part-5
Thu, 21 Aug 2014 17:33:58 -0300http://senykam.github.io/2014/08/21/how-to-search-on-encrypted-data-searchable-symmetric-encryption-part-5<p><em>This is the fifth part of a series on searching on encrypted data. See parts <a href="https://outsourcedbits.org/2013/10/14/how-to-search-on-encrypted-data-part-1/">1</a>, <a href="https://outsourcedbits.org/2013/10/30/how-to-search-on-encrypted-data-part-2/">2</a>, <a href="https://outsourcedbits.org/2013/12/20/how-to-search-on-encrypted-data-part-3/">3</a> and <a href="https://outsourcedbits.org/2014/08/21/how-to-search-on-encrypted-data-part-4-oblivious-rams/">4</a>.</em></p>
<p><img src="http://senykam.github.io/img/search.jpg" class="alignright" width="250">
In the previous post we covered the most secure way to search on encrypted
data: oblivious RAMs (ORAM). I always recommend ORAM-based solutions for
encrypted search whenever possible; namely, for small- to moderate-size data
<sup class="footnote-ref" id="fnref:1"><a class="footnote" href="#fn:1">1</a></sup>. Of course, the main limitation of ORAM is efficiency so this motivates us
to keep looking for additional approaches.</p>
<p>The solution I discuss in this post is <em>searchable symmetric encryption</em> (SSE).
For readers who are not familiar with this area, let me stress that this has
<em>nothing</em> to do with CipherCloud's <a href="http://www.ciphercloud.com/company/about-ciphercloud/press-releases/ciphercloud-delivers-breakthrough-searchable-strong-encryption/">searchable strong
encryption</a>.
I don't know why CipherCloud chose to call its "breakthrough" product SSE. No
one knows exactly what CipherCloud does at the crypto level but everything
points to them using some form of tokenization which, as far as I know, is an
industry term for deterministic encryption. This is neither a breakthrough nor
really secure for that matter but that's the last thing I'll say about
CipherCloud here so every reference to SSE that follows is about searchable
symmetric encryption.</p>
<p>SSE was first introduced by Song, Wagner and Perrig
[<a href="http://www.cs.berkeley.edu/~dawnsong/papers/se.pdf">SWP00</a>]. SSE tries to
achieve the best of all worlds. It is as efficient as the most efficient
encrypted search solutions (e.g., deterministic encryption) but provides a lot
more security.</p>
<h2 id="the-security-of-encrypted-search">The Security of Encrypted Search</h2>
<p>One of the most interesting aspects of encrypted search from a research point
of view has to do with security definitions; that is, what does it mean for an
encrypted search solution to be secure? This is not an obvious question and I
talked about this a bit in the previous post on
<a href="http://outsourcedbits.org/2013/12/20/how-to-search-on-encrypted-data-part-4-oblivious-rams/}{ORAM">ORAM</a>.</p>
<p>The first paper to explicitly address this question was an important paper by
Eu-Jin Goh [<a href="https://eprint.iacr.org/2003/216.pdf">Goh03</a>] <sup class="footnote-ref" id="fnref:2"><a class="footnote" href="#fn:2">2</a></sup> who was a
graduate student at Stanford at the time. This paper had many contributions but
one of the most important ones was simply to point out that SSE schemes were
not normal encryption schemes and, therefore, the standard notion of
CPA-security was not meaningful/relevant for SSE. The problem is essentially
that when an adversary interacts with an SSE scheme he has access to more than
an encryption oracle; he also has access to a search oracle. Goh's point was
that this had to be captured in the security definition otherwise it was
meaningless.</p>
<p>To address this, he proposed the first security definition for SSE. Roughly
speaking, the definition guaranteed that given an EDB and the encrypted
documents, the adversary would learn nothing about the underlying documents
beyond the search results <em>even if it had access to a search oracle</em>. Let
me highlight a few things about Goh's definition: (1) it was a game-based
definition; and (2) it did not provide query privacy (i.e., no privacy
guarantees for user queries). <sup class="footnote-ref" id="fnref:3"><a class="footnote" href="#fn:3">3</a></sup> A follow up paper by Chang and Mitzenmacher
[<a href="https://www.eecs.harvard.edu/~michaelm/postscripts/acns2005.pdf">CM05</a>]
proposed a new definition that was simulation-based and that guaranteed query
privacy in addition to data privacy.</p>
<p>I won't go into details, but simulation-based definitions have some
advantages over game-based definitions and, generally speaking, are preferable
and can be easier to work with---especially when composing various primitives to
build larger protocols.</p>
<p>So we're done right? Not exactly.</p>
<p>During this time, Reza Curtmola, Juan Garay, Rafail Ostrovsky and myself were
also thinking about SSE and one of the things we noticed while thinking
about the security of SSE schemes was that the previous security definitions
didn't seem to really capture what was going on. There were primarily two
issues: (1) the definitions were (implicitly) restricting the adversary's
power; and (2) they didn't explicitly capture the fact that the constructions
were leaking information.</p>
<p><strong>Adaptivity.</strong>
The first problem was that in these definitions, the adversary was never given
the search tokens, the EDB or the results of its searches. The implication of
this was that---in the definition---the adversary could not choose its search
oracle queries as a function of the EDB, the tokens or previous search
results. In other words, it's behavior was being implicitly restricted to
making <em>non-adaptive</em> queries to its search oracle. This was clearly an
issue because in the real-world the adversary we are trying to protect against
is a server that stores the EDB, that receives tokens from the client and
that sees the results of the of the search. So if we allow this adversary to
query a search oracle, then we also have to allow him to query the oracle as a
function of the EDB, the tokens and previous search results.<br>
More concretely, this captures a form of attack where the server crafts some
clever oracle queries based on the EDB, the tokens or previous search results.</p>
<p>Now let's take a step back. At this point---unless you are a
cryptographer---you are likely thinking something to the effect of: "this
sounds contrived and honestly I can't see how one could craft queries of this
form that would lead to an actual attack of this form. This is all academic!".
I know this because, unfortunately, I've heard this many times over the years.</p>
<p>But this is roughly the reaction people have every time cryptographers point
out that an adversarial model needs to be strengthened. Usually, what happens
is the following: (1) non-cryptographers ignore this and build their systems
using primitives that satisfy the weaker model because they don't believe the
stronger attacks are realistic; (2) someone comes along and carries out some
form of the stronger attack; and (3) the systems need to be re-designed and
patched. This has happened in the cases of encryption (CPA- vs. CCA2-security)
and key exchange.</p>
<p>In any case, having observed this, we wrote about it in the following paper
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] and proposed a new and
stronger definition where the adversary was allowed to generate its queries as
a function of the EDB, the tokens and previous search results. We called this
<em>adaptive</em> security and gave two formulations of this definition: one
game-based and one simulation-based. This turned out to be quite interesting
from a theoretical point of view because the simulation-based formulations were
slightly stronger than the game-based formulations; which is not the case for
the standard notion of CPA-security <sup class="footnote-ref" id="fnref:4"><a class="footnote" href="#fn:4">4</a></sup>.</p>
<p>Now, to be honest, I do not know of an explicit attack on a concrete SSE
construction that takes advantage of adaptivity. But that shouldn't matter anymore
because we now know how to construct adaptively-secure SSE schemes that are as
efficient as non-adaptively-secure ones. So there is no excuse for not using
an adaptively-secure scheme. Another important reason to consider adaptive
security is for situations where SSE schemes are used as building blocks in
larger protocols. In these kinds of situations, the primitive can be used in
unorthodox ways which open up subtle new oracles that one may not have
considered when designing the primitive for its more standard uses.</p>
<p>This exact issue comes up in a paper I wrote recently
[<a href="http://research.microsoft.com/en-us/um/people/senyk/pubs/metacrypt.pdf">K14</a>]
that combines structured encryption (which is a form of SSE) with secure
multi-party computation to design a private alternative to the NSA metadata
program. In this case, it turns out that the adversary for the larger protocol
(i.e., the NSA analyst) can easily influence the inputs to the underlying SSE
scheme and implicitly carry out adaptive attacks on it. So in this case, it is
crucial that whatever structured encryption scheme is used be adaptively-secure.</p>
<p><strong>Leakage.</strong> Another important issue that was overlooked in previous work was leakage. As I've
discussed in previous posts, non-ORAM solutions leak some information.
Everyone was basically aware that SSE revealed the search results
(i.e., the identifiers of the documents that contained the keyword). This was
the whole point of SSE and most people believed that this was why it was more
efficient than ORAM. <sup class="footnote-ref" id="fnref:5"><a class="footnote" href="#fn:5">5</a></sup> But this was not treated appropriately. In addition,
we also pointed in [<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] that all
the known SSE constructions leaked more that the search results. In particular,
they also revealed whether a search query was being repeated. This was very
easy to see by just looking at the constructions: the search tokens were
usually the output of a PRF applied to the keyword being searched for.</p>
<p>The main problem was that the definitions did not capture any of this <sup class="footnote-ref" id="fnref:6"><a class="footnote" href="#fn:6">6</a></sup>. To address
it we decided to treat leakage in SSE more formally and to capture it very
explicitly in our security definitions. Our thinking was that leakage was an
integral part of SSE (since it seemed to be one of the reasons why SSE was so
efficient) and that it deserved to be properly studied and understood. At this
stage we only really considered two types of leakage: the access pattern and
the search pattern. The access pattern is basically the search results (the
identifiers of the documents that contain the keyword) and the search pattern is
whether a search query is repeated. At the time these were the only leakages
that had appeared in the literature. In a later paper with Melissa Chase
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>], we generalized the
definitional approach of [<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] so
that the definition could include <em>any</em> kind of leakage.</p>
<p>Leakage is of course undesirable from a security point of view, but it is
fascinating from a research point of view. I hope to discuss this further in later
posts. For the purposes of this discussion, I'll just point out that there are
(mostly) two kinds of leakages: setup leakage, which is revealed just by the
EDB; and query leakage, which is revealed by a combination of the EDB and a
token. One of the main issues with any solution based on deterministic
encryption or, more generally, on property-preserving encryption is that they
have a high degree of setup leakage: their EDB's have non-trivial leakage. In
that sense, SSE-based solutions are better because their setup leakage is
usually minimal/trivial and the non-trivial leakage is only query leakage which
is controlled by the client since queries can only be executed with knowledge of
the secret key.</p>
<p><strong>Summing up.</strong>
So in the end, what we tried to argue in
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] was that what we should be
asking for from an SSE security definition is a guarantee that:</p>
<blockquote>
<p><em>the adversary cannot learn anything about the data and the queries beyond
the explicitly allowed leakage; even if the adversary can make adaptive
queries to a search oracle.</em></p>
</blockquote>
<p>But once we settled on this definition and formalized it, the following natural
problems came up: (1) how do we distinguish between reasonable and
unreasonable leakage?; and (2) is it even possible to design SSE schemes that
are adaptively-secure? <sup class="footnote-ref" id="fnref:5"><a class="footnote" href="#fn:5">7</a></sup></p>
<p>Initially, the answers to these questions weren't obvious to us. We thought
about them for a while and eventually answered the second question by finding an
SSE construction that was adaptively-secure. Unfortunately, while the scheme had
optimal asymptotic search complexity, it was not really practical. But at least
we knew adaptive security was achievable---though we did not know whether it was
achievable efficiently.</p>
<p>We didn't really have any answer for the second question. In fact, we still
don't. We don't really have a good way to understand and analyze the leakage of
SSE schemes. For now, the best we can do is to try and describe it precisely.</p>
<h2 id="searchable-symmetric-encryption">Searchable Symmetric Encryption</h2>
<p>There are many variants of SSE (see this paper
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>] for a discussion)
including interactive schemes, where the search operation is interactive (i.e.,
a two-party protocol); and response-hiding schemes, where search results are
not revealed to the server but only to the client. I'll focus on
non-interactive and response-revealing schemes here because they were the first
kind of SSE schemes considered and also because they are very useful as
building blocks for more complex constructions and protocols. It also happens
that they are the most difficult to construct.</p>
<p>In our formulation we will
ignore the document collection itself and just assume that the individual
documents are encrypted using some symmetric encryption scheme and that the
documents each have a unique identifier that is independent of their content (so
that knowing the identifier reveals nothing about a file's contents).</p>
<p>We assume that the client processes the data collection <span class="math">\(\textbf{D} = (D_1,
\dots, D_n)\)</span> and sets up a "database" <span class="math">\({\sf DB}\)</span> that maps every keyword <span class="math">\(w\)</span> in the
collection to the identifiers of the documents that contain it. Recall that in
our context, we use the term database loosely to refer to a data structure
optimized for keyword search (i.e., a search structure). For a keyword
<span class="math">\(w\)</span>, we'll write <span class="math">\({\sf DB}[w]\)</span> to refer to the list of identifiers of documents that
contain <span class="math">\(w\)</span>.</p>
<p>A non-interactive and response-revealing SSE scheme <span class="math">\(({\sf Setup}, {\sf Token}, {\sf Search})\)</span> consists of</p>
<ul>
<li><p>a <span class="math">\({\sf Setup}\)</span> algorithm run by the client that takes as input a security
parameter <span class="math">\(1^k\)</span> and a database <span class="math">\({\sf DB}\)</span>; it returns a secret key <span class="math">\(K\)</span> and an
encrypted database <span class="math">\({\sf EDB}\)</span>;</p></li>
<li><p>a <span class="math">\({\sf Token}\)</span> algorithm also run by the client that takes as input a secret key
<span class="math">\(K\)</span> and a keyword <span class="math">\(w\)</span>; it returns a token <span class="math">\({\sf tk}\)</span>;</p></li>
<li><p>a <span class="math">\({\sf Search}\)</span> algorithm run by the server that takes as input an encrypted
database <span class="math">\({\sf EDB}\)</span> and a token <span class="math">\({\sf tk}\)</span>; it returns a set of identifiers <span class="math">\({\sf DB}[w]\)</span>.</p></li>
</ul>
<p>In addition to security, of course, the most important thing we want from an SSE
solution is low search complexity.<br>
Fast, for our purposes will mean <em>sub-linear</em> in the number
of documents and, ideally, linear in the number of documents that contain the
search term. Note that the latter is optimal since at a minimum the server
needs to fetch the relevant documents just to return them.</p>
<p>Requiring sub-linear search complexity is <em>crucial</em> for practical purposes.
Unless you are working with a very small dataset, linear search is just not
realistic---try to imagine if your desktop search application or email search
function did sequential search over your hard drive or email collection
<em>every time you searched</em>. Or if your favorite search engine sequentially
scanned the entire Web every time you performed a web search <sup class="footnote-ref" id="fnref:7"><a class="footnote" href="#fn:7">8</a></sup>.</p>
<p>The sub-linear requirement has consequences, however. In particular it means
that we must be willing to work in a offline/online setting where we run a
one-time (linear) pre-processing phase to setup a search structure so that we
can then execute search queries on the data structure in sub-linear time.
And this is exactly the approach we'll take.</p>
<h2 id="the-inverted-index-solution">The Inverted Index Solution</h2>
<p>The particular solution I describe here is referred to as the <em>inverted
index solution</em> and was proposed in the same
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] paper in which we studied the
security of encrypted search. This is a good construction to understand for
several reasons: (1) it is the basis of almost all subsequent SSE
constructions; and (2) many of the tricks and techniques that are used in
recent SSE schemes (and the more general setting of structured encryption)
originated in this construction.</p>
<p><strong>Setup.</strong>
The scheme makes use of a symmetric encryption scheme <span class="math">\(({\sf Gen}, {\sf Enc}, {\sf Dec})\)</span>, of a
pseudo-random function (PRF) <span class="math">\(F: \{0,1\}^k \times W \rightarrow \{0,1\}^k\)</span> and
of a pseudo-random permutation (PRP) <span class="math">\(P: \{0,1\}^k \times W \rightarrow \{1,
\dots, |W|\}\)</span>. To setup the EDB, the client first samples two <span class="math">\(k\)</span>-bit keys
<span class="math">\(K_{\sf T}\)</span> and <span class="math">\(K_{\sf R}\)</span> for <span class="math">\(F\)</span> and <span class="math">\(P\)</span>, respectively. It then creates two arrays
<span class="math">\({\sf T}\)</span> and <span class="math">\({\sf RAM}_1\)</span>. For all keywords <span class="math">\(w \in W\)</span>, the client builds a list for
<span class="math">\({\sf DB}[w]\)</span> and stores the nodes in <span class="math">\({\sf RAM}_1\)</span>. More precisely, for every keyword <span class="math">\(w
\in W\)</span> and every <span class="math">\(1 \leq i \leq |{\sf DB}[w]|\)</span>, it stores</p>
<p><span class="math">\[
{\sf N}_{w,i} = \bigg\langle {\sf id}_{w,i}, {\sf ptr}_1(w, i+1) \bigg\rangle
\]</span></p>
<p>in <span class="math">\({\sf RAM}_1\)</span>, where <span class="math">\({\sf id}_{w,i}\)</span> is the <span class="math">\(i\)</span>th identifier in <span class="math">\({\sf DB}[w]\)</span> and
<span class="math">\({\sf ptr}_1(w, i+1)\)</span> is the address (in <span class="math">\({\sf RAM}_1\)</span>) of the <span class="math">\((i+1)\)</span>th identifier in
<span class="math">\({\sf DB}[w]\)</span>. Of course, <span class="math">\({\sf ptr}_1(w, |{\sf DB}[w]| + 1) = \bot\)</span>.</p>
<p>It then randomly permutes the locations of the nodes; that is, it creates
a new array <span class="math">\({\sf RAM}_2\)</span> stores all the nodes in <span class="math">\({\sf RAM}_1\)</span> but at locations
chosen uniformly at random and with appropriately updated pointers.</p>
<p>After this shuffling step, the client encrypts each node in <span class="math">\({\sf RAM}_2\)</span>; that is,
it creates a new array <span class="math">\({\sf RAM}_3\)</span> such that for all <span class="math">\(w \in W\)</span> and all <span class="math">\(1 \leq i
\leq |{\sf DB}[w]|\)</span>,</p>
<p><span class="math">\[
{\sf RAM}_3\big[{\sf addr}_2({\sf N}_{w,i})\big] =
{\sf Enc}_{K_w}\bigg({\sf RAM}_2\big[{\sf addr}_2({\sf N}_{w,i})\big]\bigg)
\]</span></p>
<p>where <span class="math">\(K_w = F_{K_{\sf R}}(w)\)</span> and <span class="math">\({\sf addr}_2\)</span> is just a function that maps nodes to
their location in <span class="math">\({\sf RAM}_2\)</span> (this just makes notation easier).</p>
<p>Now, for all keywords <span class="math">\(w \in W\)</span>, the client sets<br>
<span class="math">\(
{\sf T}\big[P_{K_{\sf T}}(w) \big] = {\sf Enc}_{K_w}\big({\sf addr}_3({\sf N}_{w, 1})\big),
\)</span></p>
<p>where <span class="math">\({\sf addr}_3\)</span> is a function that maps nodes to their locations in <span class="math">\({\sf RAM}_3\)</span>.
Finally, the client sets <span class="math">\({\sf EDB} = ({\sf T}, {\sf RAM}_3)\)</span>.</p>
<p>Now the version I just described is simpler than the one presented in
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>]. There are two main
differences. The first has to do with the domain of the pseudo-random
permutation <span class="math">\(P\)</span>. In practice, PRPs have a fixed domain size. For example, if
we view AES as a PRP then it is a PRP that maps 128-bit strings to 128-bit
strings. But in our case we need a PRP that maps keywords in <span class="math">\(W\)</span> to the numbers
<span class="math">\(1\)</span> through <span class="math">\(|W|\)</span>. The problem here is that in practice the size of <span class="math">\(W\)</span> will be
<em>much</em> smaller than <span class="math">\(2^{128}\)</span>. So the question becomes how we can use a
PRP built for a large domain to build a PRP for a small domain? There are ways
of doing this but at the time the known solutions had several important
limitations. So we solved the problem using the following approach.</p>
<p>Suppose we used a large-domain PRP. The problem would be that the table <span class="math">\({\sf T}\)</span>
would be large as well, i.e., it would have to hold <span class="math">\(2^{128}\)</span> elements if we
were using a PRP over <span class="math">\(128\)</span>-bit strings (e.g., AES). Obviously this is too large
to be practical. So the idea was to "shrink" <span class="math">\({\sf T}\)</span> by using something called a
Fredman-Komlos-Szemeredi (FKS) table. I won't go into the details, but the
point is that by using FKS tables, we could use a large-domain PRP and
still have a compact table <span class="math">\({\sf T}\)</span>.</p>
<p>The other difference has to do with the symmetric encryption scheme <span class="math">\(({\sf Gen},
{\sf Enc}, {\sf Dec})\)</span> that we use. In the version described here, it is
important for security that the encryption scheme be <em>anonymous</em> which
means that, given two ciphertexts, one cannot tell whether they
were encrypted under the same key or not. Why is this important? Because each
list of nodes <span class="math">\(\{{\sf N}_{w, i}\}_{i \leq |{\sf DB}[w]|}\)</span> is encrypted under the same
key <span class="math">\(K_w\)</span>. And if, given <span class="math">\({\sf RAM}_3\)</span>, the adversary can tell which ciphertexts are
encrypted under the same key, then it can learn the frequency <span class="math">\(|{\sf DB}[w]|\)</span> of each
keyword. Note that this would be revealed by the EDB; without the client ever
having made any queries.</p>
<p>The problem with anonymity is that it is not implied
by the standard notion of CPA-security. In practice, it seems that most block
ciphers (including AES) would be anonymous but again maybe not. In [<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] we
didn't assume that the underlying symmetric encryption scheme was anonymous so
we had to use a different approach. At a high-level, what we did is to encrypt
each node under a different key and store that key in its predecessor in the
list. The fact that every node is encrypted under a different key solves our
problem.</p>
<p><strong>Token and search.</strong>
If the client wants to search for keyword <span class="math">\(w\)</span>, he simply generates a token</p>
<p><span class="math">\[
{\sf tk} = ({\sf tk}_1, {\sf tk}_2) = (P_{K_{\sf T}}(w), F_{K_{\sf R}}(w)),
\]</span></p>
<p>which he sends to the server. To query <span class="math">\({\sf EDB} = ({\sf T}, {\sf RAM}_3)\)</span>, the server first
recovers the ciphertext <span class="math">\(c = {\sf T}[{\sf tk}_1]\)</span> which it decrypts to recover address
<span class="math">\(a_1 = {\sf Dec}_{{\sf tk}_2}(c)\)</span>. Then, for all <span class="math">\(i\)</span> until <span class="math">\(a_i = \bot\)</span>, it decrypts the
nodes <span class="math">\(({\sf N}_{w, 1}, \dots, {\sf N}_{w, |{\sf DB}[w]|})\)</span> by computing</p>
<p><span class="math">\[
({\sf id}_i, a_{i+1}) \leftarrow {\sf Dec}_{K_{\sf R}}\big({\sf RAM}_3[a_i]\big).
\]</span></p>
<p>It then finds and returns the encrypted documents with identifiers <span class="math">\(({\sf id}_1,
\dots, {\sf id}_{|{\sf DB}[w]|})\)</span>.</p>
<p><strong>Efficiency and security.</strong>
To search, the server needs to do one lookup in <span class="math">\(T\)</span>, which is <span class="math">\(O(1)\)</span> and then
one decryption for each node <span class="math">\(({\sf N}_{w, 1}, \dots, {\sf N}_{w, |{\sf DB}[w]|})\)</span>,
which is <span class="math">\(O(|{\sf DB}[w]|)\)</span>. So the search complexity of this approach is
<span class="math">\(O(|{\sf DB}[w]|)\)</span>, which is optimal since it would take at least that much time just
for the server to send back the relevant documents.</p>
<p>The construction is clearly efficient (asymptotically speaking, as efficient as
possible) but is it secure? Yes and no. The security of the solution (at least
the more complex version) is proved secure in
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] but it is only shown to be
<em>non-adaptively-secure</em> with trivial setup leakage and query leakage that
includes the access pattern (the search results) and the search pattern
(whether a query is repeated).</p>
<p>Intuitively, given <span class="math">\({\sf EDB} = ({\sf T}, {\sf RAM}_3)\)</span> the adversary learns at most the
number of keywords (by the size of <span class="math">\({\sf T}\)</span>) and <span class="math">\(\sum_{w \in W} |{\sf DB}[w]|\)</span> by the
size of <span class="math">\({\sf RAM}_3\)</span>. So that is the setup leakage. Notice that unlike solutions
based on deterministic encryption, the <span class="math">\({\sf EDB}\)</span> by itself does not leak any
non-trivial information like the frequency of a keyword. At query time, the
server obviously learns the search results <span class="math">\({\sf DB}[w]\)</span> but it also learns whether
the client is repeating a keyword search since in that case the tokens <span class="math">\({\sf tk} =
(P_{K_{\sf T}}(w), F_{K_{\sf R}}(w))\)</span> will be the same.</p>
<p><strong>Improvements.</strong>
The inverted index solution has been improved over several works. Its main
limitations were that: (1) it was only non-adaptively secure; (2) the use of
FKS dictionaries made the solution hard to understand and implement; and (3)
it was a static scheme, in the sense that one could not modify the <span class="math">\({\sf EDB}\)</span> to add
or remove keywords and/or document identifiers <sup class="footnote-ref" id="fnref:8"><a class="footnote" href="#fn:8">9</a></sup>.</p>
<p>The first problem was addressed in a joint paper with my MSR colleague Melissa
Chase [<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]. One of the observations
in that work was that the inverted index solution could be made
adaptively-secure by replacing the symmetric encryption scheme by a
non-committing encryption scheme. Non-committing encryption schemes are usually
either very expensive or require very strong assumptions (i.e., random
oracles). Fortunately, in our setting we only need a <em>symmetric</em>
non-committing encryption scheme and such a scheme can be instantiated very
efficiently. In fact, it turns out that the simplest possible symmetric
encryption scheme is non-committing! In retrospect this is a very simple
observation, but it's been a very useful one since it allows us to design
adaptively-secure schemes very efficiently (and under standard assumptions). In
fact, this has been used in most subsequent SSE constructions.</p>
<p>The second issue was also addressed in
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]. Obviously one could just
replace the PRP with a small-domain PRP but the approach taken in
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>] was different. The idea is to
replace the array <span class="math">\({\sf T}\)</span> with a dictionary <span class="math">\({\sf DX}\)</span>. A dictionary is a data
structure that stores label/value pairs and that supports lookup operations
that map labels to their values. Dictionaries can be instantiated as hash
tables, binary search trees etc. So instead of populating <span class="math">\({\sf T}\)</span> with</p>
<p><span class="math">\[
{\sf T}\big[P_{K_{\sf T}}(w) \big] = {\sf Enc}_{K_w}\big({\sf addr}_3({\sf N}_{w, 1})\big)
\]</span></p>
<p>for all <span class="math">\(w \in W\)</span>, we instead use a PRF <span class="math">\(G\)</span> and store the pair</p>
<p><span class="math">\[
\bigg(G_{K_{\sf T}}(w), {\sf Enc}_{K_w}\big({\sf addr}_3({\sf N}_{w, 1})\big)\bigg)
\]</span></p>
<p>in <span class="math">\({\sf DX}\)</span> for all <span class="math">\(w \in W\)</span>. With this approach we remove the need for a PRP
altogether and, in turn, the need for either small-domain PRPs or FKS dictionaries.</p>
<p>The third issue was addressed in a joint paper with Charalampos (Babis)
Papamanthou who was an MSR intern at the time and Tom Roeder who was an MSR
colleague at the time. In this paper
[<a href="http://eprint.iacr.org/2012/530.pdf">KPR12</a>], we show how to make
the inverted index solution dynamic while maintaining its efficiency. The
solution is complex so I won't discuss it here.</p>
<p>In another paper with Babis
[<a href="https://research.microsoft.com/en-us/um/people/senyk/pubs/psse.pdf">KP13</a>]
we propose a much simpler dynamic solution. Our approach here is tree-based and
not based on the inverted index solution at all. It's search complexity,
however, is not optimal but sub-linear; in particular, logarithmic in the number
of documents. It has other good properties, however, like parallizable search
and good I/O complexity.</p>
<p>In a more recent paper
[<a href="http://www.internetsociety.org/sites/default/files/07_4_1.pdf">CJJJKRS14</a>],
Cash, Jarecki, Jaeger, Jutla, Krawczyk, Steiner and Rosu describe a dynamic
solution that is very simple, has optimal and parallelizable search and has
good I/O complexity.</p>
<p>In another recent paper
[<a href="http://web.engr.illinois.edu/~naveed2/pub/Oakland2014BlindStorage.pdf">NPG14</a>]
Naveed, Prabakharan and Gunther propose a very interesting dynamic solution
based on the notion of blind storage. In a way, their notion of blind storage
can be viewed as an abstraction of the <span class="math">\({\sf RAM}_3\)</span> structure in the inverted index
solution. What
[<a href="http://web.engr.illinois.edu/~naveed2/pub/Oakland2014BlindStorage.pdf">NPG14</a>]
shows, however, is that there is an alternative---and much better---way of
achieving the properties needed from <span class="math">\({\sf RAM}_3\)</span> than how it is done in
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>]. I won't say much else
because this really gets into the weeds of SSE techniques but I recommend the
paper if you're interested in this area.</p>
<p>Finally, the last paper I'll mention is a work by Cash, Jarecki, Jutla,
Krawczyk, Rosu and Steiner [<a href="http://eprint.iacr.org/2013/169">CJJKRS13</a>] that
shows how to extend the inverted index solution to handle <em>boolean</em>
queries while keeping its optimal search complexity. Prior to this work we knew
how to handle conjunctive search queries (i.e., <span class="math">\(w_1 \wedge w_2\)</span>) in linear
time. This paper showed not only how to do it in optimal time but also showed
how to handle disjunctive queries (i.e., <span class="math">\(w_1 \vee w_2\)</span>) and combinations of
conjunctions and disjunctions!</p>
<div class="footnotes">
<hr>
<ol>
<li id="fn:1">I discuss how to use ORAM for encrypted search towards the end of the previous post of this series.
<a class="footnote-return" href="#fnref:1">↩</a></li>
<li id="fn:2">Amazingly, this paper was never accepted for publication; which tells you something about the current state of our publication process.
<a class="footnote-return" href="#fnref:2">↩</a></li>
<li id="fn:3">This wasn't an omission on Goh's part; he defined it this way on purpose. His reasoning was that SSE schemes could have a variety of applications where token privacy was not needed. This made sense but it still left open the question of how one should define security with token privacy.<br>
<a class="footnote-return" href="#fnref:3">↩</a></li>
<li id="fn:4">A similar situation was later observed by Boneh, Sahai and Waters and O' Neill in the setting of functional encryption.
<a class="footnote-return" href="#fnref:4">↩</a></li>
<li id="fn:5">Technically, this is <em>not</em> true! The reason SSE schemes tend to be more efficient than ORAM is not because they reveal the search results (access pattern) but because they reveal whether searches were repeated (search pattern).<br>
<a class="footnote-return" href="#fnref:5">↩</a></li>
<li id="fn:6">At this point you might be wondering how the proofs went through. In the definition of [Goh03], the tokens did not appear at all since he was not considering query privacy. In the case of [CM05], the adversary in the proof is restricted to never repeating queries.
<a class="footnote-return" href="#fnref:6">↩</a></li>
<li id="fn:5">Technically, this is <em>not</em> true! The reason SSE schemes tend to be more efficient than ORAM is not because they reveal the search results (access pattern) but because they reveal whether searches were repeated (search pattern).<br>
<a class="footnote-return" href="#fnref:5">↩</a></li>
<li id="fn:7">A criticism I often hear from colleagues and reviewers is that SSE constructions are not really <em>searching</em> over data. The underlying issue is that no computation is being performed. In my opinion, this reflects a very uninformed understanding of the real world. Given the amounts of data we currently produce and have to search over, search has become analogous to <em>sub-linear-time search</em> and therefore to some form of indexed-based search. In other words, the kind of scale we now have to deal with has fundamentally changed what we mean by the term search.<br>
<a class="footnote-return" href="#fnref:7">↩</a></li>
<li id="fn:8">Actually, in [<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] we describe a way to make our constructions (and any other) dynamic. There are limitations to this approach, however, including the tokens growing in length with the number of updates and interaction. So when we ask for a dynamic SSE scheme we typically want the update process not to affect the token size and, preferably, the update mechanism to be non-interactive---though the latter doesn't matter much from a practical point of view.<br>
<a class="footnote-return" href="#fnref:8">↩</a></li>
</ol>
</div>
Are Compliance and Privacy Always at Odds?
http://senykam.github.io/2013/07/23/are-compliance-and-privacy-always-at-odds
Tue, 23 Jul 2013 22:23:59 -0300http://senykam.github.io/2013/07/23/are-compliance-and-privacy-always-at-odds<p><img src="http://senykam.github.io/img/obey.jpg" class="alignright" width="250">
Chris Soghoian
<a href="https://twitter.com/csoghoian/status/358613839094362112">points</a> to an
interesting
<a href="http://http//online.wsj.com/article/SB10001424127887324448104578615881436052760.html">article</a>
in the Wall Street Journal. It describes mounting pressure on the NSA to
re-design its phone-data program---the program under which it compels
telecommunications companies (telcos) like Verizon to turn over their phone
record data.</p>
<p>In the article, Timothy Edgar, a former privacy lawyer who served in the Bush
and Obama administrations is quoted as saying:</p>
<blockquote>
<p>Privacy technology under development would allow for anonymous searches of
databases, keeping data out of government hands but also preventing phone
companies from learning the purpose of NSA searches. Overhauling the
surveillance program would provide a reason to speed up the technology's
deployment.</p>
</blockquote>
<p>So this motivates the following interesting technical question:
<em>how would one design such a privacy-preserving phone-data program exactly?</em></p>
<p>The first thing we need is that the telcos keep their data, as opposed to
sending it all to the NSA. The issue with such an approach, of course, is that
the NSA would have to disclose its queries to the telco in order to retrieve
any information---which for obvious reasons is not going to happen.</p>
<p>So what we need is a mechanism with which the telcos can keep their data and
the NSA can access it without disclosing its queries. This might sound
impossible, but it turns out we've known how to do this (in theory at least)
for over <em>15</em> years!</p>
<h2 id="private-information-retrieval">Private Information Retrieval</h2>
<p>One answer to this problem could be to use something called <a href="http://en.wikipedia.org/wiki/Private_information_retrieval">private
information
retrieval</a> (PIR).
With PIR, a client can retrieve information from a server <em>without the server
learning anything about which item is being retrieved</em>. Standard PIR protocols
only allow the client to retrieve information by memory location but there are
more sophisticated variants that also support retrieval based on
<a href="http://eprint.iacr.org/1998/003">keywords</a>.</p>
<p>PIR was first introduced in 1995 in a
<a href="http://people.csail.mit.edu/madhu/papers/1995/pir-journ.pdf">paper</a> by Chor,
Kushilevitz, Goldreich and Sudan. Initially, PIR only worked if the data could
be stored on two (or more) servers that could not collude. In a breakthrough
paper, Kushilevitz and Ostrovsky showed in 1997 that PIR could be achieved even
with a single server. Since then, there has been a lot of work and many
advances on PIR and, recently, Ian Goldberg from the University of Waterloo and
his students have been trying to make PIR practical (improving both efficiency
and functionality). If you are interested in this topic (especially in the
practical aspects) I highly recommend the thesis of
<a href="http://uwspace.uwaterloo.ca/bitstream/10012/6142/1/Olumofin_Femi.pdf">Olumofin</a>.</p>
<p>So a simple idea to solve our problem is to have the telco keep its data and to
have the NSA query it through a PIR protocol. While this might seem like a good
solution, there are two important problems.</p>
<p>The first is that while PIR will protect the query of the NSA (i.e., the telco will not learn anything about the query) it will not necessarily protect the telco's dataset from the NSA; that is, the NSA could learn information about individuals that are not included in its query.</p>
<p>The second problem is that the telco has no way of knowing if the NSA' s query is legitimate. What if the NSA keeps submitting queries indiscriminately and eventually just learns the entire database? How does the telco know whether a particular query is even legal?</p>
<p>Fortunately, both problems can be addressed!</p>
<h2 id="oblivious-transfer">Oblivious Transfer</h2>
<p>To handle the first problem, we need a stronger form of PIR called <a href="http://en.wikipedia.org/wiki/Oblivious_transfer">oblivious
transfer</a> (OT). With an OT
protocol, a client can select an item from a server's dataset while maintaining
the following guarantees: (1) the server learns
nothing about the client's query; and (2) the
client learns nothing about the items it does not query. So unlike PIR, OT
protects both parties; which is why it is sometimes called symmetric PIR.</p>
<p>Like PIR, standard OT protocols only allow clients to retrieve items by their
location in memory so, in practice, we would prefer to use a keyword-based OT;
that is, an OT protocol where items can be labeled with keywords and where the
clients can retrieve them based on search terms. Fortunately, we already know
how to design such protocols. The first keyword OT is due to Ogata and Kurosawa
(see this <a href="http://seculab.cis.ibaraki.ac.jp/~kurosawa/2004/OKS.pdf">paper</a>) but
their scheme does not scale very well (each query would require the NSA to do
work that is linear in the size of the dataset). A more efficient approach is
due to Freedman, Ishai, Pinkas and Reingold and is described in this
<a href="https://www.cs.princeton.edu/~mfreed/docs/FIPR05-ks.pdf">paper</a>.</p>
<h2 id="keyword-ot">Keyword OT</h2>
<p>The high-level idea of Freedman et al.'s keyword OT is as follows. As before,
the server is the telco and the client is the NSA. Suppose the telco's dataset
consists of <span class="math">\(n\)</span> pairs <span class="math">\((w_1, d_1), \dots, (w_n, d_n)\)</span>, where <span class="math">\(w_i\)</span> is a keyword
and <span class="math">\(d_i\)</span> is some data associated to <span class="math">\(w_i\)</span>. In practice, the keywords could be
names and the data could be phone, address, etc. The telco starts by encrypting
this dataset by replacing each pair <span class="math">\((w_i, d_i)\)</span> by a label/ciphertext pair
<span class="math">\((\ell_i, d_i \oplus p_i)\)</span>, where the label <span class="math">\(\ell_i\)</span> and the pad <span class="math">\(p_i\)</span> are
(pseudo-)random strings generated from <span class="math">\(w_i\)</span> using a pseudo-random function
with a secret key <span class="math">\(K\)</span>. More formally, we would write that for all <span class="math">\(i\)</span>,</p>
<p><span class="math">\[
F_K(w_i) = (\ell_i, p_i),
\]</span></p>
<p>where <span class="math">\(F\)</span> is the PRF. A PRF is sort of like a keyed
hash. <sup class="footnote-ref" id="fnref:1"><a class="footnote" href="#fn:1">1</a></sup> The main property of PRFs is that if we evaluate them with a random
key <span class="math">\(K\)</span> on any input, they output a random looking
string.</p>
<p>Note that this new encrypted dataset reveals no information about the real
dataset since the <span class="math">\(\ell_i\)</span> values are pseudo-random
(and therefore effectively independent of the
<span class="math">\(w_i\)</span>'s) and because the ciphertexts <span class="math">\(d_i\oplus
p_i\)</span> are effectively one-time pad (OTP) encryptions of the
<span class="math">\(d_i\)</span>'s. <sup class="footnote-ref" id="fnref:2"><a class="footnote" href="#fn:2">2</a></sup> The telco now sends this encrypted
dataset to the NSA who stores it. Remember: it reveals no information
whatsoever about the real dataset so this is OK!</p>
<p>Now suppose the NSA needs to lookup information related to some keyword <span class="math">\(w\)</span> and
remember that the encrypted dataset it holds consists of labels <span class="math">\(\ell_i\)</span> and
ciphertexts <span class="math">\(d_i \oplus p_i\)</span>. To extract the information it needs from the
encrypted dataset, it therefore needs to figure out: (1) the label for keyword
<span class="math">\(w\)</span> (so it can lookup the appropriate OTP ciphertext); and (2) the pad <span class="math">\(p_i\)</span>
used in the associated ciphertext.</p>
<p>Of course the NSA cannot do this on its own because it does not know the
telco's secret key <span class="math">\(K\)</span> for the PRF used to generate these items. But we have a
problem. If the NSA sends its keyword w to the telco so that the latter
computes and returns <span class="math">\(F_K(w)\)</span>, the telco will learn the keyword. And if the
telco sends its key <span class="math">\(K\)</span> to the NSA so that it computes <span class="math">\(F_K(w)\)</span> on its own, the
NSA will be able to decrypt the entire dataset.</p>
<p>The solution here is to use another amazing cryptographic technology called
<a href="http://en.wikipedia.org/wiki/Secure_multi-party_computation#Two-party_computation">secure two-party
computation</a>
(2PC). I won't try to explain how 2PC works but if you are interested a good
place to start is the <a href="http://mpclounge.au.dk/">MPC Lounge</a>. The important
thing to know about 2PC is that we can use it to solve our problem. In other
words, the telco and the NSA can execute a 2PC protocol that will result in the
NSA learning <span class="math">\(F_K(w)\)</span> and therefore the label and the pad for <span class="math">\(w\)</span>, without it
learning anything about the telco's key and without the telco learning anything
about <span class="math">\(w\)</span> <sup class="footnote-ref" id="fnref:3"><a class="footnote" href="#fn:3">3</a></sup>.</p>
<h2 id="authorized-queries">Authorized Queries</h2>
<p>Now on to the second problem: how does the telco know if the NSA' s query is
legitimate? To address this we first need to incorporate an extra party into
our model that has the power to decide if an NSA query is legitimate or not. In
practice, this would be the <a href="http://en.wikipedia.org/wiki/United_States_Foreign_Intelligence_Surveillance_Court">FISA
court</a>
<sup class="footnote-ref" id="fnref:4"><a class="footnote" href="#fn:4">4</a></sup> and we' ll assume this court can digitally sign, i.e., it has a secret
signing key and a public verification key that is known to the telco.</p>
<p>Now suppose the NSA wants to retrieve information about a user Alice from the
telco. It first sends its query to the court. If the court approves the query,
it signs it and returns the signature to the NSA. At this point, we only need
to make a small change to the protocol described above. Instead of executing a
2PC that evaluates the PRF so as to generate a label and pad for the NSA's
query; the parties will execute a 2PC that first verifies the court's signature
and then (if the signature checks out) evaluates the PRF (i.e., generates the
label and pad for the keyword). The properties of the 2PC will hide the
signature and the keyword from the telco, and the secret key
<span class="math">\(K\)</span> from the NSA. <sup class="footnote-ref" id="fnref:5"><a class="footnote" href="#fn:5">5</a></sup></p>
<h2 id="is-this-really-possible">Is this really possible?</h2>
<p>The design described above is possible in theory. But of course the interesting
question is whether something like this could be used in practice.</p>
<p>I don't really know how large telco datasets are but I would guess on the order
of hundreds of millions of users. Encrypting such a dataset and sending it to
the NSA would be expensive but definitely possible as the encryption process
here would consist of relatively cheap operations like PRF evaluations and
XORs. The query stage, however would be very inefficient due to the execution
of the 2PC protocol. But if we look at things carefully, the bottlenecks would
likely be (1) the verification of the signature (due to the complexity of
signature verification); and (2) the generation of the pads (since they have to
be as long as the data they will be XORed with).</p>
<p>Fortunately there are a few things we can do to mitigate these problems.
Instead of using a signature scheme, we could use a message authentication code
(MAC). This would require the court to share a secret key with the telco but
this doesn't seem like such a severe requirement. MACs are much simpler
computationally than signatures so the 2PC verification would be much faster
<sup class="footnote-ref" id="fnref:6"><a class="footnote" href="#fn:6">6</a></sup>.</p>
<p>With respect to the length of the pads, we could use the PRF to generate a
short string instead (say 128 bits long) and use
that as a seed to a pseudo-random generator to generate a larger pad. This
would change how the telco and NSA encrypt and decrypt items of the dataset but
it is a minor change that would not effect the efficiency of encryption and
decryption much.</p>
<p>With these changes, the 2PC would only have to compute two PRF evaluations and
one equality check which is definitely within practical reach.</p>
<p><strong>Update:</strong> For a high-level description of the protocol I designed in this
post see
<a href="http://boingboing.net/2014/03/01/trustycon-how-to-redesign-nsa.html">this</a>
great talk by Ed Felten.</p>
<p><em>Thanks to Matt Green and Payman Mohassel for comments on a draft of this post
and to Chris Soghoian for motivating me to think about this problem.</em></p>
<div class="footnotes">
<hr>
<ol>
<li id="fn:1">PRFs are like keyed hash functions only in idealized models like the random oracle model.
<a class="footnote-return" href="#fnref:1">↩</a></li>
<li id="fn:2">Technically, since the labels and pads are pseudo-random (as opposed to random), <span class="math">\(\ell_i\)</span> is not independent of <span class="math">\(w_i\)</span> and <span class="math">\(d_i \oplus p_i\)</span> is not a one-time pad. More precisely, <span class="math">\(\ell_i\)</span> and <span class="math">\(d_i \oplus p_i\)</span> reveal no partial information about <span class="math">\(w_i\)</span> and <span class="math">\(d_i\)</span> to a computationally-bounded adversary.
<a class="footnote-return" href="#fnref:2">↩</a></li>
<li id="fn:3">Protocols that evaluate PRFs in this manner are usually called oblivious PRF (OPRF) protocols. The 2PC-based OPRF protocol is the simplest to understand conceptually but we know of more efficient OPRF protocols not based on 2PC (e.g., the Freedman et al. paper describes one such construction).
<a class="footnote-return" href="#fnref:3">↩</a></li>
<li id="fn:4">There is debate as to whether the FISA court exercises proper oversight over the NSA or not (for example see <a href="http://www.nytimes.com/2013/07/26/us/politics/robertss-picks-reshaping-secret-surveillance-court.html?_r=0">this article</a> from the New York Times), but for the purpose of this exercise we'll just assume that it does.
<a class="footnote-return" href="#fnref:4">↩</a></li>
<li id="fn:5">The reason we also need to hide the signature from the telco is that signatures can leak information about their message.
<a class="footnote-return" href="#fnref:5">↩</a></li>
<li id="fn:6">Here we also assume the data is hashed with a collision-resistant hash function before being MACed.
<a class="footnote-return" href="#fnref:6">↩</a></li>
</ol>
</div>