Outsourced Bits
http://senykam.github.io/
Recent content on Outsourced BitsHugo -- gohugo.ioen-usThu, 16 Jun 2016 12:32:12 -0300Graph Encryption: Going Beyond Encrypted Keyword Search
http://senykam.github.io/2016/06/16/graph-encryption-going-beyond-encrypted-keyword-search
Thu, 16 Jun 2016 12:32:12 -0300http://senykam.github.io/2016/06/16/graph-encryption-going-beyond-encrypted-keyword-search<p><em>This is a guest post by <a href="http://www.xianruimeng.org/">Xianrui Meng</a> from
Boston University about a paper he presented at CCS 2015, written in
collaboration with <a href="https://www.cs.bgu.ac.il/~kobbi/">Kobbi Nissim</a>, <a href="http://www.cs.bu.edu/~gkollios/">George
Kollios</a> and myself. Note that Xianrui is on
the job market.</em></p>
<p><img src="http://senykam.github.io/img/graph21.jpg" class="alignright" width="250">
Encrypted search has attracted a lot of attention from practitioners and
researchers in academia and industry. In previous posts, Seny already described
different ways one can search on encrypted data. Here, I would like to discuss
search on encrypted <em>graph</em> databases which are gaining a lot of
popularity.</p>
<h2 id="graph-databases-and-graph-privacy">Graph Databases and Graph Privacy</h2>
<p>As today's data is getting bigger and bigger, traditional
relational database management systems (RDBMS) cannot scale to the massive
amounts of data generated by end users and organizations. In addition, RDBMSs
cannot effectively capture certain data relationships; for example in
object-oriented data structures which are used in many applications. Today,
<a href="http://nosql-database.org/">NoSQL</a> (Not Only SQL) has emerged as a good
alternative to RDBMSs. One of the many advantages of NoSQL systems is that
they are capable of storing, processing, and managing large volumes of
structured, semi-structured, and even unstructured data. NoSQL databases (e.g.,
document stores, wide-column stores, key-value (tuple) stores, object
databases, and graph databases) can provide the scale and availability needed
in cloud environments.</p>
<p>In an Internet-connected world, graph databases have become an increasingly
significant data model among NoSQL technologies. Social networks (e.g.,
Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML
documents, networked systems can all be modeled as graphs. One nice thing
about graph databases is that they store the relations between entities
(objects) in addition to the entities themselves and their properties. This
allows the search engine to navigate both the data and their relationships
extremely efficiently. Graph databases rely on the node-link-node relationship,
where a node can be a profile or an object and the edge can be any relation
defined by the application. Usually, we are interested in the structural
characteristics of such a graph databases.</p>
<p>What do we mean by the confidentiality of a graph? And how to do we protect it?
The problem has been studied by both the security and database communities. For
example, in the database and data mining community, many solutions have been
proposed based on <em>graph anonymization</em>. The core idea here is to
anonymize the nodes and edges in the graph so that re-identification is hard.
Although this approach may be efficient, from a security point view it is hard
to tell what is achieved. Also, by leveraging auxiliary information,
researchers have studied how to attack this kind of approach. On the other
hand, cryptographers have some really compelling and provably-secure tools such
as ORAM and FHE (mentioned in Seny's previous posts) that can protect all the
information in a graph database. The problem, however, is their performance,
which is crucial for databases. In today's world, efficiency is more than
running in polynomial time; we need solutions that run and scale to massive
volumes of data. Many real world graph datasets, such as biological networks
and social networks, have millions of nodes, some even have billions of nodes
and edges. Therefore, besides security, scalability is one of main aspects we
have to consider.</p>
<h2 id="graph-encryption">Graph Encryption</h2>
<p>Previous work in encrypted search has focused on how to
search encrypted documents, e.g., doing keyword search, conjunctive queries,
etc. Graph encryption, on the other hand, focuses on performing graph queries
on encrypted graphs rather than keyword search on encrypted documents. In some
cases, this makes the problem harder since some graph queries can be extremely
complex. Another technical challenge is that the privacy of nodes and edges
needs to be protected but also the <em>structure</em> of the graph, {\bf which can
lead to many interesting research directions}.</p>
<p>Graph encryption was introduced by Melissa Chase and Seny in
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]. That paper shows how
to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency
and focused subgraphs) can be performed (though the paper is more general as it
describes <em>structured encryption</em>). Seny and I, together with Kobbi Nissim
and George Kollios, followed this up with a paper last year
[<a href="http://eprint.iacr.org/2015/266.pdf">MKNK15</a>] that showed how to
handle more complex graphs queries.</p>
<h2 id="queries-on-encrypted-graph-databases">Queries on Encrypted Graph Databases</h2>
<h3 id="neighbor-queries-and-adjacency-queries">Neighbor Queries and Adjacency Queries</h3>
<p>As I mentioned earlier,
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>] studied some simple
graph queries, such as adjacency queries and neighbor queries. An adjacency
query is takes two nodes as input and returns whether they have an edge in
common. A neighbor query takes a node as input and returns all the nodes that
share an edge with it.</p>
<p>The construction for neighbor queries is mainly based on the searchable
symmetric encryption (SSE), where the input graph is viewed as particular kind
of document collection. Another novel technique that is proposed in the paper
is to use an efficient symmetric non-committing encryption scheme to achieve
adaptive security efficiently. The paper also proposes a nice solution for
focused subgraph queries, which are an essential part of the seminal HITS
ranking algorithm of Kleinberg but are also useful in their own right.</p>
<h3 id="approximate-shortest-distance-queries">Approximate Shortest Distance Queries</h3>
<p>Shortest distance queries are arguably one of the most fundamental and
well-studied graph queries due to their numerous applications. A shortest
distance query takes as input two nods and returns the smallest number of edges
in the shortest path between them. In social networks these queries allow you
to find the smallest number of friends (or collaborators, peers, etc) between
two people. So a graph encryption scheme that supports shortest distance
queries would potentially have many applications in graph database security,
and could be a major building block for other graph encryption schemes. In the
following, I briefly give an overview on our solution for <em>approximate</em>
shortest distance queries.</p>
<p>As I mentioned, to design a secure yet scalable graph encryption
scheme, we have to take into account many things, including the storage space on
the server side, the bandwidth for the query, the computational overhead for
both client and server, etc. Suppose we are given a graph <span class="math">\(G= (V, E)\)</span> and let
<span class="math">\(n= |V|\)</span>, <span class="math">\(m = |E|\)</span>. If we were to use traditional shortest distance algorithm such as
Dijsktra's algorithm, the query time would be <span class="math">\(O(n\log n+m)\)</span>, which can be very
slow for large graphs. The benefit of course would be that we would not need
extra storage. Another approach is to build an encrypted adjacency matrix (see
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]) that somehow supports
shortest distance queries. The problem there is that we would need to pay at
least <span class="math">\(O(n^2)\)</span> storage, which is obviously expensive when the <span class="math">\(n\)</span> is, say, <span class="math">\(1\)</span>
million.</p>
<p>Fortunately, thanks to brilliant algorithmic computer scientists, there exists
a really nice and neat data structure called a <em>distance oracle</em> (DO)
[<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.333&rep=rep1&type=pdf">TZ05</a>,
<a href="http://research.microsoft.com/pubs/115785/wsdm2010.pdf">SGNP10</a>,
<a href="http://research-srv.microsoft.com/pubs/201773/cosn-similarity.pdf">CDFGGW13</a>].
Using such a structure, one can use much less storage overhead (typically <span class="math">\(O(n
\log n)\)</span>) and fast query performance (typically <span class="math">\(O(\log n)\)</span>). However, most
distance oracles return the <em>approximate</em> distance rather than the exact
one. But one one can tweak the parameters in order to get the best trade-off
between performance and approximation. When I first looked at these data
structures, I felt that this was a really amazing tool; not only because of its
functionality but also due to its simplicity.</p>
<p>There are many ways of generating distance oracles. Some of them offer
better approximation while others can have better performance. Here I just
describe one kind which are <em>sketch-based</em> distance oracles. In a such an
oracle, every node <span class="math">\(v\)</span> has a sketch, <span class="math">\(Sk_v\)</span> (normally generated by some
randomized algorithm). <span class="math">\(Sk_v\)</span> is a set containing many node pairs <span class="math">\(\langle
w_i,d(v, w_i)\rangle\)</span>, where <span class="math">\(w_i\)</span> is some node id and <span class="math">\(d(v, w_i)\)</span> is the
distance between <span class="math">\(v\)</span> and <span class="math">\(w_i\)</span>. For example, the following sketch <span class="math">\(Sk_v\)</span> consists of
three pairs</p>
<p><span class="math">\[ Sk_v = \{\langle w_1, d(v, w_1)\rangle, \langle w_2, d(v, w_2)\rangle,
\langle w_3, d(v, w_3)\rangle \}.
\]</span></p>
<p>Querying the shortest distance
between <span class="math">\(u\)</span> and <span class="math">\(v\)</span> is quite simple. We only need to retrieve <span class="math">\(Sk_u\)</span> and
<span class="math">\(Sk_v\)</span>, and find the common nodes in both sketches and add up their
corresponding distances. We then return the minimum sum as the
shortest distance. Formally, let <span class="math">\(I\)</span> be the common nodes that appear in both
<span class="math">\(Sk_u\)</span> and <span class="math">\(Sk_v\)</span>. Then, the approximate shortest distance between <span class="math">\(u\)</span> and <span class="math">\(v\)</span>,
<span class="math">\(d(u,v)\)</span>, is</p>
<p><span class="math">\[d(u, v) = argmin_{s \in I}\{ d(u, s) + d(v, s)\}\]</span></p>
<p>The design of this distance oracle guarantees that the returned distance is no
greater than <span class="math">\(\alpha\times \mathsf{dist}(u,v)\)</span>, where <span class="math">\(\mathsf{dist}(u, v)\)</span> is
the true shortest distance between <span class="math">\(u\)</span> and <span class="math">\(v\)</span> and <span class="math">\(\alpha\)</span> is the
approximation ratio. Note that the approximation ration <span class="math">\(\alpha\)</span> is a function
of some parameters of the sketch so to one controls the approximation by
tweaking the sketch which in turns effects both setup and query efficiency. In
our solution, we leverage sketched-based distance oracles but we have to be
very careful not to affect their approximation ration.</p>
<p>A distance oracle encryption scheme
<span class="math">\(\mathsf{Graph} = (\mathsf{Setup}, \mathsf{DistQuery})\)</span> consists of a polynomial-time algorithm
and a polynomial-time two-party protocol that work as follows:</p>
<ul>
<li><p><span class="math">\((K, \mathsf{EO}) \leftarrow \mathsf{Setup}(1^k, \Omega, \alpha, \varepsilon)\)</span>: is a
probabilistic algorithm that takes as input a security parameter <span class="math">\(k\)</span>, an oracle<br>
<span class="math">\(\Omega\)</span>, an approximation factor <span class="math">\(\alpha\)</span>, and an error parameter <span class="math">\(\varepsilon\)</span>.
It outputs a secret key <span class="math">\(K\)</span> and an encrypted oracle <span class="math">\(\mathsf{EO}\)</span>.</p></li>
<li><p>item <span class="math">\((d, \bot) \leftarrow \mathsf{DistQuery}_{C,S}\big((K, q), \mathsf{EO}\big)\)</span>: is a
two-party protocol between a client <span class="math">\(C\)</span> that holds a key <span class="math">\(K\)</span> and a shortest
distance query <span class="math">\(q = (u, v) \in V^2\)</span> and a server <span class="math">\(S\)</span> that holds an encrypted
oracle <span class="math">\(\mathsf{EO}\)</span>. After executing the protocol, the <span class="math">\(C\)</span> receives a distance <span class="math">\(d
\geq 0\)</span> and server <span class="math">\(S\)</span> receives <span class="math">\(\bot\)</span>.</p></li>
</ul>
<p>For <span class="math">\(\alpha \geq 1\)</span> and <span class="math">\(\varepsilon \lt 1\)</span>, we say that <span class="math">\(\mathsf{Graph}\)</span> is
<span class="math">\((\alpha, \varepsilon)\)</span>-correct if for all <span class="math">\(k \in \mathbb{N}\)</span>, for all <span class="math">\(\Omega\)</span>
and for all <span class="math">\(q = (u, v) \in V^2\)</span>,</p>
<p><span class="math">\[
\mbox{Pr}\big[d \leq \alpha\cdot {\sf dist}(u, v)\big] \geq 1 - \varepsilon,
\]</span></p>
<p>where the probability is over the randomness in computing <span class="math">\((K, \mathsf{EO}) \leftarrow
\mathsf{Setup}(1^k, \Omega, \alpha, \varepsilon)\)</span> and then <span class="math">\((d, \bot) \leftarrow
\mathsf{DistQuery}\big((K, q), \mathsf{EO}\big)\)</span>. I skip the adaptive security definition
as it is similar to adaptive security for SSE and is captured by the general
notion of security for structured encryption given in
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]. Next, I will go over
two solutions for the oracle encryption.</p>
<p><strong>A computationally-efficient solution.</strong>
This approach is rather straightforward, so here I briefly sketch its description. The <span class="math">\(\mathsf{Setup}\)</span> algorithm works as follows:</p>
<ol>
<li>For each node <span class="math">\(v \in V\)</span>, generate a token by applying a PRF to <span class="math">\(v\)</span>: <span class="math">\(\mathsf{tk}_v = F_K(v)\)</span>.</li>
<li>Pad the sketches to the same length and encrypt each sketch <span class="math">\(Sk_v\)</span> as <span class="math">\({\sf Enc}_K(Sk_v)\)</span> using a symmetric encryption scheme.</li>
<li>For each node <span class="math">\(v \in V\)</span>, store the pair <span class="math">\((\mathsf{tk}_v, {\sf Enc}_K(Sk_v))\)</span> in a <a href="https://en.wikipedia.org/wiki/Associative_array}">dictionary data structure</a> <span class="math">\(\mathsf{DX}\)</span> (you should do the insertions at random).</li>
</ol>
<p>The <span class="math">\(\mathsf{DistQuery}\)</span> algorithm is quite simple: given nodes <span class="math">\(u\)</span> and <span class="math">\(v\)</span>, the
client just computes <span class="math">\(F_K(u)\)</span> and <span class="math">\(F_K(v)\)</span> and sends them to the server as the
token. After receiving the token, the server just retrieves <span class="math">\(\mathsf{DX}[F_K(u)]\)</span> and
<span class="math">\(\mathsf{DX}[F_K(v)]\)</span> and sends back the encrypted sketches <span class="math">\({\sf Enc}_K(Sk_u)\)</span> and
<span class="math">\({\sf Enc}_K(Sk_v)\)</span>. Finally, the client decrypts the sketches, and computes the
approximate shortest distance as is normally done in sketch-based distance
oracles. This approach is efficient and simple since we use symmetric
encryption. We show in the paper that this scheme is adaptively secure and that
the leakage for this scheme are the size of the graph, maximum size of the
distance oracle, and the query pattern (see paper for a precise definition).</p>
<p><strong>Communication-efficient solution.</strong>
The problem with the scheme described above is that the communication
complexity is linear in the maximum sketch size. As I mentioned above, this
can be a bottleneck in practice when the graphs are large. Now, at very high
level, I briefly discuss how we can achieve a solution with optimal <span class="math">\(O(1)\)</span>
communication complexity. The scheme makes use of a PRF, a degree-<span class="math">\(2\)</span> somewhat
homomorphic encryption scheme <span class="math">\(\mathsf{SHE} = ({\sf Gen}, {\sf Enc}, {\sf Dec})\)</span>, and a hash function <span class="math">\(h:
V\to [t]\)</span>.</p>
<ul>
<li><p><span class="math">\(\mathsf{Setup}(1^k, \Omega, \alpha, \varepsilon)\)</span>: Given <span class="math">\(1^k\)</span>, <span class="math">\(\Omega\)</span>,
<span class="math">\(\alpha\)</span>, and <span class="math">\(\varepsilon\)</span> as inputs, it generates a public/secret-key pair
<span class="math">\(({\sf pk}, {\sf sk})\)</span> for <span class="math">\(\mathsf{SHE}\)</span>. Let <span class="math">\(D\)</span> be the maximum distance over
all the sketches and <span class="math">\(S\)</span> be the maximum sketch size. <span class="math">\(\mathsf{Setup}\)</span> sets <span class="math">\(N
\leftarrow 2\cdot D +1\)</span> and samples a hash function <span class="math">\(h \leftarrow \mathcal{H}\)</span>
with domain <span class="math">\(V\)</span> and co-domain <span class="math">\([t]\)</span>, where <span class="math">\(t = 2\cdot
S^2\cdot\varepsilon^{-1}\)</span>. It then creates a hash table for each node <span class="math">\(v \in
V\)</span>. More precisely, for each node <span class="math">\(v\)</span>, it processes each pair <span class="math">\((w_i, d_i) \in
sk_v\)</span> and stores <span class="math">\({\sf Enc}_{pk}(2^{N - \delta_i})\)</span> at location <span class="math">\(h(w_i)\)</span> of a
<span class="math">\(t\)</span>-size array <span class="math">\(\mathsf{T}_v\)</span>. In other words, for all <span class="math">\(v \in V\)</span>, it creates an
array <span class="math">\(\mathsf{T}_v\)</span> such that for all <span class="math">\((w_i, \delta_i) \in Sk_v\)</span>,
<span class="math">\(\mathsf{T}_v[h(w_i)] \leftarrow {\sf Enc}_{pk}(2^{N - \delta_i})\)</span>. It then fills
the empty cells of <span class="math">\(\mathsf{T}_v\)</span> with homomorphic encryptions of <span class="math">\(0\)</span> and
stores each hash table <span class="math">\(\mathsf{T}_{v_1}\)</span> through <span class="math">\(\mathsf{T}_{v_n}\)</span> in a
dictionary <span class="math">\(\mathsf{DX}\)</span> by setting, for all <span class="math">\(v \in V\)</span>, <span class="math">\(\mathsf{DX}[F_K(v)]
\leftarrow \mathsf{T}_v\)</span>. Finally, it outputs <span class="math">\(\mathsf{DX}\)</span> as the encrypted
oracle <span class="math">\(\mathsf{EO}\)</span>.</p></li>
<li><p>The <span class="math">\(\mathsf{DistQuery}\)</span> protocol works as follows. Given a query <span class="math">\(q = (u,
v)\)</span>, the client sends tokens <span class="math">\((\mathsf{tk}_1, \mathsf{tk}_2) = (F_K(u),
F_K(v))\)</span> to the server which uses them to retrieve the hash tables of nodes
<span class="math">\(u\)</span> and <span class="math">\(v\)</span> by computing <span class="math">\(\mathsf{T}_u := \mathsf{DX}[\mathsf{tk}_1]\)</span> and
<span class="math">\(\mathsf{T}_v := \mathsf{DX}[\mathsf{tk}_2]\)</span>. The server then homomorphically
evaluates an inner product over the hash tables. More precisely, it computes <span class="math">\(c
:= \sum_{i=1}^t \mathsf{T}_u[i]\cdot\mathsf{T}_v[i]\)</span>, where <span class="math">\(\sum\)</span> and <span class="math">\(\cdot\)</span>
refer to the homomorphic addition and multiplication operations of of the SHE
scheme. Finally, the server returns only <span class="math">\(c\)</span> to the client who decrypts it and
outputs <span class="math">\(2N - \log_2 \left({\sf Dec}_{\sf sk}(c)\right)\)</span>.</p></li>
</ul>
<p>See the paper for more details and an analysis of the construction. What is
important to note is that we can show that the scheme does not affect the
quality of underlying oracle's approximation too much and, in fact, in certain
cases it improves it!</p>
<p>It is also worth of mentioning that, in the paper, we also propose a third
scheme that has <span class="math">\(O(1)\)</span> communication complexity but with some additional
leakage which we call the sketch pattern leakage. This third scheme is far more
efficient than the one above. One interesting subtlety is that, unlike more
standard encrypted schemes schemes, where the leakage is over a structure that
holds all the original data (e.g., an inverted index with full indexing), the
leakage in this case is only over a data structure that holds a random subset
of the data.</p>
<p>Finally, We implemented all our constructions and verified their efficiency
experimentally.</p>
<h2 id="conclusions-and-future-work">Conclusions and Future Work</h2>
<p>I went over our graph encryption schemes with support for approximate shortest distance
queries. The solutions I described are all adaptively-secure. Of course, there
are other possible approaches based on ORAM or FHE which can provide stronger
security (even hide access pattern!) but at a higher cost. As graph databases become more and more
popular, I believe graph encryption will play an increasingly important role in
database security. We live in a data-centric world that generates network and
graph data of all kinds. There are still more challenging and exciting open
problems in graph database security: e.g., how to construct graph encryption
schemes for more complex graph queries? Can we support graph mining tasks,
e.g., can we construct graph encryption schemes that allow us to detect
communities over encrypted social networks? And of course, as is common in
encrypted search, how can we quantify the security of our graph encryption
schemes? Any briliant ideas? Talk to us! :-)</p>
Attacking Encrypted Database Systems
http://senykam.github.io/2015/09/07/attacking-encrypted-database-systems
Mon, 07 Sep 2015 22:11:02 -0300http://senykam.github.io/2015/09/07/attacking-encrypted-database-systems<p><img src="http://senykam.github.io/img/edb.png" class="alignright" width="250">
<a href="http://cryptoonline.com/">Muhammad Naveed</a>, <a href="http://web.cecs.pdx.edu/~cvwright/">Charles
Wright</a> and I recently posted a paper that
describes inference attacks on encrypted database (EDB) systems like
<a href="http://css.csail.mit.edu/cryptdb/">CryptDB</a>,
<a href="http://research.microsoft.com/en-us/projects/cipherbase/">Cipherbase</a>,<br>
<a href="https://github.com/google/encrypted-bigquery-client">Google's Encrypted BigQuery demo</a> and <a href="https://msdn.microsoft.com/en-us/library/mt163865.aspx?f=255&MSPPError=-2147217396">Microsoft SQL Server 2016 Always Encrypted</a>.
These systems are based on property-preserving encryption (PPE) schemes which
are a class of encryptions schemes that leak certain properties of their
plaintexts. Examples include <a href="https://en.wikipedia.org/wiki/Deterministic_encryption">deterministic
encryption</a> (DTE) and
order-preserving encryption (OPE).</p>
<p>The paper is
<a href="http://research.microsoft.com/en-us/um/people/senyk/pubs/edb.pdf">here</a> and
will be presented in October at the <a href="http://www.sigsac.org/ccs/CCS2015/index.html">ACM Conference on Computer and
Communication Security</a>.</p>
<p>This was an interesting project for several reasons. For one thing, it was fun to do
cryptanalysis! Even though we make use of old and basic ideas like frequency
analysis and of even more obvious ideas like plain sorting, we did get the chance
to work on some non-trivial attacks based on combinatorial optimization. We
hope to post more about this in the future and we have upcoming work that
explores new interesting technical questions that came out of this.</p>
<p>We also got to work with real medical data and we believe our results provide
a fair, accurate and realistic security evaluation of these PPE-based EDB
systems. In particular, it provides a concrete and real-world analysis of the
security one would get if one were to use them in an electronic medical records
(EMR) setting, which is an important motivating scenario for these systems
(see, e.g.,
[<a href="http://people.csail.mit.edu/nickolai/papers/raluca-cryptdb.pdf">PRZB11</a>]).</p>
<h2 id="rebuttals">Rebuttals</h2>
<p>The results of the paper were recently covered by the media (e.g.,
<a href="http://www.forbes.com/sites/thomasbrewster/2015/09/03/microsoft-dumb-attacks-cracks-next-gen-cryptography/">Forbes</a>
and <a href="http://arstechnica.com/security/2015/09/ms-researchers-claim-to-crack-encrypted-database-with-old-simple-trick/">Ars
Technica</a>
and on Microsoft Research's own
<a href="http://blogs.technet.com/b/inside_microsoft_research/archive/2015/09/03/database-security-arms-race-researchers-make-advances.aspx">blog</a>).
Some of the designers of these systems found our work important, even saying
that it is</p>
<blockquote>
<p>valuable because it gives database customers a better understanding of the
security precautions they need to consider, especially if they are charged with
handling very sensitive data such as electronic medical records.</p>
</blockquote>
<p>Others, however, dispute our results. The three arguments we have seen so far
(in the Forbes and Ars Technica pieces) are that: (1) we did not break the
systems; (2) we used them in a way they were not intended for; and (3)
that users are warned on how to use them correctly.</p>
<p>The high-level point we
would like to make here is that <em>we absolutely did use them in the way
they were intended</em> and if you read this entire post you will see why. You
will also see why the user warnings mentioned in the articles (which we could not
find in the referenced paper) do not prevent the attacks.</p>
<h2 id="how-we-used-the-systems">How We Used the Systems</h2>
<p>First, we should be clear that we never claimed to have broken anything. But
in our opinion, this is not a meaningful question anyways due to the way these
PPE-based EDB systems and their underlying cryptography phrase their security
claims. Roughly speaking, these systems are usually claimed to be secure under
what I will refer to as <em>functionality assumptions</em>. <sup class="footnote-ref" id="fnref:1"><a class="footnote" href="#fn:1">1</a></sup></p>
<p>The problem here is that these assumptions essentially cripple the
functionality of these systems in a way that make them a lot less useful and
interesting than what most people believe or expect. In fact, in some
cases these assumptions can almost obviate the systems' entire purpose.</p>
<p>For example, PPE-based EDB systems are typically claimed to be secure if a
database administrator labels all "sensitive" fields (for some undefined
notion of sensitivity) so that they are encrypted with standard encryption
schemes. But of course, this also means that these fields then cannot be
queried at all---ever. So this leaves us with an EDB system that only works
over <em>non-sensitive</em> data. If it's non-sensitive, one could ask how much
value we are getting from encrypting it at all. Is
this really the point of an encrypted DB system? To do SQL over encrypted
<em>non-sensitive</em> data? Is this really consistent with how these systems are
motivated and understood?</p>
<p>The point is that to claim that these systems are secure, you effectively have
to cripple them until you have to use regular plain encryption---at least for
the kinds of data you actually care to protect.</p>
<p>What's going on is that there is an underlying tradeoff in these systems
between security and utility and the <em>technical</em> (i.e., the fine print)
security claims these systems make essentially cripples their utility. Yet,
the crippled version of these systems is not what will be used in practice and
it is not how these systems are understood by most people (engineers and
researchers alike). Indeed, the main claim these systems make is that one can
run real-world applications on top of them. In particular, applications like
electronic medical records (EMR).</p>
<p>So in our work, we were not interested in "breaking" these systems because,
under the right <em>functional assumptions</em> one can always claim they are
secure. That would be a meaningless exercise.</p>
<p>What we set out to do was to answer a different question that is actually relevant to
practice: "what security do we get when we run an EMR application on top of
these systems?" In our opinion, this is not only a fair question; it is the
most basic question one should ask about these systems. This is a concrete and
real-world question that cannot be dismissed using assumptions and caveats.</p>
<p>So how do we answer such a question? Our approach was to analyze<br>
attributes that are relevant to EMR systems.</p>
<p>To support the EMR system, the EDB needs to support queries on these attributes
which means that they need to be encrypted with the appropriate PPE
scheme (i.e., their onion needs to be peeled to either order-preserving
encryption or deterministic encryption depending on the type of query). And
when this happens, the question is: "what information is leaked?"</p>
<p>That is what our paper explores and this is not only a fair question it is a
question that has been asked about other encrypted search solutions.</p>
<p>Of course, in choosing attributes to analyze we were limited by our data. That
is, there are many attributes used in EMR systems but we didn't have data for
all of them. We felt it was important to use real data here and not synthetic
data so we preferred to limit the attributes we analyzed than to compromise the
integrity of the experiments. In the end, the set of attributes that are
relevant to EMRs and that we had data for were: sex, race, age, admission
month, whether the patient died, primary payer, length of stay, mortality risk,
disease severity, major diagnostic category, admission type and admission
source.</p>
<p>We believe it is fair to say that any reasonable EMR system would need to query
these attributes. We even confirmed that the
<a href="http://www.open-emr.org/">OpenEMR</a> system (which is used as
motivation in
[<a href="http://people.csail.mit.edu/nickolai/papers/raluca-cryptdb.pdf">PRZB11</a>]
queries sex, race, age, admission month, patient died and primary payer. So the
claim that we are using these systems in a way they were not intended to is
completely unfounded.</p>
<h2 id="how-should-they-be-used">How Should They be Used?</h2>
<p>In the rebuttal from the Forbes article, we also see the following claim:</p>
<blockquote>
<p>OPE encryption should be used for "high-entropy values" where the order does
not reveal much and that CryptDB was still a worthy way to protect information.
"This is how the CryptDB paper says it should be used."</p>
</blockquote>
<p>First, there is no such statement in the CryptDB paper---at least we could not
find one. Second, even if this warning were given to users and DB
administrators, <em>the system would still be vulnerable to the attacks we
describe!</em> In fact, the attacks would still work even if the data had the
<em>maximum</em> possible entropy.</p>
<p>Let's see how. In Section 6.1 of our paper, we describe a really obvious
"attack" we refer to as the <em>sorting attack</em>. Here is how it works.
Suppose we are analyzing some attribute, say age, which can range over a given
number of values. In the case of age, this would be from 0 to 124. Now
suppose I give you an OPE-encrypted column of ages; that is, a set of OPE
encryptions of the ages in the DB. Now further suppose that every possible age
is represented in this column. In such a case, we say that the column is
<em>dense</em>. Now to recover the plaintext values from this OPE-encrypted
column, one simply needs to sort it---which can be done since OPE reveals
order. At this point, it simply follows that the smallest ciphertext will
correspond to 0, the second smallest to <span class="math">\(1\)</span> and so on and so forth.</p>
<p>Of course, the statement "OPE should be used for high-entropy values" is a
bit vague. In fact, it could be interpreted in two different ways but, as we
will see, in each case it leads to problems. Note that by entropy, in
cryptography we usually mean
<a href="https://en.wikipedia.org/wiki/Min_entropy">min-entropy</a>.</p>
<p><strong>Interpretation #1.</strong>
The values themselves are sampled from some arbitrary distribution (perhaps
even with low min-entropy) but the adversary knows nothing about them so we treat them
as high min-entropy. This is completely unrealistic. Designing and using
cryptography against adversaries with no auxiliary information is meaningless
since auxiliary information is available. In particular, our paper implicitly makes this
point since we were able to leverage publicly-available auxiliary information
to mount our attacks.</p>
<p><strong>Interpretation #2.</strong>
Each value in the column is actually sampled from a high min-entropy distribution.
Let' s assume this is the case and suppose the range of the attribute space is
<span class="math">\(n\)</span>. Furthermore, let's go as far as saying that the values are sampled
uniformly at random---which has <em>maximum</em> min-entropy.</p>
<p>By the
<a href="https://en.wikipedia.org/wiki/Coupon_collector%27s_problem">coupon collector</a>
problem, we can expect to see all values after <span class="math">\(\theta(n \log n)\)</span>
samples. More concretely, this means that we can expect the OPE-encrypted
column to be dense after <span class="math">\(\theta(n \log n)\)</span> patients are registered in the DB.
So for example, for the "age" attribute which has <span class="math">\(n = 125\)</span>, we should expect
the column to be dense---and therefore vulnerable to sorting---after 604
patients appear in the DB. To give you an idea of how small that is, the
largest hospital in our dataset had 121,664 patients and exactly 827 out of
1050 hospitals in our dataset had more than 604 patients (i.e., 79%).</p>
<!-- % \paragraph{Interpretation $$\#3$$.}
% The histogram of the encrypted column (when viewed as a distribution) has high
% min-entropy. Let's assume, again, the uniform distribution which has maximum
% min-entropy. In this case, this will mean that every element in the attribute
% space appears the exact same number of times. But, as long as they appear at
% least once, the column will be dense, and therefore vulnerable to sorting.
-->
<p>This discussion highlights some of the pitfalls with the kind of functional
assumptions used in the security claims of these systems. They are often vague,
hard to interpret, hard to make actionable and, as above, sometimes do not even provide
security.</p>
<h2 id="are-edb-systems-doomed">Are EDB Systems Doomed?</h2>
<p>We personally do not think so. There are alternative ways of designing EDB
systems that are currently being investigated in various research labs. What
our work shows is that the EDB systems <em>we</em> looked at do not provide the
level of security required for EMR systems and, most likely, other databases
with similar demographic information. More generally, our work does suggest
that <em>PPE-based</em> EDB systems (i.e., based on deterministic and
order-preserving encryption) might not be the way to go but this does not mean
that reasonably secure and efficient EDB systems are impossible to design.</p>
<!--
% \section*{Isn't this Better than Nothing?}
% Since releasing our paper, another argument we have heard (though not in the
% rebuttals mentioned above) is that even if these systems aren't perfect they
% are better than nothing; that it's better to use them than not to.
%
% This is not a clearcut argument in our opinion. The reality is that the use of
% encryption comes with expectations. The standard role of encryption in secure
% system design is, roughly speaking, to reduce the attack surface of the system.
% In other words, instead of having to protect a dataset or DB, one only has to
% worry about protecting the key. When people hear *encrypted* they tend to
% think that their data is secure. As such, they become less careful with it
-->
<div class="footnotes">
<hr>
<ol>
<li id="fn:1">Please note that I am not trying to define a formal concept here, I am only using the term <em>functional assumption</em> for ease of exposition. Also, these assumptions are very different than, e.g., computational assumptions as typically used in cryptography. Computational assumptions usually do not limit the utility of a primitive whereas the kind of assumptions we are talking about here do. So the argument that, e.g., high min-entropy assumptions are perfectly fine because we typically use assumptions anyways does not hold.<br>
<a class="footnote-return" href="#fnref:1">↩</a></li>
</ol>
</div>
Workshop on Encryption for Secure Search and Other Algorithms
http://senykam.github.io/2015/06/30/workshop-on-encryption-for-secure-search-and-other-algorithms
Tue, 30 Jun 2015 21:43:32 -0300http://senykam.github.io/2015/06/30/workshop-on-encryption-for-secure-search-and-other-algorithms<p><img src="http://senykam.github.io/img/bertinoro.gif" class="alignright" width="250">
I just got back from the Workshop on Encryption for Secure Search and other
Algorithms (ESSA) which was held in Bertinoro, Italy, and was organized by
Sasha Boldyreva and Bogdan Warinschi. It was a great event and I'd like to
thank the organizers for putting this together and doing such a great job. It
was really nice to see all the excitement and enthusiasm behind this topic;
both from the research community and from industry.</p>
<p>Since a few people have already asked me for details about the event I figured
I would just write brief summaries of the talks. I think the slides will be
posted soon so if you are interested you should be able to get more details on
the workshop page <a href="http://www.cs.bris.ac.uk/essa/">ESSA</a>.</p>
<p>The first talk was by Christopher Bosch who gave a survey of encrypted search.
The talk was based on a paper Christopher published last year. This a really
extensive and thorough survey and a great contribution to the field. The
authors go over a large number of papers and try to organize and categorize
them; drawing conclusions and research directions from the broad perspective
they gained from writing the survey. It is a great reference for anyone
interested in this field.</p>
<p><em>Kaoru Kurosawa</em> gave a talk on two of his papers. In the first paper, the
authors describe a universally composable (UC) variant of adaptive semantic
security (i.e., CKA2-security) for SSE. The main difference with the standard
definition is that the UC variant requires correctness and the simulation is
strengthened by requiring it to be black-box (i.e., there exists a simulator
for all adversaries). Kaoru then described a construction that achieves this
strong notion of security. In the second part of the talk Kaoru discussed a
more recent paper of his that describes an SSE scheme that handles very
expressive queries but without revealing the expression (not just the keywords
in the query but the form/structure of the query). This is accomplished using a
new variant of garbled circuits which is very interesting in its own right.</p>
<p><em>Emily Shen</em> talked about her work on substring matching on encrypted data. This
is done using an encrypted suffix tree, i.e., using a (interactive) structured
encryption scheme for suffix trees. In this work, however, she was concerned
with a stronger model of security where the server can be malicious. This last
constraint required her to strengthen the standard definition of security for
structured encryption. The construction was very nice but a bit too involved to
describe here so I recommend reading the paper for more details.</p>
<p><em>Nathan Chenette</em> gave a nice overview of the state of the art of fuzzy
searching on encrypted data in the property-preserving model. After describing
the different approaches he gave stronger definitions for this primitive.
Unfortunately, to achieve this notion there is what seems to be an inherent
ciphertext expansion so he described a weaker notion that allows for more
space-efficient constructions.</p>
<p><em>Kevin Lewi</em> talked about his work on order revealing encryption (ORE). ORE is
similar to order-preserving encryption (OPE) in that it allows for comparisons
over ciphertexts. But there is one important distinction: unlike OPE, ORE does
not require the comparison operation over ciphertexts to be "less than". In
other words, to compare two OPE ciphertexts one can simply execute a
"greater/lesser than" operation whereas with ORE one might have to execute some
other arbitrary operation. This is an important relaxation and allows the
authors to overcome impossibility results for OPE which say that OPE schemes
have to leak more than just the order. The construction Kevin presented is
based on obfuscation techniques but does not require the full power of
obfuscation. In particular it avoids the use of Barrington's theorem (though it
makes use of multi-linear maps) which as Kevin said makes the scheme at least
"implementable" but not practical.</p>
<p><em>Murat Kantarcioglu</em> described a few of his works including a paper that
initiated research on concrete inference attacks; that is, attacks on encrypted
search solutions that use statistical and optimization techniques to exploit
leakage. He described the attack he and his co-authors use to try to recover
the queries a user makes by exploiting the access pattern leakage of SSE. This
attack is known as the IKK attack and is currently the best inference attack we
have against access pattern leakage. The second part of the talk covered ways
to mitigate these kinds of attacks and Murat described a clever way of using
differential privacy for this.</p>
<p><em>David Cash</em> also discussed inference attacks. In addition to standard inference
attacks, however, he also described attacks where the adversary exploits more
than leakage and, in particular, knows or chooses some of documents. The
findings were very interesting. One thing that came out of this study was that
the IKK attack-while very interesting in theory-is not really practical. There
are several technical reasons for this but I' ll leave you to read David's paper
when it appears if you are interested. This study also looked at a new class of
schemes (not SSE/structured schemes) that have appeared in the literature
recently and showed that they were vulnerable to adversaries who know and/or
can choose documents (though to be fair they were not designed with that
adversarial model in mind).</p>
<p>Unfortunately, <em>Hugo Krawczyk</em> couldn' t make it to the workshop at the last
second so <em>Stas Jarecki</em> gave his talk. This was a nice overview of the work on
SSE done by the IBM/Rutgers/UC Irvine team (from now on referred to as IRI) for
the IARPA SPAR project. It covered a series of papers including their paper
from CRYPTO' 13 that shows how to achieve conjunctive queries in sub-linear
time. The talk then continued with more recent papers that focused on schemes
with good I/O complexity and even more expressive queries. The talk had a nice
blend of theory and systems, in particular illustrating how systems constraints
like I/O complexity can sometime force you to find new and interesting
solutions.</p>
<p><em>Vlad Kolesnikov</em> talked about the system that Columbia and Bell Labs designed
for the IARPA competition. This system-called Blind Seer-even had a cool logo
which we learned was designed by Vlad himself! At a high level, this system
makes use of garbled circuits and bloom filters and is designed to work in a
3/4-party model that includes a data owner, a policy server and an index
server. Vlad described several bottlenecks they encountered and all the clever
optimizations they had to design to make the system perform. There was some
discussion about how Blind Seer compared to the IRI system. In the end, it
seemed that the two were incomparable and achieved different tradeoffs between
leakage and efficiency.</p>
<p><em>Adam O'Neill</em> presented his recent work on modular OPE (MOPE). MOPE is a variant
of OPE where a random shift and modular operation is applied to the plaintext
before an OPE encryption is done. Turns out this can improve the security of
OPE but not when OPE is used to do range queries. Adam described a few
techniques to address this that didn' t seem to affect the efficiency of the
schemes. He also showed experimental results to back this up.</p>
<p><em>Radu Sion</em> talked about the new cloud security startup he' s doing. He couldn' t
say much about the technical aspects of what they are doing but he went over
some of the services they are providing and showed demoes, some of which
included searching on encrypted data. Since this was a "sensitive" talk and
Radu himself had to be careful not to reveal too much I' ll stop here at the
risk of revealing things he may not want made public on a larger scale.</p>
<p><em>Paul Grubbs</em> gave a talk that went over what he' s been working on at SkyHigh
networks. He talked about ongoing projects SkyHigh was doing with OPE,
deterministic encryption and format-preserving encryption. In addition he
discussed future projects the company was planning on doing with SSE. This talk
was nice in that it provided a different perspective on crypto than what you
typically get in academic settings. In particular, Paul described how the
solutions they considered and worked on had to fit various business and
legal/regulatory constraints. This is something I' ve been exposed to at MSR and
I definitely think that seeing how technology gets (or doesn' t get) deployed in
the real world is very useful in developing and sharpening your intuition about
what research areas are more or less promising in terms of impact.</p>
<p><em>Mayank Varia</em> gave a great talk on the testing framework Lincoln Labs built to
evaluate the encrypted search systems for the IARPA competition. I have to say
this was one of my favorite talks. The scale of what they built was truly
impressive. The system is composed of various frameworks. One part of the
system is just for generating realistic data and queries and they do this using
machine learning techniques on real data. The query generation is very
flexible however, and you can use it to generate data and queries with specific
characteristics for your tests. The second component is a measurement
framework. The third component was an automated system for generating graphs
and visualizations of the experimental results in LaTex! Overall what they
built sounded very impressive and I think that we should try to adopt it as a
standard way of testing/evaluating encrypted search solutions. I think the
encrypted search community is lucky to have such a framework so we should take
advantage of it. Mayank said that they are working on getting the code up on
GitHub so I' ll update this post as soon as it' s up.</p>
<p><em>David Wu</em> talked about a new protocol for privacy-preserving location
services. Suppose you want to find out how to get from point A to point B but
don' t want to disclose your location to the server that stores the maps and the
server doesn' t want to reveal its own data. Without privacy, one can solve this
problem by representing the map as a graph and computing the shortest path so
the problem David was interested in was can you design a practical two-party
protocol for shortest paths. David showed how to do this by first proposing a
very nice way to compress the representation of the graph in a way that doesn' t
affect the shortest paths and then computing the shortest paths on the new
representation via oblivious transfer. David then presented benchmarks of their
protocol for the city of Los Angeles.</p>
<p><em>Florian Bourse</em> presented new constructions of functional encryption schemes for
inner products. Unlike previous general-purpose FE schemes the goal of this
work was to provide simple and efficient constructions. Florian discussed two
constructions, one based on DDH and another based on LWE. Note that the
functionality considered by Florian is slightly different than "inner product
encryption" of Katz, Sahai, Waters and Shen, Shi and Waters. In the latter
works, the decryption returns one bit of information: whether the inner product
is equal to 0 or not. Here, the decryption returns the actual inner product.</p>
<p><em>Tarik Moataz</em> talked about ORAM with constant bandwidth. What is meant in the
literature by constant-bandwidth ORAM is a bit technical but, roughly speaking,
one can think of it as the requirement that the metadata exchanged with the
server is smaller than the data blocks. Previous work on constant-bandwidth
ORAM had two limitations. The first is that they achieved only amortized
constant-bandwidth. The second is that they only work with very large blocks
and as such only make sense for limited kinds of applications (using standard
parameters, they would have 4MB blocks). Tarik showed to get around these two
limitations, giving a worst-case constant-bandwidth ORAM with much smaller
block size. In addition, the scheme also improves the computational cost at the
server.</p>
<p><em>Stas Jarecki</em> talked about RAM-based MPC (i.e., MPC protocols that work in the
RAM model as opposed to over circuits). The standard way to do this is to use
two-party computation (2PC) to securely compute the client algorithm of an ORAM
scheme. Roughly speaking, this requires the ORAM client algorithm to be
MPC-friendly so that the resulting solution is efficient. While most schemes
consider only the two-party setting, Stas argued that it is interesting to look
at three parties as well since better efficiency could be achieved in that
setting. In fact, Stas described a protocol for this setting which was a lot
more efficient than protocols for the two-party setting.</p>
<p><em>Leo Reyzin</em> gave a survey of entropy notions in cryptography. Leo went over
Shannon entropy, min-entropy and average conditional min-entropy in each case
giving a very nice and intuitive explanation of why and when these notions
should be applied. He also discussed computational variants of entropy
including HILL entropy and what is known and not known about it. Entropy
notions in crypto are a bit subtle and can be hard to work with and
unfortunately there isn't much material to learn from so Leo' s survey was
extremely useful.</p>
Applied Crypto Highlights: Searchable Encryption with Ranked Results
http://senykam.github.io/2015/04/15/applied-crypto-highlights-searchable-encryption-with-ranked-results
Wed, 15 Apr 2015 20:57:14 -0300http://senykam.github.io/2015/04/15/applied-crypto-highlights-searchable-encryption-with-ranked-results<p><em>This is the second in a series of guest posts highlighting new research in
applied cryptography. This post is written by <a href="http://www.baldimtsi.com/">Foteini
Baldimtsi</a> who is a postdoc at Boston University and
<a href="http://research.microsoft.com/en-us/people/oohrim/">Olya Ohrimenko</a> who is a
postdoc at Microsoft Research. Note that Olya is on the job market this year.</em></p>
<p><img src="http://senykam.github.io/img/steam.jpg" class="alignright" width="250">
Modern cloud services let their users outsource data as well as request
computations on it. Due to potentially sensitive content of users' data and
distrust in cloud services, it is natural for users to outsource their data
encrypted. It is, however, important for the users to still be able to use
cloud services for performing computations on the encrypted data. In this
article we consider an important class of such computations: search over
outsourced encrypted data. Searchable Encryption has attracted a lot of
attention from research community and has been thoroughly described by Seny
in <a href="http://outsourcedbits.org/2013/10/06/how-to-search-on-encrypted-data-part-1">previous blog posts</a>.</p>
<p>Search functionality alone, however, might not be enough when one considers a
large amount of data. Ideally, users would like to not only receive the
matching results, but get them back sorted according to how relevant they are
to their query (just like a search engine does!). In this blog post we describe
our <a href="http://fc15.ifca.ai/preproceedings/paper_89.pdf">recent result</a> from
the conference on Financial Cryptography and Data Security 2015 which builds on
top of searchable encryption techniques to return <em>ranked results</em> to
users' queries. Our goal is to create a scheme that is efficient and achieves a
high level of privacy against a curious cloud server.</p>
<h2 id="ranking-search-results-on-plaintext-data">Ranking search results on plaintext data</h2>
<p>Let us start by briefly describing how ranking would be done if users did not
take into account the privacy of their data and outsourced it in an unencrypted
format. Literature on information retrieval offers an abundance of ranking
methods. For our paper, we chose the $\mbox{tf-idf}$ ranking method due to its
simplicity, popularity and the fact that it supports free text queries. This method
is effective since it is based on term/keyword frequency (tf) and inverse
document frequency (idf).</p>
<p>Let $D=D_1,\dots,D_n$ be a document collection of $n$ documents, in which there
exist $m$ unique terms/keywords $t_1,\dots,t_m$. First, for every term, $t$, we
compute its frequency ($\mbox{tf}$) in each document $D_i$ as well as its inverse
document frequency ($\mbox{idf}$) in the entire collection (it captures how common the term is
in the whole document collection). Then, for each term and document we compute</p>
<p><span class="math">\[
\mbox{tf-idf}_{t,D_i} = \mbox{tf}_{t,d} \times \mbox{idf}_{t}
\]</span></p>
<p>and store the score values in the rank table, $T$:</p>
<p><figure><img src="http://senykam.github.io/img/searchindextable.jpg" alt="" title="$\mbox{tf-idf}$ rank table, $T$, outsourced to the cloud"><figcaption>$\mbox{tf-idf}$ rank table, $T$, outsourced to the cloud</figcaption></figure></p>
<p>Note that if a term does not appear in a table, then we store $0.00$ as a rank.
This table could be either computed by the owner of the document collection and
outsourced to the cloud, or computed by the cloud itself since it
receives the actual document collection $D$ in the clear.</p>
<p>Now suppose that a user wants to query the cloud for the multi-keyword query "searchable
encryption". Then, the cloud first searches for the terms "searchable" and
"encryption" in the table, adds the corresponding rows together to get the
overall score of the query, sorts the scores, and returns the relevant
documents in a sorted order.</p>
<h2 id="ranking-search-results-on-encrypted-data">Ranking search results on encrypted data</h2>
<p>A user that wishes to protect her privacy is likely to outsource her document
collection to the cloud in an encrypted format: $E(D_1),\dots,E(D_n)$. In order
to be able to perform <em>ranked search</em>, the user has to create the rank
table $T$ and send it to the cloud (as opposed to outsourcing plaintext data
where the cloud could also compute the rank table itself). Since the rank table
contains information about the distribution of words in individual documents and
the whole collection, it has to be encrypted as well. However, in order for the
server to be able to return ranked results using the $\mbox{tf-idf}$ method
described above, encrypted$T$ should be able to support the following
operations:</p>
<ol>
<li>search for terms/keywords</li>
<li>add numerical values</li>
<li>sort a list of numerical values.</li>
</ol>
<p>For the first operation one could simply encrypt the keywords on the table
using a <a href="http://outsourcedbits.org/2013/10/06/how-to-search-on-encrypted-data-part-1/">searchable
encryption</a>
(SE) scheme. Then, whenever the user wants to search for a phrase, she sends to
the cloud an SE trapdoor for each keyword in the phrase. The server can then
use the trapdoors to locate the keywords in the table.</p>
<p>The next two operations refer to the numerical entries on the table which
should be encrypted in a way that supports addition and sorting. A natural
solution would be to encrypt these values under a <a href="http://outsourcedbits.org/2012/06/26/applying-fully-homomorphic-encryption-part-1/">fully-homomorphic
encryption</a>
scheme that can support any type of computation over encrypted data. However,
the resulting solution would be very inefficient to be applied in practice.
Another potential solution would be to encrypt the numerical values under an
<a href="http://www.cc.gatech.edu/aboldyre/papers/bclo.pdf">order-preserving
encryption</a> (OPE) scheme.
However, this would be sufficient only for single-keyword queries, since OPE
schemes cannot support homomorphic addition (and, even if they did, they would
<a href="http://luca-giuzzi.unibs.it/corsi/Support/papers-cryptography/RAD78.pdf">not be
secure</a>.
Note that for single keyword queries, OPE might not be ideal since it leaks the
rank order of the documents for each keyword (see also the discussion
<a href="http://outsourcedbits.org/2013/10/14/how-to-search-on-encrypted-data-part-2/">here</a>.</p>
<p>Given that we aim for an efficient and provably secure
solution, we propose to encrypt the numerical values of the rank table using
the <a href="http://en.wikipedia.org/wiki/Paillier_cryptosystem">Paillier encryption
scheme</a>: a semi-homomorphic
scheme that supports the addition of encrypted values. (For the rest of this
post, we use $[a]$ to denote the encryption of value $a$ using this scheme.) %a
semi-homomorphic encryption scheme. By the properties of Paillier, the server
can add the corresponding rows of $T$ when a query is received. What is still
left to discuss is, how the server can also sort these encrypted values. In
the rest of the post, we describe our private sorting mechanism over encrypted
values.</p>
<p>Our private sorting mechanism requires to equip the cloud server with a secure
co-processor (e.g., <a href="http://www-03.ibm.com/security/cryptocards/pciecc/overview.shtml">IBM
PCIe</a>, <a href="https://software.intel.com/en-us/blogs/2013/09/26/protecting-application-secrets-with-intel-sgx">Intel
SGX</a>,
<a href="https://technet.microsoft.com/en-us/library/cc749022%28v=ws.10%29.aspx">Windows
TPM</a>.
The secure co-processor is then given the decryption key of the
semi-homomorphic encryption scheme which lets him assist the cloud server in
sorting. For the protocol to proceed, we assume that the co-processor does not
collude with the cloud server and both of them are following the protocol in an
honest-but-curious way. That is, neither of them deviates from the protocol but
both are curious to learn more about user's data.</p>
<p><figure><img src="http://senykam.github.io/img/introimagesingleuserslim.jpg" alt="" title="An overview of the interactions between the user, the cloud server $S_1$ and the co-processor $S_2$."><figcaption>An overview of the interactions between the user, the cloud server $S_1$ and the co-processor $S_2$.</figcaption></figure></p>
<p>Regarding the privacy of our scheme, we design our protocol in such a way that:
(a) the co-processor learns nothing about the values being sorted and (b) the
cloud server, as in SE, learns the search pattern (i.e., whether a keyword was
queried before or not), but learns nothing about the ranking of the documents.
For example, he does not learn which document ranks higher for user's query.</p>
<h2 id="private-sort">Private Sort</h2>
<p>We now develop a sorting protocol that the cloud server and the co-processor
can use to jointly sort encrypted ranking data of the documents. From now on we
denote the cloud server by $S_1$ and the co-processor by $S_2$. Our private
sort is a two-party protocol between $S_1$ and $S_2$ where $S_1$ has an
encrypted array of $n$ elements $[A] = { [A_1], [A_2], \ldots, [A_n]}$ and
$S_2$ has the secret key that can decrypt $A$.<br>
By the end of the protocol, $S_1$ should obtain $[B] = {[B_1],
[B_2], \ldots, [B_n]}$ where $[B]$ is an encryption of$A$ sorted. Since $S_1$
and $S_2$ are both curious, we are interested in protecting the content of $A$
and$B$ from both of them and we are willing to reveal <em>only</em> the size of
$A$,$n$. Hence, $S_2$ should only assist $S_1$ in sorting without seeing the
encrypted content of $A$ or $B$, otherwise he can trivially decrypt it. On the
other side of the protocol, nothing about the decryption key nor plaintext
values of $A$ and $B$ should be leaked to $S_1$. For example, we do not want
to leak to neither $S_1$ nor $S_2$ values of elements in $A$, their comparison
result with other elements, and their new location in $B$ (in the paper we
express these properties using simulation based security definitions).</p>
<p><strong>Private Sort Construction Overview.</strong>
As can be seen from the definitions, the participation of $S_1$ and $S_2$ in
private sort should not reveal anything about the content of the data to either
of them. Hence, any method we use for comparison and sorting must appear
independent of the data. We note, however, that many sorting algorithms access
the data depending on the comparison result and data content (e.g., quicksort).
This does not fit our model where everything about the data, including
individual comparisons, should be protected from $S_1$ and $S_2$.</p>
<p>Fortunately, there are sorting algorithms where data comparisons are determined
by the size of the data to be sorted, $n$ in our case, and not the data
content. One such algorithm is a
<a href="http://dl.acm.org/citation.cfm?id=1468121">sorting network</a> by K.
Batcher which relies on Two-Element Sort circuit. This circuit takes two
elements and outputs them in a sorted order.
Then the network consists of $O(\log n)$ layers where every layer has $O(n)$ Two-Element Sort circuits,
where constants in big-O are determined by $n$.
In order to sort the data, one simply passes it through the network.
Moving the data through the network depends only on $n$ and the Two-Element Sort.
Hence, if we develop a private Two-Element Sort, the implementation
of private Batcher's network becomes trivial.</p>
<h3 id="private-twoelement-sort">Private Two-Element Sort</h3>
<p>As the name suggests, Private Two-Element Sort is a special case of Private Sort, as defined above, for the case $n=2$. That is, $S_1$ has two encrypted elements $[a]$ and $[b]$ and wishes to obtain $[c]$ and $[d]$ where $c = \min(a,b)$ and $d = \max(a,b)$. Similarly,$S_2$ has the secret key of the encryption. The security definition is also the same and informally states that neither $S_1$ nor $S_2$ learn anything about $a$ and $b$.</p>
<p>We first describe operations that are required to perform Two-Element Private Sort without encryption and then for every operation give its private version. The sorting consists of:</p>
<ol>
<li>$t := a > b$ (Set bit $t$ to the result of comparing $a$ and $b$).</li>
<li>$c := (1-t)a + tb$ (Use $t$ to select the minimum of $a$ and $b$).</li>
<li>$d := ta + (1-t)b$ (Use $t$ to select the maximum of $a$ and $b$).</li>
</ol>
<p>Note that these three operations have to be performed on encrypted data:
$a$ and $b$ are part of the encrypted input of $S_1$,
bit $t$ and values $c$ and $d$ also should be encrypted to protect their content from $S_1$.
Moreover, neither of these values should be shown to$S_2$ since he can trivially decrypt them,
violating privacy guarantees against$S_2$.</p>
<p>We show how to perform above operations over encrypted data one by one, starting first with
a <em>Private Comparison</em> protocol for computing $[t]$ and following with
a <em>Private Select</em> protocol for computing $[c]$ and $[d]$.</p>
<p><strong>Private Comparison.</strong>
This protocol is a variation of a classical Andrew Yao's <a href="http://research.cs.wisc.edu/areas/sec/yao1982-ocr.pdf">Millionaire's problem</a>:
$S_1$ has $[a]$ and $[b]$ and wishes to obtain $[t]$, where
$t = (a > b)$ and $S_2$ has the private key of the encryption scheme.
Although there is more than one way of doing so, we pick an efficient
algorithm from a recent result by <a href="http://www.internetsociety.org/sites/default/files/04_1_2.pdf">Bost et al.</a>, which is a correction of the original <a href="http://bioinformatics.tudelft.nl/sites/default/files/Comparing%20encrypted%20data.pdf">protocol</a> by T.Veugen.
This algorithm lets $S_1$ and $S_2$ compare $a$ and $b$ using
number of interactions that is logarithmic in the number of bits in each element.</p>
<p>Note that neither $S_1$ nor $S_2$ learn the values of $a$, $b$, and $t$.
In addition, $S_2$ does not learn the ciphertexts corresponding to these values.</p>
<p><strong>Private Select.</strong>
Given the comparison bit $t$, we now devise a private algorithm for using this
bit to select the minimum and the maximum value of $a$ and $b$ (that is
performing operations 2 and 3 above). Recall that $S_1$ has to obtain $[c]$
and $[d]$ with $S_2$ "blindly" assisting him in the protocol.</p>
<p>We wish to use simple cryptographic operations in order to compute $c$ and $d$.
That is, we use semi-homomorphic cryptographic techniques as opposed to
fully-homomorphic ones. To this end, we use an interesting property of layered
Paillier Encryption. We omit many details from here and only point out the
features that we need.</p>
<p>We denote messages encrypted using first and second layers of Paillier Encryption as
$[m]$ and $[![m]!]$, respectively.
We recall that Paillier Encryption supports addition of ciphertexts as well
as multiplication by a constant, i.e., $[m_1][m_2] = [m_1+m_2]$ and $[m]^{C} = [Cm]$.
The same operations hold for ciphertexts of the second layer.
However, what is more interesting is that the ciphertext of the first layer
is in the same domain as the plaintext of the second layer, which
allows the following operations:</p>
<p>This trick allows us to implement the functionality of private select for $c$,
and similarly for $d$, as follows:</p>
<p><span class="math">\[[\![[c]]\!] := [\![[a]]\!]^{[1-t]} [\![[b]]\!]^{[t]} = [\![[(1-t)a + tb]]\!]\,\]</span></p>
<p>where $c$ and $d$ are doubly encrypted.</p>
<p>Recall that the output of Two-Element Private Sort is a building block of the
general sort, where $c$ and $d$ participate in further invocations of
Two-Element Private Sort. To make the values $c$ and $d$ usable in the next
layer of Batcher's network, $S_1$ uses $S_2$ to strip off the extra layer of
encryption. $S_1$ blinds the value he needs to strip via $[![[c+r]]!]$, and
sends it to $S_2$, who decrypts the ciphertext and sends back only $[c+r]$.
Using homomorphic properties of Paillier, $S_1$ subtracts $r$ to get $[c]$.
The similar protocol is executed for $d$. %Note that this protocol requires
one interaction with $S_2$.</p>
<h3 id="private-nelement-sort">Private $n$-Element Sort</h3>
<p>Let us now show how to sort an array of $n$ elements using our Private Two-Element Sort.
%We are now ready to combine all the building blocks.
$S_1$ executes Batcher's sorting network layer by layer.
For each layer in the network and for every sorting gate in this layer,
he engages with $S_2$ in Private Two-Element Sort.
He uses the outputs of this layer as inputs to the next layer
of the network. (See Figure\ref{fig:batcher} for an illustration.)</p>
<p><figure><img src="http://senykam.github.io/img/batcher1.jpg" alt="" title="Example of privately sorting an encrypted array of four elements $5,1,2,9$ where $[m]$ denotes a Paillier encryption of message $m$ and $\mathsf{pairs}_i$ denotes a pair of elements to be sorted. Note that only $S_1$ stores values in the arrays $A_i$ while $S_2$ blindly assists $S_1$ in sorting the values."><figcaption>Example of privately sorting an encrypted array of four elements $5,1,2,9$ where $[m]$ denotes a Paillier encryption of message $m$ and $\mathsf{pairs}_i$ denotes a pair of elements to be sorted. Note that only $S_1$ stores values in the arrays $A_i$ while $S_2$ blindly assists $S_1$ in sorting the values.</figcaption></figure></p>
<p><strong>Sketch of Privacy Analysis.</strong>
We note that the number of times $S_1$ engages with $S_2$ in the protocol does
not reveal either of them anything about the data content. Each engagement is
an execution of Private Two-Element Sort which, in turn, is a call to Private
Comparison and two calls to Private Select. Private comparison guarantees
privacy against $S_1$ and $S_2$ as long as they are non-colluding honest
adversaries. Private select relies on homomorphic properties of Paillier and
requires only the re-encryption step from $S_2$. Since $S_2$ receives a
blinded value he does not learn the value of $c$ or $d$. Moreover, since the
values of $c$ and $d$ are re-randomized we can treat $O(n (\log n)^2)$ calls to
Two-Element Private Sort independently.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We constructed a private sort mechanism that allows a cloud server $S_1$ to sort
a list of encrypted data without learning anything about their order (while
assisted by a non-colluding co-processor $S_2$). As discussed in the beginning
of our post, this tool lets a user store his encrypted documents in
a cloud server and receive ranked results when searching on them.</p>
<p>The method, as described in the post, assumes that the rank table has an entry
for every keyword-document pair, even if a keyword does not appear in
this document zero is stored.
In the <a href="https://eprint.iacr.org/2014/1017">full version</a> of the paper, we show that
we can relax this requirement and store only information for documents where
keyword appears, hence, significantly reducing the size of $T$ and query time for the server.
If we do so, we can add ranking to the optimal SE technique by <a href="http://research.microsoft.com/apps/pubs/?id=102088">Curtmola et al.</a> for single keyword queries or to the technique by <a href="https://eprint.iacr.org/2013/169">Cash et al.</a>
for efficiently answering Boolean queries on encrypted data (see earlier <a href="http://outsourcedbits.org/2014/08/21/how-to-search-on-encrypted-data-searchable-symmetric-encryption-part-5/#comment-2512">blog post</a> for more details on each).
Although the resulting scheme gives a significant performance
improvement and protects the ranking of the documents,
it inherits the leakage of the access pattern (i.e., identifiers of the documents where each query keyword appears)
from the corresponding SE technique.</p>
<p>Our work leaves several interesting open questions, including:
how to efficiently update the collection?
how can a user verify the ranking result it receives?
is a non-colluding co-processor provably necessary to solve multi-keyword
ranked search? Any ideas? :)</p>
Applied Crypto Highlights: Restricted Oblivious RAMs and Hidden Volume Encryption
http://senykam.github.io/2014/12/09/applied-crypto-highlights-restricted-oblivious-rams-and-hidden-volume-encryption
Tue, 09 Dec 2014 20:40:06 -0300http://senykam.github.io/2014/12/09/applied-crypto-highlights-restricted-oblivious-rams-and-hidden-volume-encryption<p><em>This is the first in a series of guest posts highlighting new research in
applied cryptography. This post is written by <a href="http://www.ccs.neu.edu/home/travism/">Travis
Mayberry</a> from Northeastern University.
Note that Travis is graduating this year and will be on the job market.</em></p>
<p><img src="http://senykam.github.io/img/steam.jpg" class="alignright" width="250"></p>
<h2 id="oram-background">ORAM Background</h2>
<p>Oblivious RAM is a very hot research topic right now. As Seny has written
about
<a href="http://outsourcedbits.org/2013/12/20/how-to-search-on-encrypted-data-part-4-oblivious-rams">here</a>,
it can be used to perform searches over outsourced encrypted data while
maintaining the highest possible levels of security against a malicious storage
provider. As noted in that post, however, in exchange for this security it
imposes a very significant overhead on the client. In contrast, searchable
encryption gives us almost as much security at a much lower cost. So, why
should we care about ORAM then, and why is it so interesting to researchers
right now? In this post I'm going to attempt to answer that question as well
as highlight some advances in ORAM efficiency that I recently presented at
<a href="https://eprint.iacr.org/2014/344.pdf">CCS</a> and a few interesting applications
for it that you may not have seen.</p>
<p>Broadly, the answer to my question above is that the security you give up for
improved efficiency might not be acceptable. The motivating example I usually
give is one of a hospital that wants to outsource its patient records to the
cloud. Of course, they are concerned about patient privacy and so they encrypt
those records to prevent the cloud provider from learning sensitive information
from those records. Unfortunately, beyond the data itself, a careful adversary
can learn a lot of sensitive information from where, when and how often a client
accesses their data. In this case, if the provider sees that a cancer doctor
has accessed my records, they will learn that I have (or at least suspect I
have) cancer, regardless of whether they can decrypt my actual records or not.
The most dangerous aspect of these types of attacks is that they are cumulative.
An adversary may learn only a small amount from any one access, but over time
they can aggregate everything that they have seen with any side knowledge of the
client they might have to reveal a surprising amount of sensitive information.
With more data being outsourced to the cloud every day, this becomes a bigger
and bigger problem.</p>
<p>Here is, of course, where Oblivious RAM comes in. Remember that ORAM can be used
for secure searching, but it is actually a very general tool that can hide
<em>any</em> access pattern a user wishes to perform from the server it is
performing it on. An ORAM scheme provides an interface <span class="math">\((\mathsf{Read},
\mathsf{Write})\)</span>, which guarantees that the addresses a client reads and writes
to are hidden from the server. Specifically, given any two sequences of accesses
<span class="math">\(S_0\)</span> and <span class="math">\(S_1\)</span>, and a random bit <span class="math">\(b \leftarrow_{\$} \{0,1\}\)</span>, there should not
exist any probabilistic polynomial time adversary <span class="math">\(\mathcal{A}\)</span> such that
<span class="math">\(\mathcal{A}(\mathsf{ORAM}(S_b)) \rightarrow b\)</span> with probability non-negligibly
greater than <span class="math">\(1/2\)</span>. Here we use <span class="math">\(\mathsf{ORAM}(S_b)\)</span> to signify the series of
accesses performed on the server by the ORAM algorithm when running <span class="math">\(S_b\)</span>.</p>
<p>This is accomplished by continually shuffling and refreshing the data on the
server so that each individual access is indistinguishable from random. Again,
refer to the previous
<a href="http://outsourcedbits.org/2013/12/20/how-to-search-on-encrypted-data-part-4-oblivious-rams">post</a>
for a good example of how an ORAM works, but there are a few things worth
restating:</p>
<ol>
<li><p>ORAMs are highly stateful. Every read and write operation will change the
data structure on the server, and every subsequent operation will depend on the
current state of the storage device. This often means that the client has to
keep additional auxiliary information beyond long-term secrets in order to
correctly access the data on the server. We refer to this as the \emph{client
memory}.</p></li>
<li><p>Each access the client performs will require more than one (oftem many
more) "raw" accesses on the storage device. This is a consequence of the fact
that the client must hide which block of data they are actually interested in.</p></li>
</ol>
<p><strong>Efficiency.</strong> The main property which we evaluate ORAM algorithms on is
their communication efficiency. For every operation the ORAM performs, how much
data must be transferred to and from the server? This is very important for
cloud scenarios because communication overhead translates directly to cost.
Efficiency is expressed in terms of the number of blocks in the data store, <span class="math">\(n\)</span>,
and the size of each block <span class="math">\(B\)</span>. Usually, however, <span class="math">\(B\)</span> is left out and the cost
is expressed as a multiple of <span class="math">\(B\)</span>. This gives an intuitive notion of
"overhead" compared to a normal access, which simply costs <span class="math">\(B\)</span> communication.
Additionally, we must consider the amount of client memory that a scheme
requires. If the client needs a huge amount of memory then it would be
counterproductive in a scenario meant specifically to alleviate the client's
storage burden.</p>
<p><strong>Existing work.</strong> As I alluded to before, there has been a flurry of
research lately on ORAM. The two most notable papers have been by Shi et al.
[<a href="https://eprint.iacr.org/2011/407.pdf">SCSL11</a>] and Stefanov et al.
[<a href="https://www.cs.umd.edu/~elaine/docs/pathoram.pdf">SDS+13</a>]. The
first paper introduced a new paradigm for ORAMs, constructing a data structure
with a binary tree where each node is itself a smaller Oblivious RAM. This has
inspired much of the subsequent research, and it would require its own post to
do it justice, but suffice it to say that they achieve overhead of
<span class="math">\(O(\log^3{n})\)</span> with only <span class="math">\(O(1)\)</span> client memory. Stefanov et al. introduced an
improved tree construction with <span class="math">\(O(\log^2{n})\)</span> efficiency and <span class="math">\(O(\log{n})\)</span>
client memory. Additionally, in a later revision of the paper, they were able
to introduce a tweak which reduces the overhead of both schemes by a
<span class="math">\(O(\log{n})\)</span> factor. This, finally, gives us a scheme with constant memory and
<span class="math">\(O(\log^2{n})\)</span> overhead, and one with higher, logarithmic memory but only
<span class="math">\(O(\log{n})\)</span> overhead. In terms of efficiency, Path ORAM represents the state
of the art for Oblivious RAM.</p>
<h2 id="writeonly-oram-construction">Write-only ORAM construction</h2>
<p>Although tree-based ORAMs provide drastically improved efficiency over more
traditional hierarchical or square-root schemes, they still impose a non-trivial
overhead that makes many applications cost-prohibitive. As a simple reference,
setting <span class="math">\(n=2^{20}\)</span> (a modestly sized database) in Path ORAM induces an overhead
of at least 80x, and potentially much more depending on the size of <span class="math">\(B\)</span> and the
security parameter. There is little known about ORAM lower bounds, but it has
been shown that under certain conditions the best you can do is an overhead of
<span class="math">\(\Omega(\log{n})\)</span> [<a href="http://dl.acm.org/citation.cfm?id=28416">Goldreich87</a>].
While there are some rather large loopholes in that proof, such as the
requirement that the client have <span class="math">\(O(1)\)</span> memory and the memory blocks be all of
equal size, this level of overhead seems unavoidable when using a tree-based
scheme simpy because the height of the tree will be <span class="math">\(O(\log{n})\)</span>.</p>
<p>It is interesting, then, to consider whether a more restricted Oblivious RAM
might achieve better efficiency. Consider an ORAM which attempts to hide not
read and write accesses, but writes alone. Suppose for the time being that
there is an alternative, secure way for the client to read from the storage
device, and it only wants to hide updates. It turns out that this goal is
achievable in a rather simple way that induces only <span class="math">\(O(1)\)</span> overhead!</p>
<p><span class="math">\({\sf Setup}\)</span>: To start with, we initialize an array of size <span class="math">\(2n\)</span> on the storage
device to hold <span class="math">\(n\)</span> logical blocks of data. Every location is initially empty,
and the client has a local data structure which maps a logical block ID in the
range <span class="math">\([0,n)\)</span> to a storage location in the range <span class="math">\([0, 2n)\)</span>. This way, the
client always knows which location a block is in if they want to retrieve it.</p>
<p><span class="math">\({\sf Write}(id, data)\)</span>: Every write operation starts by choosing <span class="math">\(k\)</span> unique
positions in the array uniformly at random, <span class="math">\(P = \{p_0, ..., p_k\}\)</span>. Since
there are <span class="math">\(2n\)</span> "slots" for only <span class="math">\(n\)</span> real blocks, if we choose <span class="math">\(k\)</span> to be
moderately large we will be guaranteed that for at least one <span class="math">\(i \leq k\)</span>, <span class="math">\(p_i\)</span>
will be empty. The client then picks one of these empty blocks and writes
<span class="math">\(data\)</span> into it, reencrypting all blocks in <span class="math">\(P\)</span> that \emph{are} full already, and
writing random strings into the free blocks of <span class="math">\(P\)</span> that were not used. Finally,
the client updates their map data structure so that the record for <span class="math">\(id\)</span> points
to the new location. The old location of that block will still hang around,
with "stale" data in it, but if it is ever chosen again in a set <span class="math">\(P\)</span> it will
be considered free for the purposes of storing new data. In that way, we avoid
having to touch the existing location of a block when we are updating its value,
leading to more efficient hiding of the access pattern.</p>
<p>Security for this scheme follows from the security of the encryption. If it is
indistinguishable from random (IND-CPA), then the adversary sees <span class="math">\(k\)</span> random
blocks being filled with random strings. Since everything is independent of the
IDs which the client is actually writing to, the access pattern is completely
hidden from the server.</p>
<p><strong>Efficiency:</strong> While we can achieve communication overhead of <span class="math">\(O(1)\)</span> with
the above scheme, there are two problems still 1) the map structure on the
client is very large (<span class="math">\(O(n \log{n})\)</span>) and 2) <span class="math">\(k\)</span> needs to be rather large to
guarantee an empty block is found. The first issue can be neatly solved by
storing the map itself in an ORAM, recursively. This is a relatively common
technique, and with a trick from
[<a href="https://www.cs.umd.edu/~elaine/docs/pathoram.pdf">SDS+13</a>], we can
guarantee that this will induce only an <span class="math">\(O(1)\)</span> overhead.</p>
<p>On the other hand, as described above, <span class="math">\(k\)</span> needs to be <span class="math">\(\Omega(\lambda)\)</span> where
<span class="math">\(\lambda\)</span> is the security parameter. Since any block chosen randomly has
probability <span class="math">\(1/2\)</span> to be empty, to make the failure probability
<span class="math">\(O(2^{-\lambda})\)</span>, one must set <span class="math">\(k=\Omega(\lambda)\)</span>. Fortunately, we can take
advantage of the fact that, although failure rate is only low enough when <span class="math">\(k\)</span> is
large, the expected number of empty blocks that the client will find is actually
<span class="math">\(k/2\)</span>. Instead of giving up when we don't find an empty block, we can store a
stash of blocks on the client which did not make it into the array, and when we
find more than one empty block we can write out extras from the stash. This
allows us to set <span class="math">\(k=3\)</span> and, with some queueing theory analysis, maintain a stash
of size <span class="math">\(O(\lambda) = \Theta(\log n)\)</span>.</p>
<p>In conclusion, we can achieve write-only secure Oblivious RAM with only <span class="math">\(O(1)\)</span>
overhead, and in practice very small constants. This allows fully practical use
of ORAM for the first time ever, in a reduced set of use cases.</p>
<h2 id="uses-for-writeonly-oram">Uses for write-only ORAM</h2>
<p>Okay, so we have this write-only ORAM that is pretty efficient, but what does
that really get us? In the example I gave above, you clearly need both reading
and writing. Well, this idea is very new, but I do have a few ideas. If you
want to write a lot more than you read, but still occasionally do some reads,
then you can do something like is suggested in
[<a href="http://eprint.iacr.org/2013/694.pdf">LD13</a>] and actually use PIR to
independently read from the database. Of course, PIR is very inefficient, but
if your writes outnumber your reads by orders of magnitude, then the savings
from more efficient ORAM may outweight the inefficient PIR. This could be
useful in data warehousing situations where you need to store files in case they
are needed in the future, but the vast majority of them will not be needed.</p>
<p>A more useful, and practical situation would be for online backup or mirroring
services. Consider Dropbox for example. Each client has a local copy of the
storage device, so when they read from it Dropbox does not get to see these
accesses. When they write, however, the client pushes the changes to the
server, which then distributes them to the other clients. The adversary in this
scenario is effectively "write-only". Using this ORAM, you could have access
pattern protection on Dropbox now at very little cost.</p>
<p>I leave the most interesting case for the end, of course. Think about encrypted
hard disks. A common scenario is that a user wishes to encrypt their so that if
their machine is ever lost of stolen, sensitive information on it cannot be
retrieved without the encryption key. Just as in the case above, it might be
that you leak more information than you think just through the access pattern
you induce on your hard drive, and not the encrypted data itself. Particularly,
if an adversary is able to compromise your disk on more than one occassion
(every night when you leave your computer at your desk, for instance).</p>
<h2 id="hidden-volume-encryption">Hidden volume encryption</h2>
<p>Now, I know this sounds really paranoid, but stay with me because it is pretty
interesting. There is also a notion of "hidden volume" encryption, introduced
by TrueCrypt. With this type of encryption, a user can have not one encrypted
volume on their disk, but two. The second volume lives on the portions of the
disk which are marked "free" on the first volume. This allows for a user to
actually \emph{deny} that this second volume even exists. Why would you want to
do that? Imagine someone compromises your machine. They know that you have
\emph{something} encrypted on there, so they coerce you (through legal or maybe
not so legal means) into revealing your password. If you have all of your
really secret information stored on your hidden volume, you can safely give up
the key to the outer volume and they will have no way of knowing whether you
have any further information to give up or not. If the coersion you are facing
is of a legal sort, they probably can't continue to pressure you with absolutely
no proof that you have any more information to give up at all. If it is of the
less legal variety, then it \emph{might} help you, depending on the incentives
that they have to keep torturing you. There has been some
<a href="https://defuse.ca/truecrypt-plausible-deniability-useless-by-game-theory.htm">game theory analysis</a> of the situation, but it does not cover many situations.</p>
<p>Getting back to Oblivious RAM, the hidden volume approach that is incorporated
into TrueCrypt fails spectacularly when someone has access to your machine on
more than one occasion. It becomes obvious that the client is writing into a
hidden volume when the adversary sees a "free" area spontaneously and
repeatedly changing its value. The key weakness here is that the main volume
and hidden volume are neatly separated from each other, and writing into a
certain location reliably is a dead giveaway of the existence of another volume.</p>
<p>Fortunately, hiding access patterns is what ORAM was designed for. And, as I
said before, hard drives require just "write-only" security, meaning that we
can use our new, optimally efficient construction. Reference our CCS paper for
the full details, or my
<a href="http://research.microsoft.com/apps/video/default.aspx?id=231705">talk at MSR</a>, but the idea is fairly straightforward. The user initializes a number
of different ORAMs on the disk, one for every potential volume that they might
be using. They choose this number sufficiently larger than the number of
volumes that they want to actually use, so that there is some uncertainty as to
exactly how many are really being used. For instance, I could choose a maximum
of 10 but only use 4. The goal is that, the user can give up <span class="math">\(i \lt max\)</span>
passwords, and an adversary should not be able to guess whether there is another
volume greater than <span class="math">\(i\)</span> in use, or if <span class="math">\(i\)</span> is the last volume.</p>
<p>Every access, the user writes to the volume that they want to actually change,
and they do a "dummy" operation on the other volumes, which looks identical to
a real operation but does not change anything. So, upon compromising the
machine, an adversary sees the number of operations that have been done, but not
which volumes they were on. There are some subtleties that you will have to
read the full paper for, but any access pattern that may have given away the
existence of a hidden volume is effectively protected by the ORAM.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Hopefully I have convinced you at this point that Oblivious RAM is an
interesting cryptographic primitive, and that it is on the verge of being
practical in some key situations. When it comes down to it, if you don't think
that access pattern security is an issue, then the extra cost associated with
ORAM will never be worth it to you. But if you are worried about adversaries
that could aggregate accesses and potentially learn critical private
information, then I highly encourage you to keep an eye on future research in
the area.</p>
Thoughts on Applied Cryptography Research
http://senykam.github.io/2014/12/05/thoughts-on-applied-cryptography-research
Fri, 05 Dec 2014 20:37:48 -0300http://senykam.github.io/2014/12/05/thoughts-on-applied-cryptography-research<p>If you follow me on Twitter you have no doubt heard my occasional outbursts and
rants on what I perceive to be biases in the current publication model in
cryptography. In short, I think that top cryptography conferences are heavily
biased against certain areas of cryptography and for others.</p>
<p>Some of the areas that I think have a much harder time getting into top-tier
crypto conferences include Applied Cryptography. I don' t think this is
particularly controversial and, from what I hear, CRYPTO has even tried to
rectify this recently (e.g., by accepting some applied MPC papers).</p>
<p>Nevertheless, this is a serious problem for applied crypto research since
applied crypto papers don' t really have a home. Realistically, your choice in
venues include CCS, NDSS, Usenix, Oakland and Financial Crypto. Notice that all
these conferences are security conferences and as such they only have a limited
number of places for crypto research. And a consequence of this is that
competition for these few places is very high.</p>
<p>Another consequence of the current situation is that applied crypto papers are
dispersed in many different venues. In addition to the ones listed above you
also find them in Esorics, AsiaCCS, ACNS etc. This makes it very hard to keep
track of new results (having ePrint helps a bit here) and even harder to build
any kind of community since no one is at the same place at the same time. This
impacts possible collaborations, opportunities for students etc.</p>
<p>Another issue is that these papers don' t get the visibility they deserve and
this is problematic because some of the work is very strong and, more
importantly, this is the work that has the most potential for impact. Think
about that for a second: the type of work that has the highest chance of having
impact on society has a harder time being accepted at top-tier conferences. How
can this not be a serious problem for the community?</p>
<p>A final consequence is that because these works do not receive the visibility
they deserve, less people tend to work on applied problems. I may be cynical,
but unfortunately I think many researchers choose their problems (at least in
part) as a function of the recognition they might receive from their community
for their work. I can' t always blame researchers for this since the academic
system is setup to incentivize this (especially when you are student). With the
current situation, therefore, I suspect that many researchers and students who
might be interested in (or at least open to) applied work may shy away from it
since the short-term rewards like top-tier crypto papers and visibility is less
likely to materialize. An alternative way in which this could be affecting
researchers is that the pull from other areas that do get more recognition (in
the form of top-tier papers and visibility) is too hard to compete with for
applied crypto.</p>
<p>Obviously, I would like this to be fixed somehow especially so that applied
crypto can still attract good students. But my sense is that the larger crypto
community doesn' t really care that much (modulo, perhaps, the alleged efforts
made by CRYPTO recently).</p>
<p>In the meantime, I thought it might be good to highlight a few applied crypto
papers written by up-and-coming researchers (i.e., mostly students, postdocs
and junior researchers) that I find particularly interesting. So I asked a
handful of young researchers if they would be interested in writing guest posts
summarizing some of their work. I' ll be posting some of these summaries in a
new series that will start soon. How long the series will be will obviously
depend on how many people are interested in writing. Hopefully we will have at
least two or three.</p>
<p>Let me stress that this series will obviously be subjective and biased towards
topics that I like and people that I know (though I won' t be hesitant to invite
people I don' t know if I like their work). This is unavoidable since these are
papers that I' m inviting to be summarized on my blog. So please do not complain
in the comments that your paper was not invited to be discussed.</p>
<p>With that out of the way, I hope you enjoy the series.</p>
Workshop on Surveillance and Technology
http://senykam.github.io/2014/11/26/workshop-on-surveillance-and-technology
Wed, 26 Nov 2014 20:22:09 -0300http://senykam.github.io/2014/11/26/workshop-on-surveillance-and-technology<p><em>This is an announcement for a workshop that I am organizing in conjunction with
the <a href="https://petsymposium.org/2015/">Privacy Enhancing Technologies Symposium</a> (PETS).</em></p>
<p><img src="http://senykam.github.io/img/camera.jpg" class="alignright" width="250">
Due to the Snowden disclosures, mass surveillance has become one of the most
highly-discussed and controversial issues in politics, policy, technology and
international affairs. Modern surveillance, however, relies heavily on
technology and, therefore, our community has a unique role to play in not only
understanding surveillance but in mitigating it when excessive and
restraining/limiting it when appropriate.</p>
<p>The <a href="https://satsymposium.org/">Workshop on Surveillance and Technology</a> (SAT)
will aim to bring together researchers and scholars in privacy, security,
cryptography, Law, policy, behavioral economics and psychology to discuss all
aspects of surveillance including (but not limited to):</p>
<ul>
<li>Anonymity systems,</li>
<li>Anti-surveillance technologies,</li>
<li>Case studies of real-world surveillance and censorship,</li>
<li>Cryptographic techniques for anti-surveillance,</li>
<li>Cryptographic techniques for privacy-preserving surveillance,</li>
<li>Legal analysis of surveillance,</li>
<li>Measurement studies of surveillance activity,</li>
<li>Privacy-preserving surveillance technologies,</li>
<li>Psychological impacts and aspects of surveillance,</li>
<li>Policy implications of surveillance,</li>
<li>The economics of surveillance,</li>
<li>Surveillance and censorship,</li>
<li>Surveillance and diplomacy,</li>
<li>Surveillance and human rights,</li>
<li>Surveillance and the private sector</li>
</ul>
<p>The workshop will be held on June 29th, 2015 in Philadelphia, PA, USA. We will
have several invited speakers, one of which is <a href="https://www.dubfire.net/">Christopher
Soghoian</a> , Principal Technologist in the Speech,
Privacy and Technology Project at the ACLU. Other invited speakers will be
announced later.</p>
<p>For more information about the workshop, including important dates, please see:
<a href="http://satsymposium.org">satsymposium.org</a>.</p>
Microsoft Research Internships
http://senykam.github.io/2014/11/13/microsoft-research-internships
Thu, 13 Nov 2014 20:09:06 -0300http://senykam.github.io/2014/11/13/microsoft-research-internships<p>One of the best things about working at MSR is the internship program. For a
sense of what an MSR internship is like, I recommend
<a href="http://www.pgbovine.net/PhD-memoir-intermission.htm">this</a> essay by
<a href="http://www.pgbovine.net/PhD-memoir-intermission.htm">Philip Guo</a>.</p>
<p>In this post, I want to personally reflect on the MSR internship program and
provide some context about how things have worked for me in the past.
<em>Let me stress that this reflects only my personal experience and may not
be representative of other MSR researchers' experiences</em>.</p>
<h2 id="what-i-love-about-the-program">What I Love About the Program</h2>
<p>Through this program, MSR researchers have the opportunity to work with the
best students in the world. With our interns, we spend 12 weeks (usually
during the summer) working hard on research. The internships are intense but
always gratifying and I like to think that most MSR interns are really happy
with their experience.</p>
<h2 id="what-i-hate-about-the-program">What I Hate About the Program</h2>
<p>As much as I love the internship program, there is one aspect of it that I
absolutely hate: namely, sending out rejections to the multitude of incredibly
talented students that applied but did not receive an offer.</p>
<p>What bothers me so much about this is that I know that there is a huge
disconnect between how students understand and interpret rejections and how
rejections actually occur. After all, I was a student not that long ago.</p>
<p>More precisely, the problem is the following. Students think that the
application process is a fair game, a meritocracy. That the "best" candidate
wins and if a candidate doesn't get the offer, it was because there were other
candidates that were "better"; for some vague, undefined notion of "better".</p>
<p>The application process, however, does not work this way. First of all, the set
of all candidates is not a total order; that is, most candidates cannot be
ranked in any objective sense. Counting papers is, frankly speaking, idiotic
and focusing on publication venues is meaningless when there are so many biases
in the publication system. Throw in the fact that some candidates have a huge
disadvantage because they are from a smaller University or one that is less
known in the US and you see that there is really no objective measure.</p>
<p>The other thing to realize is that simply hiring the "best" candidate (even if
you could define such a metric) is often a poor strategy from our point of
view. Often (though not always, of course) we have a sense of what
projects/area we want to pursue over the summer. Given that we already know,
roughly speaking, what we would like to do, our best strategy is to choose
someone who has the required background to bring the project to completion. So
being a good "fit" for the project is often the most important criteria; not
some vague and meaningless notion of "being good".</p>
<p>Third, it is important to realize that we make hiring decisions based on a
multitude of factors, some of which are completely outside of a student's
control. Some years, we may have a high-priority project to work on and so the
offer is made to the student with the most aligned background. Other years, we
may make an offer to a student with whom we have an ongoing collaboration so
that we can finish the collaboration. Some times, if we cannot decide between
candidates then we may make the offer to the most senior since that student may
be on the job market soon and the internship would be of greater value to
him/her. These are just a handful of reasons but there are more.</p>
<h2 id="what-to-do-if-you-didnt-get-an-offer">What to do if You Didn't Get an Offer</h2>
<p>The first thing is to not to take a rejection personally! 90\% of the time, it
is not a reflection on the quality of your work or application. The most likely
case is that the offer was made to someone else for one or more of the reasons
outlined above.</p>
<p>The second thing to realize is that having submitted an application is good!
How can it be good if you got no offer? There are at least two reasons.</p>
<p>The first is that often, when we evaluate a candidate that we conclude is not a
good fit for the projects we plan on working on, we forward the application
along to other groups who we think might be a better match. So, for example, if
I come across your application and see that you are very strong in verifiable
computation but I am not planning on working on verifiable computation this
summer, then I will forward your application to Bryan Parno. This happens a
lot! We spend a lot of time trying to find other options for people we don't
make offers to and more than once this has worked in a candidate's favor.</p>
<p>Another reason is that even if you didn't get an offer this year, you will be
on our radar and you'll most likely be considered the following year. I know of
several cases where a candidate did not get an offer one year but got one the
following year (even though they didn't re-apply!).</p>
<p>Does this mean that you will eventually get an offer. Of course not! There are
no guarantees. However, the point is that applying is not a waste of
time---even if you do not get an offer. Also, this clearly illustrates that you
should not take a rejection personally. If we go through this much trouble to
find you an alternative internship or to keep you in mind for the future, then
clearly the rejection is not a reflection of your work.</p>
How Not to Learn Cryptography
http://senykam.github.io/2014/11/11/how-not-to-learn-cryptography
Tue, 11 Nov 2014 20:36:57 -0300http://senykam.github.io/2014/11/11/how-not-to-learn-cryptography<!-- banner = "img/studying.jpg" -->
<!-- ![studying](/img/studying.jpg) -->
<p><img src="http://senykam.github.io/img/studying.jpg" class="alignright" width="250">
People often ask me how to get started in cryptography. What's interesting is
that most of the time they also want to know how I <em>personally</em> got started.
This is interesting to me because it suggests that people are looking for more
than a list of books or papers to read or set of exercises to solve; they're
really looking for a broader <em>strategy</em> on how to learn the subject. In
this post I'll discuss some possible strategies.</p>
<p>First, let me stress that I am only considering strategies for learning crypto
design and theory. Also, what I have in mind when I say "learning crypto" is
not getting to the point of understanding an average paper, but getting to the
point of generating such papers yourself (or at least the ideas in them). If
your end goal is crypto engineering then the strategies may or may not be
helpful---I'm not an expert so I can't really say either way ( though I'd like
to think that improving your understanding of how primitives and protocols are
designed can be helpful).</p>
<p>I should say from the outset that the way I personally got started in
cryptography is probably one of the worst possible ways to do it. It was highly
inefficient and had a very low probability of success. This was mainly because
I didn't have the proper background when I started and I didn't have the right
resources at my disposal. These two things are very important and one of two
things is likely to happen if you don't have them: (1) it will take you so
long that you'll get fed up and give up; or (2) you'll become a crank (and
believe me, there are a ton of cranks out there selling crypto products).</p>
<p>When devising and implementing your strategy, you should keep these outcomes in
mind because it will be very important to avoid them at all costs.</p>
<h2 id="how-to-do-it">How to Do It</h2>
<p>The best strategy for learning crypto design and theory is to get a Ph.D. at a
University with a cryptography group. Getting a Ph.D. in some random field like
mechanical engineering or biology does not count! If you are interested in
symmetric cryptography (i.e., block cipher and hash function design and
cryptanalysis), then a good place to start are European Universities since a
large fraction of the experts are there. If you're interested in crypto theory
then the US or Israel. Of course there are strong groups in each area
everywhere.</p>
<p>If you have found a University and are trying to evaluate the group is, then a
<em>very rough</em> sanity check is to look at their publication record. If this
is a theory group then you should be looking for CRYPTO, Eurocrypt, Asiacrypt,
TCC, FOCS, STOC publications. If this is a more applied group, then you should
be looking for publications at CCS, CHES, IEEE Security and Privacy (also known
as Oakland) and Usenix Security. CRYPTO, Eurocrypt and Asiacrypt are not
particularly good indicators of quality for applied crypto. If this is a
symmetric crypto and cryptanalysis group then you should look for papers at
Fast Software Encryption (FSE) and Selected Areas in Cryptography (SAC).
Similarly to applied crypto, CRYPTO, Eurocrypt and Asiacrypt are not
necessarily good indicators of quality in this area.</p>
<p>But you shouldn't get too caught up in this, however. The publication system in
cryptography is screwed up so you shouldn't necessarily dismiss group $A$ because
it has less STOC papers than group $B$; or less CCS papers than group $C$. This is
just a very coarse metric that---absent of any other signals---can be used to
distinguish between very good groups and very bad ones. Another good thing to check
is where the students that graduate from that group end up. Do they end up with
jobs that you would like?</p>
<p>So why is getting a Ph.D. from a good group the best strategy? Simply because
it is the most efficient way to learn the material. The background needed for
crypto is not part of a traditional education, neither in math nor in computer
science, so it's unlikely that you'll have learned what you need in undergrad.
So you have two choices: (1) learn it on you own; or (2) learn it in graduate
school.</p>
<p>In grad school you will have a set of classes carefully chosen and prepared for
you. You'll have an advisor that will guide you through the process, telling
you what you need to learn, what you don't need to learn, what your weaknesses
are, what you need to improve, what problems to work on and the best strategies to
solve those problems. You'll also have fellow students that will help and
motivate you throughout.</p>
<p>Note that for most Ph.D. programs in computer science you don't have to pay
anything. Your tuition is taken care of by the department or by your advisor's
grants. In addition, you receive a stipend which takes care of housing, food
etc. So if you're in a position to devote $5$ years of your life to learning
cryptography, then I think grad school in a crypto group is by far the best
strategy.</p>
<h2 id="how-not-to-do-it">How Not to Do It</h2>
<p>So you can't go to grad school or you can but somewhere without a crypto group
and you still really want to learn crypto design and theory. Here is one possible
strategy---the one I used.</p>
<p>I'll assume you have a standard systems-focused computer science undergrad
degree. In my case, for example, I had a strong systems background in undergrad
(e.g., compilers, OS, networking, architecture) and a very weak theory
background (just calculus, intro to algorithms and a linear algebra class so
bad no one ever attended). To be brutally honest, this kind of background is
useless for cryptography and if this is the point at which you're at then you
have to understand that you'll be starting from scratch.</p>
<p>There are three things you should be shooting for: (1) developing mathematical
maturity; $(2)$ learning how to debug; (3) acquiring the basics.</p>
<p>By mathematical maturity, I mean the ability to understand and use basic
mathematical language, notation and concepts. It's basically having the right
context in place for doing math. Knowing how to parse mathematical statements
and proofs and generally-speaking, knowing how to read between the lines and
how to fill in the missing pieces.</p>
<p>By debugging, what I mean is that you have to get to a point where you can
reliably tell whether you have fully understood some idea or not. When you are
starting out and working alone, this is extremely difficult especially for an
area like cryptography which can be so subtle. If you don't acquire this skill,
however, you will end up a crank: that is, someone that has read a lot,
understood very little, and is completely unaware of how confused and wrong
they are. Many people who are self-taught end up like this so you have to be
careful.</p>
<p>The problem with most of the advice given for learning a hard subject is that
they focus on the third stage; typically by pointing to papers or books. But
papers and books are useless if you don't have the first two skills.</p>
<h2 id="acquiring-mathematical-maturity">Acquiring Mathematical Maturity</h2>
<p>Of course, the easiest way to acquire mathematical maturity is to get an
undergraduate education in math. <sup class="footnote-ref" id="fnref:1"><a class="footnote" href="#fn:1">1</a></sup></p>
<p>Maturity is probably the skill that takes the longest to acquire. Math and
theoretical areas of computer science are expressed through definitions, theorems
and proofs. A definition is a precise description of some object or process. A
theorem is a precise statement concerning some object or process and a proof is
an argument as to why the statement is true. You should be comfortable with
this paradigm because everything you will see further down the line will be
expressed this way. But understanding this paradigm means you'll have to be
comfortable with basic notions like quantifiers (i.e., existential and universal),
basic proof structures (e.g., direct and by contradiction), basic logic,
elementary probability, etc.</p>
<p>By comfortable, I don't mean a casual, superficial understanding of these things.
What I mean is you should be able to properly formulate definitions, theorem
statements and proofs yourself and be able understand why some formulations are
better than others.</p>
<p>You shouldn't think of mathematical formalisms as pedantic, boring and academic.
Yes, in some cases they can be overkill because you may have a good intuitive
understanding of an idea, but there will be times where your intuition
fails and that's when having a good grasp of the formal approach will help you.
Cryptography, in particular, is very unintuitive so formalism is even more
important---especially when you are starting out.</p>
<p>Most books on cryptography will not help you acquire mathematical maturity
because it is assumed that the reader has it. If you are coming from a purely
systems background though, you may not have had the opportunity to develop it
(as was my case, for example). And reading math books is usually even worse
since mathematicians learn this stuff very early on.</p>
<p>So what can you do? The approach I took was to just read everything I could
find in math, theoretical computer science and cryptography. Once in a while, I
would get lucky and find a paper with a decent explanation of some basic
concept (e.g., some basic probability argument or a slightly more detailed
proof structure) but most of the time I had to reconstruct the missing the
pieces and context on my own.</p>
<p>Obviously, this is easy to do when you have the basics but it is incredibly
difficult and frustrating when you don't. As you can imagine it took forever
to fill in the gaps in my knowledge. Therefore, the ideal approach would be
to find a book or lecture notes that focus on this stuff. And---luckily for
you---Timothy Gowers has written an excellent series of blog posts on these
very things so you should read them:</p>
<ol>
<li>Basic Logic
<ol>
<li><a href="http://gowers.wordpress.com/2011/09/25/basic-logic-connectives-and-and-or/">And & Or</a></li>
<li><a href="http://gowers.wordpress.com/2011/09/26/basic-logic-connectives-not/">Not</a></li>
<li><a href="http://gowers.wordpress.com/2011/09/28/basic-logic-connectives-implies/">Implies</a></li>
<li><a href="http://gowers.wordpress.com/2011/09/30/basic-logic-quantifiers/">Quantifiers</a></li>
<li><a href="http://gowers.wordpress.com/2011/10/02/basic-logic-relationships-between-statements-negation/">Negation</a></li>
<li><a href="http://gowers.wordpress.com/2011/10/05/basic-logic-relationships-between-statements-converses-and-contrapositives/">Converse and contrapositive</a></li>
<li><a href="http://gowers.wordpress.com/2011/10/07/basic-logic-tips-for-handling-variables/">Handling variables</a></li>
<li><a href="http://gowers.wordpress.com/2011/10/09/basic-logic-summary/">Summary</a></li>
</ol></li>
<li>Functions
<ol>
<li><a href="http://gowers.wordpress.com/2011/10/11/injections-surjections-and-all-that/">Injections, surjections, etc.</a></li>
<li><a href="http://gowers.wordpress.com/2011/10/13/domains-codomains-ranges-images-preimages-inverse-images/">Co-domains, ranges, images</a></li>
</ol></li>
<li><a href="http://gowers.wordpress.com/2011/10/16/permutations/">Permutations</a></li>
<li>Definitions
<ol>
<li><a href="http://gowers.wordpress.com/2011/10/23/definitions/">Definitions</a></li>
<li><a href="http://gowers.wordpress.com/2011/10/25/alternative-definitions/">Alternative definitions</a></li>
</ol></li>
<li><a href="http://gowers.wordpress.com/2011/10/30/equivalence-relations/">Equivalence relations</a></li>
</ol>
<h2 id="debugging">Debugging</h2>
<p>Being able to detect whether you've made a mistake is an important and difficult
skill to acquire in any subject. This is exacerbated in security and
cryptography since we cannot ascertain the security of something experimentally.
Luckily, in crypto we do have a methodology for debugging: namely,
<em>provable security</em>. The provable security paradigm (or more appropriately,
the reductionist paradigm) consists of the following steps. One first formulates
a security definition that captures the security properties/guarantees that are
expected from the system. Then, one describes a cryptographic scheme/protocol
for the problem at hand. Finally, one proves that the scheme/protocol satisfies
the security definition (usually, under some assumption).</p>
<p>The provable security paradigm originated in the 80s' and has been used ever
since in the cryptography community to analyze the security of many primitives.
There are many benefits to this paradigm but one of the main ones is that it is
a great debugging tool. When trying to prove the security of your primitive, you
will sometimes find that the proof will not go through for some reason and, more
often than not, it is because of a subtle weakness in your protocol that you did
not pick up when first designing it.</p>
<p>I want to stress that the provable security paradigm is not foolproof and that
it has its limits. For example, there are entire areas of cryptography like
block cipher and hash function design where its usefulness has, historically,
been very limited. Also, problems can occur if the definition being used is
wrong or too weak for the application being considered. And, of course, there
could be errors in the proofs of security. So the framework should be used
with these limitations in mind because a blind adherence to it could lead you
astray.</p>
<p>In my opinion the best place to start learning the provable security paradigm
(and crypto in general) is the textbook
<a href="http://www.amazon.com/Introduction-Cryptography-Chapman-Network-Security/dp/1466570261/ref=sr_1_1?ie=UTF8&qid=1415685096&sr=8-1&keywords=katz+lindell"><em>Introduction to Modern Cryptography</em></a>
by Jonathan Katz and Yehuda Lindell. I really wish this book was out when I was
learning crypto because it would have saved me a huge amount of time. The book
teaches you all the basics of cryptography while explaining how security
definitions work and how to prove various constructions secure. Unlike many
mathematically-inclined books it goes over the details of proofs and doesn't
just leave everything as an exercise (which can be incredibly frustrating for
people who are trying to learn the material alone and without any background).
After Katz-Lindell, I would recommend <em>Foundations of Cryptography</em> Vol.
<a href="http://www.amazon.com/Foundations-Cryptography-1-Basic-Tools/dp/0521035368/ref=sr_1_2?ie=UTF8&qid=1415685162&sr=8-2&keywords=oded+goldreich">1</a> and
<a href="http://www.amazon.com/Foundations-Cryptography-Volume-Basic-Applications/dp/052111991X/ref=pd_bxgy_b_img_y">2</a>"
by Oded Goldreich. These texts, however, are a <em>lot</em> more advanced and you
likely won't need the material unless you are doing research.</p>
<h2 id="learning-the-basics">Learning the Basics</h2>
<p>Of course, another crucial step is learning the basics. The simplest thing to do
here is to just read Katz-Lindell. In addition you can also watch Jonathan Katz' and
Dan Boneh's MOOCS which are
<a href="https://www.coursera.org/course/cryptography">here</a> and
<a href="https://www.coursera.org/course/crypto">here</a>, respectively.</p>
<h2 id="putting-it-all-together">Putting it All Together</h2>
<p>So you've read Timothy Gowers' blog posts and acquired the basic mathematical
concepts, you've read Katz-Lindell and understood the basics of provably
security and you've watched the MOOCs so you know all the basic cryptographic
primitives and what they are used for. At this point you should be able to read
crypto papers and follow along. What you may not be able to do, however, is
design and analyze your own crypto protocols.</p>
<p>To make the jump from understanding other people's work to creating your own, I
think the only thing you can really do is to formulate your own problem and try to
solve do it. Whether you succeed is not important, what matters is that you will
be applying everything you learned at once and this will force you to understand how
these ideas relate to each other and interact.</p>
<p>While I think it's a good idea to work on your own problems at this stage to
gain experience in applying what you've learned,
it is very important to keep in mind that <em>you don't know what you're doing yet</em>.
In particular, you may have gained a false sense of confidence after reading the books and
watching the MOOCs so if you're not careful you'll be headed down the
path of crankdom. To avoid this, it is crucial that you get feedback on your
ideas from people who are more experienced than you. This is not an option, it is
crucial! <sup class="footnote-ref" id="fnref:2"><a class="footnote" href="#fn:2">2</a></sup></p>
<p>But how do you get experts to give you feedback if you don't know any? This is a
difficult question that I faced as well at one point. Here's the trick I used. I
basically got to the point where I could hold a semi-intelligent conversation
with a professional cryptographer. This does not mean that I could impress them.
Just that I knew enough of the basic concepts and techniques that I could have a
reasonable $10$ minute conversation about some crypto paper I had read. Once I
could do this, I tried my luck. For example, I attended crypto seminars at
Universities close by. This lead to me talking about research with professors
there and eventually starting to work on projects together.</p>
<p>What is important to realize here is that people---especially successful
people---are very busy and they just don't have the time to teach you
cryptography. If they are professors, then they already have students they are
working with and if they work in industry then they have interns and an
employer they are committed to. So if you want to learn from them you should
have something to offer.</p>
<p>But what can you offer if you are just starting out? Well, if you think about
it you have one thing that they don't: namely, <em>time</em>. Remember that these
experts are very busy so they probably have a ton of project ideas they would
like to work on but that will never see the light of day. What you can offer
to them is your time. You can start by implementing their ideas and evaluating
them experimentally (this is assuming you have a strong engineering
background). By doing this you are providing value to them and, most
importantly, you get a chance to demonstrate that you have a good work ethic,
that you are committed and that you are easy to work with. On your end, you
will learn and internalize their ideas better and put yourself in a position to
possibly improve upon them. Once you have a good working relationship and some
preliminary ideas on how to improve their work, you are well on your way.</p>
<h2 id="conclusions">Conclusions</h2>
<p>So these were my high-level strategies for learning cryptography. If you can,
just get a Ph.D. at a place with a good crypto group (remember that Ph.D.'s in
computer science are effectively free). If you really can't do that for some
reason, then you can try out the second strategy I outlined. But you should
realize that it will be painful.</p>
<p>Good luck!</p>
<div class="footnotes">
<hr>
<ol>
<li id="fn:1">A math education will teach you the building blocks from which most cryptographic protocols are built (e.g., number theory, algebra etc.) but it won't teach you specifically how to design crypto primitives and protocols or how to understand and analyze their security.
<a class="footnote-return" href="#fnref:1">↩</a></li>
<li id="fn:2">At one point when I was just starting to learn crypto I wrote up some ideas I had. Someone I knew agreed to do an introduction with a well-known cryptographer so I could send him my ideas. After reading my ideas, he (very politely) told me that what I was doing made no sense, explained why and then (again very politely) proceeded to explain why working together would be too difficult given the stage at which I was. This was one (by far) of the most important stages in my development. This small feedback that he provided made me realize that I had acquired a false sense of confidence and that I still had a <em>huge</em> amount of work to do! Looking back, this was invaluable and I'm grateful to him to this day.
<a class="footnote-return" href="#fnref:2">↩</a></li>
</ol>
</div>
Microsoft Research SVC and Applied Theory
http://senykam.github.io/2014/10/02/microsoft-research-svc-and-applied-theory
Thu, 02 Oct 2014 20:34:21 -0300http://senykam.github.io/2014/10/02/microsoft-research-svc-and-applied-theory<p>Most people have heard by now about the closing of the Microsoft Research
Silicon Valley Campus (SVC) Lab. It definitely came as a shock to everyone
(including other MSR researchers) and many people have commented online about
<a href="http://windowsontheory.org/2014/09/19/farewell-microsoft-research-silicon-valley-lab/">what the lab meant to
them</a>
and about all the great research that came out of it.</p>
<p>There is something else about MSR SVC, however, that I have always appreciated
besides it's great contributions in distributed systems and privacy. It was a
lab that was incredibly successful at what I would call "applied theory"
research. What I mean by this is research that is motivated by real-world
problems but that addresses these problems by developing theoretical insights,
models and techniques. Note that this is very different (in my mind at least)
from "theoretical theory" which is often more motivated by making advances on
long-standing open and difficult problems (independently of the initial
motivation).</p>
<p>Applied theory is difficult to carry out and there really aren't many people
out there that can do it well. I think the main reason is that it requires
researchers (or at least teams) that are really interdisciplinary and have a
deep understanding of both practice/systems and theory. Finding these people is
very hard and I know this first-hand since I've now been on a non-trivial
number of hiring/interview committees. The amazing thing about SVC is that it
seemed full applied theory researchers/teams and projects. This always
surprised and pleased me because I had never seen that anywhere else.</p>
<p>So, all this to say that MSR SVC will be greatly missed and that I hope that
all the brilliant researchers there (applied theorists and otherwise!) will
find great homes.</p>
How to Search on Encrypted Data: Searchable Symmetric Encryption (Part 5)
http://senykam.github.io/2014/08/21/how-to-search-on-encrypted-data-searchable-symmetric-encryption-part-5
Thu, 21 Aug 2014 17:33:58 -0300http://senykam.github.io/2014/08/21/how-to-search-on-encrypted-data-searchable-symmetric-encryption-part-5<p><em>This is the fifth part of a series on searching on encrypted data. See parts <a href="https://outsourcedbits.org/2013/10/14/how-to-search-on-encrypted-data-part-1/">1</a>, <a href="https://outsourcedbits.org/2013/10/30/how-to-search-on-encrypted-data-part-2/">2</a>, <a href="https://outsourcedbits.org/2013/12/20/how-to-search-on-encrypted-data-part-3/">3</a> and <a href="https://outsourcedbits.org/2014/08/21/how-to-search-on-encrypted-data-part-4-oblivious-rams/">4</a>.</em></p>
<p><img src="http://senykam.github.io/img/search.jpg" class="alignright" width="250">
In the previous post we covered the most secure way to search on encrypted
data: oblivious RAMs (ORAM). I always recommend ORAM-based solutions for
encrypted search whenever possible; namely, for small- to moderate-size data
<sup class="footnote-ref" id="fnref:1"><a class="footnote" href="#fn:1">1</a></sup>. Of course, the main limitation of ORAM is efficiency so this motivates us
to keep looking for additional approaches.</p>
<p>The solution I discuss in this post is <em>searchable symmetric encryption</em> (SSE).
For readers who are not familiar with this area, let me stress that this has
<em>nothing</em> to do with CipherCloud's <a href="http://www.ciphercloud.com/company/about-ciphercloud/press-releases/ciphercloud-delivers-breakthrough-searchable-strong-encryption/">searchable strong
encryption</a>.
I don't know why CipherCloud chose to call its "breakthrough" product SSE. No
one knows exactly what CipherCloud does at the crypto level but everything
points to them using some form of tokenization which, as far as I know, is an
industry term for deterministic encryption. This is neither a breakthrough nor
really secure for that matter but that's the last thing I'll say about
CipherCloud here so every reference to SSE that follows is about searchable
symmetric encryption.</p>
<p>SSE was first introduced by Song, Wagner and Perrig
[<a href="http://www.cs.berkeley.edu/~dawnsong/papers/se.pdf">SWP00</a>]. SSE tries to
achieve the best of all worlds. It is as efficient as the most efficient
encrypted search solutions (e.g., deterministic encryption) but provides a lot
more security.</p>
<h2 id="the-security-of-encrypted-search">The Security of Encrypted Search</h2>
<p>One of the most interesting aspects of encrypted search from a research point
of view has to do with security definitions; that is, what does it mean for an
encrypted search solution to be secure? This is not an obvious question and I
talked about this a bit in the previous post on
<a href="http://outsourcedbits.org/2013/12/20/how-to-search-on-encrypted-data-part-4-oblivious-rams/}{ORAM">ORAM</a>.</p>
<p>The first paper to explicitly address this question was an important paper by
Eu-Jin Goh [<a href="https://eprint.iacr.org/2003/216.pdf">Goh03</a>] <sup class="footnote-ref" id="fnref:2"><a class="footnote" href="#fn:2">2</a></sup> who was a
graduate student at Stanford at the time. This paper had many contributions but
one of the most important ones was simply to point out that SSE schemes were
not normal encryption schemes and, therefore, the standard notion of
CPA-security was not meaningful/relevant for SSE. The problem is essentially
that when an adversary interacts with an SSE scheme he has access to more than
an encryption oracle; he also has access to a search oracle. Goh's point was
that this had to be captured in the security definition otherwise it was
meaningless.</p>
<p>To address this, he proposed the first security definition for SSE. Roughly
speaking, the definition guaranteed that given an EDB and the encrypted
documents, the adversary would learn nothing about the underlying documents
beyond the search results <em>even if it had access to a search oracle</em>. Let
me highlight a few things about Goh's definition: (1) it was a game-based
definition; and (2) it did not provide query privacy (i.e., no privacy
guarantees for user queries). <sup class="footnote-ref" id="fnref:3"><a class="footnote" href="#fn:3">3</a></sup> A follow up paper by Chang and Mitzenmacher
[<a href="https://www.eecs.harvard.edu/~michaelm/postscripts/acns2005.pdf">CM05</a>]
proposed a new definition that was simulation-based and that guaranteed query
privacy in addition to data privacy.</p>
<p>I won't go into details, but simulation-based definitions have some
advantages over game-based definitions and, generally speaking, are preferable
and can be easier to work with---especially when composing various primitives to
build larger protocols.</p>
<p>So we're done right? Not exactly.</p>
<p>During this time, Reza Curtmola, Juan Garay, Rafail Ostrovsky and myself were
also thinking about SSE and one of the things we noticed while thinking
about the security of SSE schemes was that the previous security definitions
didn't seem to really capture what was going on. There were primarily two
issues: (1) the definitions were (implicitly) restricting the adversary's
power; and (2) they didn't explicitly capture the fact that the constructions
were leaking information.</p>
<p><strong>Adaptivity.</strong>
The first problem was that in these definitions, the adversary was never given
the search tokens, the EDB or the results of its searches. The implication of
this was that---in the definition---the adversary could not choose its search
oracle queries as a function of the EDB, the tokens or previous search
results. In other words, it's behavior was being implicitly restricted to
making <em>non-adaptive</em> queries to its search oracle. This was clearly an
issue because in the real-world the adversary we are trying to protect against
is a server that stores the EDB, that receives tokens from the client and
that sees the results of the of the search. So if we allow this adversary to
query a search oracle, then we also have to allow him to query the oracle as a
function of the EDB, the tokens and previous search results.<br>
More concretely, this captures a form of attack where the server crafts some
clever oracle queries based on the EDB, the tokens or previous search results.</p>
<p>Now let's take a step back. At this point---unless you are a
cryptographer---you are likely thinking something to the effect of: "this
sounds contrived and honestly I can't see how one could craft queries of this
form that would lead to an actual attack of this form. This is all academic!".
I know this because, unfortunately, I've heard this many times over the years.</p>
<p>But this is roughly the reaction people have every time cryptographers point
out that an adversarial model needs to be strengthened. Usually, what happens
is the following: (1) non-cryptographers ignore this and build their systems
using primitives that satisfy the weaker model because they don't believe the
stronger attacks are realistic; (2) someone comes along and carries out some
form of the stronger attack; and (3) the systems need to be re-designed and
patched. This has happened in the cases of encryption (CPA- vs. CCA2-security)
and key exchange.</p>
<p>In any case, having observed this, we wrote about it in the following paper
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] and proposed a new and
stronger definition where the adversary was allowed to generate its queries as
a function of the EDB, the tokens and previous search results. We called this
<em>adaptive</em> security and gave two formulations of this definition: one
game-based and one simulation-based. This turned out to be quite interesting
from a theoretical point of view because the simulation-based formulations were
slightly stronger than the game-based formulations; which is not the case for
the standard notion of CPA-security <sup class="footnote-ref" id="fnref:4"><a class="footnote" href="#fn:4">4</a></sup>.</p>
<p>Now, to be honest, I do not know of an explicit attack on a concrete SSE
construction that takes advantage of adaptivity. But that shouldn't matter anymore
because we now know how to construct adaptively-secure SSE schemes that are as
efficient as non-adaptively-secure ones. So there is no excuse for not using
an adaptively-secure scheme. Another important reason to consider adaptive
security is for situations where SSE schemes are used as building blocks in
larger protocols. In these kinds of situations, the primitive can be used in
unorthodox ways which open up subtle new oracles that one may not have
considered when designing the primitive for its more standard uses.</p>
<p>This exact issue comes up in a paper I wrote recently
[<a href="http://research.microsoft.com/en-us/um/people/senyk/pubs/metacrypt.pdf">K14</a>]
that combines structured encryption (which is a form of SSE) with secure
multi-party computation to design a private alternative to the NSA metadata
program. In this case, it turns out that the adversary for the larger protocol
(i.e., the NSA analyst) can easily influence the inputs to the underlying SSE
scheme and implicitly carry out adaptive attacks on it. So in this case, it is
crucial that whatever structured encryption scheme is used be adaptively-secure.</p>
<p><strong>Leakage.</strong> Another important issue that was overlooked in previous work was leakage. As I've
discussed in previous posts, non-ORAM solutions leak some information.
Everyone was basically aware that SSE revealed the search results
(i.e., the identifiers of the documents that contained the keyword). This was
the whole point of SSE and most people believed that this was why it was more
efficient than ORAM. <sup class="footnote-ref" id="fnref:5"><a class="footnote" href="#fn:5">5</a></sup> But this was not treated appropriately. In addition,
we also pointed in [<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] that all
the known SSE constructions leaked more that the search results. In particular,
they also revealed whether a search query was being repeated. This was very
easy to see by just looking at the constructions: the search tokens were
usually the output of a PRF applied to the keyword being searched for.</p>
<p>The main problem was that the definitions did not capture any of this <sup class="footnote-ref" id="fnref:6"><a class="footnote" href="#fn:6">6</a></sup>. To address
it we decided to treat leakage in SSE more formally and to capture it very
explicitly in our security definitions. Our thinking was that leakage was an
integral part of SSE (since it seemed to be one of the reasons why SSE was so
efficient) and that it deserved to be properly studied and understood. At this
stage we only really considered two types of leakage: the access pattern and
the search pattern. The access pattern is basically the search results (the
identifiers of the documents that contain the keyword) and the search pattern is
whether a search query is repeated. At the time these were the only leakages
that had appeared in the literature. In a later paper with Melissa Chase
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>], we generalized the
definitional approach of [<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] so
that the definition could include <em>any</em> kind of leakage.</p>
<p>Leakage is of course undesirable from a security point of view, but it is
fascinating from a research point of view. I hope to discuss this further in later
posts. For the purposes of this discussion, I'll just point out that there are
(mostly) two kinds of leakages: setup leakage, which is revealed just by the
EDB; and query leakage, which is revealed by a combination of the EDB and a
token. One of the main issues with any solution based on deterministic
encryption or, more generally, on property-preserving encryption is that they
have a high degree of setup leakage: their EDB's have non-trivial leakage. In
that sense, SSE-based solutions are better because their setup leakage is
usually minimal/trivial and the non-trivial leakage is only query leakage which
is controlled by the client since queries can only be executed with knowledge of
the secret key.</p>
<p><strong>Summing up.</strong>
So in the end, what we tried to argue in
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] was that what we should be
asking for from an SSE security definition is a guarantee that:</p>
<blockquote>
<p><em>the adversary cannot learn anything about the data and the queries beyond
the explicitly allowed leakage; even if the adversary can make adaptive
queries to a search oracle.</em></p>
</blockquote>
<p>But once we settled on this definition and formalized it, the following natural
problems came up: (1) how do we distinguish between reasonable and
unreasonable leakage?; and (2) is it even possible to design SSE schemes that
are adaptively-secure? <sup class="footnote-ref" id="fnref:5"><a class="footnote" href="#fn:5">7</a></sup></p>
<p>Initially, the answers to these questions weren't obvious to us. We thought
about them for a while and eventually answered the second question by finding an
SSE construction that was adaptively-secure. Unfortunately, while the scheme had
optimal asymptotic search complexity, it was not really practical. But at least
we knew adaptive security was achievable---though we did not know whether it was
achievable efficiently.</p>
<p>We didn't really have any answer for the second question. In fact, we still
don't. We don't really have a good way to understand and analyze the leakage of
SSE schemes. For now, the best we can do is to try and describe it precisely.</p>
<h2 id="searchable-symmetric-encryption">Searchable Symmetric Encryption</h2>
<p>There are many variants of SSE (see this paper
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>] for a discussion)
including interactive schemes, where the search operation is interactive (i.e.,
a two-party protocol); and response-hiding schemes, where search results are
not revealed to the server but only to the client. I'll focus on
non-interactive and response-revealing schemes here because they were the first
kind of SSE schemes considered and also because they are very useful as
building blocks for more complex constructions and protocols. It also happens
that they are the most difficult to construct.</p>
<p>In our formulation we will
ignore the document collection itself and just assume that the individual
documents are encrypted using some symmetric encryption scheme and that the
documents each have a unique identifier that is independent of their content (so
that knowing the identifier reveals nothing about a file's contents).</p>
<p>We assume that the client processes the data collection <span class="math">\(\textbf{D} = (D_1,
\dots, D_n)\)</span> and sets up a "database" <span class="math">\({\sf DB}\)</span> that maps every keyword <span class="math">\(w\)</span> in the
collection to the identifiers of the documents that contain it. Recall that in
our context, we use the term database loosely to refer to a data structure
optimized for keyword search (i.e., a search structure). For a keyword
<span class="math">\(w\)</span>, we'll write <span class="math">\({\sf DB}[w]\)</span> to refer to the list of identifiers of documents that
contain <span class="math">\(w\)</span>.</p>
<p>A non-interactive and response-revealing SSE scheme <span class="math">\(({\sf Setup}, {\sf Token}, {\sf Search})\)</span> consists of</p>
<ul>
<li><p>a <span class="math">\({\sf Setup}\)</span> algorithm run by the client that takes as input a security
parameter <span class="math">\(1^k\)</span> and a database <span class="math">\({\sf DB}\)</span>; it returns a secret key <span class="math">\(K\)</span> and an
encrypted database <span class="math">\({\sf EDB}\)</span>;</p></li>
<li><p>a <span class="math">\({\sf Token}\)</span> algorithm also run by the client that takes as input a secret key
<span class="math">\(K\)</span> and a keyword <span class="math">\(w\)</span>; it returns a token <span class="math">\({\sf tk}\)</span>;</p></li>
<li><p>a <span class="math">\({\sf Search}\)</span> algorithm run by the server that takes as input an encrypted
database <span class="math">\({\sf EDB}\)</span> and a token <span class="math">\({\sf tk}\)</span>; it returns a set of identifiers <span class="math">\({\sf DB}[w]\)</span>.</p></li>
</ul>
<p>In addition to security, of course, the most important thing we want from an SSE
solution is low search complexity.<br>
Fast, for our purposes will mean <em>sub-linear</em> in the number
of documents and, ideally, linear in the number of documents that contain the
search term. Note that the latter is optimal since at a minimum the server
needs to fetch the relevant documents just to return them.</p>
<p>Requiring sub-linear search complexity is <em>crucial</em> for practical purposes.
Unless you are working with a very small dataset, linear search is just not
realistic---try to imagine if your desktop search application or email search
function did sequential search over your hard drive or email collection
<em>every time you searched</em>. Or if your favorite search engine sequentially
scanned the entire Web every time you performed a web search <sup class="footnote-ref" id="fnref:7"><a class="footnote" href="#fn:7">8</a></sup>.</p>
<p>The sub-linear requirement has consequences, however. In particular it means
that we must be willing to work in a offline/online setting where we run a
one-time (linear) pre-processing phase to setup a search structure so that we
can then execute search queries on the data structure in sub-linear time.
And this is exactly the approach we'll take.</p>
<h2 id="the-inverted-index-solution">The Inverted Index Solution</h2>
<p>The particular solution I describe here is referred to as the <em>inverted
index solution</em> and was proposed in the same
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] paper in which we studied the
security of encrypted search. This is a good construction to understand for
several reasons: (1) it is the basis of almost all subsequent SSE
constructions; and (2) many of the tricks and techniques that are used in
recent SSE schemes (and the more general setting of structured encryption)
originated in this construction.</p>
<p><strong>Setup.</strong>
The scheme makes use of a symmetric encryption scheme <span class="math">\(({\sf Gen}, {\sf Enc}, {\sf Dec})\)</span>, of a
pseudo-random function (PRF) <span class="math">\(F: \{0,1\}^k \times W \rightarrow \{0,1\}^k\)</span> and
of a pseudo-random permutation (PRP) <span class="math">\(P: \{0,1\}^k \times W \rightarrow \{1,
\dots, |W|\}\)</span>. To setup the EDB, the client first samples two <span class="math">\(k\)</span>-bit keys
<span class="math">\(K_{\sf T}\)</span> and <span class="math">\(K_{\sf R}\)</span> for <span class="math">\(F\)</span> and <span class="math">\(P\)</span>, respectively. It then creates two arrays
<span class="math">\({\sf T}\)</span> and <span class="math">\({\sf RAM}_1\)</span>. For all keywords <span class="math">\(w \in W\)</span>, the client builds a list for
<span class="math">\({\sf DB}[w]\)</span> and stores the nodes in <span class="math">\({\sf RAM}_1\)</span>. More precisely, for every keyword <span class="math">\(w
\in W\)</span> and every <span class="math">\(1 \leq i \leq |{\sf DB}[w]|\)</span>, it stores</p>
<p><span class="math">\[
{\sf N}_{w,i} = \bigg\langle {\sf id}_{w,i}, {\sf ptr}_1(w, i+1) \bigg\rangle
\]</span></p>
<p>in <span class="math">\({\sf RAM}_1\)</span>, where <span class="math">\({\sf id}_{w,i}\)</span> is the <span class="math">\(i\)</span>th identifier in <span class="math">\({\sf DB}[w]\)</span> and
<span class="math">\({\sf ptr}_1(w, i+1)\)</span> is the address (in <span class="math">\({\sf RAM}_1\)</span>) of the <span class="math">\((i+1)\)</span>th identifier in
<span class="math">\({\sf DB}[w]\)</span>. Of course, <span class="math">\({\sf ptr}_1(w, |{\sf DB}[w]| + 1) = \bot\)</span>.</p>
<p>It then randomly permutes the locations of the nodes; that is, it creates
a new array <span class="math">\({\sf RAM}_2\)</span> stores all the nodes in <span class="math">\({\sf RAM}_1\)</span> but at locations
chosen uniformly at random and with appropriately updated pointers.</p>
<p>After this shuffling step, the client encrypts each node in <span class="math">\({\sf RAM}_2\)</span>; that is,
it creates a new array <span class="math">\({\sf RAM}_3\)</span> such that for all <span class="math">\(w \in W\)</span> and all <span class="math">\(1 \leq i
\leq |{\sf DB}[w]|\)</span>,</p>
<p><span class="math">\[
{\sf RAM}_3\big[{\sf addr}_2({\sf N}_{w,i})\big] =
{\sf Enc}_{K_w}\bigg({\sf RAM}_2\big[{\sf addr}_2({\sf N}_{w,i})\big]\bigg)
\]</span></p>
<p>where <span class="math">\(K_w = F_{K_{\sf R}}(w)\)</span> and <span class="math">\({\sf addr}_2\)</span> is just a function that maps nodes to
their location in <span class="math">\({\sf RAM}_2\)</span> (this just makes notation easier).</p>
<p>Now, for all keywords <span class="math">\(w \in W\)</span>, the client sets<br>
<span class="math">\(
{\sf T}\big[P_{K_{\sf T}}(w) \big] = {\sf Enc}_{K_w}\big({\sf addr}_3({\sf N}_{w, 1})\big),
\)</span></p>
<p>where <span class="math">\({\sf addr}_3\)</span> is a function that maps nodes to their locations in <span class="math">\({\sf RAM}_3\)</span>.
Finally, the client sets <span class="math">\({\sf EDB} = ({\sf T}, {\sf RAM}_3)\)</span>.</p>
<p>Now the version I just described is simpler than the one presented in
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>]. There are two main
differences. The first has to do with the domain of the pseudo-random
permutation <span class="math">\(P\)</span>. In practice, PRPs have a fixed domain size. For example, if
we view AES as a PRP then it is a PRP that maps 128-bit strings to 128-bit
strings. But in our case we need a PRP that maps keywords in <span class="math">\(W\)</span> to the numbers
<span class="math">\(1\)</span> through <span class="math">\(|W|\)</span>. The problem here is that in practice the size of <span class="math">\(W\)</span> will be
<em>much</em> smaller than <span class="math">\(2^{128}\)</span>. So the question becomes how we can use a
PRP built for a large domain to build a PRP for a small domain? There are ways
of doing this but at the time the known solutions had several important
limitations. So we solved the problem using the following approach.</p>
<p>Suppose we used a large-domain PRP. The problem would be that the table <span class="math">\({\sf T}\)</span>
would be large as well, i.e., it would have to hold <span class="math">\(2^{128}\)</span> elements if we
were using a PRP over <span class="math">\(128\)</span>-bit strings (e.g., AES). Obviously this is too large
to be practical. So the idea was to "shrink" <span class="math">\({\sf T}\)</span> by using something called a
Fredman-Komlos-Szemeredi (FKS) table. I won't go into the details, but the
point is that by using FKS tables, we could use a large-domain PRP and
still have a compact table <span class="math">\({\sf T}\)</span>.</p>
<p>The other difference has to do with the symmetric encryption scheme <span class="math">\(({\sf Gen},
{\sf Enc}, {\sf Dec})\)</span> that we use. In the version described here, it is
important for security that the encryption scheme be <em>anonymous</em> which
means that, given two ciphertexts, one cannot tell whether they
were encrypted under the same key or not. Why is this important? Because each
list of nodes <span class="math">\(\{{\sf N}_{w, i}\}_{i \leq |{\sf DB}[w]|}\)</span> is encrypted under the same
key <span class="math">\(K_w\)</span>. And if, given <span class="math">\({\sf RAM}_3\)</span>, the adversary can tell which ciphertexts are
encrypted under the same key, then it can learn the frequency <span class="math">\(|{\sf DB}[w]|\)</span> of each
keyword. Note that this would be revealed by the EDB; without the client ever
having made any queries.</p>
<p>The problem with anonymity is that it is not implied
by the standard notion of CPA-security. In practice, it seems that most block
ciphers (including AES) would be anonymous but again maybe not. In [<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] we
didn't assume that the underlying symmetric encryption scheme was anonymous so
we had to use a different approach. At a high-level, what we did is to encrypt
each node under a different key and store that key in its predecessor in the
list. The fact that every node is encrypted under a different key solves our
problem.</p>
<p><strong>Token and search.</strong>
If the client wants to search for keyword <span class="math">\(w\)</span>, he simply generates a token</p>
<p><span class="math">\[
{\sf tk} = ({\sf tk}_1, {\sf tk}_2) = (P_{K_{\sf T}}(w), F_{K_{\sf R}}(w)),
\]</span></p>
<p>which he sends to the server. To query <span class="math">\({\sf EDB} = ({\sf T}, {\sf RAM}_3)\)</span>, the server first
recovers the ciphertext <span class="math">\(c = {\sf T}[{\sf tk}_1]\)</span> which it decrypts to recover address
<span class="math">\(a_1 = {\sf Dec}_{{\sf tk}_2}(c)\)</span>. Then, for all <span class="math">\(i\)</span> until <span class="math">\(a_i = \bot\)</span>, it decrypts the
nodes <span class="math">\(({\sf N}_{w, 1}, \dots, {\sf N}_{w, |{\sf DB}[w]|})\)</span> by computing</p>
<p><span class="math">\[
({\sf id}_i, a_{i+1}) \leftarrow {\sf Dec}_{K_{\sf R}}\big({\sf RAM}_3[a_i]\big).
\]</span></p>
<p>It then finds and returns the encrypted documents with identifiers <span class="math">\(({\sf id}_1,
\dots, {\sf id}_{|{\sf DB}[w]|})\)</span>.</p>
<p><strong>Efficiency and security.</strong>
To search, the server needs to do one lookup in <span class="math">\(T\)</span>, which is <span class="math">\(O(1)\)</span> and then
one decryption for each node <span class="math">\(({\sf N}_{w, 1}, \dots, {\sf N}_{w, |{\sf DB}[w]|})\)</span>,
which is <span class="math">\(O(|{\sf DB}[w]|)\)</span>. So the search complexity of this approach is
<span class="math">\(O(|{\sf DB}[w]|)\)</span>, which is optimal since it would take at least that much time just
for the server to send back the relevant documents.</p>
<p>The construction is clearly efficient (asymptotically speaking, as efficient as
possible) but is it secure? Yes and no. The security of the solution (at least
the more complex version) is proved secure in
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] but it is only shown to be
<em>non-adaptively-secure</em> with trivial setup leakage and query leakage that
includes the access pattern (the search results) and the search pattern
(whether a query is repeated).</p>
<p>Intuitively, given <span class="math">\({\sf EDB} = ({\sf T}, {\sf RAM}_3)\)</span> the adversary learns at most the
number of keywords (by the size of <span class="math">\({\sf T}\)</span>) and <span class="math">\(\sum_{w \in W} |{\sf DB}[w]|\)</span> by the
size of <span class="math">\({\sf RAM}_3\)</span>. So that is the setup leakage. Notice that unlike solutions
based on deterministic encryption, the <span class="math">\({\sf EDB}\)</span> by itself does not leak any
non-trivial information like the frequency of a keyword. At query time, the
server obviously learns the search results <span class="math">\({\sf DB}[w]\)</span> but it also learns whether
the client is repeating a keyword search since in that case the tokens <span class="math">\({\sf tk} =
(P_{K_{\sf T}}(w), F_{K_{\sf R}}(w))\)</span> will be the same.</p>
<p><strong>Improvements.</strong>
The inverted index solution has been improved over several works. Its main
limitations were that: (1) it was only non-adaptively secure; (2) the use of
FKS dictionaries made the solution hard to understand and implement; and (3)
it was a static scheme, in the sense that one could not modify the <span class="math">\({\sf EDB}\)</span> to add
or remove keywords and/or document identifiers <sup class="footnote-ref" id="fnref:8"><a class="footnote" href="#fn:8">9</a></sup>.</p>
<p>The first problem was addressed in a joint paper with my MSR colleague Melissa
Chase [<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]. One of the observations
in that work was that the inverted index solution could be made
adaptively-secure by replacing the symmetric encryption scheme by a
non-committing encryption scheme. Non-committing encryption schemes are usually
either very expensive or require very strong assumptions (i.e., random
oracles). Fortunately, in our setting we only need a <em>symmetric</em>
non-committing encryption scheme and such a scheme can be instantiated very
efficiently. In fact, it turns out that the simplest possible symmetric
encryption scheme is non-committing! In retrospect this is a very simple
observation, but it's been a very useful one since it allows us to design
adaptively-secure schemes very efficiently (and under standard assumptions). In
fact, this has been used in most subsequent SSE constructions.</p>
<p>The second issue was also addressed in
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>]. Obviously one could just
replace the PRP with a small-domain PRP but the approach taken in
[<a href="http://eprint.iacr.org/2011/010.pdf">CK10</a>] was different. The idea is to
replace the array <span class="math">\({\sf T}\)</span> with a dictionary <span class="math">\({\sf DX}\)</span>. A dictionary is a data
structure that stores label/value pairs and that supports lookup operations
that map labels to their values. Dictionaries can be instantiated as hash
tables, binary search trees etc. So instead of populating <span class="math">\({\sf T}\)</span> with</p>
<p><span class="math">\[
{\sf T}\big[P_{K_{\sf T}}(w) \big] = {\sf Enc}_{K_w}\big({\sf addr}_3({\sf N}_{w, 1})\big)
\]</span></p>
<p>for all <span class="math">\(w \in W\)</span>, we instead use a PRF <span class="math">\(G\)</span> and store the pair</p>
<p><span class="math">\[
\bigg(G_{K_{\sf T}}(w), {\sf Enc}_{K_w}\big({\sf addr}_3({\sf N}_{w, 1})\big)\bigg)
\]</span></p>
<p>in <span class="math">\({\sf DX}\)</span> for all <span class="math">\(w \in W\)</span>. With this approach we remove the need for a PRP
altogether and, in turn, the need for either small-domain PRPs or FKS dictionaries.</p>
<p>The third issue was addressed in a joint paper with Charalampos (Babis)
Papamanthou who was an MSR intern at the time and Tom Roeder who was an MSR
colleague at the time. In this paper
[<a href="http://eprint.iacr.org/2012/530.pdf">KPR12</a>], we show how to make
the inverted index solution dynamic while maintaining its efficiency. The
solution is complex so I won't discuss it here.</p>
<p>In another paper with Babis
[<a href="https://research.microsoft.com/en-us/um/people/senyk/pubs/psse.pdf">KP13</a>]
we propose a much simpler dynamic solution. Our approach here is tree-based and
not based on the inverted index solution at all. It's search complexity,
however, is not optimal but sub-linear; in particular, logarithmic in the number
of documents. It has other good properties, however, like parallizable search
and good I/O complexity.</p>
<p>In a more recent paper
[<a href="http://www.internetsociety.org/sites/default/files/07_4_1.pdf">CJJJKRS14</a>],
Cash, Jarecki, Jaeger, Jutla, Krawczyk, Steiner and Rosu describe a dynamic
solution that is very simple, has optimal and parallelizable search and has
good I/O complexity.</p>
<p>In another recent paper
[<a href="http://web.engr.illinois.edu/~naveed2/pub/Oakland2014BlindStorage.pdf">NPG14</a>]
Naveed, Prabakharan and Gunther propose a very interesting dynamic solution
based on the notion of blind storage. In a way, their notion of blind storage
can be viewed as an abstraction of the <span class="math">\({\sf RAM}_3\)</span> structure in the inverted index
solution. What
[<a href="http://web.engr.illinois.edu/~naveed2/pub/Oakland2014BlindStorage.pdf">NPG14</a>]
shows, however, is that there is an alternative---and much better---way of
achieving the properties needed from <span class="math">\({\sf RAM}_3\)</span> than how it is done in
[<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>]. I won't say much else
because this really gets into the weeds of SSE techniques but I recommend the
paper if you're interested in this area.</p>
<p>Finally, the last paper I'll mention is a work by Cash, Jarecki, Jutla,
Krawczyk, Rosu and Steiner [<a href="http://eprint.iacr.org/2013/169">CJJKRS13</a>] that
shows how to extend the inverted index solution to handle <em>boolean</em>
queries while keeping its optimal search complexity. Prior to this work we knew
how to handle conjunctive search queries (i.e., <span class="math">\(w_1 \wedge w_2\)</span>) in linear
time. This paper showed not only how to do it in optimal time but also showed
how to handle disjunctive queries (i.e., <span class="math">\(w_1 \vee w_2\)</span>) and combinations of
conjunctions and disjunctions!</p>
<div class="footnotes">
<hr>
<ol>
<li id="fn:1">I discuss how to use ORAM for encrypted search towards the end of the previous post of this series.
<a class="footnote-return" href="#fnref:1">↩</a></li>
<li id="fn:2">Amazingly, this paper was never accepted for publication; which tells you something about the current state of our publication process.
<a class="footnote-return" href="#fnref:2">↩</a></li>
<li id="fn:3">This wasn't an omission on Goh's part; he defined it this way on purpose. His reasoning was that SSE schemes could have a variety of applications where token privacy was not needed. This made sense but it still left open the question of how one should define security with token privacy.<br>
<a class="footnote-return" href="#fnref:3">↩</a></li>
<li id="fn:4">A similar situation was later observed by Boneh, Sahai and Waters and O' Neill in the setting of functional encryption.
<a class="footnote-return" href="#fnref:4">↩</a></li>
<li id="fn:5">Technically, this is <em>not</em> true! The reason SSE schemes tend to be more efficient than ORAM is not because they reveal the search results (access pattern) but because they reveal whether searches were repeated (search pattern).<br>
<a class="footnote-return" href="#fnref:5">↩</a></li>
<li id="fn:6">At this point you might be wondering how the proofs went through. In the definition of [Goh03], the tokens did not appear at all since he was not considering query privacy. In the case of [CM05], the adversary in the proof is restricted to never repeating queries.
<a class="footnote-return" href="#fnref:6">↩</a></li>
<li id="fn:5">Technically, this is <em>not</em> true! The reason SSE schemes tend to be more efficient than ORAM is not because they reveal the search results (access pattern) but because they reveal whether searches were repeated (search pattern).<br>
<a class="footnote-return" href="#fnref:5">↩</a></li>
<li id="fn:7">A criticism I often hear from colleagues and reviewers is that SSE constructions are not really <em>searching</em> over data. The underlying issue is that no computation is being performed. In my opinion, this reflects a very uninformed understanding of the real world. Given the amounts of data we currently produce and have to search over, search has become analogous to <em>sub-linear-time search</em> and therefore to some form of indexed-based search. In other words, the kind of scale we now have to deal with has fundamentally changed what we mean by the term search.<br>
<a class="footnote-return" href="#fnref:7">↩</a></li>
<li id="fn:8">Actually, in [<a href="http://eprint.iacr.org/2006/210.pdf">CGKO06</a>] we describe a way to make our constructions (and any other) dynamic. There are limitations to this approach, however, including the tokens growing in length with the number of updates and interaction. So when we ask for a dynamic SSE scheme we typically want the update process not to affect the token size and, preferably, the update mechanism to be non-interactive---though the latter doesn't matter much from a practical point of view.<br>
<a class="footnote-return" href="#fnref:8">↩</a></li>
</ol>
</div>
Is the NSA Metadata Program Legal?
http://senykam.github.io/2014/04/29/is-the-nsa-metadata-program-legal
Tue, 29 Apr 2014 17:21:47 -0300http://senykam.github.io/2014/04/29/is-the-nsa-metadata-program-legal<p><img src="http://senykam.github.io/img/justice.jpg" class="alignright" width="200">
One of the most interesting aspects of the NSA metadata program is whether it
is legal or not. Unlike the questions we usually think about in computer
science, this question has no definitive answer. The program is legal in some
sense, but the logic needed for the argument to go through is so questionable
that you could just as well say that it's not.</p>
<p>Recall that the program requires telephone providers to hand to the NSA (each
day) the metadata of every US-to-foreign, foreign-to-US and US-to-US call. This
metadata consists of the origin and destination numbers, the time and duration
of the call, the international mobile subscriber identity (IMSI) number, the
trunk identifier and telephone calling card numbers. This data is stored and
queried by the NSA and each record has to be deleted after 5 years.</p>
<p>I had decided to include a high-level overview of this question in an invited
paper I wrote for the
<a href="http://www.dcsec.uni-hannover.de/wahc14.html">Workshop on Applied Homomorphic Cryptography</a>
but I had to take it out due to space restrictions.
Even though I'll eventually include the overview in the full version of the
paper, in the meantime, I thought it might make for a useful blog post;
especially for computer scientists who are curious about the legal aspects of
this issue but don't have the background to make sense of it.</p>
<p>The question of whether the NSA metadata program is legal reduces to the
following two questions: (1) does the program violate the Fourth Amendment of
the US Constitution?; and (2) is the program compliant with the amendments to
the Foreign Intelligence Surveillance Act (FISA) put forth in the USA PATRIOT
Act of 2001 which, in the Government's view, authorize this program.</p>
<h2 id="does-it-violate-the-fourth-amendment">Does it Violate the Fourth Amendment?</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Fourth_Amendment_to_the_United_States_Constitution">Fourth
Amendment</a>
protects the privacy of American citizens. The Amendment states:</p>
<blockquote>
<p>The right of the people to be secure in their persons, houses, papers, and
effects, against unreasonable searches and seizures, shall not be violated,
and no warrants shall issue, but upon probable cause, supported by oath or
affirmation, and particularly describing the place to be searched, and the
persons or things to be seized.</p>
</blockquote>
<p>Roughly speaking, it protects citizens against unreasonable search and seizures
by requiring the Government to obtain a warrant supported by probable cause
from an independent
<a href="https://en.wikipedia.org/wiki/Magistrate#United_States">magistrate</a>.</p>
<p>Historically, courts tended to limit the protections of the Fourth Amendment to
a person's physical property, including their home and personal effects, but in
1967, in <a href="https://en.wikipedia.org/wiki/Katz_v_United_States">Katz v. United
States</a>, the Supreme Court
greatly widened the scope of the Fourth Amendment. In particular, the Court
decided that Fourth Amendment protections apply to "people, not places" and
that a person is afforded protection as long as they have a "reasonable
expectation of privacy" in the items or location to be searched.</p>
<p>Warrantless searches are per se unreasonable under the Fourth Amendment, unless
they fall within a recognized exception. Given that a warrant embodies the
notion of of "reasonableness" under the Fourth Amendment, the Government would
bear a heavy burden to explain exactly how the metadata program---which collects
metadata on individuals without a warrant---involves a "reasonable" search under
the Fourth Amendment.</p>
<p>Following Katz v. United States, however, there were two important cases that
solidified what is now known as the third party doctrine, which holds that
Fourth Amendment protections do not apply to information that is voluntarily
disclosed to a third party since there is then <em>no reasonable expectation
of privacy with respect to such information</em>. These Supreme Court decisions
were <a href="https://en.wikipedia.org/wiki/United_States_v_Miller">United States v. Miller</a>
in 1976 and
<a href="https://en.wikipedia.org/wiki/Smith_v_Maryland">Smith v. Maryland</a>
in 1979. In the Miller decision, the Supreme Court ruled that the Government did
not violate the Fourth Amendment by obtaining Miller's bank records without a
warrant. In the Smith decision, the Court found no Fourth Amendment violation
when the Government obtained records of phone numbers dialed by Smith from the
phone company without a warrant. Though these decisions were issued in the
1970's and concerned bank records and telephone companies, an argument could be
made that the third party doctrine also extends to ISPs, mobile networks and
cloud providers.</p>
<p>So in light of the third party doctrine, the warrantless collection of metadata
on (potentially) every American may not violate the Fourth Amendment, because
such metadata has been voluntarily provided by users to their service providers.</p>
<p>This issue is highly controversial, however, and at least
<a href="http://en.wikipedia.org/wiki/Richard_J._Leon">one judge</a> has found
that the scale of the program and the massive technological changes that have
occurred in the last 30 years mean that the Miller and Smith decisions are not
necessarily controlling (i.e., completely binding).</p>
<h2 id="is-it-compliant-with-the-fisapatriot-act">Is it Compliant with the FISA/PATRIOT Act?</h2>
<p>The second question is whether the program exceeds the scope of <a href="http://en.wikipedia.org/wiki/Section_summary_of_the_Patriot_Act,_Title_II#Section_215:_Access_to_records_and_other_items_under_FISA">Section
215</a>
of the <a href="http://en.wikipedia.org/wiki/Patriot_act">PATRIOT Act</a>, which amended
Section 501 of the <a href="http://en.wikipedia.org/wiki/FISA">Foreign Intelligence Surveillance
Act</a> (FISA).</p>
<p>FISA is a law from 1978 that prescribes how the Government can conduct domestic
surveillance for national security-related investigations. The law was initially
passed to curb the domestic surveillance activities of the Government which
included abuses such as
<a href="http://en.wikipedia.org/wiki/Watergate">Watergate</a>, as well as the
FBI <a href="http://en.wikipedia.org/wiki/Cointel_Pro">COINTELPRO</a> and the
NSA <a href="https://en.wikipedia.org/wiki/Project_MINARET">MINARET</a> programs.</p>
<p>One of the law's provisions was to create a court, referred to as the
<a href="http://en.wikipedia.org/wiki/FISA_Court">FISA</a>, which was assigned the role of
providing judicial oversight over the Government's domestic surveillance
activities in national security-related investigations. To protect national
security, public visibility into the court's activities was limited. <sup class="footnote-ref" id="fnref:1"><a class="footnote" href="#fn:1">1</a></sup></p>
<p>FISA has been amended several times since its introduction but perhaps the most
controversial amendment was in Section 215 of the USA PATRIOT Act of 2001.
Roughly speaking, Section 215 allows the Government to compel a third-party
provider---without a warrant---to hand over "business records" about a customer
"if there are reasonable grounds to believe that the tangible things sought are
relevant to an authorized investigation".</p>
<p>Part of the argument for the legality of the NSA metadata program rests on the
meaning of the term relevant. Indeed, as explained in a
<a href="https://www.aclu.org/files/natsec/nsa/br13-09-primary-order.pdf">declassified opinion</a>
from the FISA Court, the court decided to interpret the term
"relevant" to mean bearing upon or being pertinent to an investigation, as
opposed to directly related to a specific investigation target (e.g., the
records of the specific individual being investigated).</p>
<p>The former interpretation of "relevant" combined with submissions from the NSA
that their investigative tools required the metadata of all customers in order
to work properly in any investigation, led the FISA court to hold that the
program is legal.</p>
<div class="footnotes">
<hr>
<ol>
<li id="fn:1">If you're curious about the FISA court and, especially about what it's door looks like, see this <a href="https://konklone.com/post/the-door-to-the-fisa-court">post</a> by Eric Mill.
<a class="footnote-return" href="#fnref:1">↩</a></li>
</ol>
</div>
Restructuring the NSA Metadata Program
http://senykam.github.io/2014/03/10/restructuring-the-nsa-metadata-program
Mon, 10 Mar 2014 12:05:03 -0300http://senykam.github.io/2014/03/10/restructuring-the-nsa-metadata-program<p><img src="http://senykam.github.io/img/design.jpg" class="alignright" width="280">
I just got back from Barbados where I attended the <a href="http://fc14.ifca.ai/">Financial Cryptography and
Data Security</a> conference. It was a great event overall with many interesting
talks and two great workshops.</p>
<p>One workshop was on <a href="http://fc14.ifca.ai/bitcoin/index.html">Bitcoin</a> and was the most successful Financial Crypto
workshop in history! Though I haven't personally worked on Bitcoin, one of the
things I enjoyed most about the conference and workshops was the presence of
the Bitcoin community. The interaction between the academic and Bitcoin
communities led to some very interesting discussions and ideas. I really hope
the two communities keep interacting.</p>
<p>The other workshop was on <a href="https://www.dcsec.uni-hannover.de/4556.html">applied homomorphic
cryptography</a>. Homomorphic in the
context of this workshop is to be understood broadly and is meant to include
all cryptographic technologies that allow for some form of computation on
encrypted data. As such this includes secure multi-party computation and
encrypted search.</p>
<p>I was invited to give the keynote at this workshop and I chose to talk about
how to restructure the NSA metadata program. My slides are <a href="http://research.microsoft.com/en-us/um/people/senyk/slides/metacrypt.pdf">here</a>. They
describe---at a very high-level---a new design I refer to as MetaCrypt whose
goal is to enable some of the functionality the current NSA metadata program
supports but in a privacy-preserving manner. I first started thinking about
this problem in July 2013 when I wrote <a href="http://outsourcedbits.org/2013/07/23/are-compliance-and-privacy-always-at-odds/">this blog post</a>.</p>
<p>Since I only had one hour, there are many details missing in the talk. Also,
since this was a talk aimed at a general crypto audience, <sup class="footnote-ref" id="fnref:1"><a class="footnote" href="#fn:1">1</a></sup> I included a
variant of the protocol that is easy to describe as opposed to variants that
are perhaps more efficient and/or provide stronger security guarantees. The
details and alternative designs will appear later in an accompanying paper but
I hope that even this high-level description is interesting.</p>
<p><strong>Update:</strong> my talk and the MetaCrypt project was recently covered by MIT Tech
Review. See
<a href="http://www.technologyreview.com/news/526121/cryptography-could-add-privacy-protections-to-nsa-phone-surveillance/">here</a>
for the article.</p>
<div class="footnotes">
<hr>
<ol>
<li id="fn:1">The audience included people who focus on number-theoretic primitives, hardware crypto implementations, lattice-based cryptography etc. The ideas I described in the talk, however, required mostly background on secure multi-party computation, <a href="http://eprint.iacr.org/2006/210.pdf">searchable symmetric encryption</a> and especially <a href="http://eprint.iacr.org/2011/010.pdf">structured encryption</a>, which are more recent and not as well-known.<br>
<a class="footnote-return" href="#fnref:1">↩</a></li>
</ol>
</div>
How to Search on Encrypted Data: Oblivious RAMs (Part 4)
http://senykam.github.io/2013/12/20/how-to-search-on-encrypted-data-oblivious-rams-part-4
Fri, 20 Dec 2013 11:23:34 -0300http://senykam.github.io/2013/12/20/how-to-search-on-encrypted-data-oblivious-rams-part-4<p><em>This is the fourth part of a series on searching on encrypted data. See parts <a href="http://outsourcedbits.org/2013/10/06/how-to-search-on-encrypted-data-part-1/">1</a>, <a href="https://outsourcedbits.org/2013/10/30/how-to-search-on-encrypted-data-part-2/">2</a>, <a href="https://outsourcedbits.org/2013/12/20/how-to-search-on-encrypted-data-part-3-oblivious-rams/">3</a> and <a href="https://outsourcedbits.org/2014/08/21/how-to-search-on-encrypted-data-searchable-symmetric-encryption-part-5/">5</a>.</em></p>
<p><img src="http://senykam.github.io/img/search.jpg" class="alignright" width="250">
In the previous posts we covered two different ways to search on encrypted data.
The first was based on property-preserving encryption (in particular, on
deterministic encryption), achieved <em>sub-linear</em> search time but had weak
security properties. The second was based on functional encryption, achieved
<em>linear</em> search time but provided stronger security guarantees.</p>
<p>We'll now see another approach that achieves the strongest possible levels of
security! But first, we need to discuss what we mean by security.</p>
<h2 id="security">Security</h2>
<p>So far, I have discussed the security of the encrypted search solutions
informally---mostly providing intuition and describing possible
attacks. This is partly because I'd like this blog to remain comprehensible to
readers who are not cryptographers but also because formally defining the
security properties of encrypted search is a bit messy.</p>
<p>So, which security properties should we expect from an encrypted search solution? What about the following:</p>
<ol>
<li>the encrypted database <span class="math">\({\sf EDB}\)</span> generated by the scheme should not leak any
information about the database <span class="math">\({\sf DB}\)</span> of the user;<br></li>
<li>the tokens <span class="math">\({\sf tk}_w\)</span> generated by the user should not leak any information
about the underlying search term <span class="math">\(w\)</span> to the server.</li>
</ol>
<p>This sounds reasonable but there are several issues. First, this intuition is
not precise enough to be meaningful. What I mean is that there are many details
that impact security that are not taken into account in this high-level
intuition (e.g., what does it mean not to leak, how are the search terms chosen
exactly). This is why cryptographers are so pedantic about security
definitions---the details really do matter.</p>
<p>Putting aside the issue of formality, another problem with this intuition
is that it says nothing about the search results. More precisely, it does not
specify whether it is appropriate or not for an encrypted search solution to
reveal to the server which encrypted documents match the search term. We
usually refer to this information as the client's <em>access pattern</em> and for
concreteness you can think of it as the (matching) encrypted documents'
identifiers or their locations in memory. All we really need as an
identifier is some per-document unique string that is independent of the
contents of the document and of the keywords associated with it.</p>
<p>So the question is:</p>
<blockquote>
<p>Is it appropriate to reveal the access pattern?</p>
</blockquote>
<p>There are two possible answers to this question. On one hand, we could argue
that it is fine to reveal the access pattern since the whole point of using
encrypted search is so that the server can return the encrypted documents that
match the query. And if we expect the server to return those encrypted
documents then it clearly has to know which ones to return (though it does not
necessarily need to know the contents).</p>
<p>On the other hand, one could argue that, in theory, the access pattern reveals
some information to the server. In fact, by observing enough search results the
server could use some sophisticated statistical attack to infer something about
the client's queries and data. Note that such attacks are not completely
theoretical and in a future post we'll discuss work that tries to make them
practical. Furthermore, the argument that the server needs to know which
encrypted documents match the query in order to return the desired documents is
not technically true. In fact, we know how to design cryptographic protocols
that allow one party to send items to another without knowing which item it is
sending (see, e.g., private information retrieval and oblivious transfer).</p>
<p>Similarly, we know how to design systems that allow us to read and write to
memory without the memory device knowing which locations are being accessed.
The latter are called <em>oblivious RAMs</em> (ORAM) and we could use them to
search on encrypted data <em>without revealing the access pattern to the
server</em>. The issue, of course, is that using ORAM will slow things down.</p>
<p>So really the answer to our question depends on what kind of tradeoff we are
willing to make between efficiency and security. If efficiency is the priority,
then revealing the access pattern might not be too much to give up in terms of
security for certain applications. On the other hand, if we can tolerate some
inefficiency, then it's always best to be conservative and not reveal anything
if possible.</p>
<p>In the rest of this post we'll explore ORAMs, see how to construct one and how
to use it to search on encrypted data.</p>
<h2 id="oblivious-ram">Oblivious RAM</h2>
<p>ORAM was first proposed in a paper by Goldreich and Ostrovsky
[<a href="http://www.cs.ucla.edu/~rafail/PUBLIC/09.pdf">GO96</a>] (the link is
actually Ostrovsky's thesis which has the same content as the journal paper) on
software protection. That work turned out to be really ahead of its time as
several ideas explored in it turned out to be related to more modern topics like
cloud storage.</p>
<p>An ORAM scheme <span class="math">\(({\sf Setup}, {\sf Read}, {\sf Write})\)</span> consists of:</p>
<ul>
<li><p>A setup algorithm <span class="math">\({\sf Setup}\)</span> that takes as input a security parameter
<span class="math">\(1^k\)</span> and a memory (array) <span class="math">\({\sf RAM}\)</span> of <span class="math">\(N\)</span> items; it outputs a secret key
<span class="math">\(K\)</span> and an oblivious memory <span class="math">\({\sf ORAM}\)</span>.</p></li>
<li><p>A two-party protocol <span class="math">\({\sf Read}\)</span> executed between a client and a server
that works as follows. The client runs the protocol with a secret key <span class="math">\(K\)</span> and
an index <span class="math">\(i\)</span> as input while the server runs the protocol with an oblivious
memory <span class="math">\({\sf ORAM}\)</span> as input. At the end of the protocol, the client receives
<span class="math">\({\sf RAM}[i]\)</span> while the server receives <span class="math">\(\bot\)</span>, i.e., nothing. We'll write this
sometimes as <span class="math">\({\sf Read}((K, i), {\sf ORAM}) = ({\sf RAM}[i], \bot)\)</span>.</p></li>
<li><p>A two-party protocol <span class="math">\({\sf Write}\)</span> executed between a client and a server
that works as follows. The client runs the protocol with a key <span class="math">\(K\)</span>, an index
<span class="math">\(i\)</span> and a value <span class="math">\(v\)</span> as input and the server runs the protocol with an oblivious
memory <span class="math">\({\sf ORAM}\)</span> as input. At the end of the protocol, the client receives nothing
(again denoted as <span class="math">\(\bot\)</span>) and the server receives an updated oblivious memory
<span class="math">\({\sf ORAM}'\)</span> such that the <span class="math">\(i\)</span>th location now holds the value <span class="math">\(v\)</span>. We write this as
<span class="math">\({\sf Write}((K, i, v), {\sf ORAM}) = (\bot, {\sf ORAM}')\)</span>.</p></li>
</ul>
<h2 id="oblivious-ram-via-fhe">Oblivious RAM via FHE</h2>
<p>The simplest way to design an ORAM is to use fully-homomorphic encryption
(FHE). For an overview of FHE see my
previous posts
<a href="http://outsourcedbits.org/2012/06/26/applying-fully-homomorphic-encryption-part-1/">here</a>
and
<a href="http://outsourcedbits.org/2012/09/29/applying-fully-homomorphic-encryption-part-2/">here</a>.</p>
<p>Suppose we have an FHE scheme <span class="math">\({\sf FHE} = ({\sf Gen}, {\sf Enc}, {\sf Eval},
{\sf Dec})\)</span>. Then we can easily construct an ORAM as follows <sup class="footnote-ref" id="fnref:1"><a class="footnote" href="#fn:1">1</a></sup>:</p>
<ul>
<li><p><span class="math">\({\sf Setup}(1^k, {\sf RAM})\)</span>: generate a key for the FHE scheme by
computing <span class="math">\(K = {\sf FHE}.{\sf Gen}(1^k)\)</span> and encrypt <span class="math">\({\sf RAM}\)</span> as <span class="math">\(c =
{\sf FHE}.{\sf Enc}_K({\sf RAM})\)</span>. Output <span class="math">\(c\)</span> as the oblivious memory
<span class="math">\({\sf ORAM}\)</span>.</p></li>
<li><p><span class="math">\({\sf Read}\big((K, i), {\sf ORAM}\big)\)</span>: the client encrypts its index <span class="math">\(i\)</span> as
<span class="math">\(c_i = {\sf FHE}.{\sf Enc}_K(i)\)</span> and sends <span class="math">\(c_i\)</span> to the server. The server computes</p></li>
</ul>
<p><span class="math">\[
c' = {\sf FHE}.{\sf Eval}(f, {\sf ORAM}, c_i),
\]</span></p>
<p>where <span class="math">\(f\)</span> is a function that takes as input an array
and an index <span class="math">\(i\)</span> and returns the <span class="math">\(i\)</span>th element of the array. The server returns
<span class="math">\(c'\)</span> to the client who decrypts it to recover <span class="math">\({\sf RAM}[i]\)</span>.</p>
<ul>
<li><span class="math">\({\sf Write}\big((K, i, v), {\sf ORAM}\big)\)</span>: the client encrypts its index
<span class="math">\(i\)</span> as <span class="math">\(c_i = {\sf FHE}.{\sf Enc}_K(i)\)</span> and its value as <span class="math">\(c_v = {\sf FHE}.{\sf Enc}_K(v)\)</span> and
sends them both to the server. The server computes</li>
</ul>
<p><span class="math">\[
c' = {\sf FHE}.{\sf Eval}(g, {\sf ORAM}, c_i, c_v),
\]</span></p>
<p>where <span class="math">\(g\)</span> is a function that takes as input an array, an<br>
index <span class="math">\(i\)</span> and a value <span class="math">\(v\)</span> and returns the same array with the <span class="math">\(i\)</span>th element
updated to <span class="math">\(v\)</span>.</p>
<p>The security properties of FHE will guarantee that <span class="math">\({\sf ORAM}\)</span> leaks no information
about <span class="math">\({\sf RAM}\)</span> to the server and that the <span class="math">\({\sf Read}\)</span> and <span class="math">\({\sf Write}\)</span> protocols reveal
no information about the index and values either.</p>
<p>The obvious downside of this FHE-based ORAM is efficiency. Let's forget for a second
that FHE is not practical yet and let's suppose we had a very fast FHE scheme.
This ORAM would still be too slow simply because the homomorphic evaluation
steps in the <span class="math">\({\sf Read}\)</span> and <span class="math">\({\sf Write}\)</span> protocols require <span class="math">\(O(N)\)</span> time, i.e.,
<em>time linear in the size of the memory</em>. Again, assuming we had a
super-fast FHE scheme, this would only be usable for small memories.</p>
<h2 id="oblivious-ram-via-symmetric-encryption">Oblivious RAM via Symmetric Encryption</h2>
<p>Fortunately, we also know how to design ORAMs using standard encryption schemes
and, in particular, using symmetric encryption like AES. ORAM is a
very active area of research and we now have many constructions, optimizations
and even implementations (e.g., see Emil Stefanov's
<a href="http://www.emilstefanov.net/Research/ObliviousRam/">implementation</a>.
Because research is moving so fast, however, there really isn't a good overview of
the state-of-the-art.</p>
<p>Since ORAMs are fairly complicated, I'll describe here the simplest
(non-FHE-based) construction which is due to Goldreich and Ostrovsky
[<a href="http://www.cs.ucla.edu/~rafail/PUBLIC/09.pdf">GO96</a>]. This
particular ORAM construction is known as the Square-Root solution and it
requires just a symmetric encryption scheme <span class="math">\({\sf SKE} = ({\sf Gen}, {\sf Enc}, {\sf Dec})\)</span>, and a
pseudo-random function <span class="math">\(F\)</span> that maps <span class="math">\(\log N\)</span> bits to <span class="math">\(2\log N\)</span> bits.</p>
<p><strong>Setup.</strong>
To setup the ORAM, the client generates two secret keys <span class="math">\(K_1\)</span> and <span class="math">\(K_2\)</span> for
the symmetric encryption scheme and for the pseudo-random function <span class="math">\(F\)</span>,
respectively. It then augments each item in <span class="math">\({\sf RAM}\)</span> by appending its address and
a random tag to it. We'll refer to the address embedded with the item as its
<em>virtual</em> address. More precisely, it creates a new memory <span class="math">\({\sf RAM}_2\)</span> such that
for all <span class="math">\(1 \leq i \leq N\)</span>,</p>
<p><span class="math">\[
{\sf RAM}_2[i] = \big\langle{\sf RAM}[i], i, {\sf tag}_i \big\rangle,
\]</span></p>
<p>where <span class="math">\(\langle , , \rangle\)</span> denotes concatenation and <span class="math">\({\sf tag}_i =
F_{K_2}(i)\)</span>. It then adds <span class="math">\(\sqrt{N}\)</span> <em>dummy</em> items to <span class="math">\({\sf RAM}_2\)</span>, i.e.,
it creates a new memory <span class="math">\({\sf RAM}_3\)</span> such that for all <span class="math">\(1 \leq i \leq N\)</span>,
<span class="math">\({\sf RAM}_3[i] = {\sf RAM}_2[i]\)</span> and such that for all <span class="math">\(N+1 \leq i \leq
N+\sqrt{N}\)</span>,</p>
<p><span class="math">\[
{\sf RAM}_3[i] = \big\langle 0, \infty_1, {\sf tag}_i \big\rangle,
\]</span></p>
<p>where <span class="math">\(\infty_1\)</span> is some number larger than <span class="math">\(N + 2\sqrt{N}\)</span>.
It then sorts <span class="math">\({\sf RAM}_3\)</span> around according to the tags. Notice that the effect of
this sorting will be to permute <span class="math">\({\sf RAM}_3\)</span> since the tags are (pseudo-)random. It
then encrypts each item in <span class="math">\({\sf RAM}_3\)</span> using <span class="math">\({\sf SKE}\)</span>. In other words, it generates a
new memory <span class="math">\({\sf RAM}_4\)</span> such that, for all <span class="math">\(1 \leq i \leq N + \sqrt{N}\)</span>,</p>
<p><span class="math">\[
{\sf RAM}_4[i] = {\sf Enc}_{K_1}({\sf RAM}_3[i]).
\]</span></p>
<p>Finally, it appends <span class="math">\(\sqrt{N}\)</span> elements to <span class="math">\({\sf RAM}_4\)</span> each of which contains an
<span class="math">\({\sf SKE}\)</span> encryption of <span class="math">\(0\)</span> under key <span class="math">\(K_1\)</span>. Needless to say, all the ciphertexts
generated in this process need to be of the same size so the items need to be
padded appropriately. The result of this, i.e., the combination of <span class="math">\({\sf RAM}_4\)</span> and
the encryptions of <span class="math">\(0\)</span>, is the oblivious memory <span class="math">\({\sf ORAM}\)</span> which is sent to the
server.</p>
<p>It will be useful for us to distinguish between the two parts of <span class="math">\({\sf ORAM}\)</span> so
we'll refer to the second part (i.e., the encryptions of <span class="math">\(0\)</span>) as the <em>cache</em>.</p>
<p><strong>Read & write.</strong>
Now we'll see how to read and write to <span class="math">\({\sf ORAM}\)</span> <em>obliviously</em>, i.e., without
the server knowing which memory locations we're accessing. First we have to
define two basic operations: <span class="math">\({\sf Get}\)</span> and <span class="math">\({\sf Put}\)</span>.</p>
<p>The <span class="math">\({\sf Get}\)</span> operation takes an index <span class="math">\(1 \leq i \leq N\)</span> as input and works as
follows:</p>
<ol>
<li><p>the client requests from the server the item at virtual addres <span class="math">\(i\)</span> in
<span class="math">\({\sf ORAM}\)</span>. To do this it first re-generates the item's tag <span class="math">\({\sf tag}_i =
F_{K_2}(i)\)</span>. It then does an (interactive) binary search to find the item with
virtual address <span class="math">\(i\)</span>. In other words, it asks the server for the item stored at
location <span class="math">\(N/2\)</span> (let 's assume <span class="math">\(N\)</span> is even) decrypts it and compares its
tag with <span class="math">\({\sf tag}_i\)</span>. If <span class="math">\({\sf tag}_i\)</span> is less than the tag of item <span class="math">\({\sf ORAM}[N/2]\)</span>,
then it asks for the item at location <span class="math">\(N/4\)</span>; else it asks for the item at
location <span class="math">\(3N/4\)</span>; and so on.</p></li>
<li><p>it decrypts the item with <span class="math">\({\sf tag}_i\)</span> to recover <span class="math">\({\sf RAM}[i]\)</span>,</p></li>
<li><p>it then re-encrypts <span class="math">\({\sf RAM}[i]\)</span> (using new randomness) and asks the server to
store it back where it was found.</p></li>
</ol>
<p>The <span class="math">\({\sf Put}\)</span> operation takes an index <span class="math">\(1 \leq i \leq N\)</span> and a value <span class="math">\(v\)</span> as inputs
and works as follows:</p>
<ol>
<li><p>the client requests from the server the item with <span class="math">\({\sf tag}_i\)</span> (as above);</p></li>
<li><p>it then encrypts <span class="math">\(v\)</span> and asks the server to store it back at the
location where the previous item (i.e., the one with <span class="math">\({\sf tag}_i\)</span>) was found.</p></li>
</ol>
<p>Notice that from the server's point of view the two operations look the same.
In other words, the server cannot tell whether the client is executing a <span class="math">\({\sf Get}\)</span>
or a <span class="math">\({\sf Put}\)</span> operation since in either case all it sees is a binary search
followed by a request to store a new ciphertext at the same location.</p>
<p>Now suppose for a second that <span class="math">\({\sf ORAM}\)</span> only consisted of <span class="math">\({\sf RAM}_4\)</span>.
If that were the case then <span class="math">\({\sf ORAM}\)</span> would be one-time
oblivious in the sense that we could use it to read or write only once by executing
either a <span class="math">\({\sf Get}\)</span> or a <span class="math">\({\sf Put}\)</span> operation. Why is this the case? Remember that we
randomly permuted and encrypted our memory before sending it to the
server. This means that asking the server for the item at location <span class="math">\(j\)</span> reveals
nothing about that item's real/virtual address <span class="math">\(i\)</span>. Furthermore, the binary
search we do when looking for the item with virtual address <span class="math">\(i\)</span> depends only
<span class="math">\({\sf tag}_i\)</span> which is random and therefore reveals nothing about <span class="math">\(i\)</span>.</p>
<p>Of course, this only works once because if we want to access <span class="math">\(i\)</span> again then
we'll ask the server for the same location which immediately tells it<br>
something: namely, that we asked for the same thing twice.</p>
<p>So how do we hide the fact that we're asking for the same thing twice?<br>
This is really the core difficulty in designing ORAMs and this is where the
cache will come in.</p>
<p>We start by initializing a counter <span class="math">\({\sf ctr} = 1\)</span>. To read location <span class="math">\(i\)</span> we
execute the following
<span class="math">\({\sf Read}\)</span> protocol:</p>
<ol>
<li><p>We <span class="math">\({\sf Get}\)</span> the entire cache. In other
words, we execute <span class="math">\({\sf Get}(j)\)</span> for all<br>
<span class="math">\(
N + \sqrt{N} + 1 \leq j \leq N+ 2\cdot\sqrt{N};
\)</span></p></li>
<li><p>If any of the <span class="math">\({\sf Get}\)</span> operations above result in the <span class="math">\(i\)</span>th item (i.e.,
if we get an item with virtual address <span class="math">\(i\)</span>) then we <span class="math">\({\sf Get}\)</span> a dummy item by executing
<span class="math">\({\sf Get}(N+{\sf ctr})\)</span>. Also, we set <span class="math">\(z\)</span> to be the item we found in the cache
and <span class="math">\(\ell\)</span> to be the cache location where we found it.</p></li>
<li><p>If none of the <span class="math">\({\sf Get}\)</span> operations above resulted in the <span class="math">\(i\)</span>th item, we
execute a <em>modified</em> <span class="math">\({\sf Get}(i)\)</span> and set <span class="math">\(z\)</span> to be the result and <span class="math">\(\ell = N +
\sqrt{N} + {\sf ctr}\)</span>. The modified version of <span class="math">\({\sf Get}(i)\)</span> works like a regular
<span class="math">\({\sf Get}(i)\)</span> operation, except that we update the item's virtual address to
<span class="math">\(\infty_2\)</span>, where <span class="math">\(\infty_2 > \infty_1\)</span>. In other words, we store an encryption
of <span class="math">\(\langle {\sf RAM}[i], \infty_2, {\sf tag}_i\rangle\)</span> back where we found it. This
will be useful for us later when we'll need to re-structure <span class="math">\({\sf ORAM}\)</span>.</p></li>
<li><p>We then process the entire cache again but slightly differently than
before (we do this so that we can store the item in the cache for future
accesses). In particular, for all <span class="math">\(N + \sqrt{N} + 1 \leq j \leq N +
2\cdot\sqrt{N}\)</span>,</p>
<ul>
<li>if <span class="math">\(j \neq \ell\)</span> we execute a <span class="math">\({\sf Get}(j)\)</span> operation</li>
<li>if <span class="math">\(j = \ell\)</span> we execute a <span class="math">\({\sf Put}(j, z)\)</span>.</li>
</ul></li>
<li><p>We increase <span class="math">\({\sf ctr}\)</span> by <span class="math">\(1\)</span>.</p></li>
</ol>
<p>The first thing to notice is that this is correct in the sense that by executing
this operation the client will indeed receive <span class="math">\({\sf RAM}[i]\)</span>.</p>
<p>The more interesting question, however, is why is this oblivious and, in
particular, why is this more than one-time oblivious? To see why this is
oblivious it helps to think of things from the server's perspective and see
why its view of the execution is independent of (i.e., not affected by) <span class="math">\(i\)</span>.</p>
<p>First, no matter what <span class="math">\(i\)</span> the client is looking for, it always <span class="math">\({\sf Get}\)</span>s the
entire cache so Step <span class="math">\(1\)</span> reveals no information about <span class="math">\(i\)</span> to the server. We then
have two possible cases:</p>
<ol>
<li><p>If the <span class="math">\(i\)</span>th item is in the cache (at location <span class="math">\(\ell\)</span>), we <span class="math">\({\sf Get}\)</span> a
dummy item; and <span class="math">\({\sf Put}\)</span> the <span class="math">\(i\)</span>th item at location <span class="math">\(\ell\)</span> while we re-process the
entire cache (in Step <span class="math">\(4\)</span>).</p></li>
<li><p>If the <span class="math">\(i\)</span>th item is not in the cache, we <span class="math">\({\sf Get}\)</span> the
<span class="math">\(i\)</span>th item and <span class="math">\({\sf Put}\)</span> it in the next open location in the cache while we re-process
the entire cache.</p></li>
</ol>
<p>In either case, the server sees the same thing: a <span class="math">\({\sf Get}\)</span> for an item at some
location between <span class="math">\(1\)</span> and <span class="math">\(N+\sqrt{N}\)</span> and a sequence of <span class="math">\({\sf Get}/{\sf Put}\)</span> operations for
all addresses in the cache, i.e., between <span class="math">\(N+\sqrt{N}\)</span> and <span class="math">\(N+2\cdot\sqrt{N}\)</span>.<br>
Recall that the server cannot distinguish between <span class="math">\({\sf Get}\)</span> and <span class="math">\({\sf Put}\)</span> operations.</p>
<p>The <span class="math">\({\sf Write}\)</span> protocol is similar to the <span class="math">\({\sf Read}\)</span> protocol. The only difference
is that in Step <span class="math">\(2\)</span>, we set <span class="math">\(z = v\)</span> if the <span class="math">\(i\)</span>th item is in the cache and in
Step <span class="math">\(3\)</span> we execute <span class="math">\({\sf Put}(i, v)\)</span> and set <span class="math">\(z = v\)</span>. Notice, however, that the
<span class="math">\({\sf Write}\)</span> protocol can introduce inconsistencies between the cache and
<span class="math">\({\sf RAM}_4\)</span>. More precisely, if the item has been accessed before (say, due to a
<span class="math">\({\sf Read}\)</span> operation), then a <span class="math">\({\sf Write}\)</span> operation will update the cache but not
the item in <span class="math">\({\sf RAM}_4\)</span>. This is OK, however, as it will be taken care of
in the re-structuring step, which we'll describe below.</p>
<p>So we can now read and write to memory without revealing which location we're
accessing and we can do this more than once! The problem, however, is that we
can do it at most <span class="math">\(\sqrt{N}\)</span> times because after that the cache is full so we
have to stop.</p>
<p><strong>Re-structuring.</strong>
So what if we want to do more than <span class="math">\(\sqrt{N}\)</span> reads? In that case we need to
<em>re-structure</em> our ORAM. By this, I mean that we have to re-encrypt
and re-permute all the items in <span class="math">\({\sf ORAM}\)</span> and reset our counter <span class="math">\({\sf ctr}\)</span> to <span class="math">\(1\)</span>.</p>
<p>If the client has enough space to store <span class="math">\({\sf ORAM}\)</span> locally then the easiest thing
to do is just to download <span class="math">\({\sf ORAM}\)</span>, decrypt it locally to recover <span class="math">\({\sf RAM}\)</span>, update
it (in case there were any inconsistencies) and setup a new ORAM from scratch.</p>
<p>If, on the other hand, the client does not have enough local storage then the
problem becomes harder. Here we'll assume the client only has <span class="math">\(O(1)\)</span> storage so
it can store, e.g., only two items.</p>
<p>Recall that in order to re-structure <span class="math">\({\sf ORAM}\)</span>, the client needs to re-permute
<span class="math">\({\sf RAM}_4\)</span> and re-encrypt everything obliviously while using only <span class="math">\(O(1)\)</span> space.
Also, the client needs to do this in a way that updates the elements that are in
an inconsistent state due to <span class="math">\({\sf Write}\)</span> operations. The key to doing all this
will be to figure out a way for the client to sort elements obliviously while
using <span class="math">\(O(1)\)</span> space. Once we can obliviously sort, the rest will follow
relatively easily.</p>
<p>To do this, Goldreich and Ostrovsky proposed to use a <a href="http://en.wikipedia.org/wiki/Sorting_network">sorting
network</a> like Batcher's <a href="http://en.wikipedia.org/wiki/Batcher's_sort">Bitonic
network</a>. Think of a sorting
network as a circuit composed of comparison gates. The gates take two inputs
<span class="math">\(x\)</span> and <span class="math">\(y\)</span> and output the pair <span class="math">\((x, y)\)</span> if <span class="math">\(x \lt y\)</span> and the pair <span class="math">\((y, x)\)</span> if <span class="math">\(x
\geq y\)</span>. Given a set of input values, the sorting network outputs the items in
sorted order. Sorting networks have two interesting properties: <span class="math">\((1)\)</span> the
comparisons they perform are independent of the input sequence; and <span class="math">\((2)\)</span> each
gate in the network is a binary operation (i.e., takes only two inputs). Of
course, there is an overhead to sorting obviously so Batcher's network requires
<span class="math">\(O(N\log^2 N)\)</span> work as opposed to the traditional <span class="math">\(O(N\log N)\)</span> for sorting.</p>
<p>So to obliviously sort a set of ciphertexts <span class="math">\((c_1, \dots, c_{N+2\sqrt{N}})\)</span>
stored at the server, the client will start executing the sorting network and
whenever it reaches a comparison gate between the <span class="math">\(i\)</span>th and <span class="math">\(j\)</span>th item, it will
just request the <span class="math">\(i\)</span>th and <span class="math">\(j\)</span>th ciphertexts, decrypt them, compare them, and
store them back re-encrypted in the appropriate order. Note that by the first
property above, the client's access pattern reveals nothing to the server; and
by the second property the client will never need to store more than two items
at the same time.</p>
<p>Now that we can sort obliviously, let's see how to re-structure the ORAM. We
will do it in two phases. In the first phase, we sort all the items in <span class="math">\({\sf ORAM}\)</span>
according to their virtual addresses. This is how we will get rid of
inconsistencies. Remember that the items in <span class="math">\({\sf RAM}_3\)</span> are augmented to have the
form <span class="math">\(\langle {\sf RAM}[i], i, {\sf tag}_i\rangle\)</span> for real items and <span class="math">\(\langle 0,
\infty_1, {\sf tag}_i\rangle\)</span> for dummy items. It follows that all items in the cache
have the first form since they are either copies or updates of real items
put there during <span class="math">\({\sf Read}\)</span> and <span class="math">\({\sf Write}\)</span> operations.</p>
<p>So we just execute the sorting network and, for each comparison gate,
retrieve the appropriate items, decrypt them, compare their virtual addresses and
return them re-encrypted in the appropriate order. The result of this process
is that <span class="math">\({\sf ORAM}\)</span> will now have the following form:</p>
<ol>
<li>the first <span class="math">\(N\)</span> items will consist of the most recent versions of the
real items, i.e., all the items with virtual addresses <em>other</em> than
<span class="math">\(\infty_1\)</span> and <span class="math">\(\infty_2\)</span>;</li>
<li>the next <span class="math">\(\sqrt{N}\)</span> items will consist of dummy items, i.e., all items
with virtual address <span class="math">\(\infty_1\)</span>.<br></li>
<li>the final <span class="math">\(\sqrt{N}\)</span> items will consist of the old/inconsistent
versions of the real items, i.e., all items with virtual address <span class="math">\(\infty_2\)</span>
(remember that in Step <span class="math">\(3\)</span> of <span class="math">\({\sf Read}\)</span> and <span class="math">\({\sf Write}\)</span> we executed a modified
<span class="math">\({\sf Get}(i)\)</span> that updated the item's virtual address to <span class="math">\(\infty_2\)</span>).</li>
</ol>
<p>In the second phase, we randomly permute and re-encrypt the first <span class="math">\(N+\sqrt{N}\)</span>
items of <span class="math">\({\sf ORAM}\)</span>. We first choose a new key <span class="math">\(K_3\)</span> for <span class="math">\(F\)</span>. We then access each
item from location <span class="math">\(1\)</span> to <span class="math">\(N+\sqrt{N}\)</span> and update their tags to <span class="math">\(F_{K_3}(i)\)</span>.<br>
Once we've updated the tags, we sort all the items according to their tags.
The result will be a new random permutation of items. Note that we don't
technically have to do this in two passes; but it's easier to explain this way.</p>
<p>At this point, we're done! <span class="math">\({\sf ORAM}\)</span> is as good as new and we can start accessing
it again safely.</p>
<p><strong>Efficiency.</strong>
So what is the efficiency of the Square-Root solution? Setup is <span class="math">\(O(N\log^2N)\)</span>:
<span class="math">\(O(N)\)</span> to construct the real, dummy and cache items and <span class="math">\(O(N\log^2 N)\)</span> to
permute everything through sorting.</p>
<p>Each access operation (i.e., <span class="math">\({\sf Read}\)</span> or <span class="math">\({\sf Write}\)</span>) is <span class="math">\(O(\sqrt{N})\)</span>:
<span class="math">\(O(\sqrt{N})\)</span> total get/put operations to get the cache twice and <span class="math">\(O(\log N)\)</span>
for each get/put operation due to binary search.</p>
<p>Restructuring is <span class="math">\(O(N\log^2 N)\)</span>: <span class="math">\(O(N\log^2 N)\)</span> to sort by virtual address and
<span class="math">\(O(N\log^2N)\)</span> to sort by tag. Restructuring, however, only occurs once every
<span class="math">\(\sqrt{N}\)</span> accesses. Because of this, we usually average the cost of
re-structuring over the number read/write operations supported to give an
amortized access cost. In our case, the amortized access cost is then</p>
<p><span class="math">\[
O\left(\sqrt{N} + \frac{N\log^2 N}{\sqrt{N}}\right)
\]</span></p>
<p>which is <span class="math">\(O(\sqrt{N}\cdot\log^2 N)\)</span>.</p>
<h2 id="orambased-encrypted-search">ORAM-Based Encrypted Search</h2>
<p>So now that we know how to build an ORAM, we'll see how to use it for encrypted
search. There are two possible ways to do this.</p>
<p><strong>A naive approach.</strong>
The first is for the client to just dump all the <span class="math">\(n\)</span> documents <span class="math">\(\textbf{D} =
(D_1, \dots, D_n)\)</span> in an array <span class="math">\({\sf RAM}\)</span>, setup an ORAM <span class="math">\((K, {\sf ORAM}) = {\sf Setup}(1^k,
{\sf RAM})\)</span> and send <span class="math">\({\sf ORAM}\)</span> to the server. To search, the client can just simulate a
sequential search algorithm via the <span class="math">\({\sf Read}\)</span> protocol; that is, replace every
read operation of the search algorithm with an execution of the <span class="math">\({\sf Read}\)</span>
protocol. To update the documents the client can similarly simulate an update
algorithm using the <span class="math">\({\sf Write}\)</span> protocol.</p>
<p>This will obviously be slow. Let' s assume all the documents have bit-length <span class="math">\(d\)</span>
and that <span class="math">\({\sf RAM}\)</span> has a block size of <span class="math">\(B\)</span> bits. The document collection will then
fit in (approximately) <span class="math">\(N = n\cdot d\cdot B^{-1}\)</span> blocks. The sequential scan
algorithm is itself <span class="math">\(O(N)\)</span>, but on top of that we'll have to execute an entire
<span class="math">\({\sf Read}\)</span> protocol for every address of memory read.</p>
<p>Remember that if we're using the Square-Root solution as our ORAM then the
<span class="math">\({\sf Read}\)</span> protocol requires <span class="math">\(O(\sqrt{N}\cdot\log^2 N)\)</span> <em>amortized</em> work. So
in total, search would be <span class="math">\(O(N^{3/2}\cdot\log^2 N)\)</span> which would not scale. Now
imagine for a second if we were using the FHE-based ORAM described above which
requires <span class="math">\(O(N)\)</span> work for each <span class="math">\({\sf Read}\)</span> and <span class="math">\({\sf Write}\)</span>. In this scenario, a single
search would take <span class="math">\(O(N^2)\)</span> time!</p>
<p><strong>A better approach.</strong><sup class="footnote-ref" id="fnref:2"><a class="footnote" href="#fn:2">2</a></sup>
A better idea is for the client to build two arrays <span class="math">\({\sf RAM}_1\)</span> and <span class="math">\({\sf RAM}_2\)</span>. <sup class="footnote-ref" id="fnref:3"><a class="footnote" href="#fn:3">3</a></sup>
In <span class="math">\({\sf RAM}_1\)</span> it will store a data structure that supports fast searches on the
document collection (e.g., an
<a href="http://en.wikipedia.org/wiki/Inverted_index">inverted index</a> and in
<span class="math">\({\sf RAM}_2\)</span> it will store the documents <span class="math">\(\textbf{D}\)</span> themselves. It then builds and
sends <span class="math">\({\sf ORAM}_1 = {\sf Setup}(1^k, {\sf RAM}_1)\)</span> and <span class="math">\({\sf ORAM}_2 = {\sf Setup}(1^k, {\sf RAM}_2)\)</span> to the
server. To search, the client simulates a query to the data structure in
<span class="math">\({\sf ORAM}_1\)</span> via the <span class="math">\({\sf Read}\)</span> protocol (i.e., it replaces each read operation in
the data structure's query algorithm with an execution of <span class="math">\({\sf Read}\)</span>). From this,
the client will recover the identifiers of the documents that contain the
keyword and with this information it can just read those documents from
<span class="math">\({\sf ORAM}_2\)</span>.</p>
<p>Now suppose there are <span class="math">\(m\)</span> documents that contain the keyword and that we're
using an optimal-time data structure (i.e., a structure with a query algorithm
that runs in <span class="math">\(O(m)\)</span> time like an inverted index). Also, assume that the data
structure fits in <span class="math">\(N_1\)</span> blocks of <span class="math">\(B\)</span> bits and that the data collection
fits in <span class="math">\(N_2 = n\cdot d/B\)</span> blocks.</p>
<p>Again, if we were using the Square-Root solution for our ORAMs, then the first
step would take <span class="math">\(O(m\cdot\sqrt{N_1}\cdot\log^2 N_1)\)</span> time and the second step will take</p>
<p><span class="math">\[
O\left( \frac{m\cdot d}{B}\cdot\sqrt{N_2}\cdot\log^2 N_2 \right).
\]</span></p>
<p>In practice, the size of a fast data structure for keyword search can be large.
A very conservative estimate for an inverted index, for example, would be that
it is roughly the size of the data collection. <sup class="footnote-ref" id="fnref:4"><a class="footnote" href="#fn:4">4</a></sup> Setting <span class="math">\(N = N_1 = N_2\)</span>, the
total search time would be</p>
<p><span class="math">\[
O\left( (1+d/B)\cdot m \cdot\sqrt{N}\cdot\log^2 N\right)
\]</span></p>
<p>which is <span class="math">\(O(m\cdot d\cdot B^{-1} \cdot \sqrt{N}\cdot \log^2 N)\)</span> (since <span class="math">\(d \gg
B\)</span>) compared to the previous approach' s <span class="math">\(O(n\cdot d
\cdot B^{-1} \cdot \sqrt{N}\cdot\log^2N)\)</span>.</p>
<p>In cases where the search term appears in <span class="math">\(m \ll n\)</span> documents, this can be a
substantial improvement.</p>
<h2 id="is-this-practical">Is This Practical?</h2>
<p>If one were to only look at the asymptotics, one might conclude that the
two-RAM solution described above might be reasonably efficient. After all it
would take at least <span class="math">\(O(m\cdot d \cdot B^{-1})\)</span> time just to retrieve the
matching files from (unencrypted) memory so the two-RAM solution adds just a
<span class="math">\(\sqrt{N}\)</span> multiplicative factor over the minimum retrieval time.</p>
<p>Also there are much more efficient ORAM constructions than the Square-Root
solution. In fact, in their paper, Goldreich and Ostrovsky also proposed the
Hierarchichal solution which achieves <span class="math">\(O(\log^3 N)\)</span> amortized access cost.
Goodrich and Mitzenmacher
[<a href="http://arxiv.org/pdf/1007.1259v2.pdf">GM11</a>] gave a solution with
<span class="math">\(O(\log^2 N)\)</span> amortized access cost and, recently, Kushilevitz, Lu and
Ostrovsky [<a href="http://eprint.iacr.org/2011/327.pdf">KLO12</a>] a solution
with <span class="math">\(O(\log^2N/\log\log N)\)</span> amortized cost (and there are even more recent
papers that improve on this under certain conditions). There are also works
that tradeoff client storage for access efficiency. For example, Williams, Sion
and Carbunar
[<a href="http://digitalpiglet.org/research/sion2008pir-ccs.pdf">WSC08</a>]
propose a solution with <span class="math">\(O(\log N\cdot\log\log N)\)</span> amortized access cost and
<span class="math">\(O(\sqrt{N})\)</span> client storage while Stefanov, Shi and Song
[<a href="http://arxiv.org/pdf/1106.3652.pdf">SSS12</a>] propose a solution with
<span class="math">\(O(\log N)\)</span> amortized overhead for clients that have <span class="math">\(O(N)\)</span> local storage, where
the underlying constant is very small. There is also a line of work that tries
to de-amortize ORAM in the sense that it splits the re-structuring operation so
that it happens progressively over each access. This was first considered by
Ostrovsky and Shoup in
[<a href="http://www.cs.ucla.edu/~rafail/PUBLIC/28.pdf">OS97</a>] and was
further studied by Goodrich, Mitzenmacher, Ohrimenko, Tamassia
[<a href="http://arxiv.org/pdf/1107.5093.pdf}">GMOT11</a>] and by Shi, Chan,
Stefanov and Li [<a href="http://eprint.iacr.org/2011/407.pdf">SSSL11</a>].</p>
<p>All in all this may not seem that bad and, intuitively, the two-RAM solution might
actually be reasonably practical for small to moderate-scale data
collections---especially considering all the recent improvements in efficiency
that have been proposed. For large- or massive-scale collections, however, I'd
be surprised <sup class="footnote-ref" id="fnref:5"><a class="footnote" href="#fn:5">5</a></sup>.</p>
<h2 id="conclusions">Conclusions</h2>
<p>In this post we went over the ORAM-based solution to encrypted search which
provides the most secure solution to our problem since it hides
everything---even the access pattern!</p>
<p>In the next post we'll cover an approach that tries to strike a balance between
efficiency and security. In particular, this solution is as efficient as the
deterministic-encryption-based solution while being only slightly less secure
than the ORAM-based solution.</p>
<div class="footnotes">
<hr>
<ol>
<li id="fn:1">I haven't seen this construction written down anywhere. It's fairly obvious, however, so I suspect it's been mentioned somewhere. If anyone knows of a reference, please let me know.
<a class="footnote-return" href="#fnref:1">↩</a></li>
<li id="fn:2">Like the FHE-based ORAM, I have not seen this construction written down anywhere so if anyone knows of a reference, please let me know.
<a class="footnote-return" href="#fnref:2">↩</a></li>
<li id="fn:3">Of course, the following could be done using a single RAM, but splitting into two makes things easier to explain.<br>
<a class="footnote-return" href="#fnref:3">↩</a></li>
<li id="fn:4">In practice, this would <em>not</em> be the case and, in addition, we could make use of index compression techniques.
<a class="footnote-return" href="#fnref:4">↩</a></li>
<li id="fn:5">I won't attempt to draw exact lines between what's small-, moderate- and large-scale since I think that's a question best answered by experimental results.<br>
<a class="footnote-return" href="#fnref:5">↩</a></li>
</ol>
</div>
How to Search on Encrypted Data: Functional Encryption (Part 3)
http://senykam.github.io/2013/10/30/how-to-search-on-encrypted-data-functional-encryption-part-3
Wed, 30 Oct 2013 11:02:40 -0300http://senykam.github.io/2013/10/30/how-to-search-on-encrypted-data-functional-encryption-part-3<p><em>This is the third part of a series on searching on encrypted data. See parts <a href="http://outsourcedbits.org/2013/10/06/how-to-search-on-encrypted-data-part-1/">1</a>, <a href="https://outsourcedbits.org/2013/10/30/how-to-search-on-encrypted-data-part-2/">2</a>, <a href="https://outsourcedbits.org/2013/12/20/how-to-search-on-encrypted-data-part-4-oblivious-rams/">4</a> and <a href="https://outsourcedbits.org/2014/08/21/how-to-search-on-encrypted-data-searchable-symmetric-encryption-part-5/">5</a>.</em></p>
<p><img src="http://senykam.github.io/img/search.jpg" class="alignright" width="250">
Previously, we covered the simplest solution for encrypted search which
consisted of using a deterministic encryption scheme (more generally, using a
property-preserving encryption scheme) to encrypt keywords. This resulted in an
encrypted search solution with sub-linear (in <span class="math">\(n\)</span>) search time but that leaked
quite a bit of information to the server.</p>
<p>We' ll now describe a different approach that provides the opposite properties:
slow search but better security. At a high-level, one can view this approach
as simply replacing the PPE scheme in the previous solution with a <em>functional
encryption</em> (FE) scheme.</p>
<h2 id="functional-and-identitybased-encryption">Functional and Identity-Based Encryption</h2>
<p>The notion of FE was first described by Sahai and Waters in a talk
[<a href="http://www.cs.utexas.edu/~bwaters/presentations/files/functional.ppt">SW09</a>]
and later formalized by Boneh, Sahai and Waters
[<a href="http://eprint.iacr.org/2010/543.pdf">BSW10</a>] and by O'Neill
[<a href="http://eprint.iacr.org/2010/556">O10</a>]. Starting with the work of Boneh
and Franklin on <a href="http://en.wikipedia.org/wiki/ID-based_encryption">identity-based
encryption</a>, there was a slew
of new encryption schemes achieving various properties (e.g., attribute-based
encryption, hidden vector encryption, predicate encryption). Many of these
constructions felt loosely related so the idea behind FE was to capture all
these schemes under a single framework.</p>
<p>Though everything we'll cover can be done with FE, for concreteness, we'll
consider the special case of IBE, which was first suggested by Shamir
[<a href="http://discovery.csc.ncsu.edu/Courses/csc774-S07/shamir84.pdf">Shamir84</a>]
and realized by Boneh and Franklin
[<a href="http://crypto.stanford.edu/~dabo/pubs/papers/bfibe.pdf">BF01</a>].</p>
<p>A public-key IBE scheme consists of four algorithms:</p>
<ul>
<li>A setup algorithm <span class="math">\({\sf Setup}\)</span> used to generate a master secret and public key
pair <span class="math">\((msk, mpk)\)</span>.</li>
<li>An encryption algorithm <span class="math">\({\sf Enc}\)</span> that takes as input the master public-key
<span class="math">\(mpk\)</span>, an identity <span class="math">\(id\)</span> and a message <span class="math">\(m\)</span> as input and returns a ciphertext <span class="math">\(c\)</span>.</li>
<li>A key generation algorithm <span class="math">\({\sf Keygen}\)</span> that takes as input the master secret
key <span class="math">\(msk\)</span> and an identity <span class="math">\(id\)</span> and returns a secret key <span class="math">\(sk_{id}\)</span>.</li>
<li>And finally a decryption algorithm <span class="math">\({\sf Dec}\)</span> that takes as input a secret key
<span class="math">\(sk_{id}\)</span> and a ciphertext <span class="math">\(c\)</span> and returns a message <span class="math">\(m\)</span> or a failure symbol <span class="math">\(\bot\)</span>.</li>
</ul>
<p>The motivation behind IBE is key distribution. In particular, using an IBE
scheme should be easier than using a standard (public-key) encryption scheme
where public keys have to be certified, revoked and verified.</p>
<p>Let's consider a concrete example. Suppose Alice wants to send an encrypted
message to Bob who works at Microsoft. The idea is that Microsoft would first
generate a pair of master keys <span class="math">\((msk, mpk)\)</span> and distribute <span class="math">\(mpk\)</span> together with a
certificate. To send her message <span class="math">\(m\)</span>, Alice would retrieve Microsoft's master
public key <span class="math">\(mpk\)</span>, verify its certificate and then encrypt <span class="math">\(m\)</span> under Bob's
identity by computing:</p>
<p><span class="math">\[
c = {\sf Enc}(mpk, "\texttt{bob@microsoft.com}", m).
\]</span></p>
<p>To decrypt the ciphertext <span class="math">\(c\)</span>, Bob needs to hold a secret key for his identity
under Microsoft's master key:</p>
<p><span class="math">\[
sk = {\sf Keygen}(msk, "\texttt{bob@microsoft.com}").
\]</span></p>
<p>Given this key, he can then recover the message by computing <span class="math">\(m = {\sf Dec}(sk, c)\)</span>.</p>
<p>Notice that Alice never needed to know what Bob's public key was or to verify
any certificate for his key. The only certificate she had to verify was for
Microsoft's master public key but once that key is authenticated she can send
email to anyone at Microsoft without any additional work.</p>
<h2 id="publickey-encrypted-search">Public-Key Encrypted Search</h2>
<p>We are now ready to see how (anonymous) IBE can be used to search over
encrypted data. This idea was first proposed by Boneh, Di Crescenzo, Ostrovsky
and Persiano
[<a href="http://crypto.stanford.edu/~dabo/pubs/papers/encsearch.pdf">BCOP04</a>] and is
best explained by considering the following email scenario where Alice wants to
send an encrypted email to Bob.</p>
<p>Bob first generates a master secret and public key pair for the IBE scheme
<span class="math">\((msk, mpk)\)</span> and a secret and public key pair for a standard public-key
encryption scheme <span class="math">\((sk, pk)\)</span>. He then makes the public keys <span class="math">\((mpk, pk)\)</span> public
and keeps the secret keys <span class="math">\((msk, sk)\)</span> private. Alice encrypts her message under
<span class="math">\(pk\)</span> using the standard public-key encryption scheme, resulting in a ciphertext
<span class="math">\(c\)</span>. She then attaches IBE encryptions of "1" under Bob's master public key
<span class="math">\(mpk\)</span> with the keywords as the identity. This results in a set of IBE encryptions
<span class="math">\((e_1, \dots, e_m)\)</span> where each <span class="math">\(e_j\)</span> (for <span class="math">\(1 \leq j \leq m\)</span>) is defined as</p>
<p><span class="math">\[
e_j = {\sf Enc}(mpk, w_j),
\]</span></p>
<p>where <span class="math">\((w_1, \dots, w_m)\)</span> are the keywords.</p>
<p>Let's suppose Bob's email server has received <span class="math">\(n\)</span> emails of this form, so that
it now holds a set of encrypted emails <span class="math">\((c_1, \dots, c_n)\)</span> and an encrypted
database</p>
<p><span class="math">\[
{\sf EDB} = \bigg(\big(e_{1, 1}, \dots, e_{1, m}, {\sf ptr}(c_1)\big), \dots,
\big(e_{n,1}, \dots, e_{n,m}, {\sf ptr}(c_n)\big)\bigg).
\]</span></p>
<p>Now, if Bob wants to retrieve the emails with keyword <span class="math">\(w\)</span>, he just needs to
generate a secret IBE key as <span class="math">\(sk_w = {\sf Keygen}(msk, w)\)</span> and send it as the token
to the server. The server then tries to decrypt each IBE ciphertext in
<span class="math">\({\sf EDB}\)</span> and if successful follows the associated pointer to return the
appropriate encrypted email.</p>
<p>An important observation is that a standard IBE scheme here will not be enough.
The problem is that the notion of IBE does not necessarily guarantee that a
ciphertext hides information about the identity used to create it.
This means that if we were to use a standard IBE scheme, <span class="math">\({\sf EDB}\)</span> could leak the
keywords to the server. To address this, Boneh et al. observe that what you
actually need is an <em>anonymous</em> IBE scheme which essentially means that the
ciphertexts do not reveal information about the identities. Fortunately, we
know how to construct such schemes efficiently so this is not a major concern
from a practical point of view (e.g., the Boneh-Franklin IBE scheme is
anonymous).</p>
<p><strong>Efficiency.</strong>
Search time for the server is <span class="math">\(O(nm)\)</span> since it has to try to decrypt each
ciphertext in the <span class="math">\({\sf EDB}\)</span>. Assuming <span class="math">\(m \ll n\)</span>, this is <span class="math">\(O(n)\)</span> which is a <em>a
lot</em> slower than the solution based on deterministic encryption described in the
<a href="http://outsourcedbits.org/2013/10/14/how-to-search-on-encrypted-data-part-2/">previous post</a>
which required time <span class="math">\(o(n)\)</span> (i.e., sub-linear in <span class="math">\(n\)</span>).</p>
<h2 id="is-this-secure">Is this Secure?</h2>
<p>While this approach is slower than the PPE-based approach, it has better security
properties. First, the encrypted database by itself does not reveal much useful
information to the server since---unlike the deterministic approach---keywords
are encrypted using a <em>randomized</em> (identity-based) encryption scheme. So
even if two documents have keywords in common, the encrypted keywords in <span class="math">\({\sf EDB}\)</span>
will be different. This means that we don't have to make unnatural assumptions
about the data (e.g., that it has high entropy) to use it safely.</p>
<p>There is an issue, however, with this approach: <em>it
does not protect the search terms</em>. In particular, the server could mount the
following attack to figure out which keyword the client is searching for.</p>
<p>Suppose the server has some dictionary <span class="math">\(W\)</span> of <span class="math">\(d\)</span> words. For each keyword <span class="math">\(w \in
W\)</span> it encrypts "1" with key <span class="math">\(mpk\)</span> and identity <span class="math">\(w\)</span>.
This results in a set of <span class="math">\(d\)</span> (identity-based) encryptions <span class="math">\((e'_1, \dots, e'_d)\)</span>.
Now, given some token <span class="math">\(sk_w\)</span>, the server can learn <span class="math">\(w\)</span> by simply trying to
decrypt each of the ciphertext <span class="math">\(e'_i\)</span> with <span class="math">\(sk_w\)</span>. If the decryption works for
some <span class="math">\(e'_i\)</span>, then the server knows that <span class="math">\(sk_w\)</span> is for the identity used to
generate <span class="math">\(e'_i\)</span>.</p>
<p>Notice that the attack does not result from a deficiency of any particular IBE
scheme but that it applies to <em>any</em> public-key encrypted search solution.
The fundamental problem is that the server has both the ability to create EDBs
(since it has the public-key) and to search over them. So what this tells us is
that, as defined, the notion of search on publicly-encrypted data cannot
protect search terms.</p>
<p>So what can we do about this? Recently, Boneh, Raghunathan and Segev
[<a href="http://eprint.iacr.org/2013/283.pdf">BRS13</a>] and Ariaga and Tang
[<a href="http://eprint.iacr.org/2013/330.pdf">AT13</a>] set out to design
public-key encrypted search solutions that achieved the best possible level of
confidentiality for search terms. Roughly speaking, what this means is that if
the search terms are hard enough to guess, then the schemes proposed will
protect them.</p>
<p>But what do we do if (as in most cases) our search terms are not hard to guess?
Well, we don't really have a good answer except that this problem does not occur
in the symmetric setting since only the client can generate EDBs so,
depending on the application, a symmetric solution might be preferable.</p>
<h2 id="conclusions">Conclusions</h2>
<p>So far, we' ve seen two approaches to searching on encrypted data. The first,
the
<a href="http://outsourcedbits.org/2013/10/14/how-to-search-on-encrypted-data-part-2/">PPE-based
approach</a>,
resulted in schemes with fast search (sub-linear in <span class="math">\(n\)</span>) but with relatively
weak security guarantees. The second, the FE-based approach, resulted in
schemes with slow search (linear in <span class="math">\(n\)</span>) but with better security guarantees.</p>
<p>In the next post, we'll go over solutions that are even slower, but that achieve
the strongest possible levels of security!</p>