docs/abstract/abstract.tex

   1 \documentclass[conference]{IEEEtran}
   2
   3 \usepackage[noadjust]{cite}
   4
   5 \usepackage[dvips]{graphicx}
   6
   7 \usepackage{url}
   8 \usepackage[cmex10]{amsmath}
   9 \interdisplaylinepenalty=2500
  10 \usepackage{threeparttable}
  11 \usepackage{multirow}
  12
  13 \begin{document}
  14
  15 \title{Using Altruistic Peers to Help With Free Content Downloads}
  16 \author{\IEEEauthorblockN{Cameron Dale}
  17 \IEEEauthorblockA{School of Computing Science\\
  18 Simon Fraser University\\
  19 Burnaby, British Columbia, Canada\\
  20 Email: camerond@cs.sfu.ca}
  21 \and
  22 \IEEEauthorblockN{Jiangchuan Liu}
  23 \IEEEauthorblockA{School of Computing Science\\
  24 Simon Fraser University\\
  25 Burnaby, British Columbia, Canada\\
  26 Email: jcliu@cs.sfu.ca}}
  27
  28 \maketitle
  29
  30 % \begin{abstract}
  31 % A large amount of free content is available over the Internet from
  32 % many different distributors. Most of this content uses the traditional
  33 % client-server model to handle requests from users. However, there is an
  34 % opportunity to use peer-to-peer techniques to reduce the cost of
  35 % much of this distribution, especially due to the altruistic nature
  36 % of many of these users. We present a general technique for
  37 % satisfying the needs of this P2P distribution, which is applicable
  38 % to many of these distributors systems. Our method makes use of a DHT
  39 % for storing the location of peers, using the cryptographic hash of
  40 % the file as a key. We then go further and implement a solution for
  41 % the distribution of Debian software packages.
  42 %
  43 % We have designed a generally applicable peer-to-peer content
  44 % distribution model to be used by many of the free content
  45 % distributors operating today. It makes use of already existing
  46 % features of most package management systems, to create a
  47 % peer-to-peer distribution that should substantially reduce the costs
  48 % of hosting the content.
  49 %
  50 % We have also implemented our design in software to be used in
  51 % conjuction with Debian-based distribution of Linux software
  52 % packages. It will soon be in use by many users of the Debian
  53 % project's distribution, and so serves as an example of the
  54 % opportunities that exist.
  55 % \end{abstract}
  56
  57 \section{Introduction}
  58 \label{intro}
  59
  60 There are a large number of free content distributors using custom
  61 package distribution systems over the Internet to distribute content
  62 to their users. These distributors have developed many different
  63 methods for this distribution, but they almost exclusively use a
  64 client-server model to satisfy user requests. The contents' size,
  65 popularity and number of users results in a large number of
  66 requests, which usually requires a network of mirrors to handle. Due
  67 to the free nature of this content, many users are willing and able
  68 to contribute upload bandwidth to this distribution, but currently
  69 have no way to do this.
  70
  71 There are many peer-to-peer implementations available today, but
  72 none is very well suited to this specific problem. Many distributors
  73 make their content available in some form using BitTorrent, but it
  74 is not ideally suited to an application such as this one. BitTorrent
  75 is designed for large static collections of files, in which all
  76 peers are interested in most or all of the content. But the packages
  77 in this case are small, constantly being updated, and users are only
  78 interested in a small subset of them. Other proposed implementations
  79 suffer from some of the same, and other, problems, and so none has
  80 been well-used to address this problem.
  81
  82 We present a new peer-to-peer distribution model to meet these
  83 demands. It is based on the lessons learned from many previous
  84 implementations of successful peer-to-peer protocols, and makes use
  85 of a distributed hash table (DHT) and borrows many ideas from
  86 BitTorrent. The model relies on the pre-existence of cryptographic
  87 hashes of the content, which uniquely identify it for a request from
  88 other peers. If the peer-to-peer download fails, then the original
  89 request to the server is used as a fallback to prevent dissatisfied
  90 users. The peer can then share this new content with others through
  91 the P2P system. This system would efficiently satisfy the demands of
  92 a large number of users, and should significantly reduce the
  93 currently substantial bandwidth costs of hosting this content.
  94
  95 We also present a sample implementation based on the Debian package
  96 distribution system. This implementation is simple, makes use of
  97 mostly well-known technologies, and so serves as an example for
  98 other free content distributors of the opportunity that can be met
  99 with such a system.
 100
 101 \section{Problem}
 102
 103 There are a large number of groups using the Internet to distribute
 104 their free content. This content is made available from a free
 105 download site, which usually requires a network of mirrors to
 106 support the large number of requests. This content almost always
 107 supports some kind of file verification, using cryptographic hashes,
 108 to verify the completed downloads accuracy or authenticity.
 109
 110 Most Linux distributions use a software package management system
 111 that fetches packages to be installed from a network of mirrors.
 112 Debian-based distributions uses the \texttt{apt} program, which
 113 downloads Debian packages from one of many mirrors. RPM-based
 114 distributions use \texttt{yum}, and Gentoo uses \texttt{portage},
 115 which both operate in a similar way. Other free software
 116 distributors also use a similar system: CPAN distributes files
 117 containing software packages for the PERL programming language,
 118 using SOAP RPC requests to find and download files; Cygwin provides
 119 many of the standard Unix/Linux tools in a Windows environment,
 120 using a package management tool that requests packages from
 121 websites; there are two software distribution systems for Mac OSX,
 122 fink and MacPorts, that also retrieve packages in this way. Also,
 123 some systems use direct web downloads, but with a hash verification
 124 file also available for download next to the desired file. These
 125 hash files usually have the same file name, but with an added
 126 extension identifying the hash used (e.g. \texttt{.md5} for the MD5
 127 hash). Finally, there are other programs that make use of
 128 cryptographic hashing to identify files: e.g. Git is a version
 129 control system in which all files and changes are identified by
 130 their SHA1 hash.
 131
 132 These systems all share some commonalities: the content is avaliable
 133 for anyone to freely download, the content is divided into distinct
 134 units (packages), users are typically not interested in downloading
 135 all of the content available, and hashes of the content are
 136 available before the download is attempted. We also expect that
 137 there are a number of users of these systems that are motivated by
 138 altruism to want to help out with this distribution. This is common
 139 in these systems due to the free nature of the content being
 140 delivered.
 141
 142 % Although at first it might seem that a single all-reaching solution
 143 % is possible for this problem, there are some differences in each
 144 % system that require independent solutions. The systems all use
 145 % different tools for their distribution, so any implementation would
 146 % have to be specific to the tool it is to integrate with. In
 147 % particular, how to obtain requests from the tool or user, and how to
 148 % find a hash that corresponds to the file being requested, is very
 149 % specific to each system. Also, there may be no benefit in creating a
 150 % single large solution to integrate all these systems since, even
 151 % though they sometimes distribute nearly identical content, it is not
 152 % identical as it has been tailored to the system, which will change
 153 % the hash of the files and so will make it impossible to distribute
 154 % similar content across systems.
 155
 156 \section{Solution}
 157
 158 This situation presents a clear opportunity to use some form of
 159 peer-to-peer file-sharing protocol. The sparse interest in a large
 160 number of packages undergoing constant updating is well suited to
 161 the functionality provided by a Distributed Hash Table (DHT). DHTs
 162 require unique keys to store and retrieve strings of data, which the
 163 cryptographic hashes used by these package management systems are
 164 perfect for. The stored and retrieved strings can then be pointers
 165 to the peers that have the content package that hashes to that key.
 166 A downloading peer can lookup the package hash in the DHT and, if it
 167 is found, download the file from those peers and verify the content.
 168 If the package hash can not be found in the DHT, the peer will
 169 fallback to downloading from the original content location, and once
 170 complete will add a new entry to the DHT indicating that it now has
 171 the content. The servers or mirrors thus act as \emph{seeds} for the
 172 P2P system without any modification to them.
 173
 174 This desired P2P functionality could be integrated into the existing
 175 package management software in 2 ways. The functionality can be
 176 directly integrated into the software, though this could be
 177 difficult as the DHT should be running at all times, whereas the
 178 tools typically only run until the download request is complete.
 179 Alternatively, since many of the package management software
 180 implementations use HTTP requests to download the files, this makes
 181 it possible to implement the P2P aspect as a standard HTTP caching
 182 proxy, which will get uncached requests first from the P2P system,
 183 and then fallback to the normal HTTP request from a server. For
 184 methods that don't use HTTP requests, other types of proxies may
 185 also be possible.
 186
 187 Downloading a large file efficiently from a number of peers is where
 188 BitTorrent shines as a peer-to-peer application. Its method of
 189 breaking up larger files into pieces, each with its own hash, makes
 190 it very easy to parallelize the downloading process and maximize the
 191 download speed. Since the package management system only makes
 192 available a hash of the entire content, the piece hashes will have
 193 to be found through the P2P protocol. These hashes can be added to
 194 the DHT, or requested from other peers.
 195
 196 For files smaller than the piece size, no piece hashes are
 197 necessary. For medium size files of only a few pieces, the piece
 198 hashes are short enough to be stored in the DHT with the pointer to
 199 the peer to download from. For large files of 10's of pieces, the
 200 piece hashes are generally too long to store with the pointer to
 201 download peers. Instead, the peers will store the piece hashes for
 202 large files separately in the DHT using as a key the hash of the
 203 piece hashes, and include this key in their stored value for the
 204 hash of the file. Peers downloading these large files can then
 205 retrieve the piece hashes using a second DHT request. For very large
 206 files for which the piece hashes are too large to store at all in
 207 the DHT, a request to the peer for the hash (using the same method
 208 as file downloads) should return the piece hashes.
 209
 210 \section{Sample Implementation}
 211
 212 To demonstrate the ease and effectiveness of this solution we have
 213 created a sample implementation called \texttt{apt-p2p} which
 214 interacts with the \texttt{apt} tool found in most Debian-based
 215 Linux distributions. \texttt{apt} uses SHA1 hashes to verify most
 216 downloaded files, including the large index files that contain
 217 information such as the hashes of the individual packages. Since all
 218 requests from \texttt{apt} are in the form of HTTP downloads from a
 219 server, the implementation takes the form of a caching HTTP proxy.
 220
 221 The DHT used is based on Khashmir, which is an implementation of the
 222 Kademlia DHT \cite{kademlia} used by most of the existing BitTorrent
 223 clients to implement trackerless operation. It has been modified to
 224 better support the retrieval of multiple values per key, and to
 225 improve the response time so as to satisfy users' demands for speed.
 226 The values stored in the DHT are all bencoded dictionaries similar
 227 to torrent files. Peers store their download location (IP address
 228 and TCP port), as well as the piece strings (or hashes of piece
 229 strings) as described in the previous section. Downloading is
 230 accomplished by simple HTTP requests to the peers' identified to
 231 have the desired file. All peers support HTTP/1.1, both in the
 232 server and the client, which allows for pipelining of multiple
 233 requests to a peer, and the requesting of smaller pieces of a large
 234 file using the Range request header.
 235
 236 We have chosen a piece size of 512 kB, which means that 17,515
 237 (79\%) of the almost 23 thousand packages will not require piece
 238 information as they are smaller than a single piece. There are 3054
 239 packages (13\%) that will require 2 to 4 pieces, for which the piece
 240 hashes are stored directly with the peer download information in the
 241 DHT. There are 1667 packages (7\%) that will require 5 to 70 pieces,
 242 and so use a separate lookup in the DHT for the piece hashes.
 243 Finally, there are only 62 packages (0.3\%) that require more than
 244 70 pieces, and so for them a separate request to a peer for the
 245 piece hashes is needed.
 246
 247 We have deployed the DHT from this implementation on PlanetLab to
 248 test it's speed and effectiveness at delivering results. We find
 249 that, even on the overloaded PlanetLab network, the responses to
 250 lookups of keys take less than 10 seconds on average. Since each
 251 package download requires a lookup before it can proceed, it may
 252 take at least 10 seconds to process a request. However, due to
 253 \texttt{apt} pipelining up to 10 requests at a time to our proxy,
 254 the user will see an average response time (after a short startup)
 255 of under a second, which should be responsive enough to satisfy most
 256 users.
 257
 258 This sample implementation has also been made available to all
 259 Debian users as an installable software package. Once it becomes
 260 more widely used, we plan to collect more usage information on the
 261 system through crawls and data analysis.
 262
 263 \bibliographystyle{IEEEtran}
 264 \bibliography{./IEEEabrv,./all}
 265
 266 \end{document}