docs/abstract/abstract.tex

   1 \documentclass[conference]{IEEEtran}
   2
   3 \usepackage[noadjust]{cite}
   4
   5 \usepackage[dvips]{graphicx}
   6
   7 \usepackage{url}
   8 \usepackage[cmex10]{amsmath}
   9 \interdisplaylinepenalty=2500
  10 \usepackage{threeparttable}
  11 \usepackage{multirow}
  12
  13 \begin{document}
  14
  15 \title{Leveraging Altruistic Peers to Reduce the Bandwidth Costs of Free Content Downloads}
  16 \author{\IEEEauthorblockN{Cameron Dale}
  17 \IEEEauthorblockA{School of Computing Science (student)\\
  18 Simon Fraser University\\
  19 Burnaby, British Columbia, Canada\\
  20 Email: camerond@cs.sfu.ca}
  21 \and
  22 \IEEEauthorblockN{Jiangchuan Liu}
  23 \IEEEauthorblockA{School of Computing Science (faculty)\\
  24 Simon Fraser University\\
  25 Burnaby, British Columbia, Canada\\
  26 Email: jcliu@cs.sfu.ca}}
  27
  28 \maketitle
  29
  30 \section{Introduction}
  31 \label{intro}
  32
  33 There are a large number of free content distributors using custom
  34 package distribution systems over the Internet to distribute content
  35 to their users: most Linux distributions (e.g. Debian, Red Hat,
  36 Gentoo), Cygwin, CPAN, etc.
  37 These distributors have developed many different
  38 methods for this distribution, but they almost exclusively use a
  39 client-server model to satisfy user requests. The contents' size,
  40 popularity and number of users results in a large number of
  41 requests, which usually requires a complicated network of mirrors to handle. Due
  42 to the free nature of this content, many users are willing and able
  43 to contribute upload bandwidth to this distribution, but currently
  44 have no way to do this.
  45
  46 There are many peer-to-peer implementations available today, but
  47 none is very well suited to this specific problem. Many distributors
  48 make their content available in some form using BitTorrent \cite{COHEN03}, but it
  49 is not ideally suited to an application such as this one. BitTorrent
  50 is designed for large static collections of files, in which all
  51 peers are interested in most or all of the content. But the packages
  52 in this case are small, constantly being updated, and users are only
  53 interested in a small subset of them. Other proposed implementations
  54 suffer from some of the same, and other, problems and so none has
  55 been well-used to address this problem.
  56
  57 We present a new peer-to-peer distribution model to meet these
  58 demands. It is based on the lessons learned from many previous
  59 implementations of successful peer-to-peer protocols, such as
  60 Distributed Hash Tables and BitTorrent.
  61 The model relies on the pre-existence of cryptographic
  62 hashes of the content, which uniquely identify it for a request from
  63 other peers. If the peer-to-peer download fails, then the original
  64 request to the server is used as a fallback to prevent dissatisfied
  65 users. The peer can then share this new content with others through
  66 the P2P system. This system would efficiently satisfy the demands of
  67 a large number of users, and should significantly reduce the
  68 currently substantial bandwidth costs of hosting this content.
  69
  70 We also present a sample implementation based on the Debian package
  71 distribution system. This implementation is simple, makes use of
  72 mostly well-known technologies, and so serves as an example for
  73 other free content distributors of the opportunity that can be easily met
  74 with such a system.
  75
  76 \section{Problem}
  77
  78 There are a large number of groups using the Internet to distribute
  79 their free content. This content is made available from a free
  80 download site, which usually requires a network of mirrors to
  81 support the large number of requests. This content almost always
  82 supports some kind of file verification using cryptographic hashes
  83 to verify the download's accuracy or authenticity.
  84
  85 Most Linux distributions use a software package management system
  86 that fetches packages to be installed from a network of mirrors.
  87 Debian-based distributions uses the \texttt{apt} program, which
  88 downloads Debian packages from one of many mirrors. RPM-based
  89 distributions use \texttt{yum}, and Gentoo uses \texttt{portage},
  90 which both operate in a similar way. Other free software
  91 distributors also use a similar system: CPAN distributes files
  92 containing software packages for the PERL programming language,
  93 using SOAP RPC requests to find and download files; Cygwin provides
  94 many of the standard Unix/Linux tools in a Windows environment,
  95 using a package management tool that requests packages from
  96 websites; two software distribution systems exist for Mac OSX,
  97 fink and MacPorts, that also retrieve packages in this way. Also,
  98 some systems use direct web downloads, but with a hash verification
  99 file also available for download next to the desired file. These
 100 hash files usually have the same file name, but with an added
 101 extension identifying the hash used (e.g. \texttt{.md5} for the MD5
 102 hash). Finally, there are other programs that make use of
 103 cryptographic hashing to identify files: e.g. Git is a version
 104 control system in which all files and changes are identified by
 105 their SHA1 hash.
 106
 107 These systems all share some commonalities: the content is avaliable
 108 for anyone to freely download, the content is divided into distinct
 109 units (packages), users are typically not interested in downloading
 110 all of the content available, and hashes of the content are
 111 available before the download is attempted. We also expect that
 112 there are many users of these systems that are motivated by
 113 altruism to want to help out with this distribution. This is common
 114 in these systems due to the free nature of the content being
 115 delivered.
 116
 117 % Although at first it might seem that a single all-reaching solution
 118 % is possible for this problem, there are some differences in each
 119 % system that require independent solutions. The systems all use
 120 % different tools for their distribution, so any implementation would
 121 % have to be specific to the tool it is to integrate with. In
 122 % particular, how to obtain requests from the tool or user, and how to
 123 % find a hash that corresponds to the file being requested, is very
 124 % specific to each system. Also, there may be no benefit in creating a
 125 % single large solution to integrate all these systems since, even
 126 % though they sometimes distribute nearly identical content, it is not
 127 % identical as it has been tailored to the system, which will change
 128 % the hash of the files and so will make it impossible to distribute
 129 % similar content across systems.
 130
 131 \section{Solution}
 132
 133 This situation presents a clear opportunity to use a new
 134 peer-to-peer file sharing protocol.
 135 A downloading peer can lookup the package hash using the protocol and, if it
 136 is found, download the file from those peers and verify the content.
 137 If the package hash can not be found, the peer will
 138 fallback to downloading from the original content location, and once
 139 complete will announce to other peers indicating that it now has
 140 the content. The original servers or mirrors thus act as \emph{seeds} for the
 141 P2P system without any modification to them.
 142
 143 This functionality could be directly integrated into the
 144 package management software, although this would be
 145 difficult as the protocol should be running at all times, whereas the
 146 package software typically only runs until the download request is complete.
 147 Alternatively, since many of the package management software
 148 implementations use HTTP requests to download the files, it
 149 it possible to implement the P2P aspect as as HTTP caching
 150 proxy. The proxy will get uncached requests first from the P2P system,
 151 and then fallback to the normal HTTP request from a server. For
 152 methods that don't use HTTP requests, other types of proxies
 153 (possibly custom-made) may also be possible.
 154
 155 The sparse interest in a large
 156 number of packages undergoing constant updating is well suited to
 157 the functionality provided by a Distributed Hash Table (DHT). DHTs
 158 require unique keys to store and retrieve strings of data, which the
 159 cryptographic hashes used by the package management systems are
 160 perfect for. The stored and retrieved strings can then be pointers
 161 to the peers that have the content package that hashes to that key,
 162 as well as any other information needed by the protocol to facilitate
 163 the download.
 164
 165 To download larger files efficiently from a number of peers,
 166 the file needs to be broken up into pieces, each with its own hash.
 167 This method, inspired by BitTorrent, makes it very easy to
 168 parallelize the downloading process and maximize the download speed.
 169 Since the package management systems only make
 170 available a hash of the entire content, the piece hashes will have
 171 to be found through the P2P protocol.
 172
 173 For files smaller than the piece size, no piece hashes are
 174 necessary. For medium size files of only a few pieces, the piece
 175 hashes are short enough to be stored in the DHT with the pointer to
 176 the peer to download from. For large files of 10's of pieces, the
 177 piece hashes are generally too long to store with the pointer to
 178 download peers. Instead, the peers will store the piece hashes for
 179 large files separately in the DHT using as a key the hash of the
 180 piece hashes, and include this key in their stored value for the
 181 hash of the file. Peers downloading these large files can then
 182 retrieve the piece hashes using a second DHT request. For very large
 183 files for which the piece hashes are too large to store at all in
 184 the DHT, a request to the peer for the hash (using the same method
 185 as file downloads) should return the piece hashes.
 186
 187 \section{Sample Implementation}
 188
 189 To demonstrate the ease and effectiveness of this solution we have
 190 created a sample implementation called \texttt{apt-p2p} which
 191 interacts with the \texttt{apt} tool found in most Debian-based
 192 Linux distributions. \texttt{apt} uses SHA1 hashes to verify most
 193 downloaded files, including the large index files that contain
 194 information such as the hashes of the individual packages. Since all
 195 requests from \texttt{apt} are in the form of HTTP downloads from a
 196 server, the implementation takes the form of a caching HTTP proxy.
 197
 198 The DHT used is based on Khashmir, which is an implementation of
 199 Kademlia \cite{kademlia} used by most of the existing BitTorrent
 200 clients to implement trackerless operation. It has been modified to
 201 better support the retrieval of multiple values (peers) per key, and to
 202 improve the response time so as to satisfy users' demands for speed.
 203 The values stored in the DHT are all bencoded dictionaries similar
 204 to torrent files. Peers store their download location (IP address
 205 and TCP port), as well as the piece strings (or hashes of piece
 206 strings) as described in the previous section. Downloading is
 207 accomplished by simple HTTP requests for the hash of the file,
 208 which is sent to the peers identified from the DHT lookup to
 209 have the desired file. All peers support HTTP/1.1, both as
 210 servers and clients, which allows for pipelining of multiple
 211 requests to a peer, and the requesting of smaller pieces of a large
 212 file using the Range request header.
 213
 214 We have chosen a piece size of 512 kB, which means that 17,515
 215 (79\%) of the almost 23,000 Debian packages will not require piece
 216 information as they are smaller than a single piece. There are 3054
 217 packages (13\%) that will require 2 to 4 pieces, for which the piece
 218 hashes are stored directly with the peer download information in the
 219 DHT. There are 1667 packages (7\%) that will require 5 to 70 pieces,
 220 and so use a separate lookup in the DHT for the piece hashes.
 221 Finally, there are only 62 packages (0.3\%) that require more than
 222 70 pieces, and so need a separate request to a peer for the
 223 piece hashes.
 224
 225 We have deployed the DHT part of this implementation on PlanetLab to
 226 test it's speed and effectiveness at delivering results. We find
 227 that, even on the overloaded PlanetLab network, the responses to
 228 lookups of keys take less than 10 seconds on average. Since each
 229 package download requires a lookup before it can proceed, it may
 230 take at least 10 seconds to process a request. However, due to
 231 \texttt{apt} pipelining up to 10 requests at a time to our proxy,
 232 the user will see an average response time (after a short startup)
 233 of under a second, which should be responsive enough to satisfy most
 234 users.
 235
 236 This sample implementation has also been made available to all
 237 Debian users as an installable software package. Once it becomes
 238 more widely used, we plan to collect more usage information on the
 239 system through crawls and data analysis.
 240
 241 \bibliographystyle{IEEEtran}
 242 \bibliography{./IEEEabrv,./all}
 243
 244 \end{document}