docs/abstract/abstract.tex

   1 \documentclass{sig-alt-release2}
   2
   3 \usepackage[noadjust]{cite}
   4 \usepackage{url}
   5 \hyphenation{dis-tri-bu-ted}
   6
   7 \begin{document}
   8
   9 \conferenceinfo{SIGCOMM'08,} {August 17--22, 2008, Seattle, Washington, USA.}
  10 \CopyrightYear{2008}
  11 \crdata{978-1-60558-175-0/08/08}
  12
  13 \title{Leveraging Altruistic Peers to Reduce the Bandwidth Costs of Free Content Downloads}
  14 \numberofauthors{2}
  15 \author{
  16 \alignauthor Cameron Dale\\
  17         \affaddr{School of Computing Science (student)}\\
  18         \affaddr{Simon Fraser University}\\
  19         \affaddr{Burnaby, British Columbia, Canada}\\
  20         \email{Email: camerond@cs.sfu.ca}
  21 %\and
  22 \alignauthor Jiangchuan Liu\\
  23         \affaddr{School of Computing Science (faculty)}\\
  24         \affaddr{Simon Fraser University}\\
  25         \affaddr{Burnaby, British Columbia, Canada}\\
  26         \email{Email: jcliu@cs.sfu.ca}
  27 }
  28
  29 \maketitle
  30
  31 \begin{abstract}
  32 We introduce the opportunity many free content distributors have to
  33 reduce their bandwidth costs using a peer-to-peer download
  34 mechanism. Our model for this peer-to-peer distribution is simple to
  35 implement, backwards-compatible with the existing infrastructure,
  36 and can be deployed incrementally. It makes wide use of a customized
  37 distributed hash table (DHT) to facilitate the finding of peers and
  38 the efficient downloading of files. We also present the example
  39 solution we created for one of these distributors, the Debian
  40 project, which is already in use by some users.
  41 \end{abstract}
  42
  43 % A category with the (minimum) three required fields
  44 %\category{H.4}{Information Systems Applications}{Miscellaneous}
  45 %A category including the fourth, optional field follows...
  46 \category{C.2.4}{Computer-Communication Networks}{Distributed Systems}[Distributed applications]
  47 \terms{Design, Performance, Reliability}
  48 \keywords{BitTorrent, P2P}
  49
  50 \section{Introduction}
  51 \label{intro}
  52
  53 There are a large number of free content distributors using custom
  54 package distribution systems over the Internet to distribute content
  55 to their users: most Linux distributions (e.g. Debian, Red Hat,
  56 Gentoo), Cygwin, CPAN, etc.
  57 These distributors have developed many different
  58 methods for this distribution, but they almost exclusively use a
  59 client-server model to satisfy user requests. The contents' size,
  60 popularity and number of users results in a large number of
  61 requests, which usually requires a complicated network of mirrors to handle. Due
  62 to the free nature of this content, many users are willing and able
  63 to contribute upload bandwidth to this distribution, but currently
  64 have no way to do this.
  65
  66 There are many peer-to-peer implementations available today, but
  67 none is very well suited to this specific problem. Many distributors
  68 make their content available in some form using BitTorrent \cite{COHEN03}, but it
  69 is not ideally suited to an application such as this one. BitTorrent
  70 is designed for large static collections of files, in which all
  71 peers are interested in most or all of the content. But the packages
  72 in this case are small, constantly being updated, and users are only
  73 interested in a small subset of them. Other proposed implementations
  74 suffer from some of the same, and other, problems and so none has
  75 been well-used to address this problem.
  76
  77 We present a new peer-to-peer distribution model to meet these
  78 demands, based on other successful peer-to-peer protocols such as
  79 Distributed Hash Tables and BitTorrent.
  80 The model relies on the pre-existence of cryptographic
  81 hashes of the content, which uniquely identify it for a request from
  82 other peers. If the peer-to-peer download fails, then the original
  83 request to the server is used as a fallback to prevent dissatisfied
  84 users. The peer can then share this new content with others through
  85 the P2P system. This system would efficiently satisfy the demands of
  86 a large number of users, and should significantly reduce the
  87 currently substantial bandwidth costs of hosting this content.
  88
  89 We also present a sample implementation based on the Debian package
  90 distribution system. This implementation is simple, makes use of
  91 mostly well-known technologies, and so serves as an example for
  92 other free content distributors of the opportunity that can be easily met
  93 with such a system.
  94
  95 \section{Problem}
  96
  97 There are a large number of groups using the Internet to distribute
  98 their free content. This content is made available from a free
  99 download site, which usually requires a network of mirrors to
 100 support the large number of requests. This content almost always
 101 supports some kind of file verification using cryptographic hashes
 102 to verify the download's accuracy or authenticity.
 103
 104 Most Linux distributions use a software package management system
 105 that fetches packages to be installed from a network of mirrors.
 106 Debian-based distributions uses the \texttt{apt} program, which
 107 downloads Debian packages from one of many mirrors. RPM-based
 108 distributions use \texttt{yum}, and Gentoo uses \texttt{portage},
 109 which both operate in a similar way. Other free software
 110 distributors also use a similar system: CPAN distributes files
 111 containing software packages for the PERL programming language,
 112 using SOAP RPC requests to find and download files; Cygwin provides
 113 many of the standard Unix/Linux tools in a Windows environment,
 114 using a package management tool that requests \texttt{tar} files from
 115 websites; two software distribution systems exist for Mac OSX,
 116 fink and MacPorts, that also retrieve packages in this way. Also,
 117 some systems use direct web downloads, but with a hash verification
 118 file also available. These
 119 hash files usually have the same file name, but with an added
 120 extension identifying the hash used (e.g. \texttt{.md5} for the MD5
 121 hash). Finally, there are other programs that make use of
 122 cryptographic hashing to identify files: e.g. Git is a version
 123 control system in which all files and changes are identified by
 124 their SHA1 hash.
 125
 126 These systems all share some commonalities: the content is avaliable
 127 for anyone to freely download, the content is divided into distinct
 128 units (packages), users are typically not interested in downloading
 129 all of the content available, and hashes of the content are
 130 available before the download is attempted. We also expect that
 131 there are many users of these systems that are motivated by
 132 altruism to want to help out with this distribution. This is common
 133 in these systems due to the free nature of the content being
 134 delivered.
 135
 136 % Although at first it might seem that a single all-reaching solution
 137 % is possible for this problem, there are some differences in each
 138 % system that require independent solutions. The systems all use
 139 % different tools for their distribution, so any implementation would
 140 % have to be specific to the tool it is to integrate with. In
 141 % particular, how to obtain requests from the tool or user, and how to
 142 % find a hash that corresponds to the file being requested, is very
 143 % specific to each system. Also, there may be no benefit in creating a
 144 % single large solution to integrate all these systems since, even
 145 % though they sometimes distribute nearly identical content, it is not
 146 % identical as it has been tailored to the system, which will change
 147 % the hash of the files and so will make it impossible to distribute
 148 % similar content across systems.
 149
 150 \section{Solution}
 151
 152 This situation presents a clear opportunity to use a new
 153 peer-to-peer file sharing protocol.
 154 A downloading peer can lookup the package hash using the protocol and, if it
 155 is found, download the file from those peers and verify the content.
 156 If the package hash can not be found, the peer will
 157 fallback to downloading from the original content location, and once
 158 complete will announce to other peers indicating that it now has
 159 the content. The original servers or mirrors thus act as \emph{seeds} for the
 160 P2P system without any modification to them, and users are satisfied even when
 161 there are no peers which allows the system to be deployed incrementally.
 162
 163 This functionality could be directly integrated into the
 164 package management software, although this would be
 165 difficult as the protocol should be running at all times, whereas the
 166 package software typically only runs until the download request is complete.
 167 Alternatively, it
 168 it possible to implement the P2P aspect as a caching
 169 proxy. The proxy will get uncached requests first from the P2P system,
 170 and then fallback to the normal request from a server.
 171
 172 The sparse interest in a large
 173 number of packages undergoing constant updating is well suited to
 174 the functionality provided by a Distributed Hash Table (DHT). DHTs
 175 require unique keys to store and retrieve strings of data, which the
 176 cryptographic hashes used by the package management systems are
 177 perfect for. The stored and retrieved strings can then be pointers
 178 to the peers that have the content package that hashes to that key,
 179 as well as any other information needed by the protocol to facilitate
 180 the download.
 181
 182 To download larger files efficiently from a number of peers,
 183 the file needs to be broken up into pieces, each with its own hash.
 184 This method, inspired by BitTorrent, makes it very easy to
 185 parallelize the downloading process and maximize the download speed.
 186 Since the package management systems only make
 187 available a hash of the entire content, the piece hashes will have
 188 to be found through the P2P protocol.
 189
 190 For files smaller than the piece size, no piece hashes are
 191 necessary. For medium size files of only a few pieces, the piece
 192 hashes are short enough to be stored in the DHT with the pointer to
 193 the peer to download from. For large files of 10's of pieces, the
 194 piece hashes are generally too long to store with the pointer to
 195 download peers. Instead, the peers will store the piece hashes for
 196 large files separately in the DHT using as a key the hash of the
 197 piece hashes, and include this key in their stored value for the
 198 hash of the file. Peers downloading these large files can then
 199 retrieve the piece hashes using a second DHT request. For very large
 200 files for which the piece hashes are too large to store at all in
 201 the DHT, a request to the peer for the hash (using the same method
 202 as file downloads) should return the piece hashes.
 203
 204 \section{Sample Implementation}
 205
 206 To demonstrate the ease and effectiveness of this solution we have
 207 created a sample implementation called \texttt{apt-p2p} which
 208 interacts with the \texttt{apt} tool found in most Debian-based
 209 Linux distributions. \texttt{apt} uses SHA1 hashes to verify most
 210 downloaded files, including the large index files that contains detailed
 211 information such as the hashes of the individual packages. Since all
 212 requests from \texttt{apt} are in the form of HTTP downloads from a
 213 server, the implementation takes the form of a caching HTTP proxy.
 214
 215 The DHT used is based on Khashmir, which is an implementation of
 216 Kademlia \cite{kademlia} used by most of the existing BitTorrent
 217 clients to implement trackerless operation. It has been modified to
 218 better support the retrieval of multiple values (peers) per key, and to
 219 improve the response time so as to satisfy users' demands for speed.
 220 The values stored in the DHT are all bencoded dictionaries similar
 221 to torrent files. Peers store their download location (IP address
 222 and TCP port), as well as the piece strings (or hashes of piece
 223 strings) as described in the previous section. Downloading is
 224 accomplished by simple HTTP requests for the hash of the file,
 225 which is sent to the peers identified in a DHT lookup to
 226 have the desired file. All peers support HTTP/1.1, both as
 227 servers and clients, which allows for pipelining of multiple
 228 requests to a peer, and the requesting of smaller pieces of a large
 229 file using the Range request header.
 230
 231 We have chosen a piece size of 512 kB, which means that 17,515
 232 (79\%) of the almost 23,000 Debian packages will not require piece
 233 information as they are smaller than a single piece. There are 3054
 234 packages (13\%) that will require 2 to 4 pieces, for which the piece
 235 hashes are stored directly with the peer download information in the
 236 DHT. There are 1667 packages (7\%) that will require 5 to 70 pieces,
 237 and so use a separate lookup in the DHT for the piece hashes.
 238 Finally, there are only 62 packages (0.3\%) that require more than
 239 70 pieces, and so need a separate request to a peer for the
 240 piece hashes.
 241
 242 We have deployed the DHT part of this implementation on PlanetLab to
 243 test it's speed and effectiveness at delivering results. We find
 244 that, even on the overloaded PlanetLab network, the responses to
 245 lookups of keys take less than 10 seconds on average. Since each
 246 package download requires a lookup before it can proceed, it may
 247 take at least 10 seconds to process a request. However, due to
 248 \texttt{apt} pipelining up to 10 requests at a time to our proxy,
 249 the user will see an average response time (after a short startup)
 250 of under a second, which should be responsive enough to satisfy most
 251 users.
 252
 253 This sample implementation has also been made available to all
 254 Debian users as an installable software package. Once it becomes
 255 more widely used, we plan to collect more usage information on the
 256 system through crawls and data analysis.
 257
 258 \bibliographystyle{abbrv}
 259 \bibliography{./all}
 260
 261 \end{document}