docs/paper/paper.tex

   1 \documentclass[conference]{IEEEtran}\r
   2 \r
   3 %% INFOCOM addition:\r
   4 \makeatletter\r
   5 \def\ps@headings{%\r
   6 \def\@oddhead{\mbox{}\scriptsize\rightmark \hfil \thepage}%\r
   7 \def\@evenhead{\scriptsize\thepage \hfil\leftmark\mbox{}}%\r
   8 \def\@oddfoot{}%\r
   9 \def\@evenfoot{}}\r
  10 \makeatother\r
  11 \pagestyle{headings}\r
  12 \r
  13 \usepackage[noadjust]{cite}\r
  14 \r
  15 \usepackage[dvips]{graphicx}\r
  16 \r
  17 \usepackage{url}\r
  18 \r
  19 \begin{document}\r
  20 \r
  21 \title{\texttt{apt-p2p}: A Peer-to-Peer Distribution System for Software Package Releases and Updates}\r
  22 \author{\IEEEauthorblockN{Cameron Dale}\r
  23 \IEEEauthorblockA{School of Computing Science\\\r
  24 Simon Fraser University\\\r
  25 Burnaby, British Columbia, Canada\\\r
  26 Email: camerond@cs.sfu.ca}\r
  27 \and\r
  28 \IEEEauthorblockN{Jiangchuan Liu}\r
  29 \IEEEauthorblockA{School of Computing Science\\\r
  30 Simon Fraser University\\\r
  31 Burnaby, British Columbia, Canada\\\r
  32 Email: jcliu@cs.sfu.ca}}\r
  33 \r
  34 \maketitle\r
  35 \r
  36 \begin{abstract}\r
  37 A large amount of free software packages are available over the\r
  38 Internet from many different distributors. Most of these\r
  39 distributors use the traditional client-server model to handle\r
  40 requests from users. However, there is an excellent opportunity to\r
  41 use peer-to-peer techniques to reduce the cost of much of this\r
  42 distribution, especially due to the altruistic nature of many of\r
  43 these users. There are no existing solutions suitable for this\r
  44 situation, so we present a new technique for satisfying the needs of\r
  45 this P2P distribution which is generally applicable to many of\r
  46 these distributors' systems. Our method makes use of a DHT for\r
  47 storing the location of peers, using the cryptographic hash of the\r
  48 package as a key. To show the simplicity and functionality of our model, we\r
  49 implement a solution for the distribution of Debian software\r
  50 packages, including details on the DHT customizations and system optimizations needed. Finally, we\r
  51 analyze our system to determine how it is performing and what effect\r
  52 it is having.\r
  53 \end{abstract}\r
  54 \r
  55 %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
  56 \r
  57 \section{Introduction}\r
  58 \label{intro}\r
  59 \r
  60 There are a large number of free software package distributors using\r
  61 package distribution systems over the Internet to distribute\r
  62 software to their users. These distributors have developed many\r
  63 different methods for this distribution, but they almost exclusively\r
  64 use a client-server model to satisfy user requests. The popularity\r
  65 and number of users results in a large number of requests, which\r
  66 usually requires a network of mirrors to handle. Due to the free\r
  67 nature of this software, many users are willing and able to\r
  68 contribute upload bandwidth to this distribution, but have no\r
  69 current way to do this.\r
  70 \r
  71 We present a new peer-to-peer distribution model to meet these\r
  72 demands. It is based on many previous implementations of successful\r
  73 peer-to-peer protocols, especially distributed hash tables (DHT) and\r
  74 BitTorrent. The model relies on the pre-existence of cryptographic\r
  75 hashes of the packages, which uniquely identify it for a\r
  76 request from other peers. If the peer-to-peer download fails, then\r
  77 the original request to the server is used as a fallback to prevent\r
  78 any dissatisfaction from users. The peer can then share this new\r
  79 package with others through the P2P system.\r
  80 \r
  81 First, we examine the opportunity that is available for many of\r
  82 these free software package distributors. We present an overview of\r
  83 a system that would efficiently satisfy the demands of a large\r
  84 number of users, and should significantly reduce the currently\r
  85 substantial bandwidth requirements of hosting this software. We then\r
  86 present an example implementation based on the Debian package\r
  87 distribution system. This implementation will be used by a large\r
  88 number of users, and serves as an example for other free software\r
  89 distributors of the opportunity that can be met with such a system.\r
  90 Finally, we analyze our currently deployed implementation to\r
  91 determine how effective it is at meeting our goals, and to see what\r
  92 effect it is having on the Debian package distribution system.\r
  93 \r
  94 The rest of this paper is organized as follows. The background and motivation are presented in Section~\ref{situation}, including an analysis of BitTorrent's use for this purpose in Section~\ref{bittorrent}. We propose\r
  95 our solution in Section~\ref{opportunity}. We then detail our sample\r
  96 implementation for Debian-based distributions in Section~\ref{implementation},\r
  97 including an in-depth look at our DHT\r
  98 customizations in Section~\ref{custom_dht}. The performance of our implementation is evaluated in Section~\ref{analysis}. We examine some related work in Section~\ref{related}, and then\r
  99 Section~\ref{conclusions} concludes the paper and offers some future directions.\r
 100 \r
 101 %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
 102 \r
 103 \section{Background and Motivations}\r
 104 \label{situation}\r
 105 \r
 106 In the free software society, there are a large number of groups using the Internet to \r
 107 collaboratively develop and release their software. The ever increasing power of\r
 108 modern programming languages and operating systems has made this software, like commercial software, extremely large and complex, though it is often\r
 109 distributed in many small units (packages). Together with their popularity among users, \r
 110 an efficient and reliable management and distribution of these packages over the Internet has become a daunting task. In this section, we offer concrete examples illustrating the \r
 111 unique challenges in this context. \r
 112 \r
 113 \subsection{Free Software Package Distributors}\r
 114 \label{examples}\r
 115 \r
 116 Most Linux distributions use a software package management system\r
 117 that fetches packages to be installed from an archive of packages\r
 118 hosted on a network of mirrors. The Debian project, and other\r
 119 Debian-based distributions such as Ubuntu and Knoppix, use the\r
 120 \texttt{apt} (Advanced Package Tool) program, which downloads Debian\r
 121 packages in the \texttt{.deb} format from one of many HTTP mirrors.\r
 122 The program will first download index files that contain a listing\r
 123 of which packages are available, as well as important information\r
 124 such as their size, location, and a hash of their content. The user\r
 125 can then select which packages to install or upgrade, and\r
 126 \texttt{apt} will download and verify them before installing them.\r
 127 \r
 128 There are also several similar frontends for the RPM-based\r
 129 distributions. Red Hat's Fedora project uses the \texttt{yum}\r
 130 program, SUSE uses \texttt{YAST}, while Mandriva has\r
 131 \texttt{Rpmdrake}, all of which are used to obtain RPMs from\r
 132 mirrors. Other distributions use tarballs (\texttt{.tar.gz} or\r
 133 \texttt{.tar.bz2}) to contain their packages. Gentoo's package\r
 134 manager is called \texttt{portage}, SlackWare Linux uses\r
 135 \texttt{pkgtools}, and FreeBSD has a suite of command-line tools,\r
 136 all of which download these tarballs from web servers.\r
 137 \r
 138 Similar tools have been used for other software packages. CPAN\r
 139 distributes software packages for the PERL\r
 140 programming language, using SOAP RPC requests to find and download\r
 141 files. Cygwin provides many of the\r
 142 standard Unix/Linux tools in a Windows environment, using a\r
 143 package management tool that requests packages from websites. There\r
 144 are two software distribution systems for software that runs on the Macintosh OS, fink and\r
 145 MacPorts, that also retrieve packages in this way.\r
 146 \r
 147 Direct web downloading is also a common way, often coupled with a hash\r
 148 verification file to be downloaded next to the desired\r
 149 file. The hash file usually has the same file name, but with an\r
 150 added extension identifying the hash used (e.g. \texttt{.md5} for\r
 151 the MD5 hash). This type of file downloading and verification is\r
 152 typical of free software hosting facilities that are open to anyone\r
 153 to use, such as SourceForge.\r
 154 \r
 155 \r
 156 Given the free nature of this software, there are often a number of users \r
 157 motivated by altruism to want to help out with the distribution.\r
 158 This is particularly true considering that many of this software is used by\r
 159 groups that are staffed mostly, or sometimes completely, by\r
 160 volunteers. They are thus motivated to contribute their network resources, so as to promote the healthy development \r
 161 of the volunteer community that released the software.\r
 162 We also naturally expect that peer-to-peer distribution can be implementation in \r
 163 this context, which scale well with large user bases and can easily explore the network resources made available by\r
 164 the volunteers. \r
 165 \r
 166 \r
 167 \r
 168 \subsection{Unique Characteristics}\r
 169 \label{problems}\r
 170 \r
 171 While it seems straightforward to use an existing peer-to-peer file sharing tool like BitTorrent for\r
 172 free software package distribution, there are indeed a series of new challenges in this unique scenario:\r
 173 \r
 174 \subsubsection{Archive Dimensions}\r
 175 \r
 176 While most of the packages of a software release are very small in\r
 177 size, there are some that are quite large. There are too many\r
 178 packages to distribute each individually, but the archive is also\r
 179 too large to distribute in its entirety. In some archives there are\r
 180 also divisions of the archive into sections, e.g. by the operating system (OS) or computer\r
 181 architecture that the package is intended for.\r
 182 \r
 183 \begin{figure}\r
 184 \centering\r
 185 \includegraphics[width=\columnwidth]{apt_p2p_simulation-size_CDF.eps}\r
 186 \caption{The CDF of the size of packages in a Debian system, both\r
 187 for the actual size and adjusted size based on the popularity of\r
 188 the package.}\r
 189 \label{size_CDF}\r
 190 \end{figure}\r
 191 \r
 192 For example, Figure~\ref{size_CDF} shows the size of the packages in the\r
 193 current Debian distribution. While 80\% of the packages are less than\r
 194 512~KB, some of the packages are hundreds of megabytes. The entire\r
 195 archive consists of 22,298 packages and is approximately 119,000 MB\r
 196 in size. Most of the packages are to be installed in any computer environment, but there are \r
 197 also OS- or architecture-specific packages.\r
 198 \r
 199 \subsubsection{Package Updates}\r
 200 \r
 201 The software packages being distributed are being constantly\r
 202 updated. These updates could be the result of the software creators\r
 203 releasing a new version with improved functionality,\r
 204 or the distributor updating their packaging of the\r
 205 software to meet new requirements. Even if the distributor\r
 206 periodically makes \emph{stable} releases, which are snapshots of\r
 207 all the packages in the archive at a certain time, updates are still\r
 208 released for security issues or serious bugs.\r
 209 \r
 210 \begin{figure}\r
 211 \centering\r
 212 \includegraphics[width=\columnwidth]{size-quarter.eps}\r
 213 \caption{The amount of data in the 119,000 MB Debian archive that is\r
 214 updated each day, broken down by architecture.}\r
 215 \label{update_size}\r
 216 \end{figure}\r
 217 \r
 218 For example, Figure~\ref{update_size} shows the amount of data in\r
 219 the Debian archive that was updated each day over a period of 3\r
 220 months. Every single day, approximately 1.5\% of the 119,000 MB archive is\r
 221 updated with new versions of packages. This frequency is much higher than\r
 222 that of most commercial software, mainly because much of free software is\r
 223 developed in a loosely managed environment of developers working\r
 224 asynchronously on a worldwide scale.\r
 225 \r
 226 \subsubsection{Limited Interest}\r
 227 \r
 228 Finally, though there are a large number of packages and a large number of\r
 229 users, the interest in a particular version of a package can be very \r
 230 limited. Specifically, there are core packages that every user has to download, but most\r
 231 packages would fall in the category of optional or extra, and so are\r
 232 interesting to only a limited number of people.\r
 233 \r
 234 \begin{figure}\r
 235 \centering\r
 236 \includegraphics[width=\columnwidth]{apt_p2p_popularity-cdf.eps}\r
 237 \caption{The CDF of the popularity of packages in a Debian system.}\r
 238 \label{popularity_CDF}\r
 239 \end{figure}\r
 240 \r
 241 For example, the Debian distribution tracks the popularity of its\r
 242 packages using popcon \cite{popcon}. Figure~\ref{popularity_CDF}\r
 243 shows the cumulative distribution function of the percentage of all\r
 244 users who install each package. Though some packages are installed\r
 245 by everyone, 80\% of the packages are installed by less than 1\% of\r
 246 users.\r
 247 \r
 248 \subsubsection{Interactive Users}\r
 249 \r
 250 The package management software that downloads packages displays\r
 251 some kind of indication of speed and completeness for users to\r
 252 watch. Since previous client-server downloads occurred in a sequential\r
 253 fashion, the package management software also measures the speed\r
 254 based on sequential downloading. This requires the peer-to-peer\r
 255 solution to be reasonably responsive at retrieving packages\r
 256 sequentially.\r
 257 \r
 258 \subsection{Why BitTorrent Doesn't Work Well}\r
 259 \label{bittorrent}\r
 260 \r
 261 Many distributors make their software available using\r
 262 BitTorrent \cite{COHEN03}, in particular for the distribution of CD\r
 263 images. This straightforward use however can be very ineffective, as it requires the\r
 264 peers to download large numbers of packages that they are not\r
 265 interested in, and prevents them from updating to new packages\r
 266 without downloading another image containing a lot of the same\r
 267 packages they already have. \r
 268 \r
 269 An alternative is to create torrents tracking smaller groups of packages. Unfortunately, we find that this enhancement can be\r
 270 quite difficult given the unique characteristic of free software packages. \r
 271 First, there is no obvious way to divide the packages into torrents.\r
 272 Most of the packages are too small, and there are too many packages\r
 273 in the entire archive to create individual torrents for each one.\r
 274 On the other hand, all the packages together are too large to track\r
 275 efficiently as a single torrent. Hence, some division of the archive's\r
 276 packages into torrents is obviously necessary, but wherever that\r
 277 split occurs it will cause either some duplication of connections,\r
 278 or prevent some peers from connecting to others who do have the same\r
 279 content. In addition, a small number of the packages can be updated every\r
 280 day which would add new files to the torrent, thereby changing its\r
 281 \emph{infohash} identifier and making it a new torrent. This will\r
 282 severely fracture the download population, even though peers in the\r
 283 new torrent may share 99\% of the packages in common with peers in the\r
 284 old torrent.\r
 285 \r
 286 Other issues also prevent BitTorrent from being a good solution to\r
 287 this problem. In particular, BitTorrent's fixed piece sizes (usually 512 KB) that disregard file\r
 288 boundaries are bigger than many of the packages in the archive. This\r
 289 will waste peers' downloading bandwidth as they will end up\r
 290 downloading parts of other packages just to get the piece that\r
 291 contains the package they do want. \r
 292 Finally, note that BitTorrent downloads files\r
 293 randomly, which does not work well with the interactive package\r
 294 management tools expectation of sequential downloads.\r
 295 \r
 296 On the other hand, there are aspects of BitTorrent that are no\r
 297 longer needed in a system such as this one. With altruistic peers\r
 298 and all files being available without uploading, incentives to share\r
 299 become a less important issue. Also, the availability of seeds are\r
 300 not critical either, as the mirror sites serve in that capacity\r
 301 already.\r
 302 \r
 303 %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
 304 \r
 305 \section{Peer-to-Peer Assisted Distributor: An Overview}\r
 306 \label{opportunity}\r
 307 \r
 308 We assume that the user is attempting to download packages from a\r
 309 server, using package management software to find and retrieve the\r
 310 packages. Further, we assume that the requests from the user to the\r
 311 server are able to be proxied by our peer-to-peer program, and that\r
 312 the server is always available and has all of the package files.\r
 313 Finally, the cryptographic hash of the packages must be available\r
 314 separately from the package itself, and is usually contained in an\r
 315 \emph{index} file which also contains all the packages' names,\r
 316 locations and sizes.\r
 317 \r
 318 \begin{figure}\r
 319 \centering\r
 320 \includegraphics[width=\columnwidth]{model_simple.eps}\r
 321 \caption{The different phases of functionality of our peer-to-peer distribution model.}\r
 322 \label{model}\r
 323 \end{figure}\r
 324 \r
 325 Our model for using P2P to enhance such a system is shown in\r
 326 Figure~\ref{model}. As shown in Phase~1, our program will act as a\r
 327 proxy (1,2), downloading (3) and caching all files communicated\r
 328 between the user and the server (4). It will therefore also have\r
 329 available the index files containing the cryptographic hashes all\r
 330 packages. Later, in Phase~2, upon receiving a request from the user\r
 331 to download a package (5), our program will search the index files\r
 332 for the package being requested and find its hash (6). This hash can\r
 333 then be looked up recursively in the DHT (7), which will return a\r
 334 list of peers that have the package already (8). The package can\r
 335 then be downloaded from the peers (11,12), it can be verified using\r
 336 the hash (13), and if valid can be returned to the user (14). The\r
 337 current nodes location is also added to the DHT for that hash (15),\r
 338 as it is now a source for others to download from.\r
 339 \r
 340 In steps (11,12), the fact that this package is also available to download for free\r
 341 from a server is very important to our proposed model. If the package hash\r
 342 can not be found in the DHT, the peer can then fallback to\r
 343 downloading from the original location (i.e. the network of\r
 344 mirrors). The mirrors thus, with no modification to their\r
 345 functionality, serve as seeds for the packages in the peer-to-peer\r
 346 system. Any packages that have just been updated, or that are very\r
 347 rare, and so don't have any peers available, can always be found on\r
 348 the mirror. Once the peer has completed the download from the mirror\r
 349 and verified the package, it can then add itself to the DHT as the\r
 350 first peer for the new package, so that future requests for the package\r
 351 will not need the mirror.\r
 352 \r
 353 This sparse\r
 354 interest in a large number of packages undergoing constant updating\r
 355 is well suited to the functionality provided by a Distributed Hash\r
 356 Table (DHT). DHTs require unique keys to store and retrieve strings\r
 357 of data, for which the cryptographic hashes used by these package\r
 358 management systems are perfect for. The stored and retrieved strings\r
 359 can then be pointers to the peers that have the package that hashes\r
 360 to that key. A downloading peer can lookup the package hash in the\r
 361 DHT and, if it is found, download the file from those peers and\r
 362 verify the package with the hash. Once the download is complete, the\r
 363 peer will add its entry to the DHT indicating that it now has the\r
 364 package.\r
 365 \r
 366 Note that, despite downloading the package from untrustworthy peers,\r
 367 the trust of the package is always guaranteed through the use\r
 368 of the cryptographic hashes. Nothing can be downloaded from a peer\r
 369 until the hash is looked up in the DHT, so a hash must first come\r
 370 from a trusted source (i.e. the distributors server). Most distributors use index\r
 371 files that contain hashes for a large number of the packages in\r
 372 their archive, and which are also hashed. After retrieving the\r
 373 index's hash from the mirror, the index file can be downloaded from\r
 374 peers and verified. Then the program has access to all the hashes of\r
 375 the packages it will be downloading, all of which can be verified\r
 376 with a \emph{chain of trust} that stretches back to the original\r
 377 distributor's server.\r
 378 \r
 379 \subsection{Implementation Options}\r
 380 \label{imp_options}\r
 381 \r
 382 There are several ways to implement the desired P2P functionality\r
 383 into the existing package management software. The functionality can\r
 384 be directly integrated through modifications to the software, though\r
 385 this could be difficult as the P2P functionality should be running\r
 386 at all times. This is needed both for efficient lookups and to\r
 387 support uploading of already downloaded packages. Unfortunately, the\r
 388 package management tools typically only run until the download and\r
 389 install request is complete.\r
 390 \r
 391 Many of the package management software implementations use HTTP\r
 392 requests from web servers to download the packages, which makes it\r
 393 possible to implement the P2P aspect as an almost standard HTTP\r
 394 caching proxy. This proxy will run as a daemon in the background,\r
 395 listening for requests from the package management tool for package\r
 396 files, as well as serving (uploading) cached package to other peers.\r
 397 It will get uncached requests first from the P2P system, or\r
 398 fall back to the normal HTTP request from a server should it not\r
 399 be found. For methods that don't use HTTP requests, other types of\r
 400 proxies may also be possible.\r
 401 \r
 402 \subsection{Downloading From Peers}\r
 403 \label{downloading}\r
 404 \r
 405 Although not necessary, we recommend implementing a download\r
 406 protocol that is similar to the protocol used to fetch packages from\r
 407 the distributor's servers. This simplifies the P2P program, as it\r
 408 can then treat peers and mirrors almost identically when requesting\r
 409 packages. In fact, the mirrors can be used when there are only a few\r
 410 slow peers available for a file to help speed up the download\r
 411 process.\r
 412 \r
 413 Downloading a file efficiently from a number of peers is where\r
 414 BitTorrent shines as a peer-to-peer application. Its method of\r
 415 breaking up larger files into pieces, each with its own hash,\r
 416 makes it very easy to parallelize the downloading process and\r
 417 maximize the download speed. For very small packages (i.e. less than\r
 418 the piece size), this parallel downloading is not necessary, or\r
 419 even desirable. However, this method should still be used, in\r
 420 conjunction with the DHT, for the larger packages that are\r
 421 available.\r
 422 \r
 423 Since the package management system only stores a hash of the entire\r
 424 package, and not of pieces of that package, we will need to be able\r
 425 to store and retrieve these piece hashes using the P2P protocol. In\r
 426 addition to storing the file download location in the DHT (which\r
 427 would still be used for small files), a peer will store a\r
 428 \emph{torrent string} containing the peer's hashes of the pieces of\r
 429 the larger files. These piece hashes are compared ahead of time\r
 430 to determine which peers have the same piece hashes (they all\r
 431 should), and then used during the download to verify the pieces of\r
 432 the downloaded package.\r
 433 \r
 434 %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
 435 \r
 436 \section{\texttt{apt-p2p}: A Practical Implementation}\r
 437 \label{implementation}\r
 438 \r
 439 We have created a sample implementation that functions as described\r
 440 in section \ref{opportunity}, and is freely available for other\r
 441 distributors to download and modify \cite{apt-p2p}. This software,\r
 442 called \texttt{apt-p2p}, interacts with the \texttt{apt} tool which\r
 443 is found in most Debian-based Linux distributions. \texttt{apt} uses\r
 444 SHA1 hashes to verify most downloaded files, including the large\r
 445 index files that contain the hashes of the individual packages. We\r
 446 chose this distribution system as it is familiar to us, it allows\r
 447 software contributions, and there are interesting statistics\r
 448 available for analyzing the popularity of the software packages\r
 449 \cite{popcon}.\r
 450 \r
 451 Since all requests from \texttt{apt} are in the form of HTTP downloads from a\r
 452 server, the implementation takes the form of a caching HTTP proxy.\r
 453 Making a standard \texttt{apt} implementation use the proxy is then\r
 454 as simple as prepending the proxy location and port to the front of\r
 455 the mirror name in \texttt{apt}'s configuration file (i.e.\r
 456 ``http://localhost:9977/mirrorname.debian.org/\ldots'').\r
 457 \r
 458 We created a customized DHT based on Khashmir \cite{khashmir}, which\r
 459 is an implementation of Kademlia \cite{kademlia} using methods\r
 460 familiar to BitTorrent developers. Khashmir is also the same DHT\r
 461 implementation used by most of the existing BitTorrent clients to\r
 462 implement trackerless operation. The communication is all handled by\r
 463 UDP messages, and RPC (remote procedure call) requests and responses\r
 464 are all \emph{bencoded} in the same way as BitTorrent's\r
 465 \texttt{.torrent} files. Khashmir uses the high-level Twisted\r
 466 event-driven networking engine \cite{twisted}, so we also use\r
 467 Twisted in our sample implementation for all other networking needs.\r
 468 More details of this customized DHT can be found below in\r
 469 Section~\ref{custom_dht}.\r
 470 \r
 471 Downloading is accomplished by sending simple HTTP requests to the\r
 472 peers identified by lookups in the DHT to have the desired file.\r
 473 Requests for a package are made using the package's hash, properly\r
 474 encoded, as the URL to request from the peer. The HTTP server used\r
 475 for the proxy also doubles as the server listening for requests for\r
 476 downloads from other peers. All peers support HTTP/1.1, both in the\r
 477 server and the client, which allows for pipelining of multiple\r
 478 requests to a peer, and the requesting of smaller pieces of a large\r
 479 file using the HTTP Range request header.\r
 480 \r
 481 %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
 482 \r
 483 \section{Customized DHT / System Optimization}\r
 484 \label{custom_dht}\r
 485 \r
 486 A large contribution of our work is in the customization and use of\r
 487 a Distributed Hash Table (DHT). Although our DHT is based on\r
 488 Kademlia, we have made many improvements to it to make it suitable\r
 489 for this application. In addition to a novel storage technique to\r
 490 support piece hashes, we have improved the response time of looking\r
 491 up queries, allowed the storage of multiple values for each key, and\r
 492 incorporated some improvements from BitTorrent's tracker-less DHT\r
 493 implementation.\r
 494 \r
 495 \subsection{DHT Background}\r
 496 \label{dht}\r
 497 \r
 498 DHT's operate by storing (\emph{key}, \emph{value}) pairs in a\r
 499 distributed fashion such that no node will, on average, store more\r
 500 than any other node. They support two primitive operations:\r
 501 \emph{put}, which takes a key and a value and stores it in the DHT;\r
 502 and \emph{get}, which takes a key and returns a value (or values)\r
 503 that was previously stored with that key. These operations are\r
 504 recursive, as each node does not know about all the other nodes in a\r
 505 DHT, and so must recursively search for the correct node to put to\r
 506 or get from.\r
 507 \r
 508 The Kademlia DHT, like most other DHTs, assigns IDs to peers from\r
 509 the same space that is used for keys. The peers with IDs closest to\r
 510 the desired key will then store the values for that key. Lookups are\r
 511 recursive, as nodes are queried in each step that are exponentially\r
 512 closer to the key than in the previous step.\r
 513 \r
 514 Nodes in a Kademlia system support four primitive requests.\r
 515 \texttt{ping} will cause a peer to return nothing, and is only used\r
 516 to determine if a node is still alive, while \texttt{store} tells a\r
 517 node to store a value associated with a given key. The most\r
 518 important primitives are \texttt{find\_node} and\r
 519 \texttt{find\_value}, which both function recursively to find nodes\r
 520 close to a key. The queried nodes will return a list of the nodes\r
 521 they know about that are closest to the key, allowing the querying\r
 522 node to quickly traverse the DHT to find the nodes close to the\r
 523 desired key. The only difference between \texttt{find\_node} and \texttt{find\_value} is that the\r
 524 \texttt{find\_value} query will cause a node to return a value, if\r
 525 it has one for that key, instead of a list of nodes.\r
 526 \r
 527 \subsection{Piece Hash Storage}\r
 528 \label{pieces}\r
 529 \r
 530 Hashes of pieces of the larger package files are needed to support\r
 531 their efficient downloading from multiple peers.\r
 532 For large files (5 or more pieces), the torrent strings described in\r
 533 Section~\ref{downloading}\r
 534 are too long to store with the peer's download info in the DHT. This\r
 535 is due to the limitation that a single UDP packet should be less\r
 536 than 1472 bytes to avoid fragmentation.\r
 537 % For example, for a file of 4 pieces, the peer would need to store a\r
 538 % value 120 bytes in size (IP address, port, four 20-byte pieces and\r
 539 % some overhead), which would limit a return value from including more\r
 540 % than 10 values to a request.\r
 541 \r
 542 Instead, the peers will store the torrent string for large files\r
 543 separately in the DHT, and only contain a reference to it in their\r
 544 stored value for the hash of the file. The reference is an SHA1 hash\r
 545 of the entire concatenated length of the torrent string. If the\r
 546 torrent string is short enough to store in the DHT (i.e. less than\r
 547 1472 bytes, or about 70 pieces for the SHA1 hash), then a lookup of\r
 548 that hash in the DHT will return the torrent string. Otherwise, a\r
 549 request to the peer for the hash (using the same method as file\r
 550 downloads, i.e. HTTP), will cause the peer to return the torrent\r
 551 string.\r
 552 \r
 553 Figure \ref{size_CDF} shows the size of the 22,298 packages\r
 554 available in Debian in January 2008. We can see that most of the\r
 555 packages are quite small, and so most will therefore not require\r
 556 piece hash information to download. We have chosen a piece\r
 557 size of 512 kB, which means that 17,515 (78\%) of the packages will\r
 558 not require this information. There are 3054 packages that will\r
 559 require 2 to 4 pieces, for which the torrent string can be stored\r
 560 directly with the package hash in the DHT. There are 1667 packages\r
 561 that will require a separate lookup in the DHT for the longer\r
 562 torrent string as they require 5 to 70 pieces. Finally, there are\r
 563 only 62 packages that require more than 70 pieces, and so will\r
 564 require a separate request to a peer for the torrent string.\r
 565 \r
 566 \subsection{Response Time}\r
 567 \label{response_time}\r
 568 \r
 569 Many of our customizations to the DHT have been to try and improve\r
 570 the time of the recursive \texttt{find\_value} requests, as this can\r
 571 cause long delays for the user waiting for a package download to\r
 572 begin. The one problem that slows down such requests is waiting for\r
 573 timeouts to occur before marking the node as failed and moving on.\r
 574 \r
 575 Our first improvement is to retransmit a request multiple times\r
 576 before a timeout occurs, in case the original request or its\r
 577 response was lost by the unreliable UDP protocol. If it does not\r
 578 receive a response, the requesting node will retransmit the request\r
 579 after a short delay. This delay will increase exponentially for\r
 580 later retransmissions, should the request again fail. Our current\r
 581 implementation will retransmit the request after 2 seconds and 6\r
 582 seconds (4 seconds after the first retransmission), and then timeout\r
 583 after 9 seconds.\r
 584 \r
 585 We have also added some improvements to the recursive\r
 586 \texttt{find\_node} and \texttt{find\_value} queries to speed up the\r
 587 process when nodes fail. If enough nodes have responded to the\r
 588 current query such that there are many new nodes to query that are\r
 589 closer to the desired key, then a stalled request to a node further\r
 590 away will be dropped in favor of a new request to a closer node.\r
 591 This has the effect of leap-frogging unresponsive nodes and\r
 592 focussing attention on the nodes that do respond. We will also\r
 593 prematurely abort a query while there are still oustanding requests,\r
 594 if enough of the closest nodes have responded and there are no\r
 595 closer nodes found. This prevents a far away unresponsive node from\r
 596 making the query's completion wait for it to timeout.\r
 597 \r
 598 Finally, we made all attempts possible to prevent firewalled and\r
 599 NATted nodes from being added to the routing table for future\r
 600 requests. Only a node that has responded to a request from us will\r
 601 be added to the table. If a node has only sent us a request, we\r
 602 attempt to send a \texttt{ping} to the node to determine if it is\r
 603 NATted or not. Unfortunately, due to the delays used by NATs in\r
 604 allowing UDP packets for a short time if one was recently sent by\r
 605 the NATted host, the ping is likely to succeed even if the node is\r
 606 NATted. We therefore also schedule a future ping to the node to make\r
 607 sure it is still reachable after the NATs delay has hopefully\r
 608 elapsed. We also schedule future pings of nodes that fail once to\r
 609 respond to a request, as it takes multiple failures (currently 3)\r
 610 before a node is removed from the routing table.\r
 611 \r
 612 \begin{figure}\r
 613 \centering\r
 614 \includegraphics[width=\columnwidth]{apt_p2p_improvements-find_value.eps}\r
 615 \caption{The distribution of average response times PlanetLab nodes\r
 616 experience for \texttt{find\_value} queries. The original DHT\r
 617 implementation results are shown, as well as the successive\r
 618 improvements that we made to reduce the response time.}\r
 619 \label{improvements}\r
 620 \end{figure}\r
 621 \r
 622 To test our changes during development, we ran the customized DHT\r
 623 for several hours after each major change on over 300 PlanetLab nodes\r
 624 \cite{planetlab}. Though the nodes are not expected to be firewalled\r
 625 or NATted, some can be quite overloaded and so consistently fail to\r
 626 respond within a timeout period, similar to NATted peers. The\r
 627 resulting distribution of the nodes' average response times is shown\r
 628 in Figure~\ref{improvements}. Each improvement successfully reduced\r
 629 the response time, for a total reduction of more than half. The\r
 630 final distribution is also narrower, as the improvements make the\r
 631 system more predictable. However, there are still a large number of\r
 632 outliers with higher average response times, which are the\r
 633 overloaded nodes on PlanetLab. This was confirmed by examining the\r
 634 average time it took for a timeout to occur, which should be\r
 635 constant as it is a configuration option, but can be much larger if\r
 636 the node is too overloaded for the program to be able to check for a\r
 637 timeout very often.\r
 638 \r
 639 \subsection{Multiple Values}\r
 640 \label{multiple_values}\r
 641 \r
 642 The original design of Kademlia specified that each key would have\r
 643 only a single value associated with it. The RPC to find this value\r
 644 was called \texttt{find\_value} and worked similarly to\r
 645 \texttt{find\_node}, iteratively finding nodes with ID's closer to\r
 646 the desired key. However, if a node had a value stored associated\r
 647 with the searched for key, it would respond to the request with that\r
 648 value instead of the list of nodes it knows about that are closer.\r
 649 \r
 650 While this works well for single values, it can cause a problem when\r
 651 there are multiple values. If the responding node is no longer one\r
 652 of the closest to the key being searched for, then the values it is\r
 653 returning will probably be the staler ones in the system, as it\r
 654 will not have the latest stored values. However, the search for\r
 655 closer nodes will stop here, as the queried node only returned\r
 656 values and not a list of nodes to recursively query. We could have\r
 657 the request return both the values and the list of nodes, but that\r
 658 would severely limit the size and number of the values that could be\r
 659 returned in a single UDP packet.\r
 660 \r
 661 Instead, we have broken up the original \texttt{find\_value}\r
 662 operation into two parts. The new \texttt{find\_value} request\r
 663 always returns a list of nodes that the node believes are closest to\r
 664 the key, as well as a number indicating the number of values that\r
 665 this node has for the key. Once a querying node has finished its\r
 666 search for nodes and found the closest ones to the key, it can issue\r
 667 \texttt{get\_value} requests to some nodes to actually retrieve the\r
 668 values they have. This allows for much more control of when and how\r
 669 many nodes to query for values. For example, a querying node could\r
 670 abort the search once it has found enough values in some nodes, or\r
 671 it could choose to only request values from the nodes that are\r
 672 closest to the key being searched for.\r
 673 \r
 674 \subsection{\textbf{OPTIONAL}: BitTorrent's Improvements}\r
 675 \label{bittorrent_dht}\r
 676 \r
 677 In the several years that some BitTorrent clients have been using a\r
 678 Kademlia-based DHT for tracker-less operation, the developers have\r
 679 made many enhancements which we can take advantage of. One of the\r
 680 most important is a security feature added to stop malicious nodes\r
 681 from subscribing other nodes as downloaders. When a node issues a\r
 682 request to another node to store its download info for that key, it\r
 683 must include a \emph{token} returned by the storing node in a recent\r
 684 \texttt{find\_node} request. The token is created by hashing the IP\r
 685 address of the requesting peer with a temporary secret that expires\r
 686 after several minutes. This prevents the requesting peer from faking\r
 687 its IP address in the store request, since it must first receive a\r
 688 response from a \texttt{find\_node} on that IP.\r
 689 \r
 690 We also made some BitTorrent-inspired changes to the parameters of\r
 691 the DHT originally specified by the authors of Kademlia. Since, as\r
 692 we will show later, peers stay online in our system for much longer\r
 693 periods of time, we reduced Kademlia's \emph{k} value from 20 to 8.\r
 694 The value is supposed to be large enough such that any given\r
 695 \emph{k} nodes are unlikely to fail within an hour of each other,\r
 696 which is very unlikely in our system given the long uptimes of\r
 697 nodes (see Figure~\ref{duration_online_1} in Section~\ref{analysis}.\r
 698 We also increased the number of concurrent outstanding\r
 699 requests allowed from 3 to 6 to speed up the recursive key finding\r
 700 processes.\r
 701 \r
 702 \subsection{\textbf{OPTIONAL}: Other Changes}\r
 703 \label{other_changes}\r
 704 \r
 705 We added one other new RPC request that nodes can make:\r
 706 \texttt{join}. This request is only sent on first loading the DHT,\r
 707 and is usually only sent to the bootstrap nodes that are listed for\r
 708 the DHT. These bootstrap nodes will respond to the request with the\r
 709 requesting peer's IP and port, so that the peer can determine what\r
 710 its oustide IP address is and whether port translation is being\r
 711 used. In the future, we hope to add functionality similar to STUN\r
 712 \cite{STUN}, so that nodes can detect whether they are NATted and\r
 713 take appropriate steps to circumvent it.\r
 714 \r
 715 In addition, we have allowed peers to store values in the DHT, even\r
 716 if the hash they are using is not the correct length. Most of the\r
 717 keys used in the DHT are based on the SHA1 hash, and so they are 20\r
 718 bytes in length. However, some files in the targetted Debian package\r
 719 system are only hashed using the MD5 algorithm, and so the hashes\r
 720 retrieved from the server are 16 bytes in length. The DHT will still\r
 721 store values using these keys by padding the end of them with zeroes\r
 722 to the proper length, as the loss of uniqueness from the 4 less\r
 723 bytes is not an issue. Also, the DHT will happily store hashes that\r
 724 are longer than 20 bytes, such as those from the 32 byte SHA256\r
 725 algorithm, by truncating them to the correct length. Since the\r
 726 requesting peer will have the full length of the hash, this will not\r
 727 affect its attempt to verify the downloaded file.\r
 728 \r
 729 %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
 730 \r
 731 \section{Performance Evaluation}\r
 732 \label{analysis}\r
 733 \r
 734 Our \texttt{apt-p2p} implementation supporting the Debian package\r
 735 distribution system has been available to all Debian users since May\r
 736 3rd, 2008 \cite{apt-p2p-debian}, and will also be available in the\r
 737 next release of Ubuntu \cite{apt-p2p-ubuntu}. We have since created\r
 738 a \emph{walker} that will navigate the DHT and find all the peers\r
 739 currently connected to it. This allows us to analyze many aspects of\r
 740 our implementation.\r
 741 \r
 742 \subsection{Peer Lifetimes}\r
 743 \label{peer_life}\r
 744 \r
 745 \begin{figure}\r
 746 \centering\r
 747 \includegraphics[width=\columnwidth]{AptP2PWalker-peers.eps}\r
 748 \caption{The number of peers found in the system, and how many are\r
 749 behind a firewall or NAT.}\r
 750 \label{walker_peers}\r
 751 \end{figure}\r
 752 \r
 753 We first began analyzing the DHT on June 24th, and continue to this\r
 754 day, giving us 2 months of data so far. Figure~\ref{walker_peers}\r
 755 shows the number of peers we have seen in the DHT during this time.\r
 756 The peer population is very steady, with just over 50 regular users\r
 757 participating in the DHT at any time. We also note that we find 100\r
 758 users who connect regularly (weekly), and we have found 186 unique\r
 759 users in the 2 months of our analysis.\r
 760 \r
 761 We also determined which users are\r
 762 behind a firewall or NAT, which is one of the main problems of\r
 763 implementing a peer-to-peer network. These peers will be\r
 764 unresponsive to DHT requests from peers they have not contacted\r
 765 recently, which will cause the peer to wait for a timeout to occur\r
 766 (currently 9 seconds) before moving on. They will also be\r
 767 unable to contribute any upload bandwidth to other peers, as all\r
 768 requests for packages from them will also timeout. From\r
 769 Figure~\ref{walker_peers}, we see that approximately half of all\r
 770 peers suffered from this restriction.\r
 771 \r
 772 \begin{figure}\r
 773 \centering\r
 774 \includegraphics[width=\columnwidth]{AptP2PDuration-peers.eps}\r
 775 \caption{The CDF of how long an average session will last.}\r
 776 \label{duration_peers}\r
 777 \end{figure}\r
 778 \r
 779 Figure~\ref{duration_peers} shows the cumulative distribution of how\r
 780 long a connection from a peer can be expected to last. Due to our\r
 781 software being installed as a daemon that is started by default\r
 782 every time their computer boots up, peers are expected to stay for a\r
 783 long period in the system.\r
 784 Indeed, we find that 50\% of connections last longer than 5\r
 785 hours, and 20\% last longer than 10 hours. These connections are\r
 786 much longer than those reported by Saroiu et. al. \cite{saroiu2001}\r
 787 for other P2P systems, which had 50\% of Napster and Gnutella\r
 788 sessions lasting only 1 hour.\r
 789 \r
 790 \begin{figure}\r
 791 \centering\r
 792 \includegraphics[width=\columnwidth]{AptP2PDuration-ind_peers.eps}\r
 793 \caption{\textbf{OPTIONAL}: The CDF of the average time individual peers stay in the\r
 794 system.}\r
 795 \label{duration_ind_peers}\r
 796 \end{figure}\r
 797 \r
 798 \textbf{OPTIONAL}: We also examined the average time each individual peer spends in the\r
 799 system. Figure~\ref{duration_peers} shows the cumulative\r
 800 distribution of how long each individual peer remains in the system.\r
 801 Here we see that 50\% of peers have average stays in the system\r
 802 longer than 10 hours.\r
 803 \r
 804 \begin{figure}\r
 805 \centering\r
 806 \includegraphics[width=\columnwidth]{AptP2PDuration-online_1.eps}\r
 807 \caption{The fraction of peers that, given their current duration in\r
 808 the system, will stay online for another hour.}\r
 809 \label{duration_online_1}\r
 810 \end{figure}\r
 811 \r
 812 Since our DHT is based on Kademlia, which was designed based on the\r
 813 probability that a node will remain up another hour, we also\r
 814 analyzed our system for this parameter.\r
 815 Figure~\ref{duration_online_1} shows the fraction of peers that will\r
 816 remain online for another hour, as a function of how long they have\r
 817 been online so far. Maymounkov and Mazieres found that the longer a\r
 818 node has been online, the higher the probability that it will stay\r
 819 online \cite{kademlia}. Our results also show this behavior. In\r
 820 addition, similar to the Gnutella peers, over 90\% of our peers that\r
 821 have been online for 10 hours, will remain online for another hour.\r
 822 Our results also show that, for our system, over 80\% of all peers\r
 823 will remain online another hour, compared with around 50\% for\r
 824 Gnutella.\r
 825 \r
 826 \begin{figure}\r
 827 \centering\r
 828 \includegraphics[width=\columnwidth]{AptP2PDuration-online_6.eps}\r
 829 \caption{\textbf{OPTIONAL}: The fraction of peers that, given their current duration in\r
 830 the system, will stay online for another 6 hours.}\r
 831 \label{duration_online_6}\r
 832 \end{figure}\r
 833 \r
 834 \textbf{OPTIONAL}: Since our peers are much longer-lived than other P2P systems, we\r
 835 also looked at the fraction of peers that stay online for another 6\r
 836 hours. Figure~\ref{duration_online_6} shows that over 60\% of peers\r
 837 that are online for 10 hours will stay online for another 6.\r
 838 However, we see an interesting decrease in this fraction between 8\r
 839 and 12 hours, which can also be seen in\r
 840 Figure~\ref{duration_online_1}. We believe this to be due to desktop\r
 841 users, who regularly turn off their computers at night.\r
 842 \r
 843 \subsection{Peer Statistics}\r
 844 \label{peer_stats}\r
 845 \r
 846 On July 31st we enhanced our walker to retrieve additional\r
 847 information from each contacted peer. The peers are configured, by\r
 848 default, to publish some statistics on how much they are downloading\r
 849 and uploading, and their measured response times for DHT queries.\r
 850 Our walker can extract this information if the peer is not\r
 851 firewalled or NATted, it has not disabled this functionality, and if\r
 852 it uses the same port for both its DHT (UDP) requests and download\r
 853 (TCP) requests (which is also the default configuration parameter).\r
 854 \r
 855 \begin{figure}\r
 856 \centering\r
 857 \includegraphics[width=\columnwidth]{AptP2PDownloaded-peers.eps}\r
 858 \caption{The number of peers that were contacted to determine their\r
 859 bandwidth, and the total number of peers in the system.}\r
 860 \label{down_peers}\r
 861 \end{figure}\r
 862 \r
 863 Figure~\ref{down_peers} shows the total number of peers we have been\r
 864 able to contact since starting to gather this additional\r
 865 information, as well as how many total peers were found. We were\r
 866 only able to contact 30\% of all the peers that connected to the\r
 867 system during this time.\r
 868 \r
 869 \begin{figure}\r
 870 \centering\r
 871 \includegraphics[width=\columnwidth]{AptP2PDownloaded-bw.eps}\r
 872 \caption{\textbf{OPTIONAL}: The bandwidth of data that the contacted peers have\r
 873 downloaded and uploaded.}\r
 874 \label{down_bw}\r
 875 \end{figure}\r
 876 \r
 877 Figure~\ref{down_bw} shows the amount of data the peers we were able\r
 878 to contact have downloaded. Peers measure their downloads from other\r
 879 peers and mirrors separately, so we are able to get an idea of how\r
 880 much savings our system is generating for the mirrors. We see that\r
 881 the peers are downloading approximately 20\% of their package data\r
 882 from other peers, which is saving the mirrors from supplying that\r
 883 bandwidth. The actual numbers are only a lower bound, since we have\r
 884 only contacted 30\% of the peers in the system, but we can estimate\r
 885 that \texttt{apt-p2p} has already saved the mirrors 15 GB of\r
 886 bandwidth, or 1 GB per day.\r
 887 \r
 888 We also collected the statistics on the measured response time peers\r
 889 were experiencing when sending requests to the DHT. We found that\r
 890 the recursive \texttt{find\_value} query, which is necessary before\r
 891 a download can occur, is taking 17 seconds on average. This\r
 892 indicates that, on average, requests are experiencing almost 2\r
 893 stalls while waiting for the 9 second timeouts to occur on\r
 894 unresponsive peers. This is longer than our target of 10 seconds,\r
 895 although it will only lead to a slight average delay in downloading\r
 896 of 1.7 seconds when the default 10 concurrent downloads are\r
 897 occurring.This increased response time is due to the number of peers\r
 898 that were behind firewalls or NATs, which was much higher than we\r
 899 anticipated. We do have plans to improve this through better\r
 900 informing of users of their NATted status, the use of STUN\r
 901 \cite{STUN} to circumvent the NATs, and by better exclusion of\r
 902 NATted peers from the DHT (which does not prevent them from using\r
 903 the system).\r
 904 \r
 905 We were also concerned that the constant DHT requests and responses,\r
 906 even while not downloading, would overwhelm some peers' network\r
 907 connections. However, we found that peers are using 200 to 300 bytes\r
 908 per second of bandwidth in servicing the DHT. These numbers are\r
 909 small enough to not affect any other network services the peer would\r
 910 be running.\r
 911 \r
 912 \r
 913 %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
 914 \r
 915 \section{Related Work}\r
 916 \label{related}\r
 917 \r
 918 There have been other preliminary attempts to implement peer-to-peer distributors for\r
 919 software packages. apt-torrent \cite{apttorrent} creates torrents\r
 920 for some of the larger packages available, but it ignores the\r
 921 smaller packages, which are often the most popular. DebTorrent\r
 922 \cite{debtorrent} makes widespread modifications to a traditional\r
 923 BitTorrent client, to try and fix the drawbacks mentioned in\r
 924 Section~\ref{bittorrent}. However, these changes also require some\r
 925 modifications to the distribution system to support it. Our system\r
 926 considers all the files available to users to download, and makes\r
 927 use of the existing infrastructure unmodified.\r
 928 \r
 929 \textbf{OPTIONAL}: Others have also used DHTs to support this type of functionality.\r
 930 Kenosis \cite{kenosis} is a P2P Remote Procedure Call\r
 931 client also based on the Kademlia DHT, but it is meant as a P2P\r
 932 primitive system on which other tools can be built, and so it has no\r
 933 file sharing functionality at all. Many have used a DHT as a drop-in\r
 934 replacement for the tracker in a BitTorrent system\r
 935 \cite{bittorrent-dht, azureus-dht}, but such systems only use the\r
 936 DHT to find peers for the same torrent, so the file sharing uses\r
 937 traditional BitTorrent and so is not ideal for the reasons listed\r
 938 in Section~\ref{bittorrent}.\r
 939 \r
 940 % \cite{deltacast}\r
 941 \r
 942 There are a number of works dedicated to developing a collaborative\r
 943 content distribution network (CDN) using peer-to-peer techniques.\r
 944 Freedman et. al. developed Coral \cite{coral} using a distributed\r
 945 \emph{sloppy} hash table to speed request times. Pierre and van\r
 946 Steen developed Globule \cite{globule} which uses typical DNS and\r
 947 HTTP redirection techniques to serve requests from a network of\r
 948 replica servers, which in turn draw their content from the original\r
 949 location (or a backup). Shah et. al. \cite{shah08} analyze an\r
 950 existing software delivery system and use the results to design a\r
 951 peer-to-peer content distribution network that makes use of\r
 952 volunteer servers to help with the load. None of these systems meets\r
 953 our goal of an even distribution of load amongst the users of the\r
 954 system. Not all users of the systems become peers, and so are not\r
 955 able to contribute back to the system after downloading. The\r
 956 volunteers that do contribute as servers are required to contribute\r
 957 larger amounts of bandwidth, both for uploading to others, and in\r
 958 downloading content they are not in need of in order to share them\r
 959 with other users. Our system treats all users equally, requiring all\r
 960 to become peers in the system, sharing the uploading load equally\r
 961 amongst all, but doesn't require any user to download files they\r
 962 would not otherwise need.\r
 963 \r
 964 The most similar works to ours are by Shah et. al. \cite{shah08} and\r
 965 Shark by Annapureddy et. al. \cite{shark}.\r
 966 Shah's system, in addition to the drawbacks mentioned above,\r
 967 is not focused on the interactivity of downloads, as\r
 968 half of all requests were required ``to wait between 8 and 15\r
 969 minutes.'' In contrast, lookups in our system take only seconds to\r
 970 complete, and all requests can be completed in under a minute.\r
 971 Shark makes use of Coral's distributed sloppy hash table to speed\r
 972 the lookup time, but their system is more suited to its intended use\r
 973 as a distributed file server. It doesn't make use of authoritative\r
 974 copies of the original files, allowing instead any users in the\r
 975 system to update files and propagate those changes to others. Our\r
 976 system is well-tailored to the application of disseminating the\r
 977 unchanging software packages from the authoritative sources to all\r
 978 users.\r
 979 \r
 980 \r
 981 %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
 982 \r
 983 \section{Conclusion and Future Work}\r
 984 \label{conclusions}\r
 985 \r
 986 We have designed a generally applicable peer-to-peer content\r
 987 distribution model to be used by many of the free software\r
 988 distributors operating today. It makes use of already existing\r
 989 features of most package management systems to create a\r
 990 peer-to-peer distribution that should substantially reduce the costs\r
 991 of hosting the software packages.\r
 992 \r
 993 We have also implemented our design in freely available software to\r
 994 be used in conjuction with Debian-based distribution of Linux\r
 995 software packages. It is currently in use by some users of the\r
 996 Debian project's distribution, and so serves as an example of the\r
 997 possibilities that exist.\r
 998 \r
 999 We feel that our P2P software package distribution model is\r
1000 well-suited to be used by many free software distributors. We hope\r
1001 to convince them to adopt such a model in their distribution, or we\r
1002 may port our existing system to some of the other groups for them to\r
1003 try.\r
1004 \r
1005 \textbf{OPTIONAL}: One aspect missing from our model is the removal of old packages\r
1006 from the cache. Since our implementation is still relatively young,\r
1007 we have not had to deal with the problems of a growing cache of\r
1008 obsolete packages consuming all of a user's hard drive. We plan to\r
1009 implement some form of least recently used (LRU) cache removal\r
1010 technique, in which packages that are no longer available on the\r
1011 server, no longer requested by peers, or simply are the oldest in\r
1012 the cache, will be removed.\r
1013 \r
1014 The most significant area of improvement still needed in our sample\r
1015 implementation is to further speed up some of the slower recursive\r
1016 DHT requests. We hope to accomplish this by further tuning the\r
1017 parameters of our current system, better exclusion of NATted peers\r
1018 from the routing tables, and through the use of STUN \cite{STUN} to\r
1019 circumvent the NATs of the 50\% of the peers that have not\r
1020 configured port forwarding.\r
1021 \r
1022 \bibliographystyle{IEEEtran}\r
1023 \bibliography{./IEEEabrv,./all}\r
1024 \r
1025 \end{document}\r