WIP on final version of accepted INFOCOM paper.

[quix0rs-apt-p2p.git] / docs / paper / paper.tex
diff --git a/docs/paper/paper.tex b/docs/paper/paper.tex

index 8423d82a69a173791bb0b0f6ad79acaa746fd30c..a1ca47a6b9de533455949b13190e367450c14b12 100644 (file)
--- a/docs/paper/paper.tex
+++ b/docs/paper/paper.tex
@@ -1,14 +1,14 @@
  \documentclass[conference]{IEEEtran}\r
  \r
  %% INFOCOM addition:\r
-\makeatletter\r
-\def\ps@headings{%\r
-\def\@oddhead{\mbox{}\scriptsize\rightmark \hfil \thepage}%\r
-\def\@evenhead{\scriptsize\thepage \hfil\leftmark\mbox{}}%\r
-\def\@oddfoot{}%\r
-\def\@evenfoot{}}\r
-\makeatother\r
-\pagestyle{headings}\r
+% \makeatletter\r
+% \def\ps@headings{%\r
+% \def\@oddhead{\mbox{}\scriptsize\rightmark \hfil \thepage}%\r
+% \def\@evenhead{\scriptsize\thepage \hfil\leftmark\mbox{}}%\r
+% \def\@oddfoot{}%\r
+% \def\@evenfoot{}}\r
+% \makeatother\r
+% \pagestyle{headings}\r
  \r
  \usepackage[noadjust]{cite}\r
  \r
@@ -18,7 +18,7 @@
  \r
  \begin{document}\r
  \r
-\title{\texttt{apt-p2p}: A Peer-to-Peer Distributor for Free Software Package Release and Update}\r
+\title{\texttt{apt-p2p}: A Peer-to-Peer Distribution System for Software Package Releases and Updates}\r
  \author{\IEEEauthorblockN{Cameron Dale}\r
  \IEEEauthorblockA{School of Computing Science\\\r
  Simon Fraser University\\\r
@@ -34,22 +34,28 @@ Email: jcliu@cs.sfu.ca}}
  \maketitle\r
  \r
  \begin{abstract}\r
-A large amount of free software packages are available over the\r
-Internet from many different distributors. Most of these\r
-distributors use the traditional client-server model to handle\r
-requests from users. However, there is an excellent opportunity to\r
-use peer-to-peer techniques to reduce the cost of much of this\r
-distribution, especially due to the altruistic nature of many of\r
-these users. There are no existing solution suitable for this\r
-situation, so we present a new technique for satisfying the needs of\r
-this P2P distribution, which is generally applicable to many of\r
-these distributors' systems. Our method makes use of a DHT for\r
-storing the location of peers, using the cryptographic hash of the\r
-package as a key. To show the simplicity and functionality, we\r
-implement a solution for the distribution of Debian software\r
-packages, including the many DHT customizations needed. Finally, we\r
-analyze our system to determine how it is performing and what effect\r
-it is having.\r
+The Internet has become a cost-effective\r
+vehicle for software development and release, particular in the free software community.\r
+Given the free nature of these software, there are often a number of users\r
+motivated by altruism to help out with the distribution, so as to promote the healthy development\r
+of this voluntary society. It is thus naturally expected that peer-to-peer distribution can be implemented,\r
+which scale well with large user bases and can easily explore the network resources made available by\r
+the volunteers.\r
+\r
+Unfortunately, this application scenario has many unique characteristics, which\r
+make a straightforward adoption of existing peer-to-peer systems for file sharing such as BitTorrent suboptimal. In particular,\r
+a software release often consists of a large number of packages, which are difficult to be distributed individually, but the archive is\r
+too large to be distributed in its entirety. The packages are also being constantly\r
+updated by the loosely-managed developers, and the interest in a particular version of a package can be very\r
+limited depending on the computer platforms and operating systems used.\r
+\r
+In this paper, we propose a novel peer-to-peer assisted distribution system design that\r
+addresses the above challenges. It enhances the existing distribution systems by providing compatible and yet more efficient downloading and updating services\r
+for software packages. Our design leads to \texttt{apt-p2p}, a practical implementation that extends the popular \texttt{apt} distributor.  \texttt{apt-p2p}  has been used in conjuction with Debian-based distribution of Linux\r
+software packages and will also be available in the next release of Ubuntu. We have addressed the key design issues in \texttt{apt-p2p}, including indexing table customization,\r
+response time reduction, and multi-value extension. They together ensure\r
+that the altruistic users' resources are effectively utilized and thus significantly reduces the currently\r
+large bandwidth requirements of hosting the software, as confirmed by our existing real user statistics over the Internet.\r
  \end{abstract}\r
  \r
  %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
@@ -57,60 +63,67 @@ it is having.
  \section{Introduction}\r
  \label{intro}\r
  \r
-There are a large number of free software package distributors using\r
-package distribution systems over the Internet to distribute\r
-software to their users. These distributors have developed many\r
-different methods for this distribution, but they almost exclusively\r
-use a client-server model to satisfy user requests. The popularity\r
-and number of users results in a large number of requests, which\r
-usually requires a network of mirrors to handle. Due to the free\r
-nature of this software, many users are willing and able to\r
-contribute upload bandwidth to this distribution, but have no\r
-current way to do this.\r
-\r
-We present a new peer-to-peer distribution model to meet these\r
-demands. It is based on many previous implementations of successful\r
-peer-to-peer protocols, especially distributed hash tables (DHT) and\r
-BitTorrent. The model relies on the pre-existence of cryptographic\r
-hashes of the packages, which should uniquely identify it for a\r
-request from other peers. If the peer-to-peer download fails, then\r
-the original request to the server is used as a fallback to prevent\r
-any dissatisfaction from users. The peer can then share this new\r
-package with others through the P2P system.\r
-\r
-First, we examine the opportunity that is available for many of\r
-these free software package distributors. We present an overview of\r
-a system that would efficiently satisfy the demands of a large\r
-number of users, and should significantly reduce the currently\r
-substantial bandwidth requirements of hosting this software. We then\r
-present an example implementation based on the Debian package\r
-distribution system. This implementation will be used by a large\r
-number of users, and serves as an example for other free software\r
-distributors of the opportunity that can be met with such a system.\r
-\r
-The rest of this paper is organized as follows. The background and motivation are presented in Section \ref{situation}, together with related works in  Section \ref{related}. We propose\r
-our solution in Section \ref{opportunity}. We then detail our sample\r
-implementation for Debian-based distributions in Section\r
-\ref{implementation}, including an in-depth look at our DHT\r
-customizations in Section \ref{custom_dht}. Its performance is evaluated in Section \ref{analysis}. Finally, \r
-Section \ref{conclusions} concludes the paper and offers some future directions.\r
+With the widespread penetration of broadband access, the Internet has become a cost-effective\r
+vehicle for software development and release \cite{feller2000fao}. This is particularly true for the free software\r
+community whose developers and users are distributed worldwide and work asynchronously. The ever increasing power of\r
+modern programming languages, computer platforms, and operating systems has made this software extremely large and complex,\r
+though it is often divided into a huge number of small packages.\r
+Together with their popularity among users,\r
+an efficient and reliable management and distribution of these software packages over the Internet has become a challenging task \cite{ubuntu-blueprint}.\r
+\r
+The existing distribution for free software is mostly based on the client/server model, e.g.,\r
+the Advanced Package Tool (\texttt{apt}) for Linux \cite{apt}, which suffers from the well-known bottleneck problem.\r
+Given the free nature of these software, there are often a number of users\r
+motivated by altruism to help out with the distribution, so as to promote the healthy development\r
+of this voluntary society.\r
+We thus naturally expect that peer-to-peer distribution can be implemented in\r
+this context, which will scale well with the currently large user bases and can easily explore the resources made available by\r
+the volunteers.\r
+\r
+Unfortunately, this application scenario has many unique characteristics, which\r
+make a straightforward adoption of existing peer-to-peer systems for file sharing such as BitTorrent suboptimal. In particular,\r
+there are too many packages to distribute each individually, but the archive is\r
+too large to distribute in its entirety. The packages are also being constantly\r
+updated by the loosely-managed developers, and the interest in a particular version of a package can be very\r
+limited. They together make it very difficult to efficiently create and manage torrents/trackers. The random downloading nature of BitTorrent-like systems is also different from the\r
+sequential order used in existing software package distributions. This in turn suppresses interaction with users\r
+given the difficulty in tracking speed and downloading progress.\r
+\r
+In this paper, we propose a novel peer-to-peer assisted distribution system design that\r
+addresses the above challenges. It enhances the existing distribution systems by providing compatible and yet more efficient downloading and updating services\r
+for software packages. Our design leads to the development of \texttt{apt-p2p}, a practical implementation based on the Debian \footnote{Debian - The Universal Operating System: {\it http://www.debian.org/}} package\r
+distribution system.  We have addressed the key design issues in \texttt{apt-p2p}, including indexing table customization,\r
+response time reduction, and multi-value extension. They together ensure\r
+that the altruistic users' resources are effectively utilized and thus significantly reduces the currently\r
+large bandwidth requirements of hosting the software.\r
+\r
+\texttt{apt-p2p}  has been used in conjuction with Debian-based distribution of Linux\r
+software packages and will also be available in the next release of Ubuntu. We have evaluated our current deployment  to\r
+determine how effective it is at meeting our goals, and to see what\r
+effect it is having on the Debian package distribution system. In particular, our existing real user statistics\r
+have suggested that it responsively interacts with clients and substantially reduces server cost.\r
+\r
+The rest of this paper is organized as follows. The background and motivation are presented in Section~\ref{situation}, including an analysis of BitTorrent's use for this purpose in Section~\ref{bittorrent}. We propose\r
+our solution in Section~\ref{opportunity}. We then detail our sample\r
+implementation for Debian-based distributions in Section~\ref{implementation},\r
+including an in-depth look at our system optimization\r
+in Section~\ref{custom_dht}. The performance of our implementation is evaluated\r
+in Section~\ref{analysis}. We examine some related work in Section~\ref{related}, and then\r
+Section~\ref{conclusions} concludes the paper and offers some future directions.\r
  \r
  %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
  \r
  \section{Background and Motivations}\r
  \label{situation}\r
  \r
-In the free software society, there are a large number of groups using the Internet to \r
-collaboratively develop and release their software. The ever increasing power of\r
-modern programming languages and operating systems has made these software, like commercial software, extremely large and complex, which often\r
-consists of many small units (packages). Together with their popularity among users, \r
-an efficient and reliable management and distribution of these packages over the Internet has become an daunting task. In this section, we offer concrete examples illustrating the \r
-unique challenges in this context. \r
+In the free software community, there are a large number of groups using the Internet to\r
+collaboratively develop and release their software. Efficient and reliable management and distribution \r
+of these software packages over the Internet thus has become a critical task. In this section, we offer concrete examples illustrating the\r
+unique challenges in this context.\r
  \r
-\subsection{Free Software Package Distributor: Examples}\r
+\subsection{Free Software Package Distributors}\r
  \label{examples}\r
  \r
-\r
  Most Linux distributions use a software package management system\r
  that fetches packages to be installed from an archive of packages\r
  hosted on a network of mirrors. The Debian project, and other\r
@@ -133,34 +146,34 @@ manager is called \texttt{portage}, SlackWare Linux uses
  \texttt{pkgtools}, and FreeBSD has a suite of command-line tools,\r
  all of which download these tarballs from web servers.\r
  \r
-Similar tools have been used for other software packages. CPAN\r
-distributes software packages for the PERL\r
+Similar tools have been used for other types of software packages. CPAN\r
+distributes packaged software for the PERL\r
  programming language, using SOAP RPC requests to find and download\r
  files. Cygwin provides many of the\r
  standard Unix/Linux tools in a Windows environment, using a\r
  package management tool that requests packages from websites. There\r
-are two software distribution systems for Mac OSX, fink and\r
+are two software distribution systems for software that runs on the Macintosh OS, fink and\r
  MacPorts, that also retrieve packages in this way.\r
  \r
  Direct web downloading is also a common way, often coupled with a hash\r
  verification file to be downloaded next to the desired\r
-file. The hash file usually have the same file name, but with an\r
+file. The hash file usually has the same file name, but with an\r
  added extension identifying the hash used (e.g. \texttt{.md5} for\r
  the MD5 hash). This type of file downloading and verification is\r
  typical of free software hosting facilities that are open to anyone\r
  to use, such as SourceForge.\r
  \r
  \r
-Given the free nature of these software, there are often a number of users \r
-motivated by altruism to want to help out with their distribution. \r
-This is particularly true considering that many of these software are used by\r
+Given the free nature of this software, there are often a number of users\r
+motivated by altruism to want to help out with the distribution.\r
+This is particularly true considering that much of this software is used by\r
  groups that are staffed mostly, or sometimes completely, by\r
-volunteers. They are thus motivated to contribute their network resources, so as to promote the healthy development \r
-of the volunteer community that released the software. As a matter of fact,\r
-we have see many free mirror sites hosting these software packages for downloading. \r
-We also naturally expect that peer-to-peer distribution can be implementation in \r
-this context, which scale well with large user bases and can easily explore the network resources made available by\r
-the volunteers. \r
+volunteers. They are thus motivated to contribute their network resources, so as to promote the healthy development\r
+of the volunteer community that released the software.\r
+We also naturally expect that peer-to-peer distribution can be implemented in\r
+this context, which will scale well with the currently large user bases\r
+and can easily explore the network resources made available by\r
+the volunteers.\r
  \r
  \r
  \r
@@ -181,7 +194,7 @@ architecture that the package is intended for.
  \r
  \begin{figure}\r
  \centering\r
-\includegraphics[width=\columnwidth]{apt_p2p_simulation-size_CDF.eps}\r
+\includegraphics[width=0.80\columnwidth]{apt_p2p_simulation-size_CDF.eps}\r
  \caption{The CDF of the size of packages in a Debian system, both\r
  for the actual size and adjusted size based on the popularity of\r
  the package.}\r
@@ -192,8 +205,8 @@ For example, Figure~\ref{size_CDF} shows the size of the packages in the
  current Debian distribution. While 80\% of the packages are less than\r
  512~KB, some of the packages are hundreds of megabytes. The entire\r
  archive consists of 22,298 packages and is approximately 119,000 MB\r
-in size. Most of the packages are to be installed in any computer environment, but there are \r
-also OS- or architecture-specific packages, as shown by the adjusted sizes based on popularity of the packages. ((((more words on how the size is adjusted by popularity))))\r
+in size. Many of the packages are to be installed in any computer environment, but there are\r
+also OS- or architecture-specific packages.\r
  \r
  \subsubsection{Package Updates}\r
  \r
@@ -203,36 +216,36 @@ releasing a new version with improved functionality,
  or the distributor updating their packaging of the\r
  software to meet new requirements. Even if the distributor\r
  periodically makes \emph{stable} releases, which are snapshots of\r
-all the packages in the archive at a certain time, updates are still\r
+all the packages in the archive at a certain time, many later updates are still\r
  released for security issues or serious bugs.\r
  \r
  \begin{figure}\r
  \centering\r
-\includegraphics[width=\columnwidth]{size-quarter.eps}\r
-\caption{The amount of data in the Debian archive that is updated\r
-each day, broken down by architecture.}\r
+\includegraphics[width=0.9\columnwidth]{size-quarter.eps}\r
+\caption{The amount of data in the 119,000 MB Debian archive that is\r
+updated each day, broken down by architecture.}\r
  \label{update_size}\r
  \end{figure}\r
  \r
  For example, Figure~\ref{update_size} shows the amount of data in\r
  the Debian archive that was updated each day over a period of 3\r
-months. In every single day, approximately 1.5\% of the 119,000 MB archive is\r
-updated with new versions of packages. Note that this frequency is much higher than\r
-that of most commercial software, mainly because many free software are\r
-developed in a loosely management environment with developers working\r
-asynchronously from worldwide. \r
+months. Every single day, approximately 1.5\% of the 119,000 MB archive is\r
+updated with new versions of packages. This frequency is much higher than\r
+that of most commercial software \cite{microsoft-update}, mainly because much of free software is\r
+developed in a loosely managed environment of developers working\r
+asynchronously on a worldwide scale.\r
  \r
  \subsubsection{Limited Interest}\r
  \r
-Finally, though there are a large number of packages and a large number of\r
-users, the interest in a particular version of a package can be very \r
+Though there are a large number of packages and a large number of\r
+users, the interest in a particular package, or version of a package, can be very\r
  limited. Specifically, there are core packages that every user has to download, but most\r
  packages would fall in the category of optional or extra, and so are\r
-interesting to only a limited number of people.\r
+interesting to only a limited number of users.\r
  \r
  \begin{figure}\r
  \centering\r
-\includegraphics[width=\columnwidth]{apt_p2p_popularity-cdf.eps}\r
+\includegraphics[width=0.80\columnwidth]{apt_p2p_popularity-cdf.eps}\r
  \caption{The CDF of the popularity of packages in a Debian system.}\r
  \label{popularity_CDF}\r
  \end{figure}\r
@@ -244,26 +257,32 @@ users who install each package. Though some packages are installed
  by everyone, 80\% of the packages are installed by less than 1\% of\r
  users.\r
  \r
+\subsubsection{Interactive Users}\r
  \r
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+Finally, given the relatively long time for software package downloading, existing package management systems generally display\r
+some kind of indication of speed and completeness for users to\r
+monitor. Since previous client-server downloads occurred in a sequential\r
+fashion, the package management software also measures the speed\r
+based on sequential downloading. To offer comparable user experience, it is natural to expect that\r
+the new peer-to-peer solution be reasonably responsive at retrieving packages,\r
+preferably in a sequential downloading order too. \r
  \r
-\subsection{Why BitTorrent Doesn't Work Well}\r
-\label{related}\r
+\subsection{Why BitTorrent Does Not Work Well}\r
+\label{bittorrent}\r
  \r
-Recently, many distributors make their software available using\r
-BitTorrent \cite{COHEN03}, in particular, for the distribution of CD\r
-images. This straightforward use however is far ineffective, as it requires the\r
+Many distributors make their software available using\r
+BitTorrent \cite{COHEN03}, in particular for the distribution of CD\r
+images. This straightforward use however can be very ineffective, as it requires the\r
  peers to download large numbers of packages that they are not\r
  interested in, and prevents them from updating to new packages\r
  without downloading another image containing a lot of the same\r
-packages they already have. \r
-\r
-An alternative is to create torrents tracking individual packages. Unfortunately, we find that this enhancement can be\r
-quite difficult given the unique characteristic of free software packages. \r
+packages they already have.\r
  \r
+An alternative is to create torrents tracking smaller groups of packages. Unfortunately, we find that this enhancement can be\r
+quite difficult given the unique characteristic of free software packages.\r
  First, there is no obvious way to divide the packages into torrents.\r
  Most of the packages are too small, and there are too many packages\r
-in the entire archive to create individual torrents for each one.\r
+in the entire archive, to create individual torrents for each one.\r
  On the other hand, all the packages together are too large to track\r
  efficiently as a single torrent. Hence, some division of the archive's\r
  packages into torrents is obviously necessary, but wherever that\r
@@ -272,100 +291,131 @@ or prevent some peers from connecting to others who do have the same
  content. In addition, a small number of the packages can be updated every\r
  day which would add new files to the torrent, thereby changing its\r
  \emph{infohash} identifier and making it a new torrent. This will\r
-severely fracture the download population, even though peers in the\r
+severely fracture the download population, since even though peers in the\r
  new torrent may share 99\% of the packages in common with peers in the\r
-old torrent.\r
+old torrent, they will be unable to communicate.\r
  \r
  Other issues also prevent BitTorrent from being a good solution to\r
-this problem. In particular, BitTorrent's fixed piece sizes (?KB) that disregard file\r
+this problem. In particular, BitTorrent's fixed piece sizes (usually 512 KB) that disregard file\r
  boundaries are bigger than many of the packages in the archive. This\r
  will waste peers' downloading bandwidth as they will end up\r
  downloading parts of other packages just to get the piece that\r
-contains the package they do want. \r
-\r
+contains the package they do want.\r
  Finally, note that BitTorrent downloads files\r
  randomly, which does not work well with the interactive package\r
-management tools expectation of sequential downloads. On the other hand, with altruistic peers, incentives to share (upload) \r
-become a less important issue, and the availability of seeds are not critical, either, as the mirror sites \r
-can serve in that capacity.\r
-\r
+management tools expectation of sequential downloads.\r
  \r
+On the other hand, there are aspects of BitTorrent that are no\r
+longer critical. Specifically, with altruistic peers\r
+and all files being available to download without uploading, incentives to share\r
+become a less important issue. Also, the availability of seeders is\r
+not critical either, as the servers are already available to serve in\r
+that capacity.\r
  \r
  %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
  \r
  \section{Peer-to-Peer Assisted Distributor: An Overview}\r
  \label{opportunity}\r
  \r
-The situation described in section \ref{situation} presents a clear\r
-opportunity to use some form of peer-to-peer file-sharing protocol\r
-to allow willing users to contribute upload bandwidth. This sparse\r
+We now present the design of our peer-to-peer assisted distribution system for free software package releases and\r
+updates. A key principle in our design is that the new functionalities implemented in our distributor should be transparent to users,\r
+thus offering the same experience as using conventional software management systems, but with enhanced efficiency.\r
+That said, we assume that the user is still attempting to download packages from a\r
+server, but the requests will be proxied by our peer-to-peer program.\r
+We further assume that the server is always available and has all of the package files.\r
+In addition, the cryptographic hash of the packages will be available\r
+separately from the package itself, and is usually contained in an\r
+\emph{index} file which also contains all the packages' names,\r
+locations and sizes.\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=0.9\columnwidth]{model_simple.eps}\r
+\caption{The different phases of functionality of our peer-to-peer distribution model.}\r
+\label{model}\r
+\end{figure}\r
+\r
+\subsection{System Overview}\r
+Our model for using peer-to-peer to enhance package distribution is shown in\r
+Figure~\ref{model}. As shown in Phase~1, our program will act as a\r
+proxy (1,2), downloading (3) and caching all files communicated\r
+between the user and the server (4). It will therefore also have\r
+available the index files containing the cryptographic hashes all\r
+packages. Later, in Phase~2, upon receiving a request from the user\r
+to download a package (5), our program will search the index files\r
+for the package being requested and find its hash (6). This hash can\r
+then be looked up recursively in an indexing structure (a Distributed Hash Table, or DHT \cite{kademlia}, in our implementation) (7), which will return a\r
+list of peers that have the package already (8). The package can\r
+then be downloaded from the peers (11,12), it can be verified using\r
+the hash (13), and if valid can be returned to the user (14). The\r
+current nodes location is also added to the DHT for that hash (15),\r
+as it is now a source for others to download from.\r
+\r
+In steps (11,12), the fact that this package is also available to download for free\r
+from a server is very important to our proposed model. If the package hash\r
+can not be found in the DHT, the peer can then fallback to\r
+downloading from the original location (i.e. the server).\r
+The server thus, with no modification to its\r
+functionality, serves as a seed for the packages in the peer-to-peer\r
+system. Any packages that have just been updated, or that are very\r
+rare, and so do not have any peers available, can always be found on\r
+the server. Once the peer has completed the download from the server\r
+and verified the package, it can then add itself to the DHT as the\r
+first peer for the new package, so that future requests for the package\r
+will not need the server.\r
+\r
+This sparse\r
  interest in a large number of packages undergoing constant updating\r
-is well suited to the functionality provided by a Distributed Hash\r
-Table (DHT). DHTs require unique keys to store and retrieve strings\r
+is well suited to the functionality provided by a DHT. A DHT requires unique keys to store and retrieve strings\r
  of data, for which the cryptographic hashes used by these package\r
  management systems are perfect for. The stored and retrieved strings\r
  can then be pointers to the peers that have the package that hashes\r
-to that key. A downloading peer can lookup the package hash in the\r
-DHT and, if it is found, download the file from those peers and\r
-verify the package with the hash. Once the download is complete, the\r
-peer will add its entry to the DHT indicating that it now has the\r
-package.\r
-\r
-The fact that this package is also available to download for free\r
-from a server is very important to our proposal. If the package hash\r
-can not be found in the DHT, the peer can then fallback to\r
-downloading from the original location (i.e. the network of\r
-mirrors). The mirrors thus, with no modification to their\r
-functionality, serve as seeds for the packages in the peer-to-peer\r
-system. Any packages that have just been updated, or that are very\r
-rare, and so don't have any peers available can always be found on\r
-the mirror. Once the peer has completed the download from the mirror\r
-and verified the package, it can then add itself to the DHT as the\r
-first peer for the new package, so that future requests from peers\r
-will not need the mirror.\r
+to that key.\r
  \r
-The trust of the package is also always guaranteed through the use\r
+Note that, despite downloading the package from untrustworthy peers,\r
+the trust of the package is always guaranteed through the use\r
  of the cryptographic hashes. Nothing can be downloaded from a peer\r
  until the hash is looked up in the DHT, so a hash must first come\r
-from a trusted source (i.e. a mirror). Most distributors use index\r
+from a trusted source (i.e. the distributors server). Most distributors use index\r
  files that contain hashes for a large number of the packages in\r
  their archive, and which are also hashed. After retrieving the\r
-index's hash from the mirror, the index file can be downloaded from\r
+index's hash from the server, the index file can be downloaded from\r
  peers and verified. Then the program has access to all the hashes of\r
  the packages it will be downloading, all of which can be verified\r
  with a \emph{chain of trust} that stretches back to the original\r
  distributor's server.\r
  \r
-\subsection{Implementation Options}\r
-\label{imp_options}\r
-\r
-There are several ways to implement the desired P2P functionality\r
-into the existing package management software. The functionality can\r
-be directly integrated through modifications to the software, though\r
-this could be difficult as the P2P functionality should be running\r
-at all times. This is needed both for efficient lookups and to\r
-support uploading of already downloaded packages. Unfortunately, the\r
-package management tools typically only run until the download and\r
-install request is complete.\r
-\r
-Many of the package management software implementations use HTTP\r
-requests from web servers to download the packages, which makes it\r
-possible to implement the P2P aspect as an almost standard HTTP\r
-caching proxy. This proxy will run as a daemon in the background,\r
-listening for requests from the package management tool for package\r
-files. It will get uncached requests first from the P2P system, or\r
-falling back to the normal HTTP request from a server should it not\r
-be found. For methods that don't use HTTP requests, other types of\r
-proxies may also be possible.\r
-\r
-\subsection{Downloading From Peers}\r
+% \subsection{Implementation Options}\r
+% \label{imp_options}\r
+%\r
+% There are several ways to implement the desired peer-to-peer functionality\r
+% into the existing package management software. The functionality can\r
+% be directly integrated through modifications to the software, though\r
+% this could be difficult as the peer-to-peer functionality should be running\r
+% at all times. This is needed both for efficient lookups and to\r
+% support uploading of already downloaded packages. Unfortunately, the\r
+% package management tools typically only run until the download and\r
+% install request is complete.\r
+%\r
+% Many of the package management software implementations use HTTP\r
+% requests from web servers to download the packages, which makes it\r
+% possible to implement the peer-to-peer aspect as an almost standard HTTP\r
+% caching proxy. This proxy will run as a daemon in the background,\r
+% listening for requests from the package management tool for package\r
+% files, as well as serving (uploading) cached package to other peers.\r
+% It will get uncached requests first from the peer-to-peer system, or\r
+% fall back to the normal HTTP request from a server should it not\r
+% be found. For methods that do not use HTTP requests, other types of\r
+% proxies may also be possible.\r
+\r
+\subsection{Peer Downloading Protocol}\r
  \label{downloading}\r
  \r
  Although not necessary, we recommend implementing a download\r
  protocol that is similar to the protocol used to fetch packages from\r
-the distributor's servers. This simplifies the P2P program, as it\r
-can then treat peers and mirrors almost identically when requesting\r
-packages. In fact, the mirrors can be used when there are only a few\r
+the distributor's server. This simplifies the peer-to-peer program, as it\r
+can then treat peers and the server almost identically when requesting\r
+packages. In fact, the server can be used when there are only a few\r
  slow peers available for a file to help speed up the download\r
  process.\r
  \r
@@ -381,11 +431,13 @@ available.
  \r
  Since the package management system only stores a hash of the entire\r
  package, and not of pieces of that package, we will need to be able\r
-to store and retrieve these piece hashes using the P2P protocol. In\r
+to store and retrieve these piece hashes using the peer-to-peer protocol. In\r
  addition to storing the file download location in the DHT (which\r
  would still be used for small files), a peer will store a\r
  \emph{torrent string} containing the peer's hashes of the pieces of\r
-the larger files. These piece hashes could be compared ahead of time\r
+the larger files (similar to (15) in Phase 3 of Figure~\ref{model}).\r
+These piece hashes will be retrieved and compared ahead of time by\r
+the downloading peer ((9,10) in Phase 2 of Figure~\ref{model})\r
  to determine which peers have the same piece hashes (they all\r
  should), and then used during the download to verify the pieces of\r
  the downloaded package.\r
@@ -398,34 +450,30 @@ the downloaded package.
  We have created a sample implementation that functions as described\r
  in section \ref{opportunity}, and is freely available for other\r
  distributors to download and modify \cite{apt-p2p}. This software,\r
-called \texttt{apt-p2p}, interacts with the \texttt{apt} tool, which\r
-is found in most Debian-based Linux distributions. \texttt{apt} uses\r
-SHA1 hashes to verify most downloaded files, including the large\r
-index files that contain the hashes of the individual packages. We\r
-chose this distribution system as it is familiar to us, it allows\r
-software contributions, and there are interesting statistics\r
+called \texttt{apt-p2p}, interacts with the popular \texttt{apt} tool. This tool\r
+is found in most Debian-based Linux distributions, with related statistics\r
  available for analyzing the popularity of the software packages\r
  \cite{popcon}.\r
  \r
-Since all requests from apt are in the form of HTTP downloads from a\r
-server, the implementation takes the form of a caching HTTP proxy.\r
+Since all requests from \texttt{apt} are in the form of HTTP downloads from a\r
+server, our implementation takes the form of a caching HTTP proxy.\r
  Making a standard \texttt{apt} implementation use the proxy is then\r
  as simple as prepending the proxy location and port to the front of\r
  the mirror name in \texttt{apt}'s configuration file (i.e.\r
-``localhost:9977/mirrorname.debian.org/\ldots'').\r
+``http://localhost:9977/mirrorname.debian.org/\ldots'').\r
  \r
  We created a customized DHT based on Khashmir \cite{khashmir}, which\r
-is an implementation of Kademlia \cite{kademlia} using methods\r
-familiar to BitTorrent developers. Khashmir is also the same DHT\r
+is an implementation of Kademlia \cite{kademlia}. Khashmir is also the same DHT\r
  implementation used by most of the existing BitTorrent clients to\r
  implement trackerless operation. The communication is all handled by\r
  UDP messages, and RPC (remote procedure call) requests and responses\r
  are all \emph{bencoded} in the same way as BitTorrent's\r
-\texttt{.torrent} files. Khashmir uses the high-level Twisted\r
-event-driven networking engine \cite{twisted}, so we also use\r
-Twisted in our sample implementation for all other networking needs.\r
-More details of this customized DHT can be found below in section\r
-\ref{custom_dht}.\r
+\texttt{.torrent} files.\r
+% Khashmir uses the high-level Twisted\r
+% event-driven networking engine \cite{twisted}, so we also use\r
+% Twisted in our sample implementation for all other networking needs.\r
+More details of this customized DHT can be found below in\r
+Section~\ref{custom_dht}.\r
  \r
  Downloading is accomplished by sending simple HTTP requests to the\r
  peers identified by lookups in the DHT to have the desired file.\r
@@ -435,14 +483,16 @@ for the proxy also doubles as the server listening for requests for
  downloads from other peers. All peers support HTTP/1.1, both in the\r
  server and the client, which allows for pipelining of multiple\r
  requests to a peer, and the requesting of smaller pieces of a large\r
-file using the Range request header.\r
+file using the HTTP Range request header. Like in \texttt{apt},\r
+SHA1 hashes are then used to verify downloaded files, including the large\r
+index files that contain the hashes of the individual packages.\r
  \r
  %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
  \r
-\section{Customized DHT}\r
+\section{System Optimization}\r
  \label{custom_dht}\r
  \r
-A large contribution of our work is in the customization and use of\r
+Another contribution of our work is in the customization and use of\r
  a Distributed Hash Table (DHT). Although our DHT is based on\r
  Kademlia, we have made many improvements to it to make it suitable\r
  for this application. In addition to a novel storage technique to\r
@@ -451,25 +501,32 @@ up queries, allowed the storage of multiple values for each key, and
  incorporated some improvements from BitTorrent's tracker-less DHT\r
  implementation.\r
  \r
-\subsection{Kademlia Background}\r
-\label{kademlia}\r
+\subsection{DHT Details}\r
+\label{dht}\r
  \r
-The Kademlia DHT, like most other DHTs, assigns IDs to peers from\r
-the same space that is used for keys. The peers with IDs closest to\r
-the desired key will then store the values for that key. Lookups are\r
-recursive, as nodes are queried in each step that are exponentially\r
-closer to the key than in the previous step.\r
+DHTs operate by storing (\emph{key}, \emph{value}) pairs in a\r
+distributed fashion such that no node will, on average, store more or have to work harder\r
+than any other node. They support two primitive operations:\r
+\emph{put}, which takes a key and a value and stores it in the DHT;\r
+and \emph{get}, which takes a key and returns a value (or values)\r
+that was previously stored with that key. These operations are\r
+recursive, as each node does not know about all the other nodes in the\r
+DHT, and so must recursively search for the correct node to put to\r
+or get from.\r
  \r
-Nodes in a Kademlia system support four primitive requests.\r
+The Kademlia DHT, like most other DHTs, assigns IDs to peers randomly from\r
+the same space that is used for keys. The peers with IDs closest to\r
+the desired key will then store the values for that key.\r
+Nodes support four primitive requests.\r
  \texttt{ping} will cause a peer to return nothing, and is only used\r
-to determine if a node is still alive, while \texttt{store} tells a\r
+to determine if a node is still alive. \texttt{store} tells a\r
  node to store a value associated with a given key. The most\r
  important primitives are \texttt{find\_node} and\r
  \texttt{find\_value}, which both function recursively to find nodes\r
  close to a key. The queried nodes will return a list of the nodes\r
  they know about that are closest to the key, allowing the querying\r
-node to quickly traverse the DHT to find the nodes close to the\r
-desired key. The only difference between them is that the\r
+node to quickly traverse the DHT to find the nodes closest to the\r
+desired key. The only difference between \texttt{find\_node} and \texttt{find\_value} is that the\r
  \texttt{find\_value} query will cause a node to return a value, if\r
  it has one for that key, instead of a list of nodes.\r
  \r
@@ -479,7 +536,7 @@ it has one for that key, instead of a list of nodes.
  Hashes of pieces of the larger package files are needed to support\r
  their efficient downloading from multiple peers.\r
  For large files (5 or more pieces), the torrent strings described in\r
-section \ref{downloading}\r
+Section~\ref{downloading}\r
  are too long to store with the peer's download info in the DHT. This\r
  is due to the limitation that a single UDP packet should be less\r
  than 1472 bytes to avoid fragmentation.\r
@@ -499,7 +556,7 @@ request to the peer for the hash (using the same method as file
  downloads, i.e. HTTP), will cause the peer to return the torrent\r
  string.\r
  \r
-Figure \ref{size_CDF} shows the package size of the 22,298 packages\r
+Figure \ref{size_CDF} shows the size of the 22,298 packages\r
  available in Debian in January 2008. We can see that most of the\r
  packages are quite small, and so most will therefore not require\r
  piece hash information to download. We have chosen a piece\r
@@ -512,10 +569,10 @@ torrent string as they require 5 to 70 pieces. Finally, there are
  only 62 packages that require more than 70 pieces, and so will\r
  require a separate request to a peer for the torrent string.\r
  \r
-\subsection{Response Time}\r
+\subsection{Response Time Optimization}\r
  \label{response_time}\r
  \r
-Most of our customizations to the DHT have been to try and improve\r
+Many of our customizations to the DHT have been to try and improve\r
  the time of the recursive \texttt{find\_value} requests, as this can\r
  cause long delays for the user waiting for a package download to\r
  begin. The one problem that slows down such requests is waiting for\r
@@ -558,10 +615,37 @@ elapsed. We also schedule future pings of nodes that fail once to
  respond to a request, as it takes multiple failures (currently 3)\r
  before a node is removed from the routing table.\r
  \r
-\subsection{Multiple Values}\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=0.80\columnwidth]{apt_p2p_improvements-find_value.eps}\r
+\caption{The distribution of average response times PlanetLab nodes\r
+experience for \texttt{find\_value} queries. The original DHT\r
+implementation results are shown, as well as the successive\r
+improvements that we made to reduce the response time.}\r
+\label{improvements}\r
+\end{figure}\r
+\r
+To test our changes during development, we ran our customized DHT\r
+for several hours after each major change on over 300 PlanetLab nodes\r
+\cite{planetlab}. Though the nodes are not expected to be firewalled\r
+or NATted, some can be quite overloaded and so consistently fail to\r
+respond within a timeout period, similar to NATted peers. The\r
+resulting distribution of the nodes' average response times is shown\r
+in Figure~\ref{improvements}. Each improvement successfully reduced\r
+the response time, for a total reduction of more than 50\%. The\r
+final distribution is also narrower, as the improvements make the\r
+system more predictable. However, there are still a large number of\r
+outliers with higher average response times, which are the\r
+overloaded nodes on PlanetLab. This was confirmed by examining the\r
+average time it took for a timeout to occur, which should be\r
+constant as it is a configuration option, but can be much larger if\r
+the node is too overloaded for the program to be able to check for a\r
+timeout very often.\r
+\r
+\subsection{Multiple Values Extension}\r
  \label{multiple_values}\r
  \r
-The original design of Kademlia specified that each keywould have\r
+The original design of Kademlia specified that each key would have\r
  only a single value associated with it. The RPC to find this value\r
  was called \texttt{find\_value} and worked similarly to\r
  \texttt{find\_node}, iteratively finding nodes with ID's closer to\r
@@ -572,13 +656,13 @@ value instead of the list of nodes it knows about that are closer.
  While this works well for single values, it can cause a problem when\r
  there are multiple values. If the responding node is no longer one\r
  of the closest to the key being searched for, then the values it is\r
-returning will probably be the staler ones in the system, and it\r
+returning will probably be the staler ones in the system, as it\r
  will not have the latest stored values. However, the search for\r
-closer nodes will stop here, as the queried node only returned the\r
+closer nodes will stop here, as the queried node only returned\r
  values and not a list of nodes to recursively query. We could have\r
  the request return both the values and the list of nodes, but that\r
  would severely limit the size and number of the values that could be\r
-returned.\r
+returned in a single UDP packet.\r
  \r
  Instead, we have broken up the original \texttt{find\_value}\r
  operation into two parts. The new \texttt{find\_value} request\r
@@ -593,59 +677,50 @@ abort the search once it has found enough values in some nodes, or
  it could choose to only request values from the nodes that are\r
  closest to the key being searched for.\r
  \r
-\subsection{BitTorrent's Improvements}\r
-\label{bittorrent_dht}\r
-\r
-In the many years that some BitTorrent clients have been using a\r
-Kademlia based DHT for tracker-less operation, the developers have\r
-made many enhancements which we can take advantage of. One of the\r
-most important is a security feature added to stop malicious nodes\r
-from subscribing other nodes as downloaders. When a node issues a\r
-request to another node to store its download info for that key, it\r
-must include a \emph{token} returned by the storing node in a recent\r
-\texttt{find\_node} request. The token is created by hashing the IP\r
-address of the requesting peer with a temporary secret that expires\r
-after several minutes. This prevents the requesting peer from faking\r
-its IP address in the store request, since it must first receive a\r
-response from a \texttt{find\_node} on that IP.\r
-\r
-We also made some BitTorrent-inspired changes to the parameters of\r
-the DHT originally specified by the authors of Kademlia. Since, as\r
-we will show later, peers stay online in our system for much longer\r
-periods of time, we reduced Kademlia's \emph{k} value from 20 to 8.\r
-The value is supposed to be large enough such that any given\r
-\emph{k} nodes are unlikely to fail within an hour of each other,\r
-which is very unlikely in our system given the long uptimes of\r
-nodes. We also increased the number of concurrent outstanding\r
-requests allowed from 3 to 6 to speed up the recursive key finding\r
-processes.\r
-\r
-\subsection{Other Changes}\r
-\label{other_changes}\r
-\r
-We added one other new RPC request that nodes can make:\r
-\texttt{join}. This request is only sent on first loading the DHT,\r
-and is usually only sent to the bootstrap nodes that are listed for\r
-the DHT. These bootstrap nodes will respond to the request with the\r
-requesting peer's IP and port, so that the peer can determine what\r
-its oustide IP address is and whether port translation is being\r
-used. In the future, we hope to add functionality similar to STUN\r
-\cite{STUN}, so that nodes can detect whether they are NATted and\r
-take appropriate steps to circumvent it.\r
-\r
-In addition, we have allowed peers to store values in the DHT, even\r
-if the hash they are using is not the correct length. Most of the\r
-keys used in the DHT are based on the SHA1 hash, and so they are 20\r
-bytes in length. However, some files in the targetted Debian package\r
-system are only hashed using the MD5 algorithm, and so the hashes\r
-retrieved from the server are 16 bytes in length. The DHT will still\r
-store values using these keys by padding the end of them with zeroes\r
-to the proper length, as the loss of uniqueness from the 4 less\r
-bytes is not an issue. Also, the DHT will happily store hashes that\r
-are longer than 20 bytes, such as those from the 32 byte SHA256\r
-algorithm, by truncating them to the correct length. Since the\r
-requesting peer will have the full length of the hash, this will not\r
-affect its attempt to verify the downloaded file.\r
+% \subsection{\textbf{OPTIONAL}: BitTorrent's Improvements}\r
+% \label{bittorrent_dht}\r
+%\r
+% In the several years that some BitTorrent clients have been using a\r
+% Kademlia-based DHT for tracker-less operation, the developers have\r
+% made many enhancements which we can take advantage of. One of the\r
+% most important is a security feature added to stop malicious nodes\r
+% from subscribing other nodes as downloaders. When a node issues a\r
+% request to another node to store its download info for that key, it\r
+% must include a \emph{token} returned by the storing node in a recent\r
+% \texttt{find\_node} request. The token is created by hashing the IP\r
+% address of the requesting peer with a temporary secret that expires\r
+% after several minutes. This prevents the requesting peer from faking\r
+% its IP address in the store request, since it must first receive a\r
+% response from a \texttt{find\_node} on that IP.\r
+%\r
+% We also made some BitTorrent-inspired changes to the parameters of\r
+% the DHT originally specified by the authors of Kademlia. Since, as\r
+% we will show later, peers stay online in our system for much longer\r
+% periods of time, we reduced Kademlia's \emph{k} value from 20 to 8.\r
+% The value is supposed to be large enough such that any given\r
+% \emph{k} nodes are unlikely to fail within an hour of each other,\r
+% which is very unlikely in our system given the long uptimes of\r
+% nodes (see Figure~\ref{duration_online_1} in Section~\ref{analysis}.\r
+% We also increased the number of concurrent outstanding\r
+% requests allowed from 3 to 6 to speed up the recursive key finding\r
+% processes.\r
+%\r
+% \subsection{\textbf{OPTIONAL}: Other Changes}\r
+% \label{other_changes}\r
+%\r
+% In addition, we have allowed peers to store values in the DHT, even\r
+% if the hash they are using is not the correct length. Most of the\r
+% keys used in the DHT are based on the SHA1 hash, and so they are 20\r
+% bytes in length. However, some files in the targetted Debian package\r
+% system are only hashed using the MD5 algorithm, and so the hashes\r
+% retrieved from the server are 16 bytes in length. The DHT will still\r
+% store values using these keys by padding the end of them with zeroes\r
+% to the proper length, as the loss of uniqueness from the 4 less\r
+% bytes is not an issue. Also, the DHT will happily store hashes that\r
+% are longer than 20 bytes, such as those from the 32 byte SHA256\r
+% algorithm, by truncating them to the correct length. Since the\r
+% requesting peer will have the full length of the hash, this will not\r
+% affect its attempt to verify the downloaded file.\r
  \r
  %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
  \r
@@ -658,14 +733,14 @@ distribution system has been available to all Debian users since May
  next release of Ubuntu \cite{apt-p2p-ubuntu}. We have since created\r
  a \emph{walker} that will navigate the DHT and find all the peers\r
  currently connected to it. This allows us to analyze many aspects of\r
-our implementation.\r
+our implementation in the real Internet environment.\r
  \r
  \subsection{Peer Lifetimes}\r
  \label{peer_life}\r
  \r
  \begin{figure}\r
  \centering\r
-\includegraphics[width=\columnwidth]{AptP2PWalker-peers.eps}\r
+\includegraphics[width=0.80\columnwidth]{AptP2PWalker-peers.eps}\r
  \caption{The number of peers found in the system, and how many are\r
  behind a firewall or NAT.}\r
  \label{walker_peers}\r
@@ -677,20 +752,31 @@ shows the number of peers we have seen in the DHT during this time.
  The peer population is very steady, with just over 50 regular users\r
  participating in the DHT at any time. We also note that we find 100\r
  users who connect regularly (weekly), and we have found 186 unique\r
-users in the 2 months of our analysis. We determined which users are\r
+users in the 2 months of our analysis.\r
+\r
+We also determined which users are\r
  behind a firewall or NAT, which is one of the main problems of\r
  implementing a peer-to-peer network. These peers will be\r
  unresponsive to DHT requests from peers they have not contacted\r
  recently, which will cause the peer to wait for a timeout to occur\r
-(currently set at 9 seconds) before moving on. They will also be\r
+(currently 9 seconds) before moving on. They will also be\r
  unable to contribute any upload bandwidth to other peers, as all\r
  requests for packages from them will also timeout. From\r
  Figure~\ref{walker_peers}, we see that approximately half of all\r
-peers suffered from this restriction.\r
+peers suffered from this restriction. To address this problem, we added one other new RPC request that nodes can make:\r
+\texttt{join}. This request is only sent on first loading the DHT,\r
+and is usually only sent to the bootstrap nodes that are listed for\r
+the DHT. These bootstrap nodes will respond to the request with the\r
+requesting peer's IP and port, so that the peer can determine what\r
+its oustide IP address is and whether port translation is being\r
+used. In the future, we hope to add functionality similar to STUN\r
+\cite{STUN}, so that nodes can detect whether they are NATted and\r
+take appropriate steps to circumvent it.\r
+\r
  \r
  \begin{figure}\r
  \centering\r
-\includegraphics[width=\columnwidth]{AptP2PDuration-peers.eps}\r
+\includegraphics[width=0.80\columnwidth]{AptP2PDuration-peers.eps}\r
  \caption{The CDF of how long an average session will last.}\r
  \label{duration_peers}\r
  \end{figure}\r
@@ -699,29 +785,30 @@ Figure~\ref{duration_peers} shows the cumulative distribution of how
  long a connection from a peer can be expected to last. Due to our\r
  software being installed as a daemon that is started by default\r
  every time their computer boots up, peers are expected to stay for a\r
-long period in the system. 50\% of connections last longer than 5\r
+long period in the system.\r
+Indeed, we find that 50\% of connections last longer than 5\r
  hours, and 20\% last longer than 10 hours. These connections are\r
-much longer than those reported by Saroiu et. al. \cite{saroiu2001}\r
-for other P2P systems, which had 50\% of Napster and Gnutella\r
+much longer than those reported by Saroiu et al. \cite{saroiu2001}\r
+for other peer-to-peer systems, which had 50\% of Napster and Gnutella\r
  sessions lasting only 1 hour.\r
  \r
-\begin{figure}\r
-\centering\r
-\includegraphics[width=\columnwidth]{AptP2PDuration-ind_peers.eps}\r
-\caption{The CDF of the average time individual peers stay in the\r
-system.}\r
-\label{duration_ind_peers}\r
-\end{figure}\r
-\r
-We also examined the average time each individual peer spends in the\r
-system. Figure~\ref{duration_peers} shows the cumulative\r
-distribution of how long each individual peer remains in the system.\r
-Here we see that 50\% of peers have average stays in the system\r
-longer than 10 hours.\r
+% \begin{figure}\r
+% \centering\r
+% \includegraphics[width=0.80\columnwidth]{AptP2PDuration-ind_peers.eps}\r
+% \caption{\textbf{OPTIONAL}: The CDF of the average time individual peers stay in the\r
+% system.}\r
+% \label{duration_ind_peers}\r
+% \end{figure}\r
+%\r
+% \textbf{OPTIONAL}: We also examined the average time each individual peer spends in the\r
+% system. Figure~\ref{duration_peers} shows the cumulative\r
+% distribution of how long each individual peer remains in the system.\r
+% Here we see that 50\% of peers have average stays in the system\r
+% longer than 10 hours.\r
  \r
  \begin{figure}\r
  \centering\r
-\includegraphics[width=\columnwidth]{AptP2PDuration-online_1.eps}\r
+\includegraphics[width=0.80\columnwidth]{AptP2PDuration-online_1.eps}\r
  \caption{The fraction of peers that, given their current duration in\r
  the system, will stay online for another hour.}\r
  \label{duration_online_1}\r
@@ -730,7 +817,7 @@ the system, will stay online for another hour.}
  Since our DHT is based on Kademlia, which was designed based on the\r
  probability that a node will remain up another hour, we also\r
  analyzed our system for this parameter.\r
-Figure~\ref{duration_online_1} shows the fraction of peers will\r
+Figure~\ref{duration_online_1} shows the fraction of peers that will\r
  remain online for another hour, as a function of how long they have\r
  been online so far. Maymounkov and Mazieres found that the longer a\r
  node has been online, the higher the probability that it will stay\r
@@ -741,22 +828,22 @@ Our results also show that, for our system, over 80\% of all peers
  will remain online another hour, compared with around 50\% for\r
  Gnutella.\r
  \r
-\begin{figure}\r
-\centering\r
-\includegraphics[width=\columnwidth]{AptP2PDuration-online_6.eps}\r
-\caption{The fraction of peers that, given their current duration in\r
-the system, will stay online for another 6 hours.}\r
-\label{duration_online_6}\r
-\end{figure}\r
-\r
-Since our peers are much longer-lived than other P2P systems, we\r
-also looked at the fraction of peers that stay online for another 6\r
-hours. Figure~\ref{duration_online_6} shows that over 60\% of peers\r
-that are online for 10 hours will stay online for another 6.\r
-However, we see an interesting decrease in this fraction between 8\r
-and 12 hours, which can also be seen in\r
-Figure~\ref{duration_online_1}. We believe this to be due to desktop\r
-users, who regularly turn off their computers at night.\r
+% \begin{figure}\r
+% \centering\r
+% \includegraphics[width=0.80\columnwidth]{AptP2PDuration-online_6.eps}\r
+% \caption{\textbf{OPTIONAL}: The fraction of peers that, given their current duration in\r
+% the system, will stay online for another 6 hours.}\r
+% \label{duration_online_6}\r
+% \end{figure}\r
+%\r
+% \textbf{OPTIONAL}: Since our peers are much longer-lived than other peer-to-peer systems, we\r
+% also looked at the fraction of peers that stay online for another 6\r
+% hours. Figure~\ref{duration_online_6} shows that over 60\% of peers\r
+% that are online for 10 hours will stay online for another 6.\r
+% However, we see an interesting decrease in this fraction between 8\r
+% and 12 hours, which can also be seen in\r
+% Figure~\ref{duration_online_1}. We believe this to be due to desktop\r
+% users, who regularly turn off their computers at night.\r
  \r
  \subsection{Peer Statistics}\r
  \label{peer_stats}\r
@@ -768,11 +855,11 @@ and uploading, and their measured response times for DHT queries.
  Our walker can extract this information if the peer is not\r
  firewalled or NATted, it has not disabled this functionality, and if\r
  it uses the same port for both its DHT (UDP) requests and download\r
-(TCP) requests (which is also a configuration parameter).\r
+(TCP) requests (which is also the default configuration parameter).\r
  \r
  \begin{figure}\r
  \centering\r
-\includegraphics[width=\columnwidth]{AptP2PDownloaded-peers.eps}\r
+\includegraphics[width=0.80\columnwidth]{AptP2PDownloaded-peers.eps}\r
  \caption{The number of peers that were contacted to determine their\r
  bandwidth, and the total number of peers in the system.}\r
  \label{down_peers}\r
@@ -786,8 +873,8 @@ system during this time.
  \r
  \begin{figure}\r
  \centering\r
-\includegraphics[width=\columnwidth]{AptP2PDownloaded-bw.eps}\r
-\caption{The bandwidth of data that the contacted peers have\r
+\includegraphics[width=0.80\columnwidth]{AptP2PDownloaded-bw.eps}\r
+\caption{The bandwidth of data (total number of bytes) that the contacted peers have\r
  downloaded and uploaded.}\r
  \label{down_bw}\r
  \end{figure}\r
@@ -801,15 +888,17 @@ from other peers, which is saving the mirrors from supplying that
  bandwidth. The actual numbers are only a lower bound, since we have\r
  only contacted 30\% of the peers in the system, but we can estimate\r
  that \texttt{apt-p2p} has already saved the mirrors 15 GB of\r
-bandwidth, or 1 GB per day.\r
+bandwidth, or 1 GB per day. Considering the current small number of\r
+users this savings is quite large, and is expected to grow\r
+considerably as more users participate in the P2P system.\r
  \r
  We also collected the statistics on the measured response time peers\r
  were experiencing when sending requests to the DHT. We found that\r
  the recursive \texttt{find\_value} query, which is necessary before\r
  a download can occur, is taking 17 seconds on average. This\r
-indicates that, on average, requests are experiencing almost 2\r
+indicates that, on average, requests are experiencing almost 2 full\r
  stalls while waiting for the 9 second timeouts to occur on\r
-unresponsive peers. This is longer than our target of 10 seconds,\r
+unresponsive peers. This time is longer than our target of 10 seconds,\r
  although it will only lead to a slight average delay in downloading\r
  of 1.7 seconds when the default 10 concurrent downloads are\r
  occurring.This increased response time is due to the number of peers\r
@@ -817,41 +906,83 @@ that were behind firewalls or NATs, which was much higher than we
  anticipated. We do have plans to improve this through better\r
  informing of users of their NATted status, the use of STUN\r
  \cite{STUN} to circumvent the NATs, and by better exclusion of\r
-NATted peers from the DHT (which will not prevent them from using\r
+NATted peers from the DHT (which does not prevent them from using\r
  the system).\r
  \r
  We were also concerned that the constant DHT requests and responses,\r
  even while not downloading, would overwhelm some peers' network\r
-connections. However, we found that peers are using 200 to 300 bytes\r
-per second of bandwidth in servicing the DHT. These numbers are\r
+connections. However, we found that peers are using 200 to 300 bytes/sec\r
+of bandwidth in servicing the DHT. These numbers are\r
  small enough to not affect any other network services the peer would\r
  be running.\r
  \r
  \r
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+\r
  \section{Related Work}\r
-\label{others}\r
+\label{related}\r
  \r
-There have also been preliminary attempts to implement peer-to-peer distributors for\r
+There have been other preliminary attempts to implement peer-to-peer distributors for\r
  software packages. apt-torrent \cite{apttorrent} creates torrents\r
  for some of the larger packages available, but it ignores the\r
  smaller packages, which are often the most popular. DebTorrent\r
  \cite{debtorrent} makes widespread modifications to a traditional\r
-BitTorrent client, to try and fix the drawbacks mentioned in section\r
-\ref{bittorrent}. However, these changes also require some\r
-modifications to the distribution system to support it.\r
-\r
-Others have also used DHTs to support this type of functionality.\r
-Kenosis \cite{kenosis} is a P2P Remote Procedure Call\r
-client also based on the Kademlia DHT, but it is meant as a P2P\r
-primitive system on which other tools can be built, and so it has no\r
-file sharing functionality at all. Many have used a DHT as a drop-in\r
-replacement for the tracker in a BitTorrent system\r
-\cite{bittorrent-dht, azureus-dht}, but such systems only use the\r
-DHT to find peers for the same torrent, so the file sharing uses\r
-traditional BitTorrent and so is not ideal for the reasons listed\r
-above.\r
-\r
-Our solution differs from them in that ...\r
+BitTorrent client, to try and fix the drawbacks mentioned in\r
+Section~\ref{bittorrent}. However, these changes also require some\r
+modifications to the distribution system to support it. Our system\r
+considers all the files available to users to download, and makes\r
+use of the existing infrastructure unmodified.\r
+\r
+% \textbf{OPTIONAL}: Others have also used DHTs to support this type of functionality.\r
+% Kenosis \cite{kenosis} is a peer-to-peer Remote Procedure Call\r
+% client also based on the Kademlia DHT, but it is meant as a peer-to-peer\r
+% primitive system on which other tools can be built, and so it has no\r
+% file sharing functionality at all. Many have used a DHT as a drop-in\r
+% replacement for the tracker in a BitTorrent system\r
+% \cite{bittorrent-dht, azureus-dht}, but such systems only use the\r
+% DHT to find peers for the same torrent, so the file sharing uses\r
+% traditional BitTorrent and so is not ideal for the reasons listed\r
+% in Section~\ref{bittorrent}.\r
+\r
+% \cite{deltacast}\r
+\r
+There are a number of works dedicated to developing a collaborative\r
+content distribution network (CDN) using peer-to-peer techniques.\r
+Freedman et al. developed Coral \cite{coral} using a distributed\r
+\emph{sloppy} hash table to speed request times. Pierre and van\r
+Steen developed Globule \cite{globule} which uses typical DNS and\r
+HTTP redirection techniques to serve requests from a network of\r
+replica servers, which in turn draw their content from the original\r
+location (or a backup). Shah et al. \cite{shah08} analyze an\r
+existing software delivery system and use the results to design a\r
+peer-to-peer content distribution network that makes use of\r
+volunteer servers to help with the load. None of these systems meets\r
+our goal of an even distribution of load amongst the users of the\r
+system. Not all users of the systems become peers, and so are not\r
+able to contribute back to the system after downloading. The\r
+volunteers that do contribute as servers are required to contribute\r
+larger amounts of bandwidth, both for uploading to others, and in\r
+downloading content they are not in need of in order to share them\r
+with other users. Our system treats all users equally, requiring all\r
+to become peers in the system, sharing the uploading load equally\r
+amongst all, but does not require any user to download files they\r
+would not otherwise need.\r
+\r
+The most similar works to ours are by Shah et al. \cite{shah08} and\r
+Shark by Annapureddy et al. \cite{shark}.\r
+Shah's system, in addition to the drawbacks mentioned above,\r
+is not focused on the interactivity of downloads, as\r
+half of all requests were required ``to wait between 8 and 15\r
+minutes.'' In contrast, lookups in our system take only seconds to\r
+complete, and all requests can be completed in under a minute.\r
+Shark makes use of Coral's distributed sloppy hash table to speed\r
+the lookup time, but their system is more suited to its intended use\r
+as a distributed file server. It does not make use of authoritative\r
+copies of the original files, allowing instead any users in the\r
+system to update files and propagate those changes to others. Our\r
+system is well-tailored to the application of disseminating the\r
+unchanging software packages from the authoritative sources to all\r
+users.\r
  \r
  \r
  %%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
@@ -859,24 +990,23 @@ Our solution differs from them in that ...
  \section{Conclusion and Future Work}\r
  \label{conclusions}\r
  \r
-We have designed a generally applicable peer-to-peer content\r
-distribution model to be used by many of the free software\r
-distributors operating today. It makes use of already existing\r
-features of most package management systems to create a\r
-peer-to-peer distribution that should substantially reduce the costs\r
-of hosting the software packages.\r
-\r
-We have also implemented our design in freely available software to\r
-be used in conjuction with Debian-based distribution of Linux\r
-software packages. It is currently in use by some users of the\r
-Debian project's distribution, and so serves as an example of the\r
-possibilities that exist.\r
-\r
-We feel that our P2P software package distribution model is\r
-well-suited to be used by many free software distributors. We hope\r
-to convince them to adopt such a model in their distribution, or we\r
-may port our existing system to some of the other groups for them to\r
-try.\r
+In this paper, we have provided strong evidence that free software package distribution\r
+and update exhibit many distinct characteristics, which call for new designs other\r
+than the existing peer-to-peer systems for file sharing. To this end, we have\r
+present \texttt{apt-p2p}, a novel peer-to-peer distributor that sits between\r
+client and server, providing efficient and transparent downloading and updating services\r
+for software packages. We have addressed the key design issues in \texttt{apt-p2p}, including DHT customization,\r
+response time reduction, and multi-value extension. \texttt{apt-p2p}  has been used in conjuction with Debian-based distribution of Linux\r
+software packages and will also be available in the next release of Ubuntu. Existing real user statistics\r
+have suggested that it interacts well with clients and substantially reduces server cost.\r
+\r
+There are many future avenues toward improving our implementation. Besides\r
+evaluating its performance in larger scales, we are particularly interest in further speeding up some of the slower recursive\r
+DHT requests. We expect to accomplish this by fine tuning the\r
+parameters of our current system, better exclusion of NATted peers\r
+from the routing tables, and through the use of STUN \cite{STUN} to\r
+circumvent the NATs of the 50\% of the peers that have not\r
+configured port forwarding.\r
  \r
  One aspect missing from our model is the removal of old packages\r
  from the cache. Since our implementation is still relatively young,\r
@@ -887,13 +1017,7 @@ technique, in which packages that are no longer available on the
  server, no longer requested by peers, or simply are the oldest in\r
  the cache, will be removed.\r
  \r
-The most significant area of improvement still needed in our sample\r
-implementation is to further speed up some of the slower recursive\r
-DHT requests. We hope to accomplish this by further tuning the\r
-parameters of our current system, better exclusion of NATted peers\r
-from the routing tables, and through the use of STUN \cite{STUN} to\r
-circumvent the NATs of the 50\% of the peers that have not\r
-configured port forwarding.\r
+\r
  \r
  \bibliographystyle{IEEEtran}\r
  \bibliography{./IEEEabrv,./all}\r