X-Git-Url: https://git.mxchange.org/?p=quix0rs-apt-p2p.git;a=blobdiff_plain;f=docs%2Fpaper%2Fpaper.tex;h=f8cb8fbbc2418047abc2350a9f357abcf61638e8;hp=33bcb182e001ece5105908f2b2abb737d4d4b69e;hb=82c54607b7dc07a036419f53e175ce8af48f1d06;hpb=39b67214133f5e3c42a862cec9232872b51b70b5

diff --git a/docs/paper/paper.tex b/docs/paper/paper.tex
index 33bcb18..f8cb8fb 100644
--- a/docs/paper/paper.tex
+++ b/docs/paper/paper.tex
@@ -1,961 +1,1025 @@
-\documentclass[conference]{IEEEtran}
-
-%% INFOCOM addition:
-\makeatletter
-\def\ps@headings{%
-\def\@oddhead{\mbox{}\scriptsize\rightmark \hfil \thepage}%
-\def\@evenhead{\scriptsize\thepage \hfil\leftmark\mbox{}}%
-\def\@oddfoot{}%
-\def\@evenfoot{}}
-\makeatother
-\pagestyle{headings}
-
-\usepackage[noadjust]{cite}
-
-\usepackage[dvips]{graphicx}
-
-\usepackage{url}
-
-\begin{document}
-
-\title{Peer-to-Peer Distribution of Free Software Packages}
-\author{\IEEEauthorblockN{Cameron Dale}
-\IEEEauthorblockA{School of Computing Science\\
-Simon Fraser University\\
-Burnaby, British Columbia, Canada\\
-Email: camerond@cs.sfu.ca}
-\and
-\IEEEauthorblockN{Jiangchuan Liu}
-\IEEEauthorblockA{School of Computing Science\\
-Simon Fraser University\\
-Burnaby, British Columbia, Canada\\
-Email: jcliu@cs.sfu.ca}}
-
-\maketitle
-
-\begin{abstract}
-A large amount of free software packages are available over the
-Internet from many different distributors. Most of these
-distributors use the traditional client-server model to handle
-requests from users. However, there is an excellent opportunity to
-use peer-to-peer techniques to reduce the cost of much of this
-distribution, especially due to the altruistic nature of many of
-these users. There are no existing solution suitable for this
-situation, so we present a new technique for satisfying the needs of
-this P2P distribution, which is generally applicable to many of
-these distributors' systems. Our method makes use of a DHT for
-storing the location of peers, using the cryptographic hash of the
-package as a key. To show the simplicity and functionality, we
-implement a solution for the distribution of Debian software
-packages, including the many DHT customizations needed. Finally, we
-analyze our system to determine how it is performing and what effect
-it is having.
-\end{abstract}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Introduction}
-\label{intro}
-
-There are a large number of free software package distributors using
-package distribution systems over the Internet to distribute
-software to their users. These distributors have developed many
-different methods for this distribution, but they almost exclusively
-use a client-server model to satisfy user requests. The popularity
-and number of users results in a large number of requests, which
-usually requires a network of mirrors to handle. Due to the free
-nature of this software, many users are willing and able to
-contribute upload bandwidth to this distribution, but have no
-current way to do this.
-
-We present a new peer-to-peer distribution model to meet these
-demands. It is based on many previous implementations of successful
-peer-to-peer protocols, especially distributed hash tables (DHT) and
-BitTorrent. The model relies on the pre-existence of cryptographic
-hashes of the packages, which should uniquely identify it for a
-request from other peers. If the peer-to-peer download fails, then
-the original request to the server is used as a fallback to prevent
-any dissatisfaction from users. The peer can then share this new
-package with others through the P2P system.
-
-First, we examine the opportunity that is available for many of
-these free software package distributors. We present an overview of
-a system that would efficiently satisfy the demands of a large
-number of users, and should significantly reduce the currently
-substantial bandwidth requirements of hosting this software. We then
-present an example implementation based on the Debian package
-distribution system. This implementation will be used by a large
-number of users, and serves as an example for other free software
-distributors of the opportunity that can be met with such a system.
-
-This paper is organized as follows. The current situation and its
-problems are presented in section \ref{situation}. We analyze some
-related solutions in section \ref{related}, and then present our
-solution in section \ref{opportunity}. We then detail our sample
-implementation for Debian-based distributions in section
-\ref{implementation}, including an in-depth look at our DHT
-customizations in section \ref{custom_dht}. We analyze our deployed
-sample implementation in section \ref{analysis}, suggest some future
-enhancements in section \ref{future}, and conclude the paper in
-section \ref{conclusions}.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Problem/Situation}
-\label{situation}
-
-There are a large number of groups using the Internet to distribute
-their free software packages. These packages are usually available
-from a free download site, which usually requires a network of
-mirrors to support the large number of requests. These distribution
-systems almost always support some kind of file verification,
-usually cryptographic hashes, to verify the completed downloads
-accuracy or authenticity.
-
-\subsection{Examples}
-\label{examples}
-
-Most Linux distributions use a software package management system
-that fetches packages to be installed from an archive of packages
-hosted on a network of mirrors. The Debian project, and other
-Debian-based distributions such as Ubuntu and Knoppix, use the
-\texttt{apt} (Advanced Package Tool) program, which downloads Debian
-packages in the \texttt{.deb} format from one of many HTTP mirrors.
-The program will first download index files that contain a listing
-of which packages are available, as well as important information
-such as their size, location, and a hash of their content. The user
-can then select which packages to install or upgrade, and
-\texttt{apt} will download and verify them before installing them.
-
-There are also several similar frontends for the RPM-based
-distributions. Red Hat's Fedora project uses the \texttt{yum}
-program, SUSE uses \texttt{YAST}, while Mandriva has
-\texttt{Rpmdrake}, all of which are used to obtain RPMs from
-mirrors. Other distributions use tarballs (\texttt{.tar.gz} or
-\texttt{.tar.bz2}) to contain their packages. Gentoo's package
-manager is called \texttt{portage}, SlackWare Linux uses
-\texttt{pkgtools}, and FreeBSD has a suite of command-line tools,
-all of which download these tarballs from web servers.
-
-Other software distributors also use a similar system. CPAN
-distributes software packages for the PERL
-programming language, using SOAP RPC requests to find and download
-files. Cygwin provides many of the
-standard Unix/Linux tools in a Windows environment, using a
-package management tool that requests packages from websites. There
-are two software distribution systems for Mac OSX, fink and
-MacPorts, that also retrieve packages in this way.
-
-Also, some systems use direct web downloads, but with a hash
-verification file also available for download next to the desired
-file. These hash files usually have the same file name, but with an
-added extension identifying the hash used (e.g. \texttt{.md5} for
-the MD5 hash). This type of file downloading and verification is
-typical of free software hosting facilities that are open to anyone
-to use, such as SourceForge.
-
-\subsection{Similarities}
-\label{similarities}
-
-The important things to note for each of the systems mentioned in
-section \ref{examples}, is that they all have the following in
-common:
-\begin{itemize}
- \item The content is divided into distinct units (packages), each
-       of which contains a small independent part of all the
-       content.
- \item The packages are avaliable for anyone to freely download.
- \item Users are typically not interested in downloading all of the
-       packages available.
- \item Hashes of the packages are available before the download is
-       attempted.
- \item Requests to download the packages are sent by a tool, not
-       directly by the user (though the tool is responding to
-       requests from the user).
-\end{itemize}
-
-We also expect that there are a number of users of these systems
-that are motivated by altruism to want to help out with this
-distribution. This is common in these systems due to the free nature
-of the software being delivered, which encourages some to want to
-help out in some way. A number of the systems are also used by
-groups that are staffed mostly, or sometimes completely, by
-volunteers. This also encourages users to want to give back
-to the volunteer community that has created the software they are
-using.
-
-\subsection{Differences}
-\label{differences}
-
-Although at first it might seem that a single all-reaching solution
-is possible for this situation, there are some differences in each
-system that require independent solutions. The systems all use
-different tools for their distribution, so any implementation would
-have to be specific to the tool it is to integrate with. In
-particular, how to obtain requests from the tool or user, and how to
-find a hash that corresponds to the file being requested, is very
-specific to each system.
-
-Also, there may be no benefit in creating a single large solution to
-integrate all these problems. Although they sometimes distribute
-nearly identical packages (e.g. the same software available in
-multiple Linux distributions), it is not exactly identical and has
-usually been tailored to the system by the distributor. The small
-differences will change the hash of the files, and so will make it
-impossible to distribute similar packages across systems. And,
-although peer-to-peer systems can scale very well with the number of
-peers in the system, there is some overhead involved, so having a
-much larger system of peers would mean that requests could take
-longer to complete.
-
-\subsection{Problems}
-\label{problems}
-
-There are some attributes of software package distribution that make
-it difficult to use an existing peer-to-peer solution with, or to
-design a new solution for this need.
-
-\subsubsection{Archive Dimensions}
-
-The dimensions of the software archive makes it difficult to
-distribute efficiently. While most of the packages are very small in
-size, there are some that are quite large. There are too many
-packages to distribute each individually, but the archive is also
-too large to distribute in its entirety. In some archives there are
-also divisions of the archive into sections, e.g. by the computer
-architecture that the package is intended for. 
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{apt_p2p_simulation-size_CDF.eps}
-\caption{The CDF of the size of packages in a Debian system, both
-for the actual size and adjusted size based on the popularity of
-the package.}
-\label{size_CDF}
-\end{figure}
-
-For example, Figure~\ref{size_CDF} shows the size of packages in the
-Debian distribution. While 80\% of the packages are less than
-512~KB, some of the packages are hundreds of megabytes. The entire
-archive consists of 22,298 packages and is approximately 119,000 MB
-in size. Some packages are divided by architecture, but there are a
-large number of packages that are not architecture specific and so
-can be installed on any computer architecture.
-
-\subsubsection{Package Updates}
-
-The software packages being distributed are being constantly
-updated. These updates could be the result of the software creators
-releasing a new version of the software with improved functionality,
-or the result of the distributor updating their packaging of the
-software to meet new requirements. Even if the distributor
-periodically makes \emph{stable} releases, which are snapshots of
-all the packages in the archive at a certain time, updates are still
-released for security issues or serious bugs.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{size-quarter.eps}
-\caption{The amount of data in the Debian archive that is updated
-each day, broken down by architecture.}
-\label{update_size}
-\end{figure}
-
-For example, Figure~\ref{update_size} shows the amount of data in
-the Debian archive that was updated each day over a period of 3
-months. Each day approximately 1.5\% of the 119,000 MB archive is
-updated with new versions of packages.
-
-\subsubsection{Limited Interest}
-
-Though there are a large number of packages, and a large number of
-users, interest in a given version of a package could be very
-limited. If the package is necessary for the system (e.g. the
-package distribution software), then it is likely that everyone will
-have it installed, but there are very few of these packages. Most
-packages fall in the category of optional or extra, and so are
-interesting to only a limited number of people.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{apt_p2p_popularity-cdf.eps}
-\caption{The CDF of the popularity of packages in a Debian system.}
-\label{popularity_CDF}
-\end{figure}
-
-For example, the Debian distribution tracks the popularity of its
-packages using popcon \cite{popcon}. Figure~\ref{popularity_CDF}
-shows the cumulative distribution function of the percentage of all
-users who install each package. Though some packages are installed
-by everyone, 80\% of the packages are installed by less than 1\% of
-users.
-
-\subsubsection{Interactive Users}
-
-The package management software that downloads packages displays
-some kind of indication of speed and completeness for users to
-watch. Since previous client-server downloads occur in a sequential
-fashion, the package management software also measures the speed
-based on sequential downloading. This requires the peer-to-peer
-solution to be reasonably responsive at retrieving packages
-sequentially.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Related Work/Existing Solutions}
-\label{related}
-
-There are many peer-to-peer implementations available today, but
-none is very well suited to this specific problem.
-
-\subsection{BitTorrent}
-\label{bittorrent}
-
-Many distributors make their software available in some form using
-BitTorrent \cite{COHEN03}, e.g. for the distribution of CD
-images. This is not an ideal situation though, as it requires the
-peers to download large numbers of packages that they are not
-interested in, and prevents them from updating to new packages
-without downloading another image containing a lot of the same
-packages they already have. Unfortunately, BitTorrent is not ideally
-suited to an application such as this one.
-
-First, there is no obvious way to divide the packages into torrents.
-Most of the packages are too small, and there are too many packages
-in the entire archive, to create individual torrents for each one.
-However, all the packages together are too large to track
-efficiently as a single torrent. Some division of the archive's
-packages into torrents is obviously necessary, but wherever that
-split occurs it will cause either some duplication of connections,
-or prevent some peers from connecting to others who do have the same
-content. Also, a small number of the packages can be updated every
-day which would add new files to the torrent, thereby changing its
-\emph{infohash} identifier and making it a new torrent. This will
-severely fracture the download population, even though peers in the
-new torrent share 99\% of the packages in common with peers in the
-old torrent.
-
-Other issues also prevent BitTorrent from being a good solution to
-this problem. BitTorrent's fixed piece sizes that disregard file
-boundaries are bigger than many of the packages in the archive. This
-will waste peers' downloading bandwidth as they will end up
-downloading parts of other packages just to get the piece that
-contains the package they do want. Incentives to share (upload) are
-no longer needed, as the software is freely available for anyone to
-download without sharing. Seeds are also not needed as the mirrors
-can serve in that capacity. Finally, BitTorrent downloads files
-randomly, which does not work well with the interactive package
-management tools expectation of sequential downloads.
-
-\subsection{Other Solutions}
-\label{others}
-
-There have also been other attempted implementations, usually based
-on some form of BitTorrent, to address this problem. Some other
-implementations have used BitTorrent almost unmodified, while others
-have only looked at using DHTs to replace the tracker in a
-BitTorrent system. apt-torrent \cite{apttorrent} creates torrents
-for some of the larger packages available, but it ignores the
-smaller packages, which are often the most popular. DebTorrent
-\cite{debtorrent} makes widespread modifications to a traditional
-BitTorrent client, to try and fix the drawbacks mentioned in section
-\ref{bittorrent}. However, these changes also require some
-modifications to the distribution system to support it.
-
-Others have also used DHTs to support this type of functionality.
-Kenosis \cite{kenosis} is a P2P Remote Procedure Call
-client also based on the Kademlia DHT, but it is meant as a P2P
-primitive system on which other tools can be built, and so it has no
-file sharing functionality at all. Many have used a DHT as a drop-in
-replacement for the tracker in a BitTorrent system
-\cite{bittorrent-dht, azureus-dht}, but such systems only use the
-DHT to find peers for the same torrent, so the file sharing uses
-traditional BitTorrent and so is not ideal for the reasons listed
-above.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Opportunity/Our Proposal}
-\label{opportunity}
-
-The situation described in section \ref{situation} presents a clear
-opportunity to use some form of peer-to-peer file-sharing protocol
-to allow willing users to contribute upload bandwidth. This sparse
-interest in a large number of packages undergoing constant updating
-is well suited to the functionality provided by a Distributed Hash
-Table (DHT). DHTs require unique keys to store and retrieve strings
-of data, for which the cryptographic hashes used by these package
-management systems are perfect for. The stored and retrieved strings
-can then be pointers to the peers that have the package that hashes
-to that key. A downloading peer can lookup the package hash in the
-DHT and, if it is found, download the file from those peers and
-verify the package with the hash. Once the download is complete, the
-peer will add its entry to the DHT indicating that it now has the
-package.
-
-The fact that this package is also available to download for free
-from a server is very important to our proposal. If the package hash
-can not be found in the DHT, the peer can then fallback to
-downloading from the original location (i.e. the network of
-mirrors). The mirrors thus, with no modification to their
-functionality, serve as seeds for the packages in the peer-to-peer
-system. Any packages that have just been updated, or that are very
-rare, and so don't have any peers available can always be found on
-the mirror. Once the peer has completed the download from the mirror
-and verified the package, it can then add itself to the DHT as the
-first peer for the new package, so that future requests from peers
-will not need the mirror.
-
-The trust of the package is also always guaranteed through the use
-of the cryptographic hashes. Nothing can be downloaded from a peer
-until the hash is looked up in the DHT, so a hash must first come
-from a trusted source (i.e. a mirror). Most distributors use index
-files that contain hashes for a large number of the packages in
-their archive, and which are also hashed. After retrieving the
-index's hash from the mirror, the index file can be downloaded from
-peers and verified. Then the program has access to all the hashes of
-the packages it will be downloading, all of which can be verified
-with a \emph{chain of trust} that stretches back to the original
-distributor's server.
-
-\subsection{Implementation Options}
-\label{imp_options}
-
-There are several ways to implement the desired P2P functionality
-into the existing package management software. The functionality can
-be directly integrated through modifications to the software, though
-this could be difficult as the P2P functionality should be running
-at all times. This is needed both for efficient lookups and to
-support uploading of already downloaded packages. Unfortunately, the
-package management tools typically only run until the download and
-install request is complete.
-
-Many of the package management software implementations use HTTP
-requests from web servers to download the packages, which makes it
-possible to implement the P2P aspect as an almost standard HTTP
-caching proxy. This proxy will run as a daemon in the background,
-listening for requests from the package management tool for package
-files. It will get uncached requests first from the P2P system, or
-falling back to the normal HTTP request from a server should it not
-be found. For methods that don't use HTTP requests, other types of
-proxies may also be possible.
-
-\subsection{Downloading From Peers}
-\label{downloading}
-
-Although not necessary, we recommend implementing a download
-protocol that is similar to the protocol used to fetch packages from
-the distributor's servers. This simplifies the P2P program, as it
-can then treat peers and mirrors almost identically when requesting
-packages. In fact, the mirrors can be used when there are only a few
-slow peers available for a file to help speed up the download
-process.
-
-Downloading a file efficiently from a number of peers is where
-BitTorrent shines as a peer-to-peer application. Its method of
-breaking up larger files into pieces, each with its own hash,
-makes it very easy to parallelize the downloading process and
-maximize the download speed. For very small packages (i.e. less than
-the piece size), this parallel downloading is not necessary, or
-even desirable. However, this method should still be used, in
-conjunction with the DHT, for the larger packages that are
-available.
-
-Since the package management system only stores a hash of the entire
-package, and not of pieces of that package, we will need to be able
-to store and retrieve these piece hashes using the P2P protocol. In
-addition to storing the file download location in the DHT (which
-would still be used for small files), a peer will store a
-\emph{torrent string} containing the peer's hashes of the pieces of
-the larger files. These piece hashes could be compared ahead of time
-to determine which peers have the same piece hashes (they all
-should), and then used during the download to verify the pieces of
-the downloaded package.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Sample Implementation/Implementation}
-\label{implementation}
-
-We have created a sample implementation that functions as described
-in section \ref{opportunity}, and is freely available for other
-distributors to download and modify \cite{apt-p2p}. This software,
-called \texttt{apt-p2p}, interacts with the \texttt{apt} tool, which
-is found in most Debian-based Linux distributions. \texttt{apt} uses
-SHA1 hashes to verify most downloaded files, including the large
-index files that contain the hashes of the individual packages. We
-chose this distribution system as it is familiar to us, it allows
-software contributions, and there are interesting statistics
-available for analyzing the popularity of the software packages
-\cite{popcon}.
-
-Since all requests from apt are in the form of HTTP downloads from a
-server, the implementation takes the form of a caching HTTP proxy.
-Making a standard \texttt{apt} implementation use the proxy is then
-as simple as prepending the proxy location and port to the front of
-the mirror name in \texttt{apt}'s configuration file (i.e.
-``localhost:9977/mirrorname.debian.org/\ldots'').
-
-We created a customized DHT based on Khashmir \cite{khashmir}, which
-is an implementation of Kademlia \cite{kademlia} using methods
-familiar to BitTorrent developers. Khashmir is also the same DHT
-implementation used by most of the existing BitTorrent clients to
-implement trackerless operation. The communication is all handled by
-UDP messages, and RPC (remote procedure call) requests and responses
-are all \emph{bencoded} in the same way as BitTorrent's
-\texttt{.torrent} files. Khashmir uses the high-level Twisted
-event-driven networking engine \cite{twisted}, so we also use
-Twisted in our sample implementation for all other networking needs.
-More details of this customized DHT can be found below in section
-\ref{custom_dht}.
-
-Downloading is accomplished by sending simple HTTP requests to the
-peers identified by lookups in the DHT to have the desired file.
-Requests for a package are made using the package's hash, properly
-encoded, as the URL to request from the peer. The HTTP server used
-for the proxy also doubles as the server listening for requests for
-downloads from other peers. All peers support HTTP/1.1, both in the
-server and the client, which allows for pipelining of multiple
-requests to a peer, and the requesting of smaller pieces of a large
-file using the Range request header.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Customized DHT}
-\label{custom_dht}
-
-A large contribution of our work is in the customization and use of
-a Distributed Hash Table (DHT). Although our DHT is based on
-Kademlia, we have made many improvements to it to make it suitable
-for this application. In addition to a novel storage technique to
-support piece hashes, we have improved the response time of looking
-up queries, allowed the storage of multiple values for each key, and
-incorporated some improvements from BitTorrent's tracker-less DHT
-implementation.
-
-\subsection{Kademlia Background}
-\label{kademlia}
-
-The Kademlia DHT, like most other DHTs, assigns IDs to peers from
-the same space that is used for keys. The peers with IDs closest to
-the desired key will then store the values for that key. Lookups are
-recursive, as nodes are queried in each step that are exponentially
-closer to the key than in the previous step.
-
-Nodes in a Kademlia system support four primitive requests.
-\texttt{ping} will cause a peer to return nothing, and is only used
-to determine if a node is still alive, while \texttt{store} tells a
-node to store a value associated with a given key. The most
-important primitives are \texttt{find\_node} and
-\texttt{find\_value}, which both function recursively to find nodes
-close to a key. The queried nodes will return a list of the nodes
-they know about that are closest to the key, allowing the querying
-node to quickly traverse the DHT to find the nodes close to the
-desired key. The only difference between them is that the
-\texttt{find\_value} query will cause a node to return a value, if
-it has one for that key, instead of a list of nodes.
-
-\subsection{Piece Hash Storage}
-\label{pieces}
-
-Hashes of pieces of the larger package files are needed to support
-their efficient downloading from multiple peers.
-For large files (5 or more pieces), the torrent strings described in
-section \ref{downloading}
-are too long to store with the peer's download info in the DHT. This
-is due to the limitation that a single UDP packet should be less
-than 1472 bytes to avoid fragmentation.
-% For example, for a file of 4 pieces, the peer would need to store a
-% value 120 bytes in size (IP address, port, four 20-byte pieces and
-% some overhead), which would limit a return value from including more
-% than 10 values to a request.
-
-Instead, the peers will store the torrent string for large files
-separately in the DHT, and only contain a reference to it in their
-stored value for the hash of the file. The reference is an SHA1 hash
-of the entire concatenated length of the torrent string. If the
-torrent string is short enough to store in the DHT (i.e. less than
-1472 bytes, or about 70 pieces for the SHA1 hash), then a lookup of
-that hash in the DHT will return the torrent string. Otherwise, a
-request to the peer for the hash (using the same method as file
-downloads, i.e. HTTP), will cause the peer to return the torrent
-string.
-
-Figure \ref{size_CDF} shows the package size of the 22,298 packages
-available in Debian in January 2008. We can see that most of the
-packages are quite small, and so most will therefore not require
-piece hash information to download. We have chosen a piece
-size of 512 kB, which means that 17,515 (78\%) of the packages will
-not require this information. There are 3054 packages that will
-require 2 to 4 pieces, for which the torrent string can be stored
-directly with the package hash in the DHT. There are 1667 packages
-that will require a separate lookup in the DHT for the longer
-torrent string as they require 5 to 70 pieces. Finally, there are
-only 62 packages that require more than 70 pieces, and so will
-require a separate request to a peer for the torrent string.
-
-\subsection{Response Time}
-\label{response_time}
-
-Most of our customizations to the DHT have been to try and improve
-the time of the recursive \texttt{find\_value} requests, as this can
-cause long delays for the user waiting for a package download to
-begin. The one problem that slows down such requests is waiting for
-timeouts to occur before marking the node as failed and moving on.
-
-Our first improvement is to retransmit a request multiple times
-before a timeout occurs, in case the original request or its
-response was lost by the unreliable UDP protocol. If it does not
-receive a response, the requesting node will retransmit the request
-after a short delay. This delay will increase exponentially for
-later retransmissions, should the request again fail. Our current
-implementation will retransmit the request after 2 seconds and 6
-seconds (4 seconds after the first retransmission), and then timeout
-after 9 seconds.
-
-We have also added some improvements to the recursive
-\texttt{find\_node} and \texttt{find\_value} queries to speed up the
-process when nodes fail. If enough nodes have responded to the
-current query such that there are many new nodes to query that are
-closer to the desired key, then a stalled request to a node further
-away will be dropped in favor of a new request to a closer node.
-This has the effect of leap-frogging unresponsive nodes and
-focussing attention on the nodes that do respond. We will also
-prematurely abort a query while there are still oustanding requests,
-if enough of the closest nodes have responded and there are no
-closer nodes found. This prevents a far away unresponsive node from
-making the query's completion wait for it to timeout.
-
-Finally, we made all attempts possible to prevent firewalled and
-NATted nodes from being added to the routing table for future
-requests. Only a node that has responded to a request from us will
-be added to the table. If a node has only sent us a request, we
-attempt to send a \texttt{ping} to the node to determine if it is
-NATted or not. Unfortunately, due to the delays used by NATs in
-allowing UDP packets for a short time if one was recently sent by
-the NATted host, the ping is likely to succeed even if the node is
-NATted. We therefore also schedule a future ping to the node to make
-sure it is still reachable after the NATs delay has hopefully
-elapsed. We also schedule future pings of nodes that fail once to
-respond to a request, as it takes multiple failures (currently 3)
-before a node is removed from the routing table.
-
-\subsection{Multiple Values}
-\label{multiple_values}
-
-The original design of Kademlia specified that each keywould have
-only a single value associated with it. The RPC to find this value
-was called \texttt{find\_value} and worked similarly to
-\texttt{find\_node}, iteratively finding nodes with ID's closer to
-the desired key. However, if a node had a value stored associated
-with the searched for key, it would respond to the request with that
-value instead of the list of nodes it knows about that are closer.
-
-While this works well for single values, it can cause a problem when
-there are multiple values. If the responding node is no longer one
-of the closest to the key being searched for, then the values it is
-returning will probably be the staler ones in the system, and it
-will not have the latest stored values. However, the search for
-closer nodes will stop here, as the queried node only returned the
-values and not a list of nodes to recursively query. We could have
-the request return both the values and the list of nodes, but that
-would severely limit the size and number of the values that could be
-returned.
-
-Instead, we have broken up the original \texttt{find\_value}
-operation into two parts. The new \texttt{find\_value} request
-always returns a list of nodes that the node believes are closest to
-the key, as well as a number indicating the number of values that
-this node has for the key. Once a querying node has finished its
-search for nodes and found the closest ones to the key, it can issue
-\texttt{get\_value} requests to some nodes to actually retrieve the
-values they have. This allows for much more control of when and how
-many nodes to query for values. For example, a querying node could
-abort the search once it has found enough values in some nodes, or
-it could choose to only request values from the nodes that are
-closest to the key being searched for.
-
-\subsection{BitTorrent's Improvements}
-\label{bittorrent_dht}
-
-In the many years that some BitTorrent clients have been using a
-Kademlia based DHT for tracker-less operation, the developers have
-made many enhancements which we can take advantage of. One of the
-most important is a security feature added to stop malicious nodes
-from subscribing other nodes as downloaders. When a node issues a
-request to another node to store its download info for that key, it
-must include a \emph{token} returned by the storing node in a recent
-\texttt{find\_node} request. The token is created by hashing the IP
-address of the requesting peer with a temporary secret that expires
-after several minutes. This prevents the requesting peer from faking
-its IP address in the store request, since it must first receive a
-response from a \texttt{find\_node} on that IP.
-
-We also made some BitTorrent-inspired changes to the parameters of
-the DHT originally specified by the authors of Kademlia. Since, as
-we will show later, peers stay online in our system for much longer
-periods of time, we reduced Kademlia's \emph{k} value from 20 to 8.
-The value is supposed to be large enough such that any given
-\emph{k} nodes are unlikely to fail within an hour of each other,
-which is very unlikely in our system given the long uptimes of
-nodes. We also increased the number of concurrent outstanding
-requests allowed from 3 to 6 to speed up the recursive key finding
-processes.
-
-\subsection{Other Changes}
-\label{other_changes}
-
-We added one other new RPC request that nodes can make:
-\texttt{join}. This request is only sent on first loading the DHT,
-and is usually only sent to the bootstrap nodes that are listed for
-the DHT. These bootstrap nodes will respond to the request with the
-requesting peer's IP and port, so that the peer can determine what
-its oustide IP address is and whether port translation is being
-used. In the future, we hope to add functionality similar to STUN
-\cite{STUN}, so that nodes can detect whether they are NATted and
-take appropriate steps to circumvent it.
-
-In addition, we have allowed peers to store values in the DHT, even
-if the hash they are using is not the correct length. Most of the
-keys used in the DHT are based on the SHA1 hash, and so they are 20
-bytes in length. However, some files in the targetted Debian package
-system are only hashed using the MD5 algorithm, and so the hashes
-retrieved from the server are 16 bytes in length. The DHT will still
-store values using these keys by padding the end of them with zeroes
-to the proper length, as the loss of uniqueness from the 4 less
-bytes is not an issue. Also, the DHT will happily store hashes that
-are longer than 20 bytes, such as those from the 32 byte SHA256
-algorithm, by truncating them to the correct length. Since the
-requesting peer will have the full length of the hash, this will not
-affect its attempt to verify the downloaded file.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Analysis}
-\label{analysis}
-
-Our \texttt{apt-p2p} implementation supporting the Debian package
-distribution system has been available to all Debian users since May
-3rd, 2008 \cite{apt-p2p-debian}, and will also be available in the
-next release of Ubuntu \cite{apt-p2p-ubuntu}. We have since created
-a \emph{walker} that will navigate the DHT and find all the peers
-currently connected to it. This allows us to analyze many aspects of
-our implementation.
-
-\subsection{Peer Lifetimes}
-\label{peer_life}
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PWalker-peers.eps}
-\caption{The number of peers found in the system, and how many are
-behind a firewall or NAT.}
-\label{walker_peers}
-\end{figure}
-
-We first began analyzing the DHT on June 24th, and continue to this
-day, giving us 2 months of data so far. Figure~\ref{walker_peers}
-shows the number of peers we have seen in the DHT during this time.
-The peer population is very steady, with just over 50 regular users
-participating in the DHT at any time. We also note that we find 100
-users who connect regularly (weekly), and we have found 186 unique
-users in the 2 months of our analysis. We determined which users are
-behind a firewall or NAT, which is one of the main problems of
-implementing a peer-to-peer network. These peers will be
-unresponsive to DHT requests from peers they have not contacted
-recently, which will cause the peer to wait for a timeout to occur
-(currently set at 9 seconds) before moving on. They will also be
-unable to contribute any upload bandwidth to other peers, as all
-requests for packages from them will also timeout. From
-Figure~\ref{walker_peers}, we see that approximately half of all
-peers suffered from this restriction.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDuration-peers.eps}
-\caption{The CDF of how long an average session will last.}
-\label{duration_peers}
-\end{figure}
-
-Figure~\ref{duration_peers} shows the cumulative distribution of how
-long a connection from a peer can be expected to last. Due to our
-software being installed as a daemon that is started by default
-every time their computer boots up, peers are expected to stay for a
-long period in the system. 50\% of connections last longer than 5
-hours, and 20\% last longer than 10 hours. These connections are
-much longer than those reported by Saroiu et. al. \cite{saroiu2001}
-for other P2P systems, which had 50\% of Napster and Gnutella
-sessions lasting only 1 hour.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDuration-ind_peers.eps}
-\caption{The CDF of the average time individual peers stay in the
-system.}
-\label{duration_ind_peers}
-\end{figure}
-
-We also examined the average time each individual peer spends in the
-system. Figure~\ref{duration_peers} shows the cumulative
-distribution of how long each individual peer remains in the system.
-Here we see that 50\% of peers have average stays in the system
-longer than 10 hours.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDuration-online_1.eps}
-\caption{The fraction of peers that, given their current duration in
-the system, will stay online for another hour.}
-\label{duration_online_1}
-\end{figure}
-
-Since our DHT is based on Kademlia, which was designed based on the
-probability that a node will remain up another hour, we also
-analyzed our system for this parameter.
-Figure~\ref{duration_online_1} shows the fraction of peers will
-remain online for another hour, as a function of how long they have
-been online so far. Maymounkov and Mazieres found that the longer a
-node has been online, the higher the probability that it will stay
-online \cite{kademlia}. Our results also show this behavior. In
-addition, similar to the Gnutella peers, over 90\% of our peers that
-have been online for 10 hours, will remain online for another hour.
-Our results also show that, for our system, over 80\% of all peers
-will remain online another hour, compared with around 50\% for
-Gnutella.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDuration-online_6.eps}
-\caption{The fraction of peers that, given their current duration in
-the system, will stay online for another 6 hours.}
-\label{duration_online_6}
-\end{figure}
-
-Since our peers are much longer-lived than other P2P systems, we
-also looked at the fraction of peers that stay online for another 6
-hours. Figure~\ref{duration_online_6} shows that over 60\% of peers
-that are online for 10 hours will stay online for another 6.
-However, we see an interesting decrease in this fraction between 8
-and 12 hours, which can also be seen in
-Figure~\ref{duration_online_1}. We believe this to be due to desktop
-users, who regularly turn off their computers at night.
-
-\subsection{Peer Statistics}
-\label{peer_stats}
-
-On July 31st we enhanced our walker to retrieve additional
-information from each contacted peer. The peers are configured, by
-default, to publish some statistics on how much they are downloading
-and uploading, and their measured response times for DHT queries.
-Our walker can extract this information if the peer is not
-firewalled or NATted, it has not disabled this functionality, and if
-it uses the same port for both its DHT (UDP) requests and download
-(TCP) requests (which is also a configuration parameter).
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDownloaded-peers.eps}
-\caption{The number of peers that were contacted to determine their
-bandwidth, and the total number of peers in the system.}
-\label{down_peers}
-\end{figure}
-
-Figure~\ref{down_peers} shows the total number of peers we have been
-able to contact since starting to gather this additional
-information, as well as how many total peers were found. We were
-only able to contact 30\% of all the peers that connected to the
-system during this time.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDownloaded-bw.eps}
-\caption{The bandwidth of data that the contacted peers have
-downloaded and uploaded.}
-\label{down_bw}
-\end{figure}
-
-Figure~\ref{down_bw} shows the amount of data the peers we were able
-to contact have downloaded. Peers measure their downloads from other
-peers and mirrors separately, so we are able to get an idea of how
-much savings our system is generating for the mirrors. We see that
-the peers are downloading approximately 20\% of their package data
-from other peers, which is saving the mirrors from supplying that
-bandwidth. The actual numbers are only a lower bound, since we have
-only contacted 30\% of the peers in the system, but we can estimate
-that \texttt{apt-p2p} has already saved the mirrors 15 GB of
-bandwidth, or 1 GB per day.
-
-We also collected the statistics on the measured response time peers
-were experiencing when sending requests to the DHT. We found that
-the recursive \texttt{find\_value} query, which is necessary before
-a download can occur, is taking 17 seconds on average. This
-indicates that, on average, requests are experiencing almost 2
-stalls while waiting for the 9 second timeouts to occur on
-unresponsive peers. This is longer than our target of 10 seconds,
-although it will only lead to a slight average delay in downloading
-of 1.7 seconds when the default 10 concurrent downloads are
-occurring.This increased response time is due to the number of peers
-that were behind firewalls or NATs, which was much higher than we
-anticipated. We do have plans to improve this through better
-informing of users of their NATted status, the use of STUN
-\cite{STUN} to circumvent the NATs, and by better exclusion of
-NATted peers from the DHT (which will not prevent them from using
-the system).
-
-We were also concerned that the constant DHT requests and responses,
-even while not downloading, would overwhelm some peers' network
-connections. However, we found that peers are using 200 to 300 bytes
-per second of bandwidth in servicing the DHT. These numbers are
-small enough to not affect any other network services the peer would
-be running.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Future Work}
-\label{future}
-
-We feel that our P2P software package distribution model is
-well-suited to be used by many free software distributors. We hope
-to convince them to adopt such a model in their distribution, or we
-may port our existing system to some of the other groups for them to
-try.
-
-One aspect missing from our model is the removal of old packages
-from the cache. Since our implementation is still relatively young,
-we have not had to deal with the problems of a growing cache of
-obsolete packages consuming all of a user's hard drive. We plan to
-implement some form of least recently used (LRU) cache removal
-technique, in which packages that are no longer available on the
-server, no longer requested by peers, or simply are the oldest in
-the cache, will be removed.
-
-The most significant area of improvement still needed in our sample
-implementation is to further speed up some of the slower recursive
-DHT requests. We hope to accomplish this by further tuning the
-parameters of our current system, better exclusion of NATted peers
-from the routing tables, and through the use of STUN \cite{STUN} to
-circumvent the NATs of the 50\% of the peers that have not
-configured port forwarding.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Conclusions}
-\label{conclusions}
-
-We have designed a generally applicable peer-to-peer content
-distribution model to be used by many of the free software
-distributors operating today. It makes use of already existing
-features of most package management systems to create a
-peer-to-peer distribution that should substantially reduce the costs
-of hosting the software packages.
-
-We have also implemented our design in freely available software to
-be used in conjuction with Debian-based distribution of Linux
-software packages. It is currently in use by some users of the
-Debian project's distribution, and so serves as an example of the
-possibilities that exist.
-
-\bibliographystyle{IEEEtran}
-\bibliography{./IEEEabrv,./all}
-
-\end{document}
+\documentclass[conference]{IEEEtran}
+
+%% INFOCOM addition:
+\makeatletter
+\def\ps@headings{%
+\def\@oddhead{\mbox{}\scriptsize\rightmark \hfil \thepage}%
+\def\@evenhead{\scriptsize\thepage \hfil\leftmark\mbox{}}%
+\def\@oddfoot{}%
+\def\@evenfoot{}}
+\makeatother
+\pagestyle{headings}
+
+\usepackage[noadjust]{cite}
+
+\usepackage[dvips]{graphicx}
+
+\usepackage{url}
+
+\begin{document}
+
+\title{\texttt{apt-p2p}: A Peer-to-Peer Distribution System for Software Package Releases and Updates}
+\author{\IEEEauthorblockN{Cameron Dale}
+\IEEEauthorblockA{School of Computing Science\\
+Simon Fraser University\\
+Burnaby, British Columbia, Canada\\
+Email: camerond@cs.sfu.ca}
+\and
+\IEEEauthorblockN{Jiangchuan Liu}
+\IEEEauthorblockA{School of Computing Science\\
+Simon Fraser University\\
+Burnaby, British Columbia, Canada\\
+Email: jcliu@cs.sfu.ca}}
+
+\maketitle
+
+\begin{abstract}
+A large amount of free software packages are available over the
+Internet from many different distributors. Most of these
+distributors use the traditional client-server model to handle
+requests from users. However, there is an excellent opportunity to
+use peer-to-peer techniques to reduce the cost of much of this
+distribution, especially due to the altruistic nature of many of
+these users. There are no existing solutions suitable for this
+situation, so we present a new technique for satisfying the needs of
+this P2P distribution which is generally applicable to many of
+these distributors' systems. Our method makes use of a DHT for
+storing the location of peers, using the cryptographic hash of the
+package as a key. To show the simplicity and functionality of our model, we
+implement a solution for the distribution of Debian software
+packages, including details on the DHT customizations and system optimizations needed. Finally, we
+analyze our system to determine how it is performing and what effect
+it is having.
+\end{abstract}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Introduction}
+\label{intro}
+
+There are a large number of free software package distributors using
+package distribution systems over the Internet to distribute
+software to their users. These distributors have developed many
+different methods for this distribution, but they almost exclusively
+use a client-server model to satisfy user requests. The popularity
+and number of users results in a large number of requests, which
+usually requires a network of mirrors to handle. Due to the free
+nature of this software, many users are willing and able to
+contribute upload bandwidth to this distribution, but have no
+current way to do this.
+
+We present a new peer-to-peer distribution model to meet these
+demands. It is based on many previous implementations of successful
+peer-to-peer protocols, especially distributed hash tables (DHT) and
+BitTorrent. The model relies on the pre-existence of cryptographic
+hashes of the packages, which uniquely identify it for a
+request from other peers. If the peer-to-peer download fails, then
+the original request to the server is used as a fallback to prevent
+any dissatisfaction from users. The peer can then share this new
+package with others through the P2P system.
+
+First, we examine the opportunity that is available for many of
+these free software package distributors. We present an overview of
+a system that would efficiently satisfy the demands of a large
+number of users, and should significantly reduce the currently
+substantial bandwidth requirements of hosting this software. We then
+present an example implementation based on the Debian package
+distribution system. This implementation will be used by a large
+number of users, and serves as an example for other free software
+distributors of the opportunity that can be met with such a system.
+Finally, we analyze our currently deployed implementation to
+determine how effective it is at meeting our goals, and to see what
+effect it is having on the Debian package distribution system.
+
+The rest of this paper is organized as follows. The background and motivation are presented in Section~\ref{situation}, including an analysis of BitTorrent's use for this purpose in Section~\ref{bittorrent}. We propose
+our solution in Section~\ref{opportunity}. We then detail our sample
+implementation for Debian-based distributions in Section~\ref{implementation},
+including an in-depth look at our DHT
+customizations in Section~\ref{custom_dht}. The performance of our implementation is evaluated in Section~\ref{analysis}. We examine some related work in Section~\ref{related}, and then
+Section~\ref{conclusions} concludes the paper and offers some future directions.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Background and Motivations}
+\label{situation}
+
+In the free software society, there are a large number of groups using the Internet to 
+collaboratively develop and release their software. The ever increasing power of
+modern programming languages and operating systems has made this software, like commercial software, extremely large and complex, though it is often
+distributed in many small units (packages). Together with their popularity among users, 
+an efficient and reliable management and distribution of these packages over the Internet has become a daunting task. In this section, we offer concrete examples illustrating the 
+unique challenges in this context. 
+
+\subsection{Free Software Package Distributors}
+\label{examples}
+
+Most Linux distributions use a software package management system
+that fetches packages to be installed from an archive of packages
+hosted on a network of mirrors. The Debian project, and other
+Debian-based distributions such as Ubuntu and Knoppix, use the
+\texttt{apt} (Advanced Package Tool) program, which downloads Debian
+packages in the \texttt{.deb} format from one of many HTTP mirrors.
+The program will first download index files that contain a listing
+of which packages are available, as well as important information
+such as their size, location, and a hash of their content. The user
+can then select which packages to install or upgrade, and
+\texttt{apt} will download and verify them before installing them.
+
+There are also several similar frontends for the RPM-based
+distributions. Red Hat's Fedora project uses the \texttt{yum}
+program, SUSE uses \texttt{YAST}, while Mandriva has
+\texttt{Rpmdrake}, all of which are used to obtain RPMs from
+mirrors. Other distributions use tarballs (\texttt{.tar.gz} or
+\texttt{.tar.bz2}) to contain their packages. Gentoo's package
+manager is called \texttt{portage}, SlackWare Linux uses
+\texttt{pkgtools}, and FreeBSD has a suite of command-line tools,
+all of which download these tarballs from web servers.
+
+Similar tools have been used for other software packages. CPAN
+distributes software packages for the PERL
+programming language, using SOAP RPC requests to find and download
+files. Cygwin provides many of the
+standard Unix/Linux tools in a Windows environment, using a
+package management tool that requests packages from websites. There
+are two software distribution systems for software that runs on the Macintosh OS, fink and
+MacPorts, that also retrieve packages in this way.
+
+Direct web downloading is also a common way, often coupled with a hash
+verification file to be downloaded next to the desired
+file. The hash file usually has the same file name, but with an
+added extension identifying the hash used (e.g. \texttt{.md5} for
+the MD5 hash). This type of file downloading and verification is
+typical of free software hosting facilities that are open to anyone
+to use, such as SourceForge.
+
+
+Given the free nature of this software, there are often a number of users 
+motivated by altruism to want to help out with the distribution.
+This is particularly true considering that many of this software is used by
+groups that are staffed mostly, or sometimes completely, by
+volunteers. They are thus motivated to contribute their network resources, so as to promote the healthy development 
+of the volunteer community that released the software.
+We also naturally expect that peer-to-peer distribution can be implementation in 
+this context, which scale well with large user bases and can easily explore the network resources made available by
+the volunteers. 
+
+
+
+\subsection{Unique Characteristics}
+\label{problems}
+
+While it seems straightforward to use an existing peer-to-peer file sharing tool like BitTorrent for
+free software package distribution, there are indeed a series of new challenges in this unique scenario:
+
+\subsubsection{Archive Dimensions}
+
+While most of the packages of a software release are very small in
+size, there are some that are quite large. There are too many
+packages to distribute each individually, but the archive is also
+too large to distribute in its entirety. In some archives there are
+also divisions of the archive into sections, e.g. by the operating system (OS) or computer
+architecture that the package is intended for.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{apt_p2p_simulation-size_CDF.eps}
+\caption{The CDF of the size of packages in a Debian system, both
+for the actual size and adjusted size based on the popularity of
+the package.}
+\label{size_CDF}
+\end{figure}
+
+For example, Figure~\ref{size_CDF} shows the size of the packages in the
+current Debian distribution. While 80\% of the packages are less than
+512~KB, some of the packages are hundreds of megabytes. The entire
+archive consists of 22,298 packages and is approximately 119,000 MB
+in size. Most of the packages are to be installed in any computer environment, but there are 
+also OS- or architecture-specific packages.
+
+\subsubsection{Package Updates}
+
+The software packages being distributed are being constantly
+updated. These updates could be the result of the software creators
+releasing a new version with improved functionality,
+or the distributor updating their packaging of the
+software to meet new requirements. Even if the distributor
+periodically makes \emph{stable} releases, which are snapshots of
+all the packages in the archive at a certain time, updates are still
+released for security issues or serious bugs.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{size-quarter.eps}
+\caption{The amount of data in the 119,000 MB Debian archive that is
+updated each day, broken down by architecture.}
+\label{update_size}
+\end{figure}
+
+For example, Figure~\ref{update_size} shows the amount of data in
+the Debian archive that was updated each day over a period of 3
+months. Every single day, approximately 1.5\% of the 119,000 MB archive is
+updated with new versions of packages. This frequency is much higher than
+that of most commercial software, mainly because much of free software is
+developed in a loosely managed environment of developers working
+asynchronously on a worldwide scale.
+
+\subsubsection{Limited Interest}
+
+Finally, though there are a large number of packages and a large number of
+users, the interest in a particular version of a package can be very 
+limited. Specifically, there are core packages that every user has to download, but most
+packages would fall in the category of optional or extra, and so are
+interesting to only a limited number of people.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{apt_p2p_popularity-cdf.eps}
+\caption{The CDF of the popularity of packages in a Debian system.}
+\label{popularity_CDF}
+\end{figure}
+
+For example, the Debian distribution tracks the popularity of its
+packages using popcon \cite{popcon}. Figure~\ref{popularity_CDF}
+shows the cumulative distribution function of the percentage of all
+users who install each package. Though some packages are installed
+by everyone, 80\% of the packages are installed by less than 1\% of
+users.
+
+\subsubsection{Interactive Users}
+
+The package management software that downloads packages displays
+some kind of indication of speed and completeness for users to
+watch. Since previous client-server downloads occurred in a sequential
+fashion, the package management software also measures the speed
+based on sequential downloading. This requires the peer-to-peer
+solution to be reasonably responsive at retrieving packages
+sequentially.
+
+\subsection{Why BitTorrent Doesn't Work Well}
+\label{bittorrent}
+
+Many distributors make their software available using
+BitTorrent \cite{COHEN03}, in particular for the distribution of CD
+images. This straightforward use however can be very ineffective, as it requires the
+peers to download large numbers of packages that they are not
+interested in, and prevents them from updating to new packages
+without downloading another image containing a lot of the same
+packages they already have. 
+
+An alternative is to create torrents tracking smaller groups of packages. Unfortunately, we find that this enhancement can be
+quite difficult given the unique characteristic of free software packages. 
+First, there is no obvious way to divide the packages into torrents.
+Most of the packages are too small, and there are too many packages
+in the entire archive to create individual torrents for each one.
+On the other hand, all the packages together are too large to track
+efficiently as a single torrent. Hence, some division of the archive's
+packages into torrents is obviously necessary, but wherever that
+split occurs it will cause either some duplication of connections,
+or prevent some peers from connecting to others who do have the same
+content. In addition, a small number of the packages can be updated every
+day which would add new files to the torrent, thereby changing its
+\emph{infohash} identifier and making it a new torrent. This will
+severely fracture the download population, even though peers in the
+new torrent may share 99\% of the packages in common with peers in the
+old torrent.
+
+Other issues also prevent BitTorrent from being a good solution to
+this problem. In particular, BitTorrent's fixed piece sizes (usually 512 KB) that disregard file
+boundaries are bigger than many of the packages in the archive. This
+will waste peers' downloading bandwidth as they will end up
+downloading parts of other packages just to get the piece that
+contains the package they do want. 
+Finally, note that BitTorrent downloads files
+randomly, which does not work well with the interactive package
+management tools expectation of sequential downloads.
+
+On the other hand, there are aspects of BitTorrent that are no
+longer needed in a system such as this one. With altruistic peers
+and all files being available without uploading, incentives to share
+become a less important issue. Also, the availability of seeds are
+not critical either, as the mirror sites serve in that capacity
+already.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Peer-to-Peer Assisted Distributor: An Overview}
+\label{opportunity}
+
+We assume that the user is attempting to download packages from a
+server, using package management software to find and retrieve the
+packages. Further, we assume that the requests from the user to the
+server are able to be proxied by our peer-to-peer program, and that
+the server is always available and has all of the package files.
+Finally, the cryptographic hash of the packages must be available
+separately from the package itself, and is usually contained in an
+\emph{index} file which also contains all the packages' names,
+locations and sizes.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{model_simple.eps}
+\caption{The different phases of functionality of our peer-to-peer distribution model.}
+\label{model}
+\end{figure}
+
+Our model for using P2P to enhance such a system is shown in
+Figure~\ref{model}. As shown in Phase~1, our program will act as a
+proxy (1,2), downloading (3) and caching all files communicated
+between the user and the server (4). It will therefore also have
+available the index files containing the cryptographic hashes all
+packages. Later, in Phase~2, upon receiving a request from the user
+to download a package (5), our program will search the index files
+for the package being requested and find its hash (6). This hash can
+then be looked up recursively in the DHT (7), which will return a
+list of peers that have the package already (8). The package can
+then be downloaded from the peers (11,12), it can be verified using
+the hash (13), and if valid can be returned to the user (14). The
+current nodes location is also added to the DHT for that hash (15),
+as it is now a source for others to download from.
+
+In steps (11,12), the fact that this package is also available to download for free
+from a server is very important to our proposed model. If the package hash
+can not be found in the DHT, the peer can then fallback to
+downloading from the original location (i.e. the network of
+mirrors). The mirrors thus, with no modification to their
+functionality, serve as seeds for the packages in the peer-to-peer
+system. Any packages that have just been updated, or that are very
+rare, and so don't have any peers available, can always be found on
+the mirror. Once the peer has completed the download from the mirror
+and verified the package, it can then add itself to the DHT as the
+first peer for the new package, so that future requests for the package
+will not need the mirror.
+
+This sparse
+interest in a large number of packages undergoing constant updating
+is well suited to the functionality provided by a Distributed Hash
+Table (DHT). DHTs require unique keys to store and retrieve strings
+of data, for which the cryptographic hashes used by these package
+management systems are perfect for. The stored and retrieved strings
+can then be pointers to the peers that have the package that hashes
+to that key. A downloading peer can lookup the package hash in the
+DHT and, if it is found, download the file from those peers and
+verify the package with the hash. Once the download is complete, the
+peer will add its entry to the DHT indicating that it now has the
+package.
+
+Note that, despite downloading the package from untrustworthy peers,
+the trust of the package is always guaranteed through the use
+of the cryptographic hashes. Nothing can be downloaded from a peer
+until the hash is looked up in the DHT, so a hash must first come
+from a trusted source (i.e. the distributors server). Most distributors use index
+files that contain hashes for a large number of the packages in
+their archive, and which are also hashed. After retrieving the
+index's hash from the mirror, the index file can be downloaded from
+peers and verified. Then the program has access to all the hashes of
+the packages it will be downloading, all of which can be verified
+with a \emph{chain of trust} that stretches back to the original
+distributor's server.
+
+\subsection{Implementation Options}
+\label{imp_options}
+
+There are several ways to implement the desired P2P functionality
+into the existing package management software. The functionality can
+be directly integrated through modifications to the software, though
+this could be difficult as the P2P functionality should be running
+at all times. This is needed both for efficient lookups and to
+support uploading of already downloaded packages. Unfortunately, the
+package management tools typically only run until the download and
+install request is complete.
+
+Many of the package management software implementations use HTTP
+requests from web servers to download the packages, which makes it
+possible to implement the P2P aspect as an almost standard HTTP
+caching proxy. This proxy will run as a daemon in the background,
+listening for requests from the package management tool for package
+files, as well as serving (uploading) cached package to other peers.
+It will get uncached requests first from the P2P system, or
+fall back to the normal HTTP request from a server should it not
+be found. For methods that don't use HTTP requests, other types of
+proxies may also be possible.
+
+\subsection{Downloading From Peers}
+\label{downloading}
+
+Although not necessary, we recommend implementing a download
+protocol that is similar to the protocol used to fetch packages from
+the distributor's servers. This simplifies the P2P program, as it
+can then treat peers and mirrors almost identically when requesting
+packages. In fact, the mirrors can be used when there are only a few
+slow peers available for a file to help speed up the download
+process.
+
+Downloading a file efficiently from a number of peers is where
+BitTorrent shines as a peer-to-peer application. Its method of
+breaking up larger files into pieces, each with its own hash,
+makes it very easy to parallelize the downloading process and
+maximize the download speed. For very small packages (i.e. less than
+the piece size), this parallel downloading is not necessary, or
+even desirable. However, this method should still be used, in
+conjunction with the DHT, for the larger packages that are
+available.
+
+Since the package management system only stores a hash of the entire
+package, and not of pieces of that package, we will need to be able
+to store and retrieve these piece hashes using the P2P protocol. In
+addition to storing the file download location in the DHT (which
+would still be used for small files), a peer will store a
+\emph{torrent string} containing the peer's hashes of the pieces of
+the larger files. These piece hashes are compared ahead of time
+to determine which peers have the same piece hashes (they all
+should), and then used during the download to verify the pieces of
+the downloaded package.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{\texttt{apt-p2p}: A Practical Implementation}
+\label{implementation}
+
+We have created a sample implementation that functions as described
+in section \ref{opportunity}, and is freely available for other
+distributors to download and modify \cite{apt-p2p}. This software,
+called \texttt{apt-p2p}, interacts with the \texttt{apt} tool which
+is found in most Debian-based Linux distributions. \texttt{apt} uses
+SHA1 hashes to verify most downloaded files, including the large
+index files that contain the hashes of the individual packages. We
+chose this distribution system as it is familiar to us, it allows
+software contributions, and there are interesting statistics
+available for analyzing the popularity of the software packages
+\cite{popcon}.
+
+Since all requests from \texttt{apt} are in the form of HTTP downloads from a
+server, the implementation takes the form of a caching HTTP proxy.
+Making a standard \texttt{apt} implementation use the proxy is then
+as simple as prepending the proxy location and port to the front of
+the mirror name in \texttt{apt}'s configuration file (i.e.
+``http://localhost:9977/mirrorname.debian.org/\ldots'').
+
+We created a customized DHT based on Khashmir \cite{khashmir}, which
+is an implementation of Kademlia \cite{kademlia} using methods
+familiar to BitTorrent developers. Khashmir is also the same DHT
+implementation used by most of the existing BitTorrent clients to
+implement trackerless operation. The communication is all handled by
+UDP messages, and RPC (remote procedure call) requests and responses
+are all \emph{bencoded} in the same way as BitTorrent's
+\texttt{.torrent} files. Khashmir uses the high-level Twisted
+event-driven networking engine \cite{twisted}, so we also use
+Twisted in our sample implementation for all other networking needs.
+More details of this customized DHT can be found below in
+Section~\ref{custom_dht}.
+
+Downloading is accomplished by sending simple HTTP requests to the
+peers identified by lookups in the DHT to have the desired file.
+Requests for a package are made using the package's hash, properly
+encoded, as the URL to request from the peer. The HTTP server used
+for the proxy also doubles as the server listening for requests for
+downloads from other peers. All peers support HTTP/1.1, both in the
+server and the client, which allows for pipelining of multiple
+requests to a peer, and the requesting of smaller pieces of a large
+file using the HTTP Range request header.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Customized DHT / System Optimization}
+\label{custom_dht}
+
+A large contribution of our work is in the customization and use of
+a Distributed Hash Table (DHT). Although our DHT is based on
+Kademlia, we have made many improvements to it to make it suitable
+for this application. In addition to a novel storage technique to
+support piece hashes, we have improved the response time of looking
+up queries, allowed the storage of multiple values for each key, and
+incorporated some improvements from BitTorrent's tracker-less DHT
+implementation.
+
+\subsection{DHT Background}
+\label{dht}
+
+DHT's operate by storing (\emph{key}, \emph{value}) pairs in a
+distributed fashion such that no node will, on average, store more
+than any other node. They support two primitive operations:
+\emph{put}, which takes a key and a value and stores it in the DHT;
+and \emph{get}, which takes a key and returns a value (or values)
+that was previously stored with that key. These operations are
+recursive, as each node does not know about all the other nodes in a
+DHT, and so must recursively search for the correct node to put to
+or get from.
+
+The Kademlia DHT, like most other DHTs, assigns IDs to peers from
+the same space that is used for keys. The peers with IDs closest to
+the desired key will then store the values for that key. Lookups are
+recursive, as nodes are queried in each step that are exponentially
+closer to the key than in the previous step.
+
+Nodes in a Kademlia system support four primitive requests.
+\texttt{ping} will cause a peer to return nothing, and is only used
+to determine if a node is still alive, while \texttt{store} tells a
+node to store a value associated with a given key. The most
+important primitives are \texttt{find\_node} and
+\texttt{find\_value}, which both function recursively to find nodes
+close to a key. The queried nodes will return a list of the nodes
+they know about that are closest to the key, allowing the querying
+node to quickly traverse the DHT to find the nodes close to the
+desired key. The only difference between \texttt{find\_node} and \texttt{find\_value} is that the
+\texttt{find\_value} query will cause a node to return a value, if
+it has one for that key, instead of a list of nodes.
+
+\subsection{Piece Hash Storage}
+\label{pieces}
+
+Hashes of pieces of the larger package files are needed to support
+their efficient downloading from multiple peers.
+For large files (5 or more pieces), the torrent strings described in
+Section~\ref{downloading}
+are too long to store with the peer's download info in the DHT. This
+is due to the limitation that a single UDP packet should be less
+than 1472 bytes to avoid fragmentation.
+% For example, for a file of 4 pieces, the peer would need to store a
+% value 120 bytes in size (IP address, port, four 20-byte pieces and
+% some overhead), which would limit a return value from including more
+% than 10 values to a request.
+
+Instead, the peers will store the torrent string for large files
+separately in the DHT, and only contain a reference to it in their
+stored value for the hash of the file. The reference is an SHA1 hash
+of the entire concatenated length of the torrent string. If the
+torrent string is short enough to store in the DHT (i.e. less than
+1472 bytes, or about 70 pieces for the SHA1 hash), then a lookup of
+that hash in the DHT will return the torrent string. Otherwise, a
+request to the peer for the hash (using the same method as file
+downloads, i.e. HTTP), will cause the peer to return the torrent
+string.
+
+Figure \ref{size_CDF} shows the size of the 22,298 packages
+available in Debian in January 2008. We can see that most of the
+packages are quite small, and so most will therefore not require
+piece hash information to download. We have chosen a piece
+size of 512 kB, which means that 17,515 (78\%) of the packages will
+not require this information. There are 3054 packages that will
+require 2 to 4 pieces, for which the torrent string can be stored
+directly with the package hash in the DHT. There are 1667 packages
+that will require a separate lookup in the DHT for the longer
+torrent string as they require 5 to 70 pieces. Finally, there are
+only 62 packages that require more than 70 pieces, and so will
+require a separate request to a peer for the torrent string.
+
+\subsection{Response Time}
+\label{response_time}
+
+Many of our customizations to the DHT have been to try and improve
+the time of the recursive \texttt{find\_value} requests, as this can
+cause long delays for the user waiting for a package download to
+begin. The one problem that slows down such requests is waiting for
+timeouts to occur before marking the node as failed and moving on.
+
+Our first improvement is to retransmit a request multiple times
+before a timeout occurs, in case the original request or its
+response was lost by the unreliable UDP protocol. If it does not
+receive a response, the requesting node will retransmit the request
+after a short delay. This delay will increase exponentially for
+later retransmissions, should the request again fail. Our current
+implementation will retransmit the request after 2 seconds and 6
+seconds (4 seconds after the first retransmission), and then timeout
+after 9 seconds.
+
+We have also added some improvements to the recursive
+\texttt{find\_node} and \texttt{find\_value} queries to speed up the
+process when nodes fail. If enough nodes have responded to the
+current query such that there are many new nodes to query that are
+closer to the desired key, then a stalled request to a node further
+away will be dropped in favor of a new request to a closer node.
+This has the effect of leap-frogging unresponsive nodes and
+focussing attention on the nodes that do respond. We will also
+prematurely abort a query while there are still oustanding requests,
+if enough of the closest nodes have responded and there are no
+closer nodes found. This prevents a far away unresponsive node from
+making the query's completion wait for it to timeout.
+
+Finally, we made all attempts possible to prevent firewalled and
+NATted nodes from being added to the routing table for future
+requests. Only a node that has responded to a request from us will
+be added to the table. If a node has only sent us a request, we
+attempt to send a \texttt{ping} to the node to determine if it is
+NATted or not. Unfortunately, due to the delays used by NATs in
+allowing UDP packets for a short time if one was recently sent by
+the NATted host, the ping is likely to succeed even if the node is
+NATted. We therefore also schedule a future ping to the node to make
+sure it is still reachable after the NATs delay has hopefully
+elapsed. We also schedule future pings of nodes that fail once to
+respond to a request, as it takes multiple failures (currently 3)
+before a node is removed from the routing table.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{apt_p2p_improvements-find_value.eps}
+\caption{The distribution of average response times PlanetLab nodes
+experience for \texttt{find\_value} queries. The original DHT
+implementation results are shown, as well as the successive
+improvements that we made to reduce the response time.}
+\label{improvements}
+\end{figure}
+
+To test our changes during development, we ran the customized DHT
+for several hours after each major change on over 300 PlanetLab nodes
+\cite{planetlab}. Though the nodes are not expected to be firewalled
+or NATted, some can be quite overloaded and so consistently fail to
+respond within a timeout period, similar to NATted peers. The
+resulting distribution of the nodes' average response times is shown
+in Figure~\ref{improvements}. Each improvement successfully reduced
+the response time, for a total reduction of more than half. The
+final distribution is also narrower, as the improvements make the
+system more predictable. However, there are still a large number of
+outliers with higher average response times, which are the
+overloaded nodes on PlanetLab. This was confirmed by examining the
+average time it took for a timeout to occur, which should be
+constant as it is a configuration option, but can be much larger if
+the node is too overloaded for the program to be able to check for a
+timeout very often.
+
+\subsection{Multiple Values}
+\label{multiple_values}
+
+The original design of Kademlia specified that each key would have
+only a single value associated with it. The RPC to find this value
+was called \texttt{find\_value} and worked similarly to
+\texttt{find\_node}, iteratively finding nodes with ID's closer to
+the desired key. However, if a node had a value stored associated
+with the searched for key, it would respond to the request with that
+value instead of the list of nodes it knows about that are closer.
+
+While this works well for single values, it can cause a problem when
+there are multiple values. If the responding node is no longer one
+of the closest to the key being searched for, then the values it is
+returning will probably be the staler ones in the system, as it
+will not have the latest stored values. However, the search for
+closer nodes will stop here, as the queried node only returned
+values and not a list of nodes to recursively query. We could have
+the request return both the values and the list of nodes, but that
+would severely limit the size and number of the values that could be
+returned in a single UDP packet.
+
+Instead, we have broken up the original \texttt{find\_value}
+operation into two parts. The new \texttt{find\_value} request
+always returns a list of nodes that the node believes are closest to
+the key, as well as a number indicating the number of values that
+this node has for the key. Once a querying node has finished its
+search for nodes and found the closest ones to the key, it can issue
+\texttt{get\_value} requests to some nodes to actually retrieve the
+values they have. This allows for much more control of when and how
+many nodes to query for values. For example, a querying node could
+abort the search once it has found enough values in some nodes, or
+it could choose to only request values from the nodes that are
+closest to the key being searched for.
+
+\subsection{\textbf{OPTIONAL}: BitTorrent's Improvements}
+\label{bittorrent_dht}
+
+In the several years that some BitTorrent clients have been using a
+Kademlia-based DHT for tracker-less operation, the developers have
+made many enhancements which we can take advantage of. One of the
+most important is a security feature added to stop malicious nodes
+from subscribing other nodes as downloaders. When a node issues a
+request to another node to store its download info for that key, it
+must include a \emph{token} returned by the storing node in a recent
+\texttt{find\_node} request. The token is created by hashing the IP
+address of the requesting peer with a temporary secret that expires
+after several minutes. This prevents the requesting peer from faking
+its IP address in the store request, since it must first receive a
+response from a \texttt{find\_node} on that IP.
+
+We also made some BitTorrent-inspired changes to the parameters of
+the DHT originally specified by the authors of Kademlia. Since, as
+we will show later, peers stay online in our system for much longer
+periods of time, we reduced Kademlia's \emph{k} value from 20 to 8.
+The value is supposed to be large enough such that any given
+\emph{k} nodes are unlikely to fail within an hour of each other,
+which is very unlikely in our system given the long uptimes of
+nodes (see Figure~\ref{duration_online_1} in Section~\ref{analysis}.
+We also increased the number of concurrent outstanding
+requests allowed from 3 to 6 to speed up the recursive key finding
+processes.
+
+\subsection{\textbf{OPTIONAL}: Other Changes}
+\label{other_changes}
+
+We added one other new RPC request that nodes can make:
+\texttt{join}. This request is only sent on first loading the DHT,
+and is usually only sent to the bootstrap nodes that are listed for
+the DHT. These bootstrap nodes will respond to the request with the
+requesting peer's IP and port, so that the peer can determine what
+its oustide IP address is and whether port translation is being
+used. In the future, we hope to add functionality similar to STUN
+\cite{STUN}, so that nodes can detect whether they are NATted and
+take appropriate steps to circumvent it.
+
+In addition, we have allowed peers to store values in the DHT, even
+if the hash they are using is not the correct length. Most of the
+keys used in the DHT are based on the SHA1 hash, and so they are 20
+bytes in length. However, some files in the targetted Debian package
+system are only hashed using the MD5 algorithm, and so the hashes
+retrieved from the server are 16 bytes in length. The DHT will still
+store values using these keys by padding the end of them with zeroes
+to the proper length, as the loss of uniqueness from the 4 less
+bytes is not an issue. Also, the DHT will happily store hashes that
+are longer than 20 bytes, such as those from the 32 byte SHA256
+algorithm, by truncating them to the correct length. Since the
+requesting peer will have the full length of the hash, this will not
+affect its attempt to verify the downloaded file.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Performance Evaluation}
+\label{analysis}
+
+Our \texttt{apt-p2p} implementation supporting the Debian package
+distribution system has been available to all Debian users since May
+3rd, 2008 \cite{apt-p2p-debian}, and will also be available in the
+next release of Ubuntu \cite{apt-p2p-ubuntu}. We have since created
+a \emph{walker} that will navigate the DHT and find all the peers
+currently connected to it. This allows us to analyze many aspects of
+our implementation.
+
+\subsection{Peer Lifetimes}
+\label{peer_life}
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{AptP2PWalker-peers.eps}
+\caption{The number of peers found in the system, and how many are
+behind a firewall or NAT.}
+\label{walker_peers}
+\end{figure}
+
+We first began analyzing the DHT on June 24th, and continue to this
+day, giving us 2 months of data so far. Figure~\ref{walker_peers}
+shows the number of peers we have seen in the DHT during this time.
+The peer population is very steady, with just over 50 regular users
+participating in the DHT at any time. We also note that we find 100
+users who connect regularly (weekly), and we have found 186 unique
+users in the 2 months of our analysis.
+
+We also determined which users are
+behind a firewall or NAT, which is one of the main problems of
+implementing a peer-to-peer network. These peers will be
+unresponsive to DHT requests from peers they have not contacted
+recently, which will cause the peer to wait for a timeout to occur
+(currently 9 seconds) before moving on. They will also be
+unable to contribute any upload bandwidth to other peers, as all
+requests for packages from them will also timeout. From
+Figure~\ref{walker_peers}, we see that approximately half of all
+peers suffered from this restriction.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{AptP2PDuration-peers.eps}
+\caption{The CDF of how long an average session will last.}
+\label{duration_peers}
+\end{figure}
+
+Figure~\ref{duration_peers} shows the cumulative distribution of how
+long a connection from a peer can be expected to last. Due to our
+software being installed as a daemon that is started by default
+every time their computer boots up, peers are expected to stay for a
+long period in the system.
+Indeed, we find that 50\% of connections last longer than 5
+hours, and 20\% last longer than 10 hours. These connections are
+much longer than those reported by Saroiu et. al. \cite{saroiu2001}
+for other P2P systems, which had 50\% of Napster and Gnutella
+sessions lasting only 1 hour.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{AptP2PDuration-ind_peers.eps}
+\caption{\textbf{OPTIONAL}: The CDF of the average time individual peers stay in the
+system.}
+\label{duration_ind_peers}
+\end{figure}
+
+\textbf{OPTIONAL}: We also examined the average time each individual peer spends in the
+system. Figure~\ref{duration_peers} shows the cumulative
+distribution of how long each individual peer remains in the system.
+Here we see that 50\% of peers have average stays in the system
+longer than 10 hours.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{AptP2PDuration-online_1.eps}
+\caption{The fraction of peers that, given their current duration in
+the system, will stay online for another hour.}
+\label{duration_online_1}
+\end{figure}
+
+Since our DHT is based on Kademlia, which was designed based on the
+probability that a node will remain up another hour, we also
+analyzed our system for this parameter.
+Figure~\ref{duration_online_1} shows the fraction of peers that will
+remain online for another hour, as a function of how long they have
+been online so far. Maymounkov and Mazieres found that the longer a
+node has been online, the higher the probability that it will stay
+online \cite{kademlia}. Our results also show this behavior. In
+addition, similar to the Gnutella peers, over 90\% of our peers that
+have been online for 10 hours, will remain online for another hour.
+Our results also show that, for our system, over 80\% of all peers
+will remain online another hour, compared with around 50\% for
+Gnutella.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{AptP2PDuration-online_6.eps}
+\caption{\textbf{OPTIONAL}: The fraction of peers that, given their current duration in
+the system, will stay online for another 6 hours.}
+\label{duration_online_6}
+\end{figure}
+
+\textbf{OPTIONAL}: Since our peers are much longer-lived than other P2P systems, we
+also looked at the fraction of peers that stay online for another 6
+hours. Figure~\ref{duration_online_6} shows that over 60\% of peers
+that are online for 10 hours will stay online for another 6.
+However, we see an interesting decrease in this fraction between 8
+and 12 hours, which can also be seen in
+Figure~\ref{duration_online_1}. We believe this to be due to desktop
+users, who regularly turn off their computers at night.
+
+\subsection{Peer Statistics}
+\label{peer_stats}
+
+On July 31st we enhanced our walker to retrieve additional
+information from each contacted peer. The peers are configured, by
+default, to publish some statistics on how much they are downloading
+and uploading, and their measured response times for DHT queries.
+Our walker can extract this information if the peer is not
+firewalled or NATted, it has not disabled this functionality, and if
+it uses the same port for both its DHT (UDP) requests and download
+(TCP) requests (which is also the default configuration parameter).
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{AptP2PDownloaded-peers.eps}
+\caption{The number of peers that were contacted to determine their
+bandwidth, and the total number of peers in the system.}
+\label{down_peers}
+\end{figure}
+
+Figure~\ref{down_peers} shows the total number of peers we have been
+able to contact since starting to gather this additional
+information, as well as how many total peers were found. We were
+only able to contact 30\% of all the peers that connected to the
+system during this time.
+
+\begin{figure}
+\centering
+\includegraphics[width=\columnwidth]{AptP2PDownloaded-bw.eps}
+\caption{\textbf{OPTIONAL}: The bandwidth of data that the contacted peers have
+downloaded and uploaded.}
+\label{down_bw}
+\end{figure}
+
+Figure~\ref{down_bw} shows the amount of data the peers we were able
+to contact have downloaded. Peers measure their downloads from other
+peers and mirrors separately, so we are able to get an idea of how
+much savings our system is generating for the mirrors. We see that
+the peers are downloading approximately 20\% of their package data
+from other peers, which is saving the mirrors from supplying that
+bandwidth. The actual numbers are only a lower bound, since we have
+only contacted 30\% of the peers in the system, but we can estimate
+that \texttt{apt-p2p} has already saved the mirrors 15 GB of
+bandwidth, or 1 GB per day.
+
+We also collected the statistics on the measured response time peers
+were experiencing when sending requests to the DHT. We found that
+the recursive \texttt{find\_value} query, which is necessary before
+a download can occur, is taking 17 seconds on average. This
+indicates that, on average, requests are experiencing almost 2
+stalls while waiting for the 9 second timeouts to occur on
+unresponsive peers. This is longer than our target of 10 seconds,
+although it will only lead to a slight average delay in downloading
+of 1.7 seconds when the default 10 concurrent downloads are
+occurring.This increased response time is due to the number of peers
+that were behind firewalls or NATs, which was much higher than we
+anticipated. We do have plans to improve this through better
+informing of users of their NATted status, the use of STUN
+\cite{STUN} to circumvent the NATs, and by better exclusion of
+NATted peers from the DHT (which does not prevent them from using
+the system).
+
+We were also concerned that the constant DHT requests and responses,
+even while not downloading, would overwhelm some peers' network
+connections. However, we found that peers are using 200 to 300 bytes
+per second of bandwidth in servicing the DHT. These numbers are
+small enough to not affect any other network services the peer would
+be running.
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Related Work}
+\label{related}
+
+There have been other preliminary attempts to implement peer-to-peer distributors for
+software packages. apt-torrent \cite{apttorrent} creates torrents
+for some of the larger packages available, but it ignores the
+smaller packages, which are often the most popular. DebTorrent
+\cite{debtorrent} makes widespread modifications to a traditional
+BitTorrent client, to try and fix the drawbacks mentioned in
+Section~\ref{bittorrent}. However, these changes also require some
+modifications to the distribution system to support it. Our system
+considers all the files available to users to download, and makes
+use of the existing infrastructure unmodified.
+
+\textbf{OPTIONAL}: Others have also used DHTs to support this type of functionality.
+Kenosis \cite{kenosis} is a P2P Remote Procedure Call
+client also based on the Kademlia DHT, but it is meant as a P2P
+primitive system on which other tools can be built, and so it has no
+file sharing functionality at all. Many have used a DHT as a drop-in
+replacement for the tracker in a BitTorrent system
+\cite{bittorrent-dht, azureus-dht}, but such systems only use the
+DHT to find peers for the same torrent, so the file sharing uses
+traditional BitTorrent and so is not ideal for the reasons listed
+in Section~\ref{bittorrent}.
+
+% \cite{deltacast}
+
+There are a number of works dedicated to developing a collaborative
+content distribution network (CDN) using peer-to-peer techniques.
+Freedman et. al. developed Coral \cite{coral} using a distributed
+\emph{sloppy} hash table to speed request times. Pierre and van
+Steen developed Globule \cite{globule} which uses typical DNS and
+HTTP redirection techniques to serve requests from a network of
+replica servers, which in turn draw their content from the original
+location (or a backup). Shah et. al. \cite{shah08} analyze an
+existing software delivery system and use the results to design a
+peer-to-peer content distribution network that makes use of
+volunteer servers to help with the load. None of these systems meets
+our goal of an even distribution of load amongst the users of the
+system. Not all users of the systems become peers, and so are not
+able to contribute back to the system after downloading. The
+volunteers that do contribute as servers are required to contribute
+larger amounts of bandwidth, both for uploading to others, and in
+downloading content they are not in need of in order to share them
+with other users. Our system treats all users equally, requiring all
+to become peers in the system, sharing the uploading load equally
+amongst all, but doesn't require any user to download files they
+would not otherwise need.
+
+The most similar works to ours are by Shah et. al. \cite{shah08} and
+Shark by Annapureddy et. al. \cite{shark}.
+Shah's system, in addition to the drawbacks mentioned above,
+is not focused on the interactivity of downloads, as
+half of all requests were required ``to wait between 8 and 15
+minutes.'' In contrast, lookups in our system take only seconds to
+complete, and all requests can be completed in under a minute.
+Shark makes use of Coral's distributed sloppy hash table to speed
+the lookup time, but their system is more suited to its intended use
+as a distributed file server. It doesn't make use of authoritative
+copies of the original files, allowing instead any users in the
+system to update files and propagate those changes to others. Our
+system is well-tailored to the application of disseminating the
+unchanging software packages from the authoritative sources to all
+users.
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Conclusion and Future Work}
+\label{conclusions}
+
+We have designed a generally applicable peer-to-peer content
+distribution model to be used by many of the free software
+distributors operating today. It makes use of already existing
+features of most package management systems to create a
+peer-to-peer distribution that should substantially reduce the costs
+of hosting the software packages.
+
+We have also implemented our design in freely available software to
+be used in conjuction with Debian-based distribution of Linux
+software packages. It is currently in use by some users of the
+Debian project's distribution, and so serves as an example of the
+possibilities that exist.
+
+We feel that our P2P software package distribution model is
+well-suited to be used by many free software distributors. We hope
+to convince them to adopt such a model in their distribution, or we
+may port our existing system to some of the other groups for them to
+try.
+
+\textbf{OPTIONAL}: One aspect missing from our model is the removal of old packages
+from the cache. Since our implementation is still relatively young,
+we have not had to deal with the problems of a growing cache of
+obsolete packages consuming all of a user's hard drive. We plan to
+implement some form of least recently used (LRU) cache removal
+technique, in which packages that are no longer available on the
+server, no longer requested by peers, or simply are the oldest in
+the cache, will be removed.
+
+The most significant area of improvement still needed in our sample
+implementation is to further speed up some of the slower recursive
+DHT requests. We hope to accomplish this by further tuning the
+parameters of our current system, better exclusion of NATted peers
+from the routing tables, and through the use of STUN \cite{STUN} to
+circumvent the NATs of the 50\% of the peers that have not
+configured port forwarding.
+
+\bibliographystyle{IEEEtran}
+\bibliography{./IEEEabrv,./all}
+
+\end{document}