[Yum] Re: yum mirroring

Robert G. Brown rgb at phy.duke.edu
Wed Jul 30 13:09:24 UTC 2003


On 29 Jul 2003, Aleksander Demko wrote:

> On Mon, 2003-07-28 at 16:17, Robert G. Brown wrote:
> ...
> > wget is pretty simple as well, but you have to tell it to decend
> > recursively to an appropriate depth or use the --mirror option.
> > Something like:
> > 
> >   wget --mirror http://whatever.repository.youlike.org -o
> > /tmp/mirror_log
> 
> Yeah, that doesn't work quite right. Without parameter fiddling, I get
> non-repository files (.html, etc) as well as it may go UP the url and
> continue to suck down. I didn't realize this until I pulled down 3+ gig
> from Duke... had stuff like 7.x updates to (useless to me, as we have no
> <8 machines)... luckily we get like a megabyte a second from you guys.

I know you have, but RFM a few more times (it is pretty long and
complicated:-).  I recall that there are options for controlling
whether and/or how far up and down it extends recursively.  I also
wasn't clear -- you probably wanted something more like:

 wget --mirror \
  http://whatever.repository.youlike.org/pub/linux/distro_9 \
   -o /tmp/mirror_log

which doesn't mirror the WHOLE repository, but only the distro_9 part.
IIRC it will do something moderately horrible with the path on your
mirror site -- you might get ./pub/linux/distro_9.  It doesn't really
behave like a cp.

> Also, I don't think wget has the guts to actually remove files that are
> no longer on the repository.

Ya, wget is an adequate tool but hardly sparkly or exciting, because it
uses httpd itself to deliver the files.  That is, it doesn't really
mirror anything -- it is a very specialized scripted browser that
connects to a server and retrieves every file it finds, recursively, in
a tree.  Of course, a lot of "files" it might find could be active/cgi
files.  The best it can do is save whatever it is that it was presented
with, which is probably not the cgi source.  Not a copy at all -- more
like a browser "save" feature, recursive, with path, and as you note
conservative to a fault.

> > should do it.  You have to look to see if rsync works for each
> > repository you might want to mirror.  Where it works it is "better".  It
> > is also reasonable to ask permission before mirroring regularly from a
> > public repository that doesn't already grant it openly.  Some sites have
> > spare bandwidth and a public-spiritedness, others don't.
> 
> I've never used rsync, but I don't think I can use it here. Our heavily
> DMZ'ed public http server can really only do HTTP requests, and even
> those I back tunnel over ssh to a proxy. I think I'm restricted to
> HTTP-pure mirroring techniques.

rsync is actually by far the preferred tool.  It is designed to do
precisely what you need (synchronize two images, perfectly), efficiently
(copying compressed images of just what has changed), and safely (where
you can select whether or not to delete files that are no longer in the
images being sync'd.  The issue of whether or not they support it on
the repository you're trying to mirror is a policy issue, of course, and
you may or may not have any control or voice there, but it is certainly
worth opening a discussion with the owners and asking for it.  Here are
the arguments:

rsync on top of ssh CAN be fully authenticated with really strong
authentication.  Anonymous rsync can use rsh, ssh, rsync(d) itself as a
server, or rsyncd as transport for a web proxy.  Of these, ssh is
extremely strong host/user-level authentication and fully access
controllable -- the issue there isn't whether or not ssh is a secure
mechanism, it is whether or not they'll give >>you<< ssh access.  This
depends on who you are and so forth, usually.

I have no idea how strong rsyncd is as a secure transport mechanism, but
for unauthenticated anonymous access to a selected tree it is probably
secure to within stupidity in setting up the tree and the eternal
possibility of e.g. buffer overwrite attacks in any daemon listening on
an open port.  It does have a slew of options for authenticated access
(including host/domain authentication) chroot on the provided tree, and
so forth and likely is comparable to httpd itself in overall security.

Web proxy is weak/stupid authentication in cleartext and hence probably
not a great idea in any event, either to support wget or to support
rsyncd.  I used it for the first time yesterday and learned to my
chagrin that it doesn't run on top of ssl, which means that used over
broadband networks it is just an open invitation to password snoops.
Nobody (intelligent) permits telnet or rsh access anymore because
password snooping used to be the number one security risk of nearly any
unixoid LAN.  Somehow web proxies have escaped that, but they should and
will follow unless they are ssl-ified so no cleartext passwords ever are
used.

The decision to support (anonymous or other) rsync access is a serious
one, of course, but lots of very paranoid repositories permit rsync one
way or another -- sometimes several ways, for different parts of the
tree.  Even Seth permits it, sometimes, and we have to regularly
medicate him so that he doesn't jump on people passing in the hall and
beat them with a sucker rod while screaming "Crackers! Crackers!  Stay
away from my servers!" (which in the South is likely to be
misunderstood:-).

[In truth, a standalone webserver that is properly backed up and
monitored isn't that big a deal if it IS cracked -- shut it down,
restore it, bring it up -- as little as an hour of downtime total,
provided of course that you can determine and shut down the cracker's
point of egress... at worst a momentary embarrassment.]

In fact, if you can talk the owners of the repository into giving you
ssh access (because you are a systems person and because your
rsync-maintained mirror REDUCES load on their server) you don't NEED any
sort of daemon -- I use rsync as a command line tool on top of ssh for
all of my quotidian mirroring needs:

  rsync -avz host:path .

to get all changed files, verbosely, preserving files that no longer
occur, and send them compressed or

  rsync -avz --delete host:path .

to make a "perfect" mirror including deleting any files that have been
removed.  rsync rocks.  Note that you also need --rsh=ssh or set
RSYNC_RSH to ssh.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Yum mailing list