[Yum-devel] [PATCH 6/6] Implement parallel downloads for regular RPMs

James Antill james at fedoraproject.org
Mon Jul 18 14:50:19 UTC 2011


On Fri, 2011-07-15 at 07:50 -0400, Zdenek Pavlas wrote:
> > "James Antill" <james at fedoraproject.org> wrote:
> 
> > 1. rpmdb/sqlite/etc. are now fork()d in N processes, and we have to make
> > sure that all the code within downloadProcess() doesn't do anything
> > weird with them. This scares me a lot.
> 
> Agreed.  We can't even make sure that accessing it produces an error,
> because we can't close it.
> 
> The solution is to close all DBs before the download, and reopen it later.
> But I'm afraid it's quite costly.

 No, that's more of a workaround ... and while, in theory, we could do
it for the core yum rpmdb/sqlite (maybe) it would mean adding a bunch of
APIs to different parts of yum to close/reopen stuff. And we have no
control over plugins, and some things (like NSS) just don't like that
model anyway.

>   Otoh, we could probably safely release
> the lock at the same time.

 I'm not sure what lock you mean here, but in yum the big locking
problem is that we can't "change" the repos. after they've been
accessed. And that means we can't unlock them (or another proc. could
change them).
 The one exception here is if we did something specifically for
"background downloading", where we'd get a list of packages to download
and then drop the locks and download them (because we'd then immediately
exit after doing that thing). Even then it might be easier to make this
a multi-stage process (Eg. download to tmp path, and then lock+rename).

> > 2. Any global resources, like fd's open or what happens at signal time
> > will need to be dealt with. This is almost certainly more pain than is
> > wanted.
> 
> Besides the rpmdb and sqlite, I don't know of any other resources
> the downloader could possibly touch.  Well, I haven't checked how
> signals are handled in yum yet, but IMHO it's not an obstacle.

 The two big ones are:

 Signals.
 Exceptions.

...esp. as both of those interact with code we'll be calling (Eg. both
rpmdb and NSS play with SIGINT).

> > 3. We have to make sure that all the python code in yum/urlgrabber/etc.
> > below downloadProcess() doesn't do anything weird due to running in N
> > procs. at once. This is almost certainly more pain than is wanted.
> 
> It goes quite straight to the urlgrabber.  Progress callbacks are a bit
> of a pain to look up, but the rest is quite predictable, me thinks.

 The huge elephant in the room here is the rhnplugin, but even without
that I'd be worried.

> > 4. SELinux does have setcontext() but would _really_ prefer to have an
> > exec() instead ... and we still have a huge amount of extra code in
> > core, even if it's running in a restricted context.
> 
> I'm not sure exec() would be a win.
> 
> 1) you'd have to re-implement all the setup code.
>
> 2) or- we could "import yum" and friends, but then we're back at
>    "huge amount of extra code in core".
> 1) you'd exec() the very same /usr/bin/python anyway.
 
 What setup code? The "download helper" only needs to know the
information we are passing to urlgrabber ... it's not like we'd need to
read yum repo files (in fact we can't, as they might not exist).

 Yes, at least for version #1, I assume we'll still be running python in
the download helper ... but that's still a lot better than forking yum,
after importing 666 things. Think of it like this:

1. fork+exec(python app).
2. fork+exec(python app)+drop privs.
3. fork+exec(python app)+chroot+drop privs.
4. fork+exec(C app)+chroot+drop privs.

> > 5. This is pretty package specific ... we'd need a bigger, and scarier,
> > patch if we want to do anything else.
> 
> I've read the drpm download code, it's very similar.  Maybe we should
> merge and clean up the downloading paths first?

 There's drpm and metadata. For drpm as an end result we want a merged
download path ... Nils said he'd look at it, but he hasn't had much time
so speak to him if you want to look at doing that too.
 For metadata we really need good APIs from urlgrabber, and the
experience of doing it for packages ... and a bunch of work :).

> > 6. We inherit the memory resources of yum, for all the downloaders. COW
> > might help a bit here ... but this is python, not C, so I could see us
> > churning through COW pages a lot more than we might expect.
> 
> Correct me if I'm wrong but there's actually very little work done
> between downloadProcess(), getPackage(), urlgrabber and pycurl.
> All is just a wrapper- adds few stack frames, no huge lists
> or dicts on the way to touch.
> 
> It's as much thin that I consider throwing it away entirely,
> and use pycurl directly.  CurlMulti() looks really great and 
> is async-based, so instead of N we'd fork just one process.
> What do you think of it?

 If we don't fork+exec we lose the containment/SELinux features. We also
lose the ability to properly work around the NSS client cert. problems.
 If you want the download helpers to use CurlMulti(), I'm fine with
that ... although it might be significantly easier to not do that.


> > But you want to look at the fork()+exec() model inside urlgrabber,
> > next. And then we can look at some APIs for "sane users" ... and then
> > see what we need to make it not suck to integrate it into yum.
> 
> The problem is urlgrabber does not provide any "bulk" or "aync" API.
> That has to be defined first, I think.

 Yes, we'll probably need to add APIs on both sides.



More information about the Yum-devel mailing list