[Yum-devel] [PATCH 6/6] Implement parallel downloads for regular RPMs

Fri Jul 15 11:50:46 UTC 2011

> "James Antill" <james at fedoraproject.org> wrote:

> 1. rpmdb/sqlite/etc. are now fork()d in N processes, and we have to make
> sure that all the code within downloadProcess() doesn't do anything
> weird with them. This scares me a lot.

Agreed.  We can't even make sure that accessing it produces an error,
because we can't close it.

The solution is to close all DBs before the download, and reopen it later.
But I'm afraid it's quite costly.  Otoh, we could probably safely release
the lock at the same time.

> 2. Any global resources, like fd's open or what happens at signal time
> will need to be dealt with. This is almost certainly more pain than is
> wanted.

Besides the rpmdb and sqlite, I don't know of any other resources
the downloader could possibly touch.  Well, I haven't checked how
signals are handled in yum yet, but IMHO it's not an obstacle.

> 3. We have to make sure that all the python code in yum/urlgrabber/etc.
> below downloadProcess() doesn't do anything weird due to running in N
> procs. at once. This is almost certainly more pain than is wanted.

It goes quite straight to the urlgrabber.  Progress callbacks are a bit
of a pain to look up, but the rest is quite predictable, me thinks.

> 4. SELinux does have setcontext() but would _really_ prefer to have an
> exec() instead ... and we still have a huge amount of extra code in
> core, even if it's running in a restricted context.

I'm not sure exec() would be a win.

1) you'd have to re-implement all the setup code.
2) or- we could "import yum" and friends, but then we're back at
   "huge amount of extra code in core".
1) you'd exec() the very same /usr/bin/python anyway.

> 5. This is pretty package specific ... we'd need a bigger, and scarier,
> patch if we want to do anything else.

I've read the drpm download code, it's very similar.  Maybe we should
merge and clean up the downloading paths first?

> 6. We inherit the memory resources of yum, for all the downloaders. COW
> might help a bit here ... but this is python, not C, so I could see us
> churning through COW pages a lot more than we might expect.

Correct me if I'm wrong but there's actually very little work done
between downloadProcess(), getPackage(), urlgrabber and pycurl.
All is just a wrapper- adds few stack frames, no huge lists
or dicts on the way to touch.

It's as much thin that I consider throwing it away entirely,
and use pycurl directly.  CurlMulti() looks really great and 
is async-based, so instead of N we'd fork just one process.
What do you think of it?

> ...so as I said, I think it's a good POC ... you have something where
> you can measure the impact of the change, do speed tests etc.

No yet.. todo.

> But you want to look at the fork()+exec() model inside urlgrabber,
> next. And then we can look at some APIs for "sane users" ... and then
> see what we need to make it not suck to integrate it into yum.

The problem is urlgrabber does not provide any "bulk" or "aync" API.
That has to be defined first, I think.

--
Zdenek