[Yum-devel] [PATCH] implement reget=check_timestamp

Tue Jun 26 18:54:09 UTC 2012

On Tue, 2012-06-26 at 04:54 -0400, Zdenek Pavlas wrote:
> >  Ok, so what is the desire for both cases here. Above that we say:
> > 
> >   reget = None   [None|'simple'|'check_timestamp']
> > 
> >     whether to attempt to reget a partially-downloaded file.  Reget
> >     only applies to .urlgrab and (obviously) only if there is a
> >     partially downloaded file.  Reget has two modes:
> > 
> > ...which implies this is getting extra data or all data, the above
> > kind of implies we are getting extra data or nothing (or maybe all data or
> > nothing).  But there is a also a problem with the idea...
> 
> Yes, the idea is to use reget=simple when we have unique URLs,
> and reget=check_timestamp for URLs where content changes over time.

 But if we use this to get "primary" when it's not using unique names,
people are not going to be happy if their 8MB of 12MB download restarts.
 I understand that you are thinking of this in the context of repomd.xml
but checking the timestamps/ETags/whatever and dealing with small files
(where it doesn't matter if you skip "resume" and just re-download
everything) are distinct things.

> >  This timestamp is going to be one of three things:
> > 
> > 1. The timestamp we last tried to download FOO, and stopped before we
> > got it all.
> > 
> > 2. The timestamp we last downloaded all of FOO, but didn't have a
> > last-modified.
> > 
> > 3. The timestamp of the server last-modified when we last downloaded
> > all of FOO and had a last-modified so urlgrabber used utimes().
> > 
> > ...which is problematic.
> 
> 1. This implies timestamp check fails for every partially downloaded
> file.  That's why I ignore opts.range unless reget==simple.
>
> 2. We'd always reget the whole file (it's a special case of 1).
> 
> 3. Yes, I rely on utime() being used on completed files only.
> Why is that problematic?

 Not sure what you mean by #1 but a _partial_ download will always have
a newer timestamp than the timestamp on the server, and...

        The If-Modified-Since request-header field is used with a method
        to make it conditional: if the requested variant has not been
        modified since the time specified in this field, an entity will
        not be returned from the server; instead, a 304 (not modified)
        response will be returned without any message-body.

...so we'll fail to verify that the data is good, but urlgrabber will
fail to (re)download anything because the timestamp is newer and it just
gets 304s. The problem with #3 is the same ... servers are not
_required_ to return Last-Modified, and if they don't we can't use
utime() and if we haven't used utime() we really shouldn't be passing
the mtime we do have to the server.
 With multiple server we can also be downloading from ftp one day, and
then downloading from http the next ... and we shouldn't be using the
timestamps in those cases either.

 This is probably really hard (if not impossible) to trigger with
repomd.xml ... because it's so small, but then as you said it's not a
measurable improvement even if it works ... because it's so small.
 Also there's the problem that anything using metalink files implies
that checking the timestamps is a noop anyway (or should be, in some
weird server failure cases it could cause problems).