[Yum] new urlgrabber design
Michael Stenner
mstenner at phy.duke.edu
Mon Oct 13 21:20:30 UTC 2003
Again, if you don't know what urlgrabber is, you don't need to read
this. I am actively requesting input from Jeremy, Seth, and Icon. I
would love to input from others as well (Ryan?), but these are the
ones that will get he beatings.
Here is the basic design that I have in mind. This (intentionally)
has no mention of internal workings. It only discusses things that
matter to someone that would USE the module. Internal design is
certainly open for discussion, but I only want to talk about it now to
the extent that it affects interface.
-Michael
=======================================================================
MAIN FUNCTIONS:
urlgrab -- Fetch a url and make a local copy. Return the filename
urlopen -- Return a file object for the specified url.
urlread -- Read the specified file into a string and return int.
retrygrab -- Wrapper for urlgrab the retries given certain errors.
retryopen -- Wrapper for urlopen the retries given certain errors.
retryread -- Wrapper for urlread the retries given certain errors.
NOTE: retryopen can't protect you from errors that occur AFTER the
connection is made. It can only retry setting up the connection.
FEATURES:
* identical behavior for http, ftp, and file
Options that change the behavior for one protocol (like
copy_local) are OK as long as they don't affect the other
protocols. However, something like byte-ranges MUST work for
all protocols. These are different because byte-ranges CHANGE
the return value for a given input. copy_local only modifies
the internal behavior.
All options must by syntactically legal for ALL urls. The whole
point is to have the library not care what sort of url is passed
in.
* smart url interpretation
- handle "normal local filenames" also
- handle url-encoded username/password for ftp and http (and file? smb?)
* byte ranges
* reget support
- internally supported via byte ranges
- several reget modes
+ never: always start from the beginning
+ force: always pick up from the end of the local file
+ smart: check timestamps, length, etc.
* throttling
* progress meter
* i18n support (if the calling application provides translations)
* settable User-Agent
* http keepalive (via the keepalive module)
* timestamp preservation
INTERFACE:
I'm considering changing the function interface a little. There are
just getting to be an insane number of options, and I'm not sure how
to deal with it. There is also the issue of passing options through
retry*.
Option 1 (the way it is now, everything is a kwarg)
def urlgrab(url, filename=None, copy_local=0, close_connection=0,
progress_obj=None, throttle=None, bandwidth=None):
def retrygrab(url, filename=None, copy_local=0, close_connection=0,
progress_obj=None, throttle=None, bandwidth=None,
numtries=3, retrycodes=[-1,2,4,5,6,7], checkfunc=None):
This is REALLY ugly and it makes it very hard to cleanly add
options. Specifically, what if someone does:
retrygrab(url, fn, 1, 0, None, None, None, 5) # the last is numtries
and then we later add more options to urlgrab? Sure, it's not
likely, and sure, I put a warning to only use these as kwargs in
the doc, but still. It's very icky. However, it is very clear
and very normal.
Option 2
def urlgrab(url, filename=None, **kwargs):
def retrygrab(url, filename=None, **kwargs):
retrygrab could then strip out the options it cares about and pass
on the rest. This makes the function definition very clean, but
completely useless to look at. The legal args would have to go in
the docs. One of the up-sides is that things could ONLY be called as
keyword args so the ordering is irrelevant.
Option 3
def urlgrab(url, filename=None, options=None):
def retrygrab(url, filename=None, optionsNone):
Same as 2, but instead of calling as:
urlgrab(url, copy_local=1)
it must be
urlgrab(url, options={'copy_local':1})
I don't really like this option. It's just a step on the way to
the next one :)
Option 4
def urlgrab(url, filename=None, options=None):
def retrygrab(url, filename=None, options=None, retry_options=None):
Here, the options arg to retrygrab would get passed through
untouched, and retry_options would be ONLY for options related to
the retry process.
I'm open to other ideas... If I had to pick now, I'd probably go
with (2), but I'm still quite open.
STRUCTURE:
Because urlgrabber already consists of at least two files
(urlgrabber.py and keepalive.py), I'm thinking of making it a
"package" (directory with sub-modules inside). One might argue that
this is the only sane way to go if it's going to be a tidy library.
This will also make life much easier if we need to do "parallel
installs" farther down the road.
Then again, maybe keepalive.py and progress_meter.py should be
separate!
--
Michael Stenner Office Phone: 919-660-2513
Duke University, Dept. of Physics mstenner at phy.duke.edu
Box 90305, Durham N.C. 27708-0305
More information about the Yum
mailing list