Web monitor should use modified headers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
specto (Ubuntu) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: specto
I'm using Specto 0.2.2 in Ubuntu Gutsy.
According to the google specto docs, the web monitor uses file size (presumably the "content-length" HTTP header but perhaps it checks the string length internally) to know if a page has changed. Ideally, the "Last-Modified" HTTP header should be used to determine if a page has been modified. This is because pages which contain, for example, a table of data which doesn't change in size but changes in value may not have a different content length but should be returned with a proper "Last-Modified" header.
Further, it is recommended to use the "If-Modified-Since" header to minimize data transfer. Use of the ETag header may be necessary for cases involving proxies and caching servers.
References:
http://
http://
http://
HI ! static. py, you can see in lines 109-121:
For what it's worth, Specto actually already uses this, I *think*. In the 0.2.2 series, if you look at the code in watch_web_
if (self.cached == 1) or (os.path. exists( self.cacheFullP ath_)):
self. cached = 1 cacheFullPath_ , "r")# Load up the cached version
self. infoB_ = HTTPMessage(f) .has_key( 'last-modified' ):
request. add_header( "If-Modified- Since", self.infoB_ ['last- modified' ]) .has_key( 'ETag') :
request. add_header( "If-None- Match", self.infoB_ ['ETag' ])
response = urllib2. urlopen( request)
self. error = True
self. specto. logger. log(_(" Watch: \"%s\" has error: ") % self.name + str(e), "error", self.__class__)
f = file(self.
if self.infoB_
if self.infoB_
try:
except (urllib2.URLError, BadStatusLine), e:
However,
- the code might not be elegant
- the code/logic might be wrong (after all, I took a long time doing it and I'm not sure I did it properly)
I don't know if that is conforming to your suggestions, or if you meant that
- some piece is missing?
- something is not working properly?
Also, I think the etag headers may not work properly with websites that use advertising/dynamic content, so, if I remember correctly my own code, the "error margin" (difference percentage based on file sizes) would override it.