Bug#792025: [uscan] Doesn't strip terminal download.html, fails fetching .../download.html/download/...

Adam D. Barratt adam at adam-barratt.org.uk
Fri Jul 10 12:58:31 UTC 2015


On Fri, 2015-07-10 at 12:05 +0100, Barak A. Pearlmutter wrote:
> $ cat debian/watch
> version=3
> https://www.fossil-scm.org/download.html \
>  .*/fossil-src-(\d*\.\d*)\.(?:zip|tgz|tbz|txz|(?:tar\.(?:gz|bz2|xz)))
[...]
> Note the URL it tries to fetch is
> https://www.fossil-scm.org/download.html/download/fossil-src-1.33.tar.gz
> which does not have the "download.html" stripped out.  The URL looks
> fine in a browser, and (as also shown in the above transcript) the
> page source reads href="download/fossil-src-..."
> 
> Mystified!

Inspection of the download page reveals:

    <base href="https://www.fossil-scm.org/download.html" />

uscan(1) says:

If any of the hrefs in the homepage which match the (anchored) pattern
are relative URLs, they will be taken as being relative to the base URL
of the homepage (i.e., with everything after the trailing slash
removed), or relative to the base URL specified in the homepage itself
with  a  <base  href="...">  tag.

I think the behaviour is arguably slightly broken here in that
https://www.w3.org/wiki/HTML/Elements/base implies that the last
component shouldn't be included if it's a document rather than a
directory but I couldn't spot that being explicitly specified in that
URL at least.

In any case, using the documented mangle facilities seems to work okay,
giving:

version=3
opts="downloadurlmangle=s#/download.html##" \
https://www.fossil-scm.org/download.html \
 .*/fossil-src-(\d*\.\d*)\.(?:zip|tgz|tbz|txz|(?:tar\.(?:gz|bz2|xz)))

Regards,

Adam



More information about the devscripts-devel mailing list