Ubuntu
urlscan package

urlscan does not work on HTML fragments

Bug #1930437 reported by Bill Yikes on 2021-06-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	urlscan (Ubuntu)	New	Undecided	Unassigned

Bug Description

This yields no output:

curl -s 'https://www.veridiancu.org' | sed -ne '/<form/,/<\/form/p' | urlscan -n

Without the sed filter, urlscan works. But then urlscan dumps all URLs in the whole document. It seems urlscan was only designed to work on whole documents. So perhaps this is not a "bug" but rather a feature request.

The workaround would normally be to use urlview instead, but urlview has the limitation of only working interactively. Perhaps the fix here is for urlscan to add a --fuzzyhtml option, and use the guts of urlview to do the processing.

(edit)

This workaround works for urlscan:

curl -s 'https://www.veridiancu.org' | python -c 'from bs4 import BeautifulSoup; import sys; print(BeautifulSoup(sys.stdin.read()).form)' | urlscan -n

which might give a clue about what the problem is.

See original description

Bill Yikes (yik3s) on 2021-06-01

description:

updated

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntuurlscan package

urlscan does not work on HTML fragments

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
urlscan package