How to archive a website in a future-proof way (involves PDF hybrid)

evenwicht · edit-2 5 months ago

How to archive a website in a future-proof way (involves PDF hybrid)

evenwicht · 5 months ago

It’s perhaps the best way for someone that has a good handle on it. Docs say it “sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.” So you would need to tune it so that it’s not grabbing objects that are irrelevent to the view, and probably exclude some file types like videos and audio. If you get a well-tuned command worked out, that would be quite useful. But I do see a couple shortcomings nonetheless:

If you’re on a page that required you to login to and do some interactive things to get there, then I think passing the cookie from the gui browser to wget would be non-trivial.
If you’re on a capped internet connection, you might want to save from the brower’s cache rather that refetch everything.

But those issues aside I like the fact that wget does not rely on a plugin.

CanadaPlus · edit-2 5 months ago

I find that the things most likely to disappear (like a tinkerer’s web 1.0 homepage) tend to have limited recursion depth anyway.

A Tumblr blog takes an awfully long time to crawl politely, IIRC, but the end result wasn’t too big on disk. Now I’m wondering how you would pass a cookie to wget, and how you might set a data cap so you can stop and wait for the month to be up before you call it again. I kind of feel like I’ve done a cookie before to get around a captcha or something…

Edit: There’s a couple of ideas for limiting size on StackOverflow. The wget specific one is -Q for quota, which you’d want to set conservatively in case there’s one huge file somewhere, since it only checks between individual downloads.

Looks like there’s a --load-cookies option that will read a browser export of cookies from a file, as well as load POST data and save cookie options if you want to do something interactive that way.

Edit edit: What I’m remembering is actually adding headers, like this.

evenwicht · edit-2 5 months ago

wget has a --load-cookies file option. It wants the original Netscape cookie file format. Depending on your GUI browser you may have to convert it. I recall in one case I had to parse the session ID out of a cookie file then build the expected format around it. I don’t recall the circumstances.

Another problem: some anti-bot mechanisms crudely look at user-agent headers and block curl attempts on that basis alone.

(edit) when cookies are not an issue, wkhtmltopdf is a good way to get a PDF of a webpage. So you could have a script do a wget to get the HTML faithfully, and wkhtmltopdf to get a PDF, then pdfattach to put the HTML inside the PDF.

(edit2) It’s worth noting there is a project called curl-impersonate which makes curl look more like a GUI browser to get more equal treatment. I think they go as far as adding a javascript engine or something.

CanadaPlus · 5 months ago

Ah, looks like you beat my edit by a few seconds.

Good to know about the Netscape thing. It looks like Firefox (still, being a successor to NS) does it that way, and Chrome can do it that way. If you’re using a true third option you probably don’t need my help.

For the sake of completeness, on Tor Browser you have to copy the SQLite database from the browser directory, since it’s too locked down to just export the normal way. Then I’d try just subbing it in on an offline Firefox instance and proceeding the normal way. And obviously, use wget over torsocks as well.

How to archive a website in a future-proof way (involves PDF hybrid)

How to archive a website in a future-proof way (involves PDF hybrid)

MAFF (a shit-show, unsustained)

MHTML (shit-show due to non-portable browser-dependency)

PDF (lossy)

PDF+MHTML hybrid

We need to evolve

(update) The goals