Web archiving

As we all know, human civilization must be digitized so that, with one stroke, it can be deleted.
– Anon

Why

On the Importance of Web Archiving

https://items.ssrc.org/parameters/on-the-importance-of-web-archiving

Where incredible journeys end…

Site deaths are when sites go offline, taking content and permalinks with them, and breaking the web accordingly.
Site deaths are one of the big reasons why you should own your own identity and content on the web:
https://indieweb.org/site-deaths

How to Properly Archive Your Digital Files

Posted Jul 14 2024
https://wired.com/story/how-to-properly-archive-your-digital-files
Will you be able to open today’s Word docs in 20 years? Probably not, unless you take some necessary steps to give those digital files an extra-long shelf life.
Updates to software like Microsoft Word mean that files that opened fine in the '80s don’t open in the 2020s. Part of the problem: Microsoft, and only Microsoft, controls the file format, or even knows how it works. For this reason, Stuchell says he encourages people to export files in an open file format—especially files they want to keep accessible for the long term.
Basically, if a file on your computer can only be opened by a specific piece of software, and that software is controlled by a single company, you should probably export it to an open format. It’s the only way to future-proof it.
Keep in mind that some photos, especially RAW files from your camera, might be in a proprietary format.
“You have to be careful because many cameras default to their own version of RAW, which is highly proprietary,” Stuchell says. He recommends exporting such photos to an open format called Digital Negative (DNG), which is a safer format to use for preserving RAW files.

Inside the online communities trying to preserve our digital memories

Text Günseli Yalcinkaya, posted 8th January 2024
https://dazeddigital.com/life-culture/article/61689/1/youtube-comments-online-communities-preserve-digital-memories-checkpoints
Ruby Justice Thelot’s A Cyberarchaeology of Checkpoints pays homage to the personal stories shared in the comments section of YouTube videos

The online data that’s being deleted. For years, we were encouraged to store our data online…

By Chris Baraniuk, posted 15th July 2021
https://bbc.com/future/article/20210715-the-online-data-thats-being-deleted
But it’s become increasingly clear that this won’t last forever - and now the race is on to stop our memories being deleted

Delete Never: The digital hoarders who collect Tumblrs, medieval manuscripts, and Terabytes of text files

By Steven Melendez, published March 4 2019
https://gizmodo.com/delete-never-the-digital-hoarders-who-collect-tumblrs-1832900423

We are not alone: Progress in the digital preservation community

By Ricc Ferrante, posted october 2 2018
https://siarchives.si.edu/blog/we-are-not-alone-progress-digital-preservation-community

File formats

EPUB

Out of all the formats, this is probably the least appropriate for storing web pages, but it’s useful for storing long articles, or documentation.
EPUB is an e-book file format supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers.
The EPUB format is implemented as an archive file consisting of XHTML files carrying the content, along with images and other supporting files.
EPUB is the most widely supported vendor-independent XML-based (as opposed to PDF) e-book format; it is supported by almost all hardware readers, except for Kindle.

HAR

HAR is a JSON-formatted archive file format for logging of a web browser’s interaction with a site.
Because it is used for troubleshooting website issues (slow page load, timeout when performing certain tasks, page rendering issues) the format is quite verbose, because it contains a lot of metadata not related to the actual page content.
HAR can be exported from any web browser, from “Web developer tools > Network”. Browsers also allow importing HAR files to see the request/ response log.

HAR is suitable for arhiving web pages, ONLY if it contains the actual response content, not just the headers and metadata!
A huge CON of the HAR format is that I couldn’t find any reader of the page and resources, like you have for HTML, WARC, or EPUB.

The format tree looks something like:

- log
  - browser = {'name':'Firefox', 'version':'90.0'} - only in Firefox
  - creator = {'name':'Firefox', 'version':'90.0'}
  - entries
    - cache
    - connection
    - pageref = 'page_X'
    - request
      - bodySize, cookies, headers, headersSize, httpVersion, method, queryString, url, ...
    - response
      - bodySize, content, cookies, headers, headersSize, httpVersion, redirectURL, status, statusText,...
    - serverIPAddress = '[xxxx:xx:x:x:x:xxxx:xxxx:x]'
    - startedDateTime = '2021-07-26Txx:...'
    - time = 6.715
    - timings
  - pages
    - id = 'page_X'
    - pageTimings = {'onContentLoad': 2518, 'onLoad': 3018}}
    - startedDateTime = '2021-07-26Txx:...'
    - title = 'Some home page'
  - version = 1.2

File size for page: https://crlf.link
Firefox: 518,665 crlf.link-ff.har
Chromium: 522,162 crlf.link-chr.har

HAR tools:

https://github.com/exogen/node-fetch-har
https://github.com/janodvarko/harviewer
https://github.com/mrichman/hargo – Go library and CLI that dumps and loads HAR files
https://github.com/thameera/harcleaner – clean noisy requests from HAR files
https://micmro.github.io/PerfCascade
https://toolbox.googleapps.com/apps/har_analyzer

HTML ⭐

All browsers support saving a web page as HTML, but they save all the resources (CSS, JS, images) in a separate folder, which is pretty inconvenient.
There are lots of tools that export HTML pages by inlining all resources inside the HTML, which makes the file portable.
This is probably the best format for archiving a single page.

Tools:

https://github.com/danburzo/percollate – CLI app for turning web pages into beautiful, readable PDF, EPUB, or HTML
https://github.com/gildas-lormeau/SingleFile – Extension for Firefox/Chrome/MS Edge and CLI tool to save a faithful copy of an entire web page in a single HTML file
https://github.com/go-shiori/obelisk – CLI app, inspired by Monolith, for saving web pages as single HTML file
https://github.com/wabarc/cairn – Node.js CLI tool for saving web pages, inspired by Obelisk/ Monolith
https://github.com/WebMemex/freeze-dry – store a web page as shown in the browser, after having inlined all external resources
https://github.com/Y2Z/monolith – CLI app for saving complete web pages as a single HTML file

MAFF

MAFF files are standard ZIP files containing one or more web pages, images, or other downloadable content.
Additional metadata, like the original page address, is saved along with the content.
Unlike the related MHTML format, MAFF is compressed and particularly suited for large media files.
.MAFF extension no longer works on newer (as of 2017) versions of Firefox, but it is supported in Cyberfox and Waterfox, forks of Firefox that try to keep features removed from Firefox like the traditional extension API.

Features:

store web content in a single file
store multiple independent pages in the same archive, eg: all open tabs in a single file
based on ZIP / JAR format - pages in MAFF files are compressed, not encoded
suited for video and audio files
store metadata about the saved resources, like the original location from which the page was saved, as well as the date and time of the save operation
store arbitrary extended metadata, like scroll position in the page, text zoom level, etc
https://amadzone.org/mozilla-archive-format
https://wikipedia.org/wiki/Mozilla_Archive_Format

MHTML

MTHML combine in a single file, the HTML code and its companion resources (such as images, audio and video files) that are represented by external links in the web page’s HTML code.
The content of an MHTML file is encoded using the same techniques that were first developed for HTML email messages, using the MIME content type multipart/related.
The .mhtml (Web archive) and .eml (email) filename extensions are interchangeable: either filename extension can be changed from one to the other. An .eml message can be sent by e-mail, and it can be displayed by an email client. An email message can be saved using a .mhtml or .mht filename extension and then opened for display in a web browser or for editing other programs.
Some browsers support the MHTML format, directly or through third-party extensions, but the process for saving a web page along with its resources as an MHTML file is not standardized. This means a web page saved as an MHTML file using one browser may look differently on another.

https://docs.fileformat.com/web/mhtml
https://fileinfo.com/extension/eml – EML files
https://github.com/erikbrinkman/mhtml-stream – parsing MHTML file streams
https://github.com/nodemailer/mailparser – EML/ MHTML parser
https://github.com/testimio/mhtml-parser – fast MHTML parser in Node.js
https://lifewire.com/mht-file-4140714 – MHT files
https://mhonarc.org – mail-to-HTML converter
https://wikipedia.org/wiki/MHTML

PDF

Saving web pages as PDF is best for long articles or documents.
Not optimal for other websites, because the page is not interactive anymore, so you can’t click on things, play videos, play audio, etc.
Besides that, PDFs are split into pages, which interrupt the flow of the page and can make the page harder to understand.
Regular PDFs are also harder to search on the disk, because the text inside the file is not made to be searched, it’s made to be printed.

https://addons.mozilla.org/en-US/firefox/addon/save-pdf – Save page as PDF, Firefox add-on
https://hanzo.co/blog/web-archiving-for-compliance-101-the-pros-and-cons-of-pdfs

Replay.io

Replay is a browser that records all of the context you need to fix bugs faster
Replays include the entire session and are typically smaller than a video
CON: it’s a technical tool, not designed for users

https://replay.io
https://twitter.com/replayio
https://github.com/RecordReplay – some of the tools are open-source

rrWeb

Record and replay a web session
Open-source web session replay library, which provides easy-to-use APIs to record user’s interactions and replay it remotely
The format is JSON, a very deep tree of nodes and properties
CON: it’s a very technical tool, not user friendly

https://rrweb.io
https://github.com/rrweb-io/rrweb – record + play
https://github.com/rrweb-io/rrvideo – convert record into video

WARC

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information.
Was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving.
It generalizes older formats to better support the harvesting, access, and exchange needs of archiving organizations.
The WARC format is inspired by HTTP/1.0 streams, with a similar header and the use of CRLFs as delimiters, making it very conducive to crawler implementations.
Files are often compressed using GZIP, resulting in a .warc.gz extension.

A WARC is a container file that includes and wraps around other files like JS, CSS, PDF or MP3, along with some additional information and formatting.
It concatenates several files into one object, like other container formats (eg: TAR, ZIP, or RAR).
WARC container can also contextualize those files, to contain technical and provenance metadata about the collection and arrangement of their media, so sites can be read and represented in live web browsing experiences like they were at the time of their collection.
The ARC file was the Internet Archive’s original container file for web-native resourcese. The WARC standard was formalized in 2009 to include more detailed technical metadata.

International Standard defines 8 WARC record types:

warcinfo
response
resource
request
metadata
revisit
conversion
continuation

A WARC record contains:

header
content block
two newlines

Libraries and apps

https://github.com/alard/megawarc – Python nondestructive warc-in-tar to warc conversion
https://github.com/ArchiveTeam/wpull – Python Wget-compatible web downloader and crawler that saves WARC
https://github.com/chatnoir-eu/chatnoir-resiliparse – FastWARC high-performance WARC parsing library written in C++/Cython
https://github.com/chfoo/warcat – Python handle WARC files: list, concat, split, extract
https://github.com/datacoon/metawarc – CLI Python metadata extraction from files from WARC
https://github.com/eugeneware/warc – Node.js WARC parser
https://github.com/internetarchive/scrapy-warcio – Scrapy WARC I/O
https://github.com/internetarchive/warc – Python reading and writing WARC files
https://github.com/internetarchive/warctools – CLI & libss for handling and manipulating WARC
https://github.com/jsvine/waybackpack – download the entire Wayback Machine archive for a given URL
https://github.com/machawk1/warcreate – Chrome extension to create WARC files from any webpage
https://github.com/N0taN3rd/node-warc – Node.js parse and create WARC
https://github.com/netarchivesuite/solrwayback – search interface and wayback machine for the UKWA Solr based WARC-indexer (Java)
https://github.com/odie5533/WarcMiddleware – Scrapy WarcMiddleware to seamlessly download a mirror copy of a website
https://github.com/richardlehane/warcmount – mount a WARC file in a virtual FS for browsing
https://github.com/sepastian/warc2corpus – Python extract structured data from HTML pages in WARCs w CSS selectors
https://github.com/steffenfritz/html2warc – Python convert web resources to a single WARC file
https://github.com/tomnomnom/waybackurls – fetch all URLs that Wayback Machine knows about for a domain
https://github.com/Vikasg7/warc-reader – Node.js ES6 .warc or .warc.gz file reader
https://github.com/Vikasg7/warc-stream – Node.js ES6 warc stream
https://github.com/webrecorder/har2warc – Python HAR to WARC
https://github.com/webrecorder/replayweb.page – Web Archive replay in the browser
https://github.com/webrecorder/wacz-format – WACZ format
https://github.com/webrecorder/warcio.js – WARC IO optimized for browser and Node.js
https://github.com/webrecorder/warcio – Python WARC (and ARC) read and write library
https://github.com/webrecorder/warcit – CLI convert directories of web documents into WARC
https://warcreate.com – WARCreate Chrome extension
https://webrecorder.net/2021/01/18/archiveweb-page-extension.html – Chrome extension to record and save WACZ (compressed warcs collection)
wget – saves pages as WARC - it doesn’t save the resources
etc, etc.

ZIM

ZIM file format seems to be used to save massive websites, like Wikipedia, 10GB+ files are OK.
The ZIM archive can then be opened from a reader and browsed offline,
eg from: https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/

https://chrome.kiwix.org – Chrome offline wikipedia.zim reader
https://firefox.kiwix.org – Firefox offline wikipedia.zim reader
https://github.com/birros/web-archives – web archives reader
https://github.com/kimbauters/ZIMply – offline reader for ZIM file, through ordinary browsers
https://github.com/kiwix/kiwix-desktop – viewer/ manager of ZIM files
https://github.com/kiwix/kiwix-tools – collection of tools
https://github.com/kymeria/pyzim-tools – introspect a zim files
https://github.com/mkiol/Zimpedia – ZIM repository offline reader
https://github.com/openzim/node-libzim – Node.js read & write ZIM files
https://github.com/openzim/python-libzim – Python read & write ZIM files
https://github.com/openzim/warc2zim – Python WARC files to ZIM
https://github.com/openzim/zim-tools – ZIM check, dump, export
https://gitlab.com/jojolebarjos/zimscan – minimal ZIM file reader, designed for article streaming
https://openzim.org/wiki/Build_your_ZIM_file
https://openzim.org/wiki/Readers
https://openzim.org/wiki/ZIM_file_format

Wiki

https://en.wikipedia.org/wiki/Comparison_of_software_saving_Web_pages_for_offline_use