Web archiving
As we all know, human civilization must be digitized so that, with one stroke, it can be deleted.
– Anon
Why
On the Importance of Web Archiving
https://items.ssrc.org/parameters/on-the-importance-of-web-archiving
Where incredible journeys end…
Site deaths are when sites go offline, taking content and permalinks with them, and breaking the web accordingly.
Site deaths are one of the big reasons why you should own your own identity and content on the web:
https://indieweb.org/site-deaths
How to Properly Archive Your Digital Files
Posted Jul 14 2024
https://wired.com/story/how-to-properly-archive-your-digital-files
Will you be able to open today’s Word docs in 20 years? Probably not, unless you take some necessary steps to give those digital files an extra-long shelf life.
Updates to software like Microsoft Word mean that files that opened fine in the '80s don’t open in the 2020s. Part of the problem: Microsoft, and only Microsoft, controls the file format, or even knows how it works. For this reason, Stuchell says he encourages people to export files in an open file format—especially files they want to keep accessible for the long term.
Basically, if a file on your computer can only be opened by a specific piece of software, and that software is controlled by a single company, you should probably export it to an open format. It’s the only way to future-proof it.
Keep in mind that some photos, especially RAW files from your camera, might be in a proprietary format.
“You have to be careful because many cameras default to their own version of RAW, which is highly proprietary,” Stuchell says. He recommends exporting such photos to an open format called Digital Negative (DNG), which is a safer format to use for preserving RAW files.
Inside the online communities trying to preserve our digital memories
Text Günseli Yalcinkaya, posted 8th January 2024
https://dazeddigital.com/life-culture/article/61689/1/youtube-comments-online-communities-preserve-digital-memories-checkpoints
Ruby Justice Thelot’s A Cyberarchaeology of Checkpoints pays homage to the personal stories shared in the comments section of YouTube videos
The online data that’s being deleted. For years, we were encouraged to store our data online…
By Chris Baraniuk, posted 15th July 2021
https://bbc.com/future/article/20210715-the-online-data-thats-being-deleted
But it’s become increasingly clear that this won’t last forever - and now the race is on to stop our memories being deleted
Delete Never: The digital hoarders who collect Tumblrs, medieval manuscripts, and Terabytes of text files
By Steven Melendez, published March 4 2019
https://gizmodo.com/delete-never-the-digital-hoarders-who-collect-tumblrs-1832900423
We are not alone: Progress in the digital preservation community
By Ricc Ferrante, posted october 2 2018
https://siarchives.si.edu/blog/we-are-not-alone-progress-digital-preservation-community
File formats
EPUB
Out of all the formats, this is probably the least appropriate for storing web pages, but it’s useful for storing long articles, or documentation.
EPUB is an e-book file format supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers.
The EPUB format is implemented as an archive file consisting of XHTML files carrying the content, along with images and other supporting files.
EPUB is the most widely supported vendor-independent XML-based (as opposed to PDF) e-book format; it is supported by almost all hardware readers, except for Kindle.
HAR
HAR is a JSON-formatted archive file format for logging of a web browser’s interaction with a site.
Because it is used for troubleshooting website issues (slow page load, timeout when performing certain tasks, page rendering issues) the format is quite verbose, because it contains a lot of metadata not related to the actual page content.
HAR can be exported from any web browser, from “Web developer tools > Network”. Browsers also allow importing HAR files to see the request/ response log.
HAR is suitable for arhiving web pages, ONLY if it contains the actual response content, not just the headers and metadata!
A huge CON of the HAR format is that I couldn’t find any reader of the page and resources, like you have for HTML, WARC, or EPUB.
The format tree looks something like:
- log
- browser = {'name':'Firefox', 'version':'90.0'} - only in Firefox
- creator = {'name':'Firefox', 'version':'90.0'}
- entries
- cache
- connection
- pageref = 'page_X'
- request
- bodySize, cookies, headers, headersSize, httpVersion, method, queryString, url, ...
- response
- bodySize, content, cookies, headers, headersSize, httpVersion, redirectURL, status, statusText,...
- serverIPAddress = '[xxxx:xx:x:x:x:xxxx:xxxx:x]'
- startedDateTime = '2021-07-26Txx:...'
- time = 6.715
- timings
- pages
- id = 'page_X'
- pageTimings = {'onContentLoad': 2518, 'onLoad': 3018}}
- startedDateTime = '2021-07-26Txx:...'
- title = 'Some home page'
- version = 1.2
File size for page: https://crlf.link
Firefox: 518,665 crlf.link-ff.har
Chromium: 522,162 crlf.link-chr.har
- http://softwareishard.com/blog/har-viewer
- https://ourpcgeek.com/how-to-generate-har-file
- https://w3c.github.io/web-performance/specs/HAR/Overview.html
- https://wikipedia.org/wiki/HAR_(file_format)
HAR tools:
- https://github.com/exogen/node-fetch-har
- https://github.com/janodvarko/harviewer
- https://github.com/mrichman/hargo – Go library and CLI that dumps and loads HAR files
- https://github.com/thameera/harcleaner – clean noisy requests from HAR files
- https://micmro.github.io/PerfCascade
- https://toolbox.googleapps.com/apps/har_analyzer
HTML ⭐
All browsers support saving a web page as HTML, but they save all the resources (CSS, JS, images) in a separate folder, which is pretty inconvenient.
There are lots of tools that export HTML pages by inlining all resources inside the HTML, which makes the file portable.
This is probably the best format for archiving a single page.
Tools:
- https://github.com/danburzo/percollate – CLI app for turning web pages into beautiful, readable PDF, EPUB, or HTML
- https://github.com/gildas-lormeau/SingleFile – Extension for Firefox/Chrome/MS Edge and CLI tool to save a faithful copy of an entire web page in a single HTML file
- https://github.com/go-shiori/obelisk – CLI app, inspired by Monolith, for saving web pages as single HTML file
- https://github.com/wabarc/cairn – Node.js CLI tool for saving web pages, inspired by Obelisk/ Monolith
- https://github.com/WebMemex/freeze-dry – store a web page as shown in the browser, after having inlined all external resources
- https://github.com/Y2Z/monolith – CLI app for saving complete web pages as a single HTML file
MAFF
MAFF files are standard ZIP files containing one or more web pages, images, or other downloadable content.
Additional metadata, like the original page address, is saved along with the content.
Unlike the related MHTML format, MAFF is compressed and particularly suited for large media files.
.MAFF extension no longer works on newer (as of 2017) versions of Firefox, but it is supported in Cyberfox and Waterfox, forks of Firefox that try to keep features removed from Firefox like the traditional extension API.
Features:
- store web content in a single file
- store multiple independent pages in the same archive, eg: all open tabs in a single file
- based on ZIP / JAR format - pages in MAFF files are compressed, not encoded
- suited for video and audio files
- store metadata about the saved resources, like the original location from which the page was saved, as well as the date and time of the save operation
- store arbitrary extended metadata, like scroll position in the page, text zoom level, etc
- https://amadzone.org/mozilla-archive-format
- https://wikipedia.org/wiki/Mozilla_Archive_Format
MHTML
MTHML combine in a single file, the HTML code and its companion resources (such as images, audio and video files) that are represented by external links in the web page’s HTML code.
The content of an MHTML file is encoded using the same techniques that were first developed for HTML email messages, using the MIME content type multipart/related.
The .mhtml (Web archive) and .eml (email) filename extensions are interchangeable: either filename extension can be changed from one to the other. An .eml message can be sent by e-mail, and it can be displayed by an email client. An email message can be saved using a .mhtml or .mht filename extension and then opened for display in a web browser or for editing other programs.
Some browsers support the MHTML format, directly or through third-party extensions, but the process for saving a web page along with its resources as an MHTML file is not standardized. This means a web page saved as an MHTML file using one browser may look differently on another.
- https://docs.fileformat.com/web/mhtml
- https://fileinfo.com/extension/eml – EML files
- https://github.com/erikbrinkman/mhtml-stream – parsing MHTML file streams
- https://github.com/nodemailer/mailparser – EML/ MHTML parser
- https://github.com/testimio/mhtml-parser – fast MHTML parser in Node.js
- https://lifewire.com/mht-file-4140714 – MHT files
- https://mhonarc.org – mail-to-HTML converter
- https://wikipedia.org/wiki/MHTML
Saving web pages as PDF is best for long articles or documents.
Not optimal for other websites, because the page is not interactive anymore, so you can’t click on things, play videos, play audio, etc.
Besides that, PDFs are split into pages, which interrupt the flow of the page and can make the page harder to understand.
Regular PDFs are also harder to search on the disk, because the text inside the file is not made to be searched, it’s made to be printed.
- https://addons.mozilla.org/en-US/firefox/addon/save-pdf – Save page as PDF, Firefox add-on
- https://hanzo.co/blog/web-archiving-for-compliance-101-the-pros-and-cons-of-pdfs
Replay.io
Replay is a browser that records all of the context you need to fix bugs faster
Replays include the entire session and are typically smaller than a video
CON: it’s a technical tool, not designed for users
- https://replay.io
- https://twitter.com/replayio
- https://github.com/RecordReplay – some of the tools are open-source
rrWeb
Record and replay a web session
Open-source web session replay library, which provides easy-to-use APIs to record user’s interactions and replay it remotely
The format is JSON, a very deep tree of nodes and properties
CON: it’s a very technical tool, not user friendly
- https://rrweb.io
- https://github.com/rrweb-io/rrweb – record + play
- https://github.com/rrweb-io/rrvideo – convert record into video
WARC
The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information.
Was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving.
It generalizes older formats to better support the harvesting, access, and exchange needs of archiving organizations.
The WARC format is inspired by HTTP/1.0 streams, with a similar header and the use of CRLFs as delimiters, making it very conducive to crawler implementations.
Files are often compressed using GZIP, resulting in a .warc.gz extension.
A WARC is a container file that includes and wraps around other files like JS, CSS, PDF or MP3, along with some additional information and formatting.
It concatenates several files into one object, like other container formats (eg: TAR, ZIP, or RAR).
WARC container can also contextualize those files, to contain technical and provenance metadata about the collection and arrangement of their media, so sites can be read and represented in live web browsing experiences like they were at the time of their collection.
The ARC file was the Internet Archive’s original container file for web-native resourcese. The WARC standard was formalized in 2009 to include more detailed technical metadata.
International Standard defines 8 WARC record types:
- warcinfo
- response
- resource
- request
- metadata
- revisit
- conversion
- continuation
A WARC record contains:
- header
- content block
- two newlines
Libraries and apps
- https://github.com/alard/megawarc – Python nondestructive warc-in-tar to warc conversion
- https://github.com/ArchiveTeam/wpull – Python Wget-compatible web downloader and crawler that saves WARC
- https://github.com/chatnoir-eu/chatnoir-resiliparse – FastWARC high-performance WARC parsing library written in C++/Cython
- https://github.com/chfoo/warcat – Python handle WARC files: list, concat, split, extract
- https://github.com/datacoon/metawarc – CLI Python metadata extraction from files from WARC
- https://github.com/eugeneware/warc – Node.js WARC parser
- https://github.com/internetarchive/scrapy-warcio – Scrapy WARC I/O
- https://github.com/internetarchive/warc – Python reading and writing WARC files
- https://github.com/internetarchive/warctools – CLI & libss for handling and manipulating WARC
- https://github.com/jsvine/waybackpack – download the entire Wayback Machine archive for a given URL
- https://github.com/machawk1/warcreate – Chrome extension to create WARC files from any webpage
- https://github.com/N0taN3rd/node-warc – Node.js parse and create WARC
- https://github.com/netarchivesuite/solrwayback – search interface and wayback machine for the UKWA Solr based WARC-indexer (Java)
- https://github.com/odie5533/WarcMiddleware – Scrapy WarcMiddleware to seamlessly download a mirror copy of a website
- https://github.com/richardlehane/warcmount – mount a WARC file in a virtual FS for browsing
- https://github.com/sepastian/warc2corpus – Python extract structured data from HTML pages in WARCs w CSS selectors
- https://github.com/steffenfritz/html2warc – Python convert web resources to a single WARC file
- https://github.com/tomnomnom/waybackurls – fetch all URLs that Wayback Machine knows about for a domain
- https://github.com/Vikasg7/warc-reader – Node.js ES6 .warc or .warc.gz file reader
- https://github.com/Vikasg7/warc-stream – Node.js ES6 warc stream
- https://github.com/webrecorder/har2warc – Python HAR to WARC
- https://github.com/webrecorder/replayweb.page – Web Archive replay in the browser
- https://github.com/webrecorder/wacz-format – WACZ format
- https://github.com/webrecorder/warcio.js – WARC IO optimized for browser and Node.js
- https://github.com/webrecorder/warcio – Python WARC (and ARC) read and write library
- https://github.com/webrecorder/warcit – CLI convert directories of web documents into WARC
- https://warcreate.com – WARCreate Chrome extension
- https://webrecorder.net/2021/01/18/archiveweb-page-extension.html – Chrome extension to record and save WACZ (compressed warcs collection)
- wget – saves pages as WARC - it doesn’t save the resources
- etc, etc.
Links
- https://archive-it.org/blog/post/the-stack-warc-file
- https://commoncrawl.org/2014/04/navigating-the-warc-file-format
- https://fileformats.archiveteam.org/wiki/WARC
- https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1
- https://inkdroid.org/2016/04/14/warc-work – working with WARC
- https://taricorp.net/2016/web-history-warc
- https://webrecorder.net/2021/01/18/wacz-format-1-0.html – WACZ format = Web Archive Collection Zipped
- https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
- https://wiki.archiveteam.org/index.php/Wget_with_WARC_output
- https://wikipedia.org/wiki/Internet_Archive
- https://wikipedia.org/wiki/Web_ARChive
- https://wikipedia.org/wiki/Web_archiving
ZIM
ZIM file format seems to be used to save massive websites, like Wikipedia, 10GB+ files are OK.
The ZIM archive can then be opened from a reader and browsed offline,
eg from: https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/
- https://chrome.kiwix.org – Chrome offline wikipedia.zim reader
- https://firefox.kiwix.org – Firefox offline wikipedia.zim reader
- https://github.com/birros/web-archives – web archives reader
- https://github.com/kimbauters/ZIMply – offline reader for ZIM file, through ordinary browsers
- https://github.com/kiwix/kiwix-desktop – viewer/ manager of ZIM files
- https://github.com/kiwix/kiwix-tools – collection of tools
- https://github.com/kymeria/pyzim-tools – introspect a zim files
- https://github.com/mkiol/Zimpedia – ZIM repository offline reader
- https://github.com/openzim/node-libzim – Node.js read & write ZIM files
- https://github.com/openzim/python-libzim – Python read & write ZIM files
- https://github.com/openzim/warc2zim – Python WARC files to ZIM
- https://github.com/openzim/zim-tools – ZIM check, dump, export
- https://gitlab.com/jojolebarjos/zimscan – minimal ZIM file reader, designed for article streaming
- https://openzim.org/wiki/Build_your_ZIM_file
- https://openzim.org/wiki/Readers
- https://openzim.org/wiki/ZIM_file_format