In: offline, computer.

Web archiving

As we all know, human civilization must be digitized so that, with one stroke, it can be deleted.
– Anon

Why

On the Importance of Web Archiving

https://items.ssrc.org/parameters/on-the-importance-of-web-archiving

Where incredible journeys end…

Site deaths are when sites go offline, taking content and permalinks with them, and breaking the web accordingly.
Site deaths are one of the big reasons why you should own your own identity and content on the web:
https://indieweb.org/site-deaths

How to Properly Archive Your Digital Files

Posted Jul 14 2024
https://wired.com/story/how-to-properly-archive-your-digital-files
Will you be able to open today’s Word docs in 20 years? Probably not, unless you take some necessary steps to give those digital files an extra-long shelf life.
Updates to software like Microsoft Word mean that files that opened fine in the '80s don’t open in the 2020s. Part of the problem: Microsoft, and only Microsoft, controls the file format, or even knows how it works. For this reason, Stuchell says he encourages people to export files in an open file format—especially files they want to keep accessible for the long term.
Basically, if a file on your computer can only be opened by a specific piece of software, and that software is controlled by a single company, you should probably export it to an open format. It’s the only way to future-proof it.
Keep in mind that some photos, especially RAW files from your camera, might be in a proprietary format.
“You have to be careful because many cameras default to their own version of RAW, which is highly proprietary,” Stuchell says. He recommends exporting such photos to an open format called Digital Negative (DNG), which is a safer format to use for preserving RAW files.

Inside the online communities trying to preserve our digital memories

Text Günseli Yalcinkaya, posted 8th January 2024
https://dazeddigital.com/life-culture/article/61689/1/youtube-comments-online-communities-preserve-digital-memories-checkpoints
Ruby Justice Thelot’s A Cyberarchaeology of Checkpoints pays homage to the personal stories shared in the comments section of YouTube videos

The online data that’s being deleted. For years, we were encouraged to store our data online…

By Chris Baraniuk, posted 15th July 2021
https://bbc.com/future/article/20210715-the-online-data-thats-being-deleted
But it’s become increasingly clear that this won’t last forever - and now the race is on to stop our memories being deleted

Delete Never: The digital hoarders who collect Tumblrs, medieval manuscripts, and Terabytes of text files

By Steven Melendez, published March 4 2019
https://gizmodo.com/delete-never-the-digital-hoarders-who-collect-tumblrs-1832900423

We are not alone: Progress in the digital preservation community

By Ricc Ferrante, posted october 2 2018
https://siarchives.si.edu/blog/we-are-not-alone-progress-digital-preservation-community


File formats

EPUB

Out of all the formats, this is probably the least appropriate for storing web pages, but it’s useful for storing long articles, or documentation.
EPUB is an e-book file format supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers.
The EPUB format is implemented as an archive file consisting of XHTML files carrying the content, along with images and other supporting files.
EPUB is the most widely supported vendor-independent XML-based (as opposed to PDF) e-book format; it is supported by almost all hardware readers, except for Kindle.

HAR

HAR is a JSON-formatted archive file format for logging of a web browser’s interaction with a site.
Because it is used for troubleshooting website issues (slow page load, timeout when performing certain tasks, page rendering issues) the format is quite verbose, because it contains a lot of metadata not related to the actual page content.
HAR can be exported from any web browser, from “Web developer tools > Network”. Browsers also allow importing HAR files to see the request/ response log.

HAR is suitable for arhiving web pages, ONLY if it contains the actual response content, not just the headers and metadata!
A huge CON of the HAR format is that I couldn’t find any reader of the page and resources, like you have for HTML, WARC, or EPUB.

The format tree looks something like:

- log
  - browser = {'name':'Firefox', 'version':'90.0'} - only in Firefox
  - creator = {'name':'Firefox', 'version':'90.0'}
  - entries
    - cache
    - connection
    - pageref = 'page_X'
    - request
      - bodySize, cookies, headers, headersSize, httpVersion, method, queryString, url, ...
    - response
      - bodySize, content, cookies, headers, headersSize, httpVersion, redirectURL, status, statusText,...
    - serverIPAddress = '[xxxx:xx:x:x:x:xxxx:xxxx:x]'
    - startedDateTime = '2021-07-26Txx:...'
    - time = 6.715
    - timings
  - pages
    - id = 'page_X'
    - pageTimings = {'onContentLoad': 2518, 'onLoad': 3018}}
    - startedDateTime = '2021-07-26Txx:...'
    - title = 'Some home page'
  - version = 1.2

File size for page: https://crlf.link
Firefox: 518,665 crlf.link-ff.har
Chromium: 522,162 crlf.link-chr.har

HAR tools:

HTML ⭐

All browsers support saving a web page as HTML, but they save all the resources (CSS, JS, images) in a separate folder, which is pretty inconvenient.
There are lots of tools that export HTML pages by inlining all resources inside the HTML, which makes the file portable.
This is probably the best format for archiving a single page.

Tools:

MAFF

MAFF files are standard ZIP files containing one or more web pages, images, or other downloadable content.
Additional metadata, like the original page address, is saved along with the content.
Unlike the related MHTML format, MAFF is compressed and particularly suited for large media files.
.MAFF extension no longer works on newer (as of 2017) versions of Firefox, but it is supported in Cyberfox and Waterfox, forks of Firefox that try to keep features removed from Firefox like the traditional extension API.

Features:

MHTML

MTHML combine in a single file, the HTML code and its companion resources (such as images, audio and video files) that are represented by external links in the web page’s HTML code.
The content of an MHTML file is encoded using the same techniques that were first developed for HTML email messages, using the MIME content type multipart/related.
The .mhtml (Web archive) and .eml (email) filename extensions are interchangeable: either filename extension can be changed from one to the other. An .eml message can be sent by e-mail, and it can be displayed by an email client. An email message can be saved using a .mhtml or .mht filename extension and then opened for display in a web browser or for editing other programs.
Some browsers support the MHTML format, directly or through third-party extensions, but the process for saving a web page along with its resources as an MHTML file is not standardized. This means a web page saved as an MHTML file using one browser may look differently on another.

PDF

Saving web pages as PDF is best for long articles or documents.
Not optimal for other websites, because the page is not interactive anymore, so you can’t click on things, play videos, play audio, etc.
Besides that, PDFs are split into pages, which interrupt the flow of the page and can make the page harder to understand.
Regular PDFs are also harder to search on the disk, because the text inside the file is not made to be searched, it’s made to be printed.

Replay.io

Replay is a browser that records all of the context you need to fix bugs faster
Replays include the entire session and are typically smaller than a video
CON: it’s a technical tool, not designed for users

rrWeb

Record and replay a web session
Open-source web session replay library, which provides easy-to-use APIs to record user’s interactions and replay it remotely
The format is JSON, a very deep tree of nodes and properties
CON: it’s a very technical tool, not user friendly

WARC

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information.
Was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving.
It generalizes older formats to better support the harvesting, access, and exchange needs of archiving organizations.
The WARC format is inspired by HTTP/1.0 streams, with a similar header and the use of CRLFs as delimiters, making it very conducive to crawler implementations.
Files are often compressed using GZIP, resulting in a .warc.gz extension.

A WARC is a container file that includes and wraps around other files like JS, CSS, PDF or MP3, along with some additional information and formatting.
It concatenates several files into one object, like other container formats (eg: TAR, ZIP, or RAR).
WARC container can also contextualize those files, to contain technical and provenance metadata about the collection and arrangement of their media, so sites can be read and represented in live web browsing experiences like they were at the time of their collection.
The ARC file was the Internet Archive’s original container file for web-native resourcese. The WARC standard was formalized in 2009 to include more detailed technical metadata.

International Standard defines 8 WARC record types:

A WARC record contains:

Libraries and apps

Links

ZIM

ZIM file format seems to be used to save massive websites, like Wikipedia, 10GB+ files are OK.
The ZIM archive can then be opened from a reader and browsed offline,
eg from: https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/

Wiki

×