Web snapshots? The what, the why and the how

This article is a conclusion resulting from 6+ months of work, trying to find the best way to create “perfect” snapshots of web pages.

Intro

What does “web snapshots” mean?

Simply put, it means using an application that creates an archive of a live page in such a way that you can perfectly restore it later, offline. From the snapshot, you can extract text, HTML, or take print screens, completely offline.

There are a few reasons why this is needed:

  • link rot: pages are removed from websites all the time, websites go offline and domains disappear
  • websites are changing all the time and it can be useful to take snapshots over time and observe the differences
  • ensure the information is preserved in an archive for future research, history, or the public
  • save long articles for reading in places where the Internet is not available (such as a plane, or train)
  • fight censorship by making sure that sensitive information and vital proof is not lost

My reason for looking into this problem is that at Zyte, we are developing an automated data extraction API that uses AI to recognize parts of the page, like: title, images, price, etc.
To do this, we have a few tens of thousands of snapshots of pages that we use for training and testing the model, and they need to capture the original state of the page as closely as possible.

Creating perfect snapshots is not a simple problem because website pages are complicated, and of course, I’m not the first one to try this.
I also have an older project: https://github.com/croqaz/clean-mark – convert articles into clean text. It doesn’t work with other kinds of pages.

Let’s look at the current methods and formats:

WARC

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file, together with related information.
The WARC format is inspired by HTTP/1.0 streams, with a similar header and it uses CRLFs as delimiters, making it very conducive to crawler implementations.
Files are often compressed using GZIP, resulting in a .warc.gz extension.

WARC is considered the golden standard for archiving, used by all serious archiving initiatives.
Because of that, there are lots and lots of applications and programming libraries to parse this format.

The format is definitely very elegant and the WARC tools generally work.
Where it fails is on websites with lots of Javascript, especially if the Javascript is non-deterministic.

Unfortunately, it’s easy to break the restoring of the page with just 1 line of Javascript. If the page creation date is saved either in the HTML as a tag, or a Javascript constant, or extracted from the headers of a resource, and there’s an if condition checking that date against the current date, it can refuse to render the page. This means the restored page will be blank, or show an error.

Another big problem is that after you create a WARC file, it’s very hard to find a reader to look at the file… People capturing WARC files are mostly focusing on archiving pages, not restoring them.

There’s https://replayweb.page which is nice and anyone can use it without having to run lots of scary CLI commands in the terminal.
And there’s https://github.com/webrecorder/pywb which is also nice, but needs some setup and technical know-how.
That’s it basically. Both of them have their problems, some important shopping, real-estate and job-posting domains can’t be saved or restored for different reasons.
In their defence, the archiving solutions only care about preserving the knowledge of humanity, not the prices of random producs, so they make sure that articles, forum posts and Twitter threads can be saved at least.

In conclusion: WARC captures only resources, so it must re-execute all JS on replay and this can lead to inconsistencies.
Of course, some of these limitations may change in the future.

rrWeb

Links:

Unfortunately, this is a very technical programming library, it’s not a final product. There’s no easy way of capturing a page, or looking at the result, you need to know exactly what to do.
The documentation for getting started is not that helpful. I had to dig into the source, the tests, issues and try different methods to guess the best way to capture, or restore pages.
It is written in Typescript and it involves calling the snapshot(document, { ... }) function, when the page is completely loaded… Sounds easy enough!
In reality, you need to setup a browser (eg: Puppeteer) and call that function when all the resources you want to capture are finished loading.

I got really excited about this format, as soon as I spent some time to understand it. It has the potential to be what PDF is for document archival, basically a perfect copy of the structure of the DOM.

Then, I started using it and I noticed that very important resources were missing from the rrWeb snapshot:
The images were not captured. This means that the snapshot was not completely offline; if the domain goes dark, your snapshot will lose all the images.
I made a PR to implement capturing images and a series of fixes after that.

Capturing images is not hard per se, if you could easily capture the image data… There’s 2 ways to do that:

  • capture the binary stream of data by making a request with the image URL
  • create a Canvas object, inject the image as background and call Canvas .toDataURL() function

Both methods are tricky to do…

  • capturing the binary stream absolutely requires that rrWeb snapshot function is async, or at least callback based – and it’s not as I’m writing this article, so you would have to guess when the request is finished
  • calling .toDataURL() requires the images to be from the same domain as the website, otherwise a normal browser will complain that “Tainted canvases may not be exported” and the function will fail

Of course, there’s plenty of hacks you can to do overcome that, start the browser with --disable-web-security and apply .crossOrigin = "anonymous" to all images.
See: https://developer.mozilla.org/en-US/docs/Web/HTML/CORS_enabled_image

Then, I discovered that the background images from the CSS stylesheets are not captured in the rrWeb snapshot…
The background images can be very important for some pages and if they are missing, some buttons or product pictures will become invisible.
There are 2 types of background images: normal, because they are applied on a real DOM node; :before: or :after: a DOM node, basically they don’t exist in the DOM and cannot be accessed from Javascript.

To overcome this, I had to post-process the pages before calling rrWeb, to replace all background image URLs with their base64 representation, so that they are included in the snapshot, offline.
And I had to create new CSS classes derived from the URLs to apply on the actual nodes, because you can’t hack the styles of invisible DOM nodes.

Then, I discovered that the web-fonts are not captured in the rrWeb snapshot…
The web-fonts are not important for styling the text content, you can definitely have a working page with just the “Times Roman” font, that’s not the issue.
The problem is that web-fonts with icons like Font Awesome, Material Design Icons, or Bootstrap Icons are so important that some pages look completely broken without them.

To overcome this, I had to post-process the pages again, to replace all WOFF2, WOFF, OTF and TTF URLs from all CSS styles, with their base64 representation.

Then, I realized that rrWeb doesn’t store the Javascript from the page. At all… It is completely lost, replaced with a ‘SCRIPT_PLACEHOLDER’ string.
This is more like a feature, because the snapshot already contains the whole DOM tree and doesn’t need Javascript to restore the structure. Running the Javascript twice can break the page, so it makes sense.
However in our case, keeping some Javascript, specifically type="application/ld+json" is critical.
So, this was the last straw… The whole series of workarounds to make this work is too big, like a completely new library.

Another potential issue of the rrWeb format is that it has very very deep objects. So deep that Python’s JSON library crashes, but at least it has an option to go deeper by setting sys.setrecursionlimit(10**5). Libraries like UJSON and ORJSON will just crash loudly without giving you any option to handle it.
To overcome this, I just keep the JSON data as a binary stream, I don’t parse it in Python at all, unless I really have to.

After all of these issues, there were still edge cases and some domains were still consistently broken.

Despite all the problems I encountered, I have to say that rrWeb is impressive, it has tremendous potential and it seems to move the right way, eg: the developers are thinking to make an async API.
In conclusion: rrWeb captures the DOM, but doesn’t embed all resources by default, such as images and fonts.

HTML

I have to mention this format and a few tools, even if it doesn’t work well for many pages. It’s generally more suited for articles, forum posts or documentation.

If you don’t care about maintaining the original page structure and resources, HTML can be the best option, because web pages are by definition HTML. However, you can’t simply download the HTML + all resources and expect perfect results, the Javascript will most likely break after some time, or it may be already broken as soon as it’s downloaded.
In other words, the HTML and the resources must be processed before saving for archival. That’s why people are trying to create other kinds of formats, to solve this exact problem.

A few tools:

All these tools have the exact same problems as WARC tools, in the sense that a simple Javascript if condition to check the date can completely break a page, even if all the resources are present and could potentially be restored.
Another easy way to break pages, which doesn’t happen with WARC is that all Javascript requests are broken, because the portable HTML is just a simple file and there’s no server to handle the requests.
Some of the tools above have the option to ignore all Javascript when recording the page and that generally works, because articles don’t usually have too much Javascript, which is great.

In conclusion: the HTML format is a great idea, but the tools I checked so far are not working with many modern pages.

Enter the “recorded” format.

The “recorded” format

This was the format that we used initially at Zyte, but it was generated with an old browser called Splash and the pages were not complete.
The name is pretty uninspired… but it wasn’t meant to be public, that’s what we called internally.

This format is similar to WARC and HAR, but with a simple twist: the HTML saved in the snapshot is the final, settled HTML, after all the page was loaded and Javascript was run.
I made a public repository with an open-source implementation of this, similar to what we use internally: https://github.com/croqaz/web-snap ;
When recording a page, a browser window will open and you can wait enough time to interact with the page, close popups, hover over some images, scroll the page, even inspect the HTML and delete parts of the page. In the end, web-snap will capture the document’s inner HTML as is, and save it in the snapshot.
On restore, a browser will also window open and you can see the captured page and interact with it, also offline. Of course, you can’t navigate to links that were never captured, but you can still interact with popups, buttons and hover images.
Web-snap is written in Node.js as a command line app. It’s easier to use than webrecorder/pywb and much easier than the rrWeb library for sure.

The captured page could be very easily saved in a WARC file format and it would be an elegant implementation!
Why I have decided to use JSON instead of WARC is that the current WARC players don’t give you the option to enable or disable different features on replay, eg: you can’t restore a page and disable Javascript, which is vital.
With web-snap restore, running the Javascript on restore is optional, and you also have the option to go online if you want.

This format works really well and it can solve the problems of WARC and rrWeb:

  • it captures absolutely all resources: CSS, JS, web fonts and images
  • the resources can be post processed optionally, to decrease their size
  • generally the pages can be restored with JS disabled, but if that doesn’t work, JS can be run without many issues
  • a lot of pages from important domains can be captured this way

We use this format to capture live pages and we can later apply different methods to extract text, HTML, print screens, etc.

All that being said, this is still not perfect. I think the idea is great and it has huge potential, but the current implementation can still be improved. For example, we don’t capture Iframes yet.
In conclusion: the “recorded” format captures settled HTML and resources, so JS doesn’t need to be re-executed, and implementation is simple if we can intercept requests and replay responses.

Others

The above formats are by no means all of them! To name a few: HAR, EPUB, HTML, MAFF, MHTML, PDF, print-screens, ZIM…
These formats are generally focused on capturing pages for debug (HAR), or for long documents like books.
They won’t handle Javascript, so it’s impossible to capture many of the modern web pages.

Conclusion

There’s no silver bullet solution. We still have work to do. The archival of web pages is important, I’m passionate about the subject and I will personally continue to think about it and work on it.


Article also posted on:

https://zyte.com/blog/web-snapshots

Also check:

@notes #software #programming #archive #project