Thursday, January 23, 2020

How the Image Scanning and Blocking Works

In this post, I wanted to briefly discuss how the image blocking mechanism works. Notably, this post is not about how the model works - just the actual mechanics of blocking the image. I believe the first part of the post is relevant to all users, while the second part of the post may be of academic interest to other developers - a little something for everybody.

For everybody:

The image blocking works roughly like this. A web page usually downloads the "outline" of everything first, and then downloads the images and other resources after that. The image blocking works by tying into the step where the other images are downloaded. This means that the image as seen by the webpage has been replaced. This has some important implications. If you were to right-click and save the image, it will save what the webpage received: a placeholder image that does not contain the original image. However, this also means that even if you were to turn the plugin off, the image would stay that way until you did a full page refresh.

There are two interesting side notes.

First, if you directly got a bad image's URL, the image can't be replaced in the normal way due to technical limitations. Instead, you will get a note that says something like "The image <your web address here> cannot be displayed because it contains errors." This is because internally the plugin has aborted the image download.

Second, some web pages work slightly differently: rather than downloading the "outline" and then the images after that, they may include the images directly as part of the web page using something called a "data URL". Notably, Google Images does this for their first "page" of search results, and I believe it contributes to how they load so quickly. To handle this, the plugin scans the web page and replaces these images as well. This works reasonably well, but occasionally it may run into a new way of doing this that I haven't encountered yet and it fails to scan (and thus block) any images. If you suspect this, please comment (or file a bug on GitHub so I can start tracking it right away! https://github.com/wingman-jr-addon/wingman_jr/issues)

One last thing: the way that the scanning/filtering works is not available on Chrome, and sadly will not likely become available. 


Additional technical details for developers:

This uses the webRequest.filterResponseData() interface internally, which has worked well. As noted, that is not available on Chrome and since the existing partial functionality is actively being neutered in the name of performance and security. So, a Chrome port as such is not likely to be possible. I am actively tracking potential ways to do this however; see https://github.com/wingman-jr-addon/wingman_jr/issues/2 and please comment if you know ways to work with this. Also, I've seen this project as well that would be great if I wanted to switch methods. However, I'm just not convinced that scanning at a DOM level is going to be as robust.

Internally, I serve up SVG's as the placeholders. This means I need to change the MIME type, which normally works great - but not for the directly visited URL's, interestingly - which is why those produce an error message instead of the placeholder SVG.

Note the review form wraps the original image into the SVG and makes a translucent overlay.

Along the way, I discovered that data URLs are unfortunately not hooked by the webRequest API. There is a bug in for it, but I think it is unlikely it will be fixed anytime soon. Regrettably, that leaves me only with the hack of scanning every HTML document for base 64 URLs. Fun times with regex!

Google's use of embedding data URL's in special ways has caused more than one issue in the past, see [1] and [2]. If you see an issue with Google or another site, please add a bug over at https://github.com/wingman-jr-addon/wingman_jr/issues

Let me know if you have any questions on this API or just go look at my source. Thanks for reading to the end!

No comments:

Post a Comment