Tuesday, August 6, 2024

Release 3.4.0 Better Non-English Webpage Support

This release is especially for those whose native language is not English, or frequently browse international websites. If you've been keeping up with the releases, you will have noticed that international support has featured in quite a few of them with notes about character encoding etc. being fixed for a few characters. So you'd think that by now some of the main problems would all be fixed, right? How could this release help if so much work has already been done?

Well, it has to do with how web pages get served up. Handling international text is actually quite a difficult problem with many nuances. From a historical perspective, it wasn't solved well right away, and the early approach was often to use "code pages". How do these work?

Most languages - but not all, particularly excluding east Asian languages - could generally use a relatively limited number of characters to represent their language. In particular, the vast majority of languages could use 256 or fewer characters, which meant that since one byte can represent the numbers 0-255, you could use one byte to represent one character. Simple! This mapping is called a "code page". 

So what's the catch? Well, it didn't work as well for east Asian languages for one thing - but if you used two bytes you could make it work. But, it also meant that you couldn't use just one code page - you had to use one for each language. This makes new difficulties - for example, what if you want to show two different languages in the same page?

As a result, a new system was born - Unicode. The most popular encoding for Unicode is now UTF-8. This can use multiple bytes in a fancy system to represent an arbitrary number of "code points" - basically numbers to represent characters. (There is certainly more nuance here but this is roughly the idea.) This system has become ubiquitous, but it should be noted that web pages can take up a bit more room to support multiple languages.

Prior to release 3.4.0, only international pages using UTF-8 were really supported well, with a bit of extra support for the most common code page. But many web pages - particularly those with a focus on serving a single country - would use the older code page approach. Unfortunately this ran into a slightly tricky problem. To scan web pages containing embedded Base64 images, the web page itself is decoded and re-encoded. This means that the addon has to turn bytes into characters, check things, then turn the characters back into bytes. Decoding the bytes into characters is easy - just use TextDecoder. So you might be thinking that there would be a TextEncoder, too ... and you'd be right, but there's a catch. The built-in TextDecoder can decode basically anything, but the TextEncoder can only turn characters back into bytes using UTF-8. So, that means that out of the box, there is literally no way to support re-encoding text back into the served up code page. Ideally this would just exist in the main API.

But, to handle this I've now created a program that helps! Basically it creates a map containing all this code page information so that it can be used in place of TextEncoder. Is it perfect or complete? No. But there is a good chance your home language may now be supported for national websites - so, I hope this release works well for you and happy browsing!

Special thanks to Ayaya and Dragodraki for continuing to provide feedback on international language support!


Tuesday, January 30, 2024

Mailbag (Later 2023-Jan 2024)

 I was sifting through recent user feedback in the last couple of months, and have received a decent bit of feedback.

From Dragodraki

Dragodraki continues to be a star bug reporter! I've been able to fix many issues with international support due to their reports. Recent release 3.3.6 features another fix related to proper charset handling of Windows-1252.

From Ayaya - Cyrillic

Ayaya sent in feedback that Cyrillic was still not working fully correctly. See issue https://github.com/wingman-jr-addon/wingman_jr/issues/201

From SplinterCell

I got some constructive criticism from user SplinterCell mixed in with some other positive feedback (slightly edited for clarity):

  1.  It is unclear to me what the numbers mean on the images
  2. Why is there no option to blur images - this way users can recognize false-positives more easily
  3. The UI is ugly and it is not as self-explanatory as you think -> Do the buttons work per site, per browsing sessions, what do the buttons do, etc. it's wholly unclear. 

Good questions all, so I'll take a bit of time on each.

First, the numbers relate to the score that the image filter model returns. Basically, the higher the number, the more likely it is to be an NSFW image. This isn't quite the same as saying that it's a more NSFW image if it's a higher number, but there is often a correlation. For the technically-minded, it finds the model's confidence score, maps that on the ROC curve and returns 1.0-TPR at that point; not the most well-founded way but a confidence indicator.

Second, regarding blurring - there are a couple reasons. The first is that I try to think through the psychology of the addon a bit as well, and while blurring images allows false-positives to be picked out more easily it also also allows true-positives to be observed a bit better as well. Could it be an option for some? Yes, but probably not a default. The second reason is that blur effects are fairly computationally expensive, and I've tried to avoid incurring that cost so that pages with large numbers of images will still be speedy. In practice, it could be that this wouldn't be an issue. So - head on over to GitHub if you feel strongly about it and enter an issue.

Third - yes, I agree the UI is ugly and a bit clunky. As the main developer on the project, I have to choose where to put my time and this just hasn't been a focus. Here are a few notes:

  • Image filtering currently works at a global level, so the buttons are not per domain or per web page. However, this is something I've pondered changing.  (See related: https://github.com/wingman-jr-addon/wingman_jr/issues/184 and https://github.com/wingman-jr-addon/wingman_jr/issues/168)
  • The basic way that it works is there are different zones based on how sensitive the model is configured to be: use Trusted for sites without much chance of bad content, Neutral on sites where there's a chance some questionable content will pop up but rarely, and Untrusted on sketchy sites. You can switch between the zones to kick it into manual mode, otherwise it'll try to flip back and forth automatically when Automatic is selected.

 Whitelisting

A couple users (Opensourcerer and happydev) wrote in regarding a whitelisting feature. This is being tracked but hasn't seen action in a while - see https://github.com/wingman-jr-addon/wingman_jr/issues/184 from above.

Mobile - Future?

I got an unexpected PR from one ArthurMelton that helps support use on mobile. The addon doesn't officially support that, but this pushes it closer! Thanks Arthur!

Conclusion and Next Steps

Thanks for the feedback! Lately I've been working on reviewing machine learning research over the past 2 or so years to check for possible advancements to improve the base model, particularly those related to the explosion of growth brought about by the cross-pollination of transformers to image classification. You can see some of the experiments here: https://github.com/wingman-jr-addon/model/issues/7

Thursday, May 18, 2023

Release 3.3.4 - Revenge of the Character Encoding

Previously, in The Case of the Distorted Symbols I talked a bit about some improvements being made to better handle character encoding detection - this is the followup. If you're a non-technical reader, just know that some sites should hopefully work better to soon for displaying accented characters etc. as they ought to be rather than as symbols. If however, you're a technical reader, read on for some interesting notes about handling character encoding on the web.

 I've had at least one dedicated international user helping report bugs. To that end, I'd like to thank Drago for the helpful feedback in reviews. Recently, Drago reported that a specific site wasn't working, so it gave me an opportunity to debug further and nail down the specific problems.

In the first round, I took a naive approach to detecting character encoding and was able to pass most of the test suites found here: https://www.w3.org/2006/11/mwbp-tests/index.xhtml 

However, I had some interesting problems:

  • My original approach would read in bytes and output them through the TextEncoder as utf-8. This is problematic because the input bytes could actually have been in iso-8859-1.
  • True character set detection is quite difficult because you have to actually sniff the request contents because the headers are not enough to definitively determine the character set type.
     

By default, the new implementation starts in iso-8859-1 and then "upgrades" to utf-8 if any of a variety of conditions are encountered:

  1. Headers: Content-Type has a charset
  2. Content sniffing: starts with BOM
  3. Content sniffing: XML encoding indicates utf-8
  4. Content sniffing: meta http-equiv Content-Type indicates utf-8

Content sniffing uses the first 512 bytes currently, and the specific upgrade types have quite narrow search patterns - e.g. different ordering of http-equiv could cause non-detection etc.

All of the tests in the W3C test suite now pass!

With the improved approach, I'm hopeful that >90% of international pages will be correctly handled now, but we'll see what folks like you encounter - let me know if you encounter any bugs via the feedback link in the addon or via https://github.com/wingman-jr-addon/wingman_jr!

Thursday, May 4, 2023

The "Hidden Tabs Keeps Appearing" Problem

 Summary: Hopefully this is fixed in 3.3.3, but see Note 2 at bottom if you still have problems!

I've been hearing scattered reports for quite some time about how the hidden tabs prompt keeps appearing; enough that I added a dedicated category for it to the exit survey so I could see if it was a major issue or not.

And the answer was a resounding "yes". I also started seeing the occasional review that discussed it, too.

Unfortunately as is often the case, I could not reproduce the issue, and I did not have a clear way to find out what was causing it. Fortunately I was able to respond to one of the reviewers and the worked with me to capture logs from their system. So if this fix works for you, please tell Umbrella123 over on the GitHub issue thanks! (See https://github.com/wingman-jr-addon/wingman_jr/issues/185)

So what was the problem and why did the "Hidden Tabs" prompt keep appearing?

First, it's important to keep in mind that originally the addon did not need these hidden tabs. But unfortunately some slight changes in how Firefox and the machine learning framework Tensorflow.js interacted caused the old approach to become unbearably slow one one Firefox release a couple years ago. This forced a massive rewrite. As part of the rewrite, I needed to basically separate the addon into two parts: the original core part, and a satellite "web page" that can access the graphics card more effectively. Having these separate parts has two important implications: 1) the "Hidden Tabs" prompt shows up to hide the satellite "web page" and 2) there is increased complexity to get the two parts to talk to each other.

Second, the increased complexity of the system drove me to rely on a "watchdog" - the addon will send itself a known test image every few seconds and check that the result is as expected and consistent. This makes sure data is flowing through the system OK. This is quite important because generally people expect the addon to work even after scanning tens of thousands of images, just like the browser does. I'd also seen sometimes in the past where Tensorflow.js eventually could become unstable so it helps guard against that. If the "watchdog" self-test fails, the addon will assume that something has gone wrong with itself and restart after a certain threshold.

You can probably see where this is going now: when you see the "Hidden Tabs" prompt, it is the addon restarting itself because it thinks it is malfunctioning.

For a while I assumed this was probably due to some bad interaction with another addon that also did filtering. But it turns out that the true cause of the problem is that Tensorflow.js on certain Linux systems seems to give inconsistent results from the first prediction to later predictions for the same test image - the values can be close, but still different. So the self-checker would bail after a few self-checks and the "Hidden Tabs" prompt would appear.

I would argue that this seems to be a bug in Tensoflow.js on Linux, but I decided to make the check more robust by doing an "approximately equal" check instead and now it is working. Hooray! Special thanks to Umbrella123 for helping finding the issue, Opensourcerer for confirming the fix, and all the users reporting they were having this problem. Please let me know on GitHub or in reviews if you are still seeing issues, but hopefully it's just gone now.

Note! Umbrella123 also pointed out that they needed to set two settings to work - in about:config, set webgl.out-of-process to true and webgl.force-enabled to true. Please try these if you are having issues.

Note 2! Codedotexe reported that they needed to increase the tolerance for their computer - I'll plan to widen it for the next release as per https://github.com/wingman-jr-addon/wingman_jr/issues/191

Saturday, January 14, 2023

Feedback from 2022

Over the last year or so, I've gotten a decent bit of feedback across a number of different channels. I thought I'd share that here so folks can get a sense of what I'm hearing. This blog post is long, but there are three main categories of feedback covered here if you'd like to skip around: current user feedback, reviews, and exit poll feedback. If you wrote in, please check it out - I've likely responded to your input. I'd like to thank everyone for their feedback, as it has been generally constructive and helpful!

Sunday, February 13, 2022

Silent Mode - Custom Images?

This post is for folks who use silent mode - I've gotten some feedback from a couple helpful users:

qvim's hope: "...change the silent mode images, it would be my preference for me to choose my own images for this mode. there's nothing wrong with the images just that i would like to have the ones i prefer"

Cranky's hope: "...Ability to add custom images for silent mode from an online database of SFW images (you can memorize the stock images pretty quickly, defeating the purpose)", "Ability to lighten/darken/remove the watermark in silent mode"

  Silent mode is one of my personal favorite features, so it makes me happy to hear that others are using it - and using it enough that they'd like to see it expanded somehow.

So first, let's talk just a bit about how silent mode works right now. I've gotten a number of images from Unsplash, downsized them heavily, and included them directly in the addon (~100 images in ~2MB!). I've also created an index of  metadata for attribution as well as some data for image similarity, so that in theory the images being replaced seem less out of place. Honestly, I don't think that it works particularly well in that regard, but there is a rudimentary method in there right now that does that.

So, what about adding your own images? To do so I think there would be a couple approaches:

  1. Drag and drop your images onto the addon settings page.
  2. Provide a way to get placeholder images from the web.

Since the image similarity code is done as a pre-build step, the logic isn't even in the JS of the addon, so both potential methods might lack a bit there.

Option #1 wouldn't be too hard but I'd have to look into local storage limits. I think we might hit some issues there, particularly if folks aren't resizing the images as heavily as I did. Additionally, if images are handpicked, Cranky's note about memorizing the stock images still is unsolved unless many, many images are put in - which will definitely have problems with local storage limits.

Option #2 provides more flexibility - for example, see this list of placeholder image services. Most of these image services tend to provide a URL where the desired width and height is supplied, which would mesh well with the addon's current functionality.

I think option #2 comes with fewer headaches, but let me know over on GitHub!

(Update: the adventurous may wish to watch this branch...)

Thursday, January 20, 2022

Block/Allow Images?

 Recently I got some friendly feedback from "qvim". There were multiple parts to it, but I wanted to highlight one of the suggestions. The desired feature would be to ...

block with click, sometimes it doesn't work and if possible I would like to click on the image with the right button and be able to block or allow. (very important)

Now, I'm actually a bit surprised that I haven't gotten more feedback around this type of request in the past. It's popped up maybe once or twice. There a couple interesting points about this, one technical and one more based on use cases.

 First, the technical reason. For those who haven't peered at the source code, you may be interested to learn that the addon does not actually interact directly with the HTML elements at all. Nope. The filtering actually ties in at a lower level - it is basically processing the network requests for different media types. When I started the project, I thought about taking an approach that would interact with the HTML for each image, but decided against it because I thought working with streams would do a better job of catching every single image going through the system. Several years in, I think I made the right choice but there are still some exceptions - base 64 images and a few others delivered in special ways. However, I don't have to deal endless hacks against the HTML DOM to try to catch every element change or to try to make images invisible until the addon has scanned them.

So why does this matter? Well, it means that once the data has been scanned, I don't have an easy way to know where it actually appears on the page. The image data is essentially read-only. Similarly, I certainly could create something to interact with the images as HTML elements, but there just hasn't been a strong need.

Second, the use cases. Folks like to use this addon in a variety of ways. In some cases, the user is an adult and would like to filter out the junk. But in other cases, parents are using this as a basic defense for their kids. While it's not easy to lock down addons as a parental control, I do try to avoid showing one-click options to see images. There's also a third use case, where the user is an adult but is trying to overcome pornography. In that case, it may also not be the best option to make it too easy to view a blocked image.

So where does that leave things? Well, since the data is already delivered and read-only, it is not technically easy to go back and show an already blocked image (possible, yes, but definitely not trivial). However, hiding an offensive image that the filter missed is easier because it does not require this coordination - for example, the HTML element can just be altered. Similarly, from a use case perspective, it is likely good to allow blocking an image, but not showing one. The easiest way to do something like this would be to expose it as a right-click context menu, but if you follow that link you may notice that the user experience could get a little clunky.

This also ties in with another request: can users report that images are either being blocked incorrectly or that they should be blocked? This one is a bit more spicy to deal with.

I'd like to hear more feedback about if you'd be interested in this blocking feature and/or if you'd be interested in an image reporting feature. Let me know using the feedback link in the addon, or head on over to GitHub!


Update: you can see a rudimentary version of this here: https://github.com/wingman-jr-addon/wingman_jr/tree/hide-images with relevant issue at https://github.com/wingman-jr-addon/wingman_jr/issues/162