Wednesday, November 4, 2020

Release 1.2.1 - The Case of the Distorted Symbols

International users - this release is a bug fix release for you!
One of you kindly reported that they were seeing special characters such as "ä", "ö", "ü", "ß" and "€" showing up incorrectly as ¿½. This release should fix most instances of that happening, but please comment at https://github.com/wingman-jr-addon/wingman_jr/issues/70 if you are still seeing problems. Thanks!

 For the technically curious (or perhaps those who are having trouble falling asleep at night and need something boring to read), here's what was happening. In order to scan images that have been encoded as Base64 data URI's, I fully scan all documents of Content-Type text/html and do search and replace as necessary. However, when I get the document it is as bytes, so I need to handle the decoding from bytes into text myself. All examples out there just use UTF-8 for the TextDecoder, but alas, real life is a bit more complex - the source of this issue is due to incorrectly decoding non-UTF-8 docs as UTF-8. So now I try to do rudimentary encoding detection based on "charset" in Content-Type. An interesting followup is that when I turn text back into bytes, I use TextEncoder which - at present - only supports UTF-8, so I need to make sure the Content-Type gets set appropriately for that.

Note that using only Content-Type for character encoding detection is considerably simpler than the mechanism that browsers use, but it still hits a vast majority of the use cases even though it is not quite accurate. You can see how it fares against a selection of standardized tests by W3C. Character encoding detection is exceedingly sophisticated - if I still haven't bored you with the details, I recommend checking out the spec for those facing truly persistent insomnia.

No comments:

Post a Comment