Thursday, May 18, 2023

Release 3.3.4 - Revenge of the Character Encoding

Previously, in The Case of the Distorted Symbols I talked a bit about some improvements being made to better handle character encoding detection - this is the followup. If you're a non-technical reader, just know that some sites should hopefully work better to soon for displaying accented characters etc. as they ought to be rather than as symbols. If however, you're a technical reader, read on for some interesting notes about handling character encoding on the web.

 I've had at least one dedicated international user helping report bugs. To that end, I'd like to thank Drago for the helpful feedback in reviews. Recently, Drago reported that a specific site wasn't working, so it gave me an opportunity to debug further and nail down the specific problems.

In the first round, I took a naive approach to detecting character encoding and was able to pass most of the test suites found here: https://www.w3.org/2006/11/mwbp-tests/index.xhtml 

However, I had some interesting problems:

  • My original approach would read in bytes and output them through the TextEncoder as utf-8. This is problematic because the input bytes could actually have been in iso-8859-1.
  • True character set detection is quite difficult because you have to actually sniff the request contents because the headers are not enough to definitively determine the character set type.
     

By default, the new implementation starts in iso-8859-1 and then "upgrades" to utf-8 if any of a variety of conditions are encountered:

  1. Headers: Content-Type has a charset
  2. Content sniffing: starts with BOM
  3. Content sniffing: XML encoding indicates utf-8
  4. Content sniffing: meta http-equiv Content-Type indicates utf-8

Content sniffing uses the first 512 bytes currently, and the specific upgrade types have quite narrow search patterns - e.g. different ordering of http-equiv could cause non-detection etc.

All of the tests in the W3C test suite now pass!

With the improved approach, I'm hopeful that >90% of international pages will be correctly handled now, but we'll see what folks like you encounter - let me know if you encounter any bugs via the feedback link in the addon or via https://github.com/wingman-jr-addon/wingman_jr!

No comments:

Post a Comment