Tuesday, August 6, 2024

Release 3.4.0 Better Non-English Webpage Support

This release is especially for those whose native language is not English, or frequently browse international websites. If you've been keeping up with the releases, you will have noticed that international support has featured in quite a few of them with notes about character encoding etc. being fixed for a few characters. So you'd think that by now some of the main problems would all be fixed, right? How could this release help if so much work has already been done?

Well, it has to do with how web pages get served up. Handling international text is actually quite a difficult problem with many nuances. From a historical perspective, it wasn't solved well right away, and the early approach was often to use "code pages". How do these work?

Most languages - but not all, particularly excluding east Asian languages - could generally use a relatively limited number of characters to represent their language. In particular, the vast majority of languages could use 256 or fewer characters, which meant that since one byte can represent the numbers 0-255, you could use one byte to represent one character. Simple! This mapping is called a "code page". 

So what's the catch? Well, it didn't work as well for east Asian languages for one thing - but if you used two bytes you could make it work. But, it also meant that you couldn't use just one code page - you had to use one for each language. This makes new difficulties - for example, what if you want to show two different languages in the same page?

As a result, a new system was born - Unicode. The most popular encoding for Unicode is now UTF-8. This can use multiple bytes in a fancy system to represent an arbitrary number of "code points" - basically numbers to represent characters. (There is certainly more nuance here but this is roughly the idea.) This system has become ubiquitous, but it should be noted that web pages can take up a bit more room to support multiple languages.

Prior to release 3.4.0, only international pages using UTF-8 were really supported well, with a bit of extra support for the most common code page. But many web pages - particularly those with a focus on serving a single country - would use the older code page approach. Unfortunately this ran into a slightly tricky problem. To scan web pages containing embedded Base64 images, the web page itself is decoded and re-encoded. This means that the addon has to turn bytes into characters, check things, then turn the characters back into bytes. Decoding the bytes into characters is easy - just use TextDecoder. So you might be thinking that there would be a TextEncoder, too ... and you'd be right, but there's a catch. The built-in TextDecoder can decode basically anything, but the TextEncoder can only turn characters back into bytes using UTF-8. So, that means that out of the box, there is literally no way to support re-encoding text back into the served up code page. Ideally this would just exist in the main API.

But, to handle this I've now created a program that helps! Basically it creates a map containing all this code page information so that it can be used in place of TextEncoder. Is it perfect or complete? No. But there is a good chance your home language may now be supported for national websites - so, I hope this release works well for you and happy browsing!

Special thanks to Ayaya and Dragodraki for continuing to provide feedback on international language support!


No comments:

Post a Comment