Subj : Need volonteers to test another patch
To   : Nicholas Boel
From : Michiel van der Vlist
Date : Sun Mar 03 2024 04:45 pm

Hello Nicholas,

On Sunday March 03 2024 08:46, you wrote to Vitaliy Aksyonov:

 NB> As for the pseudo-graphics wrapped to the next line, I have a
 NB> (probably dumb) question about this: If the pseudo graphics were
 NB> originally cp437 (single byte) and translated to utf-8, once they are
 NB> translated are they now multiple bytes per character?

I prefer dumb quetion, they are easier to answer... ;-)

Yes, they are translated to multi (usually two for most characters used in Fidonet) byte characters. Only the ASCII characters (0-127) are not translated and so remain one byte.

 NB> If "UTF-8 uses 1 to 4 bytes to encode a single character", I guess
 NB> what I'm wondering is if the character was 1 byte to begin with, why
 NB> wouldn't it stay 1 byte when translated to utf-8? Or is it because
 NB> those _specific_ characters when in utf-8 are already multiple bytes?

A non ASCII character can not be translated to one byte for the simple reason that the remaning  128 bytes with the highest bit set are not enough to encode ALL the characters in ALL the single byte characters sets. The whole idea of unicode is to encode ALL the characters of ALL those characters sets, CP437, CP850, CP 866, CP 1250, etc into ONE encoding scheme. One byte is just not enough for all.

To put it simple: if you want to encode CP437 and CP866, you could put CP437 OR CP866 in the first byte, but you need at least one bit more information which one it is; CP437 or CP866. That is not exactly how UTF-8 works but it should give you an idea of why just one byte can not be enough.


Cheers, Michiel

--- GoldED+/W32-MSVC 1.1.5-b20170303
 * Origin: Nieuw Schnøørd (2:280/5555)

.