Jump to content

Talk:UTF-8

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Table should not only use color to encode information (but formatting like bold and underline)

[edit]

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices. — Preceding unsigned comment added by 88.219.179.109 (talkcontribs) 02:26, 17 April 2020‎ (UTC)[reply]

[edit]
   and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad
   "Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.
   https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()

This link is dead. How to fix it? — Preceding unsigned comment added by Un1Gfn (talkcontribs) 02:58, 5 April 2021 (UTC)[reply]

That text, and that link, appears to have been removed, so there's no longer anything to fix. Guy Harris (talk) 23:43, 21 December 2023 (UTC)[reply]

The article contains "{{efn", which looks like a mistake.

I would've fixed it myself but I don't know how to transform the remaining sentence to make sense. 2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094 (talk) 16:17, 2 April 2024 (UTC)[reply]

I fixed it, I think. I'm not 100% sure it's how the previous editors intended. I invite them to review and confirm. Indefatigable (talk) 19:03, 2 April 2024 (UTC)[reply]

Should "The Manifesto" be mentioned somewhere?

[edit]

More specifically, this one: https://utf8everywhere.org -- Preceding unsigned comment added by Rudxain (talk o contribs) 21:52, 12 July 2024 (UTC)[reply]

Only if it's got significant coverage in reliable sources. Remsense 22:10, 12 July 2024 (UTC)[reply]
It's kind of ahistorical, since the Microsoft decisions that they deplore were made while developing Windows NT 3.1, and UTF-8 wasn't even a standard until Windows NT 3.1 was close to being released. There was more money to be made from East Asian customized computer systems than Unicode computer systems in 1993, so Unicode was probably not their main focus at that time... AnonMoos (talk) 20:30, 15 July 2024 (UTC)[reply]

The number of 3 byte encodings is incorrect

[edit]

This sentence is incorrect:

Three bytes are needed for the remaining 61,440 codepoints...

FFFF - 0800 + 1 = F800 = 63,488 three byte codepoints.

The other calculations for 1, 2, and 4 byte encodings are correct. Bantling66 (talk) 02:56, 23 August 2024 (UTC)[reply]

You forgot to subtract 2048 surrogates in the D800–DFFF range. – MwGamera (talk) 08:58, 23 August 2024 (UTC)[reply]

Multi-point flags

[edit]

I'm struggling to assume good faith here with this edit. A flag which consists of five code points is already sufficiently illustrative of the issue being discussed. That an editor saw fit to first remove that example without discussion, and then to swap it out for the other example when it was pared down to one flag, invites discussion of why that particular flag was removed, and the obvious answer isn't a charitable one. Chris Cunningham (user:thumperward) (talk) 12:35, 17 September 2024 (UTC)[reply]

Yes it was restored to the pride flag for precisely the reasons you state. Spitzak (talk) 20:48, 17 September 2024 (UTC)[reply]
A better, more in-depth explanations of the flags can be found on the articles regional indicator symbol and Tags_(Unicode_block)#Current_use (the mechanism for these specific flags). I don't think it belongs in articles of specific character encodings like UTF-8 at all.
The fact that one code point does not necessarily produce one grapheme has nothing to do with a specific character encoding like UTF-8. It's a more fundamental property of the text itself and any encoding that can be used to encode some string of characters decodes back to the same characters when decoded back from the binary representation. Although very popular, UTF-8 is just one of the numerous ways to encode text to binary and back.
I wrote more about this below at Other issues in the article and sadly only then noticed this was already being somewhat discussed here. Mossymountain (talk) 10:45, 20 September 2024 (UTC)[reply]

Why was the "heart" of the article, almost the whole section of UTF-8#Encoding (Old revision) removed instead of adding a note?

[edit]

NOTE: The section seems to have been renamed (UTF-8#Encoding -> UTF-8#Description) in this edit.

I don't understand why such a large part of UTF-8#Encoding (old revision) was suddenly removed in this edit (edit A), and then this edit (edit B) (diff after both edits) instead of either:

  • Adding a note about parts of it being written poorly.
  • Rewriting some of it. (the best and the most difficult option)
  • Carefully considering removing parts that were definitely redundant (such as arguably the latter part of UTF-8#Examples (old revision)).

Both of the edits removed a separate, and quite a well-written example (at least for my brain, these very examples made understanding UTF-8 require significantly less effort spent thinking). I don't think removing them was a good decision. Yes, you could explain basically anything without using examples, but in my experience an example is usually the easiest and fastest way for someone to understand almost any concept, especially when the examples were so visual and beautifully simple. I see it in the same category as a lecturer speaking with his hands and writing+drawing relevant things on a whiteboard versus having to hold the lecture by speaking over the phone.

The 1st, edit A

[edit]
→‎Encoding: this entire section is almost completely opaque and its inclusion stymies the addition of some clear prose describing how unicode is decoded
— user:Thumperward, (edit A)

To me, this reads as if UTF-8 was accidentally conflated with Unicode, causing a mistake to remove the parts from the wrong article (Having thought about it more, I now think it's) a severe disagreement of article design/presentation style.
(I still think edit notes asking for rewrites would have been the way to go instead of nuking the information, and that for some of the items, an article-like rewrite would be the wrong choice: Some data is way more enjoyable and simple to read visually from a table than it is to glean from written or spoken word and, as such, should be visualized in a table.)

I am strongly of the mind that the deleted parts included the two most important parts of the whole article, that must definitely be included as they are the very core of the article:

  1. The UTF-8#Codepage layout (old revision), in my opinion the most important part of any article about a character encoding. This part was in my opinion also designed, formatted and written exemplarily well here. The colour palette could be adjusted accordingly if it's a problem for the colour-blind.
    - Precedents/Examples in other articles about specific character encodings:
  2. The first list (numbered 1..7) of UTF-8#Examples (old revision) that clearly, by simple example demonstrates how UTF-8 works. (I agree it could be rewritten, the language used is quite verbose)

The 2nd, edit B

[edit]
→Encoding: this now refers to removed text and contradicts repeated assertions elsewhere that overlong encodings are unnecessary
— user:Thumperward, (edit B)

This edit removed the whole section UTF-8#Overlong encodings (old revision). I disagree with its removal.

  1. The example removed in this edit was a clear and easy to understand way of explaining what an overlong encoding means.
  2. I don't understand what the deleted text is referred to have contradicted, unless this is something like the mention in UTF-8#Implementations and adoption of Java's "Modified UTF-8" that uses an overlong encoding for the null character. Overlong encodings aren't merely "unnecessary", they are *utterly forbidden*/invalid/illegal.
    • Apart from the lacking citation, which probably should have been rfc3629 § 3, I don't understand what was wrong with the second paragraph. I also consider the information presented in it essential for the article. (A simple decoder implementation could easily just pass the overlong encodings as if they were single-byte characters, or choose to simplify encoding by using a fixed length. The paragraph gives two good reasons why such encodings are illegal, that are now completely gone from the article.)

The 3rd, edit C

[edit]

This is about font colouring on UTF-8#Encoding (old version), it reverts this edit by User:Nsmeds. The textual information stays the same between the two, the edit only removes the custom colours.

→Encoding: fix the colour blindness issue
— user:Thumperward, (edit C)

This is attempting to fix a potential issue for the colour-blind, but I think it unfortunately only ends up concretely denying the help the colour was there to provide from both the colour-blind and not. The colouring was there to help the reader parse the information (as in, you didn't even need to know that a hex digit covers 4 bits, or that 0x7 corresponds to to 0b111, and regardless of whether you do or don't, when the relationships between the columns and data formats is clearly indicated by something like colouring, you don't have to count or deliberately deduce anything to determine what corresponds to what; you just instantly see it. This is obviously highly desirable in data visualization.

The problems I interpret edit C attempting to fix is that when you have colour-blindness, you may not be able to:

  1. Differentiate the text from 'the background making reading the text hard or impossible due to the contrast against the background not ending up sufficient, which is a serious issue that must at least have a workaround available.
  2. Differentiate the colours used from each other (not a problem worthy of removing the colours, just a slight inconvenience, way less of an inconvenience compared to removing ALL of the colour, colour-blind or not, the final fix is to use a colour-blind friendly palette).

I think we should consider these before removing:

  • Do the colours used here even have a problem with contrast with the background, (or only amongst themselves and they are non-essential)? Maybe it's just that we should avoid the potential low-contrast combinations even for those with normal vision, such as:
    • Overly bright colours, such as bright yellow (after switching to light background, I really struggle to read "bright yellow" there)
    • Overly dark colours, such as deep blue (after switching to dark background, I struggle to read "deep blue" there)
    • Colours close to even the rest of the corresponding brightnesses between the light and dark mode background and their respective overlay backgrounds like this one of <code>
  • How used to this problem and ready/proactive for situations like this should we assume the colour-blind are? For example when reading this, having already enabled:
    • A colour-blindness correction mode (that remaps the screen's colours in a way that makes colours that normal colour vision would see as clearly distinct, as clearly distinct as it can for the colour-blindness selected).
    • A high-contrast mode (disables custom text and background colouring altogether and maps both to predetermined colours)
  • How easy should we assume that other workarounds, such as selecting text (which tends to bypass the issue) is for the reader? I do this all the time if the contrast is bad.

I think the least total effort catch-all long-term solution would be to provide a site-wide toggle on the side that disables all custom text and background colouring when you want (or adjusts it somehow, more complicated), then we could always have both.

Three sequential colormaps that have been designed to be accessible to the color blind

The other solution to fix all of what edit C attempted to fix, (and the solution applicable right here and now) would be to use a palette that is also readable for the colour blind, such as these three palettes found on Color_blindness#Ordered_Information that can be used to produce distinct colours that work no matter of colour-blindness.

NOTE: They ALL work for ALL types of colour blindness, it's just a choice of which looks the nicest.
Do keep in mind however that all of the selected colours still need to be clearly distinct in brightness from both light and dark backgrounds, so maybe the colours from the very edges of these aren't usable, like how I attempted to demonstrate above with blue and yellow.

Other issues in the article

[edit]

The UTF-8 article does talk about generic things about Unicode quite a bit more than I think it should, possibly adding to the likelihood of misunderstandings like what I think ultimately lead to the edits A and B. (now I think it was a disagreement of style, not a misunderstanding)
An example of this:

  • Explaining how some "graphical characters can be more than 4 bytes in UTF-8". This is because Unicode (and by extension UTF-8) does not deal in graphemes in the first place, but code points (essentially just numbers to index into Unicode), which can correspond to valid Unicode characters, which in turn can directly correspond to a grapheme. Some characters don't correspond to a grapheme at all (control characters), and some combine/join with other character(s) to to produce a combination grapheme (combining/joining characters), or something akin to the latter two, such as the formatting tag characters used in the flag example, which didn't seem to produce graphemes on their own or if used incorrectly.
    The possibility of needing to use multiple code points for one grapheme like that is a direct consequence of these types of characters in general and isn't caused by UTF-8 or any other encoding, and can happen through ANY and all encodings capable of encoding such code points, not just UTF-8.
    In short: The issue has nothing to do with UTF-8.


Maybe some of those explanations should be moved to the Unicode article or other appropriate articles instead and/or drastically shortened and replaced with links to the longer explanations?

Mossymountain (talk) 05:09, 20 September 2024 (UTC) Mossymountain (talk) 17:10, 20 September 2024 (UTC)[reply]

Because the editor was offended that that section used color. Akeosnhaoe (talk) 08:56, 20 September 2024 (UTC)[reply]
It's pretty important that we not communicate information solely through color, but I wonder how we could better do something like that. Remsense ‥  09:02, 20 September 2024 (UTC)[reply]
Most of the information wasn't in the color, it was in the text readable without formatting in monochrome. The color was there just to make it easier to quickly identify which is which.
If what Akeosnhaoe said is the case (which I don't think it is, I think this was an honest misunderstanding with good intentions), obviously the colors should be changed to the intended visibility standard, not the information removed. Mossymountain (talk) 10:17, 20 September 2024 (UTC)[reply]

IMHO the edits made by user:thumperward were a good and powerful attempt to remove the obscene bloat of this article. The enourmous complex "examples" with color did not provide any information, and it is quite impossible to figure out what the colors mean without already knowing how UTF-8 works already. Elimination of the "code page" is IMHO a good and daring decision, one I may not have made and I'm glad he tried it. I'd like to continue, pretty much removing the bloated mess of "comparisons" that are either obvious or that nobody cares about, the few useful bits of info there can be merged into the description.Spitzak (talk) 18:13, 20 September 2024 (UTC)[reply]