Talk:UTF-8

This is the talk page for discussing improvements to the UTF-8 article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computing Mid‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

Computer science Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Typography Mid‑importance

	This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TypographyWikipedia:WikiProject TypographyTemplate:WikiProject TypographyTypography articles
Mid	This article has been rated as Mid-importance on the importance scale.

Archives

Index

Archive 1	Archive 2	Archive 3
Archive 4	Archive 5

This page has archives. Sections older than 730 days may be automatically archived by .

Table should not only use color to encode information (but formatting like bold and underline)

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices. — Preceding unsigned comment added by 88.219.179.109 (talk • contribs) 02:26, 17 April 2020‎ (UTC)[reply]

Microsoft script dead link

   and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad

   "Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.

   https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()

This link is dead. How to fix it? — Preceding unsigned comment added by Un1Gfn (talk • contribs) 02:58, 5 April 2021 (UTC)[reply]

That text, and that link, appears to have been removed, so there's no longer anything to fix. Guy Harris (talk) 23:43, 21 December 2023 (UTC)[reply]

The article contains "{{efn", which looks like a mistake.

I would've fixed it myself but I don't know how to transform the remaining sentence to make sense. 2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094 (talk) 16:17, 2 April 2024 (UTC)[reply]

I fixed it, I think. I'm not 100% sure it's how the previous editors intended. I invite them to review and confirm. Indefatigable (talk) 19:03, 2 April 2024 (UTC)[reply]

Should "The Manifesto" be mentioned somewhere?

More specifically, this one: https://utf8everywhere.org -- Preceding unsigned comment added by Rudxain (talk o contribs) 21:52, 12 July 2024 (UTC)[reply]

Only if it's got significant coverage in reliable sources. Remsense 22:10, 12 July 2024 (UTC)[reply]

It's kind of ahistorical, since the Microsoft decisions that they deplore were made while developing Windows NT 3.1, and UTF-8 wasn't even a standard until Windows NT 3.1 was close to being released. There was more money to be made from East Asian customized computer systems than Unicode computer systems in 1993, so Unicode was probably not their main focus at that time... AnonMoos (talk) 20:30, 15 July 2024 (UTC)[reply]

The number of 3 byte encodings is incorrect

This sentence is incorrect:

Three bytes are needed for the remaining 61,440 codepoints...

FFFF - 0800 + 1 = F800 = 63,488 three byte codepoints.

The other calculations for 1, 2, and 4 byte encodings are correct. Bantling66 (talk) 02:56, 23 August 2024 (UTC)[reply]

You forgot to subtract 2048 surrogates in the D800–DFFF range. – MwGamera (talk) 08:58, 23 August 2024 (UTC)[reply]

Multi-point flags

I'm struggling to assume good faith here with this edit. A flag which consists of five code points is already sufficiently illustrative of the issue being discussed. That an editor saw fit to first remove that example without discussion, and then to swap it out for the other example when it was pared down to one flag, invites discussion of why that particular flag was removed, and the obvious answer isn't a charitable one. Chris Cunningham (user:thumperward) (talk) 12:35, 17 September 2024 (UTC)[reply]

Yes it was restored to the pride flag for precisely the reasons you state. Spitzak (talk) 20:48, 17 September 2024 (UTC)[reply]

A better, more in-depth explanations of the flags can be found on the articles regional indicator symbol and Tags_(Unicode_block)#Current_use (the mechanism for these specific flags). I don't think it belongs in articles of specific character encodings like UTF-8 at all.

The fact that one code point does not necessarily produce one grapheme has nothing to do with a specific character encoding like UTF-8. It's a more fundamental property of the text itself and any encoding that can be used to encode some string of characters decodes back to the same characters when decoded back from the binary representation. Although very popular, UTF-8 is just one of the numerous ways to encode text to binary and back.

I wrote more about this below at Other issues in the article and sadly only then noticed this was already being somewhat discussed here. Mossymountain (talk) 10:45, 20 September 2024 (UTC)[reply]

Why was the "heart" of the article, almost the whole section of UTF-8#Encoding (Old revision) removed instead of adding a note?

NOTE: The section seems to have been renamed (UTF-8#Encoding -> UTF-8#Description) in this edit.

I don't understand why such a large part of UTF-8#Encoding (old revision) was suddenly removed in this edit (edit A), and then this edit (edit B) (diff after both edits) instead of either:

Adding a note about parts of it being written poorly.
Rewriting some of it. (the best and the most difficult option)
Carefully considering removing parts that were definitely redundant (such as arguably the latter part of UTF-8#Examples (old revision)).

Both of the edits removed a separate, and quite a well-written example (at least for my brain, these very examples made understanding UTF-8 require significantly less effort spent thinking). I don't think removing them was a good decision. Yes, you could explain basically anything without using examples, but in my experience an example is usually the easiest and fastest way for someone to understand almost any concept, especially when the examples were so visual and beautifully simple. I see it in the same category as a lecturer speaking with his hands and writing+drawing relevant things on a whiteboard versus having to hold the lecture by speaking over the phone.

The 1st, edit A

→‎Encoding: this entire section is almost completely opaque and its inclusion stymies the addition of some clear prose describing how unicode is decoded
— user:Thumperward, (edit A)

To me, this reads as if UTF-8 was accidentally conflated with Unicode, causing a mistake to remove the parts from the wrong article (Having thought about it more, I now think it's) a severe disagreement of article design/presentation style.
(I still think edit notes asking for rewrites would have been the way to go instead of nuking the information, and that for some of the items, an article-like rewrite would be the wrong choice: Some data is way more enjoyable and simple to read visually from a table than it is to glean from written or spoken word and, as such, should be visualized in a table.)

I am strongly of the mind that the deleted parts included the two most important parts of the whole article, that must definitely be included as they are the very core of the article:

The UTF-8#Codepage layout (old revision), in my opinion the most important part of any article about a character encoding. This part was in my opinion also designed, formatted and written exemplarily well here. The colour palette could be adjusted accordingly if it's a problem for the colour-blind.
- Precedents/Examples in other articles about specific character encodings:
- Variable multi-byte (like UTF-8):
  - Shift_JIS#Shift_JIS_byte_map
  - GBK_(character_encoding)#Encoding
- single byte:
  - ASCII#Character_set (a strict subset of UTF-8)
  - Code_page_437#Character_set
  - ISO/IEC_8859-1#Code_page_layout
The first list (numbered 1..7) of UTF-8#Examples (old revision) that clearly, by simple example demonstrates how UTF-8 works. (I agree it could be rewritten, the language used is quite verbose)

The 2nd, edit B

→Encoding: this now refers to removed text and contradicts repeated assertions elsewhere that overlong encodings are unnecessary
— user:Thumperward, (edit B)

This edit removed the whole section UTF-8#Overlong encodings (old revision). I disagree with its removal.

The example removed in this edit was a clear and easy to understand way of explaining what an overlong encoding means.
I don't understand what the deleted text is referred to have contradicted, unless this is something like the mention in UTF-8#Implementations and adoption of Java's "Modified UTF-8" that uses an overlong encoding for the null character. Overlong encodings aren't merely "unnecessary", they are *utterly forbidden*/invalid/illegal.
- Apart from the lacking citation, which probably should have been rfc3629 § 3, I don't understand what was wrong with the second paragraph. I also consider the information presented in it essential for the article. (A simple decoder implementation could easily just pass the overlong encodings as if they were single-byte characters, or choose to simplify encoding by using a fixed length. The paragraph gives two good reasons why such encodings are illegal, that are now completely gone from the article.)

The 3rd, edit C

This is about font colouring on UTF-8#Encoding (old version), it reverts this edit by User:Nsmeds. The textual information stays the same between the two, the edit only removes the custom colours.

→Encoding: fix the colour blindness issue
— user:Thumperward, (edit C)

This is attempting to fix a potential issue for the colour-blind, but I think it unfortunately only ends up concretely denying the help the colour was there to provide from both the colour-blind and not. The colouring was there to help the reader parse the information (as in, you didn't even need to know that a hex digit covers 4 bits, or that 0x7 corresponds to to 0b111, and regardless of whether you do or don't, when the relationships between the columns and data formats is clearly indicated by something like colouring, you don't have to count or deliberately deduce anything to determine what corresponds to what; you just instantly see it. This is obviously highly desirable in data visualization.

The problems I interpret edit C attempting to fix is that when you have colour-blindness, you may not be able to:

Differentiate the text from 'the background making reading the text hard or impossible due to the contrast against the background not ending up sufficient, which is a serious issue that must at least have a workaround available.
Differentiate the colours used from each other (not a problem worthy of removing the colours, just a slight inconvenience, way less of an inconvenience compared to removing ALL of the colour, colour-blind or not, the final fix is to use a colour-blind friendly palette).

I think we should consider these before removing:

Do the colours used here even have a problem with contrast with the background, (or only amongst themselves and they are non-essential)? Maybe it's just that we should avoid the potential low-contrast combinations even for those with normal vision, such as:
- Overly bright colours, such as bright yellow (after switching to light background, I really struggle to read "bright yellow" there)
- Overly dark colours, such as deep blue (after switching to dark background, I struggle to read "deep blue" there)
- Colours close to even the rest of the corresponding brightnesses between the light and dark mode background and their respective overlay backgrounds like this one of <code>
How used to this problem and ready/proactive for situations like this should we assume the colour-blind are? For example when reading this, having already enabled:
- A colour-blindness correction mode (that remaps the screen's colours in a way that makes colours that normal colour vision would see as clearly distinct, as clearly distinct as it can for the colour-blindness selected).
- A high-contrast mode (disables custom text and background colouring altogether and maps both to predetermined colours)
How easy should we assume that other workarounds, such as selecting text (which tends to bypass the issue) is for the reader? I do this all the time if the contrast is bad.

I think the least total effort catch-all long-term solution would be to provide a site-wide toggle on the side that disables all custom text and background colouring when you want (or adjusts it somehow, more complicated), then we could always have both.

The other solution to fix all of what edit C attempted to fix, (and the solution applicable right here and now) would be to use a palette that is also readable for the colour blind, such as these three palettes found on Color_blindness#Ordered_Information that can be used to produce distinct colours that work no matter of colour-blindness.

NOTE: They ALL work for ALL types of colour blindness, it's just a choice of which looks the nicest.
Do keep in mind however that all of the selected colours still need to be clearly distinct in brightness from both light and dark backgrounds, so maybe the colours from the very edges of these aren't usable, like how I attempted to demonstrate above with blue and yellow.