Discussion:
Test Latin-1 in Google Groups.
Add Reply
Ruud Harmsen
2018-11-02 17:05:49 UTC
Reply
Permalink
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252

Accent grave: àèìòù
Accent aigu: áéíóúý
Accent circonflex: âêôîû
Diaeresis: äëïöüÿ
Tilde: ãõ
Cedilla: ç

Uppercase:
Accent grave: ÀÈÌÒÙ
Accent aigu: ÁÉÍÓÚÝ
Accent circonflex: ÂÊÔÎÛ
Diaeresis: ÄËÏÖܟ
Tilde: ÃÕ
Cedilla: Ç



àèìòùáéíóúýâêôîûäëïöüÿãõç
--
Ruud Harmsen, http://rudhar.com
Ruud Harmsen
2018-11-02 17:30:45 UTC
Reply
Permalink
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Accent grave: àèìòù
Accent aigu: áéíóúý
Accent circonflex: âêôîû
Diaeresis: äëïöüÿ
Tilde: ãõ
Cedilla: ç
Accent grave: ÀÈÌÒÙ
Accent aigu: ÁÉÍÓÚÝ
Accent circonflex: ÂÊÔÎÛ
Diaeresis: ÄËÏÖܟ
Tilde: ÃÕ
Cedilla: Ç
àèìòùáéíóúýâêôîûäëïöüÿãõç
Same problem here. The message arrives in my Usenet program intact, it
is in GG intact, as a raw message. But the normal GG display changes
it to Cyrillic.

Compare:
https://groups.google.com/d/msg/sci.lang/xx6jGN64dB0/yiYGYa2UAwAJ
(Cyrillic) with:
https://groups.google.com/forum/#!original/sci.lang/xx6jGN64dB0/yiYGYa2UAwAJ
--
Ruud Harmsen, http://rudhar.com
Christian Weisgerber
2018-11-02 17:40:44 UTC
Reply
Permalink
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Looking at the raw article in my INN server's newsspool...
Post by Ruud Harmsen
Diaeresis: äëïöüÿ
Diaeresis: ÄËÏÖÜ<9F>
There was an illegal character there, byte 0x9F, which is not a
printable character in ISO 8859-1. Also, capital Y with diaeresis
is not part of the 8859-1 set. (Lowercase y with diaeresis is.)
--
Christian "naddy" Weisgerber ***@mips.inka.de
Ruud Harmsen
2018-11-02 18:53:49 UTC
Reply
Permalink
Fri, 2 Nov 2018 17:40:44 -0000 (UTC): Christian Weisgerber
Post by Christian Weisgerber
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Looking at the raw article in my INN server's newsspool...
Post by Ruud Harmsen
Diaeresis: äëïöüÿ
Diaeresis: ÄËÏÖÜ<9F>
There was an illegal character there, byte 0x9F, which is not a
printable character in ISO 8859-1. Also, capital Y with diaeresis
is not part of the 8859-1 set. (Lowercase y with diaeresis is.)
Bingo, that must be it! Thanks.

So I am lying (and Agent 1.93 lets me!), I claim to post in 8859-1 but
I don't, because I also post outside that range. That confuses Google:
it doesn't know how to interpret that character. (So it switches to
Cyrillic, which is a weird thing to do, in my opinion; even if I
myself caused it.)

OK, so now I'll repeat the test, but with those funny dotted y's
(which nobody ever needs anyway) removed.
--
Ruud Harmsen, http://rudhar.com
Peter T. Daniels
2018-11-02 19:15:30 UTC
Reply
Permalink
Post by Ruud Harmsen
Fri, 2 Nov 2018 17:40:44 -0000 (UTC): Christian Weisgerber
Post by Christian Weisgerber
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Looking at the raw article in my INN server's newsspool...
Post by Ruud Harmsen
Diaeresis: длпцья
Diaeresis: ДЛПЦЬ<9F>
There was an illegal character there, byte 0x9F, which is not a
printable character in ISO 8859-1. Also, capital Y with diaeresis
is not part of the 8859-1 set. (Lowercase y with diaeresis is.)
But here, after it passed through Ruud, I see cyrillic again.
Post by Ruud Harmsen
Bingo, that must be it! Thanks.
So I am lying (and Agent 1.93 lets me!), I claim to post in 8859-1 but
it doesn't know how to interpret that character. (So it switches to
Cyrillic, which is a weird thing to do, in my opinion; even if I
myself caused it.)
OK, so now I'll repeat the test, but with those funny dotted y's
(which nobody ever needs anyway) removed.
Nobody? You can't discuss French decadence without it! or Belgian violinists!
António Marques
2018-11-02 19:20:25 UTC
Reply
Permalink
Post by Ruud Harmsen
Fri, 2 Nov 2018 17:40:44 -0000 (UTC): Christian Weisgerber
Post by Christian Weisgerber
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Looking at the raw article in my INN server's newsspool...
Post by Ruud Harmsen
Diaeresis: äëïöüÿ
Diaeresis: ÄËÏÖÜ<9F>
There was an illegal character there, byte 0x9F, which is not a
printable character in ISO 8859-1. Also, capital Y with diaeresis
is not part of the 8859-1 set. (Lowercase y with diaeresis is.)
Bingo, that must be it! Thanks.
So I am lying (and Agent 1.93 lets me!), I claim to post in 8859-1 but
it doesn't know how to interpret that character. (So it switches to
Cyrillic, which is a weird thing to do, in my opinion; even if I
myself caused it.)
OK, so now I'll repeat the test, but with those funny dotted y's
(which nobody ever needs anyway) removed.
What your program is using is probably Windows-1252 or something like that
- it has the same code points as Latin-1 but has additional characters
where Latin-1 has a reserved area for control codes.
Do you have the option to send Windows-1252? Maybe then GG will accept it
for what it is.
Ruud Harmsen
2018-11-02 20:23:21 UTC
Reply
Permalink
Fri, 2 Nov 2018 19:20:25 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
Fri, 2 Nov 2018 17:40:44 -0000 (UTC): Christian Weisgerber
Post by Christian Weisgerber
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Looking at the raw article in my INN server's newsspool...
Post by Ruud Harmsen
Diaeresis: äëïöüÿ
Diaeresis: ÄËÏÖÜ<9F>
There was an illegal character there, byte 0x9F, which is not a
printable character in ISO 8859-1. Also, capital Y with diaeresis
is not part of the 8859-1 set. (Lowercase y with diaeresis is.)
Bingo, that must be it! Thanks.
So I am lying (and Agent 1.93 lets me!), I claim to post in 8859-1 but
it doesn't know how to interpret that character. (So it switches to
Cyrillic, which is a weird thing to do, in my opinion; even if I
myself caused it.)
OK, so now I'll repeat the test, but with those funny dotted y's
(which nobody ever needs anyway) removed.
What your program is using is probably Windows-1252 or something like that
- it has the same code points as Latin-1 but has additional characters
where Latin-1 has a reserved area for control codes.
Do you have the option to send Windows-1252? Maybe then GG will accept it
for what it is.
Yes, the Code Page is Windows-1252, but it says "Post Usenet as"
ISO-8859-1. Can't change that. Agent 1.93.

Anyway, if as an interface designer you are confronted with a message
claiming to be in ISO-8859-1, but actually it contains CP1252 (code
point 9F is an umlauted Y, in CP1252 but not 8859-1), what do you do?

1) Assume Windows Cp1252? (Makes sense, right?)

2) Assume Cyrillic, https://en.wikipedia.org/wiki/ISO/IEC_8859-5,
which ALSO has no defined character for 9F?
(Wouldn't that be, to put it mildly, SLIGHTLY silly?)

Well, GG does 2). What's more, it also does 2) when the posted message
IS fully in ISO-8859-1, as later tests reveal.

GG is broken, and never was whole.
--
Ruud Harmsen, http://rudhar.com
António Marques
2018-11-02 22:29:02 UTC
Reply
Permalink
Post by Ruud Harmsen
Fri, 2 Nov 2018 19:20:25 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
Fri, 2 Nov 2018 17:40:44 -0000 (UTC): Christian Weisgerber
Post by Christian Weisgerber
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Looking at the raw article in my INN server's newsspool...
Post by Ruud Harmsen
Diaeresis: äëïöüÿ
Diaeresis: ÄËÏÖÜ<9F>
There was an illegal character there, byte 0x9F, which is not a
printable character in ISO 8859-1. Also, capital Y with diaeresis
is not part of the 8859-1 set. (Lowercase y with diaeresis is.)
Bingo, that must be it! Thanks.
So I am lying (and Agent 1.93 lets me!), I claim to post in 8859-1 but
it doesn't know how to interpret that character. (So it switches to
Cyrillic, which is a weird thing to do, in my opinion; even if I
myself caused it.)
OK, so now I'll repeat the test, but with those funny dotted y's
(which nobody ever needs anyway) removed.
What your program is using is probably Windows-1252 or something like that
- it has the same code points as Latin-1 but has additional characters
where Latin-1 has a reserved area for control codes.
Do you have the option to send Windows-1252? Maybe then GG will accept it
for what it is.
Yes, the Code Page is Windows-1252, but it says "Post Usenet as"
ISO-8859-1. Can't change that. Agent 1.93.
Anyway, if as an interface designer you are confronted with a message
claiming to be in ISO-8859-1, but actually it contains CP1252 (code
point 9F is an umlauted Y, in CP1252 but not 8859-1), what do you do?
1) Assume Windows Cp1252? (Makes sense, right?)
No. How does it know it’s 1252? There’s nothing special about 9F. A lot of
encodings use it for characters. Either it trusts the charset header, or
anything goes.
They’re only bytes.
Post by Ruud Harmsen
2) Assume Cyrillic, https://en.wikipedia.org/wiki/ISO/IEC_8859-5,
which ALSO has no defined character for 9F?
(Wouldn't that be, to put it mildly, SLIGHTLY silly?)
Well, GG does 2). What's more, it also does 2) when the posted message
IS fully in ISO-8859-1, as later tests reveal.
But do they? My single a-grave went through ok.

You don’t know that it’s assuming a specific Cyrillic encoding. If it’s not
able to trust the header, it’s reasonable to try to be useful to the
largest amount of people. Who tells you there was no Cyrillic encoding in
widespread use whose agents often misidentified as Latin-1?
Ruud Harmsen
2018-11-03 06:01:29 UTC
Reply
Permalink
Fri, 2 Nov 2018 22:29:02 -0000 (UTC): António Marques
Post by Ruud Harmsen
Anyway, if as an interface designer you are confronted with a message
claiming to be in ISO-8859-1, but actually it contains CP1252 (code
point 9F is an umlauted Y, in CP1252 but not 8859-1), what do you do?
1) Assume Windows Cp1252? (Makes sense, right?)
No. How does it know it’s 1252?
All characters within 8859-1, only some without, but those are within
1252. So guess what?
There’s nothing special about 9F.
Yes there is, it's in that (in)famous range that is defined for 1252
but not for 8859-1.
A lot of
encodings use it for characters. Either it trusts the charset header, or
anything goes.
They’re only bytes.
And people having sent them.
Post by Ruud Harmsen
2) Assume Cyrillic, https://en.wikipedia.org/wiki/ISO/IEC_8859-5,
which ALSO has no defined character for 9F?
(Wouldn't that be, to put it mildly, SLIGHTLY silly?)
Well, GG does 2). What's more, it also does 2) when the posted message
IS fully in ISO-8859-1, as later tests reveal.
But do they? My single a-grave went through ok.
Yes, that remain the riddle. My later conforming messages on the other
hand were also mangled.
You don’t know that it’s assuming a specific Cyrillic encoding. If it’s not
able to trust the header, it’s reasonable to try to be useful to the
largest amount of people. Who tells you there was no Cyrillic encoding in
widespread use whose agents often misidentified as Latin-1?
http://czyborra.com/charsets/cyrillic.html
"Windows-1251" which is not to be mistaken as a 13th century precursor
of today's Windows95®"

Nice.
António Marques
2018-11-03 13:08:44 UTC
Reply
Permalink
Post by Ruud Harmsen
Fri, 2 Nov 2018 22:29:02 -0000 (UTC): António Marques
Post by Ruud Harmsen
Anyway, if as an interface designer you are confronted with a message
claiming to be in ISO-8859-1, but actually it contains CP1252 (code
point 9F is an umlauted Y, in CP1252 but not 8859-1), what do you do?
1) Assume Windows Cp1252? (Makes sense, right?)
No. How does it know it’s 1252?
All characters within 8859-1, only some without, but those are within
1252. So guess what?
They are as well within a number of other charsets.

What you wanted was ‘treat Latin-1 as if it were w1252’. While that would
solve this case, it would wreak havoc with a lot of others.
Post by Ruud Harmsen
There’s nothing special about 9F.
Yes there is, it's in that (in)famous range that is defined for 1252
but not for 8859-1.
It’s also defined for most other charsets. The various ISO-Latin are pretty
much alone In leaving those precious 32 bytes to control codes nobody uses.
Post by Ruud Harmsen
A lot of
encodings use it for characters. Either it trusts the charset header, or
anything goes.
They’re only bytes.
And people having sent them.
Post by Ruud Harmsen
2) Assume Cyrillic, https://en.wikipedia.org/wiki/ISO/IEC_8859-5,
which ALSO has no defined character for 9F?
(Wouldn't that be, to put it mildly, SLIGHTLY silly?)
Well, GG does 2). What's more, it also does 2) when the posted message
IS fully in ISO-8859-1, as later tests reveal.
But do they? My single a-grave went through ok.
Yes, that remain the riddle. My later conforming messages on the other
hand were also mangled.
You don’t know that it’s assuming a specific Cyrillic encoding. If it’s not
able to trust the header, it’s reasonable to try to be useful to the
largest amount of people. Who tells you there was no Cyrillic encoding in
widespread use whose agents often misidentified as Latin-1?
http://czyborra.com/charsets/cyrillic.html
"Windows-1251" which is not to be mistaken as a 13th century precursor
of today's Windows95®"
‘Today’s’? Are you telling us Windows 95’s development and architecture
don’t date from the 1200s? (The century marked by intelligent decisions,
such as having a ‘Christian’ army attack the centre of Eastern
Christianity.)

It’s probably doing heuristics. It’s maybe the only way to display most of
the old content correctly, short of having an option to choose display
encoding for each message. That it breaks with compliant messages - if it
really does - is unfortunate, but then nobody is supposed to be using 8-bit
single byte encodings these days. That’s almost dirty.
Christian Weisgerber
2018-11-03 16:30:11 UTC
Reply
Permalink
Post by António Marques
Post by Ruud Harmsen
Yes there is, it's in that (in)famous range that is defined for 1252
but not for 8859-1.
It’s also defined for most other charsets. The various ISO-Latin are pretty
much alone In leaving those precious 32 bytes to control codes nobody uses.
You have to consider the time and context when these standards were
created. ISO 8859-1, from the mid-1980s, is a slightly modified
version of the DEC Multinational Character Set that had been
introduced in 1983 with the highly influential DEC VT220 terminal.
The VT220 and its successors could be configured to use 8-bit control
characters (0x80..0x9F). For instance, 0x9B could be used instead
of the sequence 0x1B 0x5B. This optimized the transmission from
host to terminal over the slow EIA-232 serial connection (typically
9600 bit/s). Clearly, 8-bit control codes were the future.

The last time I ran into 8-bit control codes was on the Remote
Management Console of an AlphaServer 800 (introduced in 1997), which
was hard-coded to produce such terminal control output. It was at
that point that I noticed that 8-bit control codes are fundamentally
incompatible with UTF-8.
--
Christian "naddy" Weisgerber ***@mips.inka.de
António Marques
2018-11-12 15:36:59 UTC
Reply
Permalink
Post by Christian Weisgerber
Post by António Marques
Post by Ruud Harmsen
Yes there is, it's in that (in)famous range that is defined for 1252
but not for 8859-1.
It’s also defined for most other charsets. The various ISO-Latin are pretty
much alone In leaving those precious 32 bytes to control codes nobody uses.
You have to consider the time and context when these standards were
created. ISO 8859-1, from the mid-1980s, is a slightly modified
version of the DEC Multinational Character Set that had been
introduced in 1983 with the highly influential DEC VT220 terminal.
The VT220 and its successors could be configured to use 8-bit control
characters (0x80..0x9F). For instance, 0x9B could be used instead
of the sequence 0x1B 0x5B. This optimized the transmission from
host to terminal over the slow EIA-232 serial connection (typically
9600 bit/s). Clearly, 8-bit control codes were the future.
The last time I ran into 8-bit control codes was on the Remote
Management Console of an AlphaServer 800 (introduced in 1997), which
was hard-coded to produce such terminal control output. It was at
that point that I noticed that 8-bit control codes are fundamentally
incompatible with UTF-8.
I had no idea the ISO charsets were so old. I thought they dated from the
late 90s.
Christian Weisgerber
2018-11-12 17:48:26 UTC
Reply
Permalink
Post by António Marques
I had no idea the ISO charsets were so old. I thought they dated from the
late 90s.
ISO 8859-1 was already the default on the Commodore Amiga.
--
Christian "naddy" Weisgerber ***@mips.inka.de
Ruud Harmsen
2018-11-03 23:09:01 UTC
Reply
Permalink
Sat, 3 Nov 2018 13:08:44 -0000 (UTC): António Marques
Post by António Marques
It’s probably doing heuristics. It’s maybe the only way to display most of
the old content correctly, short of having an option to choose display
encoding for each message. That it breaks with compliant messages - if it
really does - is unfortunate, but then nobody is supposed to be using 8-bit
single byte encodings these days. That’s almost dirty.
Nonsense. GG is buggy and does stupid things, period. Admit it.
Ruud Harmsen
2018-11-03 23:06:47 UTC
Reply
Permalink
Sat, 3 Nov 2018 13:08:44 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
http://czyborra.com/charsets/cyrillic.html
"Windows-1251" which is not to be mistaken as a 13th century precursor
of today's Windows95®"
‘Today’s’? Are you telling us Windows 95’s development and architecture
don’t date from the 1200s?
You missed the quotation marks.
_I_ am not telling you that, Roman Czyborra did, some 20 years ago.
António Marques
2018-11-03 23:43:48 UTC
Reply
Permalink
Post by Ruud Harmsen
Sat, 3 Nov 2018 13:08:44 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
http://czyborra.com/charsets/cyrillic.html
"Windows-1251" which is not to be mistaken as a 13th century precursor
of today's Windows95®"
‘Today’s’? Are you telling us Windows 95’s development and architecture
don’t date from the 1200s?
You missed the quotation marks.
_I_ am not telling you that, Roman Czyborra did, some 20 years ago.
So what?
Ruud Harmsen
2018-11-04 06:49:13 UTC
Reply
Permalink
Sat, 3 Nov 2018 23:43:48 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
Sat, 3 Nov 2018 13:08:44 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
http://czyborra.com/charsets/cyrillic.html
"Windows-1251" which is not to be mistaken as a 13th century precursor
of today's Windows95®"
‘Today’s’? Are you telling us Windows 95’s development and architecture
don’t date from the 1200s?
You missed the quotation marks.
_I_ am not telling you that, Roman Czyborra did, some 20 years ago.
So what?
So that. It's funny. Or I find it funny.
--
Ruud Harmsen, http://rudhar.com
Ruud Harmsen
2018-11-03 23:08:10 UTC
Reply
Permalink
Sat, 3 Nov 2018 13:08:44 -0000 (UTC): António Marques
Post by António Marques
It’s probably doing heuristics. It’s maybe the only way to display most of
the old content correctly, short of having an option to choose display
encoding for each message.
It's doing heuristics also for correctly marked message that are
completely in the encoding in the header. So yes, GG is non-compliant
and buggy.
António Marques
2018-11-03 23:20:48 UTC
Reply
Permalink
Post by Ruud Harmsen
Sat, 3 Nov 2018 13:08:44 -0000 (UTC): António Marques
Post by António Marques
It’s probably doing heuristics. It’s maybe the only way to display most of
the old content correctly, short of having an option to choose display
encoding for each message.
It's doing heuristics also for correctly marked message that are
completely in the encoding in the header. So yes, GG is non-compliant
and buggy.
How can you _know_ the message is correctly marked? How can you know the
message is ‘in’ the encoding of the header? The encoding is what declares
the meaning of the bytes. If you don’t trust the header, there’s nothing
else you can resort to other than heuristics.
And old clients were known for saying one thing and doing another. Yours,
apparently, is one of those.
Ruud Harmsen
2018-11-04 06:59:55 UTC
Reply
Permalink
Sat, 3 Nov 2018 23:20:48 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
It's doing heuristics also for correctly marked message that are
completely in the encoding in the header. So yes, GG is non-compliant
and buggy.
How can you _know_ the message is correctly marked? How can you know the
message is ‘in’ the encoding of the header?
I know what I posted in which test messages. I can see the header by
pressing H in Agent or by looking at the "Original message" in GG. The
header, sent by Agent, says the message is in ISO-8859-1. And it is:
everything in that message is valid in ISO-8859-1.
Post by António Marques
The encoding is what declares
the meaning of the bytes. If you don’t trust the header, there’s nothing
else you can resort to other than heuristics.
There is no reason for GG not to trust that header. The header says
ISO-8859-1 and all the characters in the message are meaningful in
that encoding. So what GG should do is displaying it as what the
message says it is.

There is NO reason whatsever to display parts of it as Russian or
Hebrew. Doing that anyway is simply smartass and buggy behaviour. It
is arrogantly applying Artificial Intelligence where none is needed,
because there is an unambiguous and correct specification of how the
text is enoded.
Post by António Marques
And old clients were known for saying one thing and doing another. Yours,
apparently, is one of those.
No. Not in those several test message that did not contain Y umlaut.
--
Ruud Harmsen, http://rudhar.com
Ruud Harmsen
2018-11-04 07:42:03 UTC
Reply
Permalink
Post by Ruud Harmsen
There is NO reason whatsever to display parts of it as Russian or
Hebrew. Doing that anyway is simply smartass and buggy behaviour. It
is arrogantly applying Artificial Intelligence where none is needed,
because there is an unambiguous and correct specification of how the
text is enoded.
We could also have analysed this technical phenomemon in concerted
action in a peaceful and friendly manner. But no, _every_ occasion is
taken, in several places in Usenet, to create conflicts where non need
to exist.
--
Ruud Harmsen, http://rudhar.com
Rein
2018-11-11 14:43:02 UTC
Reply
Permalink
Post by Ruud Harmsen
Post by Ruud Harmsen
There is NO reason whatsever to display parts of it as Russian or
Hebrew. Doing that anyway is simply smartass and buggy behaviour. It
is arrogantly applying Artificial Intelligence where none is needed,
because there is an unambiguous and correct specification of how the
text is enoded.
We could also have analysed this technical phenomemon in concerted
action in a peaceful and friendly manner. But no, _every_ occasion is
taken, in several places in Usenet, to create conflicts where non need
to exist.
Kijk die krokodillentranen. Ik zie niemand ontmenselijkt worden,
hypocriet.
--
<
Ruud Harmsen
2018-11-11 22:04:10 UTC
Reply
Permalink
Post by Rein
Post by Ruud Harmsen
Post by Ruud Harmsen
There is NO reason whatsever to display parts of it as Russian or
Hebrew. Doing that anyway is simply smartass and buggy behaviour. It
is arrogantly applying Artificial Intelligence where none is needed,
because there is an unambiguous and correct specification of how the
text is enoded.
We could also have analysed this technical phenomemon in concerted
action in a peaceful and friendly manner. But no, _every_ occasion is
taken, in several places in Usenet, to create conflicts where non need
to exist.
Kijk die krokodillentranen. Ik zie niemand ontmenselijkt worden,
hypocriet.
You are writing in Dutch in an English language news group. Please
stop that behaviour, it is offensive.
--
Ruud Harmsen, http://rudhar.com
António Marques
2018-11-12 14:46:32 UTC
Reply
Permalink
Post by Ruud Harmsen
(...)
You are writing in Dutch in an English language news group.
Is it tho? And being, should it be? I don’t think so.

On a separate note, I’ve read somewhere that your blessed Eudora was once
widely used by Russians, sending KOI8 or whatever identified as ISO-8859-1.
That would explain the use of heuristics by GG rather than trusting the
headers - which is preferable, making a lot of Russians able to
communicate, or letting one guy say ‘áéíóú’?

Have you tried sending valid ISO-8859-1 from a compliant agent such as TB?
Maybe the heuristics only kicks in for old clients.
Ruud Harmsen
2018-11-13 06:09:56 UTC
Reply
Permalink
Mon, 12 Nov 2018 14:46:32 -0000 (UTC): António Marques
Post by António Marques
On a separate note, I’ve read somewhere that your blessed Eudora was once
widely used by Russians, sending KOI8 or whatever identified as ISO-8859-1.
That would explain the use of heuristics by GG rather than trusting the
headers - which is preferable, making a lot of Russians able to
communicate, or letting one guy say ‘áéíóú’?
I don't use Eudora for Usenet and it cannot be used for that. It is an
e-mail program.
Post by António Marques
Have you tried sending valid ISO-8859-1 from a compliant agent such as TB?
Free Agent is also compliant.
Post by António Marques
Maybe the heuristics only kicks in for old clients.
--
Ruud Harmsen, http://rudhar.com
António Marques
2018-11-13 14:04:51 UTC
Reply
Permalink
Post by Ruud Harmsen
Mon, 12 Nov 2018 14:46:32 -0000 (UTC): António Marques
Post by António Marques
On a separate note, I’ve read somewhere that your blessed Eudora was once
widely used by Russians, sending KOI8 or whatever identified as ISO-8859-1.
That would explain the use of heuristics by GG rather than trusting the
headers - which is preferable, making a lot of Russians able to
communicate, or letting one guy say ‘áéíóú’?
I don't use Eudora for Usenet and it cannot be used for that. It is an
e-mail program.
Eudora, Forte, whatever. It’s all the same unsupported buggy stuff from a
bygone era of ill repute. The internets say your FA used to send KOI8
identified as 8859-1.
Post by Ruud Harmsen
Post by António Marques
Have you tried sending valid ISO-8859-1 from a compliant agent such as TB?
Free Agent is also compliant.
God forbid you should test 8859-1 with TB and it turned out OK in GG.
Ruud Harmsen
2018-11-13 18:15:15 UTC
Reply
Permalink
Tue, 13 Nov 2018 14:04:51 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
Mon, 12 Nov 2018 14:46:32 -0000 (UTC): António Marques
Post by António Marques
On a separate note, I’ve read somewhere that your blessed Eudora was once
widely used by Russians, sending KOI8 or whatever identified as ISO-8859-1.
That would explain the use of heuristics by GG rather than trusting the
headers - which is preferable, making a lot of Russians able to
communicate, or letting one guy say ‘áéíóú’?
I don't use Eudora for Usenet and it cannot be used for that. It is an
e-mail program.
Eudora, Forte, whatever. It’s all the same unsupported buggy stuff from a
bygone era of ill repute. The internets say your FA used to send KOI8
identified as 8859-1.
Post by Ruud Harmsen
Post by António Marques
Have you tried sending valid ISO-8859-1 from a compliant agent such as TB?
Free Agent is also compliant.
God forbid you should test 8859-1 with TB and it turned out OK in GG.
1) You can test it yourself. Did you?

2) It's too much of a nuisance, I don't have TB installed on this
computer.

3) FreeAgent 1.93 can also speak, and always post in UTF8. I didn't
know it. However, that doesn't I can post the true names of the late
Mr. Kashoggi, because the screen interface still only supports
Windows1252. And that is quite enough.
--
Ruud Harmsen, http://rudhar.com
Ruud Harmsen
2018-11-11 22:01:45 UTC
Reply
Permalink
Post by Rein
Post by Ruud Harmsen
Post by Ruud Harmsen
There is NO reason whatsever to display parts of it as Russian or
Hebrew. Doing that anyway is simply smartass and buggy behaviour. It
is arrogantly applying Artificial Intelligence where none is needed,
because there is an unambiguous and correct specification of how the
text is enoded.
We could also have analysed this technical phenomemon in concerted
action in a peaceful and friendly manner. But no, _every_ occasion is
taken, in several places in Usenet, to create conflicts where non need
to exist.
Kijk die krokodillentranen. Ik zie niemand ontmenselijkt worden,
hypocriet.
Lees je wel eens mee in sci.lang? Ik wel, al decennia.

Zo nee, dan moet je je mond houden, je weet niet waar je over praat.

De kritiek die ik daar uit geldt trouwens net zo goed voor 30 jaar
nl.taal, ja. En voor jou persoonlijk. Jij doet daar niet anders dan
alles proberen om te buigen tot persoonlijke conflicten, wat het ook
is.

En nogmaals, Address, als je er niet van verdacht wil worden een bot
te zijn, moet je niet gedragen als een bot. Simpel.
--
Ruud Harmsen, http://rudhar.com
Rein
2018-11-12 11:57:15 UTC
Reply
Permalink
Post by Ruud Harmsen
Post by Rein
Post by Ruud Harmsen
Post by Ruud Harmsen
There is NO reason whatsever to display parts of it as Russian or
Hebrew. Doing that anyway is simply smartass and buggy behaviour. It
is arrogantly applying Artificial Intelligence where none is needed,
because there is an unambiguous and correct specification of how the
text is enoded.
We could also have analysed this technical phenomemon in concerted
action in a peaceful and friendly manner. But no, _every_ occasion is
taken, in several places in Usenet, to create conflicts where non need
to exist.
Kijk die krokodillentranen. Ik zie niemand ontmenselijkt worden,
hypocriet.
Lees je wel eens mee in sci.lang? Ik wel, al decennia.
Zo nee, dan moet je je mond houden, je weet niet waar je over praat.
De kritiek die ik daar uit geldt trouwens net zo goed voor 30 jaar
nl.taal, ja. En voor jou persoonlijk. Jij doet daar niet anders dan
alles proberen om te buigen tot persoonlijke conflicten, wat het ook
is.
En nogmaals, Address, als je er niet van verdacht wil worden een bot
te zijn, moet je niet gedragen als een bot. Simpel.
"You are writing in Dutch in an English language news group.
Please stop that behaviour, it is offensive." (RH)
--
<
Peter T. Daniels
2018-11-04 14:37:00 UTC
Reply
Permalink
Post by Ruud Harmsen
Sat, 3 Nov 2018 23:20:48 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
It's doing heuristics also for correctly marked message that are
completely in the encoding in the header. So yes, GG is non-compliant
and buggy.
How can you _know_ the message is correctly marked? How can you know the
message is ‘in’ the encoding of the header?
I know what I posted in which test messages. I can see the header by
pressing H in Agent or by looking at the "Original message" in GG. The
everything in that message is valid in ISO-8859-1.
Post by António Marques
The encoding is what declares
the meaning of the bytes. If you don’t trust the header, there’s nothing
else you can resort to other than heuristics.
There is no reason for GG not to trust that header. The header says
ISO-8859-1 and all the characters in the message are meaningful in
that encoding. So what GG should do is displaying it as what the
message says it is.
There is NO reason whatsever to display parts of it as Russian or
Hebrew. Doing that anyway is simply smartass and buggy behaviour. It
is arrogantly applying Artificial Intelligence where none is needed,
because there is an unambiguous and correct specification of how the
text is enoded.
Post by António Marques
And old clients were known for saying one thing and doing another. Yours,
apparently, is one of those.
No. Not in those several test message that did not contain Y umlaut.
Why is it _only_ messages from Ruud Harmsen in which that particular fault
occurs? In occasional other messages, GG replaces characters with question
marks, or rarely it interprets pairs of characters as Chinese characters,
and very rarely rectangles or diamond-question-marks.

Never before have I seen characters turn into cyrillic or Hebrew.
Ruud Harmsen
2018-11-04 15:05:14 UTC
Reply
Permalink
Sun, 4 Nov 2018 06:37:00 -0800 (PST): "Peter T. Daniels"
Post by Peter T. Daniels
Post by Ruud Harmsen
Post by António Marques
And old clients were known for saying one thing and doing another. Yours,
apparently, is one of those.
No. Not in those several test message that did not contain Y umlaut.
Why is it _only_ messages from Ruud Harmsen in which that particular fault
occurs?
1) I don't know. I fed back the error to Google, perhaps they'll
investigate the error and respond.

2) Until now nobody posted a string of accented letters that does not
look like a word in any language, in ISO-8859-1 and marked as such.
Post by Peter T. Daniels
In occasional other messages, GG replaces characters with question
marks, or rarely it interprets pairs of characters as Chinese characters,
and very rarely rectangles or diamond-question-marks.
Never before have I seen characters turn into cyrillic or Hebrew.
Neither did I. Apparently the introduction of the GG bug is recent.
--
Ruud Harmsen, http://rudhar.com
António Marques
2018-11-12 15:35:04 UTC
Reply
Permalink
Post by Ruud Harmsen
The header says
ISO-8859-1 and all the characters in the message are meaningful in
that encoding
Once more: all the 256 bytes are meaningful in all the 8-bit encodings (bar
some irrelevant old one). That’s not something that can be generally used
to validate them. An encoding is a declaration and not, per se, subject to
validation. It’s what it is. Even in the ISO charsets, the control codes
aren’t any less valid. They’re just not printable characters, or characters
at all, if you prefer.
Now, whether the bytes translate to a meaningful human message when using a
given charset, is another matter, and it’s a matter for AI, not classical
algorithms.

As to why not trust the header, I’ve explained in the other thread.
O. Udeman
2018-11-12 16:04:35 UTC
Reply
Permalink
Post by António Marques
Post by Ruud Harmsen
The header says
ISO-8859-1 and all the characters in the message are meaningful in
that encoding
Once more: all the 256 bytes are meaningful in all the 8-bit encodings (bar
some irrelevant old one). That’s not something that can be generally used
to validate them. An encoding is a declaration and not, per se, subject to
validation. It’s what it is. Even in the ISO charsets, the control codes
aren’t any less valid. They’re just not printable characters, or characters
at all, if you prefer.
Now, whether the bytes translate to a meaningful human message when using a
given charset, is another matter, and it’s a matter for AI, not classical
algorithms.
As to why not trust the header, I’ve explained in the other thread.
Bla, bla, bla.
Ruud Harmsen
2018-11-13 06:15:55 UTC
Reply
Permalink
Mon, 12 Nov 2018 15:35:04 -0000 (UTC): António Marques
Post by António Marques
Now, whether the bytes translate to a meaningful human message when using a
given charset, is another matter, and it’s a matter for AI, not classical
algorithms.
Interpreting announced ISO-8859-1 as ISO-8859-1 is a classical
algorithm.

Anyway, the point is, the Russian and Hebrew was displayed by GG, not
sent by me, something of which I have been falsely accused at length.
--
Ruud Harmsen, http://rudhar.com
António Marques
2018-11-13 14:04:52 UTC
Reply
Permalink
Post by Ruud Harmsen
Mon, 12 Nov 2018 15:35:04 -0000 (UTC): António Marques
Post by António Marques
Now, whether the bytes translate to a meaningful human message when using a
given charset, is another matter, and it’s a matter for AI, not classical
algorithms.
Interpreting announced ISO-8859-1 as ISO-8859-1 is a classical
algorithm.
That’s not an algorithm and it won’t work reliably given the sheer amount
of content that was produced by buggy clients such as yours back in the
day.
Post by Ruud Harmsen
Anyway, the point is, the Russian and Hebrew was displayed by GG, not
sent by me, something of which I have been falsely accused at length.
Not by me. I’ve only accused you of doing the unethical thing of using
buggy, unsupported software to communicate with others.
Ruud Harmsen
2018-11-13 18:18:40 UTC
Reply
Permalink
Tue, 13 Nov 2018 14:04:52 -0000 (UTC): António Marques
Post by António Marques
Post by Ruud Harmsen
Mon, 12 Nov 2018 15:35:04 -0000 (UTC): António Marques
Post by António Marques
Now, whether the bytes translate to a meaningful human message when using a
given charset, is another matter, and it’s a matter for AI, not classical
algorithms.
Interpreting announced ISO-8859-1 as ISO-8859-1 is a classical
algorithm.
That’s not an algorithm and it won’t work reliably given the sheer amount
of content that was produced by buggy clients such as yours back in the
day.
Agent 1.93 isn't buggy, except that it sometimes sends Windows 1252
and headers ISO-8859-1. A very minor offence.

GG however is buggy, and user unfriendly, and bad at searching text
(which was GG strong point par excellence?!?!?)

And your smartphone does smart curly quotes without you even knowing
about it.
Post by António Marques
Post by Ruud Harmsen
Anyway, the point is, the Russian and Hebrew was displayed by GG, not
sent by me, something of which I have been falsely accused at length.
Not by me. I’ve only accused you of doing the unethical thing of using
buggy, unsupported software to communicate with others.
Unjustfied. Software need not be supported, standards are. Agent 1.93
is compliant.
--
Ruud Harmsen, http://rudhar.com
Ruud Harmsen
2018-11-04 07:44:10 UTC
Reply
Permalink
Sat, 3 Nov 2018 23:20:48 -0000 (UTC): António Marques
Post by António Marques
And old clients were known for saying one thing and doing another. Yours,
apparently, is one of those.
There has been a habit, or so I hear, of sending Russian _without_ any
indication of the encoding. But not of indicating Latin-1 and then
really sending Russian.
--
Ruud Harmsen, http://rudhar.com
Peter T. Daniels
2018-11-02 19:13:23 UTC
Reply
Permalink
Post by Christian Weisgerber
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Looking at the raw article in my INN server's newsspool...
Post by Ruud Harmsen
Diaeresis: äëïöüÿ
Diaeresis: ÄËÏÖÜ<9F>
In Ruud's first message above I again see cyrillic characters. But in
the two lines above, I see letters with dieresis. Is some sort of
control code interfering?
Post by Christian Weisgerber
There was an illegal character there, byte 0x9F, which is not a
printable character in ISO 8859-1. Also, capital Y with diaeresis
is not part of the 8859-1 set. (Lowercase y with diaeresis is.)
I see the capital Y as <9F> (lessthan nine EFF greaterthan)
peteolcott
2018-11-12 14:58:08 UTC
Reply
Permalink
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Accent grave: àèìòù
Accent aigu: áéíóúý
Accent circonflex: âêôîû
Diaeresis: äëïöüÿ
Tilde: ãõ
Cedilla: ç
Accent grave: ÀÈÌÒÙ
Accent aigu: ÁÉÍÓÚÝ
Accent circonflex: ÂÊÔÎÛ
Diaeresis: ÄËÏÖÜŸ
Tilde: ÃÕ
Cedilla: Ç
àèìòùáéíóúýâêôîûäëïöüÿãõç
Google groups does almost full UTF-8 Unicode depending on the
fonts stored on the users local machine. Even old smart phones
should be able to do Latin-1.
O. Udeman
2018-11-12 15:30:01 UTC
Reply
Permalink
Post by peteolcott
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Accent grave: àèìòù
Accent aigu: áéíóúý
Accent circonflex: âêôîû
Diaeresis: äëïöüÿ
Tilde: ãõ
Cedilla: ç
Accent grave: ÀÈÌÒÙ
Accent aigu: ÁÉÍÓÚÝ
Accent circonflex: ÂÊÔÎÛ
Diaeresis: ÄËÏÖÜŸ
Tilde: ÃÕ
Cedilla: Ç
àèìòùáéíóúýâêôîûäëïöüÿãõç
Google groups does almost full UTF-8 Unicode depending on the
fonts stored on the users local machine. Even old smart phones
should be able to do Latin-1.
Pleurt op stel idioten. .
Ruud Harmsen
2018-11-13 06:17:21 UTC
Reply
Permalink
Post by peteolcott
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Accent grave: àèìòù
Accent aigu: áéíóúý
Accent circonflex: âêôîû
Diaeresis: äëïöüÿ
Tilde: ãõ
Cedilla: ç
Accent grave: ÀÈÌÒÙ
Accent aigu: ÁÉÍÓÚÝ
Accent circonflex: ÂÊÔÎÛ
Diaeresis: ÄËÏÖÜŸ
Tilde: ÃÕ
Cedilla: Ç
àèìòùáéíóúýâêôîûäëïöüÿãõç
Google groups does almost full UTF-8 Unicode depending on the
fonts stored on the users local machine. Even old smart phones
should be able to do Latin-1.
The problem was not in the sender's computer, not in the reader's
computer or smartphone, but in Google Groups, as has been amply
investigated and proven.
--
Ruud Harmsen, http://rudhar.com
Peter T. Daniels
2018-11-13 14:30:47 UTC
Reply
Permalink
Post by Ruud Harmsen
Post by peteolcott
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Accent grave: àèìòù
Accent aigu: áéíóúý
Accent circonflex: âêôîû
Diaeresis: äëïöüÿ
Tilde: ãõ
Cedilla: ç
Accent grave: ÀÈÌÒÙ
Accent aigu: ÁÉÍÓÚÝ
Accent circonflex: ÂÊÔÎÛ
Diaeresis: ÄËÏÖÜŸ
Tilde: ÃÕ
Cedilla: Ç
àèìòùáéíóúýâêôîûäëïöüÿãõç
Google groups does almost full UTF-8 Unicode depending on the
fonts stored on the users local machine. Even old smart phones
should be able to do Latin-1.
The problem was not in the sender's computer, not in the reader's
computer or smartphone, but in Google Groups, as has been amply
investigated and proven.
Stop with your "phantasy," as you call it. I saw several lines of
perfectly clear cyrilllic letters in the initial message, and when
the complainer copied the lines, they were the question-marks that he saw.

Incidentally, all the accented letters above, including the meaningless
string of them, are fine, so you seem to have fixed whatever was wrong
with your system.
António Marques
2018-11-13 14:42:52 UTC
Reply
Permalink
Post by Peter T. Daniels
Post by Ruud Harmsen
Post by peteolcott
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Accent grave: àèìòù
Accent aigu: áéíóúý
Accent circonflex: âêôîû
Diaeresis: äëïöüÿ
Tilde: ãõ
Cedilla: ç
Accent grave: ÀÈÌÒÙ
Accent aigu: ÁÉÍÓÚÝ
Accent circonflex: ÂÊÔÎÛ
Diaeresis: ÄËÏÖÜŸ
Tilde: ÃÕ
Cedilla: Ç
àèìòùáéíóúýâêôîûäëïöüÿãõç
Google groups does almost full UTF-8 Unicode depending on the
fonts stored on the users local machine. Even old smart phones
should be able to do Latin-1.
The problem was not in the sender's computer, not in the reader's
computer or smartphone, but in Google Groups, as has been amply
investigated and proven.
Stop with your "phantasy," as you call it. I saw several lines of
perfectly clear cyrilllic letters in the initial message, and when
the complainer copied the lines, they were the question-marks that he saw.
Incidentally, all the accented letters above, including the meaningless
string of them, are fine, so you seem to have fixed whatever was wrong
with your system.
What was wrong with it was insisting on using technology that was already
old when Yeltsin was president.

It's been established that GG uses heuristics to determine a message's
charset rather than trusting the declared value (unless the message
declares UTF-8, the current standard). It may or may not use a whitelist or
blacklist of sender software when doing that.

That much is not Ruud's fault, but neither can it be called a bug in GG.
It's unexpected behaviour, which prima facie is antisocial, but may be
justified on further analysis.
Ruud Harmsen
2018-11-13 18:38:21 UTC
Reply
Permalink
Tue, 13 Nov 2018 06:30:47 -0800 (PST): "Peter T. Daniels"
Post by Peter T. Daniels
Post by Ruud Harmsen
Post by peteolcott
Post by Ruud Harmsen
Testing accented letters, encoded in Latin-1, ISO8859-1, Windows
1252/CP1252
Accent grave: àèìòù
Accent aigu: áéíóúý
Accent circonflex: âêôîû
Diaeresis: äëïöüÿ
Tilde: ãõ
Cedilla: ç
Accent grave: ÀÈÌÒÙ
Accent aigu: ÁÉÍÓÚÝ
Accent circonflex: ÂÊÔÎÛ
Diaeresis: ÄËÏÖÜŸ
Tilde: ÃÕ
Cedilla: Ç
àèìòùáéíóúýâêôîûäëïöüÿãõç
Google groups does almost full UTF-8 Unicode depending on the
fonts stored on the users local machine. Even old smart phones
should be able to do Latin-1.
The problem was not in the sender's computer, not in the reader's
computer or smartphone, but in Google Groups, as has been amply
investigated and proven.
Stop with your "phantasy," as you call it. I saw several lines of
perfectly clear cyrilllic letters in the initial message, and when
the complainer copied the lines, they were the question-marks that he saw.
Sigh. You're technically incompetent, there is no longer any doubt
now. I gave you all the pointers but you are unable to look at them
and judge their meaning.

That's not so bad, not everybody needs to have the programming and
debugging experience that I have. But then please refrain from posting
unfunded accusations.
Post by Peter T. Daniels
Incidentally, all the accented letters above, including the meaningless
string of them, are fine, so you seem to have fixed whatever was wrong
with your system.
As announced, I now have Agent post in UTF-8, but you missed that too
or you didn't understand what it means.
--
Ruud Harmsen, http://rudhar.com
Loading...