Threaded Mode | Linear Mode

**whenever** · Sep. 21, 2008, 02:29 PM

I don't think it make sense to force UTF-8 encoded page to be rendered as ISO-8859-1. The page should be rendered as what it is originally encoded.

For English speaking people, there might be not big difference because ASCII characters are encoded nearly the same in ISO-8859-1 and UTF-8, that is, one character takes one byte; however, for double-byte characters like Chinese, one character take 2 bytes in GB2312 while 3 bytes in UTF-8. That's why English pages that contain only ASCII may render well in both ISO-8859-1 and UTF-8 but for Chinese pages you have to select the encoding method that the page is originally encoded.

Oddysey · Nov. 16, 2008, 05:32 AM

BUMP (or REVIVED, your choice)

I've got the definitive answer, but let me first take care of some old business:

whenever;

Quote:I don't think it make sense to force UTF-8 encoded page to be rendered as ISO-8859-1. The page should be rendered as what it is originally encoded.

Well, that would certainly fly in the face of The Proxomitron's very reason for existance - to modify the incoming data to display as you wish, not as the originator wishes. Nope, gotta disagree with you on that one, particularly since it's your very suggestion that is causing me so much aggravation (i.e. seeing question marks where they shouldn't be seen).

Now, as promised, I'll reveal how Proxo solved my problem(s). To recap, so you don't have to go back a page, I originally posted:

Quote:Many sites on the web, if not most of them, are now serving their pages with strange characters, or just as bad, with normal characters where they don't belong. The best example is a question mark being used to replace an apostrophe, or even just stuck wherever the page's author felt like putting one. (But in checking the source code, I see that sometimes a "page generator" is used, so I can't always blame the author.)

I have a filter that more or less converts those questionable question marks into apostrophes, which is what I'd usually expect to find in that position. Sadly, as often as not, the bogus question mark occurs in the middle of a link. As you can guess, this screws up the link, bigtime. When I click it, I get what amounts to a 404 error. I check the address bar, and sure enough, bold as brass, there are a bunch of apostrophe's, and no question marks.
......

After several suggestions and some deep experimentation, I found that I was getting a rather unsatisfactory browsing experience, but that it wasn't too bad, so I'd just live with it. As you might expect, if you knew me, that didn't last long...... Dead

This morning I got fed up, and sat down to get to the bottom of it all. After much googling, I found a possible answer: we should have expected all along that it would be a header issue. Essentially, if there's no specific header instruction, then the default value is used. For character sets, it's whatever the page's server sends as a Reply header for Content-Type. So I opened the Header filters, and lo and behold, Scott had already anticipated this very scenario..... there's a default filter that normally removes the charset declaration, when active. I merely activated it, after suitable modification, like so:

Code:

In = TRUE

Out = FALSE

Key = "Content-Type: character set filter (in)"

Match = "text/html;*charset*"

Replace = "text/html;charset=us-ascii"

For our purposes, the magic takes place at the end of the Match line: charset=us-ascii. You can of course substitute whatever you wish, be it a name (like I did) or an ISO number (like I first proposed). The problem I noted with ISO-8859-1 is that it would sometimes (very rarely) falsely render the odd character or two, so I opted for the most basic charset of all. Indeed, the W3C's very own RFC-2616 (Header Specifications) actually says 'use of this charset it encouraged throughout the web'. Doubtless, they're aiming for a world-wide acceptance, and recognition of, one alphabet.

BTW, there is a header that's called, wonderously enough, "accept-charset". That was abandoned early on (in web-years), and is fully ignored by all current browsers and servers. Ah, but if only they had employed hindsight back then, solving problems like this now would be so much easier..... Sad

One other thing..... http://www.unicode.org recommends that you configure your browser and OS to install all possible language packs (character/font sets), before anything else. I did that, and as you might have guessed, that didn't do the trick for me. Ah well, what's another 10 or 20 megs of fonts laying unused on my boot drive, eh? Whistling

Problem solved, and shared with my buds! Cheers

HTH

Oddysey
[/quote][/i]

**whenever** · Nov. 16, 2008, 11:18 AM

Quote:Well, that would certainly fly in the face of The Proxomitron's very reason for existance - to modify the incoming data to present as you wish, not as the originator wishes.

I really love Proxomitron's capability to modify the data on the fly but it doesn't mean it is good at encoding conversion or binary filtering (I know we have a dirty trick [%xx]). For example, I don't think it can convert Chinese character from utf-8 3 bytes to gb2312 2 bytes.

Oddysey, you didn't get what I want to say in the previous post.

As to your problem, I think it is because of your browser not selecting the proper encoding method for the page. If you view a utf-8 page in iso-8859-1 encoding, most characters will render well but those characters exist in utf-8 while not exist in iso-8859-1 will not render correctly, and may turned into quesiton marks. So, if you encounter a page with strange question mark, would you please check the page source to find the charset setting and manually set the browser to use that encoding? If that still doesn't help, would you please post the url so that I can have a look?

Oddysey · Nov. 16, 2008, 07:47 PM

whenever;

Your explanation in Post 18, of your statement in Post 16, was exactly what I had thought you meant - I did get your intented meaning.

Proxo doesn't do any charset conversions at all, it only sets a variable (a known header) to some value that the user wants. In this case, you're correct, my browser is not properly rendering the incoming datastream in the charset desired by the page's author. That's why I did what they said to do at http://www.unicode.org, to install and configure all those extra fonts, language packs, and other stuff.... and why I was disappointed (and not a bit disgusted) at the results. I mean, this is an almost pristine installation, I've not really taken the time to modify IE with BHO's, registry borks, and other assorted mayhem. It should all go according to Hoyle, but it's not. Cry

That said, there are at least two considerations here, and possibly more. One is that I don't view sites that use Chinese alphabets. Yes, you were just using that as an example, but I also mean that as an example - I don't go surfing around the web to sites in other languages, I stick to those I can read without metaphysical intervention. Wink

Thus, it's all right for ME to use a charset that you consider restrictive, but note that I'm not advocating that all other web surfers do the same. What I've done here is merely present a solution to a given problem, in the hopes that other Proxo users might see a way to adapt my solution to their needs.... that was all, I swear.

The other consideration in my mind at the moment is the validity of using UTF-8 in the first place. It sounds like a good idea, but like most such, I can poke some rather uncomfortable holes in it. But that's not why we're here, so I'll leave off of that, and get on with the rest of it.....

You asked for an example of how UTF-8 encoded pages appear in my browser. Well, when I visit eBay, I get sent a page that eBay generated in UTF-8, and it looks like this:

[Image: ebay-ugly.gif]

That's a result of eBay's attempt to appeal to a broad audience, instead of making sure that their pages don't go all sideways on at least a few of their viewers - they set "Content-type: charset=UTF-8", and hope for the best. But by forcing the incoming datastream to be "Content-type: charset=us-ascii", I tell the browser to ignore what eBay wants, and to do what I want instead. And it looks just peachy-keen, you can take that to the bank. Smile!

I can't speak for other browsers on the market, but in IE, you don't get to easily command what charset will be used. You can't turn on or off the "auto-detect" feature, no matter what they say, either choice is automatically over-ridden by any "Content-type" declaration. That's pretty user-hostile, if you ask me. Banging Head

Thus, we resort to munging the incoming datastream with Proxo, which is why we worship at the altar of Scott, each and every day. Hail

HTH

Oddysey

***z12*** · Nov. 16, 2008, 09:40 PM

Oddysey Wrote:You can't turn on or off the "auto-detect" feature, no matter what they say, either choice is automatically over-ridden by any "Content-type" declaration. That's pretty user-hostile, if you ask me.

From: http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2

Quote:To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
3. The charset attribute set on an element that designates an external resource.

In addition to this list of priorities, the user agent may use heuristics and user settings.

The 800lb gorilla in the room is browser heuristics.
While the following link refers to html 5, it's still an interesting read, if your into technical stuff:
http://www.whatwg.org/specs/web-apps/cur...r-encoding

Anyway, as you know, changing the charset for the Content-Type header does not change how the page was encoded.
It might change how the page is rendered, for better or worse.

Occasionally, I see some oddball question marks out of context.
I've also seen quite a few pages where the meta charset is different from the header charset.
Up to now, I haven't really paid attention, but I can't help but wonder if the two are related.

To resolve this, I'm thinking about stripping the charset from the Content-Type header.
I figure I can alway re-inject the charset via a meta tag if it's not already present in the page header.

If the browsers follow the rules above, this should fix it if the server isn't sending the right charset.
I'm not sure about the pros & cons of injecting a meta charset if both are missing.
My main concern with this is IE related.

I've seen IE6 bork my injected js.
This happens, sometimes, when a meta charset element is encountered after my header js is injected.
If IE decides to re-parse the document based on the new charset, I get an an error and my js is hosed.
I think I've resolved the issue, but my filter needs more testing with IE.

whenever Wrote:So, if you encounter a page with strange question mark, would you please check the page source to find the charset setting and manually set the browser to use that encoding? If that still doesn't help, would you please post the url so that I can have a look?

I'd be interested in seeing links to pages like that also.

Oddysey, FYI, firefox sends the "accept-charset" header.

z12

Graycode · Nov. 17, 2008, 06:28 PM

(Nov. 16, 2008 09:40 PM)z12 Wrote: The 800lb gorilla in the room is browser heuristics.

As long as idiots keep producing stupid headers then the 800lb gorillas and 799lb proxies will have to stay alert.

Code:

www.mcafee.com/us/images/buy_links_icon.gif

==========================================================

HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Content-Type: text/html; charset=UTF-8

X-Powered-By: ASP.NET

Date: Mon, 17 Nov 2008 17:29:58 GMT

Accept-Ranges: bytes

Last-Modified: Tue, 17 Jul 2007 13:17:59 GMT

ETag: "805d2ce574c8c71:e60"

Content-Length: 1674

==========================================================

offset | _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F |

000000 | 47 49 46 38 39 61 1E 00 1E 00 F7 FF 00 E4 F2 F9 | GIF89a          

000010 | 74 A0 DA 45 6E B2 AA D1 F9 94 A4 B6 C4 C4 C6 94 | t  En           

000020 | A4 C8 FF FE FE 68 97 E3 79 7A 77 78 98 CA 5A 9C |      h  yzwx  Z 

000030 | E9 67 68 66 DC DA DB 8B 89 82 FE FD F2 5D 81 B4 |  ghf         ]  

000040 | 62 98 DB D1 CD C8 68 8A B8 88 8D 95 E6 E5 DB AB | b     h         

000050 | B2 BA D3 E1 F8 9D 9D 9E 78 A6 F5 DA F3 FC 95 B9 |         x       

000060 | FF C3 DC F6 C5 CB DC A6 A9 AE 6B 77 87 4D 8D DA |           kw M  

000070 | 9E C4 F2 B4 CB F2 BB BF C5 A4 B5 D2 A9 B4 C4 ED |                 

000080 | ED EA B7 D6 EA BC CB E2 D6 D5 D4 E2 DF D8 96 AC |                 

000090 | E6 96 9D A8 DE DB D4 4C 84 CC 83 9C CD A1 CA F6 |        L        

0000A0 | BB BC BB BD CB CC CB D2 D3 86 AB DE BC D7 F4 C2 |                 

0000B0 | E2 EF 6A 92 CC 69 87 A5 C1 D7 EF 84 AB E6 EA E6 |   j  i

Shown is the top of the perfectly normal GIF file falsely claimed by the server to be HTML encoded as UTF-8. They even do that to their own logo ( http://www.mcafee.com/us/images/common/logo.gif )

Many more examples exist all over the web, some done by less reputable idiots Wink

***ProxRocks*** · Nov. 17, 2008, 07:18 PM

"hilarity ensues", lol... "their own logo", too d@mn funny...

Oddysey · Nov. 18, 2008, 02:40 AM

z12;

Yes, I think that most if not all major browsers send the accept-charset header, but they don't mean it - they'll receive whatever the server decides to send, and they'll render the page in that sent declaration, come Hell or highwater. I tend to think that they send the header out of sheer stubborn adherence to a once-powerful but now-forgotten (or at least ignored) specification. It's sort of like, they keep preaching about CSS, but they all still render tables, iframes and a host of deprecated tags. That's because if they didn't, people would shy away from a product didn't properly render a page out of The Wayback Machine.

Graycode's example speaks volumes about how holy the standards are considered. Talk about speaking with a forked tongue. Liar

Oddysey

**whenever** · Nov. 18, 2008, 03:12 AM

(Nov. 16, 2008 07:47 PM)Oddysey Wrote: That's why I did what they said to do at http://www.unicode.org, to install and configure all those extra fonts, language packs, and other stuff....

Are you using windows 98? I am not sure about that OS but under windows XP the fresh installation should could render UTF-8 encoded English page well without needing extra fonts, language packs.

(Nov. 16, 2008 07:47 PM)Oddysey Wrote: they set "Content-type: charset=UTF-8", and hope for the best. But by forcing the incoming datastream to be "Content-type: charset=us-ascii", I tell the browser to ignore what eBay wants, and to do what I want instead. And it looks just peachy-keen, you can take that to the bank.

z12 explained this issue well:

z12 Wrote:Anyway, as you know, changing the charset for the Content-Type header does not change how the page was encoded.
It might change how the page is rendered, for better or worse.

(Nov. 16, 2008 07:47 PM)Oddysey Wrote: I can't speak for other browsers on the market, but in IE, you don't get to easily command what charset will be used. You can't turn on or off the "auto-detect" feature, no matter what they say, either choice is automatically over-ridden by any "Content-type" declaration.

What version of IE are you using? I just checked IE6 & IE7 and the "auto-select" feature could be truned on or off.
[Image: attachment.php?aid=193]

Oddysey · Nov. 18, 2008, 08:08 PM

whenever;

Yes, Mike (z12) did explain what happens in regards to rendering versus encoding. But the fact of the matter is, when all is said and done, it doesn't matter squat how the page was encoded by the author (or page generator application), it only matters how my browser reacts to that encoding, and thus, how it renders on my screen - not on the author's screen.

I thought I mentioned this, but perhaps I didn't..... I switched over (finally!) to XP just this last August. Prior to that I ran a W98SE box that was sorely pressed to resemble anything approved by Microsoft when it left the factory. Big Teeth

My XPSP2 installation is almost pristine, the only thing I've done that has had any adverse consequences is to move the UserShellFolders to my data partition. (Done so that I could do an easy backup and/or restore.) That's actually caused me some grief in the Temporary Internet Files, but other than that, all seems to be according to Hoyle/Microsoft. Which is why I'm so perplexed about this whole UTF-8 thing...... Sad

As for turning on or off the Charset Auto-Detect feature, you're deluding yourself. (Sorry, but there it is....) Go ahead, switch it back and forth, and watch what happens as you reload a site/page. Be sure to turn on Proxo's Log screen so you can see the results each time you refresh the page after the switch. If your system is like mine, you'll note what z12 noted - if the page is sent with a Content-type declaration, that will override the Auto-Detect feature. Said feature is meant to work only when there is no declaration at all, from the server.

All of which is illustrated by yet another "nugget" I found this morning. Go here:

http://www.dailyrotation.com

It's an RSS feed aggregator, I use it daily for my compu-news fix. Wink

They don't send a charset declaration, so my browser (IE6SP1) auto-detects that it should render the page in Western Europe Windows. That's not so bad, nearly everything shows correctly. But.......

Strictly speaking, the page wasn't encoded in a vacuum, it must have had some charset applied, right? Only there's no corresponding declaration, so the browser is left to it's own devices, and sure enough, there are some question marks scattered around the place (albeit not nearly so intrusively as that eBay fiasco). So, what I've done is two things: One, I've let the forced-declaration header remain active - the charset=us-ascii thing isn't doing any harm or good, so why monkey with it? And two, I've added a generic (for me, that's rare!) web filter that looks for question marks in pairs, and assumes they are meant to be quote marks.... Whala, the page appears just fine. (Going back and testing the web filter alone without the header is no better/no worse, so again, I'll let things lay as they are.)

Now, I'm not saying that the problem is solved, at least not permanently. But for now, I can live with it. A very few oddities are not gonna bake my cookies, so to speak, it's only when they spew out all over the page that I get upset. Until I see that happening again, I'm gonna leave well enough alone.

But I do thank all of you for your input and help!!

Oddysey

**whenever** · Nov. 19, 2008, 02:59 AM

Thanks for the URL. I checked the page source and the question mark was indeed there, but I think it's there just because of some flaw of their content system while has nothing to do with page encoding or charset. So, I agree with your solution to have a web filter correcting that but I don't think the header filter to force charset is necessory.

Oddysey Wrote:the charset=us-ascii thing isn't doing any harm or good

It doesn't have to be there if it isn't doing any harm or good, right? Smile!

In fact I think it is doing harm when the page encoding is not us-ascii. It will interfere the browser to select the proper encoding. Of course you will probably not notice that when most of your time you are viewing English pages.

Oddysey · Nov. 19, 2008, 07:02 AM

whenever;

Sometimes I view a page that was written in English by a person who's mother tongue is not English. In his/her country, they are probably going to be using a charset that's either native to them, or perhaps some form of unicode. You're correct to guess that on such pages my forced 'us-ascii' header will fail miserably, but if I need/want to see the page, I can disable the filter and refresh the page.

It's useful much more often that not, so it stays enabled. And if I hit a page written with charset=unihan, I'll just Babelfish it! Big Teeth

Oddysey

lnminente · Nov. 19, 2008, 12:37 PM

Oddysey if you plan to disable your filter often, i suggest you adding a $KEYCHECK to disabling the filter and $LOG to see when your filter is working, if you want the log window auto-appears then add $LOG(!
Wink

Oddysey · Nov. 21, 2008, 07:35 AM

(Nov. 19, 2008 12:37 PM)lnminente Wrote: Oddysey if you plan to disable your filter often, i suggest you adding a $KEYCHECK to disabling the filter and $LOG to see when your filter is working, if you want the log window auto-appears then add $LOG(!

Thanks, In, that's good to know.

Oddysey

lnminente · Nov. 21, 2008, 09:29 PM

Glad to help you Mr. Oddysey Wink

An example of their use here http://prxbx.com/forums/showthread.php?tid=1137