|
A simple reminder?
|
|
Sep. 21, 2008, 02:29 PM
Post: #16
|
|||
|
|||
|
RE: A simple reminder?
I don't think it make sense to force UTF-8 encoded page to be rendered as ISO-8859-1. The page should be rendered as what it is originally encoded.
For English speaking people, there might be not big difference because ASCII characters are encoded nearly the same in ISO-8859-1 and UTF-8, that is, one character takes one byte; however, for double-byte characters like Chinese, one character take 2 bytes in GB2312 while 3 bytes in UTF-8. That's why English pages that contain only ASCII may render well in both ISO-8859-1 and UTF-8 but for Chinese pages you have to select the encoding method that the page is originally encoded. |
|||
|
Nov. 16, 2008, 05:32 AM
Post: #17
|
|||
|
|||
|
RE: A simple reminder?
BUMP (or REVIVED, your choice)
I've got the definitive answer, but let me first take care of some old business: whenever; Quote:I don't think it make sense to force UTF-8 encoded page to be rendered as ISO-8859-1. The page should be rendered as what it is originally encoded.Well, that would certainly fly in the face of The Proxomitron's very reason for existance - to modify the incoming data to display as you wish, not as the originator wishes. Nope, gotta disagree with you on that one, particularly since it's your very suggestion that is causing me so much aggravation (i.e. seeing question marks where they shouldn't be seen). Now, as promised, I'll reveal how Proxo solved my problem(s). To recap, so you don't have to go back a page, I originally posted: Quote:Many sites on the web, if not most of them, are now serving their pages with strange characters, or just as bad, with normal characters where they don't belong. The best example is a question mark being used to replace an apostrophe, or even just stuck wherever the page's author felt like putting one. (But in checking the source code, I see that sometimes a "page generator" is used, so I can't always blame the author.) After several suggestions and some deep experimentation, I found that I was getting a rather unsatisfactory browsing experience, but that it wasn't too bad, so I'd just live with it. As you might expect, if you knew me, that didn't last long...... ![]() This morning I got fed up, and sat down to get to the bottom of it all. After much googling, I found a possible answer: we should have expected all along that it would be a header issue. Essentially, if there's no specific header instruction, then the default value is used. For character sets, it's whatever the page's server sends as a Reply header for Content-Type. So I opened the Header filters, and lo and behold, Scott had already anticipated this very scenario..... there's a default filter that normally removes the charset declaration, when active. I merely activated it, after suitable modification, like so: Code: In = TRUEFor our purposes, the magic takes place at the end of the Match line: charset=us-ascii. You can of course substitute whatever you wish, be it a name (like I did) or an ISO number (like I first proposed). The problem I noted with ISO-8859-1 is that it would sometimes (very rarely) falsely render the odd character or two, so I opted for the most basic charset of all. Indeed, the W3C's very own RFC-2616 (Header Specifications) actually says 'use of this charset it encouraged throughout the web'. Doubtless, they're aiming for a world-wide acceptance, and recognition of, one alphabet. BTW, there is a header that's called, wonderously enough, "accept-charset". That was abandoned early on (in web-years), and is fully ignored by all current browsers and servers. Ah, but if only they had employed hindsight back then, solving problems like this now would be so much easier..... ![]() One other thing..... http://www.unicode.org recommends that you configure your browser and OS to install all possible language packs (character/font sets), before anything else. I did that, and as you might have guessed, that didn't do the trick for me. Ah well, what's another 10 or 20 megs of fonts laying unused on my boot drive, eh? ![]() Problem solved, and shared with my buds! ![]() HTH Oddysey [/quote][/i] I'm no longer in the rat race - the rats won't have me! |
|||
|
Nov. 16, 2008, 11:18 AM
Post: #18
|
|||
|
|||
RE: A simple reminder?
Quote:Well, that would certainly fly in the face of The Proxomitron's very reason for existance - to modify the incoming data to present as you wish, not as the originator wishes.I really love Proxomitron's capability to modify the data on the fly but it doesn't mean it is good at encoding conversion or binary filtering (I know we have a dirty trick [%xx]). For example, I don't think it can convert Chinese character from utf-8 3 bytes to gb2312 2 bytes. Oddysey, you didn't get what I want to say in the previous post. As to your problem, I think it is because of your browser not selecting the proper encoding method for the page. If you view a utf-8 page in iso-8859-1 encoding, most characters will render well but those characters exist in utf-8 while not exist in iso-8859-1 will not render correctly, and may turned into quesiton marks. So, if you encounter a page with strange question mark, would you please check the page source to find the charset setting and manually set the browser to use that encoding? If that still doesn't help, would you please post the url so that I can have a look? |
|||
|
Nov. 16, 2008, 07:47 PM
Post: #19
|
|||
|
|||
|
RE: A simple reminder?
whenever;
Your explanation in Post 18, of your statement in Post 16, was exactly what I had thought you meant - I did get your intented meaning. Proxo doesn't do any charset conversions at all, it only sets a variable (a known header) to some value that the user wants. In this case, you're correct, my browser is not properly rendering the incoming datastream in the charset desired by the page's author. That's why I did what they said to do at http://www.unicode.org, to install and configure all those extra fonts, language packs, and other stuff.... and why I was disappointed (and not a bit disgusted) at the results. I mean, this is an almost pristine installation, I've not really taken the time to modify IE with BHO's, registry borks, and other assorted mayhem. It should all go according to Hoyle, but it's not. ![]() That said, there are at least two considerations here, and possibly more. One is that I don't view sites that use Chinese alphabets. Yes, you were just using that as an example, but I also mean that as an example - I don't go surfing around the web to sites in other languages, I stick to those I can read without metaphysical intervention. Thus, it's all right for ME to use a charset that you consider restrictive, but note that I'm not advocating that all other web surfers do the same. What I've done here is merely present a solution to a given problem, in the hopes that other Proxo users might see a way to adapt my solution to their needs.... that was all, I swear.The other consideration in my mind at the moment is the validity of using UTF-8 in the first place. It sounds like a good idea, but like most such, I can poke some rather uncomfortable holes in it. But that's not why we're here, so I'll leave off of that, and get on with the rest of it..... You asked for an example of how UTF-8 encoded pages appear in my browser. Well, when I visit eBay, I get sent a page that eBay generated in UTF-8, and it looks like this: ![]() That's a result of eBay's attempt to appeal to a broad audience, instead of making sure that their pages don't go all sideways on at least a few of their viewers - they set "Content-type: charset=UTF-8", and hope for the best. But by forcing the incoming datastream to be "Content-type: charset=us-ascii", I tell the browser to ignore what eBay wants, and to do what I want instead. And it looks just peachy-keen, you can take that to the bank. ![]() I can't speak for other browsers on the market, but in IE, you don't get to easily command what charset will be used. You can't turn on or off the "auto-detect" feature, no matter what they say, either choice is automatically over-ridden by any "Content-type" declaration. That's pretty user-hostile, if you ask me. Thus, we resort to munging the incoming datastream with Proxo, which is why we worship at the altar of Scott, each and every day. ![]() HTH Oddysey I'm no longer in the rat race - the rats won't have me! |
|||
|
Nov. 16, 2008, 09:40 PM
Post: #20
|
|||
|
|||
RE: A simple reminder?
Oddysey Wrote:You can't turn on or off the "auto-detect" feature, no matter what they say, either choice is automatically over-ridden by any "Content-type" declaration. That's pretty user-hostile, if you ask me. From: http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2 Quote:To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): The 800lb gorilla in the room is browser heuristics. While the following link refers to html 5, it's still an interesting read, if your into technical stuff: http://www.whatwg.org/specs/web-apps/cur...r-encoding Anyway, as you know, changing the charset for the Content-Type header does not change how the page was encoded. It might change how the page is rendered, for better or worse. Occasionally, I see some oddball question marks out of context. I've also seen quite a few pages where the meta charset is different from the header charset. Up to now, I haven't really paid attention, but I can't help but wonder if the two are related. To resolve this, I'm thinking about stripping the charset from the Content-Type header. I figure I can alway re-inject the charset via a meta tag if it's not already present in the page header. If the browsers follow the rules above, this should fix it if the server isn't sending the right charset. I'm not sure about the pros & cons of injecting a meta charset if both are missing. My main concern with this is IE related. I've seen IE6 bork my injected js. This happens, sometimes, when a meta charset element is encountered after my header js is injected. If IE decides to re-parse the document based on the new charset, I get an an error and my js is hosed. I think I've resolved the issue, but my filter needs more testing with IE. whenever Wrote:So, if you encounter a page with strange question mark, would you please check the page source to find the charset setting and manually set the browser to use that encoding? If that still doesn't help, would you please post the url so that I can have a look? I'd be interested in seeing links to pages like that also. Oddysey, FYI, firefox sends the "accept-charset" header. z12 |
|||
|
Nov. 17, 2008, 06:28 PM
Post: #21
|
|||
|
|||
RE: A simple reminder?
(Nov. 16, 2008 09:40 PM)z12 Wrote: The 800lb gorilla in the room is browser heuristics. As long as idiots keep producing stupid headers then the 800lb gorillas and 799lb proxies will have to stay alert. Code: www.mcafee.com/us/images/buy_links_icon.gifMany more examples exist all over the web, some done by less reputable idiots
|
|||
|
Nov. 17, 2008, 07:18 PM
Post: #22
|
|||
|
|||
|
RE: A simple reminder?
"hilarity ensues", lol... "their own logo", too d@mn funny...
|
|||
|
Nov. 18, 2008, 02:40 AM
Post: #23
|
|||
|
|||
|
RE: A simple reminder?
z12;
Yes, I think that most if not all major browsers send the accept-charset header, but they don't mean it - they'll receive whatever the server decides to send, and they'll render the page in that sent declaration, come Hell or highwater. I tend to think that they send the header out of sheer stubborn adherence to a once-powerful but now-forgotten (or at least ignored) specification. It's sort of like, they keep preaching about CSS, but they all still render tables, iframes and a host of deprecated tags. That's because if they didn't, people would shy away from a product didn't properly render a page out of The Wayback Machine. Graycode's example speaks volumes about how holy the standards are considered. Talk about speaking with a forked tongue. ![]() Oddysey I'm no longer in the rat race - the rats won't have me! |
|||
|
Nov. 18, 2008, 03:12 AM
Post: #24
|
|||
|
|||
RE: A simple reminder?
(Nov. 16, 2008 07:47 PM)Oddysey Wrote: That's why I did what they said to do at http://www.unicode.org, to install and configure all those extra fonts, language packs, and other stuff....Are you using windows 98? I am not sure about that OS but under windows XP the fresh installation should could render UTF-8 encoded English page well without needing extra fonts, language packs. (Nov. 16, 2008 07:47 PM)Oddysey Wrote: they set "Content-type: charset=UTF-8", and hope for the best. But by forcing the incoming datastream to be "Content-type: charset=us-ascii", I tell the browser to ignore what eBay wants, and to do what I want instead. And it looks just peachy-keen, you can take that to the bank.z12 explained this issue well: z12 Wrote:Anyway, as you know, changing the charset for the Content-Type header does not change how the page was encoded. (Nov. 16, 2008 07:47 PM)Oddysey Wrote: I can't speak for other browsers on the market, but in IE, you don't get to easily command what charset will be used. You can't turn on or off the "auto-detect" feature, no matter what they say, either choice is automatically over-ridden by any "Content-type" declaration.What version of IE are you using? I just checked IE6 & IE7 and the "auto-select" feature could be truned on or off. |
|||
|
Nov. 18, 2008, 08:08 PM
Post: #25
|
|||
|
|||
|
RE: A simple reminder?
whenever;
Yes, Mike (z12) did explain what happens in regards to rendering versus encoding. But the fact of the matter is, when all is said and done, it doesn't matter squat how the page was encoded by the author (or page generator application), it only matters how my browser reacts to that encoding, and thus, how it renders on my screen - not on the author's screen. I thought I mentioned this, but perhaps I didn't..... I switched over (finally!) to XP just this last August. Prior to that I ran a W98SE box that was sorely pressed to resemble anything approved by Microsoft when it left the factory. My XPSP2 installation is almost pristine, the only thing I've done that has had any adverse consequences is to move the UserShellFolders to my data partition. (Done so that I could do an easy backup and/or restore.) That's actually caused me some grief in the Temporary Internet Files, but other than that, all seems to be according to Hoyle/Microsoft. Which is why I'm so perplexed about this whole UTF-8 thing...... ![]() As for turning on or off the Charset Auto-Detect feature, you're deluding yourself. (Sorry, but there it is....) Go ahead, switch it back and forth, and watch what happens as you reload a site/page. Be sure to turn on Proxo's Log screen so you can see the results each time you refresh the page after the switch. If your system is like mine, you'll note what z12 noted - if the page is sent with a Content-type declaration, that will override the Auto-Detect feature. Said feature is meant to work only when there is no declaration at all, from the server. All of which is illustrated by yet another "nugget" I found this morning. Go here: http://www.dailyrotation.com It's an RSS feed aggregator, I use it daily for my compu-news fix. They don't send a charset declaration, so my browser (IE6SP1) auto-detects that it should render the page in Western Europe Windows. That's not so bad, nearly everything shows correctly. But.......Strictly speaking, the page wasn't encoded in a vacuum, it must have had some charset applied, right? Only there's no corresponding declaration, so the browser is left to it's own devices, and sure enough, there are some question marks scattered around the place (albeit not nearly so intrusively as that eBay fiasco). So, what I've done is two things: One, I've let the forced-declaration header remain active - the charset=us-ascii thing isn't doing any harm or good, so why monkey with it? And two, I've added a generic (for me, that's rare!) web filter that looks for question marks in pairs, and assumes they are meant to be quote marks.... Whala, the page appears just fine. (Going back and testing the web filter alone without the header is no better/no worse, so again, I'll let things lay as they are.) Now, I'm not saying that the problem is solved, at least not permanently. But for now, I can live with it. A very few oddities are not gonna bake my cookies, so to speak, it's only when they spew out all over the page that I get upset. Until I see that happening again, I'm gonna leave well enough alone. But I do thank all of you for your input and help!! Oddysey I'm no longer in the rat race - the rats won't have me! |
|||
|
Nov. 19, 2008, 02:59 AM
Post: #26
|
|||
|
|||
|
RE: A simple reminder?
Thanks for the URL. I checked the page source and the question mark was indeed there, but I think it's there just because of some flaw of their content system while has nothing to do with page encoding or charset. So, I agree with your solution to have a web filter correcting that but I don't think the header filter to force charset is necessory.
Oddysey Wrote:the charset=us-ascii thing isn't doing any harm or goodIt doesn't have to be there if it isn't doing any harm or good, right? ![]() In fact I think it is doing harm when the page encoding is not us-ascii. It will interfere the browser to select the proper encoding. Of course you will probably not notice that when most of your time you are viewing English pages. |
|||
|
Nov. 19, 2008, 07:02 AM
Post: #27
|
|||
|
|||
|
RE: A simple reminder?
whenever;
Sometimes I view a page that was written in English by a person who's mother tongue is not English. In his/her country, they are probably going to be using a charset that's either native to them, or perhaps some form of unicode. You're correct to guess that on such pages my forced 'us-ascii' header will fail miserably, but if I need/want to see the page, I can disable the filter and refresh the page. It's useful much more often that not, so it stays enabled. And if I hit a page written with charset=unihan, I'll just Babelfish it! ![]() Oddysey I'm no longer in the rat race - the rats won't have me! |
|||
|
Nov. 19, 2008, 12:37 PM
Post: #28
|
|||
|
|||
|
RE: A simple reminder?
Oddysey if you plan to disable your filter often, i suggest you adding a $KEYCHECK to disabling the filter and $LOG to see when your filter is working, if you want the log window auto-appears then add $LOG(!
|
|||
|
Nov. 21, 2008, 07:35 AM
Post: #29
|
|||
|
|||
RE: A simple reminder?
(Nov. 19, 2008 12:37 PM)lnminente Wrote: Oddysey if you plan to disable your filter often, i suggest you adding a $KEYCHECK to disabling the filter and $LOG to see when your filter is working, if you want the log window auto-appears then add $LOG(! Thanks, In, that's good to know. Oddysey I'm no longer in the rat race - the rats won't have me! |
|||
|
Nov. 21, 2008, 09:29 PM
Post: #30
|
|||
|
|||
|
RE: A simple reminder?
Glad to help you Mr. Oddysey
![]() An example of their use here http://prxbx.com/forums/showthread.php?tid=1137 |
|||
|
« Next Oldest | Next Newest »
|

Search
Member List
Calendar
Help





![[-]](images/ONi/collapse.gif)





Thus, it's all right for ME to use a charset that you consider restrictive, but note that I'm not advocating that all other web surfers do the same. What I've done here is merely present a solution to a given problem, in the hopes that other Proxo users might see a way to adapt my solution to their needs.... that was all, I swear.![[Image: ebay-ugly.gif]](http://i173.photobucket.com/albums/w51/fnulnu/ebay-ugly.gif)

Thus, we resort to munging the incoming datastream with Proxo, which is why we worship at the altar of Scott, each and every day. 

My XPSP2 installation is almost pristine, the only thing I've done that has had any adverse consequences is to move the UserShellFolders to my data partition. (Done so that I could do an easy backup and/or restore.) That's actually caused me some grief in the Temporary Internet Files, but other than that, all seems to be according to Hoyle/Microsoft. Which is why I'm so perplexed about this whole UTF-8 thing......