|
Converting non latin characters to UTF-8 as Proxomitron can't read Unicode UTF-16
|
|
Jun. 12, 2009, 01:00 AM
(This post was last modified: Jun. 12, 2009 05:27 PM by Graycode.)
Post: #6
|
|||
|
|||
|
RE: Proxomitron cannot read Unicode UTF-16 Hebrew
Be warned: I don't view international sites, and I'm not even a Proxo user. But I am pretty good at HTTP and guessing about things
![]() See the first attached TXT file named '_Raw_Bytes' which shows the raw byte dump of the page's headers and data. Ignore the "===" lines, they were put there by my own proxy. You were right earlier about it being UTF encoded using Little Endian. It even starts with the BOM (Byte-Order-Mark) for that. The 1st 2 bytes being FF FE are the 16-Bit Little Endian BOM. I'm guessing that if you did a View-Source of the Proxo result of your page, then your HTML portions would appear correct but the text shown to the user would be corrupted. As I understand it, Proxo has some rudimentary filter capability for UTF detection and converstion. However I don't think the UTF capability is "real", it may be just a filter mechanism. I think once it detects a UTF then it may assume Latin or ASCII in that it' may look at every other byte for a binary zero and wipe it out. That semi-UTF conversion (if in fact that's what it's doing) works fine on many content that are UTF but probably didn't need to be. When you get into 'real' UTF things may break down quickly. In the attached illustration that starts to happen at offset hex 30 when the paired bytes no longer have 00. That begins a region of data (not HTML tags) that seems to be a form of Unicode within the transport's UTF encoding. Perhaps more robust UTF decoding would look like the 2nd attached TXT file. That one represents the decoding of the raw transport's encoding, and within that you can see the Unicode data portions. I'm not sure how browsers are able to differentiate the character sets of the HTML vs. the Unicode data without a DOCTYPE or other indicator. If my assumptions are true, then hopefully someone with more Proxo knowledge will know of a resolution for you. My limited-knowledge suggestion would be to bypass that page from Proxo filtering or other transport modification, try to leave the stream as-is if possible and let the browsers sort it out per each viewer's international settings. |
|||
|
« Next Oldest | Next Newest »
|

Search
Member List
Calendar
Help






![[-]](images/ONi/collapse.gif)