help in filtering cyrillic letters
|
Jan. 10, 2006, 01:46 PM
Post: #1
|
|||
|
|||
help in filtering cyrillic letters
one of my favorite sites has an nagvertisement (asking for money) which is posted in cyrillic letters (russian) and english
I've unsuccessfully attempted to filter this section out, any tips appreciated below where the * appears starts a half page long advertisement in cyrillic letters $NEST(<table width="100%"> <tr> <td class="block_title"><center> <b><font color="blue">,*, </tr> </table> </center> <center> <table width="100%"> <tr> <td class="block_title"><center>) |
|||
Jan. 10, 2006, 06:41 PM
Post: #2
|
|||
|
|||
my experience with proxo is that you can't enter any cyrillic text into the filters as they become question marks like ???????
but I recognize you did not do this and instead tried to use a $NEST and exclude the block which apparently didn't work sorry, I don't know either and have wondered the same thing |
|||
Jan. 10, 2006, 07:06 PM
Post: #3
|
|||
|
|||
The $NEST doesn't look correct.
Why not post a URL? If it is a question of content, maybe you could upload complete HTML of the ad somewhere? Kinda hard for us to help from what has been posted. |
|||
Jan. 10, 2006, 08:14 PM
Post: #4
|
|||
|
|||
the html sections looks like this but of course cyrillic posts as question mark
the goal (I think), would be to allow anything to show but just kill the pieces that start the unwanted section there is no way to filter based on a cyrillic word or word combination so the $NEST should work - but how? take the section below as a "generic sample" how could it be made to filter anything that comes after the "<td class="content">" where the cyrillic text starts and then to stop at the "</td>" where the cyrillic text stops? <table width="100%"> <tr> <td class="block_title"><center> <b><font color="blue">????? ??? ??????? ???????? !</font></b></center></td> </tr> </table> <table width="100%"> <tr> <td class="content"> ??????????? ??? ????, ????????? ???????. ..lots of ??? deleted for brevity there are 50 lines of ???? deleted for this post to show </td> </tr> </table> </center> <center> |
|||
Jan. 10, 2006, 10:05 PM
Post: #5
|
|||
|
|||
Is it all Cyrillic after <td class="content">? It would be much easier if there is some code or A-Z letter sequence.
The matching expression could then be something like: $NEST(<td class="content">,*TEXT*,</td>) |
|||
Jan. 11, 2006, 05:17 AM
Post: #6
|
|||
|
|||
Guest Wrote:there is no way to filter based on a cyrillic word or word combinationI wouldn't say that. (I hope the forum code doesn't screw this up) Consider the word название, which I found on http://www.google.com/search?q=proxomitron.ru Quote:Vote and review this url название: I then wrote: Code: [Patterns] The escape filter 'escaped' название to Code: %D0%BD%D0%B0%D0%B7%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5 I then wrote: Code: [Patterns] This filter removes the first occurance of the word from the google page. Guest Wrote:how could it be made to filter anything that comes after the That could be as simple as: Code: [Patterns] Just increase the Limit until the cell is removed. Now somethings to consider are: Each letter of название is more than one byte. It could be coded as & #1085;& #1072;& #1079;& #1074;& #1072;& #1085;& #1080;& #1077; minus the spaces or maybe ... So you'll need a Limit of at least 16 to match it. I try to use $NEST only when there is or may be a nest. More at http://www.proxomitron.info/45/help/Matc....html#NEST and http://www.proxomitron.info/45/help/Matc...html#INEST HTH |
|||
« Next Oldest | Next Newest »
|