Post Reply 
Matching text that's NOT in a tag
Nov. 04, 2004, 10:04 AM
Post: #1
 
Hope I didn't miss a FAQ or so...

I'm searching the text of a HTML page for items that are defined in a list file. If I find them I want them to appear bold. Basically it's something like
Matching: ($LST(Items))\1
Replacement: <b>\1</b>

Unfortunately elements of tags -- e.g. parts of the url of a link -- might also match. And putting <b> and </b> into that link makes it somewhat disfunctional :-). So I would like the replacement to work only on plain text of the webpage but not inside tags. Any ideas?

Thanks,
Wolfram
Add Thank You Quote this message in a reply
Nov. 05, 2004, 07:16 AM
Post: #2
 
Hi Wolfram,

Hmm...that seems a bit more difficult than I thought at first glance.

I suppose that you would want to make sure that you don't match anything inside certain tags, such as the script, or style tags. Also mathching anything before the the body tag wouldn't be desireable either.

I'm not sure about embeding a bold tag inside of other tags such as the anchor, strong or option tags.

To make sure you don't match inside certain tags, you could add something like the following near the the bottom of your web filters after your regular filters:

Code:
[Patterns]

Name = "Match First <body>"
Active = TRUE
URL = "$IHDR(Content-Type: (*html*))"
Limit = 1024
Match = "(^(^<body[^<>]+>))"
Replace = "$SET(bBODY=1)$STOP()"

Name = "Tag Skipper"
Active = TRUE
URL = "$IHDR(Content-Type: (*html*))"
Limit = 8192
Match = "("
"<style[^<>]+>*</style>|"
"<script[^<>]+>*</script>|"
"</+[a-z]+{1,*}[^<>]+>"
")\0"
Replace = "\0"

Name = "Word HighLiter"
Active = TRUE
URL = "$IHDR(Content-Type: (*html*))"
Limit = 40
Match = "\s($LST(Words))\0(^(^\s|.|,))$TST(bBODY=1)"
Replace = "\s<b>\0</b>"

I haven't tried this, didn't feel like creating a new list to test. Smile!

Mike

PS: Welcome to the forum.
Add Thank You Quote this message in a reply
Nov. 05, 2004, 08:00 AM
Post: #3
 
Quote:I'm searching the text of a HTML page for items that are defined in a list file. If I find them I want them to appear bold.
You could include spaces around the words. That way, you would avoid tags and words within words. For example, if you wanted to match the word "tell", the word "constellation" would match as well, unless you included spaces before and after "tell".
Add Thank You Quote this message in a reply
Nov. 05, 2004, 02:18 PM
Post: #4
 
Hi z12,

your proposal gave me some ideas of my own. I'm currently testing these:
(Note: The list I'm testing against contains partnumbers of the form 1234-5678 -- i.e. four digits, a dash, another four digits)

Code:
Name = "Inside a Tag - Start"
Active = TRUE
Multi = TRUE
URL = "OnlyCertainURLs/*"
Limit = 256
Match = "<"
Replace = "<$SET(InsideTag=1)"

Name = "Inside a Tag - Stop"
Active = TRUE
Multi = TRUE
URL = "OnlyCertainURLs/*"
Limit = 256
Match = ">"
Replace = ">$SET(InsideTag=0)"

Name = "Parts-List"
Active = TRUE
URL = "OnlyCertainURLs/*"
Bounds = "[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"
Limit = 256
Match = "($LST(Parts))\1$TST(InsideTag=0)"
Replace = "<b>\1</b>"

My main concern is performance: Since the the first two filters match so often, speed goes down considerably if I have the log-window open. Without the log-window it's not so obvious. Fortunately I can limit the range of relevant URLs fairly well.
Also, the "Bounds" may be important for my performance because the list is really long: about 4000 entries.

Comments? Improvements?

Wolfram
Add Thank You Quote this message in a reply
Nov. 05, 2004, 03:09 PM
Post: #5
 
Hi Wolfram,

Yeah, my tag skipper would have filled the log window too.

Since you have a specific pattern to match maybe we can try something a little different to avoid checking if were in a tag.

Most tag attributes don't contain space characters. If the part numbers your trying to match have a leading space, that would be good. Also, some sort of trailing delimiter would be really helpful.

How bout something like this:

Code:
Name = "Parts-List"
Active = TRUE
URL = "OnlyCertainURLs/*"
Bounds = "(\s)\0([0-9]+{4}-[0-9]+{4})\1(^(^[^<>"'a-z0-9.-_]))"
Limit = 256
Match = "\s($LST(Parts))\1"
Replace = "\0<b>\1</b>"

You might have to play around a bit with the leading & trailing delimiters

Code:
leading delimiter = \s
trailing delimiter = [^<>"'a-z0-9.-_]

The good thing about this is it wouldn't fill the log window or use $TST, which is a bit slow.

HTH
Mike

edit: forgot limit
Add Thank You Quote this message in a reply
Nov. 05, 2004, 04:35 PM
Post: #6
 
Hi z12,

I tried things like that but it ain't easy :-). Look at:

Code:
<a target=_top HREF="http://somesite.com/perl/somescript.pl?partnumber=1234-5678">
<font face = 'arial, helvetica' size="-1">1234-5678</font></a>

The second occurence of "1234-5678" should be replaced (if found in the list), the first one must no tbe replaced. I've also seen the partnumber appearing as the first thing after a CR-LF...

Edit:
I might have a chance to identify most of the currently used delimiters by analyzing many pages currently containing these partnumbers but I would prefer a general solution.
Add Thank You Quote this message in a reply
Nov. 05, 2004, 08:46 PM
Post: #7
 
Hi Wolfram,

Well, I modified the delimeters a bit, the biggest change is to the leading delimiter. It now looks for the tag close character followed by not more than 6 non-alpha/numeric characters before looking for the part number.

Code:
Name = "Parts-List"
Active = TRUE
URL = "OnlyCertainURLs/*"
Limit = 256
Bounds = "(>[^<="'a-z0-9.-_]+{0,6})([0-9]+{4}-[0-9]+{4})\1(^(^[^>="'a-z0-9.-_]))"
Match = "(>[^<="'a-z0-9.-_]+{0,6})\0($LST(Parts))\1"
Replace = "\0<b>\1</b>"

I'm running out of ideas.

Mike
Add Thank You Quote this message in a reply
Nov. 08, 2004, 01:36 PM
Post: #8
 
First of all a big thanks for all the ideas you provided. Many of them are reusable elsewhere and I might not have had them on my own -- e.g. the double negation "(^(^" for non-consuming test.
One big insight is that the problem is indeed non-trivial and I have not missed an existing command or a simple solution already available.

By the way: your last idea is unfortunaltely defeated by pages with simple formatting for export to Excel:

Code:
<b>Save this screen as a .txt file using the web browser
then import file into Excel (delimiter = ",")</b>

<b>Your search returned 270 results:</b>
1234-3456,3,00  ,3 ,0,0,0,0.85
1234-4567,3,00  ,? ,0,0,0,0
1234-5678,4V,00  ,4V,0,0,0,0.4814
1234-6789,4V,00  ,4V,0,0,0,3.0248
...

Thanks again,
Wolfram
Add Thank You Quote this message in a reply
Nov. 09, 2004, 03:01 PM
Post: #9
 
Hi Wolfram,

I was hoping someone else might jump in with a solution.

Unfortunately, without consistent leading/trailing delimiters to match on, I can't think of any generic filter that would do the job.

It looks like you'll need site specific filters to do what you need.

Mike
Add Thank You Quote this message in a reply
Nov. 12, 2004, 08:41 AM
Post: #10
 
Mike and Wolf;

Come on you guys, you both have the right idea, but you're trying to clear the forest because it's blocking your view of the trees. <_<

I know, bold words from someone who's been silent until now. Well, ordinarily, I'm the know-it-all answer demon, so this time I thought I'd sit back and see what came up. You two have taken a good stab at it, but I can see that you need a "little" direction. Wink

In a nutshell, the Match string can't do an "on-the-fly" If-Then-Else test. Sadly, that's what would be needed in a single-pass system. Thinking like that, I realized that there are programming languages that are also bereft of that logic branching construct. In those languages, you must use a two-pass system to get the job done. And that's when the lightbulb burned out turned on. We can certainly do that here, put in a two-pass system, I mean.

This is where you two got the horse after the cart. You were so wrapped up in trying to prevent the links from being bolded in the first place, that you overlooked the simplest solution - merely bold everything (that matches) on pass #1, and remove the bold tags from any links on pass #2.

And there you have it. The actual implementation of such a filter set is an exercise best left to the student. Cry And don't forget to turn Multi on. B)


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Nov. 12, 2004, 12:38 PM
Post: #11
 
Hi Oddeysey,

I wondered where you were. Smile!

It seems to me theres still a bounds problem with that scheme.

First, you have to make a match before you can replace it.

If the list matches inside a tag, that would require the second filter to "look backwards" to see if your first filter matched inside a tag.

Since you can't do that, it would seem that you have to insure your not in a tag before you match.

But perhaps I'm still missing the obvious, it wouldn't be the first time.

As for not being able to do If...Then...Else, you might want to check out "Proxomitron filters tutorial #2" which can be downloaded from here:

http://computercops.biz/downloads-cat-19.html

Mike
Add Thank You Quote this message in a reply
Nov. 12, 2004, 05:40 PM
Post: #12
 
Mike;

Far be it from me to ever put down anything said by altosax - he's in the same league as sidki3003, JD5000, our own Kye-U..... a rare strata, indeed.

However, his tutorial (which I had not seen, BTW, so thanks for the tip) approached the problem in a slightly different manner. He reduced his example problems to the If-Then-Else<span style='color:red'>If</span> construct, which is a different beast altogether from If-Then-Else. The difference between the two constructs is explained like so: The ElseIf lends itelf well to OR operations, the mere Else does not. The former introduces a new element of test, the latter does not. For the Else statement to execute, the test has already taken place, so if Else is executing, it's an absolute, not a "maybe". When ElseIf executes, it might or might not produce a "True", which would go on to the Then statement.

Did all that make sense?

That being said, you, as a budding programmer yourself, should never lose sight of the fact that we all do things differently. With the possible execption of the ubiquitious "Hello World" program, there are as many ways to do any one thing as there are programmers. B)

For that reason, I chose to stay with what I know works in the Proxo world, the two-pass system. In prior versions, Scott told us that he couldn't guarantee the loading order of each filter in a config set. In 4.5, he was able to fix that little discrepancy, and the results are more than satisfactory for many of us. :P
[/ lecture mode]

To address your questions directly, let us suppose that Filter #1 is setup to bold everything that Matches on the targeted list. Now, if everything went according to Hoyle, we should have a page that bolds every part number. Of course, the links are no longer operable, which is why Wolfram came to us for an answer in the first place. My contention is, we've solved one problem, but created another. Why not let Filter #2 solve the second problem? In point of fact, Filter #2 only needs to bounds check for <a*[/url], right? Simply set the match to pull out the <b> and </b> tags, and we're home free. If we tell the filter to check every <a .... tag on the page, there's no harm done, is there?

And as for "looking backwards", well, I'd say that's exactly what Multi=True does. You'll recall that Multi allows something to be looked at again (and again, and....) The danger there, of course, is that you could introduce an inordinate amount of processing time to complete the filtering actions. As Scott said, Multi is not a toy, it should always be used with adult supervision. [rolleyes]

As for missing the obvious, why do you think you were "wondering where I was"? I too had to think about it for awhile, only my molasses container ("brain pan", for the allegory-challenged) had to do a hard-reset before I could come up with an acceptable answer. I suppose I could apply that much vaunted system of logic that people keep accusing me of possessing, and see if I couldn't do the job in a single-pass system with the OR (or even the AND) construct, but I don't see the need. I'm not lazy, I just don't like to re-solve a problem, once I've solved it satisfactorily. Hope you understand. <_<


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Nov. 12, 2004, 09:58 PM
Post: #13
 
Hi Oddeysey,

Well, the concern is with all tags, not just anchor tags.

I have to disagree that multi allows going backwards. It allows the same match to be checked again by other filters.

To allow the 2nd filter to match, then the first filter would have to match back at the start of the tag.

But since were trying to avoid matching inside a tag.....

hmm

Mike
Add Thank You Quote this message in a reply
Nov. 13, 2004, 04:44 AM
Post: #14
 
Mike;
Quote:Well, the concern is with all tags, not just anchor tags.
Wolfram's first post was indeed generic in his reference to tags. However, he did use the anchor as an example. Furthermore, his later posts refer to the anchor tag exclusively, hence I have made an arbitrary design decision. No Expression

Now if an image tag were to use the same text in the name of it's image to display, then we'll have to use the OR operator in a regexp to allow either <a or <img to match. Other than that, I can't think of anyplace where the part number would be used within the HTML code, can you?

Quote:I have to disagree that multi allows going backwards. It allows the same match to be checked again by other filters.
Well, I took a bit of literary license there, didn't I? What I meant is that if we can re-examine a piece of text with a filter to be named in the future, then I'd equate that to having gone back in time. So to speak. Too far out, eh? Oh well, back to the drawing board. <mumbles to self on way back to laboratory....>

Quote:To allow the 2nd filter to match, then the first filter would have to match back at the start of the tag.
Why? They are two different scopes, with correspondingly different Bounds. I think that the narrower scope is permissible within the broader scope, with Multi turned on. I suppose I could be wrong, though.

Hmmmmm.....
Let's break the desired action down to a logic puzzle, shall we? After much headscratching, I came up with this final result:

If (^within a tag && match on $LST) then
.... bold the text

If we are within a tag, then the first condition fails, and we move on smartly. If we are not within a tag, and we find the subject text, then we proceed to bold it. If we reverse the terms, the test still works, but I've always liked to deal with the more likely possibilities first. It is much more likely that you're not within a tag, than otherwise.

Now that we've defined the Holy Grail, can it be done? I would assume so, but again, this is for minds that can better tie themselves up in knots while building regexp's. I'm fresh outta headroom, sorry!

And, I think I can safely retract my diatribe about how a two-filter setup is necessary. I think this is doable, don't you?


Oddysey

I'm no longer in the rat race - the rats won't have me!
Add Thank You Quote this message in a reply
Nov. 13, 2004, 09:46 AM
Post: #15
 
Hi Oddeysey,

Oddeysey Wrote:I think that the narrower scope is permissible within the broader scope, with Multi turned on.
I agree with that.

Oddeysey Wrote:If (^within a tag && match on $LST) then
.... bold the text
I agree with this also, however, I don't think you can reverse this test. Doing so would reverse the scope.

Oddeysey Wrote:And, I think I can safely retract my diatribe about how a two-filter setup is necessary.
Looking back at my first post, I had 3 filters! Smile!

Oddeysey Wrote:I think this is doable, don't you?
Yes I do. It just might take more than 1 "non-proxo-guru" to do it.



One more quote...
Oddeysey Wrote:If (^within a tag && match on $LST) then
.... bold the text
Thats it. Since matching the $LST is easy, the main problem is determing when your inside a tag.

I was looking over JD's config set the other day, & I saw something in there that gives me an idea. I need to take a closer look at it

Mike
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: