Matching text that's NOT in a tag
|
Nov. 04, 2004, 10:04 AM
Post: #1
|
|||
|
|||
Hope I didn't miss a FAQ or so...
I'm searching the text of a HTML page for items that are defined in a list file. If I find them I want them to appear bold. Basically it's something like Matching: ($LST(Items))\1 Replacement: <b>\1</b> Unfortunately elements of tags -- e.g. parts of the url of a link -- might also match. And putting <b> and </b> into that link makes it somewhat disfunctional :-). So I would like the replacement to work only on plain text of the webpage but not inside tags. Any ideas? Thanks, Wolfram |
|||
Nov. 05, 2004, 07:16 AM
Post: #2
|
|||
|
|||
Hi Wolfram,
Hmm...that seems a bit more difficult than I thought at first glance. I suppose that you would want to make sure that you don't match anything inside certain tags, such as the script, or style tags. Also mathching anything before the the body tag wouldn't be desireable either. I'm not sure about embeding a bold tag inside of other tags such as the anchor, strong or option tags. To make sure you don't match inside certain tags, you could add something like the following near the the bottom of your web filters after your regular filters: Code: [Patterns] I haven't tried this, didn't feel like creating a new list to test. Mike PS: Welcome to the forum. |
|||
Nov. 05, 2004, 08:00 AM
Post: #3
|
|||
|
|||
Quote:I'm searching the text of a HTML page for items that are defined in a list file. If I find them I want them to appear bold.You could include spaces around the words. That way, you would avoid tags and words within words. For example, if you wanted to match the word "tell", the word "constellation" would match as well, unless you included spaces before and after "tell". |
|||
Nov. 05, 2004, 02:18 PM
Post: #4
|
|||
|
|||
Hi z12,
your proposal gave me some ideas of my own. I'm currently testing these: (Note: The list I'm testing against contains partnumbers of the form 1234-5678 -- i.e. four digits, a dash, another four digits) Code: Name = "Inside a Tag - Start" My main concern is performance: Since the the first two filters match so often, speed goes down considerably if I have the log-window open. Without the log-window it's not so obvious. Fortunately I can limit the range of relevant URLs fairly well. Also, the "Bounds" may be important for my performance because the list is really long: about 4000 entries. Comments? Improvements? Wolfram |
|||
Nov. 05, 2004, 03:09 PM
Post: #5
|
|||
|
|||
Hi Wolfram,
Yeah, my tag skipper would have filled the log window too. Since you have a specific pattern to match maybe we can try something a little different to avoid checking if were in a tag. Most tag attributes don't contain space characters. If the part numbers your trying to match have a leading space, that would be good. Also, some sort of trailing delimiter would be really helpful. How bout something like this: Code: Name = "Parts-List" You might have to play around a bit with the leading & trailing delimiters Code: leading delimiter = \s The good thing about this is it wouldn't fill the log window or use $TST, which is a bit slow. HTH Mike edit: forgot limit |
|||
Nov. 05, 2004, 04:35 PM
Post: #6
|
|||
|
|||
Hi z12,
I tried things like that but it ain't easy :-). Look at: Code: <a target=_top HREF="http://somesite.com/perl/somescript.pl?partnumber=1234-5678"> The second occurence of "1234-5678" should be replaced (if found in the list), the first one must no tbe replaced. I've also seen the partnumber appearing as the first thing after a CR-LF... Edit: I might have a chance to identify most of the currently used delimiters by analyzing many pages currently containing these partnumbers but I would prefer a general solution. |
|||
Nov. 05, 2004, 08:46 PM
Post: #7
|
|||
|
|||
Hi Wolfram,
Well, I modified the delimeters a bit, the biggest change is to the leading delimiter. It now looks for the tag close character followed by not more than 6 non-alpha/numeric characters before looking for the part number. Code: Name = "Parts-List" I'm running out of ideas. Mike |
|||
Nov. 08, 2004, 01:36 PM
Post: #8
|
|||
|
|||
First of all a big thanks for all the ideas you provided. Many of them are reusable elsewhere and I might not have had them on my own -- e.g. the double negation "(^(^" for non-consuming test.
One big insight is that the problem is indeed non-trivial and I have not missed an existing command or a simple solution already available. By the way: your last idea is unfortunaltely defeated by pages with simple formatting for export to Excel: Code: <b>Save this screen as a .txt file using the web browser Thanks again, Wolfram |
|||
Nov. 09, 2004, 03:01 PM
Post: #9
|
|||
|
|||
Hi Wolfram,
I was hoping someone else might jump in with a solution. Unfortunately, without consistent leading/trailing delimiters to match on, I can't think of any generic filter that would do the job. It looks like you'll need site specific filters to do what you need. Mike |
|||
Nov. 12, 2004, 08:41 AM
Post: #10
|
|||
|
|||
Mike and Wolf;
Come on you guys, you both have the right idea, but you're trying to clear the forest because it's blocking your view of the trees. <_< I know, bold words from someone who's been silent until now. Well, ordinarily, I'm the know-it-all answer demon, so this time I thought I'd sit back and see what came up. You two have taken a good stab at it, but I can see that you need a "little" direction. In a nutshell, the Match string can't do an "on-the-fly" If-Then-Else test. Sadly, that's what would be needed in a single-pass system. Thinking like that, I realized that there are programming languages that are also bereft of that logic branching construct. In those languages, you must use a two-pass system to get the job done. And that's when the lightbulb This is where you two got the horse after the cart. You were so wrapped up in trying to prevent the links from being bolded in the first place, that you overlooked the simplest solution - merely bold everything (that matches) on pass #1, and remove the bold tags from any links on pass #2. And there you have it. The actual implementation of such a filter set is an exercise best left to the student. And don't forget to turn Multi on. B) Oddysey I'm no longer in the rat race - the rats won't have me! |
|||
Nov. 12, 2004, 12:38 PM
Post: #11
|
|||
|
|||
Hi Oddeysey,
I wondered where you were. It seems to me theres still a bounds problem with that scheme. First, you have to make a match before you can replace it. If the list matches inside a tag, that would require the second filter to "look backwards" to see if your first filter matched inside a tag. Since you can't do that, it would seem that you have to insure your not in a tag before you match. But perhaps I'm still missing the obvious, it wouldn't be the first time. As for not being able to do If...Then...Else, you might want to check out "Proxomitron filters tutorial #2" which can be downloaded from here: http://computercops.biz/downloads-cat-19.html Mike |
|||
Nov. 12, 2004, 05:40 PM
Post: #12
|
|||
|
|||
Mike;
Far be it from me to ever put down anything said by altosax - he's in the same league as sidki3003, JD5000, our own Kye-U..... a rare strata, indeed. However, his tutorial (which I had not seen, BTW, so thanks for the tip) approached the problem in a slightly different manner. He reduced his example problems to the If-Then-Else<span style='color:red'>If</span> construct, which is a different beast altogether from If-Then-Else. The difference between the two constructs is explained like so: The ElseIf lends itelf well to OR operations, the mere Else does not. The former introduces a new element of test, the latter does not. For the Else statement to execute, the test has already taken place, so if Else is executing, it's an absolute, not a "maybe". When ElseIf executes, it might or might not produce a "True", which would go on to the Then statement. Did all that make sense? That being said, you, as a budding programmer yourself, should never lose sight of the fact that we all do things differently. With the possible execption of the ubiquitious "Hello World" program, there are as many ways to do any one thing as there are programmers. B) For that reason, I chose to stay with what I know works in the Proxo world, the two-pass system. In prior versions, Scott told us that he couldn't guarantee the loading order of each filter in a config set. In 4.5, he was able to fix that little discrepancy, and the results are more than satisfactory for many of us. :P [/ lecture mode] To address your questions directly, let us suppose that Filter #1 is setup to bold everything that Matches on the targeted list. Now, if everything went according to Hoyle, we should have a page that bolds every part number. Of course, the links are no longer operable, which is why Wolfram came to us for an answer in the first place. My contention is, we've solved one problem, but created another. Why not let Filter #2 solve the second problem? In point of fact, Filter #2 only needs to bounds check for <a*[/url], right? Simply set the match to pull out the <b> and </b> tags, and we're home free. If we tell the filter to check every <a .... tag on the page, there's no harm done, is there? And as for "looking backwards", well, I'd say that's exactly what Multi=True does. You'll recall that Multi allows something to be looked at again (and again, and....) The danger there, of course, is that you could introduce an inordinate amount of processing time to complete the filtering actions. As Scott said, Multi is not a toy, it should always be used with adult supervision. [rolleyes] As for missing the obvious, why do you think you were "wondering where I was"? I too had to think about it for awhile, only my molasses container ("brain pan", for the allegory-challenged) had to do a hard-reset before I could come up with an acceptable answer. I suppose I could apply that much vaunted system of logic that people keep accusing me of possessing, and see if I couldn't do the job in a single-pass system with the OR (or even the AND) construct, but I don't see the need. I'm not lazy, I just don't like to re-solve a problem, once I've solved it satisfactorily. Hope you understand. <_< Oddysey I'm no longer in the rat race - the rats won't have me! |
|||
Nov. 12, 2004, 09:58 PM
Post: #13
|
|||
|
|||
Hi Oddeysey,
Well, the concern is with all tags, not just anchor tags. I have to disagree that multi allows going backwards. It allows the same match to be checked again by other filters. To allow the 2nd filter to match, then the first filter would have to match back at the start of the tag. But since were trying to avoid matching inside a tag..... hmm Mike |
|||
Nov. 13, 2004, 04:44 AM
Post: #14
|
|||
|
|||
Mike;
Quote:Well, the concern is with all tags, not just anchor tags.Wolfram's first post was indeed generic in his reference to tags. However, he did use the anchor as an example. Furthermore, his later posts refer to the anchor tag exclusively, hence I have made an arbitrary design decision. Now if an image tag were to use the same text in the name of it's image to display, then we'll have to use the OR operator in a regexp to allow either <a or <img to match. Other than that, I can't think of anyplace where the part number would be used within the HTML code, can you? Quote:I have to disagree that multi allows going backwards. It allows the same match to be checked again by other filters.Well, I took a bit of literary license there, didn't I? What I meant is that if we can re-examine a piece of text with a filter to be named in the future, then I'd equate that to having gone back in time. So to speak. Too far out, eh? Oh well, back to the drawing board. <mumbles to self on way back to laboratory....> Quote:To allow the 2nd filter to match, then the first filter would have to match back at the start of the tag.Why? They are two different scopes, with correspondingly different Bounds. I think that the narrower scope is permissible within the broader scope, with Multi turned on. I suppose I could be wrong, though. Hmmmmm..... Let's break the desired action down to a logic puzzle, shall we? After much headscratching, I came up with this final result: If (^within a tag && match on $LST) then .... bold the text If we are within a tag, then the first condition fails, and we move on smartly. If we are not within a tag, and we find the subject text, then we proceed to bold it. If we reverse the terms, the test still works, but I've always liked to deal with the more likely possibilities first. It is much more likely that you're not within a tag, than otherwise. Now that we've defined the Holy Grail, can it be done? I would assume so, but again, this is for minds that can better tie themselves up in knots while building regexp's. I'm fresh outta headroom, sorry! And, I think I can safely retract my diatribe about how a two-filter setup is necessary. I think this is doable, don't you? Oddysey I'm no longer in the rat race - the rats won't have me! |
|||
Nov. 13, 2004, 09:46 AM
Post: #15
|
|||
|
|||
Hi Oddeysey,
Oddeysey Wrote:I think that the narrower scope is permissible within the broader scope, with Multi turned on.I agree with that. Oddeysey Wrote:If (^within a tag && match on $LST) thenI agree with this also, however, I don't think you can reverse this test. Doing so would reverse the scope. Oddeysey Wrote:And, I think I can safely retract my diatribe about how a two-filter setup is necessary.Looking back at my first post, I had 3 filters! Oddeysey Wrote:I think this is doable, don't you?Yes I do. It just might take more than 1 "non-proxo-guru" to do it. One more quote... Oddeysey Wrote:If (^within a tag && match on $LST) thenThats it. Since matching the $LST is easy, the main problem is determing when your inside a tag. I was looking over JD's config set the other day, & I saw something in there that gives me an idea. I need to take a closer look at it Mike |
|||
« Next Oldest | Next Newest »
|