PHP help? match "between-the-dots" hostname patterns
|
May. 27, 2009, 12:30 AM
(This post was last modified: May. 27, 2009 12:45 AM by xartica.)
Post: #1
|
|||
|
|||
PHP help? match "between-the-dots" hostname patterns
Background:
I use Proxomitron to proxy overall http traffic... and run a local DNS prox app, DNSKong( http://www.pyrenean.com ). Each time I merge "domain blacklists" provided by various sources (malwaredomains.com , forum.hosts-file.net , emergingthreats.net) into a blocklist for DNSKong, the combined list contains LOTS of overlapping / redundant patterns. EXACT matching lines are easily removed using TextPad32 (Sort;remove duplicate lines) but the logic necessary for coding a script which is able to match "between the dots" has me stumped. Within the supplied lists, each given pattern might represtent "an entire domain" (bfast.com) or a "complete" hostname (ads.360.yahoo.com) or a partial hostname (tracker.ebay) or just a non-dotted label pattern (teensxxx) For the sake of example, if the existing blocklist already contains: bfast.com ads.360.yahoo.com myyahoohoo tracker.ebay teensxxx when comparing a new pattern against those already in the blocklist, the script would discard the to-be-merged entry teensxxx.smegerator.com because it's redundant, i.e. it would already be matched by 'teensxxx'. If the to-be-merged list contains an entry my.yahoo.com the script WOULD accept this as a new, non-redundant pattern ('my.yahoo' is not a between-the-dots match with 'myyahoohoo') As a given, the input has been sanitized elsewhere and exists as in "one blocklist pattern per line" format. Initially, the input is the content from the existing list (which already definitely contains some redundant patterns, and I've grown weary trying to weed them out manually). Here's the gist of what I wound up coding so far. The PHP script works... but even though it doesn't echo anything to the page until matching has finished, it is TERRIBLY slow. During testing, I had to override the max_execution_time setting in php.ini to avoid script timeout when handling only 5k patterns. trial execution times: 500 items = 1.7 sec 1K items = 6.9 sec 2K items = 36 sec 4K items = 136 sec 5K items = timeout (1800+ sec) Code: $mystringin = trim($_POST['mytextareacontent']); remaining issues: -- Speed. The nested loops represent (length-1)! iterations, so I'll need to modify the script so that it processes the patterns in batches. -- Apparently the underscore character is valid within third-level domain labels. My reliance on the \b regex modifier does not accomodate this. -- Even though I expect the input fed to script will already have been sanitized... as is, the script won't properly handle any spaces and/or blank lines contained within the input. Thanks in advance for any suggestion you can provide for altering the code, toward speeding the execution time |
|||
May. 27, 2009, 12:57 AM
Post: #2
|
|||
|
|||
RE: PHP help? match "between-the-dots" hostname patterns
Not PHP but related: If you would prefer to block the requests to these sites might have some interest reading some code i wrote here http://prxbx.com/forums/showthread.php?tid=1277
![]() |
|||
May. 27, 2010, 11:22 PM
Post: #3
|
|||
|
|||
RE: PHP help? match "between-the-dots" hostname patterns
(a year later) quick followup to mention that I found the unix diff command to be the right tool for the job of merging the lists
http://www.jasspa.com/me/m3mac035.html |
|||
« Next Oldest | Next Newest »
|