Post Reply 
PHP help? match "between-the-dots" hostname patterns
May. 27, 2009, 12:30 AM (This post was last modified: May. 27, 2009 12:45 AM by xartica.)
Post: #1
PHP help? match "between-the-dots" hostname patterns
Background:
I use Proxomitron to proxy overall http traffic... and run a local DNS prox app, DNSKong( http://www.pyrenean.com ). Each time I merge "domain blacklists" provided by various sources (malwaredomains.com , forum.hosts-file.net , emergingthreats.net) into a blocklist for DNSKong, the combined list contains LOTS of overlapping / redundant patterns.

EXACT matching lines are easily removed using TextPad32 (Sort;remove duplicate lines) but the logic necessary for coding a script which is able to match "between the dots" has me stumped.

Within the supplied lists, each given pattern might represtent "an entire domain"
(bfast.com) or a "complete" hostname (ads.360.yahoo.com) or a partial hostname
(tracker.ebay) or just a non-dotted label pattern (teensxxx)

For the sake of example, if the existing blocklist already contains:
bfast.com
ads.360.yahoo.com
myyahoohoo
tracker.ebay
teensxxx
when comparing a new pattern against those already in the blocklist,
the script would discard the to-be-merged entry
teensxxx.smegerator.com
because it's redundant, i.e. it would already be matched by 'teensxxx'.

If the to-be-merged list contains an entry
my.yahoo.com
the script WOULD accept this as a new, non-redundant pattern
('my.yahoo' is not a between-the-dots match with 'myyahoohoo')

As a given, the input has been sanitized elsewhere and exists as in "one blocklist pattern per line" format. Initially, the input is the content from the existing list (which already definitely contains some redundant patterns, and I've grown weary trying to weed them out manually).

Here's the gist of what I wound up coding so far.
The PHP script works... but even though it doesn't echo anything to the page until matching has finished, it is TERRIBLY slow.

During testing, I had to override the max_execution_time setting in php.ini to avoid script timeout when handling only 5k patterns.

trial execution times:
500 items = 1.7 sec
1K items = 6.9 sec
2K items = 36 sec
4K items = 136 sec
5K items = timeout (1800+ sec)

Code:
$mystringin = trim($_POST['mytextareacontent']);
$myarray = explode("\n", $mystringin);
$myarray = array_unique($myarray);

$mytemparray = $myarray;
$howmany = count($myarray);
$rejects = array();

for($i=0; $i < $howmany; $i++) {

for($j=0; $j < $howmany; $j++) {
$haystack = trim($myarray[$i]);
$needle = trim($myarray[$j]);

if( $i != $j  && $haystack != '' && $needle != ''
&& preg_match("/\b".preg_quote($needle)."\b/i", $haystack) > 0 )
{
$rejects[] = '<b>'. $haystack .'</b> obviated by <b>'. $needle .'</b><br />';
unset($mytemparray[$i]);
$myarray[$i] = '';
}

}
}
$mytemparray = array_unique($mytemparray);
$myout = trim(implode($mytemparray, "\n"));

// plus additional lines to display $myout and $rejects to the page
I've attached a zipped copy of the full (3Kb) script, for reference

remaining issues:

-- Speed. The nested loops represent (length-1)! iterations, so I'll need to modify the script so that it processes the patterns in batches.

-- Apparently the underscore character is valid within third-level domain labels. My reliance on the \b regex modifier does not accomodate this.

-- Even though I expect the input fed to script will already have been sanitized... as is, the script won't properly handle any spaces and/or blank lines contained within the input.

Thanks in advance for any suggestion you can provide for altering the code, toward speeding the execution time


Attached File(s)
.zip  _remove_redundant_hostname_patterns.zip (Size: 1.41 KB / Downloads: 812)
Add Thank You Quote this message in a reply
May. 27, 2009, 12:57 AM
Post: #2
RE: PHP help? match "between-the-dots" hostname patterns
Not PHP but related: If you would prefer to block the requests to these sites might have some interest reading some code i wrote here http://prxbx.com/forums/showthread.php?tid=1277 Wink
Add Thank You Quote this message in a reply
May. 27, 2010, 11:22 PM
Post: #3
RE: PHP help? match "between-the-dots" hostname patterns
(a year later) quick followup to mention that I found the unix diff command to be the right tool for the job of merging the lists

http://www.jasspa.com/me/m3mac035.html
Add Thank You Quote this message in a reply
Post Reply 


Forum Jump: