From: tedd on 14 Jun 2010 09:14 Hi gang: Considering all the recent parsing, here's another problem to consider -- given any text, parse the domain-names out of it. You may limit the parsing to the most popular TDL's, such as .com, ..net, and .org, but the finished result should be an array containing all the domain-names found in a text file. Cheers, tedd -- ------- http://sperling.com http://ancientstones.com http://earthstones.com
From: Ashley Sheridan on 14 Jun 2010 09:18 On Mon, 2010-06-14 at 09:14 -0400, tedd wrote: > Hi gang: > > Considering all the recent parsing, here's another problem to > consider -- given any text, parse the domain-names out of it. > > You may limit the parsing to the most popular TDL's, such as .com, > .net, and .org, but the finished result should be an array containing > all the domain-names found in a text file. > > Cheers, > > tedd > -- > ------- > http://sperling.com http://ancientstones.com http://earthstones.com > I'm assuming it won't be anything as simple as assuming all the domains begin with the http:// prefix? :p Thanks, Ash http://www.ashleysheridan.co.uk
From: tedd on 14 Jun 2010 09:23 At 2:18 PM +0100 6/14/10, Ashley Sheridan wrote: >On Mon, 2010-06-14 at 09:14 -0400, tedd wrote: > >> >>Hi gang: >> >>Considering all the recent parsing, here's another problem to >>consider -- given any text, parse the domain-names out of it. >> >>You may limit the parsing to the most popular TDL's, such as .com, >>.net, and .org, but the finished result should be an array containing >>all the domain-names found in a text file. >> >>Cheers, >> >>tedd >>-- >>------- >><http://sperling.com>http://sperling.com >><http://ancientstones.com>http://ancientstones.com >><http://earthstones.com>http://earthstones.com >> > >I'm assuming it won't be anything as simple as assuming all the >domains begin with the http:// prefix? :p > >Thanks, >Ash Ash: Nope, just a text file containing whatever and domain-names. The only domain-name indicator would be the period followed by an approved TDL, such as .com, .net, or .org. Cheers, tedd -- ------- http://sperling.com http://ancientstones.com http://earthstones.com
From: Robert Cummings on 14 Jun 2010 09:57 tedd wrote: > At 2:18 PM +0100 6/14/10, Ashley Sheridan wrote: >> On Mon, 2010-06-14 at 09:14 -0400, tedd wrote: >> >>> Hi gang: >>> >>> Considering all the recent parsing, here's another problem to >>> consider -- given any text, parse the domain-names out of it. >>> >>> You may limit the parsing to the most popular TDL's, such as .com, >>> .net, and .org, but the finished result should be an array containing >>> all the domain-names found in a text file. >>> >>> Cheers, >>> >>> tedd >>> -- >>> ------- >>> <http://sperling.com>http://sperling.com >>> <http://ancientstones.com>http://ancientstones.com >>> <http://earthstones.com>http://earthstones.com >>> >> I'm assuming it won't be anything as simple as assuming all the >> domains begin with the http:// prefix? :p >> >> Thanks, >> Ash > > Ash: > > Nope, just a text file containing whatever and domain-names. The only > domain-name indicator would be the period followed by an approved > TDL, such as .com, .net, or .org. <?php function rip_domains( $text ) { $domains = false; $pattern = '[^-[:alnum:]]*' .'(' . '[-[:alnum:]][-.[:alnum:]]*' . '\.(com|net|org)' .')' .'[^-_[:alnum:]]*'; if( preg_match_all( "#$pattern#", $text, $matches ) ) { $domains = array(); foreach( $matches[1] as $domain ) { $domains[$domain] = true; } $domains = array_keys( $domains ); } return $domains; } ?> Naive implementation. I'm sure I've missed edge cases someplace. Cheers, Rob. -- E-Mail Disclaimer: Information contained in this message and any attached documents is considered confidential and legally protected. This message is intended solely for the addressee(s). Disclosure, copying, and distribution are prohibited unless authorized.
From: "Daniel P. Brown" on 14 Jun 2010 10:08 On Mon, Jun 14, 2010 at 09:14, tedd <tedd(a)sperling.com> wrote: > Hi gang: > > Considering all the recent parsing, here's another problem to consider -- > given any text, parse the domain-names out of it. > > You may limit the parsing to the most popular TDL's, such as .com, .net, and > .org, but the finished result should be an array containing all the > domain-names found in a text file. <?php $text =<<<TXT To test example.com and www.php.net and other domain names such as january.pilotpig.net and ca2.php.parasane.net, we need a reliable method of checking. We don't want to match on regular periods, nor on the 2.2million or 2.2 million or just 2,200,000 other potential matches. And not when we are double-spacing or single-spacing, just when oidk.net and similar domains are found. We'll match hyphen domains like l-i-e.com, but not fake_underscored_domain.net. We also want to match http://-fronted domains like http://php1.net/, which also contains a number. If we wanted to match domains plus paths, but there was no leading http:// to indicate that it should be a URL, we could extend this to grab things like www.facebook.com/parasane, so long as we don't ignore the rare one-character SLDs like x.com, as well as the domains in email addresses like danbrown(a)php.net So if everything works as expected, we should see eleven domains matched here, because ccTLDs like guthr.ie should be matched as well. TXT; /** * $fromText can be defined via a file_get_contents() or * similar function, while $fullLink should be anything * but false to enable link-matching, which will return * only link-like domains with paths attached. */ function extract_domains($fromText,$fullLink=false) { // If we only want to match the domain names. if ($fullLink === false) { preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5})\b/',$fromText,$matches); return $matches[1]; } // If we want to match just domain names with trailing paths. preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5}\/.+?)\b/',$fromText,$matches); return $matches[1]; } // Demo echo "<pre>".PHP_EOL; echo "Just domains:".PHP_EOL; var_dump(extract_domains($text)); echo PHP_EOL; echo "Full links:".PHP_EOL; var_dump(extract_domains($text,true)); echo "</pre>".PHP_EOL; ?> -- </Daniel P. Brown> daniel.brown(a)parasane.net || danbrown(a)php.net http://www.parasane.net/ || http://www.pilotpig.net/ We now offer SAME-DAY SETUP on a new line of servers!
|
Next
|
Last
Pages: 1 2 3 Prev: protecting email addresses on a web site Next: PHP on command line -- mysql_connect error |