Prev: FAQ 7.3 Do I always/never have to quote my strings or use semicolons and commas?
Next: Posting Guidelines for comp.lang.perl.misc ($Revision: 1.9 $)
From: PerlFAQ Server on 4 Jun 2010 00:00 This is an excerpt from the latest version perlfaq9.pod, which comes with the standard Perl distribution. These postings aim to reduce the number of repeated questions as well as allow the community to review and update the answers. The latest version of the complete perlfaq is at http://faq.perl.org . -------------------------------------------------------------------- 9.5: How do I extract URLs? You can easily extract all sorts of URLs from HTML with "HTML::SimpleLinkExtor" which handles anchors, images, objects, frames, and many other tags that can contain a URL. If you need anything more complex, you can create your own subclass of "HTML::LinkExtor" or "HTML::Parser". You might even use "HTML::SimpleLinkExtor" as an example for something specifically suited to your needs. You can use "URI::Find" to extract URLs from an arbitrary text document. Less complete solutions involving regular expressions can save you a lot of processing time if you know that the input is simple. One solution from Tom Christiansen runs 100 times faster than most module based approaches but only extracts URLs from anchors where the first attribute is HREF and there are no other attributes. #!/usr/bin/perl -n00 # qxurl - tchrist(a)perl.com print "$2\n" while m{ < \s* A \s+ HREF \s* = \s* (["']) (.*?) \1 \s* > }gsix; -------------------------------------------------------------------- The perlfaq-workers, a group of volunteers, maintain the perlfaq. They are not necessarily experts in every domain where Perl might show up, so please include as much information as possible and relevant in any corrections. The perlfaq-workers also don't have access to every operating system or platform, so please include relevant details for corrections to examples that do not work on particular platforms. Working code is greatly appreciated. If you'd like to help maintain the perlfaq, see the details in perlfaq.pod. |