Prev: FAQ 4.34 How do I extract selected columns from a string?
Next: FAQ 5.30 How can I read in a file by paragraphs?
From: PerlFAQ Server on 28 Jun 2010 18:00 This is an excerpt from the latest version perlfaq9.pod, which comes with the standard Perl distribution. These postings aim to reduce the number of repeated questions as well as allow the community to review and update the answers. The latest version of the complete perlfaq is at http://faq.perl.org . -------------------------------------------------------------------- 9.4: How do I remove HTML from a string? The most correct way (albeit not the fastest) is to use "HTML::Parser" from CPAN. Another mostly correct way is to use "HTML::FormatText" which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text. Many folks attempt a simple-minded regular expression approach, like "s/<.*?>//g", but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like "<" for example. Here's one "simple-minded" approach, that works for most files: #!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz . Here are some tricky cases that you should think about when picking a solution: <IMG SRC = "foo.gif" ALT = "A > B"> <IMG SRC = "foo.gif" ALT = "A > B"> <!-- <A comment> --> <script>if (a<b && a>c)</script> <# Just data #> <![INCLUDE CDATA [ >>>>>>>>>>>> ]]> If HTML comments include other tags, those solutions would also break on text like this: <!-- This section commented out. <B>You can't see me!</B> --> -------------------------------------------------------------------- The perlfaq-workers, a group of volunteers, maintain the perlfaq. They are not necessarily experts in every domain where Perl might show up, so please include as much information as possible and relevant in any corrections. The perlfaq-workers also don't have access to every operating system or platform, so please include relevant details for corrections to examples that do not work on particular platforms. Working code is greatly appreciated. If you'd like to help maintain the perlfaq, see the details in perlfaq.pod. |