Prev: GD Watermark Question-
Next: PHP Email Question
From: "Geoffrey van Wyk" on 18 Sep 2010 02:21 Hi All, I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined". This is the code I use with the DOMDocument object for HTML files not prepared in MS Word: <?php /* Using the DOMDocument class */ /* Create a new DOMDocument object. */ $html = new DOMDocument("1.0", "UTF-8"); /* Load HTML code from an HTML file into the DOMDocument. */ $html->loadHTMLFile("HTML File With Empty Paragraphs.html"); /* Assign all the <p> elements into the $pars DOMNodeList object. */ $pars = $html->getElementsByTagName("p"); echo "The initial number of paragraphs is " . $pars->length . ".<br />"; /* The trim() function is used to remove leading and trailing spaces as well as * newline characters. */ for ($i = 0; $i < $pars->length; $i++){ if (trim($pars->item($i)->textContent == "")){ $pars->item($i)->parentNode->removeChild($pars->item($i)); $i--; } } echo "The final number of paragraphs is " . $pars->length . ".<br />"; // Write the HTML code back into an HTML file. $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html"); ?> This is the code I use with the simple_html_dom.php module for HTML files prepared in MS Word: <?php /* Using simple_html_dom.php */ include("simple_html_dom.php"); $html = file_get_html("HTML File With Empty Paragraphs.html"); $pars = $html->find("p"); for ($i = 0; $i < count($pars); $i++) { if (trim($pars[$i]->plaintext == "")) { unset($pars[$i]); $i--; } } $html->save("HTML File without Empty Paragraphs.html"); ?> It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext == "")) {". Does anyone know how I can fix this? Thank you. Geoffrey van Wyk
From: Simon J Welsh on 18 Sep 2010 04:24 On 18/09/2010, at 6:21 PM, Geoffrey van Wyk wrote: > Hi All, > > I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined". > > This is the code I use with the DOMDocument object for HTML files not prepared in MS Word: > > <?php > /* Using the DOMDocument class */ > > /* Create a new DOMDocument object. */ > $html = new DOMDocument("1.0", "UTF-8"); > > /* Load HTML code from an HTML file into the DOMDocument. */ > $html->loadHTMLFile("HTML File With Empty Paragraphs.html"); > > /* Assign all the <p> elements into the $pars DOMNodeList object. */ > $pars = $html->getElementsByTagName("p"); > > echo "The initial number of paragraphs is " . $pars->length . ".<br />"; > > /* The trim() function is used to remove leading and trailing spaces as well as > * newline characters. */ > for ($i = 0; $i < $pars->length; $i++){ > if (trim($pars->item($i)->textContent == "")){ > $pars->item($i)->parentNode->removeChild($pars->item($i)); > $i--; > } > } > > echo "The final number of paragraphs is " . $pars->length . ".<br />"; > > // Write the HTML code back into an HTML file. > $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html"); > ?> > > This is the code I use with the simple_html_dom.php module for HTML files prepared in MS Word: > > <?php > /* Using simple_html_dom.php */ > > include("simple_html_dom.php"); > > $html = file_get_html("HTML File With Empty Paragraphs.html"); > > $pars = $html->find("p"); > > for ($i = 0; $i < count($pars); $i++) { > if (trim($pars[$i]->plaintext == "")) { > unset($pars[$i]); > $i--; > } > } > > $html->save("HTML File without Empty Paragraphs.html"); > ?> > > It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext == "")) {". > > Does anyone know how I can fix this? > > Thank you. > > Geoffrey van Wyk > Personally, I'd just use regex to do it. Something like preg_replace('#<p[^>]*?>\s*</p>#m', '', $html) should do it. Otherwise, you've got trim($pars[$i]->plaintext == "") instead of trim($pars[$i]->plaintext) == "". --- Simon Welsh Admin of http://simon.geek.nz/ Who said Microsoft never created a bug-free program? The blue screen never, ever crashes! http://www.thinkgeek.com/brain/gimme.cgi?wid=81d520e5e
|
Pages: 1 Prev: GD Watermark Question- Next: PHP Email Question |