From: Ashley Sheridan on 26 Apr 2010 06:52 I've been thinking about this problem for a little while, and the thing is, I can think of ways of doing it, but they're not very nice, and I don't think they're going to be fast. Basically, I have a load of HTML formatted content in a database that get displayed onto the site. It's part of a rudimentary CMS. Currently, the titles for each article are displayed on a page, and each title links to the full article. However, that leaves me with a page which is essentially a list of links, and that's not ideal for SEO. What I wanted to do to enhance the page is to have a short excerpt of x number of words/characters beneath each article title. The idea being that search engines will find the page as more than a link farm, and visitors won't have to just rely on the title alone for the content. Here's the rub though. As the content is in HTML form, I can't just grab the first 100 characters and display them as that could leave an open tag without a closing one, potentially breaking the page. I could use strip_tags on the 100-character excerpt, but what if the excerpt itself broke a tag in half (i.e. <acronym title="something"> could become <acron ) The only solutions I can see are: * retrieve the entire article, perform a strip_tags and then take the excerpt * use a regex inside of mysql to pull out only the text The thing is, neither of these seems particularly pretty, and I am sure there's a better way, but it's too early in the week for my brain to be fully functional I think! Does anyone have any ideas about what I could do, or do you think I'm seeing problems where there are none? Thanks, Ash http://www.ashleysheridan.co.uk
From: Peter Lind on 26 Apr 2010 07:20 On 26 April 2010 12:52, Ashley Sheridan <ash(a)ashleysheridan.co.uk> wrote: > I've been thinking about this problem for a little while, and the thing > is, I can think of ways of doing it, but they're not very nice, and I > don't think they're going to be fast. > > Basically, I have a load of HTML formatted content in a database that > get displayed onto the site. It's part of a rudimentary CMS. > > Currently, the titles for each article are displayed on a page, and each > title links to the full article. However, that leaves me with a page > which is essentially a list of links, and that's not ideal for SEO. What > I wanted to do to enhance the page is to have a short excerpt of x > number of words/characters beneath each article title. The idea being > that search engines will find the page as more than a link farm, and > visitors won't have to just rely on the title alone for the content. > > Here's the rub though. As the content is in HTML form, I can't just grab > the first 100 characters and display them as that could leave an open > tag  without a closing one, potentially breaking the page. I could use > strip_tags on the 100-character excerpt, but what if the excerpt itself > broke a tag in half (i.e. <acronym title="something"> could become > <acron ) > > The only solutions I can see are: > > >    * retrieve the entire article, perform a strip_tags and then take >     the excerpt >    * use a regex inside of mysql to pull out only the text > > > The thing is, neither of these seems particularly pretty, and I am sure > there's a better way, but it's too early in the week for my brain to be > fully functional I think! > > Does anyone have any ideas about what I could do, or do you think I'm > seeing problems where there are none? Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount of content you want, then use one of the tools to repair and clean the html. Regards Peter -- <hype> WWW: http://plphp.dk / http://plind.dk LinkedIn: http://www.linkedin.com/in/plind Flickr: http://www.flickr.com/photos/fake51 BeWelcome: Fake51 Couchsurfing: Fake51 </hype>
From: Per Jessen on 26 Apr 2010 07:24 Ashley Sheridan wrote: > Here's the rub though. As the content is in HTML form, I can't just > grab the first 100 characters and display them as that could leave an= > open tag without a closing one, potentially breaking the page. I > could use strip_tags on the 100-character excerpt, but what if the > excerpt itself broke a tag in half (i.e. <acronym title=3D"something"= > > could become <acron ) >=20 > The only solutions I can see are: >=20 >=20 > * retrieve the entire article, perform a strip_tags and then > take the excerpt > * use a regex inside of mysql to pull out only the text >=20 - parse the HTML and extract the text elements. If the HTML is well-formed, this is relatively easily done with XSL, if= not, you might need to use Beautiful Soup or similar. --=20 Per Jessen, Z=C3=BCrich (16.1=C2=B0C)
From: Ashley Sheridan on 26 Apr 2010 07:23 On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote: > On 26 April 2010 12:52, Ashley Sheridan <ash(a)ashleysheridan.co.uk> wrote: > > I've been thinking about this problem for a little while, and the thing > > is, I can think of ways of doing it, but they're not very nice, and I > > don't think they're going to be fast. > > > > Basically, I have a load of HTML formatted content in a database that > > get displayed onto the site. It's part of a rudimentary CMS. > > > > Currently, the titles for each article are displayed on a page, and each > > title links to the full article. However, that leaves me with a page > > which is essentially a list of links, and that's not ideal for SEO. What > > I wanted to do to enhance the page is to have a short excerpt of x > > number of words/characters beneath each article title. The idea being > > that search engines will find the page as more than a link farm, and > > visitors won't have to just rely on the title alone for the content. > > > > Here's the rub though. As the content is in HTML form, I can't just grab > > the first 100 characters and display them as that could leave an open > > tag without a closing one, potentially breaking the page. I could use > > strip_tags on the 100-character excerpt, but what if the excerpt itself > > broke a tag in half (i.e. <acronym title="something"> could become > > <acron ) > > > > The only solutions I can see are: > > > > > > * retrieve the entire article, perform a strip_tags and then take > > the excerpt > > * use a regex inside of mysql to pull out only the text > > > > > > The thing is, neither of these seems particularly pretty, and I am sure > > there's a better way, but it's too early in the week for my brain to be > > fully functional I think! > > > > Does anyone have any ideas about what I could do, or do you think I'm > > seeing problems where there are none? > > Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount > of content you want, then use one of the tools to repair and clean the > html. > > Regards > Peter > > -- > <hype> > WWW: http://plphp.dk / http://plind.dk > LinkedIn: http://www.linkedin.com/in/plind > Flickr: http://www.flickr.com/photos/fake51 > BeWelcome: Fake51 > Couchsurfing: Fake51 > </hype> > Would that work on content that stopped mid-tag? Assuming the original copy is: <p>This is some sentence, with an <abbr title="Abbreviation">abbr</abbr> in the middle of it.</p> If I was asking for only the first 50 characters, I'd get this: <p>This is some sentence, with an <abbr title="Abb Would either htmltidy or htmlpurifier be able to handle that? I don't mind whether it tries to repair the tag or remove it completely, as long as it does something to it. Thanks, Ash http://www.ashleysheridan.co.uk
From: Peter Lind on 26 Apr 2010 07:34 On 26 April 2010 13:23, Ashley Sheridan <ash(a)ashleysheridan.co.uk> wrote: > > On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote: > > On 26 April 2010 12:52, Ashley Sheridan <ash(a)ashleysheridan.co.uk> wrote: > > I've been thinking about this problem for a little while, and the thing > > is, I can think of ways of doing it, but they're not very nice, and I > > don't think they're going to be fast. > > > > Basically, I have a load of HTML formatted content in a database that > > get displayed onto the site. It's part of a rudimentary CMS. > > > > Currently, the titles for each article are displayed on a page, and each > > title links to the full article. However, that leaves me with a page > > which is essentially a list of links, and that's not ideal for SEO. What > > I wanted to do to enhance the page is to have a short excerpt of x > > number of words/characters beneath each article title. The idea being > > that search engines will find the page as more than a link farm, and > > visitors won't have to just rely on the title alone for the content. > > > > Here's the rub though. As the content is in HTML form, I can't just grab > > the first 100 characters and display them as that could leave an open > > tag  without a closing one, potentially breaking the page. I could use > > strip_tags on the 100-character excerpt, but what if the excerpt itself > > broke a tag in half (i.e. <acronym title="something"> could become > > <acron ) > > > > The only solutions I can see are: > > > > > >    * retrieve the entire article, perform a strip_tags and then take > >     the excerpt > >    * use a regex inside of mysql to pull out only the text > > > > > > The thing is, neither of these seems particularly pretty, and I am sure > > there's a better way, but it's too early in the week for my brain to be > > fully functional I think! > > > > Does anyone have any ideas about what I could do, or do you think I'm > > seeing problems where there are none? > > Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount > of content you want, then use one of the tools to repair and clean the > html. > > Regards > Peter > > -- > <hype> > WWW: http://plphp.dk / http://plind.dk > LinkedIn: http://www.linkedin.com/in/plind > Flickr: http://www.flickr.com/photos/fake51 > BeWelcome: Fake51 > Couchsurfing: Fake51 > </hype> > > > Would that work on content that stopped mid-tag? Assuming the original copy is: > > <p>This is some sentence, with an <abbr title="Abbreviation">abbr</abbr> in the middle of it.</p> > > If I was asking for only the first 50 characters, I'd get this: > > <p>This is some sentence, with an <abbr title="Abb > > Would either htmltidy or htmlpurifier be able to handle that? I don't mind whether it tries to repair the tag or remove it completely, as long as it does something to it. > > Thanks, > Ash > http://www.ashleysheridan.co.uk > HTMLTidy should definitely do something to it, pretty sure it's able to clean that up so you get working html. Same for HTMLPurifier (the latter is not as much what you're looking for, it protects against injections more than validating html - so disregard that I mentioned that one for now :) ). Regards Peter -- <hype> WWW: http://plphp.dk / http://plind.dk LinkedIn: http://www.linkedin.com/in/plind Flickr: http://www.flickr.com/photos/fake51 BeWelcome: Fake51 Couchsurfing: Fake51 </hype>
|
Next
|
Last
Pages: 1 2 Prev: Is the case of <?php important in any way? Next: LDAP import a csv file from php |