From: Phpster on 26 Apr 2010 07:58 On Apr 26, 2010, at 7:23 AM, Ashley Sheridan <ash(a)ashleysheridan.co.uk> wrote: > On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote: > >> On 26 April 2010 12:52, Ashley Sheridan <ash(a)ashleysheridan.co.uk> >> wrote: >>> I've been thinking about this problem for a little while, and the >>> thing >>> is, I can think of ways of doing it, but they're not very nice, >>> and I >>> don't think they're going to be fast. >>> >>> Basically, I have a load of HTML formatted content in a database >>> that >>> get displayed onto the site. It's part of a rudimentary CMS. >>> >>> Currently, the titles for each article are displayed on a page, >>> and each >>> title links to the full article. However, that leaves me with a page >>> which is essentially a list of links, and that's not ideal for >>> SEO. What >>> I wanted to do to enhance the page is to have a short excerpt of x >>> number of words/characters beneath each article title. The idea >>> being >>> that search engines will find the page as more than a link farm, and >>> visitors won't have to just rely on the title alone for the content. >>> >>> Here's the rub though. As the content is in HTML form, I can't >>> just grab >>> the first 100 characters and display them as that could leave an >>> open >>> tag without a closing one, potentially breaking the page. I could >>> use >>> strip_tags on the 100-character excerpt, but what if the excerpt >>> itself >>> broke a tag in half (i.e. <acronym title="something"> could become >>> <acron ) >>> >>> The only solutions I can see are: >>> >>> >>> * retrieve the entire article, perform a strip_tags and then >>> take >>> the excerpt >>> * use a regex inside of mysql to pull out only the text >>> >>> >>> The thing is, neither of these seems particularly pretty, and I am >>> sure >>> there's a better way, but it's too early in the week for my brain >>> to be >>> fully functional I think! >>> >>> Does anyone have any ideas about what I could do, or do you think >>> I'm >>> seeing problems where there are none? >> >> Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount >> of content you want, then use one of the tools to repair and clean >> the >> html. >> >> Regards >> Peter >> >> -- >> <hype> >> WWW: http://plphp.dk / http://plind.dk >> LinkedIn: http://www.linkedin.com/in/plind >> Flickr: http://www.flickr.com/photos/fake51 >> BeWelcome: Fake51 >> Couchsurfing: Fake51 >> </hype> >> > > > Would that work on content that stopped mid-tag? Assuming the original > copy is: > > <p>This is some sentence, with an <abbr title="Abbreviation">abbr</ > abbr> > in the middle of it.</p> > > If I was asking for only the first 50 characters, I'd get this: > > <p>This is some sentence, with an <abbr title="Abb > > Would either htmltidy or htmlpurifier be able to handle that? I don't > mind whether it tries to repair the tag or remove it completely, as > long > as it does something to it. > > Thanks, > Ash > http://www.ashleysheridan.co.uk > > When looking at the performance side of things, couldn't you add another column to the table and do this work to tidy / strip tags during the insert going forward? Any current data would need a one time script to clean / tidy the current data. you could run this on a nightly cron ( depending on how much data there is) until the new column is filled with clean data. Bastien Sent from my iPod
From: Ashley Sheridan on 26 Apr 2010 07:54 On Mon, 2010-04-26 at 07:58 -0400, Phpster wrote: > > On Apr 26, 2010, at 7:23 AM, Ashley Sheridan > <ash(a)ashleysheridan.co.uk> wrote: > > > On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote: > > > >> On 26 April 2010 12:52, Ashley Sheridan <ash(a)ashleysheridan.co.uk> > >> wrote: > >>> I've been thinking about this problem for a little while, and the > >>> thing > >>> is, I can think of ways of doing it, but they're not very nice, > >>> and I > >>> don't think they're going to be fast. > >>> > >>> Basically, I have a load of HTML formatted content in a database > >>> that > >>> get displayed onto the site. It's part of a rudimentary CMS. > >>> > >>> Currently, the titles for each article are displayed on a page, > >>> and each > >>> title links to the full article. However, that leaves me with a page > >>> which is essentially a list of links, and that's not ideal for > >>> SEO. What > >>> I wanted to do to enhance the page is to have a short excerpt of x > >>> number of words/characters beneath each article title. The idea > >>> being > >>> that search engines will find the page as more than a link farm, and > >>> visitors won't have to just rely on the title alone for the content. > >>> > >>> Here's the rub though. As the content is in HTML form, I can't > >>> just grab > >>> the first 100 characters and display them as that could leave an > >>> open > >>> tag without a closing one, potentially breaking the page. I could > >>> use > >>> strip_tags on the 100-character excerpt, but what if the excerpt > >>> itself > >>> broke a tag in half (i.e. <acronym title="something"> could become > >>> <acron ) > >>> > >>> The only solutions I can see are: > >>> > >>> > >>> * retrieve the entire article, perform a strip_tags and then > >>> take > >>> the excerpt > >>> * use a regex inside of mysql to pull out only the text > >>> > >>> > >>> The thing is, neither of these seems particularly pretty, and I am > >>> sure > >>> there's a better way, but it's too early in the week for my brain > >>> to be > >>> fully functional I think! > >>> > >>> Does anyone have any ideas about what I could do, or do you think > >>> I'm > >>> seeing problems where there are none? > >> > >> Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount > >> of content you want, then use one of the tools to repair and clean > >> the > >> html. > >> > >> Regards > >> Peter > >> > >> -- > >> <hype> > >> WWW: http://plphp.dk / http://plind.dk > >> LinkedIn: http://www.linkedin.com/in/plind > >> Flickr: http://www.flickr.com/photos/fake51 > >> BeWelcome: Fake51 > >> Couchsurfing: Fake51 > >> </hype> > >> > > > > > > Would that work on content that stopped mid-tag? Assuming the original > > copy is: > > > > <p>This is some sentence, with an <abbr title="Abbreviation">abbr</ > > abbr> > > in the middle of it.</p> > > > > If I was asking for only the first 50 characters, I'd get this: > > > > <p>This is some sentence, with an <abbr title="Abb > > > > Would either htmltidy or htmlpurifier be able to handle that? I don't > > mind whether it tries to repair the tag or remove it completely, as > > long > > as it does something to it. > > > > Thanks, > > Ash > > http://www.ashleysheridan.co.uk > > > > > > When looking at the performance side of things, couldn't you add > another column to the table and do this work to tidy / strip tags > during the insert going forward? > > Any current data would need a one time script to clean / tidy the > current data. you could run this on a nightly cron ( depending on how > much data there is) until the new column is filled with clean data. > > Bastien > > Sent from my iPod > That's not a bad idea actually, I hadn't thought of it! I'm kicking myself now, because it's such an obvious solution! Thanks, Ash http://www.ashleysheridan.co.uk
From: Phpster on 26 Apr 2010 09:17 On Apr 26, 2010, at 7:54 AM, Ashley Sheridan <ash(a)ashleysheridan.co.uk> wrote: > On Mon, 2010-04-26 at 07:58 -0400, Phpster wrote: >> >> >> On Apr 26, 2010, at 7:23 AM, Ashley Sheridan >> <ash(a)ashleysheridan.co.uk> wrote: >> >> > On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote: >> > >> >> On 26 April 2010 12:52, Ashley Sheridan <ash(a)ashleysheridan.co.uk> >> >> wrote: >> >>> I've been thinking about this problem for a little while, and the >> >>> thing >> >>> is, I can think of ways of doing it, but they're not very nice, >> >>> and I >> >>> don't think they're going to be fast. >> >>> >> >>> Basically, I have a load of HTML formatted content in a database >> >>> that >> >>> get displayed onto the site. It's part of a rudimentary CMS. >> >>> >> >>> Currently, the titles for each article are displayed on a page, >> >>> and each >> >>> title links to the full article. However, that leaves me with a >> page >> >>> which is essentially a list of links, and that's not ideal for >> >>> SEO. What >> >>> I wanted to do to enhance the page is to have a short excerpt >> of x >> >>> number of words/characters beneath each article title. The idea >> >>> being >> >>> that search engines will find the page as more than a link >> farm, and >> >>> visitors won't have to just rely on the title alone for the >> content. >> >>> >> >>> Here's the rub though. As the content is in HTML form, I can't >> >>> just grab >> >>> the first 100 characters and display them as that could leave an >> >>> open >> >>> tag without a closing one, potentially breaking the page. I >> could >> >>> use >> >>> strip_tags on the 100-character excerpt, but what if the excerpt >> >>> itself >> >>> broke a tag in half (i.e. <acronym title="something"> could >> become >> >>> <acron ) >> >>> >> >>> The only solutions I can see are: >> >>> >> >>> >> >>> * retrieve the entire article, perform a strip_tags and then >> >>> take >> >>> the excerpt >> >>> * use a regex inside of mysql to pull out only the text >> >>> >> >>> >> >>> The thing is, neither of these seems particularly pretty, and I >> am >> >>> sure >> >>> there's a better way, but it's too early in the week for my brain >> >>> to be >> >>> fully functional I think! >> >>> >> >>> Does anyone have any ideas about what I could do, or do you think >> >>> I'm >> >>> seeing problems where there are none? >> >> >> >> Use htmltidy or htmlpurifier to clean up things. I.e. grab the >> amount >> >> of content you want, then use one of the tools to repair and clean >> >> the >> >> html. >> >> >> >> Regards >> >> Peter >> >> >> >> -- >> >> <hype> >> >> WWW: http://plphp.dk / http://plind.dk >> >> LinkedIn: http://www.linkedin.com/in/plind >> >> Flickr: http://www.flickr.com/photos/fake51 >> >> BeWelcome: Fake51 >> >> Couchsurfing: Fake51 >> >> </hype> >> >> >> > >> > >> > Would that work on content that stopped mid-tag? Assuming the >> original >> > copy is: >> > >> > <p>This is some sentence, with an <abbr title="Abbreviation">abbr</ >> > abbr> >> > in the middle of it.</p> >> > >> > If I was asking for only the first 50 characters, I'd get this: >> > >> > <p>This is some sentence, with an <abbr title="Abb >> > >> > Would either htmltidy or htmlpurifier be able to handle that? I >> don't >> > mind whether it tries to repair the tag or remove it completely, as >> > long >> > as it does something to it. >> > >> > Thanks, >> > Ash >> > http://www.ashleysheridan.co.uk >> > >> > >> >> When looking at the performance side of things, couldn't you add >> another column to the table and do this work to tidy / strip tags >> during the insert going forward? >> >> Any current data would need a one time script to clean / tidy the >> current data. you could run this on a nightly cron ( depending on how >> much data there is) until the new column is filled with clean data. >> >> Bastien >> >> Sent from my iPod >> > > That's not a bad idea actually, I hadn't thought of it! I'm kicking > myself now, because it's such an obvious solution! > > Thanks, > Ash > http://www.ashleysheridan.co.uk > > I always prefer simple solutions! It keeps things easy! Bastien Sent from my iPod
From: tedd on 26 Apr 2010 09:26 At 11:52 AM +0100 4/26/10, Ashley Sheridan wrote: >-snip- SEO concerns > >Does anyone have any ideas about what I could do, or do you think I'm >seeing problems where there are none? > >Thanks, >Ash Ash: Not only do you have to consider SEO for content, but what about content for an internal Site Search? I was confronted with the same problem (links to lot's of PDF files) and created a brief description of each article (PDF) that would be provided to SEO's and for Internal Searches. Sure, it's another field, but it works. Not that it's bad, but I do everything I can to keep html out of my database. In my view, the database is there to deliver content not code. I have entire sites that spring from a single index.php page that is loaded with different content depending upon what the user wants -- the site looks big, but consists of a single page. Cheers, tedd -- ------- http://sperling.com http://ancientstones.com http://earthstones.com
From: Nathan Rixham on 26 Apr 2010 14:38 Ashley Sheridan wrote: > I've been thinking about this problem for a little while, and the thing > is, I can think of ways of doing it, but they're not very nice, and I > don't think they're going to be fast. > > Basically, I have a load of HTML formatted content in a database that > get displayed onto the site. It's part of a rudimentary CMS. > > Currently, the titles for each article are displayed on a page, and each > title links to the full article. However, that leaves me with a page > which is essentially a list of links, and that's not ideal for SEO. What > I wanted to do to enhance the page is to have a short excerpt of x > number of words/characters beneath each article title. The idea being > that search engines will find the page as more than a link farm, and > visitors won't have to just rely on the title alone for the content. > > Here's the rub though. As the content is in HTML form, I can't just grab > the first 100 characters and display them as that could leave an open > tag without a closing one, potentially breaking the page. I could use > strip_tags on the 100-character excerpt, but what if the excerpt itself > broke a tag in half (i.e. <acronym title="something"> could become > <acron ) > > The only solutions I can see are: > > > * retrieve the entire article, perform a strip_tags and then take > the excerpt > * use a regex inside of mysql to pull out only the text > > > The thing is, neither of these seems particularly pretty, and I am sure > there's a better way, but it's too early in the week for my brain to be > fully functional I think! > > Does anyone have any ideas about what I could do, or do you think I'm > seeing problems where there are none? > > Thanks, > Ash > http://www.ashleysheridan.co.uk > /** * creates an abstract from any string, a nice one that stops at a full * stop or end of a word betwen 140-180 chars. * */ function createAbstract( $string ) { $lines = explode( "\n" , $string ); if( count($lines) > 1 && strlen($lines[0]) > 140 ) { $string = $lines[0]; } if( strlen($string) < 180 ) return $string; $string = substr( $string , 0 , 180); $chars = str_split( $string ); $string = ''; foreach( $chars as $char ) { $string .= $char; if( $char == '.' && strlen($string) > 120 ) { return $string; } } $string = ''; foreach( $chars as $char ) { $string .= $char; if( $char == ' ' && strlen($string) > 140 ) { return trim( $string ) . '...'; } } return $string; } /** * given an html (or fragment) tidy in to usable html * and strip back to text, new lines in tact * */ function htmlToText( $html ) { $html = str_replace( '&' , '&' , str_replace( '&' , '&' , $html ) ); $config = array( 'clean' => true, 'drop-proprietary-attributes' => true, 'output-xhtml' => true, 'show-body-only' => true, 'word-2000' => true, 'wrap' => '0' ); $tidy = new tidy(); $tidy->parseString($html, $config, 'utf8'); $tidy->cleanRepair(); $html = tidy_get_output($tidy); $text = str_replace( '&' , '&' , str_replace( '&' , '&' , $text ) ); return strip_tags($text); } using those two together should do it; they're pretty basic and could do with a tidy, but gets the job done (you'll probably want to change the 140 chars to something different) Best, Nathan
First
|
Prev
|
Pages: 1 2 Prev: Is the case of <?php important in any way? Next: LDAP import a csv file from php |