Prev: imagecreate inside an object
Next: mod_php
From: Colin Guthrie on 20 Aug 2010 05:00 Hi, OK, this is really just a sounding board for a couple ideas I'm mulling over regarding a pseudo-randomisation system for some websites I'm doing. Any thoughts on the subject greatly appreciated! Back Story: We have a system that lists things. The things are broken down by category, but you can still end up at the leaf of the category with a couple hundred things to list, which is done via a pagination system (lets say 50 per page). Now, the people who own the things pay to have their things on the site. Lets say there are three levels of option for listing: gold, silver, bronze. The default order is gold things, silver things then bronze things. Within each level, the things are listed alphabetically (again this is just the default). Now if 100 things in one category have a gold level listing, those in the second half of the alphabet will be on page two by default. They don't like this and they question why they are paying for gold at all. My client would like to present things in a more random way to give all gold level things a chance to be on the first page of results in a fairer way than just what they happen to be named. Right that's the back story. It's more complex than that, but the above is a nice and simple abstraction. Problems: There are numerous problems to randomised listings: you can't actually truly randomise results otherwise pagination breaks. Server-side caching/denationalisation is affected as there is no longer "one listing" but "many random listings". Discussing a link with a friend over IM or email and saying things like "the third one down looks best" is obviously broken too, but this is something my client accepts and can live with. Also, if the intention is to reassure the thing owners that their listing will appear further up the listings at times, the fact that a simple refresh will not reorder things for a given session will make that point harder to get across to less web-educated clients (that's a nice way of saying it!). Caching proxies and other similar things after the webserver will also come into play. So to me there are only really two options: 1. Random-per user (or session): Each user session gets some kind of randomisation key and a fresh set of random numbers is generated for each thing. They can then be reliably "randomised" for a given user. The fact that each user has their own unique randomisation is good, but it doesn't help things like server side full page caching and thus more "work" needs to be done to support this approach. 2. Random-bank + user/session assignment: So with this approach we have a simple table of numbers. First column is an id and is sequential form 1 to <very big number>. This table has lots of columns: say 32. These columns will store a random number. Once generated, this table acts as an orderer. It can be joined into our thing lookup query and the results can be ordered by one of the columns. Which column to use for ordering is picked by a cookie stored on the users machine. That way the user will always get the same random result, even if they revisit the site some time later (users not accepting cookies is not a huge deal, but I would suggest the "pick a random column" algorithm (used to set the cookie initially) is actually based on source IP address. That way even cookieless folks should get a consistent listing unless their change their IP). I'm obviously leaning towards the second approach. If I have 32 "pre-randomised" columns, this would get a pretty good end result I think. If we re-randomise periodically (i.e. once a week or month) then this can be extended further (or simply more columns can be added). I think it's the lowest impact but there are sill some concerns: Server side caching is still problematic. Instead of storing one page per "result" I now have to store 32. This will lower significantly the cache hits and perhaps make full result caching somewhat redundant. If that is the case, then so be it, but load will have to be managed. So my question for the lazy-web: Are there any other approaches I've missed? Is there some cunning, cleverness that eludes me? Are there any problems with the above approach? Would a caching proxy ultimately cause problems for some users (i.e. storing a cache for page 1 and page 2 of the same listing but with different randomisations)? And if so can this be mitigated? Thanks for reading and any insights you may have! Col -- Colin Guthrie gmane(at)colin.guthr.ie http://colin.guthr.ie/ Day Job: Tribalogic Limited [http://www.tribalogic.net/] Open Source: Mandriva Linux Contributor [http://www.mandriva.com/] PulseAudio Hacker [http://www.pulseaudio.org/] Trac Hacker [http://trac.edgewall.org/]
From: "Jon Haworth" on 20 Aug 2010 07:11 Hi Col, Interesting problem. > Are there any other approaches I've missed? Off the top of my head, how about this: 1. Add a new unsigned int column called "SortOrder" to the table of widgets or whatever it is you're listing 2. Fill this column with randomly-generated numbers between 0 and whatever the unsigned int max is (can't remember exactly but 4.2 billion ish) 3. Add the SortOrder column to the end of all your ORDER BY clauses - SELECT foo ORDER BY TypeOfListing, SortOrder will give you widgets sorted by Gold/Silver/Bronze type, but in a random order for each type 4. Every hour/day/week/whatever, update this column with different random numbers Advantages: practically no hassle/overhead/maintenance for you; provides same ordering sequence for all users at the same time; only breaks "third one down"-type references when you refresh the SortOrder column rather than on each session or page view; reasonably proxy- and cache-friendly especially if you send a meaningful HTTP Expires header. Disadvantages: breaks user persistence if they visit before and after a SortOrder refresh ("I'm sure the one I wanted was at the top of the list yesterday..."); more effort to demonstrate randomness to the client ("OK, see how you're in ninety-third place today? Well, check again tomorrow and you should be somewhere else on the list"). Hopefully food for thought anyway. Cheers Jon
From: Nathan Rixham on 20 Aug 2010 08:17 Colin Guthrie wrote: > Hi, > > OK, this is really just a sounding board for a couple ideas I'm mulling > over regarding a pseudo-randomisation system for some websites I'm > doing. Any thoughts on the subject greatly appreciated! > > Back Story: > > We have a system that lists things. The things are broken down by > category, but you can still end up at the leaf of the category with a > couple hundred things to list, which is done via a pagination system > (lets say 50 per page). > > > Now, the people who own the things pay to have their things on the site. > Lets say there are three levels of option for listing: gold, silver, > bronze. The default order is gold things, silver things then bronze > things. Within each level, the things are listed alphabetically (again > this is just the default). > > > Now if 100 things in one category have a gold level listing, those in > the second half of the alphabet will be on page two by default. They > don't like this and they question why they are paying for gold at all. > > My client would like to present things in a more random way to give all > gold level things a chance to be on the first page of results in a > fairer way than just what they happen to be named. > > Right that's the back story. It's more complex than that, but the above > is a nice and simple abstraction. > > > Problems: > > There are numerous problems to randomised listings: you can't actually > truly randomise results otherwise pagination breaks. Server-side > caching/denationalisation is affected as there is no longer "one > listing" but "many random listings". Discussing a link with a friend > over IM or email and saying things like "the third one down looks best" > is obviously broken too, but this is something my client accepts and can > live with. Also, if the intention is to reassure the thing owners that > their listing will appear further up the listings at times, the fact > that a simple refresh will not reorder things for a given session will > make that point harder to get across to less web-educated clients > (that's a nice way of saying it!). Caching proxies and other similar > things after the webserver will also come into play. > > > So to me there are only really two options: > > 1. Random-per user (or session): Each user session gets some kind of > randomisation key and a fresh set of random numbers is generated for > each thing. They can then be reliably "randomised" for a given user. The > fact that each user has their own unique randomisation is good, but it > doesn't help things like server side full page caching and thus more > "work" needs to be done to support this approach. > > 2. Random-bank + user/session assignment: So with this approach we have > a simple table of numbers. First column is an id and is sequential form > 1 to <very big number>. This table has lots of columns: say 32. These > columns will store a random number. Once generated, this table acts as > an orderer. It can be joined into our thing lookup query and the results > can be ordered by one of the columns. Which column to use for ordering > is picked by a cookie stored on the users machine. That way the user > will always get the same random result, even if they revisit the site > some time later (users not accepting cookies is not a huge deal, but I > would suggest the "pick a random column" algorithm (used to set the > cookie initially) is actually based on source IP address. That way even > cookieless folks should get a consistent listing unless their change > their IP). > > > > I'm obviously leaning towards the second approach. If I have 32 > "pre-randomised" columns, this would get a pretty good end result I > think. If we re-randomise periodically (i.e. once a week or month) then > this can be extended further (or simply more columns can be added). > > I think it's the lowest impact but there are sill some concerns: > > Server side caching is still problematic. Instead of storing one page > per "result" I now have to store 32. This will lower significantly the > cache hits and perhaps make full result caching somewhat redundant. If > that is the case, then so be it, but load will have to be managed. > > > So my question for the lazy-web: > > Are there any other approaches I've missed? Is there some cunning, > cleverness that eludes me? > > Are there any problems with the above approach? Would a caching proxy > ultimately cause problems for some users (i.e. storing a cache for page > 1 and page 2 of the same listing but with different randomisations)? And > if so can this be mitigated? > > Thanks for reading and any insights you may have! if you use mysql you can seed rand() with a number to get the same random results out each time (for that seed number) SELECT * from table ORDER BY RAND(234) Then just use limit and offset as normal. Thus, assign each user / session a simple random int, and use it in the query. on a semi related note, if you need real random data, then you'll be wanting random.org Best, Nathan
From: Colin Guthrie on 20 Aug 2010 09:05 Thanks everyone for responses. 'Twas brillig, and Nathan Rixham at 20/08/10 13:17 did gyre and gimble: > if you use mysql you can seed rand() with a number to get the same > random results out each time (for that seed number) > > SELECT * from table ORDER BY RAND(234) > > Then just use limit and offset as normal. This is a neat trick! Yeah that will avoid the need for the static lookup table with 32 randomised columns. Jon's strategy is more or less a simplified version of my 32-column randomising table (i.e. just have 1 column of random data rather than 32). I would personally prefer to reduce the refresh of this data as I don't like to annoy people when the change over day happens. The RAND(seed) approach will probably work well (not sure of performance verses an indexed table, but I can easily experiment with this). If I use the numbers 1..32 as my seed, then I still get the same net result as a 32 column table. If I just change my "seed offset" then I get the same result as re-generating my random data tables. From an operational perspective, RAND(seed) is certainly easier. I'll certainly look into this. Many thanks. Col -- Colin Guthrie gmane(at)colin.guthr.ie http://colin.guthr.ie/ Day Job: Tribalogic Limited [http://www.tribalogic.net/] Open Source: Mandriva Linux Contributor [http://www.mandriva.com/] PulseAudio Hacker [http://www.pulseaudio.org/] Trac Hacker [http://trac.edgewall.org/]
From: Colin Guthrie on 20 Aug 2010 09:31
'Twas brillig, and Andrew Ballard at 20/08/10 14:24 did gyre and gimble: > Would it work to return a list of some limited number of randomly > ordered "featured" listings/items on the page, while leaving the full > list ordered by whatever natural ordering (by date, order entered, > alphabetical, etc.)? That gives every owner a chance to appear in a > prominent spot on the page while solving the issue you cited about > page breaks (and SEO if that is a concern). You can still use any of > the suggestions that have been discussed to determine how frequently > the featured items list is reseeded to help make caching practical. Yeah we've tried to push this as an option too, but so far our clients are not biting on this suggestion. They like the idea.... but in addition to randomised listings too! Speaking of SEO, that was one of our concerns about randomising listings too. What impact do you think such randomised listings will have on SEO? Obviously if a term is matched for a listing page that contains a thing and when the user visits that page, the thing itself is not on in the listing, then the user will be disappointed, but will this actually result in SEO penalties? Col -- Colin Guthrie gmane(at)colin.guthr.ie http://colin.guthr.ie/ Day Job: Tribalogic Limited [http://www.tribalogic.net/] Open Source: Mandriva Linux Contributor [http://www.mandriva.com/] PulseAudio Hacker [http://www.pulseaudio.org/] Trac Hacker [http://trac.edgewall.org/] |