Final pre-bias data download
Today I delivered the final pre-bias testing data to the collaboration. In other words, for some time now, the website is serving only images for the purpose of bias testing – the mirrored, black and white, etc images. Therefore, the standard data are as complete as they will ever be. The process of getting the data into a form suitable for processing for individual science projects is beautifully inefficient and convoluted! Below is a somewhat technical description of what I do.
First I login into a database server at the Johns Hopkins University and perform an SQL query that dumps the entire live database into a text file, which I then compress and FTP over to my computer workstation at the Lawrence Berkeley Laboratory. This is to bridge the gap between computer science world (pretty ASP.NET code and SQL backend) and science world (spaghetti FORTRAN code on UNIX and binary files).
The data is then reduced in a series of steps. First, the data is organized and sorted by galaxies, and usernames are converted into consecutive numbers (so that the usernames are anonymous in the final database). Second, the data from various downloads are combined into one big dataset. Third bad data are weeded out (misconfigured browsers, bots and similar). Finally, the reduced “histograms” for each galaxy are produced. These correspond to our final state of knowledge about each galaxy.
There are four ways of doing these: spirals can be combined or separate and users can be reweighted or not (and two times two makes four). In the combined spirals sample, we combine all three spiral subsamples (clockwise, anti-clockwise, and edge-on) into a single spiral category: science projects that are interested purely in the galaxy evolution don’t care about orientation of a given galaxy. In the reweighted sample, we try to improve the sample by essentially comparing the agreement between users: the idea is that if ten users claim that a certain galaxy is a spiral and the eleventh users says it is an elliptical, it is likely that the 11th user got it wrong. Users who commonly disagree with everyone else gets down-weighted and those who always agree get up-weighted.
It is a purely statistical exercise meant to remove pranksters that click randomly and up-weight careful users. In practice, we can check how well it works. We do this (well, Steven does it) by looking at galaxies that have the same absolute luminosity and size and shouldn’t evolve over the small redshift range probed by the SDSS. The upshot is that it doesn’t work as well as initially anticipated: as an old english proverb goes: if one million French believe in something, it doesn’t make it right. And so we also produce the unweighted sample in which all users are given the same weight. It is up to individual science projects to decide which combination to use.
Finally, the reduced data is uploaded to a super-secret web server where other collaborators can download it.
The final datasets contain 34,617,406 clicks done by 82,931 users. Hooray for all of you! However, the previous downloads already went over 30 million, and hence this will make only small improvements to our science results. Now, the important task is to gather enough information about biases in our datasets and so keep clicking, please!
If we are now analysing images only for the purpose of bias testing & the standard data (is) complete, it’s not really the same project is it? Do you think newcomers are aware of this? (Not that it’s gonna stop me cracking on, mind…)
There’s a note on the front page that odd things are going on; Kate will confirm this but I think some of the images you’re seeing are still being classified. We needed to pin down what we’re thinking of as a final data set for the first papers, but every click will go into the final classification. Finally, the bias images are as important as the initial classifications; they’re essential in understanding what lies behind our data. So people are still contributing to the same project…
We were planning on having some original images up at the same time too (as control), but these slipped through the net at last minute. Might get them back up if we have time… I don’t think it is a different project, as they are still SDSS galaxies, and all the current classifications are still going to be used to produce a (second) classification for the galaxies. We will then compare these classifications from the last month or so, with those from earlier – and hopefully they agree! But all the classifications are just as important as each other!
I think it is interesting that perhaps psychologically is makes a difference to peoples motivation – if they know they have a flipped image. But hopefully it won’t put people off… we could have given you randomly transformed images from the start…
Is “if one million French believe in something, it doesn’t make it right.” a scientifical statement? 😉
As for as personal bias goes I have two problems that may be shared by others.
Firstly I find that the order of the ACW/CW (left/right and alphabetical order) “buttons” are anti-intuitive for some reason, and I occassionally choose the wrong one. I feel that CW is “normal” and should have been first in line, left to right. I would have preferred them to have been reversed, but too late now for such subtlety !
Secondly, having chosen the wrong “button”, I can’t correct the situation !
Cheers – Fermats Brother