Final pre-bias data download
Today I delivered the final pre-bias testing data to the collaboration. In other words, for some time now, the website is serving only images for the purpose of bias testing – the mirrored, black and white, etc images. Therefore, the standard data are as complete as they will ever be. The process of getting the data into a form suitable for processing for individual science projects is beautifully inefficient and convoluted! Below is a somewhat technical description of what I do.
First I login into a database server at the Johns Hopkins University and perform an SQL query that dumps the entire live database into a text file, which I then compress and FTP over to my computer workstation at the Lawrence Berkeley Laboratory. This is to bridge the gap between computer science world (pretty ASP.NET code and SQL backend) and science world (spaghetti FORTRAN code on UNIX and binary files).
The data is then reduced in a series of steps. First, the data is organized and sorted by galaxies, and usernames are converted into consecutive numbers (so that the usernames are anonymous in the final database). Second, the data from various downloads are combined into one big dataset. Third bad data are weeded out (misconfigured browsers, bots and similar). Finally, the reduced “histograms” for each galaxy are produced. These correspond to our final state of knowledge about each galaxy.
There are four ways of doing these: spirals can be combined or separate and users can be reweighted or not (and two times two makes four). In the combined spirals sample, we combine all three spiral subsamples (clockwise, anti-clockwise, and edge-on) into a single spiral category: science projects that are interested purely in the galaxy evolution don’t care about orientation of a given galaxy. In the reweighted sample, we try to improve the sample by essentially comparing the agreement between users: the idea is that if ten users claim that a certain galaxy is a spiral and the eleventh users says it is an elliptical, it is likely that the 11th user got it wrong. Users who commonly disagree with everyone else gets down-weighted and those who always agree get up-weighted.
It is a purely statistical exercise meant to remove pranksters that click randomly and up-weight careful users. In practice, we can check how well it works. We do this (well, Steven does it) by looking at galaxies that have the same absolute luminosity and size and shouldn’t evolve over the small redshift range probed by the SDSS. The upshot is that it doesn’t work as well as initially anticipated: as an old english proverb goes: if one million French believe in something, it doesn’t make it right. And so we also produce the unweighted sample in which all users are given the same weight. It is up to individual science projects to decide which combination to use.
Finally, the reduced data is uploaded to a super-secret web server where other collaborators can download it.
The final datasets contain 34,617,406 clicks done by 82,931 users. Hooray for all of you! However, the previous downloads already went over 30 million, and hence this will make only small improvements to our science results. Now, the important task is to gather enough information about biases in our datasets and so keep clicking, please!