New paper: Galaxy Zoo and machine learning
I’m really happy to announce a new paper based on Galaxy Zoo data has just been accepted for publication. This one is different than many of our previous works; it focuses on the science of machine learning, and how we’re improving the ability of computers to identify galaxy morphologies after being trained off the classifications you’ve provided in Galaxy Zoo. This paper was led by Sander Dieleman, a PhD student at Ghent University in Belgium.
This work was begun in early 2014, when we ran an online competition through the Kaggle data platform called “The Galaxy Challenge”. The premise was fairly simple – we used the classifications provided by citizen scientists for the Galaxy Zoo 2 project and challenged computer scientists to write an algorithm to match those classifications as closely as possible. We provided about 75,000 anonymized images + classifications as a training set for participants, and kept the same amount of data secret; solutions submitted by competitors were tested on this set. More than 300 teams participated, and we awarded prizes to the top three scores. You can see more details on the competition site.
Since completing the competition, Sander has been working on writing up his solution as an academic paper, which has just been accepted to Monthly Notices of the Royal Astronomical Society (MNRAS). The method he’s developed relies on a technique known as a neural network; these are sets of algorithms (or statistical models) in which the parameters being fit can change as they learn, and can model “non-linear” relationships between the inputs. The name and design of many neural networks are inspired by similarities to the way that neurons function in the brain.
One of the innovative techniques in Sander’s work has been to use a model that makes use of the symmetry in the galaxy images. Consider the pictures of the same galaxy below:
From the classifications in GZ, we’d expect the answers for these two images to be identical; it’s the same galaxy, after all, no matter which way we look at it. For a computer program, however, these images would need to be separately analyzed and classified. Sander’s work exploits this in two ways:
- The size of the training data can be dramatically increased by including multiple, rotated versions of the different images. More training data typically results in a better-performing algorithm.
- Since the morphological classification for the two galaxies should be the same, we can apply the same feature detectors to the rotated images and thus share parameters in the model. This makes the model more general and improves the overall performance.
Once all of the training data is in, Sander’s model takes images and can provide very precise classifications of morphology. I think one of the neatest visualizations is this one: galaxies along the top vs bottom rows are considered “most dis-similar” by the maps in the model. You can see that it’s doing well by, for example, grouping all the loose spiral galaxies together and predicting that these are a distinct class from edge-on spirals.
For more details on Sander’s work, he has an excellent blog post on his own site that goes into many of the details, a lot of which is accessible even to a non-expert.
While there are a lot of applications for these sorts of algorithms, we’re particularly interested in how this will help us select future datasets for Galaxy Zoo and similar projects. For future surveys like LSST, which will contain many millions of images, we want to efficiently select the images where citizen scientists can contribute the most – either for their unusualness or for the possibility of more serendipitous discoveries. Your data are what make innovations like this possible, and we’re looking forward to seeing how these can be applied to new scientific problems.