# Explaining clustering statistics we use to study the distribution of Galaxy Zoo galaxies

I’ve used some statistical tools to analyze the spatial distribution of Galaxy Zoo galaxies and to see whether we find galaxies with particular classifications in more dense environments or less dense ones. By “environment” I’m referring to the kinds of regions that these galaxies tend to be found: for example, galaxies in dense environments are usually strongly clustered in groups and clusters of many galaxies. In particular, I’ve used what we call “marked correlation functions,” which I’ve found are very sensitive statistics for identifying and quantifying trends between objects and their environments. This is also important from the perspective of models, since we think that massive clumps of dark matter are in the same regions as massive galaxy groups.

We’ve mainly used them in two papers, where we analyzed the environmental dependence of morphology and color and where we analyzed the environmental dependence of barred galaxies. These papers have been described a bit in this post andthis post. We’ve also had other Galaxy Zoo papers about similar subjects, especially this paper by Steven Bamford and this one by Kevin Casteels.

What I loved about these projects is that we obtained impressive results that nobody else had seen before, and it’s all thanks to the many many classifications that the citizen scientists have contributed. These statistics are useful only when one has large catalogs, and that’s exactly what we had in Galaxy Zoo 1 and 2. We have catalogs with visual classifications and type likelihoods that are ten times as large as ones other astronomers have used.

What are these “marked correlation functions”, you ask? Traditional correlation functions tell us about how objects are clustered relative to random clustering, and we usually write this as 1+ ξ. But we have lots of information about these galaxies, more than just their spatial positions. So we can weight the galaxies by a particular property, such as the elliptical galaxy likelihood, and then measure the clustering signal. We usually write this as 1+W. Then the ratio of (1+W)/(1+ξ), which is the marked correlation function M(r), tells us whether galaxies with high values of the weight are more dense or less dense environments on average. And if 1+W=1+ξ, or in other words M=1, then the weight is not correlated with the environment at all.

First, I’ll show you one of our main results from that paper using Galaxy Zoo 1 data. The upper panel shows the clustering of galaxies in the sample we selected, and it’s a function of projected galaxy separation (rp). This is something other people have measured before, and we already knew that galaxies are clustered more than random clustering. But then we weighted the galaxies by the GZ elliptical likelihood (based on the fraction of classifiers identifying the galaxies as ellipticals) and then took the (1+W)/(1+ξ) ratio, which is M(rp), and that’s shown by the red squares in the lower panel. When we use the spiral likelihoods, the blue squares are the result. This means that elliptical galaxies tend to be found in dense environments, since they have a M(rp) ratio that’s greater than 1, and spiral galaxies are in less dense environments than average. When I first ran these measurements, I expected kind of noisy results, but the measurements are very precise and they far exceeded my expectations. Without many visual classifications of every galaxy, this wouldn’t be possible.

Second, using Galaxy Zoo 2 data, we measured the clustering of disc galaxies, and that’s shown in the upper panel of the plot above. Then we weighted the galaxies by their bar likelihoods (based on the fractions of people who classified them as having a stellar bar) and measured the same statistic as before. The result is shown in the lower panel, and it shows that barred disc galaxies tend to be found in denser environments than average disc galaxies! This is a completely new result and had never been seen before. Astronomers had not detected this signal before mainly because their samples were too small, but we were able to do better with the classifications provided by Zooites. We argued that barred galaxies often reside in galaxy groups and that a minor merger or interaction with a neighboring galaxy can trigger disc instabilities that produce bars.

What kinds of science shall we use these great datasets and statistics for next? My next priority with Galaxy Zoo is to develop dark matter halo models of the environmental dependence of galaxy morphology. Our measurements are definitely good enough to tell us how spiral and elliptical morphologies are related to the masses of the dark matter haloes that host the galaxies, and these relations would be an excellent and new way to test models and simulations of galaxy formation. And I’m sure there are many other exciting things we can do too.

…One more thing: if you’re interested, you’re welcome to check out my own blog, where I occasionally write posts about citizen science.

I'm a science writer and journalist and a former astrophysicist. Check out excerpts of my work for Nature and other magazines, as well as my blog, Science Political, here: http://raminskibba.net/.

### 11 responses to “Explaining clustering statistics we use to study the distribution of Galaxy Zoo galaxies”

1. Michael Peck says :

Hi Ramin:

Thanks for the post. Do you have a reference for a good cookbook discussion (or publicly available code) of how to calculate “marked correlation functions”?

Earlier this year I did a number of posts on the environment of galaxies that were selected as recently quenched for the currently dormant Quench project.

I used density measures from the group catalog of Tempel et al. 2014, but I’m concerned about some systematics that I don’t understand in their density estimates.

Also, it might be useful (a) to eliminate binning as much as possible, by for example using vote fractions for morphological features instead of thresholds, and (b) to integrate the final product into previous GZ related publications as much as possible.

Mike Peck

• raminskibba says :

Hi Mike,

Thanks for the reply, and thanks for your interest. Yes, there are many different ways to probe galaxy environments and environmental correlations. If you’re interested, I recommend checking out this detailed comparison paper by Muldrew et al.: http://arxiv.org/abs/1109.6328

The Tempel et al. group catalog should be pretty good, but any density estimate will involve some assumptions and systematics. I agree that it’s best to reduce or eliminate binning when possible, and mark correlations functions, or mark statistics in general, are useful for that. If you look in my Galaxy Zoo papers that I linked to in the post, you’ll find some details about how to calculate these statistics. You might also be interested in this paper I wrote last year: http://arxiv.org/abs/1211.0287

-Ramin

2. Rudolf Baer says :

I have done a large number of classifications on projects like Galaxy Zoo, Milky Way & Radio Galaxy and earlier projects. Are the data accessible anywhere? (On one project the data were available, but only for a very period). I am sure there are many contributors who could assist or actually perform data analysis. R. Baer

• raminskibba says :

That’s great. Thanks for you contributions! Galaxy Zoo catalogues are publicly available if you’re interested, but I’m not sure about the status of the Milky Way and Radio Galaxy projects’ data.

• zutopian says :

@Rudolf Baer :

GZ Quench is the project, where volunteers can analyze data.
Here is the related blog post, which is dated 2nd Aug 2013.: https://blog.galaxyzoo.org/2013/08/02/gz-quench-classification-complete-now-the-real-fun-begins/
The project isn’t completed yet. So you might want to participate.

• raminskibba says :

Thanks, I appreciate it. It definitely sounds like an interesting project. I’ve heard of GZ Quench from Laura Trouille, and I’d be happy to participate in it later on. (The project is paused until the end of July.)

3. Michael Peck says :

Ramin:

Thanks. The book “Machine Learning…” by Ivezic et al. has a brief introduction to n-point correlation functions plus working Python code. Looks like as good a place as any to start learning about them (plus your references for a deeper look).

Mike

• raminskibba says :

Mike, yes, that new book by Ivezic et al. is an excellent reference, and it’s certainly a good place to start. I don’t recall whether they included a description of marked 2-point correlation functions in the book though. My papers and the references within them should give you the information you need about these statistics though.

4. Sociology Essay says :

I used density measures from the group catalog of Tempel et al. 2014, but I’m concerned about some systematics that I don’t understand in their density estimates.

• raminskibba says :

Every group catalog (including Tempel et al.) and every type of density measure have some uncertainties and systematics–there is no such thing as a perfect one. I’m not familiar with the details of the Tempel et al. density measures. For a comparison of some density measures, I recommend looking at this paper: http://arxiv.org/abs/1109.6328

Galaxy clustering statistics (including mark correlation functions) are usually very robust, though when one uses models to interpret them, systematics are a concern there as well.