Mining For Galaxies Among Millions Of Galaxy Images

Published by Lior Shamir

Lawrence Technological University

These findings are described in the article entitled A hybrid approach to machine learning annotation of large galaxy image databases, recently published in the journal Astronomy and Computing (Astronomy and Computing 25 (2018) 257-269). This work was conducted by E. Kuminski and L. Shamir from Lawrence Technological University.

The invention of the telescope is clearly the most important milestone in astronomy research, and the ability to see deeper into the night sky allowed astronomers to obtain a deep understanding of astronomical phenomena. For instance, observing the moons of Jupiter orbiting around their planet helped Galileo to understand that the Earth is orbiting around the Sun.


Since the time of Galileo, telescopes have changed significantly, improving their ability to collect light and see much deeper into space. In addition to their size and magnification power, some modern telescopes have changed in a different way – they turned into robots. These robotic telescopes scan the sky autonomously every night, collecting information about a huge number of astronomical objects. For instance, the Sloan Digital Sky Survey (SDSS) was launched in 2000 and covered 35% of the entire sky, collecting information for over 500 million astronomical objects. While that might seem like a lot of information, SDSS is, in fact, tiny compared to far more powerful telescopes that are currently being built. For instance, the Large Synoptic Survey Telescope (LSST), which is expected to see first light in 2022, will collect the same amount of information collected by SDSS – once every three days. It is expected to operate for 10 years and will generate the world’s largest public database.

These new astronomical instruments provide a completely new way of observing the universe, with opportunities for discoveries that were never possible before. In fact, we can very reasonably assume that many discoveries of paramount scientific interest are already hidden inside the large databases generated by the existing robotic telescopes, and surely more discoveries will be observed in the future as robotic telescopes become even more powerful. But to make these discoveries, we first need to be able to analyze these databases. Because they are so big, manual analysis is not possible, so we need to use computers and artificial intelligence for that purpose.

One of the more challenging tasks for which we can use machines is the automatic identification of galaxies that are visually similar to a certain galaxy in which an astronomer is interested. If astronomers identify a rare type of galaxy, they might want to study it by finding more galaxies that are similar to it, and then they can profile that galaxy type by observing and comparing multiple instances of it. For instance, if the astronomer has just one instance of a rare galaxy type, and they identify a certain feature in that galaxy, they cannot know if that feature is a characteristic of that galaxy type, or just happened to be there by chance. If they have many instances that they can compare, they can check to see whether that feature is repetitive also in the other instances of that rare galaxy type. In a database of 10 billion galaxies such as LSST, we can assume that even an extremely rare one-in-a-million type of galaxy will appear about 10,000 times. But the problem is to find them. In a database of 10 billion galaxies, it is clear that astronomers cannot search manually, and we need to use computers.

For that purpose, we developed a machine vision algorithm that can search these databases. The algorithm takes a galaxy image as its input and uses machine vision to perform a search among a large number of galaxies. The algorithm scans through a large number of galaxies and identifies the galaxies that are the most similar to the galaxy in which the astronomer is interested. It can also use other information, such as the light measurement (photometry) information of the galaxy to refine the search.


Machine vision algorithms are still not perfect, and therefore the algorithm often returns galaxies that are not necessarily similar to the query galaxy among the galaxies that it finds. But even if it returns some “noise,” it reduces the size of the data by about 1000 times of its original size. That makes it much more practical for a manual search. For instance, instead of searching among two million images, the astronomer needs to search among two thousand images, which becomes much more practical. It is expected that the algorithm will improve in the future and become more accurate and, therefore, more useful for very large databases of galaxies.

About The Author

Lior Shamir is an associate professor in computer science at the Lawrence Technological University · Computer Science.