Unsupervised Outlier Detection Methods

Could be satellites, imaging artifacts, interacting galaxies, lensed objects or other unexpected entities. Machine learning is broadly split into supervised and unsupervised algorithms. Supervised algorithms deal with labelled data. In the context of outlier detection this requires a dataset where the outliers are known beforehand so a model can be generated to detect outliers in novel data. We do not have this information, and as a result are forced to use unsupervised algorithms. 

There exist a number of unsupervised outlier detection methods, and in this work we will compare the performance and results of a selection of them, namely the Local Outlier Factor , Isolation Forest , k-means, a recently introduced novelty measure  (hereinafter named new novelty or NN), and both a normal and a convolutional autoencoder . We aim to find out how they compare to one another, if they consider the same objects outlying and where and why they differ.These methods can be used in 2 distinct ways. The first is detecting the most abnormal points in the dataset with which the models are trained. These can be singular observations unlike any other in the dataset, or small clusters of relatively normal data that lie far away from the majority of the dataset in the feature space. Both of these types are considered outliers. The second way to use the outlier detection methods is to consider the dataset on which the methods are trained as normal data, and using the trained models to find the observations in a new set of testing data that are the most outlying with respect to the training data. The majority of this work focuses on finding outliers in the dataset used to train the models on, but in the end a model is presented to detect outliers in new data using the previously trained models.

While there are a few papers in the literature on detecting outliers in sky surveys, these are often done on spectroscopic or tabular data. Examples include Chaudhary et al., which proposes a new outlier detection method and uses 5-dimensional SDSS data as a real-world application. In Fustes et al. self-organizing maps are used to find outlying spectra. Giles et al.  use a variant of the DBSCAN clustering algorithm to detect outliers in derived lightcurve features. The work closest related to this work is Baron et al., where an unsupervised random forest is used to detect the most outlying galaxy spectra of the SDSS and its results are compared with a standard random forest, a one-class support vector machine and an isolation forest. The key differences are that we look for outliers with respect to the image data instead of the spectra, and we use multiple kinds of objects instead of just galaxies.

The remainder of this work is organized as follows. We begin in Sec. with an exploration of the dataset used and the steps taken to prepare it for use with the outlier detection methods under consideration. Then, in Sec. we describe and assess the performance of these methods. The results are compared and analyzed in Sec., followed by a description of the model to find outliers in new data in Sec. Finally, directions for future work are provided and a conclusion is given.