Prediction of stellar temperature and metallicity through machine learning
(This project was carried out under the AI Saturdays Madrid programme, by Guillermo Ayllón Pérez, David García Rodríguez, Jorge Rodríguez López and Alonso J. Campos Hdez.)
In this machine learning project we sought to predict several of stars by analysing, using machine learning and data science methods, their electromagnetic spectra.
The electromagnetic spectrum of a star (or any other body) indicated the intensity of light emission at each wavelength of light. Wavelengths denote the specific “colour” of each small amount of light emitted by the body, and each wavelength has a different intensity of emission: the higher the intensity of a wavelength, the bigger the amount of light coming at that wavelength.
Electromagnetic spectra, and specifically the intensities of the light emitted at each wavelength, are known to change with the temperature of the emitting body in a major way, as per the Black-Body Radiation. Deviations from the ideal Black-Body Radiation manifests as lowered intensities at certain wavelengths or regions: these are known as absorption peaks or regions, where certain chemical elements in the star or in the path the light has travelled have absorbed the light travelling at those frequencies.
Some of that absorption is caused by interstellar dust, which can decrease the quality of the data, but most importantly, chemical elements present in the star will also absorb some wavelengths, and the amount of absorption can be used, though with great difficulty, to calculate the chemical content of those elements. The most important of these absorbing elements is arguably iron, as it is one of the main chemical element products of the stars, and thus used as an overall estimate of the total content of metals found in the stars. This overall content is known as the metallicity, and it is important in studying topics such as stellar aging of exoplanet formation.
Thus, in this project we aim to predict the temperature and metallicity of stars based on their electromagnetic spectra, as both are encoded inside them through the Black-Body Radiation and the iron absorption peaks.
Data download, EDA and preprocessing
The data were gathered from the SDSS database of stellar spectra. The downloaded data were of two kinds, optical and infrared. Optical spectra correspond to those wavelengths which we can see with our eyes, while infrared spectra correspond to the region of the electromagnetic spectrum just less energetic than the optical region. These two datasets presented advantages and drawbacks: the optical spectra ought to be better for temperature and metallicity prediction, but infrared spectra should have less noise as there are less vulnerable to contamination from stellar dust, etc. The datasets comprised 50,000 stellar spectra each.
Apart from the spectra (which are to be the data to process), several labels were obtained for each star: temperature and metallicity values (the two variables to be predicted), the stars’ gravity (as an indicator of star size, and measured as the logarithm of the gravity in units of the Sun’s gravity) and the stars’ spectral type, which is useful in classifying the stars to spot patterns.
The exploratory data analysis (EDA) of the datasets revealed several deficiencies to be addressed in the data preprocessing. For example, a smaller number of stars had negative temperatures (which is impossible), and the spectral types were too detailed, separating the stars in subtypes. Additionally, the infrared spectra showed intervals where only values of zero intensity were recorded. This was due to the mode of operation of the telescope which gathered the data, which had to shut down periodically.
Thus, the entries with nonsensical values for some the temperature and other labels were eliminated, and the spectral subtypes were condensed into the seven standard types (O, B, A, F, G, K, M). The regions of the infrared spectra with no intensity were eliminated so as to clean the data. In finishing the preprocessing, the optical spectra were also normalized between 1 and 0 (the infrared spectra were already normalized on download).
With this, the preprocessing was considered to be satisfactory, and we proceeded to the data analysis and modelling, after making some new plots to visualize the cleaned data. In the pairplots, of Fig. 3, the correlations between the various labels available can be seen. Particularly interesting is the temperature vs gravity pairplot, as it shows the separation between two groups of M-type stars, separated by a huge difference in gravity: these are the dwarf and giant M stars (or red dwarf and giant stars, since M-types look red to the eye).
Data modelling and machine learning
After data preprocessing and EDA, we proceeded to analysing our data using several machine learning techniques. We focused on the use of random forest (RF) regressors and neural networks (NN) in order to accurately predict the temperature and metallicity of the stars. Our choice of methods relied on the fact that these techniques are the most common and flexible when approaching regression problems, as they are capable of dealing with multiple kinds of data, dealing with non-linearities, etc. The standard Python libraries skelearn and keras were used for this.
We ran random search cross-validation on the parameters of our random forest regressor so as to optimize its performance against our data. Similarly, after much trial and error we finally settled for a neural network consisting of four “Dense” layers, with 128 neurons per layer and Leaky ReLU as the activation for all, followed by a one-neuron final layer with no activation. The optimizer Adam was used to run the gradient descent. The neural network and random forest models were then ran using both the infrared and optical datasets.
For the prediction of the stars’ temperature, we found the best results when working with the optical spectra and the neural network, compared with the other combinations. This had a R-squared value of 0.987 and a root mean squared error (RMSE) of 114.10 K. The results obtained using the random forest regressor were only slightly worse.
As for the metallicity, we only considered using the optical dataset, since the metallicity is measured using the iron absorption lines, and particularly Fe I, at around 438.36 nm. This line is within the optical region of the electromagnetic spectrum and therefore we only used those data. The deeper the absorption line, the bigger the iron (and thus metal) content of the star.
In order to remove all non-necessary data and noise, we cut the optical spectra around the Fe I line, removing all other wavelengths. This greatly reduced the number of variables to process, speeding up computations and eliminating unnecessary information from the problem. We additionally artificially increased the standard deviation of the cut spectra so as to magnify the depth of the iron absorption line compared to the baseline, aiding the NN and RF in carrying out the regression.
The transformation applied consisted of substracting the average value of the spectrum from all its datapoints and then multiplying the resulting values by a given factor (values from 10 to 1000 were tried) to artificially increase the standard deviation.
However, this did not work out, for either the random forest or the neural networks. RMSEs were massive, and the values of R-squared obtained were around -0.01 (i.e., slightly worse than predicting by taking the mean of all values).
With some scrutiny, it became clear that the metallicity data are not good enough for carrying out the regression. The metallicity values of the stars had essentially no correlation to the depth of the absorption lines (their correlation coefficient was -0.0869), which helps explain the extremely poor results obtained with the random forest and neural network.
In conclusion, the prediction of the temperatures of the stars was completely successful, as an extremely high value R-squared (0.987) was obtained. However, the attempts to predict the metallicity failed, despite the data processing and transformation operations that were applied. This was explained by the complete lack of correlation between the metallicity values and the absorption peak depths.