خلاصة:
با توجه به تأثیر نامطلوب آلایندهها بر محیط زیست و سلامت انسان، تجزیهوتحلیل دادههای کیفیت هوا اهمیت زیادی در حفاظت از محیط زیست و رویارویی با مشکلات آلودگی هوا دارد. دادههای گمشده در سریهای زمانی بهخصوص دادههای مربوط به آلودگی هوا موجب بروز چالشی ویژه در برابر آنالیز این دادهها میشود که ضرورت استفاده از روشهایی با عنوان جانهی را برای مقابله با این پدیده نمایان میکند. مقادیر گمشده، موجب کاهش حجم داده و تغییر الگوهای زمانی موجود در دادهها و نتیجهگیری اشتباه در تجزیهوتحلیل دادهها میشود. در این پژوهش بهمنظور جانهی مقادیر ازدسترفته در دادههای سری زمانی غلظت آلایندۀ از 12 ایستگاه سنجش آلودگی شهر تهران، روشی ترکیبی برمبنای رگرسیون جانهی با در نظر گرفتن وابستگی و شباهتهای مکانی و زمانی بین ایستگاهها توسط الگوریتم پیچش زمانی پویا معرفی شده است. دادههایی با مقادیر گمشده با الگویی مشابه با دادههای اصلی در دامنۀ 10، 15 و 20 درصد گمشدگی در دادهها با هدف ارزیابی عملکرد مدلهای جانهی شبیهسازی شدند. سپس روش پیشنهادی در ترکیب با روشهای مختلف جانهی چندگانه همانند روش طبقهبندی و رگرسیون درختی، نمونۀ تصادفی و میانگین تطابق پیشبینی کننده، اجرا و نتایج با روشهای جانهی منفرد مقایسه شد. نتایج بیانگر برتری روش معرفیشده در ترکیب با رگرسیون درختی در مقایسه با دیگر روشهای جانهی چندگانه و منفرد است.
Introduction With the increasing growth of industrialization of cities, air pollution has become one of the serious environmental hazards in the world's largest cities, including Tehran. Due to the undesirable effects of pollutants on the environment and human health, the analysis of air quality data plays an important role in protecting the environment and its hazards and tackling air pollution problems. During the last decade, a large number of air quality control data, involving the concentration of existing pollutants in the atmosphere, have been collected by pollution monitoring stations in different cities of the country, which due to various reasons such as calibration, maintenance, device errors, and processing errors show missing values at different intervals. These missing values caused problems in data analysis and leads to challenges in making decisions based on these data. Missing data is a common problem in time series issues and introducing efficient models and methods for managing this problem in data is an effective step towards decreasing bias and increasing air pollution model power. Materials and Methods This paper uses pollutant concentration data recorded in 12 air quality monitoringquality-monitoring stations, which are controlled by the air quality control company. Data were collected on an hourly basis from Dec. 7, 2016 to Feb. 27, 2019 through the air quality control site. The purpose of this paper is to introduce an innovative method based on including spatial correlations between time series related to similar stations from the perspective of time series behavior in imputation of missing information related to each pollution measuring station. In this regard, in the first step, through dynamic time wrapping, the spatio-temporal similarity between the time series of pollutant concentration of the stations is calculated in pairs. Then, for imputation in each target station, the dependence of those stations with the most similarity of desired station is used. In the second step, the initial complete data is formed by deleting the missing values at each station. In the next step, with a pattern similar to the main missing data, new missing data is obtained with 10, 15 and 20% of missing data. The fourth step involves implementing and comparing different multiple and single imputation algorithms to fill in the missing data. Finally, the performance of various imputation methods is evaluated by the introduced indicators. Discuss and Results In this study, in order to implement multiple imputation algorithms such as predictive mean matching, classification and regression tree, random sample and also implementing different single imputation algorithms such as interpolation methods, observation carried forward last from R-programming language has been used. Cart imputation method with R-squared of 0.66 and correlation coefficient of 0.8 in 10% of missing values, R-squared of 0.6 and correlation coefficient of 0.76 in 15% of missing values, R-squared of 0.58 and correlation coefficient of 0.75 at 20% of missing values, showed the best performance among multiple imputation methods. It is clear that as the percentage of missing values increases, the accuracy of the evaluation criteria decreases. Given the obtained results, the predictive mean matching method and the random method showed similar performance and performed worse than the tree regression method. Based on all three evaluation criteria, the linear interpolation method was better than the other introduced methods. Therefore, among the individual methods for the given data, this method is more appropriate. Also, the spline interpolation method has shown the weakest performance among all multiple and single imputation methods. Although, compared to the tree regression method, in data with 10% of loss, the linear interpolation method has the highest coefficient of determination and correlation and the lowest error in the evaluation indicators, but it should be noted that the linear interpolation method shows magnificent performance for missing values with low interval, but when the data loss interval increases, for example, in the 20% of missing interval, these methods are not able to provide a good imputation for the lost data and consider a fixed rate or a rate with small variation for all the missing values in each interval. Conclusion The existence of missing data in the pollutant concentration time series negatively affects the performance of data analysis in machine learning algorithms and causes bias. The results have shown that determining the spatio-temporal similarity of stations and using the pattern of similar stations using dynamic time wrapping algorithm in combination with based-regression methods leads to improvement of the model performance with high missing intervals, and the tree regression model is the most suitable method for multiple imputation. Single imputation methods, though fast and simple, are dependent on the interval length of missing in time and their performance depends on the variable under study. Therefore, the use of single methods in air pollution data with high missing intervals is not recommended. Due to the effect that other factors such as meteorological parameters have on air pollution, in future studies, the accuracy of the model can be increased by adding these parameters.