مقابله با مخاطرات ناشی از غلظت آلایندۀ PM2.5 با به‌کارگیری روش‌های رگرسیونی و شباهت مکانی- زمانی و تخمین مقادیر گم‌شده در سری زمانی آنها (مطالعۀ موردی: شهر تهران) مقالة

مدیریت مخاطرات محیطی پاییز 1399، دوره هفتم - شماره 3 التصنيف ب (Ministry of Science/ISC (‎14 صفحة - من 299 إلی 312 )

الکلمات المفتاحية: آلایندۀ PM2.5 جانهی منفرد و چندگانه داده‌های گم‌شده معیار شباهت DTW مخاطرات PM2.5 concentration DTW Similarity Criterion Single and Multiple imputation Missing values

fa en

خلاصة:

با توجه به تأثیر نامطلوب آلاینده‌ها بر محیط زیست و سلامت انسان، تجزیه‌وتحلیل داده‌های کیفیت هوا اهمیت زیادی در حفاظت از محیط زیست و رویارویی با مشکلات آلودگی هوا دارد. داده‌های گم‌شده در سری‌های زمانی به‌خصوص داده‌های مربوط به آلودگی هوا موجب بروز چالشی ویژه در برابر آنالیز این داده‌ها می‌شود که ضرورت استفاده از روش‌هایی با عنوان جانهی را برای مقابله با این پدیده نمایان می‌کند. مقادیر گم‌شده، موجب کاهش حجم داده و تغییر الگوهای زمانی موجود در داده‌ها و نتیجه‌گیری اشتباه در تجزیه‌وتحلیل داده‌ها می‌شود. در این پژوهش به‌منظور جانهی مقادیر از‌دست‌رفته ‌در داده‌های سری زمانی غلظت آلایندۀ از 12 ایستگاه سنجش آلودگی شهر تهران، روشی ترکیبی برمبنای رگرسیون جانهی با در نظر گرفتن وابستگی و شباهت‌های مکانی و زمانی بین ایستگاه‌ها توسط الگوریتم پیچش زمانی پویا معرفی شده است. داده‌هایی با مقادیر گم‌شده با الگویی مشابه با داده‌های اصلی در دامنۀ 10، 15 و 20 درصد گم‌شدگی در داده‌ها با هدف ارزیابی عملکرد مدل‌های جانهی شبیه‌سازی شدند. سپس روش پیشنهادی در ترکیب با روش‌های مختلف جانهی چندگانه همانند روش طبقه‌بندی و رگرسیون درختی، نمونۀ تصادفی و میانگین تطابق پیش‌بینی کننده، اجرا و نتایج با روش‌های جانهی منفرد مقایسه شد. نتایج بیانگر برتری روش معرفی‌شده در ترکیب با رگرسیون درختی در مقایسه با دیگر روش‌های جانهی چندگانه و منفرد است.

Introduction With the increasing growth of industrialization of cities, air pollution has become one of the serious environmental hazards in the world's largest cities, including Tehran. Due to the undesirable effects of pollutants on the environment and human health, the analysis of air quality data plays an important role in protecting the environment and its hazards and tackling air pollution problems. During the last decade, a large number of air quality control data, involving the concentration of existing pollutants in the atmosphere, have been collected by pollution monitoring stations in different cities of the country, which due to various reasons such as calibration, maintenance, device errors, and processing errors show missing values at different intervals. These missing values caused problems in data analysis and leads to challenges in making decisions based on these data. Missing data is a common problem in time series issues and introducing efficient models and methods for managing this problem in data is an effective step towards decreasing bias and increasing air pollution model power. Materials and Methods This paper uses pollutant concentration data recorded in 12 air quality monitoringquality-monitoring stations, which are controlled by the air quality control company. Data were collected on an hourly basis from Dec. 7, 2016 to Feb. 27, 2019 through the air quality control site. The purpose of this paper is to introduce an innovative method based on including spatial correlations between time series related to similar stations from the perspective of time series behavior in imputation of missing information related to each pollution measuring station. In this regard, in the first step, through dynamic time wrapping, the spatio-temporal similarity between the time series of pollutant concentration of the stations is calculated in pairs. Then, for imputation in each target station, the dependence of those stations with the most similarity of desired station is used. In the second step, the initial complete data is formed by deleting the missing values at each station. In the next step, with a pattern similar to the main missing data, new missing data is obtained with 10, 15 and 20% of missing data. The fourth step involves implementing and comparing different multiple and single imputation algorithms to fill in the missing data. Finally, the performance of various imputation methods is evaluated by the introduced indicators. Discuss and Results In this study, in order to implement multiple imputation algorithms such as predictive mean matching, classification and regression tree, random sample and also implementing different single imputation algorithms such as interpolation methods, observation carried forward last from R-programming language has been used. Cart imputation method with R-squared of 0.66 and correlation coefficient of 0.8 in 10% of missing values, R-squared of 0.6 and correlation coefficient of 0.76 in 15% of missing values, R-squared of 0.58 and correlation coefficient of 0.75 at 20% of missing values, showed the best performance among multiple imputation methods. It is clear that as the percentage of missing values increases, the accuracy of the evaluation criteria decreases. Given the obtained results, the predictive mean matching method and the random method showed similar performance and performed worse than the tree regression method. Based on all three evaluation criteria, the linear interpolation method was better than the other introduced methods. Therefore, among the individual methods for the given data, this method is more appropriate. Also, the spline interpolation method has shown the weakest performance among all multiple and single imputation methods. Although, compared to the tree regression method, in data with 10% of loss, the linear interpolation method has the highest coefficient of determination and correlation and the lowest error in the evaluation indicators, but it should be noted that the linear interpolation method shows magnificent performance for missing values with low interval, but when the data loss interval increases, for example, in the 20% of missing interval, these methods are not able to provide a good imputation for the lost data and consider a fixed rate or a rate with small variation for all the missing values in each interval. Conclusion The existence of missing data in the pollutant concentration time series negatively affects the performance of data analysis in machine learning algorithms and causes bias. The results have shown that determining the spatio-temporal similarity of stations and using the pattern of similar stations using dynamic time wrapping algorithm in combination with based-regression methods leads to improvement of the model performance with high missing intervals, and the tree regression model is the most suitable method for multiple imputation. Single imputation methods, though fast and simple, are dependent on the interval length of missing in time and their performance depends on the variable under study. Therefore, the use of single methods in air pollution data with high missing intervals is not recommended. Due to the effect that other factors such as meteorological parameters have on air pollution, in future studies, the accuracy of the model can be increased by adding these parameters.

استلام ملف الإرجاع :
(پژوهیار, , , )

تحميل
تحميل HTML

صفحة:

دخول / الاشتراک

تحتاج الدخول لعرض محتوى المقالة. إذا لم تكن عضوًا ، فتابع من الجزء الاشتراک.

دخول

الاشتراک

تحتاج دخول لعرض محتوى المقالة. إذا لم تكن عضوًا ، فتابع من الجزء الاشتراک.
إن كنت لا تقدر علی شراء الاشتراك عبرPayPal أو بطاقة VISA، الرجاء ارسال رقم هاتفك المحمول إلی مدير الموقع عبر webmaster@noormags.com .

You need Sign in to view the content of the article. If you are not a member, proceed from part Sign up.
If you fail to purchase subscription via PayPal or VISA Card, please send your mobile number to the Website Administrator via webmaster@noormags.com .

رابط قصير:

1402

1401

1400

1399

1398

1397

1396

1395

1394

1393