چکیده:
This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results are the value of the similarity threshold and the list of stop words to exclude from the comparison. Earlier research has shown that the standard deviation of the token frequency distribution is strongly predictive of how useful stop words will be in improving linking performance. The research results presented here demonstrate a method for using statistics from token frequency distribution to estimate the threshold value and stop word selection likely to give the best linking results. The model was made using linear regression and validated with independent datasets.
خلاصه ماشینی:
The two most critical parameters for the matrix comparator for obtaining the best linking results are the value of the similarity threshold and the list of stop words to exclude from the comparison.
The research results presented here demonstrate a method for using statistics from token frequency distribution to estimate the threshold value and stop word selection likely to give the best linking results.
Entity resolution, Record linking, Matrix comparator, Stop words, Token frequency, F-measure / DOI: 10.
The organization of this paper is as follows: Section II describes the logic of the Matrix Comparator for performing ER on unstandardized and heterogeneously standardized references Section III summarizes previous research on the effectiveness of using stop words to improve the quality of ER results produced by the matrix comparator Section IV describes new research for predicting the value of critical parameters of the matrix comparator, in particular, the matching threshold value, and the list of stop words Section V assessment of prediction model Section VI conclusion and future work Logic of the Matrix Comparator An ER method for avoiding the need for transforming references into a standard layout before processing is the matrix comparator (Li, Talburt, & Li, 2018).
Given that the token frequency distribution of a set of reference has a significant standard deviation and top ratio, what matching threshold and the number of stop words should be used to obtain the best linking results?