2009年12月10日 星期四

RANKING ASSOCIATION RULES BY TOPSIS METHOD

RANKING ASSOCIATION RULES BY TOPSIS METHOD WITH COMBINATION WEIGHTS
ZHEN ZHANG, CHONGHUI GUO
Institute of Systems Engineering, Dalian University of Technology
Dalian, 116023, People’s Republic of China
E-mail: zzupchd@gmail.com; guochonghui@tsinghua.org.cn
Various rules can be generated from databases by using association rule algorithms, but only a small number of these rules may be selected for implementation due to the limitations of resources. Thus, evaluating the quality of these rules becomes a hot topic in data mining field. Based on multiple criteria decision making theory, a framework for ranking the mined association rules using TOPSIS method with combination weights, considering both objective interestingness measures and the users’ domain information, is proposed. An example of market basket analysis is applied to illustrate the applicability of this method.
1. Introduction
Rapid advances in data collection and storage technology have enabled organizations to accumulate vast amounts of data. To extract useful information from large databases, data mining has become an essential tool for data analysis. In recent years, data mining has been widely applied in many fields, such as business, science and engineering (Tan, Steinbach, & Kumar, 2006).
Association rule mining, known as one of the most important techniques in data mining field, is useful for discovering interesting relationships hidden in large data sets. Various rules can be mined from databases by using association rule algorithms (Srikant, Vu, & Agrawal, 1997). But only a small number of rules may be selected for implementation due to limited business resources (Choi, Ahn, & Kim, 2005). Thus, evaluating the quality of these rules becomes a hot topic in data mining field.
Generally speaking, rule quality can be evaluated according to some interestingness measures, which consist of objective measures and subjective measures (Geng & Hamilton, 2006). Objective measures, which are mostly based on theories in probability, statistics, or information theory, just only consider the raw data, such as support and confidence used in association rule mining. Subjective measures which take into account both the data and users of these data, require users to input domain knowledge manually as constraints, or distinguish rules as interesting or uninteresting by interacting with the data mining system.
Most association rule mining algorithms consider these interestingness measures as constraints to obtain interesting patterns. But it may be difficult to select appropriate interestingness measures before data mining is performed. Thus, it is more appropriate to carry out post-processing for the mined association rules. An efficient method is to rank these mined rules based on some interestingness measures and subjective criteria in order to obtain rules that are more interesting to users. Multiple criteria decision making methods can be used to solve such problems. Choi et al. (2005) used ELECTRE-II approach combined with AHP for association rules prioritization which considered both objective criteria and subjective preferences of users. They believed the proposed method made synergy with decision analysis techniques for solving problems in data mining field. But this method may be less efficient when large number of association rules are generated. Chen (2007) developed their work and proposed a DEA based method to rank association rules. This method firstly uses a DEA model to obtain efficient association rules, and then applies another discriminated model to rank the efficient association rules. Obviously, this method needs to solve considerable numbers of linear programming models, and includes redundant computations and considerations. Based on Chen’s work, Toloo, Sohrabi, & Nalchigar (2009) used an integrated DEA model which was able to identify the most efficient association rule by solving just one mixed integer linear programming and proposed a new method for ranking association rules with multiple criteria. Both the two DEA-based methods can’t rank all these association rules and can’t reflect the importance of these criteria quantitatively, which is important for multiple criteria decision making. Besides that, the computation may be complex when the number of these association rules is very large. Thus, based on the previous work, this paper uses TOPSIS method with combination weights to rank the mined association rules synthetically and proposes a new framework for ranking association rules.
The rest of this paper is organized as follows. In section 2, association rule mining is described briefly. TOPSIS method with combination weights is presented in section 3, and section 4 introduces a framework for ranking association rules. Next, an example of market basket analysis is applied to illustrate the applicability of our method in section 5, and section 6 concludes this paper in the end.
2. Association Rule Mining
Association rule mining proposed by Ageawal, Imielinski, & Swami (1993), is a useful technique for discovering relationships between items in databases. It has been widely used in business applications such as cross-marketing, attached mailing and catalog design (Choi et al., 2005).
Let I= {i1, i2, .. , im} denote a set of items. Moreover, let D indicate a set of transactions, where each transaction t is a set of items so that t  I. A unique identifier, named tID, is associated with each transaction. An association rule can be formulated as the following implication relationship:
X  Y, where X  I, Y  I, and X∩Y=Φ.
The rule X  Y holds in the transaction set D with confidence, c, where c % of transactions in D that contain X also contain Y,and the rule has support, s, in the transaction set D if s % of transactions in D contain X∪Y.
The Apriori algorithm proposed by Agrawal et al. (1993) is one of the most widely used techniques for finding associations rules, which restricts search space by setting minimum support and confidence threshold and obtains relatively strong rules. The algorithm consists of two phases.
Firstly, all itemsets with minimum support, called frequent itemsets are generated, which can be described as follows:
(1) L1=find_large_1-itemsets;
(2) for(k=2; Lk-1≠Φ; k++) do begin
(3) Ck=apriori_gen(Lk-1); //generating k-itemset
(4) For all transactions t∈D do begin
(5) Ct=subset(Ck,t); //Ct is all the candidate itemsets that are contained in t (6) for all candidates c∈Ct do c.count++;
(7) end
(8) Lk={ c∈Ck|c.count≥minsup_count }
(9) end
(10) L=∪Lk;
After that, association rules can be generated from the set of all frequent itemsets by setting minimum confidence threshold.
3. TOPSIS Method with Combination Weights
TOPSIS (Technique for Order Preference by Similarity to Ideal Solution), initially proposed by Hwang and Yoon (1981), is one of the commonly used classical multiple criteria decision making methods for ranking and selection of a number of possible alternatives through measuring Euclidean distances. The concept of TOPSIS is that the most preferred alternative should not only have the shortest distance from the positive ideal solution, but also have the longest distance from the negative ideal solution. The basic TOPSIS methodology can be described as follows.
First of all, the decision matrix is built by aggregating the selected criteria values, and then, based on the weighted normalized decision matrix, the positive-ideal solution (PIS) and the negative-ideal solution (NIS) can be found from these limited alternatives. After that, the distances from each alternative to PIS and NIS are computed. In the end, the closeness coefficients for all the alternatives can be obtained, and these alternatives can be ranked according to the closeness coefficients. In the rest of this section, we will introduce this method specifically.



3.1. Data Preprocessing
Assume that there are n alternatives to be evaluated with regard to m selected criteria. The criteria are assumed to be benefit criteria, as TOPSIS requires that the utility of each criterion be monotonic. This assumption does not cause any loss of generality as other types of criterion can be easily transformed into a beneficial one. The criterion value of the i th alternative for the j th criterion is denoted as vij (for i=1, 2, …, n; j=1, 2, …,m). Therefore, the initial decision matrix can be obtained as
(1)
To ensure that the selected criteria are comparable, normalization for the initial decision matrix is needed so that the incomparability among criteria can be avoided. Here we use the following approach to carry out normalization,
(for benefit criteria) ; (for cost criteria) (2)
where xij is the normalized criterion value for vij, and a normalized decision matrix X can be obtained as
(3)
3.2. Criteria Weights Determining
Criteria weights reflect the decision makers’ subjective preference as well as the objective characteristics of the criteria themselves (Zeleny M, 1982). Whether the weights are reasonable can directly influent the reliability and effectiveness of the decision result. Up to now, many weights determining methods have been developed, which can be divided into subjective ones and objective ones. Subjective methods which can reflect the decision makers’ subjective preference are usually based on the decision makers’ own experiences, knowledge and perception of the problem, but objectivity is lacked. For objective methods, although they take into account the information implicated in the data and can reflect the characteristics of the criteria themselves, the weights determined by them may deviate from the actual situation sometimes and can not be interpreted easily. Thus, a method that combines the two methods together may be more reasonable.
There are many subjective weights determining methods such as AHP, Delphi method and DARE method. In this paper, group AHP is used to determine subjective weights. AHP, developed by Saaty (1980), addresses how to determine the relative importance of a set of criteria in a multiple criteria decision making problem. This method requires the experts to make pair-wise comparison for the selected criteria and a judgment matrix for each expert can be obtained. Then the relative weights are given by the right eigenvector corresponding to the largest eigenvalue for each judgment matrix. After that, the consistency for pair-wise comparisons is examined. If the consistency examination can’t pass, this judgment may be eliminated or remade. Afterwards, hierarchical clustering method is used to generate the weights for each expert in the group. And the ultimate subjective weights p= (p1, p2, …, pm) can be obtained by weighted calculation. More details can be found in Wu, Hua, & Zha (2003).
In common with subjective methods, many objective methods are also used to determine criteria weights, such as principal component analysis and entropy method. This paper uses entropy method to determine objective weights (Deng, Yeh, & Willis, 2000). This method computes the standard deviation of the data for each criterion to measure the information content such that most of the information in initial data can be kept as much as possible. The weights can be obtained by this method as follows:
(i) Compute the entropy value for each criterion: the entropy value for the j th criterion ej can be obtained through the following calculation:
. (4)
(ii) Compute the weight for each criterion: the weight value for the j th criterion uj can be computed as
. (5)
After that, the criteria weights can be obtained by using the convex combination approach, that is to say, the final weight for the j th criterion can be denoted as wj=αpj+(1-α)uj (0≤α≤1). The value of  can be set according to users’ preferences.
Then, the weights matrix can be written as
(6)


3.3. Procedure for TOPSIS Method
After data preprocessing and criteria weights determining, we can use TOPSIS method to rank the limited alternatives. TOPSIS method consists of the following steps (Hwang et al., 1981; Deng et al., 2000):
(i) According to the normalized decision matrix and weights matrix, the weighted normalized decision matrix can be obtained as Z= XW = (zij) n×m.
(ii) Compute PIS and NIS for each criterion according to matrix Z, which can be obtained as follows:
(7)
where and denote PIS and NIS for the j th criterion respectively, j* and j’ denote the sets of benefit criteria and cost criteria respectively.
(iii) Compute the Euclidean distances from each alternative to PIS and NIS.
(8)
where and denote the distances from the i th alternative to PIS and NIS respectively.
(iv) Compute the closeness coefficients for all the alternatives. Let Ci denote the closeness coefficient for the i th alternative, and Ci can be computed as
(9)
(v) All the alternatives can be ranked with regard to the closeness coefficients computed above, and the decision makers can make decisions based on the ranking result.
4. A Framework for Ranking Association Rules
Traditionally rules are selected only due to the thresholds of support and confidence, which only considers the database perspective. Actually the interestingness of an association rule is application-dependent (Srikant et al., 1997). The domain information in application areas can potentially provide useful criteria for selecting useful rules and can be adopted to select rules which are more interesting to the users. In this section, a new framework for ranking association rules which considers both the objective interestingness measures and the users’ domain information is proposed.
The proposed framework can be illustrated with the flow chart in fig. 1. Firstly, the transaction database is constructed, and association rules can be generated by using Apriori algorithm with minimum support and minimum confidence. After that, the users are required to select criteria taking into account both the objective interestingness measures and the users’ domain information, and determine the weights for the selected criteria using the method proposed in this paper. Finally, TOPSIS method can be used to rank the mined association rules, and the rules at the top of the ranking list can be selected for implementation.

Fig. 1. Framework for ranking association rules
5. Illustrative Example
To show the applicability of their proposed approaches, an example of market basket analysis is presented in Chen (2007) and Toloo et al. (2009). In this example, association rules are discovered using Apriori algorithm, in which minimum support and confidence are set to 1.0% and 10.0%, respectively, so that some infrequent but interesting rules can be kept. Forty-six association rules are mined from the market basket database.
The criteria selected for ranking consist of support, confidence, itemset value, and cross-selling profit. For association rules like XY, support is the percentage of transactions that contain X∪Y, confidence is the ratio of the percentage of transactions that contain X∪Y to the percentage of transactions that contain X. Itemset value refers to the sum of values of all items in the itemset X∪Y. Cross-selling profit can be described as the recommending that customers purchase Y, if they have bought X.
Next, we can use the proposed TOPSIS method to rank these 46 mined association rules. Firstly, the subjective weights determined by using group AHP can be denoted as p= (0.118, 0.167, 0.262, 0.453).
Using the above mentioned entropy method, the objective weights can be obtained as u= (0.196, 0.259, 0.122, 0.423).
For convenience, we set =0.5, and the weights vector can be written as w= (0.157, 0.213, 0.192, 0.438) by using the convex combination approach.
Finally, TOPSIS method is applied to rank the mined 46 association rules, and the closeness coefficients and ranking result for these rules are shown in table 1.
Table1. Ranking result and closeness coefficient for each association rule

Table2. Ranking results of Chen’s method and Toloo’s method
Ranking Rule no.
Chen's method Toloo's method
1 26 18
2 22 23
3 18 26
4 17 12
5 7 31
6 23 43
7 6 22
8 43 6
9 31 17
10 12 1
From table 1, we can see that the top 10 association rules are rules 17, 18, 12, 31, 28, 29, 27, 34, 33, and 46. The ranking result is different from the results with Chen’s method and Toloo’s method, which are shown in table 2. But it should be noted that rules 17, 18, 12 and 31 rank at the top of the ranking list no matter with our method or with the other two methods. Thus we can conclude that these rules are strong association rules which can be selected for implementation.
6. Conclusion
Association rule mining algorithms can generate amounts of rules, but only a small number of rules may be selected for implementation. There is a need for developing techniques to obtain rules that are more interesting to the users. Based on multiple criteria decision making theory, a framework for ranking the mined association rules using the proposed TOPSIS method with combination weights, which considers both the objective interestingness measures and the users’ domain information is presented, and a market basket analysis example is applied to illustrate the applicability of this method. The result shows that our method is computationally efficient and applicable. What’s more, our method takes into account both the objective weights and the subjective weights for the selected criteria, so the ranking result should be more credible.
References
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD Conference on Management of Data, 254–259.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the international conference on very large data bases, 407-419.
Chen, M. C. (2007). Ranking discovered rules from data mining with multiple criteria by data envelopment analysis. Expert Systems with Applications, 33, 1110–1116.
Choi, D. H., Ahn, B. S., & Kim, S. H. (2005). Prioritization of association rules in data mining: multiple criteria decision approach. Expert Systems with Applications, 29, 876–878.
Deng, H., Yeh, C. H., & Willis, R. J. (2000). Inter-company comparison using modified TOPSIS with objective weights. Computers & Operations Research, 27, 963-973
Geng, L., & Hamilton, H. J. (2006). Interestingness measures for data mining: A survey. ACM Computing Survey, 38(3): 9, 1-32.
Hwang, C. L., & Yoon, K. (1981). Multiple attribute decision making methods and applications. Berlin, Heidelberg: Springer.
Saaty, T. L. (1980). The analytic hierarchy process. New York: McGraw-Hill.
Srikant, R., Vu, Q., & Agrawal, R. (1997). Mining association rules with item constraints. In Proceedings of the third international conference on knowledge discovery and data mining, KDD-97 (pp. 67–73).
Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Beijing: Post & Telecom Press.
Toloo, M., Sohrabi, B., & Nalchigar, S. (2009). A new method for ranking discovered rules from data mining by DEA. Expert Systems with Applications, 36, 8503-8508.
Wu, Y. Y., Hua, Z. S., & Zha, Y. (2003). Calculation of the weights and the amalgamation of the matrixes in AHP group decision. Operations Research and Management Science, 12, 16-21.
Zeleny M. (1982). Multiple criteria decision making. New York: McGraw-Hill..

沒有留言:

張貼留言