km working group: USE OF TEXT MINING

USE OF TEXT MINING IN MANAGING KNOWLEDGE IN HELP DESK
DINESH RATHI
School of Library of and Information Studies, University of Alberta,
3-20 Rutherford South, Edmonton, AB, T6G 2J4, Canada
E-mail: drathi@ualberta.ca
LINDA C. SMITH
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 501 E Daniel, Champaign IL 61820 USA
E-mail: lcsmith@illinois.edu
One of the challenges for the help desk is to identify the problem domains requiring support so that an expert can be assigned to each domain and email routed to the expert directly. Text mining techniques such as clustering have been used for knowledge and pattern extraction from text data. The study presents the use of a methodology to extract knowledge from the historical data set, which contains problem requests received by the help desk for resolution, to identify the problem domains by using machine-expert hybrid approach. The paper also outlines two stages problem domain discovery process from email-based help desk data. The first step involves the identification of the best set of clusters from several sets of cluster results. The second step involves the identification of the problem domains and labeling of the clusters with the problem domain names by the help desk experts. The application of the machine-expert approach in the help desk environment for problem domain identification is novel and the findings from the study would help in advancement of the use of this hybrid methodology in the knowledge extraction process from a large historical dataset.
1. Introduction
In many organizations, the help desk provides support to a large number of technologies (Gonzalez et al., 2005), hence it is important for the help desk team to understand what types of technologies and problems are being supported by them. The identification of problem domains at the help desk is important because efficient help support is considered critical for high technology (Sawy and Bowles, 1997) such as computers and information technology. In addition, provision of help support is overwhelmed with many issues such as high turnover of employees, scarcity of skilled manpower, higher costs and large call volumes (Chan et al., 2000; Gonzalez et al., 2005; Nenkova and Bagga, 2003; Yang et al., 1997). Thus, this calls for the effective utilization of resources.
The identification of the problem domains will help in the effective utilization of resources; the time of experts could be effectively utilized by making them the owner of specific problem domains so that they can use their domain skill to resolve the problems effectively and efficiently. Gormly (2003) argued that help desk personnel should have skills to resolve the problem at the first contact. Having an expert or “specialist” increases the ability to resolve problems on first call, reduces time spent on problem resolution and minimizes problem escalation to higher levels. This could be achieved by directing the problem to the expert, which in turn requires knowledge about the problem domains so that an expert for that domain could be designated.
It is practically difficult for humans to peruse large quantities of historical data such as email-based help desk data and identify problem domains or categories. In contrast, large databases are easy to handle and manipulate using machines. The machine learning techniques such as clustering can be used to group similar document/problems together (Jain and Dubes, 1998). One of the challenges in clustering is to find the natural number of clusters from a given dataset. It is a very complex and still an unsolved problem (more details in Literature Overview Section), hence an alternate approach is required to extract knowledge from a large dataset to identify problem domains from help desk data.
This study explores the potential of alternate approaches in knowledge extraction from a large dataset. The study presents a methodology to extract knowledge from the historical data set, which contains problem requests received by the help desk for resolution, to identify the problem domains by using machine-expert hybrid approach. The machine-expert hybrid approach involves use of the machine’s computational power to generate clusters and experts’ skill to identify problems based on the clustering results. The paper also aims to outlines the step-by-step process of identifying problem domains from the historical help desk data using a machine-expert technique and identifies a few weaknesses (see Discussion Section) in undertaking this technique. The next section provides an overview of related work in three domains i.e., help desk, knowledge management (KM) and clustering, key concepts associated with this research. Section 3 provides the experiment details. Section 4 presents the findings of the study along with discussion of the results. The final section presents the conclusion and planned future work to extend this study.
2. Literature Overview: Help Desk, Knowledge Management and Clustering
There are many published studies that discuss issues in the help desk domain but the majority of the discussion is around knowledge base development or automatic classification of the call problems. Sinnett and Barr (2004) highlighted the development of the knowledge base system for the help desk using pre-defined categories. Kriegsman and Barletta (1993), Watson and Marir (1994), and Yang et al. (1997) proposed different techniques for improving efficiency and effectiveness of help desk using either case-based reasoning (CBR) system or rule based system. These proposed approaches focus on developing or managing a knowledge base of the solved problems and its subsequent use in problem resolution, but do not explore how to create categories of problems using the historical data which could help in knowledge organization in the knowledge base or classification of new problems into an identified category.
The review of KM literature reflects the use of text mining tools and techniques such as clustering, classification and summarization for managing knowledge. Delic and Dayal (2000) suggest the use of data mining and knowledge discovery techniques to disseminate and share knowledge with workers for decision making. Antonova et al. (2006) in their review paper state that data mining tools could be used in extraction of “meaningful relationships from usually very large quantities of seemingly unrelated data”, the objective of the present study. Luan and Serban (2002) described a three ‘Tiered Knowledge Management Model (TKKM)’ which proposed the use of mining techniques to identify patterns and relationships in the data. Marwick (2001) highlights the use of different machine learning techniques, such as classification, clustering and summarization, in managing knowledge. Clustering is a text mining technique for extracting knowledge from data. It is the process of partitioning similar documents or data into group(s). It is a “fully automatic and unsupervised process of grouping (or partitioning) of data (or documents)” (Dhillon et al., 2003; Jain et al., 1999; Jain and Dubes, 1998). Clustering is a useful technique in knowledge extraction, exploration of pattern analysis, pattern classification and grouping, etc. (Marwick, 2001; Qi et al., 2005). Qi et al. (2005) used a clustering technique to extract tacit knowledge from the email threads of pediatric pain mailing list (PPML) archives. However there are limitations in the use of clustering as “not all clustering techniques can uncover all the clusters present with equal facility” and there is subjectivity incorporated in partitioning the data into ‘k’ clusters. This is because the clustering algorithm requires selection of the number (k) of clusters to partition the data and thus needs domain knowledge to make that decision (Jain et al., 1999). Two major approaches have emerged to overcome the issue of having an appropriate number (k) of clusters as an input to the partitioning algorithm.
The first approach for users is to find out the natural number of clusters (k) using mathematical techniques and then use that as an input to partition the data in k-clusters. The argument for using this approach is that users need not have detailed pre-existing knowledge to determine k-clusters in the data. However identifying the natural number of clusters in the data is difficult and is still under investigation. Dubes (1987) in his paper concluded that estimating the number of clusters in multidimensional data remains a difficult and challenging problem. Similarly Gordon (1999) as cited by Nilsson (2002) states that “deciding the natural number of clusters (including no clustering, i.e., a single cluster) in a data set is in fact one of the hardest problems in cluster analysis”.
The second approach for users, when the natural number of clusters is difficult to estimate, is to adopt a trial and error approach to determine the correct number of clusters, which is time-consuming (Li et al., 2007; Salvador and Chan, 2004). A modified version of trial and error approach is a mixed-initiative approach where the machine and experts (users) interact and generate the clusters. The Machine-Expert approach has been used in studies predominantly in medical informatics, as highlighted by Agrawal et al. (2000), Huang and Mitchell (2006), Taxt and Lundervold (1994) and Weng and Liu (2004). To the best of our knowledge, the use of this initiative is novel in its application to identify problem domains from the emails of the help desk.
Huang and Mitchell (2006) proposed the “mixed-initiative clustering methods which allow a user to interact with and advise the clustering algorithm”. Their focus was to “cluster text documents based on the meanings of categories a user understands or wants” and their experiment gives evidence of the improvement of clusters when user feedback is incorporated in the cluster process. Agrawal et al. (2000) in their research used users’ feedback into the clustering model for topic discovery within an unorganized collection of documents. They “developed an interactive approach to clustering which involves iteratively presenting the user with a perusable number of related documents” (known as a cluster digest) and allowing the users to give feedback into the clustering model. This approach is an incremental approach in clustering and users are not required to “specify up-front an exact number of clusters to discover”. Their results show that their “interactive approach to clustering leads to substantial improvement in accuracy over standard clustering technique” in clustering the document to its dominant class. They used this approach to develop better clusters and not to identify the document categories.
The present study used the machine-expert framework in developing problem domain categories in the help desk area. The objective of the mixed approach used in this research was not to improve the cluster quality but to extract problem domains by using a clustering technique and expert skills. Hence the approach was partially modified to meet the research need. The human experts were used in two different stages: to identify the best cluster results out of multiple results and to develop the cluster labels to identify the problem domains of the help desk. This approach also incorporated an interview session with the experts involved in the labeling process to reconcile label names.
The next section provides the study details which include dataset overview, clustering algorithm and its parameter settings, and the problem domain discovery process.
3. The Study
3.1. Dataset for Research
The study used the data from the help desk known as the Request Tracker (RT) ticketing system of the Graduate School of Library and Information Science (GSLIS), University of Illinois at Urbana-Champaign. The system allows a user to send problems (requests) via email to the help desk support staff. The emails received into the RT ticketing system are stored in a central MySQL database. The RT system has been operational since March 2003. The data corpus collected for the research was from March 2003 to March 2007, and had over 20,000 emails.
3.2. Experiment Setup Details
The machine-expert approach is a mixed-initiative approach where clustering was used to generate the clusters and experts with domain knowledge were used to evaluate the content and quality of the cluster results and label the best identified cluster results.
3.2.1. Clustering Algorithm, Data Preparation and Algorithm Parameter Setting
The clustering algorithm used in this study, based on the Expectation Maximization (EM) concept, was drawn from Mei and Zhai’s (2005) work. The algorithm uses the concept of a theme to generate clusters. Themes are defined as “probability distribution of words that characterizes a semantically coherent topic or subtopics”. The clusters of themes are extracted using a simple probabilistic mixture model in which “words are regarded as data drawn from a mixture model with component models for the theme word distributions and a background word distribution” and each word in a document has equal weights (Mei and Zhai, 2005). The process of generation of problem domains from the data involved multiple steps. The first step was data formatting as required by the algorithm. All emails had to be in a single file with insertion of an XML identifier for each email with a start tag of ‘’ and end tag of ‘’. Each message in an email had to be tagged with XML identifiers ‘’ and ‘’ at the start and the end of the message respectively except for the last message, where the end tag was ‘’. The second step was a pre-processing step. The data was pre-processed by removing noise elements, spam emails, attachments, headers and signatures from the data so that clusters are generated only on the content of email. The terms in the data were stemmed using ‘korvetz’ stemmer and the stopwords were removed from the data after stemming the terms. In this study, a customized stopword list was used and it contained stopwords from the SMART experiments (http://www.lextek.com/manuals/ onix/stopwords2.html), alpha-numeric terms, names and aliases, and high frequency terms occurring in the dataset. The third step was to set the clustering algorithm parameters (Figure 1). Among different parameters, lambdaB controls the appearance of non-informative stop words. Several sets of pilot experimentation led to setting lambdaB to 0.90 to 0.92. This value of lambdaB gave better theme generation and is in line with the experimental work of Mei and Zhai (2005). The parameter ‘cluster’ (Figure 1) is the number of clusters that the algorithm should generate and is discussed in the next sub-section.
3.2.2. Number of Clusters
As argued in section 2, the partitioning of the data into ‘k’ clusters is difficult as it involves subjectivity and requires domain knowledge to make the decision about ‘k’. Multiple experiments were conducted starting with k=20 with an increment of 5 in each experiment and in a few cases lambdaB was changed between 0.90 and 0.92 (see result Table 1). The decision on the range of number of clusters was based on one of the authors’ domain knowledge (as argued by Jain et al., 1999) and a few pilot runs on a sample dataset (Salvador and Chan, 2004; Li et al., 2007).
3.2.3. Problem Domain Discovery Process
The problem domain discovery process was a two stage process. The first step was to identify the best set of clusters from the seven key cluster result sets. The second step was to give the best set of the cluster results to domain experts for identifying the problem domain by labeling the clusters with the problem domain names. This was followed by an interview to reconcile the label names.
In both steps each cluster was represented by the top 20 terms in that cluster. In the first stage the two independent raters (judges) were used to screen multiple sets of clusters, generated using different parameter settings of the algorithm (Figure 1), in order to select the most promising set for evaluation/content labeling by the domain experts of the help desk. These two raters were graduate students at GSLIS and had reasonable understanding of the help desk domains (Jain et al., 1999). The raters were shown the top 20 terms of the cluster to evaluate if these terms in the cluster reflect any theme or themes, using a three point scale: ‘clear’ (means the top 20 terms are clearly reflecting a coherent theme), ‘partially clear’ (means the top 20 terms are reflecting some theme or possibly more than one theme) and ‘not clear’ (means no theme at all).
The best sets of cluster results generated from the above process were given to two independent domain experts who work at the help desk. The experts’ task was to identify the problem domain from the clusters and label the clusters with the best name. This exercise was done through a simple questionnaire which had a set of instructions, a few questions and the cluster results, where each cluster was represented by the top 20 terms (Figure 2). In addition, the experts were given verbal instructions on the intent of the exercise. The questionnaire was pre-tested using one independent rater to ensure that questions were clearly written and conveyed the purpose of the study.
The names of clusters obtained from the help desk employees were reconciled in order to assign one common name for each cluster based on their feedback. In case of difference in assigned names, a group interview in an informal setting was conducted with both the participants’ sets to have their detailed feedback and to narrow down the semantic difference and develop a unified domain taxonomy.
4. Results and Discussion
4.1. Identification of the Best Cluster Results
The first stage of problem domain identification and labeling was to evaluate the best of the clusters results that were to be given to experts. In section 2, we argued that it is difficult to identify the natural number of clusters and the approach for identification of the natural number of clusters is still an open problem and under investigation. Therefore a mixed-initiative approach was adopted. The best cluster results were selected based on evaluation of multiple cluster results obtained by varying the number of clusters. A set of seven cluster results were evaluated by the two independent raters who were graduate student at GSLIS. Table 1 shows the results obtained from this evaluation process.
Table 1. Stage I - Cluster Results Evaluation
Experiment # No. of Clusters
lambdaB Clear Partial Clear Not-Clear
Rater 1 Rater 2 Rater 1 Rater 2 Rater 1 Rater 2
1 20 0.90 13 12 6 3 1 5
2 25 0.90 17 18 4 5 4 2
3 30 0.90 25 24 3 4 2 2
4 30 0.92 25 21 2 7 3 2
5 35 0.92 23 19 6 12 6 4
6 40 0.92 27 24 5 10 8 6
7 45 0.92 24 22 6 11 15 12
The cluster set from experiment#3 (k=30) was selected for the next step. The reasons for selection of this cluster set were: This set of results had a higher level of agreement between two raters on all three parameters (i.e., clear’, ‘partially clear’ and ‘not-clear’) as compared to other results. This set of results had a higher percentage of ‘clear’ clusters (i.e., number of clear clusters to total number of clusters in the set). Averaging the two raters’ judgments, 81.7% of the clusters were ‘clear’ clusters. Even though experiment#6 had marginally better average on the number of clear clusters than experiment #3, there were bigger differences between the raters’ opinion on clear and partially clear parameter (rater-1 and rater-2 had 27 and 24 clusters as clear, and 5 and 10 as partially clear respectively). In addition, in experiments#3 there were only two clusters that were ‘not clear’ clusters while other cluster results had a higher number of ‘not clear’ clusters. The higher percentage of ‘clear’ clusters and low number of ‘not clear’ clusters means that this set of clusters had a comparatively better set of cluster results and thus led to the selection of this cluster result for the next step. The result of problem domain identification and labeling process is described in the next sub-section.
4.2. Problem Domain Identification and Clusters Labeling Results
Table 2. Stage II - Cluster Labels by the Experts
30 Clusters
Same (SS) Nearly Same (VS) Not Same (NS)
14* 8 8
46.67% 26.67% 26.67%
*No Same clusters were labeled as CBN (Cannot Be Named)
Two experts from the help desk helped in the process of identifying problem domains as reflected in thirty clusters. The experts labeled the clusters independently and their feedback was analyzed (Table 2.) on a three-point scale: “same” (SS), “nearly same” (VS) and “not same” (NS). When both the experts labeled the same cluster with exactly the same label, then it was considered as ‘SS’; if there was slight variation but close enough then it was considered as ‘VS’; and totally different label was considered as ‘NS’. The labeling process was analyzed quantitatively as well as qualitatively.
First, intercoder reliability for the two experts for the labeling results was calculated using Krippendorff's alpha. The alpha was 0.69 (0.6887). The interpretation of the alpha value is that the two raters had a fairly high level of agreement in giving the same label name to the thirty clusters. The value of alpha between 0.61 and 0.80 can be interpreted as ‘substantial agreement’ (Castillo et al., 2006). The analysis of the three point scale shows that 14 clusters out of 30 clusters (46.67%) were labeled ‘SS’, 8 clusters out of 30 clusters (26.67%) were labeled ‘VS’ and 8 clusters out of 30 clusters were labeled differently (‘NS’) (summarized statistics are presented in Table 2). Thus, nearly two-thirds of the clusters were either labeled the same or nearly the same. For example (refer to Figure 1), cluster 1 was labeled as ‘Email’, cluster 15 was labeled as ‘Printing’ and cluster 27 was labeled as ‘Audio video support/equipments’ by both the experts.
4.2.1. Additional Feedback from the Experts
The analysis of the qualitative feedback given by the experts presented some interesting insights. One of the two experts did not find anything surprising in the cluster results but the other expert found two clusters surprising.
Surprising Clusters: One of the clusters labeled as ‘Library Issues’ was surprising to the expert. The reason for surprise was because a library-related cluster was discovered in the help desk data. The expert said, “it is surprising to me because these problems are related to the UIUC library – that users do not realize that they need to contact UIUC library systems”. Another cluster labeled as ‘Audio Video Calendar’ (different from the ‘audio video equipment’ cluster) was found surprising because the terms in the cluster dealt more with the reminders of setting up the equipment on the particular date, time and location and “not the problems” from the users. This calendar is more of internal use and filled by the help desk employees as a reminder of equipment setup when audio-video setup request is received by the help desk.
Cannot be Named (CBN) Clusters: One expert labeled two clusters as CBN and one cluster as spam. The expert said “there were too many personal names, mixed with course number and other misc IT terms” and hence the cluster was not exhibiting any coherent theme(s). Another cluster was labeled as CBN because a few terms in the cluster reflected LEEP (an online program name and thus an acronym for internal use) support theme but other terms “were all over the support map” and hence got “confused with too many themes being reflected in a single cluster”. The other expert labeled only one cluster as CBN because the terms in that cluster were “stemming from different domains”.
Missing Problem Domains: The experts who evaluated these categories felt that there were “no glaring omissions” and the cluster results did reflect major categories.
4.2.2. Reconciliation of Cluster Labels
Since there were differences between the two experts on the label name of a few clusters, an interview was conducted to understand the difference and reconcile the semantic difference. This also included clarification of the label names where the experts labeled a few clusters as ‘VS’ (nearly same). The reconciled list of cluster names after interviewing two experts and narrowing the difference resulted in finalizing the labels.
There were four labels (categories) that occurred more than one time. The common labels were ‘Mailing Lists or List Serves’ (twice), ‘Email’ (twice), ‘Printing’ (thrice), ‘Audio-Visual Support’ (twice). One of the clusters reflected two categories, ‘Job’ and ‘Instructional Resource’. The labels occurring more than once were merged and only unique names were collated to create a final unified label list. A total of 25 including 24 unique problem domains were identified from the thirty clusters and one miscellaneous category. The miscellaneous category was labeled as “spam / non-Identifiable / CBN”. When the experts labeled the clusters independently, there were three observations (combined for both raters) labeled as CBN, but during reconciliation consensus emerged and only one cluster (cluster 2) could not be labeled or associated with any problem domain and thus was labeled ‘CBN’. The miscellaneous category was considered as a separate and single category in which both the spam as well as non-identifiable or CBN labels were combined. In practice they could be considered separate but for this research they were combined into one category.
4.3. Discussion
The results of the study reflect that knowledge from the large historical data set could be extracted using a mixed-initiative of clustering technique and experts’ domain knowledge to identify the problem categories of the help desk. The use of this machine-expert technique is unique in its application to the help desk domain in identifying the problem domains. This approach could also be applied to other text-based or historical data to identify document categories. The theoretical approach of finding the natural number of clusters (k) using mathematical techniques is very difficult and of limited use to practitioners and alternatively, manually perusing the historical data set (in this case over 15,000 emails) for identifying the problem domains is quite challenging and an impossible task for large data volumes. Therefore this mixed approach is a better fit for developing the categorization schema. There are many advantage of using this approach and they are:
This approach combines the benefits of computational power to handle large data size (Huang and Mitchell, 2006) and the experts’ understanding of themes in the clusters. This approach also reduces the manual labor of perusing the large database. An internal study was conducted at the GSLIS help desk to identify the problems that the help desk is supporting. This was a longitudinal study which was conducted over a period of 21 days on the new incoming emails. The disadvantage of this technique was that it limited the scope in identification of the problem domains as they were developed based on the problems arriving during those days as compared to the much longer time period in this study. The time spent on the cluster labeling by the help desk expert was just over two hours. We have not compared the outcome of these two alternatives but will do so in a future study. Also, the mixed-initiative eliminates the need to identify a natural number of clusters in the data. “Learning the natural number of clusters still remains an open problem” and therefore it is better to adopt the user-identified number of clusters (Bekkerman et al., 2007). This mixed approach presents another advantage that problem domains are identified based on the users’ understanding of the clusters.
However, the weakness of this approach is that some domains would not emerge if they had a smaller set of problems as there would not be enough data points for the clustering algorithm to generate separate clusters for that problem. This would include the identification of newly added technologies in use. For example, the use of new software “Moodle” used for teaching purpose could not emerge into a separate category because there were not too many problem cases in the help desk data used in the research as “Moodle” was a new application introduced in the school. The proposed technique also requires that experts should have excellent domain knowledge. A lack of domain knowledge could result in the poor labeling of the clusters. This was partially evident from the study. One of the experts was relatively less experienced (i.e., was with the help desk for a shorter period) as compared to the other (i.e., who had been working at the help desk since its inception) and hence it was difficult for the first expert to understand some of the terms in the clusters. Thus the less experienced expert could not identify one of the clusters with the problem domain and labeled it CBN while the more experienced expert could associate the cluster with the problem domain.
We believe that this technique could be used to extract problem domains with different levels of granularity i.e., extracting sub-problem domains. For example, Access Issue could be sub-divided into System Access, Building Ascess, etc. This could be achieved by possibly increasing the cluster number and this was evident from the experiment where k=40 where there were more clear clusters (27 and 24 by rater-1 and rater-2 respectively) and it was highly likely that additional labels (sub-problem domains) could have been identified. However this requires further investigation and will be done in future work.
5. Conclusion and Future Work
The identification of problem domains could be done either by manually developing the problem domain categories by perusing email data or by using some machine learning techniques. Using a manual process of perusing very large historical data would be practically difficult or would require a longitudinal study which is time consuming as highlighted in discussion section. An alternate approach would be the use of the machine based approach such as machine-expert hybrid technique as used in this study. The experiment findings illustrate that the mixed initiative of machine-expert could be used to extract categories or problem domains from the historical data and the major problem domains can be identified. This result is validated from the qualitative data gathered by interviewing the experts, who work in the help desk, and they found “no glaring omissions” of any major problem domain category except for a few problem domains which were new and had limited data in the dataset such as problem domain related to Moodle. This is one limitation of the study
The results from this work should be of interest to knowledge management (KM) researchers and practitioners in a help desk environment or customer support system. For example, the identified problem domains could be used as categories for developing a knowledge base for each problem. In addition, this work can also provide insight to researchers or users interested in developing a categorization schema of historical text data for archiving documents or researchers interested in the evaluation of automatic cluster labeling algorithms. The clustering algorithm adopted in this study was a partitioning algorithm which would not automatically give the hierarchical level of problem domains i.e. one main problem domain and then its sub-problem domains. In future work, we would like to conduct a study to generate hierarchical problem domains i.e. main problem domain and its sub-domain using both partitioning algorithms and hierarchical algorithms.
6. Acknowledgments
We would like to thank Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign for providing us with the opportunity to study the help desk system and also the help desk employees for their valuable contribution in this research.
References
Agrawal, R., Bayardo, R. and Srikant, R. (2000) “Athena: Mining-Based Interactive Management of Text Databases”, Advances in Database Technology, 365–379.
Antonova, A., Gourova, E. and Nikolov, R. (2006) “Review of Technology Solutions for Knowledge Management”, 2nd IET International Conference on Intelligent Environments, 2: 39-44.
Bekkerman, R., Raghavan, H., Allan, J. and Eguchi, K. (2007) “Interactive Clustering of Text Collections According to a User-Specified Criterion”, Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-07), 684-689.
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M. and Vigna, S. (2006) “A Reference Collection for Web Spam”, ACM SIGIR Forum, ACM Publication, 40(2): 11-24.
Chan, C. W., Chen, L. L. and Geng, L. (2000) “Knowledge Engineering for an Intelligent Case-Based System for Help Desk Operations”, Expert Systems with Applications, 18(2): 125-132.
Delic, K. A. and Dayal, U. (2000) “Knowledge Management in the Service and Support Business”, Proceedings of the Third International Conference on Practical Aspects of Knowledge Management (PAKM2000).
Dhillon, I. S., Mallela, S. and Modha, D. S. (2003) “Information-Theoretic Co-clustering”, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 89-98.
Dubes, R. C. (1987) “How Many Clusters are Best? - An Experiment”, Pattern Recognition, 20(6): 645-663.
Gonzalez, L. M., Giachetti, R. E. and Ramirez, G. (2005) “Knowledge Management-Centric Help Desk: Specification and Performance Evaluation”, Decision Support Systems, 40: 389-405.
Gormly, J. (2003) “Rapid Help Desk Revitalization, User Services Conference”, Proceedings of the 31st Annual ACM SIGUCCS Conference on User Services.
Huang, Y. and Mitchell, T. M. (2006) “Text Clustering with Extended User Feedback”, Proceedings of the 29th Annual international ACM SIGIR Conference on Research and Development in Information Retrieval, 413-420.
Jain, A. K. and Dubes, R. C. (1998) Algorithms for Clustering Data, Prentice-Hall Inc.
Jain, A. K., Murthy, M. N. and Flynn P. J. (1999) “Data Clustering: Review”, ACM Computing Surveys, 31(3): 264-323.
Kriegsman, M. and Barletta, R. (1993) “Building a Case-Based Help Desk Application”, IEEE Expert, 8(6): 18-26.
Li, W., Ng, W., Liu, Y. and Ong, K. (2007) “Enhancing the Effectiveness of Clustering with Spectra Analysis”, IEEE Transactions on Knowledge and Data Engineering, 19(7): 887-902.
Luan, J. and Serban, A. M. (2002) “Technologies, Products, and Models Supporting Knowledge Management”, New Directions for Institutional Research, 113: 85-104.
Marwick, A. D. (2001) “Knowledge Management Technology”, IBM Systems Journal, 40(4): 814-830.
Mei, Q. and Zhai, C. (2005) “Discovering Evolutionary Theme Patterns from Text: An Exploration Of Temporal Text Mining”, Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 198–207.
Nenkova, A. and Bagga, A. (2003) “Email Classification for Contact Centers”, Proceedings of the 2003 ACM symposium on Applied Computing, 789-792.
Nilsson, M. (2002) “Hierarchical Clustering Using Non-Greedy Principal Direction Divisive Partitioning”, Information Retrieval, 5(4): 311-321.
Qi, Q., Gao, Q. and Shepherd, M., (2005) “Accessing Tacit Knowledge in the Pediatric Pain Email Archives”, Proceedings of the 38th Hawaii International Conference on System Sciences.
Salvador, S. and Chan, P. (2004) “Determining the Number of Clusters/Segments in Hierarchical Clustering/ Segmentation Algorithms”, Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’04), 576-584.
Sawy, O. A. and Bowles, G. (1997) “Redesigning the Customer Support Process for the Electronic Economy: Insight from Storage Dimensions”, MIS Quarterly, 21(4): 457.
Sinnett, C. J. and Barr, T. (2004) “OSU Helpdesk: A Cost-Effective Helpdesk Solution for Everyone”, 32nd Annual ACM SIGUCCS Conference on User services, 209-216.
Taxt, T. and Lundervold, A. (1994) “Multispectral Analysis of the Brain Using Magnetic Resonance Imaging”, IEEE Transactions on Medical Imaging, 13(3): 470-481.
Watson, I. and Marir, F. (1994) “Case-based Reasoning: A Review”, The Knowledge Engineering Review, Cambridge University Press, 9(4): 327-354
Weng, S. S. and Liu, C. K. (2004) “Using Text Classification and Multiple Concepts to Answer Emails”, Expert Systems with Application, 26: 529–543.
Yang, Q., Kim, E. and Racine, K. (1997) “Caseadvisor: Supporting Interactive Problem Solving and Case Based Maintenance for Help Desk Applications”, Proceedings of the IJCAI'97 Workshop on Practical Use of CBR, 32–44.

km working group

2009年12月9日星期三

USE OF TEXT MINING

沒有留言:

張貼留言

標籤

網誌存檔

關於我自己