2009年12月9日 星期三

COLLABORATIVE WRITING

CROSS-LANGUAGE KNOWLEDGE SHARING MODEL BASED ON ONTOLOGIES AND LOGICAL INFERENCE
WEISEN GUO
Science Integration Program (Human), Department of Frontier Sciences and Science Integration, Division of Project Coordination, The University of Tokyo, 5-1-5 Kashiwa-No-Ha
Kashiwa-Shi, Chiba-Ken 277-8568, Japan
E-mail: gws@scint.dpc.u-tokyo.ac.jp
STEVEN B. KRAINES†
Science Integration Program (Human), Department of Frontier Sciences and Science Integration, Division of Project Coordination, The University of Tokyo, 5-1-5 Kashiwa-No-Ha
Kashiwa-Shi, Chiba-Ken 277-8568, Japan
†E-mail: sk@scint.dpc.u-tokyo.ac.jp
Vast amounts of new knowledge are created on the Internet in many different languages every day. How to share and search this knowledge across different languages efficiently is a critical problem for information science and knowledge management. Conventional cross-language knowledge sharing models are based on natural language processing (NLP) technologies. However, natural language ambiguity, which is a problem even for single language NLP, is exacerbated when dealing with multiple languages. Semantic web technologies can circumvent the problem of natural language ambiguity by enabling human authors to specify meaning in a computer-interpretable form. In particular, description logics ontologies provide a way for authors to describe specific relationships between conceptual entities in a way that computers can process to infer implied meaning. This paper presents a new cross-language knowledge sharing model, SEMCL, which uses semantic web technologies to provide a potential solution to the problem of ambiguity. We first describe the methods used to support searches at the semantic predicate level in our model. Next, we describe how our model realizes a cross-language approach. We present an implementation of the model for the general engineering domain and give a scenario describing how the model implementation handles semantic cross-language knowledge sharing. We conclude with a discussion of related work.
1. Introduction
We live in an age of knowledge explosion. Knowledge sharing can significantly increase social capital (Widen-Wulff et al., 2004). But much of knowledge on the Internet is represented in diverse languages, which limits our ability to share and search knowledge globally. The traditional approach to share knowledge across diverse languages by manually translating each knowledge resource from the original language to all of the other languages is too slow and costly for tasks such as sharing scientific findings between researchers.
Automated cross-language technologies have been developed that use natural language processing (NLP) technologies to extract keywords for matching knowledge resources between different languages. However, NLP-based approaches cannot produce accurate matching results because of the ambiguity of natural language (Hunter and Cohen, 2006). Even thesauri or classification schemata are insufficient (Goldschmidt and Krishnamoorthy, 2008) because they do not support expressions of semantic relationships between keywords or named entities in text. Furthermore, the need to handle multiple languages in cross-language knowledge sharing models exacerbates the problem of natural language ambiguity. Some approaches to decrease the ambiguity have been reported in the literature. For example, Littman et al. (1998) used a latent semantic indexing technique to implement cross-language information retrieval. However, even these sophisticated NLP technologies do not address the fundamental issue of ambiguity in representing knowledge with natural language, an issue that is particularly problematic in a multilingual knowledge sharing situation.
Semantic Web technologies can be used to express knowledge in a computer-interpretable enable matching at a semantic predicate level, e.g. matching of both named entities and predicates stating the semantic relationships between them. Specifically, ontologies constructed in a language such as OWL-DL can represent domain knowledge within a description logic (DL) formalism (www.w3.org/TR/2004/REC-owl-features-20040210). Then DL-based inference can be used in knowledge search to find more useful matching results (Guo and Kraines, 2008).
We present a cross-language knowledge sharing model in this paper that is based on ontologies and logical inference. Using this model, knowledge providers can publish knowledge resources in their native languages, and knowledge seekers can search for knowledge in different languages, thereby enabling cross-language knowledge sharing. Furthermore, both the descriptors of the knowledge resources and the search queries are represented in a form that can be interpreted semantically by a computer, which enables the computer to infer embedded meaning that is implied but not explicitly expressed. Therefore, the knowledge system implementing this model returns matching results represented in diverse languages that should be more accurate than those of conventional keyword based systems because matching is done at the semantic predicate level.
The rest of the paper is organized as follows. In section 2, we review the state of the art of knowledge sharing on the Internet. In section 3, we present the cross-language knowledge sharing model and describe the cross-language method that we have developed to implement the model. We discuss the related work in Section 4 and conclude this paper in Section 5.
2. Knowledge Sharing using Information Retrieval and Semantic Web Technologies
Knowledge sharing is an activity through which knowledge is exchanged among people and/or organizations (Lin, 2007). In this paper, we focus on knowledge existing in explicit digital form on the Internet. A knowledge sharing community consists of two main types of the knowledge users: knowledge providers and knowledge seekers. Community members can be both types: each knowledge user may both provide knowledge resources and seek knowledge resources. The goal of a knowledge sharing system is to return the correct knowledge resources to the knowledge seeker. A global-scale knowledge sharing community will invariably include knowledge users from different countries speaking different languages.
Information Retrieval (IR) technologies can quickly find matching results by matching keywords provided by knowledge seekers with knowledge resources that are represented in natural language. In this approach, the matching system uses automatic techniques such as Natural Language Processing to determine which knowledge resources match the keywords based on the natural language representations of those resources. Conventional cross-language knowledge sharing models are based on these IR technologies. However, the problems of natural language ambiguity and grammatical complexity, which already make it difficult to determine matches with free-text in a single language, become even more serious when dealing with different languages, which results in a rapid decrease in matching precision and recall.
The problem of natural language ambiguity can be addressed by enabling people to create descriptions of knowledge resources in a computer-understandable format. For example, accuracy of matching knowledge resources can be increased by considering predicate-level semantics (Hunter et al., 2008). In particular, Semantic Web technologies, such as ontologies and logical inference, can be used to implement knowledge sharing systems that can match knowledge resources with search descriptions at the level of a grammatical sentence. These systems, such as EKOSS (Kraines et al., 2006) and Annotea (Kahan et al., 2001), are based on a semantic model for matching knowledge resources, which we call Model SEM. In this model, the knowledge providers describe their knowledge resources using computer-interpretable semantic statements instead of natural language. The knowledge seekers also input their queries in a semantic way, rather than just listing keywords. For example, the EKOSS system uses semantic matching methods based on description logics (Kraines et al., 2006; Guo and Kraines, 2008) to match the descriptions of knowledge resources and the search queries. Because the EKOSS implementation of Model SEM supports semantic matching, it can help the knowledge seekers find more correct matching results by reducing the ambiguity in both descriptions of knowledge resources and search queries.
We suggest that the use of Semantic Web technologies to enable people to create computer-interpretable semantic statements describing knowledge resources and requirements could address the issues of ambiguity and grammatical complexity in cross-language knowledge sharing. Based on Model SEM, this paper presents a new model for cross-language knowledge sharing, which we call Model SEMCL. In this model, the knowledge providers describe their knowledge resources by creating computer-interpretable semantic statements using their preferred language. In the same way, the knowledge seekers use their preferred language to describe their queries. The system is able to infer semantic matches between the descriptions and the queries, and it then displays the matching results in the preferred language of the knowledge seeker (Fig. 1).
The following section gives the details of our proposed Model SEMCL.
3. The Cross-Language Knowledge Sharing Model SEMCL
Fig. 2 shows the framework of Model SEMCL supporting three hypothetical languages: La, Lb, and Lc. Model SEMCL has four levels: the core semantic matching level, the language level, the user interface level, and the human level. At the center of Model SEMCL is the domain ontology, which is comprised of classes and properties together with labels in plain text. Translations of the class and property labels in the domain ontology to the other languages (Ontology(La), Ontology(Lb), and Ontology(Lc)) are made by a human or machine translator before running the knowledge sharing system.
At the human level, there are two types of users: Knowledge Providers and Knowledge Seekers. Knowledge Providers use the domain ontology to create computer-interpretable semantic descriptions of their knowledge resources, descriptions that are populated by instances of the ontology classes together with properties that describe specific relationships between those instances. Because these descriptions are free from the ambiguity of natural language and grounded in the logic supported by the ontology, they can be used by computers for inference (Guo and Kraines, 2008). The language level of Model SEMCL supports the cross-language sharing and searching. The user interface level of Model SEMCL provides multi-language graphic user interfaces (GUIs), which users can use to provide or seek knowledge in their preferred language.
In Fig. 2, a Knowledge Provider, who prefers to use language La, uses the GUI in language La (GUI(La)) to create a semantic description (Statement + La) of her/his knowledge resource in language La. A Knowledge Seeker, who prefers to use language Lc, uses the GUI in language Lc (GUI(Lc)) to create a semantic query (Query + Lc) in language Lc. The matching results produced by the Inference Engine at the core semantic matching level (Result) are augmented with the language information for Lc to create (Result + Lc), which is shown to the Knowledge Seeker in GUI(Lc).
To make this kind of cross-language searching possible, each knowledge description has two parts: semantic statement (Statement) and language information. Each search query also has two parts: semantic query (Query) and language information. The language information is maintained in the language level. The semantic statement and semantic query go into the core level to be matched by the Inference Engine, which uses reasoning in the supported logic as well as optional rule-based reasoning to match all the available semantic statements with each semantic query. When the Inference Engine finds some matching results (Result), it returns them to the Knowledge Seeker. Language information is added to the matching results when it goes through the language level to the user interface of the Knowledge Seeker. In summary, Model SEMCL uses ontologies to handle the ambiguity of natural language, logical inference for semantic matching, and the method of separating language from semantics to handle the cross-language issue. The following subsections give the details for each of these techniques.
3.1. Knowledge representation and search
In Model SEMCL, there are two kinds of knowledge. The first kind is the domain knowledge: the basic concepts and their relationships in the targeted knowledge domain. The second kind is the knowledge that the Knowledge Providers want to share, which is described using the first kind of knowledge. In Model SEMCL, the first kind of knowledge must be created prior to the operation of the knowledge sharing system and kept relatively stable. It should also have sufficient detail to represent the second kind of knowledge, which makes up the contents of the knowledge base in Model SEMCL.
We have created an implementation of Model SEMCL for the domain of engineering knowledge. In our implementation, the first kind of knowledge is represented by using an OWL-DL ontology that we have created for that domain. There are five main classes in the ontology – substances, activities, physical objects, events, and classes of activities (actors and spatial locations are special kinds of physical objects) – as well as several properties that can be used to specify relationships between the classes or instances of the classes (Fig. 3). For example, an instance of the class “activity” can have a relationship with an instance of the class “class of activity” using the property “has activity class”. Each main class is divided into subclasses to represent more specific concepts from the engineering domain.
The second kind of knowledge, knowledge shared by the Knowledge Providers, is represented using the classes and properties provided in the first kind of knowledge. Specifically, the entities described by each piece of shared knowledge are represented as instances of ontology classes, and the specific relationships that are described between those entities are represented using ontology properties. For example, consider the following accident report:
“The central region of Seongsu Bridge, which was built in the capital city Seoul city in Korea, suddenly collapsed on October 21, 1994. A diesel bus fell, and several people were killed. According to the investigation after the accident, the collapse was caused by fractures in the steel girders of the bridge.”
The knowledge that is expressed in this report can be represented using the domain ontology as shown in Fig. 4.
In Model SEMCL, all knowledge resources are represented as knowledge descriptions in this way. When the Knowledge Seekers want to find knowledge resources, they create semantic queries, also based on the domain ontology, and send them to the Inference Engine (an example is given in section 3.3).
Upon receiving a semantic query, the Inference Engine matches it with all the available semantic statements using a DL reasoner. First, the Inference Engine loads the domain ontology to the knowledge base. Then it loads the semantic statement for one knowledge resource. Finally, it evaluates the semantic query against the knowledge base that now contains the ontology and the statement. If each ontology class in the query can be mapped to an instance in the knowledge base subject to the properties specified for that class in the query, then the semantic statement that was loaded to the knowledge base is said to match with the query (an example is given in section 3.3).
3.2. Cross-language Knowledge Sharing
Model SEMCL handles the cross-language issue by separating the language information from the semantic statement in the knowledge description that is created by the Knowledge Provider. Only the semantic statement is used to obtain the matching result. The language information is added to the semantic statement of the matching results before showing them to the Knowledge Seeker. Because Model SEMCL uses a domain ontology instead of natural language to represent the knowledge, it is easy to separate the language information from the semantic statement.
Fig. 5 shows the overall Model SEMCL cross-language mechanism for two languages (La and Lb). In a real application, the number of languages can be more. The domain ontology and the user interface are created to form the infrastructure level and translated into each of the supported languages before running the knowledge sharing system. The interface language and ontology language are paired. In other words, if the interface language is changed to La, then the ontology language is changed to La automatically.
The Knowledge Provider uses the interface and ontology in her/his preferred language to create her/his knowledge descriptions. For example, if the preferred language is La, then “GUI(La)” and “Ontology(La)” are used, and the knowledge description “Statement + La” is created. The “Statement” is just the semantic information that remains when the language information in language La is removed. In the same way, the Knowledge Seeker uses the interface and ontology in her/his preferred language to create her/his search query. For example, if the preferred language is Lb, then “GUI(Lb)” and “Ontology(Lb)” are used, and “Query + Lb” is created. The “Query” is the part that remains when the language information in language Lb is removed. The Inference Engine evaluates matches between the Statements and Queries using the ontology. If “Query” matches with “Statement”, then the matching result “Result” is created by the Inference Engine. Because the preferred language of the Knowledge Seeker is Lb, the language information for Lb is added to create “Result + Lb”, which is displayed to the Knowledge Seeker.
3.3. Scenario
Here, we illustrate how Model SEMCL works by using a scenario involving three Knowledge Providers – Jane, Hideo, and Zhang – who are sharing knowledge on a knowledge sharing system that supports three languages: English, Japanese and Chinese.
In our scenario, Jane is the person who provided the knowledge for the article about the accident described in section 3.1, and she prefers using English. She accesses the Model SEMCL knowledge sharing system and selects the English user interface to create a description for this knowledge resource (see Fig. 4). She adds the URL of the original news report to the description so that anyone finding this description to be of interest can access the knowledge resource (the news report) for details.
Hideo is a scientist studying failure knowledge who prefers using Japanese. He created a video to explain the failure mechanism behind an airplane accident in Israel that he wants to share. He selects the Japanese user interface to create a description for this video, linked to the URL of the video. The corresponding English description is:
“On October 4, 1992, soon after the take-off of a Boeing 747 cargo air transport of El-Al Israel Airlines, the two engines on the right wing dropped off, and the air transport went out of control, finally colliding into the apartment building. Thirty-nine apartment habitants were killed in this accident. The cause of engine separation during take-off phase was fatigue failure of the pylon fuse pin.”
Zhang is a vehicle engineer and a fan of car racing who prefers using Chinese. After learning of the failure of the China A1 Racing Team vehicle in the Indonesian A1 Grand Prix finals station, he created some illustrations to show the reason of transmission failure from his professional perspective that he wants to share. He selects the Chinese user interface to create a description for these illustrations and links the description to the URL of the illustrations. His description of the content of the illustrations in Chinese is as follows:
“中国A1赛车队于2006年2月12日参加了A1大奖赛印尼站决赛。其间,江腾一驾驶的赛车失去控制停在了赛道上。事故原因被认为是经过长期超强度运转的引擎老化引起的。”
The corresponding English translation is:
“The China A1 racing team participated in the Indonesian A1 Grand Prix finals station on February 12, 2006. The vehicle of Jiang Tengyi stopped on the track. The cause of the malfunction was considered to be the aging of the engine after a long time of highly intense use.”
The description created by Zhang is shown in Fig. 6.
A Knowledge Seeker named Helen, who prefers using English, is composing her thesis about mechanical failures of transportation machines. In order to find some more knowledge, she wants to utilize the knowledge sharing system. She selects the English user interface to create her query for “a disaster caused by the mechanical failure of a machine artifact that is part of a transportation device” (see Fig. 7).
Helen sends her query to the Inference Engine of the Model SEMCL knowledge sharing system and waits for search results. The Inference Engine compares the semantic part of the query with the semantic statement part of Jane’s description, Hideo’s description and Zhang’s description. Hideo’s description and Zhang’s description match with Helen’s query (Fig. 8). However, Jane’s description does not match. Even though her description does mention a disaster event, an activity, a transportation device, a machine artifact, and a mechanical failure, the object with the mechanical failure is a part of the road bridge, not a transportation device. This example demonstrates how Model SEMCL supports cross-language knowledge sharing based on matching at the semantic predicate level.
4. Related Work
In section 2, we considered some other models for knowledge sharing. In this section, we compare Model SEMCL with some closely related work from the literature.
Littman et al., (1998) used Latent Semantic Indexing (LSI) to retrieve cross-language documents automatically. They treated a set of dual-language documents as training documents to create a dual-language semantic space in which terms from both languages are represented. Standard mono-lingual documents are represented as language-independent numerical vectors in this semantic space, so queries in either language can retrieve documents in either language without the need to translate the query. The LSI method is based on keywords. The semantic space contains the dual-language terms that form an index for speeding up the retrieval. However, the semantic space does not support classifications and relationships, so the LSI method cannot support retrieval based on matching at the semantic predicate level.
Diaz-Galiano et al., (2008) used the Medical Subject Headings (MeSH) to expand queries in the task of multilingual image retrieval. The expansion consists of searching for terms from the topic query in the MeSH vocabulary and adding similar terms. MeSH has a hierarchical structure that provides a consistent way to retrieve information using different terms for the same concepts. However, the MeSH structure does not contain typed relationships. So, the MeSH-based method also does not support retrieval at the semantic predicate level.
Wang et al., (2004) used a Publish/Subscribe system to share the knowledge. In their system, the knowledge descriptions, which they call Events, are represented with RDF. They also used a domain ontology as the domain basic knowledge which specifies the concepts involved in the Events, the relations between them, and the constraints on them. Their knowledge sharing model can be considered as an example of Model SEM.
Kraines et al., (2006) used semantic web technologies to share expert knowledge. The basic knowledge of the domain is presented to knowledge users as domain ontologies. The Knowledge Providers create knowledge descriptions for their knowledge resources, and the Knowledge Seekers create queries to search for knowledge of interest to them. Therefore, this system is also an implementation of Model SEM.
In summary, while the first two related research works support cross-language knowledge sharing, because they do not handle semantics directly, the accuracy of the matching results is limited. The last two related research works support the semantic predicate level search but do not handle the cross-language issue. Model SEMCL is a new contribution that uses the Model SEM approach to support cross-language knowledge sharing at the semantic predicate level.
5. Discussions and Conclusion
In today’s age of information explosion, vast amounts of new knowledge are generated every day in a diversity of languages. How to share and search this knowledge efficiently is one of the most important problems in the information science community. Conventional cross-language knowledge sharing models that are based on natural language processing technologies suffer from the exacerbated effect of ambiguity and grammatical complexity over multiple languages. This paper began with an analysis of the task of knowledge sharing in the Internet environment. An approach to matching knowledge resources using Semantic Web Technologies, called Model SEM, was identified. A new model based on Model SEM, SEMCL, was then proposed for cross-language semantic sharing and searching of knowledge resources. We introduced the framework of our proposal for Model SEMCL, focusing on the knowledge representation and search aspects. We then used a scenario based on an implementation of Model SEMCL for general engineering knowledge to demonstrate how Model SEMCL supports disambiguation, semantic predicate level matching, and cross-language sharing. The original contribution of this work is the creation of a cross-language knowledge sharing model that uses Semantic Web technologies to enable searches across multiple languages at a semantic predicate level.
Our implementation of Model SEMCL is accessible on the EKOSS website (www.ekoss.org). Currently, the EKOSS knowledge sharing system supports three languages: English, Japanese and Chinese. To date, a number of different users of EKOSS have used their preferred languages to create semantic statements to describe their knowledge resources.
6. Acknowledgments
The authors thank the President’s Office of the University of Tokyo for funding support.
References
Diaz-Galiano, M.C., Garcia-Cumbreras, M.A., Martin-Valdivia, M.T., Montejo-Raez, A., and Urena-Lopez, A. (2008). “Integrating MeSH Ontology to Improve Medical Information Retrieval”, In Peters, C., et al. (Eds.): CLEF 2007, LNCS 5152, 601-606.
Goldschmidt, D.E., and Krishnamoorthy, M. (2008). “Comparing keyword search to semantic search: a case study in solving crossword puzzles using the GoogleTM API”, Software-Practice & Experience, 38(4), 417-445.
Guo, W., and Kraines, S. (2008). “Explicit Scientific Knowledge Comparison Based on Semantic Description Matching”, American Society for Information Science and Technology 2008 Annual Meeting, Columbus, Ohio.
Hunter, L., and Cohen, K.B. (2006). “Biomedical Language Processing: What’s Beyond PubMed?”, Molecular Cell, 21, 589-594.
Hunter, L., Lu, Z., Firby, J., Baumgartner Jr, W.A., Johnson, H.L., Ogren, P.V., and Cohen, K.B. (2008). “OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression”, BMC Bioinformatics, 9:78, doi:10.1186/1471-2105-9-78.
Kahan, J., Koivunen, M.R., Prud’Hommeaux, E., and Swick, R.R. (2001). “Annotea: An Open RDF Infrastructure for Shared Web Annotations”, Proceedings of the WWW10 International Conference, Hong Kong, 623-632.
Kraines, S., Guo, W., Kemper, B., and Nakamura, Y. (2006). “EKOSS: A Knowledge-User Centered Approach to Knowledge Sharing, Discovery, and Integration on the Semantic Web”, ISWC 2006, 5th International Semantic Web Conference, LNCS 4273, 833-846.
Lin, H.F. (2007). “Effects of extrinsic and intrinsic motivation on employee knowledge sharing intentions”, Journal of Information Science, 33(2) 2007, 135-149.
Littman, M.L., Dumais, S.T., and Landauer, T.K. (1998). “Automatic cross-language information retrieval using latent semantic indexing”, In Grefenstette, G., editor, Cross-Language Information Retrieval, chapter 5. Kluwer Academic Publishers, Boston.
Wang, J., Jin, B., and Li, J. (2004). “An Ontology-based Publish/Subscribe System”, In Jacobsen, H.A., (Ed.): Middleware 2004, LNCS 3231, 232-253.
Widen-Wulff, G., and Ginman, M. (2004). “Explaining knowledge sharing in organizations through the dimensions of social capital”, Journal of Information Science, 30(5) 2004, 448-458.

沒有留言:

張貼留言