中文XML论坛--Our vision of semantic Web search

We have proposed a layer cake for semantic web search. It includes four layers from bottom to up as follows:

Knowledge Engineering Layer focuses on how to create semantic data. It includes knowledge annotation, knowledge extraction and knowledge fusion. In particular, we investigate collaborative annotation based on Wiki-technologies. Moreover, we pay much attention to automatically extract semantic data from Web 2.0 social corpus (e.g. Wikipedia, Del.icio.us).

Indexing and Search Layer focuses on semantic data management. It includes scalable triple store design for the data Web. It further considers building suitable indices on top of those triple stores for fast lookup or query processing. Additionally, it integrates database and information retrieval perspective for efficient and effective search engines.

Query Interface and User Interaction Layer focuses on usability issues of semantic search. It includes adapting different query interfaces (i.e. keyword interface, natural language interface) for semantic search. It aims at interpreting user queries into potential system queries with respect to the underlying semantic data. Furthermore, it involves faceted browsing to ease the process of expressing complex information needs from end users.

These basic infrastructures enable us to build more intelligent applications. For example, we can provide semantic services for Wikipedia. We can exploit semantic technologies for e-tourism, semantic portal, life science and personal information management as well.

In the Knowledge Engineering Layer, we have published the following work (2007 - 2008)

Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring
Published in the 6th International Semantic Web Conference (ISWC 2007)

Abstract
Wikipedia, a killer application in Web 2.0, has embraced the power of collaborative editing to harness collective intelligence. It can also serve as an ideal Semantic Web data source due to its abundance, influence, high quality and well structuring. However, the heavy burden of up-building and maintaining such an enormous and ever-growing online encyclopedic knowledge base still rests on a very small group of people. Many casual users may still feel difficulties in writing high quality Wikipedia articles. In this paper, we use RDF graphs to model the key elements in Wikipedia authoring, and propose an integrated solution to make Wikipedia authoring easier based on RDF graph matching, expecting making more Wikipedians. Our solution facilitates semantics reuse and provides users with: 1) a link suggestion module that suggests internal links between Wikipedia articles for the user; 2) a category suggestion module that helps the user place her articles in correct categories. A prototype system is implemented and experimental results show significant improvements over existing solutions to link and category suggestion tasks. The proposed enhancements can be applied to attract more contributors and relieve the burden of professional editors, thus enhancing the current Wikipedia to make it an even better Semantic Web data source.

PORE: Positive-Only Relation Extraction from Wikipedia Text
Published in the 6th International Semantic Web Conference (ISWC 2007)

Abstract
Extracting semantic relations is of great importance for the creation of the Semantic Web content. It is of great benefit to semi-automatically extract relations from the free text of Wikipedia using the structured content readily available in it. Pattern matching methods that employ information redundancy cannot work well since there is not much redundancy information in Wikipedia, compared to the Web. Multi-class classification methods are not reasonable since no classification of relation types is available in Wikipedia. In this paper, we propose PORE (Positive-Only Relation Extraction), for relation extraction from Wikipedia text. The core algorithm B-POL extends a state-of-the-art positive-only learning algorithm using bootstrapping, strong negative identification, and transductive inference to work with fewer positive training examples. We conducted experiments on several relations with different amount of training data. The experimental results show that B-POL can work effectively given only a small amount of positive training examples and it significantly out per forms the original positive learning approaches and a multi-class SVM. Furthermore, although PORE is applied in the context of Wikipedia, the core algorithm B-POL is a general approach for Ontology Population and can be adapted to other domains.

An Unsupervised Model for Exploring Hierarchical Semantics from Social Annotations
Published in the 6th International Semantic Web Conference (ISWC 2007)

Abstract
This paper deals with the problem of exploring hierarchical semantics from social annotations. Recently, social annotation services have become more and more popular in Semantic Web. It allows users to arbitrarily annotate web resources, thus, largely lowers the barrier to cooperation. Furthermore, through providing abundant meta-data resources, social annotation might become a key to the development of Semantic Web. However, on the other hand, social annotation has its own apparent limitations, for instance, 1) ambiguity and synonym phenomena and 2) lack of hierarchical information. In this paper, we propose an unsupervised model to automatically derive hierarchical semantics from social annotations. Using a social bookmark service Del.icio.us as example, we demonstrate that the derived hierarchical semantics has the ability to compensate those shortcomings. We further apply our model on another data set from Flickr to testify our model’s applicability on different environments. The experimental results demonstrate our model’s efficiency.

Catriple: Extracting Triples from Wikipedia Categories
Published in the 3rd Asian Semantic Web Conference (ASWC 2008)

Abstract
As an important step towards bootstrapping the Semantic Web, many efforts have been made to extract triples from Wikipedia because of its wide coverage, good organization and rich knowledge. One kind of important triples is about Wikipedia articles and their non-isa properties, e.g. (Beijing, country, China). Previous work has tried to extract such triples from Wikipedia infoboxes, article text and categories. The infobox-based and text-based extraction methods depend on the infoboxes and suffer from low article coverage. In contrast, the category-based extraction methods exploit the widespread categories. However, they rely on predefined properties. It is too effort-consuming and explores only very limited knowledge in the categories. This paper automatically extracts properties and triples from the less explored Wikipedia categories so as to achieve wider article coverage with less manual effort. We manage to realize this goal by utilizing the syntax and semantics brought by super-sub category pairs in Wikipedia. Our prototype implementation outputs about 10M triples with a 12-level confidence ranging from 47.0% to 96.4%, which cover 78.2% of Wikipedia articles. Among them, 1.27M triples have confidence of 96.4%. Applications can on demand use the triples with suitable confidence.

In the indexing and Search Layer, we have published the following work (2007 - 2008)

SOR: a practical system for ontology storage, reasoning and search
Published in the 33rd International Conference on Very Large Data Bases (VLDB 2007)
Abstract
Ontology, an explicit specification of shared conceptualization, has been increasingly used to define formal data semantics and improve data reusability and interoperability in enterprise information systems. In this paper, we present and demonstrate SOR (Scalable Ontology Repository), a practical system for ontology storage, reasoning, and search. SOR uses Relational DBMS to store ontologies, performs inference over them, and supports SPARQL language for query. Furthermore, a faceted search with relationship navigation is designed and implemented for ontology search. This demonstration shows how to efficiently solve three key problems in practical ontology management in RDBMS, namely storage, reasoning, and search. Moreover, we show how the SOR system is used for semantic master data management.

Effective and Efficient Semantic Web Data Management on DB2
Published in the 27th International Conference on Management of Data (SIGMOD 2008)
Abstract
With the fast growth of Semantic Web, more and more RDF data and ontologies are created and widely used in Web applications and enterprise information systems. It is reported that the W3C Linking Open Data community project consists of over two billion RDF triples, which are interlinked by about three million RDF links. Recently, efficient RDF data management on top of relational databases gains particular attentions from both Semantic Web community and database community. In this paper, we present effective and efficient Semantic Web data management over DB2, including efficient schema and indexes design for storage, practical ontology reasoning support, and an effective SPARQL-to-SQL translation method for RDF query. Moreover, we show the performance and scalability of our system by an evaluation among well-known RDF stores and discuss future work.

CE2 – Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support
Published in the 17th Conference on Information and Knowledge Management (CIKM 2008)

Abstract
The Web contains a large amount of documents and increasingly, also semantic data in the form of RDF triples. Many of these triples are annotations that are associated with documents. While structured query is the principal mean to retrieve semantic data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both documents and semantic data can address more complex information needs. In this paper, we present CE2, an integrated solution that leverages mature database and information retrieval technologies to tackle challenges in hybrid search on the large scale. For scalable storage, CE2 integrates database with inverted indices. Hybrid query processing is supported in CE2 through novel algorithms and data structures, which allow for advanced ranking schemes to be integrated more tightly into the process. Experiments conducted on Dbpedia and Wikipedia show that CE2 can provide good performance in terms of both effectiveness and efficiency.

Semplore: An IR Approach to Scalable Hybrid Query of Semantic Web Data
Published in the 6th International Semantic Web Conference (ISWC 2007)
Abstract
As an extension to the current Web, Semantic Web will not only contain structured data with machine understandable semantics but also textual information. While structured queries can be used to find information more precisely on the Semantic Web, keyword searches are still needed to help exploit textual information. It thus becomes very important that we can combine precise structured queries with imprecise keyword searches to have a hybrid query capability. In addition, due to the huge volume of information on the Semantic Web, the hybrid query must be processed in a very scalable way. In this paper, we define such a hybrid query capability that combines unary tree-shaped structured queries with keyword searches. We show how existing information retrieval (IR) index structures and functions can be reused to index semantic web data and its textual information, and how the hybrid query is evaluated on the index structure using IR engines in an efficient and scalable manner. We implemented this IR approach in an engine called Semplore. Comprehensive experiments on its performance show that it is a promising approach. It leads us to believe that it may be possible to evolve current web search engines to query and search the Semantic Web. Finally, we briefly describe how Semplore is used for searching Wikipedia and an IBM customer’s product information.

Efficient Index Maintenance for Frequently Updated Semantic Data
Published in the 3rd Asian Semantic Web Conference (ASWC 2008)
Abstract
Nowadays, the demand on querying and searching the Semantic Web is increasing. Some systems have adopted IR (Information Retrieval) approaches to index and search the Semantic Web data due to its capability to handle the Web-scale data and efficiency on query answering. Additionally, the huge volumes of data on the Semantic Web are frequently updated. Thus, it further requires effective update mechanisms for these systems to handle the data change. However, the existing update approaches only focus on document. It still remains a big challenge to update IR index specially designed for semantic data in the form of finer grained structured objects rather than unstructured documents. In this paper, we present a well-designed update mechanism on the IR index for triples. Our approach provides a flexible and effective update mechanism by dividing the index into blocks. It reduces the number of update operations during the insertion of triples. At the same time, it preserves the efficiency on query processing and the capability to handle large scale semantic data. Experimental results show that the index update time is a fraction of that by complete reconstruction w.r.t. the portion of the inserted triples. Moreover, the query response time is not notably affected. Thus, it is capable to make newly arrived semantic data immediately searchable for users.

In the Query Interface and User Interaction Layer, we have the following work (2007 - 2008)

PANTO: A Portable Natural Language Interface to Ontologies
Published in the 4th European Semantic Web Conference (ESWC 2007)

Abstract
Providing a natural language interface to ontologies will not only offer ordinary users the convenience of acquiring needed information from ontologies, but also expand the influence of ontologies and the semantic web consequently. This paper presents PANTO, a Portable nAtural laNguage inTerface to Ontologies, which accepts generic natural language queries and outputs SPARQL queries. Based on a special consideration on nominal phrases, it adopts a triple-based data model to interpret the parse trees output by an off-the-shelf parser. Complex modifications in natural language queries such as negations, superlative and comparative are investigated. The experiments have shown that PANTO provides state-of-the-art results.

SPARK: Adapting Keyword Query to Semantic Search
Published in the 6th International Semantic Web Conference (ISWC 2007)

Abstract
Semantic search promises to provide more accurate result than present-day keyword search. However, progress with semantic search has been delayed due to the complexity of its query languages. In this paper, we explore a novel approach of adapting keywords to querying the semantic web: the approach automatically translates keyword queries into formal logic queries so that end users can use familiar keywords to perform semantic search. A prototype system named ‘SPARK’ has been implemented in light of this approach. Given a keyword query, SPARK outputs a ranked list of SPARQL queries as the translation result. The translation in SPARK consists of three major steps: term mapping, query graph construction and query ranking. Specifically, a probabilistic query ranking model is proposed to select the most likely SPARQL query. In the experiment, SPARK achieved an encouraging translation result.

Q2Semantic: A Lightweight Keyword Interface to Semantic Search
Published in the 5th European Semantic Web Conference (ESWC 2008)

Abstract
The increasing amount of data on the Semantic Web offers opportunities for semantic search. However, formal query hinders the casual users in expressing their information need as they might be not familiar with the query’s syntax or the underlying ontology. Because keyword interfaces are easier to handle for casual users, many approaches aim to translate keywords to formal queries. However, these approaches yet feature only very basic query ranking and do not scale to large repositories. We tackle the scalability problem by proposing a novel clustered-graph structure that corresponds to only a summary of the original ontology. The so reduced data space is then used in the exploration for the computation of top-k queries. Additionally, we adopt several mechanisms for query ranking, which can consider many factors such as the query length, the relevance of ontology elements w.r.t. the query and the importance of ontology elements. The experimental results performed against our implemented system Q2Semantic show that we achieve good performance on many datasets of different sizes.

Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data
Published in the 25th International Conference on Data Engineering (ICDE 2009)

Abstract
Keyword queries enjoy widespread usage as they represent an intuitive way of specifying information needs. Recently, answering keyword queries on graph-structured data has emerged as an important research topic. The prevalent approaches build on dedicated indexing techniques as well as search algorithms aiming at finding substructures that connect the data elements matching the keywords. In this paper, we introduce a novel keyword search paradigm for graph-structured data, focusing in particular on the RDF data model. Instead of computing answers directly as in previous approaches, we first compute queries from the keywords, allowing the user to choose the appropriate query, and finally, process the query using the underlying database engine. Thereby, the full range of database optimization techniques can be leveraged for query processing. For the computation of queries, we propose a novel algorithm for the exploration of top-k matching sub graphs. While related techniques search the best answer trees, our algorithm is guaranteed to compute all k sub graphs with lowest costs, including cyclic graphs. By performing exploration only on a summary data structure derived from the data graph, we achieve promising performance improvements compared to other approaches.

Snippet Generation for Semantic Web Search Engines
Published in the 3rd Asian Semantic Web Conference (ASWC 2008)

Abstract
With the development of the Semantic Web, more and more ontologies are available for exploitation by semantic search engines. However, while semantic search engines support the retrieval of candidate ontologies, the final selection of the most appropriate ontology is still difficult for the end users. In this paper, we extend existing work on ontology summarization to support the presentation of ontology snippets. The proposed solution leverages a new semantic similarity measure to generate snippets that are based on the given query. Experimental results have shown the potential of our solution in this problem domain that is largely unexplored so far.

Making them as a whole

SearchWebDB: Searching the Billion Triples!
Published in the 7th International Semantic Web Conference (ISWC 2008)

Abstract
In recent years, the amount of structured data in form of triples available on the Web is increasing rapidly and has reached more than one billion. In this paper, we propose an infrastructure for searching the billion triples -- called SearchWebDB -- that integrates data sources publicly available on the web in a way such that users can ask queries against the billion triples through a single interface. Approximate mappings between schemata as well as data elements are computed and stored in several indices. These indices are exploited by a query engine to perform query routing and result combination in an efficient way. As opposed to a standard distributed query engine requiring the use of formal languages, users can ask queries in terms of keywords through SearchWebDB. These keywords are translated to possible interpretations presented as structured queries. Thus, complex information need can be addressed without imposing too much of a burden to the casual users.

Attached please find the document with a poster for each work. Wish you enjoy it.

*树形目录（最近20个回帖）	顶端
主题： Our vision of semantic Web search(20914字) － whfcarter，2009年1月13日
回复： whfcarter同志的压缩文件已全部下载，但是无法完成解压，好像压缩文件命名不对。..(75字) － vannus，2009年3月11日
回复： whfcarter同志提供的文献我已悉数下载，不过在实际使用时遭遇了一些技术性问题。一个是在解压..(453字) － Humphrey，2009年2月8日
回复：通过这两天的交流和思考，我发现自然语言处理是个很特别的学科（或者叫做方法更合适？），它的覆盖范围很..(368字) － Humphrey，2009年1月16日
回复：我有一点点愚见：我认为可以在不同阶段做不同方面semantic化.1）在Query interf..(961字) － viaphone，2009年1月15日
回复：可以解压缩，文件名没有问题。需要注意的是按顺序进行选择，否则不能完成解压过程。..(78字) － Humphrey，2009年3月11日
回复：如果希望对NLP有更加深入的了解，可以关注一下NLP的顶级会议ACL中的一些session，以及看..(822字) － whfcarter，2009年1月15日
回复：四个压缩包里面的是一个文件，由于文件太大，我选择了分卷压缩。同时我考虑到向下兼容，所以虽然使用07..(226字) － whfcarter，2009年2月8日
回复：就是说自然语言处理实际上包含两个分支：一个是通过统计的方法，计算词频和归纳相似的词以期获得尽量全面..(756字) － Humphrey，2009年1月15日
回复： NLP当然不是纯统计的方法，这里有很多基于rule和基于知识表示和推理的。最典型的应用是sowa原..(784字) － whfcarter，2009年1月15日
回复：过奖....我还是菜鸟....whf才是大牛(35字) － viaphone，2009年1月16日
回复：感谢viaphone同志的热情解答，从您和whfcarter同志的回复中，我学到了好多东西，从技术..(135字) － Humphrey，2009年1月16日
回复：为什么要这么在乎分类呢，类别都是在这一个东西已经有一定研究或者成果之后产生的（类似bottom u..(1001字) － whfcarter，2009年1月16日
回复：拜读了viaphone同志的回复之后，我觉得您的意思我能理解，但是对一些概念更模糊了。由此引发了我..(782字) － Humphrey，2009年1月16日
回复： [quote][b]以下是引用[i]Humphrey在2009-1-16 8:26:00[/i]的..(1273字) － viaphone，2009年1月16日
回复：谢谢viaphone的补充，这里列举的各项工作正是从不能的层面和侧面来帮助search的语义化的。..(140字) － whfcarter，2009年1月16日
回复：首先感谢您的答复。您的意思是自然语言处理（NLP）还不是纯统计的方法？而且NLP也不能算作推理类的..(290字) － Humphrey，2009年1月15日
回复： "而您所说的搜索周期所体现的就是对文档数据的解析和查询的推理吧？"这只是一部分，我在上面的描述中放..(739字) － whfcarter，2009年1月15日
回复：从全局考虑的语义搜索确实如此，不过我还是有些疑惑：如您所说，语义搜索又有广义狭义之分，但是搜索的语..(326字) － Humphrey，2009年1月15日
回复：前面和iamwym聊起，这里对我们的semantic search做进一步的说明。从狭义的sema..(1199字) － whfcarter，2009年1月14日


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	171.875ms