Online Demo:
Entity Search Engine:
Towards Agile Best-Effort Information Integration over the Web
Note: This demo must be viewed
with a javascript enabled browser (e.g. Firefox). The recommended browser is Firefox! This demo may have problem with certian versions of IE, eg, IE7.
Overview
This online demo shows the work discussed in the following demo paper.
Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web in proceeding of 2007 CIDR Conference. [PDF]
In particular, this demo shows two demo scenarios, one from the education domain regarding midwest CS departments and one from the book domain, supported by our Entity Search system.
Demo Interface
- Overview:
The core component, Query Engine, of our Entity Search system, is demonstrated in two different interactive demo scenarios listed
under the Demo Interfaces section below. Each demo scenario is built upon
the basic Query Engine query interface, with various example queries
demonstrating some possible "applications." Users are welcome to try out the examples queries, modify the example queries, as well as come up with their own queries
- Query Interface:
This portion is the query interface of our Query Engine. (as explained
in Section 5.1 of the paper). It consists of the following input fields, the first three input fields specifically refer to our Constructor operator for specifying various aspects of the Query:
- Matching Pattern:
Specify the Join Pattern of the target relation, e.g.,
uw50(#professor fax #email), ow20(#title #author)
- Scoring Measure:
Specify the Scoring Function of the target relation, e.g.,
tf (tuple frequency), dtf (distance weighted tuple frequency), mi (mutual information), tscore, cprod (confidence product).
- Entity Filter:
Specify constraints on the attributes specified in the Matching Pattern, e.g.,
professor(equalto David DeWitt), title(contains Romeo and Juliet).
- Corpus Restriction:
Specify the domains of the target relation using regular expression, e.g.,
*.uiuc.edu, cs.*.edu
- Links Per Answer:
For result presentaion, specifies the number of supporting URLs to return
for each answer (i.e. tuple).
- Order By:
For result presentation, specifies the attribute based on which to order
tuples, e.g., Order By "research". It ranks the results using score by default.
- Demo Scenarios:
- Midwest Computer Science Domain:
This scenario focuses on the surface Web, by collecting pages regarding CS departments in six midwest universities. We support query using the following extracted attributes in this scenario: professor, research, university, email and phone
- Ecommerce Domain:
This scenario focuses on the eCommerce domain, by collecting pages regarding cell phones. We support query using the following extracted attributes in this scenario: brand, provider, model, price, dim(dimension).
- Yellowpage Domain:
This scenario focuses on the email and phone. It is running on 2TB data of over 90 million webpages. Powered exclusively by 35 servers.
System Prototype Description
- The System:
- Behind this demo system, our query engine is built upon the Lemur
Information Retrieval engine in C++, on the platform of Red Hat Enterprise Linux 3 WS. The server
runs on a Pentium-4 2.6GHz PC with 1GB memory.
- The Datasets:
- For our Midwest Computer Science Domain scenario, we crawl and index the following dataset consisting
of HTML pages of six US computer science deparement pages, as provided
by the Stanford
WebBase Project from their January 2004 crawl. For this scenario, our attribute extraction has extracted: professor, research, university, email and phone. The dataset has the following statistics:
| |
University |
# pages |
# size (raw) |
| 1 |
IIT |
1305 |
10MB |
| 2 |
Illinois |
26158 |
927MB |
| 3 |
Indiana |
6265 |
56 MB |
| 4 |
Michigan |
18982 |
172MB |
| 5 |
Purdue |
10002 |
121MB |
| 6 |
Wisconsin |
19741 |
117MB |
- For our Shakespeare's Book Domain scenario, we get pages from querying three online bookstores: www.amazon.com, www.barnesandnoble.com, www.buy.com.
For each site, we send a query by specifying the Author input field as Shakespeare. We then manually collect the result pages containing the top 100 results returned. For this scenario, our attribute extraction has extracted: title, author, image, price, date.