Deakin University
Browse

File(s) under permanent embargo

OWDEAH: Online Web Data Extraction based on Access History

journal contribution
posted on 2004-01-01, 00:00 authored by Z Li, W K Ng, Kok-Leong Ong
Web data extraction systems are the kernel of information mediators between users and heterogeneous Web data resources. How to extract structured data from semi-structured documents has been a problem of active research. Supervised and unsupervised methods have been devised to learn extraction rules from training sets. However, trying to prepare training sets (especially to annotate them for supervised methods), is very time-consuming. We propose a framework for Web data extraction, which logged usersrsquo access history and exploit them to assist automatic training set generation. We cluster accessed Web documents according to their structural details; define criteria to measure the importance of sub-structures; and then generate extraction rules. We also propose a method to adjust the rules according to historical data. Our experiments confirm the viability of our proposal.

History

Journal

Lecture notes in computer science

Volume

3181

Pagination

269 - 278

Publisher

Springer-Verlag

Location

Heidelberg, Germany

ISSN

0302-9743

eISSN

1611-3349

Language

eng

Publication classification

C1 Refereed article in a scholarly journal

Copyright notice

2004, Springer-Verlag

Usage metrics

    Research Publications

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC