Leveraging visual features and hierarchical dependencies for conference information extraction

You, Y; Xu, G; Cao, J; Zhang, Y; Huang, Guangyan

File(s) under permanent embargo

Leveraging visual features and hierarchical dependencies for conference information extraction

conference contribution

posted on 2013-04-10, 00:00 authored by Y You, G Xu, J Cao, Y Zhang, Guangyan HuangGuangyan Huang

Traditional information extraction methods mainly rely on visual feature assisted techniques; but without considering the hierarchical dependencies within the paragraph structure, some important information is missing. This paper proposes an integrated approach for extracting academic information from conference Web pages. Firstly, Web pages are segmented into text blocks by applying a new hybrid page segmentation algorithm which combines visual feature and DOM structure together. Then, these text blocks are labeled by a Tree-structured Random Fields model, and the block functions are differentiated using various features such as visual features, semantic features and hierarchical dependencies. Finally, an additional post-processing is introduced to tune the initial annotation results. Our experimental results on real-world data sets demonstrated that the proposed method is able to effectively and accurately extract the needed academic information from conference Web pages. © 2013 Springer-Verlag.

History

Event

Asia-Pacific Web Conference on Web Technologies and Applications (15th : 2013 : Sydney, N.S.W.)

Volume

7808

Series

Lecture Notes in Computer Science

Pagination

404 - 416

Publisher

Springer

Location

Sydney, N.S.W.

Place of publication

Berlin, Germany

Publisher DOI

https://doi.org/10.1007/978-3-642-37401-2_41

Start date

2013-04-04

End date

2013-04-06

ISSN

0302-9743

eISSN

1611-3349

ISBN-13

9783642374012

Language

eng

Publication classification

E Conference publication; E1.1 Full written paper - refereed

Copyright notice

2013, Springer

Editor/Contributor(s)

Y Ishikawa, J Li, W Wang, R Zhang

Title of proceedings

Web Technologies and Applications

Usage metrics

Keywords

information extraction visual feature DOM structure tree-structured conditional random fields

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) under permanent embargo

Leveraging visual features and hierarchical dependencies for conference information extraction

History

Event

Volume

Series

Pagination

Publisher

Location

Place of publication

Publisher DOI

Start date

End date

ISSN

eISSN

ISBN-13

Language

Publication classification

Copyright notice

Editor/Contributor(s)

Title of proceedings

Usage metrics

Categories

Keywords

Licence

Exports