File(s) under permanent embargo
Leveraging visual features and hierarchical dependencies for conference information extraction
conference contribution
posted on 2013-04-10, 00:00 authored by Y You, G Xu, J Cao, Y Zhang, Guangyan HuangGuangyan HuangTraditional information extraction methods mainly rely on visual feature assisted techniques; but without considering the hierarchical dependencies within the paragraph structure, some important information is missing. This paper proposes an integrated approach for extracting academic information from conference Web pages. Firstly, Web pages are segmented into text blocks by applying a new hybrid page segmentation algorithm which combines visual feature and DOM structure together. Then, these text blocks are labeled by a Tree-structured Random Fields model, and the block functions are differentiated using various features such as visual features, semantic features and hierarchical dependencies. Finally, an additional post-processing is introduced to tune the initial annotation results. Our experimental results on real-world data sets demonstrated that the proposed method is able to effectively and accurately extract the needed academic information from conference Web pages. © 2013 Springer-Verlag.
History
Event
Asia-Pacific Web Conference on Web Technologies and Applications (15th : 2013 : Sydney, N.S.W.)Volume
7808Series
Lecture Notes in Computer SciencePagination
404 - 416Publisher
SpringerLocation
Sydney, N.S.W.Place of publication
Berlin, GermanyPublisher DOI
Start date
2013-04-04End date
2013-04-06ISSN
0302-9743eISSN
1611-3349ISBN-13
9783642374012Language
engPublication classification
E Conference publication; E1.1 Full written paper - refereedCopyright notice
2013, SpringerEditor/Contributor(s)
Y Ishikawa, J Li, W Wang, R ZhangTitle of proceedings
Web Technologies and ApplicationsUsage metrics
Categories
No categories selectedLicence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC