TitleWhat is in a URL? Genre classification of webpages from URLs
Publication TypeConference Paper
Year of Publication2012
AuthorsAbramson, M, Aha, DW
Conference NameAAAI Workshop on Intelligent Techniques for Web Personalization and Recommendation
PublisherAAAI Press
Conference LocationToronto (Ontario), Canada
KeywordsBehavioral web analysis, machine learning
Abstract

The importance of URLs in the representation of a document cannot be overstated. Shorthand mnemonics such as “wiki” or “blog” are often embedded in a URL to convey its
functional purpose or genre. Other mnemonics have evolved from use (e.g., a Wordpress particle is strongly suggestive of blogs). Can we leverage from this predictive power to induce the genre of a document from the representation of a URL? This paper presents a methodology for webpage genre classification from URLs which, to our knowledge, has not been previously attempted. Experiments using machine learning techniques to evaluate this claim show promising results and a novel algorithm for character n-gram decomposition is provided. Such a capability could be useful to improve personalized search results, disambiguate content, efficiently crawl the Web in search of relevant documents, and construct behavioral profiles from clickstream data without parsing the entire document.

Refereed DesignationRefereed
Full Text
NRL Publication Release Number: 
12-1231-1194
pub_tags: 
machine learning
Behavioral web analysis