A Webpage Structure Processing Algorithm - Extending the Page Tailor Toolkit

Lars Andrén
Göteborg : Chalmers tekniska högskola, 2007. 43 s. Master thesis - Technical Communication, Centre for Digital Media and higher education, Chalmers University of Technology, ISSN 1652-7674; 2007:2, 2007.
[Examensarbete på avancerad nivå]

Research in user preference-based automatic processing on the web, web page content adaptation for a small screen and informative value of web pages have resulted in the design and implementation of an algorithm, called the Domain Heritage-algorithm. This algorithm extends the functionality of the Page Tailor toolkit; a program that is the result of C-Y Tsai’s thesis “Web Page Tailoring Tool for Mobile Devices”. The algorithm extending the toolkit enables automatic processing of web pages where preferences on which parts to be displayed have not been stored. The Domain Heritage-algorithm will not work unless at least one web page of the specific domain visited has been personalised previously. This extended toolkit has then been tested on ten subjects and a number of web sites. The test results were pretty much in accordance with the expectations, but the test subjects’ experience in using the Page Tailor toolkit was found to be quite influential on the rate of successful running of the algorithm. Three major conclusions are made. The first one is that too much editing of the appearance of web page content can result in loss of informative value and successful totally automatic extraction of web page content needs semantic processing. Further, XPaths has been a good choice of data for the algorithm to process as the results of the Big Oanalysis of the running time were acceptable, and that it was possible to implement the algorithm in the existing software. Finally, previous experience in usage of the Page Tailor toolkit, as well as more than one personalised web page is essential to the successful running of the Domain Heritagealgorithm.

Nyckelord: Informative value, Web page structures, Algorithms, XPath, DOM, Ruby, Domain heritage, Technical Communication

