Chapter 6

Implementation




Introduction

This chapter will describe the way in which the system design shown in chapter four was implemented in C++. It will describe the data structures used, and the changes that had to be made to the original specification during implementation. The chapter is divided into four sections:

6.1 Page Database
6.2 Data Structures
6.3 Implementation Difficulties
6.4 Changes to Specification


6.1. Page Database

The page database is central to the operation of the system. It needs to be as small as possible to minimise download time.

The following information is stored about each page:

Ignored sections are sections of the web page that are completely removed from the page that is included in the newspaper.

Table formatting options are used to specify how specific tables within the web page are displayed in the newspaper. Each table within the page is given a number, and these numbers are used to identify individual tables or ranges of tables.
There are three modes that can be applied to a table; these are Display, Ignore and Announce. Display mode is the default mode, and simply displays the table as if it is normal text. Ignore mode allows a table to be removed completely from the newspaper. Announce mode is for tables of information that need to be announced as being tables, so “Table” is inserted into the newspaper at the start of the table, and “Table End” is inserted after the end of the table. This could be useful for tables of football results for example.

One way to minimise the size of the file was to only include each piece of information only if it was different from the default setting. For example, for the table formatting information only tables that need to be formatted differently from the default method need be included in the file.

The page database file is a simple text file containing at least three lines for each page. At the start of each line is a single letter that specifies what information is contained on that line. For example, there must always be a line that begins with “P” that contains the ID number of the page, it’s description, and it’s URL.
Valid lines in the page file are:

Each “S” line represents a group of sub pages. There can be any number of these sub page groups. The sub pages are identified using a filter that is applied to the URL of the sub page to see if it should belong to that group or not.
After each “S” line can follow an “I” line that specifies the ignored sections for the sub pages in that group.

A page can have any number of table filtering options, any number of ignored sections, and any number of sub page groups. Similarly, any sub page group can have any number of table filtering options and any number of ignored sections. This allows a great deal of customisation for each page in the database.

To enable an administrator to maintain the system, a supervisor mode was added to the application. When turned on in the settings screen, this alters the way the page is parsed so that extra information is provided to the administrator to allow them to set up the page details. Ignored sections in the page are not removed, but are shown in a smaller italic type. The tables in the page are all numbered so that the administrator can choose which tables to Ignore or Announce. Some of the HTML comments are also included so that they can be used as triggers for ignored sections. The supervisor mode caused problems during implementation, and it still does not work as well as it should.

A more detailed description of how the page database is formatted can be found in Appendix E.


6.2. Data Structures

Most of the information that is used by the system is stored within one complex data structure. This data structure stores all the information from the page database, as well as which pages have been selected for download by the user.
The system was designed with a view to its future expansion. With this in mind, the data structure that stored all the page details could grow very large, and had to be implemented in a way that would conserve memory as much as possible.

Because it is not known in advance of executing the application how many pages exist in the page database, how many formatting options would be needed for each page, etc. storing the page details as an array is impractical. For this reason linked lists were chosen as a good data structure for storing the page details.

The main structure is a linked list containing the page id, description, URL, and an array storing the categories. Each list item also contains pointers to linked lists of table filtering options, ignored sections, and sub page options. Each item in the list of sub page options also contains pointers to linked lists of table filters and ignored sections.

This data structure is shown in figure 6.2a.

D:\stuff\html converters\RTF Report\Chapter600.gif
Figure 6.2a – Diagram of Main Data Structure

The diagram in figure 6.2a shows the list elements as rectangles and the pointers connecting the elements as arrows.

This data structure also has the advantage that each individual instance of the “page getter” object need only be provided pointers to the parts of the data structure containing the pages it is concerned with. This simplifies mutual exclusion problems caused by running multiple threads.

The settings for the program that are not related to the pages to be downloaded are stored in a different much simpler data structure. This is simply a collection of variables containing the following details:

These details are stored directly in the INI file, along with a list of page ID’s of the pages that the user has selected for download.


6.3. Implementation Difficulties

A number of aspects of implementing the system proved to be problematic. One problem that was encountered early on was getting the user interface to respond in the way specified in the design. There were problems with changing the text on the buttons when traversing the menu hierarchy. The code for this proved to be more complex than anticipated, and because of its complexity it was implemented as a separate object.

One of the biggest problems encountered was processing the URLs within links on web pages. The problems arose because links on web pages can use absolute URLs or relative URLs to point to the same page.

For example, if the main page of a site was at this address:
http://www.umist.ac.uk/
A link to a sub page could be referenced in the following ways:
  1. http://www.umist.ac.uk/courses/course1.html
  2. www.umist.ac.uk/courses/course1.html
  3. ./course1.htm
  4. /courses/course1.htm
Each of these addresses points to the same page, but only the top one can be used in calls to download the page.

The pseudo-code for the algorithm used is included in Appendix F. It works by splitting the URL into three parts – the domain, the path and the filename. When the main page is retrieved, its domain and path are stored. When links to sub-pages are processed, if they are relative links then the stored information is used to re-build a complete URL. If absolute links are found, they are split up as normal into domain, path and filename, and if the domain matches that of the main page it is downloaded, otherwise an error is generated. In this way, downloads can be limited to pages within the domain of the main page.


6.4. Changes to Specification

A few changes had to be made to the specification during implementation.
Support for image maps (‘area’ and ‘map’ tags) was not implemented, and the support for frames was not fully implemented. These changes were made because of lack of time. The support for frames was not essential because the pages within the frames are web pages in their own right, and the address of the inner page can be used instead of the address of the frames page. In almost all cases this is preferable anyway.

During implementation it became apparent that the link following policy need only be concerned with getting one level of sub-pages. This means that only links on the main page need be followed. This simplified the page download process without greatly compromising the amount of information included in the newspaper.

The supervisor mode was also a feature that was added during implementation, and not included in the original design. This feature provides essential information to allow an administrator to set up new pages and track down problems with existing pages.