Chapter
6
Implementation
Introduction
This
chapter will describe the way in which the system design shown in chapter four
was implemented in C++. It will describe the data structures used, and the
changes that had to be made to the original specification during implementation.
The chapter is divided into four sections:
6.1 Page Database
6.2 Data Structures
6.3 Implementation Difficulties
6.4 Changes to
Specification
6.1. Page
Database
The page database is central to
the operation of the system. It needs to be as small as possible to minimise
download time.
The following information is
stored about each page:
- ID number to identify the page
- Description of the page
- URL of the page
- Categories the page is in
- Table formatting information
- Ignored sections
- Sub page options
- Table formatting information for sub pages
- Ignored sections in sub
pages
Ignored sections are
sections of the web page that are completely removed from the page that is
included in the newspaper.
Table formatting
options are used to specify how specific tables within the web page are
displayed in the newspaper. Each table within the page is given a number, and
these numbers are used to identify individual tables or ranges of
tables.
There are three modes that can be applied
to a table; these are Display, Ignore and Announce. Display mode is the default
mode, and simply displays the table as if it is normal text. Ignore mode allows
a table to be removed completely from the newspaper. Announce mode is for tables
of information that need to be announced as being tables, so “Table”
is inserted into the newspaper at the start of the table, and “Table
End” is inserted after the end of the table. This could be useful for
tables of football results for example.
One
way to minimise the size of the file was to only include each piece of
information only if it was different from the default setting. For example, for
the table formatting information only tables that need to be formatted
differently from the default method need be included in the
file.
The page database file is a simple text
file containing at least three lines for each page. At the start of each line is
a single letter that specifies what information is contained on that line. For
example, there must always be a line that begins with “P” that
contains the ID number of the page, it’s description, and it’s
URL.
Valid lines in the page file
are:
- “P” Page details (this is not
optional)
- “C” Categories (this is not
optional)
- “F” Table filtering options (if omitted the
defaults are used)
- “I” Ignored sections (if omitted then none
of the page is ignored)
- “S” Sub page options, including table
formatting options for sub pages (if omitted then no sub pages are
downloaded)
Each “S”
line represents a group of sub pages. There can be any number of these sub page
groups. The sub pages are identified using a filter that is applied to the URL
of the sub page to see if it should belong to that group or
not.
After each “S” line can follow
an “I” line that specifies the ignored sections for the sub pages in
that group.
A page can have any number of
table filtering options, any number of ignored sections, and any number of sub
page groups. Similarly, any sub page group can have any number of table
filtering options and any number of ignored sections. This allows a great deal
of customisation for each page in the
database.
To enable an administrator to
maintain the system, a supervisor mode was added to the application. When turned
on in the settings screen, this alters the way the page is parsed so that extra
information is provided to the administrator to allow them to set up the page
details. Ignored sections in the page are not removed, but are shown in a
smaller italic type. The tables in the page are all numbered so that the
administrator can choose which tables to Ignore or Announce. Some of the HTML
comments are also included so that they can be used as triggers for ignored
sections. The supervisor mode caused problems during implementation, and it
still does not work as well as it should.
A
more detailed description of how the page database is formatted can be found in
Appendix E.
6.2. Data
Structures
Most of the information that
is used by the system is stored within one complex data structure. This data
structure stores all the information from the page database, as well as which
pages have been selected for download by the
user.
The system was designed with a view to its
future expansion. With this in mind, the data structure that stored all the page
details could grow very large, and had to be implemented in a way that would
conserve memory as much as possible.
Because
it is not known in advance of executing the application how many pages exist in
the page database, how many formatting options would be needed for each page,
etc. storing the page details as an array is impractical. For this reason linked
lists were chosen as a good data structure for storing the page
details.
The main structure is a linked list
containing the page id, description, URL, and an array storing the categories.
Each list item also contains pointers to linked lists of table filtering
options, ignored sections, and sub page options. Each item in the list of sub
page options also contains pointers to linked lists of table filters and ignored
sections.
This data structure is shown in
figure 6.2a.
Figure
6.2a – Diagram of Main Data Structure
The diagram in figure 6.2a shows the list elements
as rectangles and the pointers connecting the elements as
arrows.
This data structure also has the
advantage that each individual instance of the “page getter” object
need only be provided pointers to the parts of the data structure containing the
pages it is concerned with. This simplifies mutual exclusion problems caused by
running multiple threads.
The settings for
the program that are not related to the pages to be downloaded are stored in a
different much simpler data structure. This is simply a collection of variables
containing the following details:
- Proxy settings
- Timeout and Retry settings
- Font size of newspaper created
- MODEM autodial settings
- Supervisor mode
toggle
These details are stored
directly in the INI file, along with a list of page ID’s of the pages that
the user has selected for
download.
6.3. Implementation
Difficulties
A number of aspects of
implementing the system proved to be problematic. One problem that was
encountered early on was getting the user interface to respond in the way
specified in the design. There were problems with changing the text on the
buttons when traversing the menu hierarchy. The code for this proved to be more
complex than anticipated, and because of its complexity it was implemented as a
separate object.
One of the biggest problems
encountered was processing the URLs within links on web pages. The problems
arose because links on web pages can use absolute URLs or relative URLs to point
to the same page.
For example, if the main
page of a site was at this
address:
http://www.umist.ac.uk/
A
link to a sub page could be referenced in the following ways:
- http://www.umist.ac.uk/courses/course1.html
- www.umist.ac.uk/courses/course1.html
- ./course1.htm
- /courses/course1.htm
Each
of these addresses points to the same page, but only the top one can be used in
calls to download the page.
The pseudo-code
for the algorithm used is included in Appendix F. It works by splitting the URL
into three parts – the domain, the path and the filename. When the main
page is retrieved, its domain and path are stored. When links to sub-pages are
processed, if they are relative links then the stored information is used to
re-build a complete URL. If absolute links are found, they are split up as
normal into domain, path and filename, and if the domain matches that of the
main page it is downloaded, otherwise an error is generated. In this way,
downloads can be limited to pages within the domain of the main
page.
6.4. Changes to
Specification
A few changes had to be
made to the specification during
implementation.
Support for image maps
(‘area’ and ‘map’ tags) was not implemented, and the
support for frames was not fully implemented. These changes were made because of
lack of time. The support for frames was not essential because the pages within
the frames are web pages in their own right, and the address of the inner page
can be used instead of the address of the frames page. In almost all cases this
is preferable anyway.
During implementation
it became apparent that the link following policy need only be concerned with
getting one level of sub-pages. This means that only links on the main page need
be followed. This simplified the page download process without greatly
compromising the amount of information included in the
newspaper.
The supervisor mode was also a
feature that was added during implementation, and not included in the original
design. This feature provides essential information to allow an administrator to
set up new pages and track down problems with existing pages.