Appendix
E
Page Database
Format
The rules governing the layout of the page
database file are best demonstrated using an
example:
Web News Speak
(Page File Version 6)
#
This is a comment and is
ignored
P
0000001,Altavista World
News,http://world.altavista.com/
C
News,Headlines
F
01-99I
I T,LATEST
HEADLINES,Powered by
S
,.htm,01-21I,23-42I,44-99I
I
T,DISPLAY TEMPLATE CATEGORY,BOTTOM
TEMPLATE
END
In
the above example, the page database file contains only one page definition.
Blank lines in the file are ignored, and lines beginning with a ‘#’
symbol are ignored (these can be used for
comments).
The first non-blank line in the
file must contain version number information for the page file. This must only
be one line, but can contain any text. This information will be included on the
contents page of the compiled newspaper.
The
beginning of a page definition is indicated by a line beginning with a
‘P’, followed by a space. The first character on a line is used to
indicate the information that the line
contains.
The remainder of the line that
begins with ‘P’ must have the following
elements:
An ID number, a description of the
page, and the URL of the page. Commas separate these elements and they must all
be present.
If any part of any of the
elements on a line use commas (e.g. if the URL has a comma in it), then an
escape character ( a \ character )can be placed in front of the comma. It will
then be treated as part of the element and not as a separator between
elements.
The next line in the file above
begins with ‘C’. This line defines the categories that the main page
belongs to. The line consists of a list of categories separated by commas. The
number of categories that are allowed is defined within the application
(currently 10). If no category lines are included for the page, that page will
be at the top level of the menu
hierarchy.
The next line begins with
‘F’. This line defines the table formatting options for the page.
The ‘F’ line is optional, if omitted all tables are displayed
normally.
The ‘F’ line consists of a
comma-separated list of table formatting settings.
In the above example, the page has only one
formatting option, i.e. 01-99I. The numbers indicate we are setting the
formatting options for tables 1 to 99. The ‘I’ on the end indicated
that the tables will be ignored. Either single table numbers or ranges can be
used, so 02I is also valid. The table numbers must always be two digits, so
numbers below 10 must have a 0 on the start. As well as ‘I’,
‘A’ can be used to indicate that the tables should be announced, or
‘D’ to indicate that the table should be displayed normally (this is
the default). There can be any number of formatting
options.
The next line begins with
‘I’. This line defined the ignored sections in the page. The
‘I’ line begins with a single character that can either be
‘T’ or ‘F’ to indicate whether the page should be
ignored from the start or not. The rest of the line is a set of pieces of
‘trigger’ text to trigger the toggle between the page being ignored
and not being ignored. When the page is processed, if the first character on
this line is ‘T’, the page will not be output to the newspaper until
the first piece of trigger text is found in the page. The text will then be
output normally until the next piece of trigger text is encountered output will
stop again. This process is repeated until the end of the page is reached. There
can be any number of pieces of trigger
text.
The next line begins with
‘S’. This line defines settings for a group of sub pages. The first
element on the line is used to define a group of sub pages based on a range of
link numbers on the main page. This is useful if you always want the first three
links on a page to be processed in a certain way, and the rest differently for
example. In the above file, this has not been used, and so is blank. The next
element defines a filter that is applied to the URLs of the links on the main
page to extract specific pages.
In the example
above, the filter ‘.htm’ was used, this means that the sub page
settings apply to any page that is linked to from the main page with a URL
containing ‘.htm’.
The rest of the
elements on this line are in the same format as the ‘F’ line
described above, and are used to set the table formatting options for the sub
pages in the selected group.
If there are no
‘S’ lines in a page definition, then no sub pages will be
downloaded. There can be any number of ‘S’ lines, each having any
number of table formatting options.
The next
line begins with ‘I’. This line works in exactly the same way as the
‘I’ line described above, except that it is applied to the sub pages
in the most recently defined sub page group and not to the main page. In this
case the ignore settings are applied to all the sub pages linked to from the
main page that have ‘.htm’ in their
URL.
The next line contains only
‘END’. This indicates the end of the page definition, and must not
be omitted. If this is omitted, and another page definition begins, the new
definition will over-write the previous
one.
The format of the page database file is
defined formally using BNF overleaf.
|
BNF Grammar for Page Database
|
|
Alphabet: version_no ::=
String containing version number
information n ::= { 0, 1,
2, 3, 4, 5, 6, 7, 8, 9
} format ::= {
‘I’, ‘A’, ‘D’
} id ::= String
containing ID of
page description ::=
String containing description of
page url ::= String
containing URL of
page nl ::= new-line
character char ::= Any
character except
new-line category ::=
String containing the name of a
category trigger ::=
String containing trigger text for ignored
sections filter ::=
String containing filter for sub page
URLs
Grammar: <PAGE_FILE> ::=
version_no (<PAGE_DEF> | nl |
<COMMENT>)*
<COMMENT> ::=
‘#’ (char)*
nl <PAGE_DEF> ::=
<PAGE_LINE> (<PAGE_OPTS>)*
‘END’ <PAGE_LINE> ::=
‘P ‘ id ‘,’ description ‘,’ url
nl <PAGE_OPTS> ::=
(<TABLE_OPTS> | <IGNORE_OPTS>)*
<CATS>
(<TABLE_OPTS> | <IGNORE_OPTS>)*
(<SUBPAGE_OPTS>)* <CATS>
::= ‘C ‘ category (‘,’ category)*
nl <TABLE_OPTS> ::=
‘F ‘ <FILTER> (‘,’ <FILTER>)*
nl <FILTER> ::= ((n
n ‘-‘ n n format) | (n n
format))
((‘,’ n n ‘-‘ n n format) | (‘,’ n n
format))* <IGNORE_OPTS> ::=
‘I ‘ trigger (‘,’ trigger)*
nl <SUBPAGE_OPTS> ::=
‘S ‘ ((n n ‘-‘ n n) |
ε )
‘,’ ( filter |
ε )
‘,’
<FILTER> (‘,’ <FILTER>)* nl
(IGNORE_OPTS)*
|