Appendix E

Page Database Format




The rules governing the layout of the page database file are best demonstrated using an example:

Web News Speak (Page File Version 6)

# This is a comment and is ignored

P 0000001,Altavista World News,http://world.altavista.com/
C News,Headlines
F 01-99I
I T,LATEST HEADLINES,Powered by
S ,.htm,01-21I,23-42I,44-99I
I T,DISPLAY TEMPLATE CATEGORY,BOTTOM TEMPLATE
END

In the above example, the page database file contains only one page definition. Blank lines in the file are ignored, and lines beginning with a ‘#’ symbol are ignored (these can be used for comments).

The first non-blank line in the file must contain version number information for the page file. This must only be one line, but can contain any text. This information will be included on the contents page of the compiled newspaper.

The beginning of a page definition is indicated by a line beginning with a ‘P’, followed by a space. The first character on a line is used to indicate the information that the line contains.

The remainder of the line that begins with ‘P’ must have the following elements:
An ID number, a description of the page, and the URL of the page. Commas separate these elements and they must all be present.

If any part of any of the elements on a line use commas (e.g. if the URL has a comma in it), then an escape character ( a \ character )can be placed in front of the comma. It will then be treated as part of the element and not as a separator between elements.

The next line in the file above begins with ‘C’. This line defines the categories that the main page belongs to. The line consists of a list of categories separated by commas. The number of categories that are allowed is defined within the application (currently 10). If no category lines are included for the page, that page will be at the top level of the menu hierarchy.

The next line begins with ‘F’. This line defines the table formatting options for the page. The ‘F’ line is optional, if omitted all tables are displayed normally.
The ‘F’ line consists of a comma-separated list of table formatting settings.
In the above example, the page has only one formatting option, i.e. 01-99I. The numbers indicate we are setting the formatting options for tables 1 to 99. The ‘I’ on the end indicated that the tables will be ignored. Either single table numbers or ranges can be used, so 02I is also valid. The table numbers must always be two digits, so numbers below 10 must have a 0 on the start. As well as ‘I’, ‘A’ can be used to indicate that the tables should be announced, or ‘D’ to indicate that the table should be displayed normally (this is the default). There can be any number of formatting options.

The next line begins with ‘I’. This line defined the ignored sections in the page. The ‘I’ line begins with a single character that can either be ‘T’ or ‘F’ to indicate whether the page should be ignored from the start or not. The rest of the line is a set of pieces of ‘trigger’ text to trigger the toggle between the page being ignored and not being ignored. When the page is processed, if the first character on this line is ‘T’, the page will not be output to the newspaper until the first piece of trigger text is found in the page. The text will then be output normally until the next piece of trigger text is encountered output will stop again. This process is repeated until the end of the page is reached. There can be any number of pieces of trigger text.

The next line begins with ‘S’. This line defines settings for a group of sub pages. The first element on the line is used to define a group of sub pages based on a range of link numbers on the main page. This is useful if you always want the first three links on a page to be processed in a certain way, and the rest differently for example. In the above file, this has not been used, and so is blank. The next element defines a filter that is applied to the URLs of the links on the main page to extract specific pages.
In the example above, the filter ‘.htm’ was used, this means that the sub page settings apply to any page that is linked to from the main page with a URL containing ‘.htm’.
The rest of the elements on this line are in the same format as the ‘F’ line described above, and are used to set the table formatting options for the sub pages in the selected group.
If there are no ‘S’ lines in a page definition, then no sub pages will be downloaded. There can be any number of ‘S’ lines, each having any number of table formatting options.

The next line begins with ‘I’. This line works in exactly the same way as the ‘I’ line described above, except that it is applied to the sub pages in the most recently defined sub page group and not to the main page. In this case the ignore settings are applied to all the sub pages linked to from the main page that have ‘.htm’ in their URL.

The next line contains only ‘END’. This indicates the end of the page definition, and must not be omitted. If this is omitted, and another page definition begins, the new definition will over-write the previous one.

The format of the page database file is defined formally using BNF overleaf.

BNF Grammar for Page Database
Alphabet:
version_no ::= String containing version number information
n ::= { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
format ::= { ‘I’, ‘A’, ‘D’ }
id ::= String containing ID of page
description ::= String containing description of page
url ::= String containing URL of page
nl ::= new-line character
char ::= Any character except new-line
category ::= String containing the name of a category
trigger ::= String containing trigger text for ignored sections
filter ::= String containing filter for sub page URLs

Grammar:
<PAGE_FILE> ::= version_no (<PAGE_DEF> | nl | <COMMENT>)*

<COMMENT> ::= ‘#’ (char)* nl
<PAGE_DEF> ::= <PAGE_LINE> (<PAGE_OPTS>)* ‘END’
<PAGE_LINE> ::= ‘P ‘ id ‘,’ description ‘,’ url nl
<PAGE_OPTS> ::= (<TABLE_OPTS> | <IGNORE_OPTS>)* <CATS>
(<TABLE_OPTS> | <IGNORE_OPTS>)* (<SUBPAGE_OPTS>)*
<CATS> ::= ‘C ‘ category (‘,’ category)* nl
<TABLE_OPTS> ::= ‘F ‘ <FILTER> (‘,’ <FILTER>)* nl
<FILTER> ::= ((n n ‘-‘ n n format) | (n n format))
((‘,’ n n ‘-‘ n n format) | (‘,’ n n format))*
<IGNORE_OPTS> ::= ‘I ‘ trigger (‘,’ trigger)* nl
<SUBPAGE_OPTS> ::= ‘S ‘ ((n n ‘-‘ n n) | ε ) ‘,’ ( filter | ε ) ‘,’
<FILTER> (‘,’ <FILTER>)* nl (IGNORE_OPTS)*