DATABASE MANAGEMENT SYSTEMS
Data Hierarchy:
- Fields represent a single data item such as name, address, balance etc.
- A record is made up of a related set of fields describing one instance of an entity e.g. your bank account details
- File is a set of related records so for the above it would be the set of everyone’s bank account details
- A database is a collection of related files
Databases control ‘data redundancy’ by ensuring that the same data cannot be entered twice. If you recall from Hans if you had a customer table and a sales order table, you don’t have to re-enter the customer’s details when you make an order. You use the primary key (customer id) to get the details from the database. If you had to key in the details and store it in the order table it would mean that the exact same data would be stored twice in the database. If this were the case a database would be unmanageable.
This brings up the point of ‘data inconsistency’. The fact that the user would be inputting the information into the sales order table without knowing exactly what is in the customer table could lead to errors in customer id, name, address etc. Also if the data in the customer table was updated then you would have to search the database (or order table) to find that customer id and update it there as well. In a relational database the user would only have to enter the customer id in the order table and as such would only be able to enter a customer id that exists in the customer table. If that particular customers record was updated then you don’t have to worry about changing any other records as it is only in the customer table.
Advantages of Databases:
- Data are independent from applications – Stored centrally so everyone can access it. The data is retrieved from the database as required. Think of a database as a filing cabinet
- Data are accessible to any new program – you only have to tell the program where the data is
- Data are not duplicated in different locations – example above of data redundancy
- Programmers do not have to write extensive descriptions of the files – Therefore easy enough to maintain
- These save enough money and time to offset the extra costs of setting and maintaining Databases
Disadvantages of Databases
- Data are more easily accessible so more easily abused
- Databases require expensive hardware and software
- Specialised personnel is often required to start with large databases
- People may object to ‘their’ data being widely available in a database – information is power
DATABASE MANAGEMENT SYSTEM (DBMS)
This is a program that makes it possible to:
- Create
- Use and
- Maintain a database
It provides a logical access to the data stored in the database, i.e. sql query; specifies what tables to draw the data from and under what conditions e.g. where balance > 500 etc. User/programmers do not have to worry about the physical aspects of the database.
There are two main functions of the DBMS:
- Query language – this is for people who are not programmers, not very difficult to learn. The sql is the greatest advantage of databases.
- Data manipulation language – this is for programmers who want to modify the links between data elements within the DB. Whilst we could simply go into Access and change the properties of tables or the database relationships using a GUI, database administrators use DML which is basically database code.
When creating a database you specify the links between data items or what links tables together; a primary key in a table and a foreign key in another table – the foreign key is the primary key of the first table. However different types of relationships can be specified i.e. different logical views. These views correspond to three main types of DBMS’s:
- Hierarchical Databases
- Network Databases
- Relational Databases – introduces in the 80’s and are the most frequent databases
HIERARCHICAL DATABASES
- Data items are related as ‘Parent’ and ‘Child’ in a tree-like structure
- ‘Parent’ means that data item is higher in the tree than the ‘child’ and connected to it.
- One ‘parent’ can have more than one ‘child’ but one ‘child’ can only have one ‘parent’
- The most common platform is IBM’s Information Management System (IMS).
- This type of database offers very fast retrieval of data
However there are undesirable side effects of this type of database:
- Insertion of a record – the dependent record cannot be added without a parent
- Deletion of a record – if you delete the parent you delete every child associated with it as well
- It is impossible to have two parents
NETWORK DATABASES
- Same as the parent and child schema in Hierchical database, but children can have more than one parent
- It is also possible to link items upwards to other items parents
- Practically, it means that the DBMS can deliver the information based on any data item
- The flexibility of data retrieval is the main advantage of Network Databases over Hierarchical databases
RELATIONAL DATABASES
- Data items stored in tables
- Specific fields from each table related to other fields in other tables (join) i.e. the primary key field (customer id) in the customer table linked to the foreign key field (customer id) in the order table.
- Infinite number of possible viewpoints on the data (queries) i.e. select * from customer, orders, supplier, etc where orders.orderid = ??
- Most flexible of all databases but slower for complex searches because of the fact that it has to take data from so many tables under certain conditions. It has to check each field to see if it matches those conditions before it will retrieve the data. For a large database that has a lot of data you can see why this delay in retrieval is so.
- Many different types of relational databases – Oracle, SQL Server, Sybase, etc.
RELATIONSHIPS
- Attempt at modeling the business elements (entities) and their relationships (links)
- Specifies dependencies between the data items
- Coded in an Entity Relationship Diagram (ERD)
- Different types of relationships
- One to one
- One to many
- Many to many
CREATING THE DATA STRUCTURE
- There are few rules for this procedure and a lot of it is down to experience
- It can get quite complex and you have to try to design it so that the database can operate at maximum speed.
- Tables must be normalised i.e. redundancy is limited to the strict minimum by an algorithm
- Normalisation is not always the best method though
NORMALISATION
- This is the process of simplifying the relationships amongst data items as much as possible
- Reasons for normalisation
- To simplify retrieval
- To simplify maintenance i.e. updates, deletion, etc.
- To reduce the need to restructure the data for each new application
FUNCTIONS OF DATABASE MANAGEMENT SYSTEMS
- Data storage retrieval and update facilities
- A user-accessible catalogue or data dictionary
- Support for shared update
- Backup and recovery services
- Security services
- Integrity services
- Services to promote the data independence
- Telecommunications
- Utilities
SUPPORT FOR LOGICAL TRANSACTIONS
- A logical transaction involves the reading, writing or updating of a record.
- If the transaction is interrupted before it’s completed then the ‘up-to-date’ data is sacrificed for consistent data i.e. it’s not saved.
- If the transaction is not interrupted then the transaction is said to be committed i.e. written to disk or successful
- A DBMS provides mechanisms that either Commit or Rollback transactions
SHARED UPDATES
- Occur if two or more users are updating the database at the same time
- For example: double update
- Customer balance is $418
- Correct
- Pat records sale of $100
- Reads, updates, writes – Balance = $518
- Joe records payment of $100
- Reads, updates, writes – Balance = $418
- Incorrect
- Pat reads and updates
- Joe reads and updates
- Pat writes – commit $518
- Joe writes - commit $318
- Value = $318 ?
- Shared update solutions - Avoidance
- Prohibit shared updates
- Allow access for retrieval only
- Record updates in transaction file and update database periodically using a batch program
- The problem with this is that the data in the database is temporarily out of date
- Customer may not be allowed credit because his balance had not been credited with last payment
- Shared update solutions – Locking
- Lock the table or record while its being updated
- Three types of lock
- Exclusive Lock – This is extreme and inflexible as other users can neither read or update the locked table/record
- Read Only Lock – Other users can read but not update the locked table/record
- Lock Time-Out – If a table/record is locked a person could be waiting a long time before it is unlocked. As such some DBMS’s detect lengthy locks and unlock them, undoing any updates made to any records during the transaction
- Lock granularity refers to the level of the lock i.e. what’s locked - field, record, page/block, table
- Deadlock – Users can have a lock on more than one record at a time. This poses problems when two users require each others locked records
RECOVERY
- Backups or saves – normal backup of the database files
- Journaling / Audit trail / Audit file
- Keep a log or journal of the activity which updates the database
- Recovery involves copying the backup of the database and running a special program to update the backup version of the database with the transaction in the log. If you make a backup of the database every night at twelve o’clock and the database crashes at two o’clock the next day you can use the backup and add the entries in the transaction file (audit file) from twelve o’clock to the time the system crashed. The database is now back up to date.
SECURITY
- Restriction of access to authorised users only
- Passwords
- Encryption
- Views – basically pre-saved sql queries. The user is viewing a copy of the sql query
- Authorisation levels
- Read only
- Edit
- Delete
- Create
Data integrity is where the user is forced to input data in a specific way i.e. customer id’s can only be numeric. DBMS’s must support the isolation of data structure from the programs. This is known as data independence. Users or application programs are not affected by changes to the database structure i.e. new records, records deleted. Logical and physical Data Independence is usually achieved through subschema or View type mechanisms. Views enable the database to function properly as if the user was using a select query the tables would be locked and so it would affect other users and application programs. A database Schema is a description of the overall logical structure of a database, expressed/programmed in Data Definition Language (DDL). It can be broken down into sub-schemas; logical description of a user’s view or programs view of the data used. DDL can be very sophisticated on a mainframe or trivial on a PC (queries/views).
As organisations are rarely single sites they use telecommunication. Flows of data do transcend the boundaries of the organisations; information systems also do. Databases can be used to support the distribution of information resources. Organisational data sources are varied and so all applications must be integrated to save time (i.e. exchange data). Databases can be used to enable this integration, one example is MFG/PRO. Portability/compatibility is paramount (e.g.ODBC drivers).
DATABASE UTILITIES
- Compact data files
- Index/re-index files
- Repair database (crash)
- Import/export data from and to other sources
- Enforce standards (integrity of standards)
- Associated data dictionary
- Access to remote computers (login, emulation)
DISTRIBUTED DATABASES
This is the next logical step in geographically dispersed organisations where the goal is to provide location transparency. The starting point is a set of decentralised databases located in different places, developed for the specific information needs of each site. The aim of distributed databases is to integrate these decentralised databases into one. Distributed databases have additional roles to play in that they must determine the location of data to be retrieved. They must also translate the request for this data retrieval into the language used by the local DBMS. Heterogeneous distributed DBMS’s means different DBMS’s running at each site. A master DBMS controls the interactions amongst the parts. However it’s not very practical today due to the need for compatibility. More often, each DBMS follows the same data architecture.
ADVANTAGES
- Increased reliability of systems and availability of data
- Local control preserved
- Modular growth possible at each site and at new site
- Optimised communication costs
- Faster response times
CONTROL IN NORMAL DATABASES
- Transaction control: ability of the DBMS to ensure the successful completion of transactions
- Commit transactions
- Rollbacks
- Concurrency control: ability of DBMS to arbitrate between concurrent uses of data
- Simultaneous access
- Simultaneous update
- Deletion
CONTROL IN DISTRIBUTED DATABASES
- Different portions of the overall database reside at different locations
- These portions are controlled by different processors running sometimes different DMBS’s
- Common schema means queries can involve any portion of the database residing at any location
OPTIONS FOR DISTRIBUTED DATABASES
- There is the issue of the actual physical design
- Performance of the database depends on good design
- There are a number of options:
- Data replication
- Store a separate copy of the full tables in each location
- If a copy is stored in every location then there is full replication
- Advantages – reliable and fast response
- Disadvantages – storage requirements increase and issue of complexity and cost of updating
- Horizontal Partitioning
- Some of the rows of the tables are stored in the location; others are stored at other locations
- E.g. customer banking out of a particular branch
- Advantages – efficient, local optimisation and secure
- Disadvantages – inconsistent speed access and backup vulnerability due to dispersement of tables
- Vertical partitioning
- Some columns are projected into base relationship at different sites
- All relations share a common domain so the full table can be reconstructed
- Advantages – tailor made support for functional areas and also same advantages as horizontal partitioning
- Disadvantages – some queries might be very slow and users must understand some design issues
- Combination of the three methods
- Most of the time, companies will use different methods
- Each method is efficient in certain situations with some other security requirements
PROBLEMS WITH GLOBAL TRANSACTIONS
DBMS’s can be radically different i.e. relational versus network and only some state of the art commercial products have translating capabilities. One alternative solution is to put some essential data and the directory of the data locations on a central server. Real distributed database systems solve these problems for the users with the help of the Networking Operating Systems (NOS). To ensure the integrity of the integrity of the data in update operations there is a Commit Protocol. This is a well-defined procedure based on the exchange of messages i.e. ‘ok’ or ‘not ok’. Each global transaction can be completed or aborted.
Two phase commit:
- Site originating the transaction sends request to all sites involved in the update
- All sites attempt to process their part of the transaction without committing the data (temporary files)
- They notify the first site whether it’s OK or not
- The first site collects all the OK’s and sends the order to commit the data
TIMESTAMPING
This is an alternative to locking as with locking there is the potential to get deadlocks. Timestamping ensures that transactions are processed in serial order so locking is not needed. All updated records carry the timestamp of the transactions that modified them. If a new transaction attempts to update a record with an earlier timestamp, that’s ok but if the new transaction has a later stamp, the update is denied, the transaction is re-stamped and is re-started. This could happen over a wide network, where transactions are not immediately modified.
EFFECT OF DESIGN ON SPEED
For example if you wanted to design a fast query that needs data located at two sites. The queries must be broken into components isolated at different sites to minimise communication time and traffic. Determine which site has the potential to yield fewer selected records and then move the preliminary results to the other site where the rest of the work can be performed.