[ Home ] [ News ] [ FAQs ] [ Downloads ] [ Products ] [ Links ] [ Beginners ] [ Services ] [ Resume ]
[ Feedback ]
Hello and Welcome to my web site. Thank you for coming. Best Regards, Peter Nolan.

Home
News
FAQs
Downloads
Products
Links
Beginners
Services
Resume
Feedback

Frequently Asked Questions

On the dwlist forum and in my consulting work, I often hear the same questions over and over again. With the dual purposes of answering a lot of those questions and minimising the amount of typing I do answering those questions I have created a Frequently Asked Questions page. If there is too much here it might well turn into a separate document as well.

Feel free to email me questions that you feel lots of people might want answered. I will generally put the question and answer onto the dwlist forum as well.

Questions

Q007. I'm new to dimensional modelling. Are there any good sources of example models around? Also, as we move forward with our data warehouse who sells data models or applications built on data models that we might 'learn' from?

Q006. I hear a lot about Inmon Vs Kimball. Who is right? Which is the best way to build a DW?

Q005. I am getting started on my DW project and the people who run the operational systems I am going to get data out of say it's pretty close to 100% right and we won't have to do any 'data quality' work on data sent to the data warehouse. What is usual in this situation?

Q004. What ETL tool should I buy for my Data Warehouse?

Q003. What front end tools should I buy for my data warehouse?

Q002. What hardware should I buy for my data warehouse?

Q001. Which database should I buy for my data warehouse?

Answers

A001. Which database should I buy for my data warehouse?

Always a good question. This response is really only for those folks who are not going to build a large DW, like in excess of 200GB of raw data to store. Folks who have in excess of 200GB raw data to store will have their own database people who can work out the best way to go forward for their companies. People who are in the 50GB or less raw data range, which is probably 95% of companies who build a DW, could benefit from this answer.

And there will be plenty of database reps out there trying to sell you their database. Of course, the answer is, it depends. Questions to ask yourself is how many users will there be? How much data will there be? How will the database actually be queried? A lot of tools now cache reports so that the database itself is hit less and less.

Other general guidelines I offer people is as follows:

The major players in the industry are Teradata, Oracle, IBM DB2 UDB/MVS, Microsoft SQL Server 2000. You also have Red-Brick and Sybase IQ as data warehouse specific databases.

Teradata will always go faster than any of the other databases given a 3NF model and the same amount of hardware underneath them. It was built specifically for MPP query processing so it is no wonder it will go faster. It will also cost more. You can take a look at the TPC results to see how much faster. The drawback of Teradata is that, as far as I am aware, it has no smarts to query star schema databases. I am aware that there was a project to put this support into Teradata but I am not aware that it has been made public. Also, NCR do seem to criticise the star schema concept. And if you have 50GB or less of raw data you are probably going to overspend on a Teradata DW.

You can't go far wrong buying Oracle or IBM UDB as databases. And you'll pay plenty for the licenses. Both these databases will do just about everything you will want to do in a data warehouse. They are fighting the 'feature/function' battle and producing databases that do everything but make your coffee for you. Many of these features I look at and wonder 'How the hell would I use that?' These two are 'safe bets'.

I haven't had a chance to build a real DW on SQL Server 2000 but I have built one on SQL Server 7.0 and it works, mostly. We found query performance on a 2 processor 733mhz compaq with 2GB memory for a 10M row fact table with 8 dimensions was in the 1-10 second range. We did have two fact tables, which we manually partitioned, that were 70M and 50M rows respectively. It was not possible to query partitioned fact tables in SQL Server 7.0 effectively. I believe this was fixed in 2000. In SQL Server 7.0 we found the optimiser did not work as advertised and we found no-one who could make it work as advertised so it did not resolve star joins that well. But SQL Server is very, very cheap. If you have a DW with less than 20GB of raw data that needs to be housed in your data warehouse then you can do it comfortably on SQL Server 7.0 and by extension SQL Server 2000. In SQL Server 7.0 we found that you want to allocate at least 5x as much space in the database as raw data and if you mirror the disk you will need to allocate 10x as much disk as raw data to get started. We also found the log file to be a real problem, it grows very large very fast. I don't know if that was fixed in 2000. The project I did on SQL Server 7.0 is in the process of moving to SQL Server 2000 and I hope to update some of these comments if I hear back how SQL Server 2000 goes in these specific comments.

If you are in the 20GB data and below range take a good look at SQL Server 2000 and try it out on a spare server to see if it will suite your needs. The ease of maintenance and lower cost of ownership make it way cheaper to put your DW on than Oracle or DB2. Microsoft will tell you they can handle way larger volumes, and they might be able to, I'm only saying I haven't seen it with my own eyes. They have plenty of references and case studies on their web site.

Sybase IQ is quite an interesting database. In a word, it is fast, fast, fast. I've done two projects on this database now and when you ask it questions it's kind of strange when it throws the answer back at you so fast. My usual first response is that it can't be the right answer because I know what has to be done to calculate it and the computer could not have done that work in that time. (Of course, you do have to set up the appropriate 'indexes'.) I haven't had the opportunity to be around while the database grows big but because of the way the database works bigger should not be a problem. The other thing about IQ is that it is the easiest database I have ever seen to maintain as a 'DBA'. Even easier than SQL Server, which is saying something. All you really do is monitor how much space is used and throw more disk at it. There are no tablespaces or indexspaces to maintain. Everything is just a big collection of disk and the database figures out what to do with it. Very neat. Also, Sybases claims that you need less disk than actual raw data that you have is true. (Why a vendors claim being true is a surprise is another story.) So, if you have 20GB of raw data you are very likely to be able to store that data in a DW of 40GB (being mirrored). You will need some separate work space and you might keep a staging area too which will need to be added. But compared to every other database, you are going to use way less storage with IQ, which means less hardware is required all round because the amount of IO required is way down on the other databases.

The down side is, of course, this is a DSS/DW only database. You just can't issue updates in any sensible amount of time. This does mean the ETL needs to perform delete-inserts and not updates which does mean you need more temporary tables for the updates which makes the ETL more complicated. The alternative is to always perform loads, which are blindingly fast, but loads of large tables take you away from minimising the elapsed time between application update and the data being available in the DW. There are some other features missing from IQ at the moment, like storage of BLOBs. But as a dedicated database to store numeric/character data for a star schema DW it is well worth a look at if you want to step outside your current RDBMS box.

Red Brick, now owned by IBM also deserves a mention. I haven't checked up to see what IBM is doing with this database, so you want to check before you buy it. However, Red Brick was built from the ground up to support star schema DWs. It is faster than anything except IQ, it has an intelligent loader for maintenance of summaries, it uses less space than anything except IQ and it has SQL extensions that allow it to perform tasks usually reserved for the MOLAP and ROLAP tools. Worth a look if you have very large fact tables to deal with, like 500M rows and above.

Having said all that, in general I recommend:

Strongly consider going with the database you already have skills in.
However, today, most customers have experience in many of these databases. Also, SQL Server has gotten so easy to maintain that you hardly need any experience to make it work. Buy the 'Unleashed' book and read it. If you have been a DBA on Oracle or DB2 then SQL Server is a piece of cake.
If you have multiple databases already installed your DBAs can perform some testing to pick the one that they believe best suites the needs of your DW.
If this is going to be by far your largest database strongly consider using the database from the vendor with which you have the best working relationship. You are likely to need help if you have never attempted a database this large before. And some vendors have established a reputation for not being very helpful once you have paid for the license.
If you want to save money and have relatively modest needs strongly consider Server 2000. It's inexpensive, easy to use and maintain, and Bill is spending bucket loads of money on it to add features.
If you feel comfortable stepping outside the 'safe' options or Oracle/DB2 UDB and your database is larger than the 20GB raw data mark you might want to take a look at IQ or Red Brick.

A word on moving between databases.

I remember being told in 1983 that Unix was going to take over the world because it would allow companies to seamlessly and easily move applications from one hardware/database vendor to another and break IBMs stranglehold on the hardware, operating system and database markets. Well, IBMs stranglehold on these markets is well and truly broken but I see very little evidence that people are moving from one database vendor to another, or even from one hardware vendor to another keeping the same database for their data warehouses. It seems to me once the DW is installed and running, it stays on whatever it was installed on.

I see no reason for this to be the case except that when you buy the database you actually pay the whole price for it. You don't pay a monthly license fee so there is little to gain by moving and a lot of money to pay for the next vendors license fee. Simply put, once the vendor gets your money for the DW database license they have got it, unless you can re-deploy the license.

The point? I recommend you for a 'loaner' of the database software during the development lifetime of the DW database. While you haven't given the vendor the money for the licenses they are more likely to be helpful. The vendor may or may not agree to giving you the loaner, after all, it delays their revenue.

A002. What hardware should I buy for my DW?

This response is really only for those folks who are not going to build a large DW, like in excess of 200GB of raw data to store. Folks who have in excess of 200GB raw data to store will have their own hardware people who can work out the best way to go forward for their companies.

Again, it very much depends on the usage profile of the database as well as the database you put on top of it. You are going to need more hardware to run Oracle/DB2/SQL Server than you are going to need for Teradata, Sybase IQ or Red-Brick. Also, you can avoid a huge amount of database access by using a MDDB like Essbase, TM1, Analytical Services etc.

The choice of hardware is more driven by the choice of operating system. Will you use Win2000 or unix. And for those of you who know me, yes, you might even use OS/390. (Because I was building DWs in 1991 for IBM quite a lot of my early experience was MVS). But there are so few DWs being built on OS/390 nowadays that it's become a non-discussion. When I was managing Ardent Professional Services for Asia Pacific I think we had in excess of 100 customers using Prism and DataStage across Asia Pacific and only one of them on MVS. The MVS site did not have any unix at the time they built their DW and it was going to be too big for NT 4.0.

The only way to be sure that the database will support the load that you are going to put on it is to try it out. This is not always easy, but most of the hardware vendors now have centres available for customers to try out various data warehouse schemes and measure the performance.

I generally recommend:

Engage someone who has built what you want built and is prepared to take some risk in 'guaranteeing' that his/her sizing and approach will work. People who know what they are doing generally have little problem with working something out like this. I used to frequently submit fixed price system integration bids to deliver end to end DW solutions including performance profiles and expectations. As long as you know enough about what the customer wants to do it's not that hard to recommend a box that will do what is necessary.
Engage someone to build a prototype to learn on to prove that what you want to build will work. It could be someone inside your organisation. This is quite normal for companies new to DWing.
Ask the HW/SW vendors for white papers on guidance for sizing the machine with various usage profiles and numbers of users. Be wary of any vendor who cannot produce such papers. If they cannot show you such papers they probably don't know themselves how big the machine should be. As strange as that sounds it is not uncommon that the sales people selling you the machine do not know how big the machine should be.
Find other people who have done what you are about to do. The number of companies with DWs has grown dramatically and the web makes them easy to talk to.

A good way to get an idea of what people are doing at any given point in time is to publish as much as you know about what you want to do on dwlist and have people feedback comments on what you propose. People on dwlist have proven quite helpful and can be more so if you provide them with enough information to go on.

Even having worked for vendors over the years I would advise that you are always sceptical of vendor performance claims unless they are willing to make a performance guarantee in writing. I have had cases where I have been told all sorts of things that evaporate when I asked for it in writing on company letterhead. I have been on the receiving end of mis-information from vendors way too often to recommend that you take what they say at face value. It's a shame, and it's disappointing because it means that the honest guys working for vendors get tared with the same brush as the ones who are happy to 'stretch the truth'. This was one of my hardest lessons when I moved into marketing for IBM. I just could not believe what some sales people told some of my customers.

The most amazing case I came across was one where a $US5M box was sold to one of our customers when publicly available information (product specs of the box) clearly indicated that the solution would not work. IBM knew it and the competitive sales representative knew it (he told me so). We tried to tell the customer and the customer didn't want to listen. The result was in excess of $US5M wasted and quite a few people losing their jobs. The whole debacle was 'hushed up' and the customer would never talk to anyone else about this experience. This world is a 'buyer beware' world. Your company should hold you responsible for the results you get from the DW and you should make sure you know what you are buying.

If you are in the region of <=10 concurrent query users and <20GB raw data you are well in the Win2000 range with any of the databases mentioned above. This covers a lot of companies. Also, if you are pushing most of the analytical queries to a separate multi-dimensional database which will usually sit on a separate server, you can get away with a much larger database on Win2000 because the query load is so low. As you increase the number of people who are concurrently connecting to the database and increase the volume of raw data housed in the database you are pushing up the limits of Win2000. Those limits are changing very quickly as new servers come out.

Nowadays (2002), I'm more inclined to say that once you are over around 30 concurrent connected database users or above the 100GB raw data range (which is going to be 800-1000GB disk) you are going into unix territory. Then you need to decide between IBM, SUN, HP. I have had very good results working with SUN as a vendor, with HP closely following behind. I haven't dealt with IBM as a unix vendor. In one deal where I managed the selection process between SUN/HP the two companies both distinguished themselves for the level of effort and professionalism displayed despite the fact we were asking them to work quite unreasonable hours.

Buying hardware in this range is quite an investment and you will want to perform some testing to make sure that what you are buying will work. You may not be able to easily re-deploy a large unix box within your company and there is no point buying wasted capacity.

One thing not to forget when sizing the hardware is the batch processing load required and the batch window available. So many Win2000/Unix people I meet seem to have no idea that loading/updating a DW can often require very substantial batch processing requirements for the integration portion of the work. They seem to want to quote 'load times' which is less than 10% of what is going to happen in most cases. In some cases the batch processing requirements will dictate more hardware than the daily query requirements will dictate. This is often the case when files are arriving late, are required early, and you need to rebuild MDDB cubes.

Something that has surprised me over the years is that quite a few people don't know how much an empty CPU or IO slot costs. I've seen quite a few boxes bought where there are lots of spare slots and the buyer has been talked into buying 'spare slots' for 'later growth' by the hardware salesman. Given the rapid decline in machine prices any spare slots sitting around for any length of time (like 12 months) is money wasted. My own take is to have as few spare slots as possible, even if it means some extra testing to do the sizing. In many cases, if you really don't know how big the box should be but can't afford to undersize the machine then buy the extra processor or IO cards and make use of the capacity doing something useful. Spare computing capacity is something that is easily used up.

A003. What front end tools should I buy for my DW?

The best source of information to select a front end tool for your data warehouse is The OLAP Report. There is no way I can touch the detail provided by Nigel so if you have budget for it, this is the site I recommend. He has recently published the 2001 market share numbers which can be found by clicking here. I recommend browsing around the OLAP Report pages to get a feel for the background of many of the vendors. Note that some notable vendors are not in the olap report because their products are reporting tools more than OLAP tools. These are Actuate, Hummingbird and IBI. These companies all have successful report deployment tools.

What front end tool should I buy for my DW is probably a harder question to answer than any of the previous questions. The main reason being that there are so many more tools out there than databases or machines, and the tools change so quickly. The advice I have been giving over the last few years was to take the end user tool decision as a 'tactical' decision. That is, make a decision that you do not have a problem bailing out of in the 2-3 year time frame. The reason being that the competition was so hot that there was no guarantee which tools would be around in the future.

Today this has changed somewhat. There are still a lot of really good tools that may or may not be around in a few years time, but there have emerged some clear 'winners' who I would be happy to put my money on being around for a while.

Today the 'winners' for end user tools (in my opinion) are as follows. These are the 'safe bets' for buying end user tools. First, a blanket comment. Each of these vendors are well aware they they are the leaders in their area and the price tags of the products reflects this. As far as I am concerned these guys are pretty expensive.

Business Objects for ROLAP and non-financial reporting requirements.
Business Objects has gone from strength to strength in a very tough environment clearly settling the argument with MicroStrategy for market leadership, share market value, and buyer confidence. In recent times they have managed 10% revenue increases while most others have been experiencing declines. They have been doing more 'services' revenue than previously to push the revenue increases up.
Cognos Powerplay for MOLAP and web based reporting based on cubes.
When Cognos came out with their first web version anyone who knew anything thought it was a joke. It was terrible. Cognos, to their credit, took notice and brought out a completely new version of their web support in very short time. This second effort is fantastic. I am a big fan of it. I did a project where this second web interface was the delivery mechanism and I thought it was just fantastic. It makes Business Objects WEBI look primitive and I believe BO are doing something about it. The new analytical applications that Cognos have released are an excellent demonstration of how to build an effective front end using Powerplay web.
Hyperion Essbase for Financial Reporting.
If you want to deliver financial reporting to a significant number of end users in microsoft excel spreadsheets it is very hard to go past Essbase. It has the most features, it is proven to scale the best, it has tons of 'financial' support and it integrates into excel so closely that most people are hardly aware that they are touching something that is not excel. I was architect on a project where on day 1 were were going to have around 1000 'finance' end users and we expected around 400 concurrent users at 9am. In the evaluation for products Essbase was rated #1 by 11 of 12 reviewers with TM1 getting the other #1 slot. Though Essbase were not able to give a reference site as large as what we were going to do, they were the only vendor who could present a credible story that they could support such numbers. As far as I am aware that site is still the largest number of concurrent Essbase users connected to one machine, maybe someone else knows one larger. Essbase has had it's share of issues over the last few years with management fighting over control of the company while Microsoft attacked it's lead. Not a good time to be distracted. However, they are still at the head of the pack and are very difficult to ignore if your reporting needs are primarily financial.
Microsoft Analysis Services.
This is the exception to the 'expensive' comment. There is no doubt that Bill also wants to own the MOLAP software space and for all intents and purposes Analysis Services is free. When you compare the price tag of the other MOLAP products, free is a pretty compelling argument. And we are now starting to see much greater take up of the product. It's market share is ranked as very high because it costs nothing extra to download and install the Analysis Services license whether it gets used or not. I am unconvinced that it's 'in use' share is anything like quoted in various sources, however, I am completely convinced that Analysis Services is changing the MOLAP market place and will be around for some time to come. You can do lots of what you want to do in the product and for free you might just live without some of the things it cannot do compared to Essbase et al. Those features/functions are likely to appear in the not too distant future.

So if they are the 'safe bets' what other tools are around that are worth looking at?

Brio for lower end ROLAP and reporting.
Brio has been a favourite of mine since 1995. One of the founders was an ex-Metaphor person and the tool was developed to be intuitively obvious to use. It used to be the best tool to demonstrate star schemas and during 1996-1998 I must have done hundreds of demonstrations of star schema databases using Brio. I have also sold quite a few Brio licenses. In 1996, when 'the web' was still a 'new idea' and Brio was first to market with a 'Web Warehouse' one of my largest customers selected our solution which included 'Brio Web Warehouse'. This worked very well and they became quite a reference site in Australia as one of the first companies to implement report delivery and analysis via the web. The other thing I have always liked about Brio is that the price is very sensible compared to other products. If you just want to be able to see reports over the web and not query them it used to be very inexpensive. Many customers who looked at Business Objects or MicroStrategy have said to me they wanted one of those two for the same price as Brio. Brio generally came in at around 40% of the price of Business Objects or MicroStrategy. The down side of Brio nowadays is that they have struggled in the very competitive end user tools environment. They have declining software revenues in a growing market. I for one hope they are around for years to come but it could well be that they are bought out by one of the larger vendors.
MicroStrategy for high end ROLAP.
If you read the press history of MicroStrategy you will see that most of their problems are associated with allegedly mis-stating revenues. I haven't kept up with the details of the story and where it is up to. It's sufficient to say that when the share price goes from $US333 to about $US2 it's awful hard to sell more software. But this company trouble did not diminish the fact that the product itself was and is a great product. MicroStrategy customers have remained loyal, but new ones have been harder to find. MicroStrategys market share for new license sales was slowing rapidly last year, and financial results for this year indicate that market share for new licenses is taking another beating, with revenues around 33% those of Business Objects. I think MicroStrategy will remain as a player for some time to come because of the large installed base they have achieved and the loyalty of the customer base. I doubt they will ever again challenge Business Objects for leadership in the ROLAP area, but I could be wrong. Worth a look, but make sure you read up on the current state of the company before any major investment.
Crystal Holos/Decisions for EIS/reporting applications.
There has been a long history behind crystal and Holos. Holos is the 'Rolls Royce' of EIS systems. It is the high end product in which it was practically possible to do anything and the applications that you can build with it can be very, very well built. I had a good relationship with the folks at Holos and then Seagate Software over quite a few years and I sold some Holos licenses in situations where the system was to be delivered to senior managers. Unfortunately, of late Holos sales seem to have shrunk down to very little, again proving that the best technology does not necessarily translate to the best sales revenue. The product is still out there and available. You can look at it on the crystal decisions web site, and I am sure the product is still excellent. Given Crystal Decisions other revenue streams they may continue to enhance and market Holos for quite some time to come. If you want to consider Holos best to check out it's future with Crystal before buy it.

However, at Seagate another tool was brought into the fold. Crystal Reports and Crystal Info. These have been reborn as 'Crystal Decisions' which is experiencing the fastest revenue growth rates in the reporting industry. If you are doing mostly reporting applications Crystal Decisions is definitely worth a look. For the longest time 'Crystal Info' was the best kept secret at Seagate. I remember this Seagate guy from Toronto visiting Sydney calling me up and nagging me to take a look at the new product. I'd seen Crystal Info not long before and frankly, it was a waste of time and I told him so. But like a good sales rep, he was persistent, so I took a look. I was amazed at the step forward in one release, and that was in 1997. The product evolved into a very, very good reporting tool. Quite a few large customers put it in as the type of tool to use for standard 'fixed' reports. I haven't taken a look at it for some time but if your reporting needs are standard fixed reports you might want to take a look.
Oracle Express for EIS and financial analysis applications.
Firstly, I don't know the product much at all. I have been involved in a lot of sales to displace Express with my colleagues at Holos. We used to find it very easy to beat Express in a deal, and not that hard to displace it either. So I spent more time on Holos than Express.

Express has been a bit of a strange one since Oracle took it over. It has gone from being the answer to every question to an Oracle rep was 'Express' to 'The future of OLAP at Oracle is Discoverer' back to 'The future of OLAP at Oracle is Express....and Discoverer'. It's been confusing. Then there was 'Sales Analyser' and 'Financial Analyser' and others built in Express. But...Now Oracle has made available 9i release 2 and the story now seems to be that Express is important again things may become a little clearer.

The very fact that Express is marketed by Oracle makes it worth a look if you are buying Oracle as your database for your data warehouse. It's hard to ignore. But, if you look at the share of OLAP and the share of database for data warehouses you will see that there are a lot of sites that buy Oracle database and do not buy Express. If everyone who bought Oracle also bought express they would be the market leader by so far it wouldn't be funny.

The supporters of Express sing it's praises more than usual, and the critics of Express are also more vocal than usual. Personally, I have pretty much ignored it for quite a few years now. That was since a big customer of mine decided to spend a whole lot of money replacing an Express system with Essbase. Given the amount of Oracle database and Express in that account, and the customers desire to limit their exposure to Express, it was pretty clear that there were more than one or two issues around Express at the time.

If you look at The OLAP Report market shares Nigel has it declining from 20% of the market in 1995 to 7.1% in 2001. That makes Express the biggest loser in the OLAP market over the last 5 years even though it is still holding onto 4th position.

Oracle is a fine marketing organisation, one of the best around, I get the impression they could sell refrigerators to eskimos.   If a sales force that good cannot shift Express there must be a reason for it.

Other people might like to provide some opinion or discussion on Express.
TM1 for financial Analysis.
If you are doing financial analysis and have a relatively modest number of concurrent users you want to attach to the database, like <30, and a modest amount of data like <5GB raw data and your budget does not extend to Essbase you should take a look at TM1. When we did the evaluation for the 1000 finance users with 400 concurrent users Applix tried to convince us they could support us. We were unconvinced. But I came away very convinced they had a great product for smaller sites, which is most companies. At the time it was the functional equal to Essbase as far as we could tell, some things were even easier to use. The price difference on offer was very substantial. It is well worth a look at for financial reporting if you don't like the price tag of Essbase and don't want microsoft analytical services for some reason.
CorVu for Balanced Scorecard Applications.
Being an Aussie I have to put my friends at CorVu on this list. CorVu was developed by a small team in Sydney, Australia, and some good friends of mine work there. They have now been able to go global and have been successful in selling their products to over 3,000 companies around the world. In BI terms they are a very small company with annual revenues running at about $US12M. However, the product is an excellent product. If you are building a 'balanced scorecard' I'd recommend taking a look at it. This is the products major strength. They have now built a number of other applications for various ERPs and branched into the BI area. The product is very solid, the people are first class, and if I can't put in a good word for fellow aussies what kind of person would I be!!. You can find them at www.corvu.com.

Why is SAS not in this list? I hear you ask.

I have had good relationships with SAS for years. I wrote a data warehousing methodology for SAS way back in 1995 which evolved into what they decided to use around the world. My opinion, they have a 'database' which is really, really fast and stats capability to die for. Virtually unchallenged as far as I am concerned. In Australia once they stopped trying to sell customers the idea that the SAS database could manage terabytes of data as the data warehouse and started saying there was a place in the world for relational databases but pull data into SAS for stats analysis their 'data warehousing' sales kicked up quite a bit.

When I first took a look at SAS/MDDB and SAS/EIS I would have put good money on SAS being successful in the MDDB/MOLAP space. But SAS seems to have been unable to convert their incredibly strong stats position into a position of leadership in other areas. The ETL tool, MDDB, EIS have all failed to make a big impact. I have no idea why. In one large customer where the ETL tool was sold we saw quite some trouble to make it work but make it work they did. I'd be interested to hear other peoples comments.

In summary, SAS should be successful in this space. I don't know why they are not way more successful than they are.

Why is Gentia not in this list? I hear you ask.

Gentia have been acquired by a company called 'Open Ratings'. If you go to www.gentia.com you will be re-directed to balancedscorecard.com. And all the documentation you will find there is about the implementation of balanced scorecards and the applications sold by 'Open Ratings'. It is not at all clear if you can buy Gentia separately from buying the applications. So I've left it out of this discussion. If you want to buy a balanced scorecard applications feel free to go to balancedscorecard.com.

A004. What ETL tool should I buy for my DW?

Let's start with what not to buy.

Don't buy PL/SQL and think you can write all your ETL in it. I can't believe the number of Oracle DBAs who think that anything you need to do can be done in PL/SQL. I think that when people do the Oracle DBA classes they must get 'inoculated' with the PL/SQL needle so that they are allergic to anything else. No other database DBAs I meet are anything like as fixated on writing stored procedures in the database language.

Stored procedures, today, are hard to write, hard to read, hard to maintain, destroy your ability to understand data heritage, and completely and totally lock you into the database engine for ever and ever amen. Not that anyone actually moves hardware or database once the system is installed which was the whole 'purpose of open systems' as it was preached to me in the early 80s.

You would be better off with Cobol, C, Perl or any popular programming language than you are with PL/SQL.

Let's go to what to think about.

All the ETL tool vendors want to sell you their tool. I did when I worked for Ardent. You can't blame a vendor for trying to sell you their wares.

My own opinion is that it is only the 'lunatic fringe' who are going to build anything remotely complicated without some kind of ETL tool. Unfortunately, I think that still puts more than 50% of all installed data warehouses into the 'lunatic fringe' category. When I was working at Ardent the major competitor our sales reps had was not Informatica, it was the Oracle DBA who believed PL/SQL can to anything an ETL tool can do better and faster.

I personally believe that every sizable data warehouse project should be implemented using the tools available today, despite the fact that they still have their limitations.

That doesn't mean you have to buy it right up front though. You have to decide how you are going to approach the project. You need to decide if you are going to build your data warehouse 2 or more times or whether you are only going to build it once. That is not as strange as it sounds. If you plan to build it a couple of time, using prototypes and small subject areas you might want to delay the purchase of an ETL tool until you know more about what it is you are going to build and learn how to build it. You can delay the learning curve on the tool so that you can focus on getting your pilot projects working. I don't recommend you believe any ETL tool vendor that says you can go on a class for a day or two and go back to your office and build a data warehouse with it. Most of these tools are very easy to use to do simple things and way more difficult to use to do something complicated. I know because I am well versed in both Informatica and DataStage. For example, on a recent project the most difficult mapping took a very knowledgeable informatica developer 10 days to write and then we took another 5 days to test it and get it working. 15 days to write/test a very difficult mapping is still a long time. Without the level of skill that we were applying it would be unlikely that we could ever had written that mapping.

If you plan to build your data warehouse only once, and you want to write the correct code that will be scalable and robust into the future don't plan on writing this code yourself. Get someone with good experience in whichever tool you select. Those people will cost you good money but the good ones are well worth it. Setting up standards for a new site and implementing the early releases of the data warehouse was 'bread and butter' consulting for consultants at Ardent as well as at our business partners. Projects where we could set up the standards before a line of code was written always went better than the projects where the customers 'saved money' by reading the manuals, not going to the training and not getting help getting started. It is pretty much like buying a C++ compiler, reading the manual and believing you are ready to write applications. It just isn't like that, and never has been.

Having 'ranted' about using an ETL tool the next question that comes up is 'which one?'

Good question. Everyone is getting into this 'data integration' market. Four years ago I thought we were going to see a 'rationalisation' of this segment because the market was not paying high prices (remember Prism was quite expensive) for the tools and Microsoft had announced they were going to imbed an ETL tool into SQL Server 7.0. Not long after Carleton was bought by Orcale for $US8M, spare change for Larry. 'The market is only going to support 2-3 independent ETL vendors' was my opinion. Well, shows you how wrong I can be!!!

Today there seems to be a growing number of ETL tools to choose from and the database vendors are spending large amounts of money putting more function into their ETL tools. Meanwhile the leading ETL tool vendors are already trying to get out of the way of the database vendor steamrollers. It's a pretty crazy market.

So what are my choices?

The leaders are Ascential with DataStage and Informatica with Powercenter. With so few people fluent in both tools, and even fewer people having implemented large complex data warehouses like Sybase IWS in each tool, I am constantly asked to compare and contrast these two and recommend one over the other. To date I have refused, and I will continue to refuse to state any public opinion. I hope to work with both companies in the future and don't want to alienate either of them!!

What I do say is that I can categorically guarantee that you can build large star schema data warehouses using either of these tools using the merchant databases Oracle, DB2, SQL Server as well as Red Brick and Sybase IQ. DataStage has Teradata support that I have seen work. I have not seen Informatica work with Teradata. I expect it does.

I also categorically state that you cannot make a mistake buying either of these two ETL tools. It's like buying Oracle or DB2 for the database engine. Either is eminently justifiable.

What is strange is that there are now more realistic competitors than there was four years ago. This despite the fact that there is significant price pressure on ETL tools and the overall market being tiny compared to, say, database revenues. I mean, Informatica and Ascential only made $US80M between the two of them in 2002Q3. This is a rounding error at Oracle, Microsoft or IBM.

Oracle Warehouse Builder is now worth a look at if you are using Oracle. DTS is worth a look if you have simple requirements and want to do things on the cheap. IBM even has a thing called Warehouse Manager despite selling DataStage, I haven't even looked at Warehouse Manager so no opinion there. SAS still market their tool Warehouse Administrator and it is good enough to take data from a data warehouse and put in into SAS datasets for analysis. It is arguable if it is the best tool to choose to place data into the data warehouse unless you are of a size to use all SAS databases for your data warehouse.

Cognos and Business Objects have both jumped into the ETL game with acquisitions. Both figure that if a sale of their analytical applications is going to generate ETL tool revenue then they might was well take that revenue themselves. It looks like if you buy the analytical applications you are going to buy their ETL tool so you might as well take a look at it.

Sagent is still out there. But with all this other activity in the marketplace I wonder if they might not be the next one bought by a deep pocketed tools or database vendor or might they be pushed out of the market by the larger vendors. For 2002Q3 they announced $US8.8M in revenue and a $US9.1M loss. I can't believe that will go on for long. It's not like they have a fortune in the bank like Ascential.

It now looks like the ETL tool world is going to become a world where each of the major vendors, database and end user, has an ETL tool to sell. And there will be a few (two) 'best of breed' ETL tools out there making a living by being better than the tools the vendors have.

Ok, you've said a lot, but I'm still no closer to knowing which one to select.

Absolutely true. All I have said is that DataStage and Informatica will absolutely do what you want. The rest, well, you will just not find many people with good skills across many ETL tools to help you out making a choice. It is unusual for people to gain experience across many of the ETL tools because of the time and effort it takes to learn them in detail. You really have to implement similar data warehouses with the tools to be able to compare them. And by the time you have done that it's two releases later and feature leapfrogging has taken place. In many ways you are very likely to be best off buying the tool that has the most licenses in your area and the most people available to work on your projects. Remember, the ETL work is still the largest single piece of work on the project at 60-80% of the total work on most projects. There is no point buying a tool that you cannot get local people skilled in.

If you are going to build yourself a sizable application independent data warehouse the 'safe' bet is to go with one of these two. If you are going with a suite of analytical applications you will probably go with the tool that the vendor imbeds with the application.

It is much more important to make sure that you learn how to build the data warehouse using the tool that you finish up using.

A005. What is the usual situation with Data Quality?

'The quality of the data is near 100% perfect' is right up there with 'The check is in the mail.' It is very rarely true. The case that I find most often is that the operational system validates data from the perspective that the data needs to be 'valid' for operational system processing. This does not mean the data is 'valid' as in reflects the real world. Also, the operational system will usually maintain data as 'valid' at a point in time and provide little support for the changing values of data over time. When you put data into the data warehouse you will have to take care of changing data values over time.

You should expect the data coming to you to be pretty bad. Then you will not be surprised if it is bad and you can be pleasantly surprise if it is good.

Should I then delay the data warehouse build until the data quality is up to scratch?

Well, not if you want to actually build it, because it generally takes a long time to bring data quality up to scratch.

What should I do then?

As a minimum, when you perform the data integration tasks of key lookups for fact record keys from the dimension tables, if you do not find as value you should put a '0' key into the fact record. Don't put a null because in most databases null=null is false.

You then put a record into the dimension table with a 0 key that indicates the key was unknown at lookup time. At the detail level you should also put the key that you looked up into the fact record. This is something of a heresy but this is what I do. You can then use the real key to adjust the '0' key to the correct key at some future point in time.

You should then also write reports in whatever query/reporting tool you are going to use to count the number of '0' keys for every dimension key on every fact table and list the unique values of the real keys. These reports let you find the real keys that are not being mapped to the dimension tables. Then you can do something about it. And what you usually need to do about it is to get people to type 'more correct' data into the operational system. This will take some time. These reports will also let you show the owner of the data the level of quality in terms of data integration to keys at least.

The other solutions I have tried in the past are error files which require the files to be re-cycled into the data warehouse as well as automagically inserting dimension records for unknown dimension records coming in. I stopped doing either of these by the end of 1995.

The main reasons for this being:

It is very hard to generate the 'right' dimension record. And if the dimension data coming in was actually an error and needs to be backed out it is much harder if the fact record has the incorrect key on it, and that record has been consolidated into summaries.

If you use recycle files you are usually forced into the position that you must complete processing of the data at hand prior to running the next nights data. If you have a large amount of processing to perform each night then having some failed processes over the weekend can leave you with a real problem on Monday morning to get everything back up to date. Then again, you can always give up your weekends to make sure the data warehouse gets loaded properly.

That's the minimum. What else might I do?

Well, you can start from running reports to find data in error and move right on up to buying a product to measure data quality and perform some repair on the data in the ETL process. There are a number of DQ vendors out there and there are a number of 'gurus' on the subject. I am familiar with the Data Quality engine in DataStage, I know it works so you could try that. There are others on the market. Larry P. English is one of the 'industry gurus' on Data Quality and if you want to read up on Data Quality he has published what looks like a pretty reasonable book on the topic. I have only browsed the book so I am not offering any opinion on it here.

Ok, if you have seen DQ issues so many times, what are the biggest headaches I'm likely to have?

The biggest headaches you are likely to have are:

	Trying to determine unique customers across multiple operational systems.
	Old codes in historical data where the code look-up tables have changed their code values over the years and the codes have changed their meaning. Even worse, people have forgotten what those codes meant when they were entered.
	The same field being used to store multiple different pieces of information, the rules to which no-one really knows and even if you did they would not be rigid because the users have changed over the years and they were hiding the fact they were doing this from IT anyway.

Of these identifying the unique customer is most problematic. You can buy tools to make intelligent guesses but they are never 100%. And when the customer base is large what is left over often requires a lot of manual work to correct.

In the end the best way to identify a unique customer is for the customer to have a customer number they know about and know to quote to your company every time they contact you. And the best way to do that is to give them yet another plastic card to put into their already bulging wallet/purse. Personally, I think any company who is not interested in me enough to give me a customer number that I know to quote just isn't interested in me.

The company that I deal with that I like the most as far as customer number and card goes is Qantas, Australias International Airline. I am a silver frequent flier, so I get a nice little silver card with my customer number on it. It has a phone number on the back for me to call, and the first thing I need to do when I call is to type in my customer number. When I call I rarely have to wait to get hold of someone and when I do get hold of someone they are polite and helpful and they talk to me by name. And all through this process Qantas can see what calls I make, what flights I take, what changes I make. It's way easier than trying to figure out who I am after the fact.

In summary, don't believe anyone who tells you 'the data quality is fine'. I haven't been to a site yet where it was.

A006. Inmon vs Kimball

Again, this is just my opinion and does not represent the opinions of Ralph or Bill. I publish this here because I am asked about this so often by people new to data warehousing.

Firstly, I have had the benefit of meeting both Ralph and Bill and saying 'Thank You' to each of them for contributing so much to my family. They have both been pioneers in what we now call Data Warehousing and Business Intelligence. I have been one of the many who have directly benefited from their pioneering efforts.

My understanding is that nowadays Bill and Ralph pretty much do their own thing and don't collaborate on articles or books.

So, let's, for the sake of this discussion, leave out the names Bill and Ralph and just look at some of their published ideas.

I first saw the time variant data with stability analysis model in Building the Data Warehouse first edition. This was back in 1993. Around the same time I was also trained, briefly, by Metaphor Computer Systems on how to design star schema databases. Metaphor were the leading star schemas design company back then.

Having learned both modelling methods at the same time there was never any contradiction between an archival data store and an analytical data store to me. Each model does things that the other model finds difficult or impossible. Most importantly for me, I had built this multi-level single wide table analytical database and we found that the star schema improved our lot in life a great deal.

A star schema can store detailed level data, lightly summarised data and highly summarised data as outlined in Building the Data Warehouse. (Indeed, this is exactly what my C Data Warehouse Generator does.) A star schema can also store time variant data as has been widely implemented in 'Type 2' dimensions. In those early days I tended to pick the type of data model that best supported the clients needs. I must say, it was almost always a star schema because I was building analytical applications.

In 1995, a little known figure in our industry, who should be better known, Steve Mongulla, came up with the idea of combining the published concepts of archival data and star schemas. That is, build an archival data store and propagate the data forward to analytical star schema databases. Steve and I talked about this for a while and we both thought it was a good idea so we tried it out. Steve and his work played no small part in PriceWaterhouseCoopers success in the data warehouse space.

As far as I am concerned, since 1995, the combination of the archival data store and the star schema/analytical data store has been the 'best of breed' way to build the most sophisticated and robust data warehouse. It has also been the most expensive. Not all customers have gone for the combination of both design methods, many customers buy the less expensive star schema method. However, I have always presented the choice to the customer so they can make an informed decision of what they forgo when there is no archival data store. Of late, customers have become more sophisticated in understanding their needs and are seeing the value of the combinations of the models.

For some years after 1995 I was on somewhat of a 'search for the holy grail' trying to figure out how to put archival data into a star schema to avoid the high cost of the archival layer. I'm pleased to say that search has come to an end with Joe Oates and Sybase IWS. These are some names you are going to hear more about. The method used to store archival data inside a star schema is proprietary information of Sybase but I will say this. IWS design techniques make the archival layer almost redundant and these design techniques are how we will all be building DWs in 5 years time. It's just that Sybase has got the jump on the rest of the world at the moment.

A007. Where to find Data Models and who is selling Models?

There are some models available for 'free' and some models are being sold as products by a variety of vendors. Also, 'analytical applications' is now a much hyped area and all these products have some form of data model underneath them.

The ‘free’ ones I know about are:

	Ralph Kimballs 2 books. The first book contains a whole series of simple database designs across numerous industries. The second one contains a reasonable telco model for call detail records as well as a great deal more advice on how to build dimensional data models.
	There is a book called ‘Data Warehouse Design Solutions’ also by published by Wiley authored by Chris Adamson and Mike Venerable. I quite like this book as there are so many models and they cross so many areas of the business. Well worth a read.
	There was another one I saw once, again by Wiley, that was a 2 volume set called something with ‘Data Models’ in the title. I browsed through this for quite a while before deciding not to buy it because it repeated so much of what was published in the first three I have mentioned. I only ever saw it once in a book shop so sorry I can’t remember exactly what the title was.
	There is a book called 24-ways available from Cognos which will give you some ideas about what might go into cubes.

If you are thinking about paying real money for comprehensive models the ‘fee’ ones I know about are:

	Sybase IWS. I have been implementing Sybase IWS for two years now and for the business verticals they cover they are very comprehensive models. Further, the Sybase IWS for Telco model is the swiss army knife of data models. If you are a telco, I’d strongly recommend you take a serious look at Sybase IWS for telco before trying to build your own data warehouse.
	IBM have mature banking/Insurance DW models. IBM is also bringing a telco model to market.
	Informatica/Business Objects have brought a suite of analytical applications to market.
	Cognos have brought a very nice set of analytical applications to market. Their end user interface, all based on web is the best interface I have seen so far for web deployment of analytical applications. I was very impressed. This set is more suited to medium sized enterprises with ERPs as the main systems. There is a list of ERPs that are supported that you can get from Cognos.

Naturally the ‘fee’ ones are far more extensive than the ‘free’ ones. For example with Sybase IWS it is not uncommon to have very significant numbers of fact tables and dimension tables. On a recent project I think we had 80+ dimension tables and 20+ fact tables. Sybase have some features built into IWS that are very, very clever.

A comment I would make here is this.

I’ve found that a lot of people have tried to read all these books and then jump in and design the enterprise data warehouse for their company. This is fine for a small to medium enterprise. However, for a large company this usually produces a model with quite a few pieces in it that are less than optimal. Nowadays I recommend companies engage a consultant who has built quite a few data warehouses and knows how to deal with all the ‘difficult’ areas to model. Make training a local part of the responsibility of the consultant. The main reason I recommend this is that designing a good data warehouse data model does not take that much time. Customising a good data model to your specific needs takes even less. Having someone who has done it over an over again is well worth the expense.

Changing a data model that is in production takes a lot of time and money because you have to change the ETL feeding the model and the reports based on top of the model.

Q008. This is Question 8.

Q009. This is Question 9.

Q010. This is Question 10.

Q011. This is Question 11.

Q012. This is Question 12.