|
|
Frequently Asked QuestionsOn the dwlist forum and in my consulting work, I often hear the same questions over and over again. With the dual purposes of answering a lot of those questions and minimising the amount of typing I do answering those questions I have created a Frequently Asked Questions page. If there is too much here it might well turn into a separate document as well. Feel free to email me questions that you feel lots of people might want answered. I will generally put the question and answer onto the dwlist forum as well.
Questions
Q006. I hear a lot about Inmon Vs Kimball. Who is right? Which is the best way to build a DW? Q004. What ETL tool should I buy for my Data Warehouse? Q003. What front end tools should I buy for my data warehouse? Q002. What hardware should I buy for my data warehouse? Q001. Which database should I buy for my data warehouse? AnswersA001. Which database should I buy for my data warehouse?Always a good question. This response is really only for those folks who are not going to build a large DW, like in excess of 200GB of raw data to store. Folks who have in excess of 200GB raw data to store will have their own database people who can work out the best way to go forward for their companies. People who are in the 50GB or less raw data range, which is probably 95% of companies who build a DW, could benefit from this answer. And there will be plenty of database reps out there trying to sell you their database. Of course, the answer is, it depends. Questions to ask yourself is how many users will there be? How much data will there be? How will the database actually be queried? A lot of tools now cache reports so that the database itself is hit less and less. Other general guidelines I offer people is as follows: The major players in the industry are Teradata, Oracle, IBM DB2 UDB/MVS, Microsoft SQL Server 2000. You also have Red-Brick and Sybase IQ as data warehouse specific databases. Teradata will always go faster than any of the other databases given a 3NF model and the same amount of hardware underneath them. It was built specifically for MPP query processing so it is no wonder it will go faster. It will also cost more. You can take a look at the TPC results to see how much faster. The drawback of Teradata is that, as far as I am aware, it has no smarts to query star schema databases. I am aware that there was a project to put this support into Teradata but I am not aware that it has been made public. Also, NCR do seem to criticise the star schema concept. And if you have 50GB or less of raw data you are probably going to overspend on a Teradata DW. You can't go far wrong buying Oracle or IBM UDB as databases. And you'll pay plenty for the licenses. Both these databases will do just about everything you will want to do in a data warehouse. They are fighting the 'feature/function' battle and producing databases that do everything but make your coffee for you. Many of these features I look at and wonder 'How the hell would I use that?' These two are 'safe bets'. I haven't had a chance to build a real DW on SQL Server 2000 but I have built one on SQL Server 7.0 and it works, mostly. We found query performance on a 2 processor 733mhz compaq with 2GB memory for a 10M row fact table with 8 dimensions was in the 1-10 second range. We did have two fact tables, which we manually partitioned, that were 70M and 50M rows respectively. It was not possible to query partitioned fact tables in SQL Server 7.0 effectively. I believe this was fixed in 2000. In SQL Server 7.0 we found the optimiser did not work as advertised and we found no-one who could make it work as advertised so it did not resolve star joins that well. But SQL Server is very, very cheap. If you have a DW with less than 20GB of raw data that needs to be housed in your data warehouse then you can do it comfortably on SQL Server 7.0 and by extension SQL Server 2000. In SQL Server 7.0 we found that you want to allocate at least 5x as much space in the database as raw data and if you mirror the disk you will need to allocate 10x as much disk as raw data to get started. We also found the log file to be a real problem, it grows very large very fast. I don't know if that was fixed in 2000. The project I did on SQL Server 7.0 is in the process of moving to SQL Server 2000 and I hope to update some of these comments if I hear back how SQL Server 2000 goes in these specific comments. If you are in the 20GB data and below range take a good look at SQL Server 2000 and try it out on a spare server to see if it will suite your needs. The ease of maintenance and lower cost of ownership make it way cheaper to put your DW on than Oracle or DB2. Microsoft will tell you they can handle way larger volumes, and they might be able to, I'm only saying I haven't seen it with my own eyes. They have plenty of references and case studies on their web site. Sybase IQ is quite an interesting database. In a word, it is fast, fast, fast. I've done two projects on this database now and when you ask it questions it's kind of strange when it throws the answer back at you so fast. My usual first response is that it can't be the right answer because I know what has to be done to calculate it and the computer could not have done that work in that time. (Of course, you do have to set up the appropriate 'indexes'.) I haven't had the opportunity to be around while the database grows big but because of the way the database works bigger should not be a problem. The other thing about IQ is that it is the easiest database I have ever seen to maintain as a 'DBA'. Even easier than SQL Server, which is saying something. All you really do is monitor how much space is used and throw more disk at it. There are no tablespaces or indexspaces to maintain. Everything is just a big collection of disk and the database figures out what to do with it. Very neat. Also, Sybases claims that you need less disk than actual raw data that you have is true. (Why a vendors claim being true is a surprise is another story.) So, if you have 20GB of raw data you are very likely to be able to store that data in a DW of 40GB (being mirrored). You will need some separate work space and you might keep a staging area too which will need to be added. But compared to every other database, you are going to use way less storage with IQ, which means less hardware is required all round because the amount of IO required is way down on the other databases. The down side is, of course, this is a DSS/DW only database. You just can't issue updates in any sensible amount of time. This does mean the ETL needs to perform delete-inserts and not updates which does mean you need more temporary tables for the updates which makes the ETL more complicated. The alternative is to always perform loads, which are blindingly fast, but loads of large tables take you away from minimising the elapsed time between application update and the data being available in the DW. There are some other features missing from IQ at the moment, like storage of BLOBs. But as a dedicated database to store numeric/character data for a star schema DW it is well worth a look at if you want to step outside your current RDBMS box. Red Brick, now owned by IBM also deserves a mention. I haven't checked up to see what IBM is doing with this database, so you want to check before you buy it. However, Red Brick was built from the ground up to support star schema DWs. It is faster than anything except IQ, it has an intelligent loader for maintenance of summaries, it uses less space than anything except IQ and it has SQL extensions that allow it to perform tasks usually reserved for the MOLAP and ROLAP tools. Worth a look if you have very large fact tables to deal with, like 500M rows and above. Having said all that, in general I recommend:
A word on moving between databases. I remember being told in 1983 that Unix was going to take over the world because it would allow companies to seamlessly and easily move applications from one hardware/database vendor to another and break IBMs stranglehold on the hardware, operating system and database markets. Well, IBMs stranglehold on these markets is well and truly broken but I see very little evidence that people are moving from one database vendor to another, or even from one hardware vendor to another keeping the same database for their data warehouses. It seems to me once the DW is installed and running, it stays on whatever it was installed on. I see no reason for this to be the case except that when you buy the database you actually pay the whole price for it. You don't pay a monthly license fee so there is little to gain by moving and a lot of money to pay for the next vendors license fee. Simply put, once the vendor gets your money for the DW database license they have got it, unless you can re-deploy the license. The point? I recommend you for a 'loaner' of the database software during the development lifetime of the DW database. While you haven't given the vendor the money for the licenses they are more likely to be helpful. The vendor may or may not agree to giving you the loaner, after all, it delays their revenue.
A002. What hardware should I buy for my DW?This response is really only for those folks who are not going to build a large DW, like in excess of 200GB of raw data to store. Folks who have in excess of 200GB raw data to store will have their own hardware people who can work out the best way to go forward for their companies. Again, it very much depends on the usage profile of the database as well as the database you put on top of it. You are going to need more hardware to run Oracle/DB2/SQL Server than you are going to need for Teradata, Sybase IQ or Red-Brick. Also, you can avoid a huge amount of database access by using a MDDB like Essbase, TM1, Analytical Services etc. The choice of hardware is more driven by the choice of operating system. Will you use Win2000 or unix. And for those of you who know me, yes, you might even use OS/390. (Because I was building DWs in 1991 for IBM quite a lot of my early experience was MVS). But there are so few DWs being built on OS/390 nowadays that it's become a non-discussion. When I was managing Ardent Professional Services for Asia Pacific I think we had in excess of 100 customers using Prism and DataStage across Asia Pacific and only one of them on MVS. The MVS site did not have any unix at the time they built their DW and it was going to be too big for NT 4.0. The only way to be sure that the database will support the load that you are going to put on it is to try it out. This is not always easy, but most of the hardware vendors now have centres available for customers to try out various data warehouse schemes and measure the performance. I generally recommend:
A good way to get an idea of what people are doing at any given point in time is to publish as much as you know about what you want to do on dwlist and have people feedback comments on what you propose. People on dwlist have proven quite helpful and can be more so if you provide them with enough information to go on. Even having worked for vendors over the years I would advise that you are always sceptical of vendor performance claims unless they are willing to make a performance guarantee in writing. I have had cases where I have been told all sorts of things that evaporate when I asked for it in writing on company letterhead. I have been on the receiving end of mis-information from vendors way too often to recommend that you take what they say at face value. It's a shame, and it's disappointing because it means that the honest guys working for vendors get tared with the same brush as the ones who are happy to 'stretch the truth'. This was one of my hardest lessons when I moved into marketing for IBM. I just could not believe what some sales people told some of my customers. The most amazing case I came across was one where a $US5M box was sold to one of our customers when publicly available information (product specs of the box) clearly indicated that the solution would not work. IBM knew it and the competitive sales representative knew it (he told me so). We tried to tell the customer and the customer didn't want to listen. The result was in excess of $US5M wasted and quite a few people losing their jobs. The whole debacle was 'hushed up' and the customer would never talk to anyone else about this experience. This world is a 'buyer beware' world. Your company should hold you responsible for the results you get from the DW and you should make sure you know what you are buying. If you are in the region of <=10 concurrent query users and <20GB raw data you are well in the Win2000 range with any of the databases mentioned above. This covers a lot of companies. Also, if you are pushing most of the analytical queries to a separate multi-dimensional database which will usually sit on a separate server, you can get away with a much larger database on Win2000 because the query load is so low. As you increase the number of people who are concurrently connecting to the database and increase the volume of raw data housed in the database you are pushing up the limits of Win2000. Those limits are changing very quickly as new servers come out. Nowadays (2002), I'm more inclined to say that once you are over around 30 concurrent connected database users or above the 100GB raw data range (which is going to be 800-1000GB disk) you are going into unix territory. Then you need to decide between IBM, SUN, HP. I have had very good results working with SUN as a vendor, with HP closely following behind. I haven't dealt with IBM as a unix vendor. In one deal where I managed the selection process between SUN/HP the two companies both distinguished themselves for the level of effort and professionalism displayed despite the fact we were asking them to work quite unreasonable hours. Buying hardware in this range is quite an investment and you will want to perform some testing to make sure that what you are buying will work. You may not be able to easily re-deploy a large unix box within your company and there is no point buying wasted capacity. One thing not to forget when sizing the hardware is the batch processing load required and the batch window available. So many Win2000/Unix people I meet seem to have no idea that loading/updating a DW can often require very substantial batch processing requirements for the integration portion of the work. They seem to want to quote 'load times' which is less than 10% of what is going to happen in most cases. In some cases the batch processing requirements will dictate more hardware than the daily query requirements will dictate. This is often the case when files are arriving late, are required early, and you need to rebuild MDDB cubes. Something that has surprised me over the years is that quite a few people don't know how much an empty CPU or IO slot costs. I've seen quite a few boxes bought where there are lots of spare slots and the buyer has been talked into buying 'spare slots' for 'later growth' by the hardware salesman. Given the rapid decline in machine prices any spare slots sitting around for any length of time (like 12 months) is money wasted. My own take is to have as few spare slots as possible, even if it means some extra testing to do the sizing. In many cases, if you really don't know how big the box should be but can't afford to undersize the machine then buy the extra processor or IO cards and make use of the capacity doing something useful. Spare computing capacity is something that is easily used up.
A003. What front end tools should I buy for my DW?The best source of information to select a front end tool for your data warehouse is The OLAP Report. There is no way I can touch the detail provided by Nigel so if you have budget for it, this is the site I recommend. He has recently published the 2001 market share numbers which can be found by clicking here. I recommend browsing around the OLAP Report pages to get a feel for the background of many of the vendors. Note that some notable vendors are not in the olap report because their products are reporting tools more than OLAP tools. These are Actuate, Hummingbird and IBI. These companies all have successful report deployment tools. What front end tool should I buy for my DW is probably a harder question to answer than any of the previous questions. The main reason being that there are so many more tools out there than databases or machines, and the tools change so quickly. The advice I have been giving over the last few years was to take the end user tool decision as a 'tactical' decision. That is, make a decision that you do not have a problem bailing out of in the 2-3 year time frame. The reason being that the competition was so hot that there was no guarantee which tools would be around in the future. Today this has changed somewhat. There are still a lot of really good tools that may or may not be around in a few years time, but there have emerged some clear 'winners' who I would be happy to put my money on being around for a while. Today the 'winners' for end user tools (in my opinion) are as follows. These are the 'safe bets' for buying end user tools. First, a blanket comment. Each of these vendors are well aware they they are the leaders in their area and the price tags of the products reflects this. As far as I am concerned these guys are pretty expensive.
So if they are the 'safe bets' what other tools are around that are worth looking at?
Why is SAS not in this list? I hear you ask. I have had good relationships with SAS for years. I wrote a data
warehousing methodology for SAS way back in 1995 which evolved into what they
decided to use around the world. My opinion, they have a 'database' which
is really, really fast and stats capability to die for. Virtually
unchallenged as far as I am concerned. In Australia once they stopped
trying to sell customers the idea that the SAS database could manage terabytes
of data as the data warehouse and started saying there was a place in the world
for relational databases but pull data into SAS for stats analysis their 'data
warehousing' sales kicked up quite a bit. In summary, SAS should be successful in this space. I don't know why they are not way more successful than they are.
Why is Gentia not in this list? I hear you ask. Gentia have been acquired by a company called 'Open Ratings'. If you go to www.gentia.com you will be re-directed to balancedscorecard.com. And all the documentation you will find there is about the implementation of balanced scorecards and the applications sold by 'Open Ratings'. It is not at all clear if you can buy Gentia separately from buying the applications. So I've left it out of this discussion. If you want to buy a balanced scorecard applications feel free to go to balancedscorecard.com.
A004. What ETL tool should I buy for my DW?Let's start with what not to buy. Don't buy PL/SQL and think you can write all your ETL in it. I can't believe the number of Oracle DBAs who think that anything you need to do can be done in PL/SQL. I think that when people do the Oracle DBA classes they must get 'inoculated' with the PL/SQL needle so that they are allergic to anything else. No other database DBAs I meet are anything like as fixated on writing stored procedures in the database language. Stored procedures, today, are hard to write, hard to read, hard to maintain, destroy your ability to understand data heritage, and completely and totally lock you into the database engine for ever and ever amen. Not that anyone actually moves hardware or database once the system is installed which was the whole 'purpose of open systems' as it was preached to me in the early 80s. You would be better off with Cobol, C, Perl or any popular programming language than you are with PL/SQL. Let's go to what to think about. All the ETL tool vendors want to sell you their tool. I did when I worked for Ardent. You can't blame a vendor for trying to sell you their wares. My own opinion is that it is only the 'lunatic fringe' who are going to build anything remotely complicated without some kind of ETL tool. Unfortunately, I think that still puts more than 50% of all installed data warehouses into the 'lunatic fringe' category. When I was working at Ardent the major competitor our sales reps had was not Informatica, it was the Oracle DBA who believed PL/SQL can to anything an ETL tool can do better and faster. I personally believe that every sizable data warehouse project should be implemented using the tools available today, despite the fact that they still have their limitations. That doesn't mean you have to buy it right up front though. You have to decide how you are going to approach the project. You need to decide if you are going to build your data warehouse 2 or more times or whether you are only going to build it once. That is not as strange as it sounds. If you plan to build it a couple of time, using prototypes and small subject areas you might want to delay the purchase of an ETL tool until you know more about what it is you are going to build and learn how to build it. You can delay the learning curve on the tool so that you can focus on getting your pilot projects working. I don't recommend you believe any ETL tool vendor that says you can go on a class for a day or two and go back to your office and build a data warehouse with it. Most of these tools are very easy to use to do simple things and way more difficult to use to do something complicated. I know because I am well versed in both Informatica and DataStage. For example, on a recent project the most difficult mapping took a very knowledgeable informatica developer 10 days to write and then we took another 5 days to test it and get it working. 15 days to write/test a very difficult mapping is still a long time. Without the level of skill that we were applying it would be unlikely that we could ever had written that mapping. If you plan to build your data warehouse only once, and you want to write the correct code that will be scalable and robust into the future don't plan on writing this code yourself. Get someone with good experience in whichever tool you select. Those people will cost you good money but the good ones are well worth it. Setting up standards for a new site and implementing the early releases of the data warehouse was 'bread and butter' consulting for consultants at Ardent as well as at our business partners. Projects where we could set up the standards before a line of code was written always went better than the projects where the customers 'saved money' by reading the manuals, not going to the training and not getting help getting started. It is pretty much like buying a C++ compiler, reading the manual and believing you are ready to write applications. It just isn't like that, and never has been. Having 'ranted' about using an ETL tool the next question that comes up is 'which one?' Good question. Everyone is getting into this 'data integration' market. Four years ago I thought we were going to see a 'rationalisation' of this segment because the market was not paying high prices (remember Prism was quite expensive) for the tools and Microsoft had announced they were going to imbed an ETL tool into SQL Server 7.0. Not long after Carleton was bought by Orcale for $US8M, spare change for Larry. 'The market is only going to support 2-3 independent ETL vendors' was my opinion. Well, shows you how wrong I can be!!! Today there seems to be a growing number of ETL tools to choose from and the database vendors are spending large amounts of money putting more function into their ETL tools. Meanwhile the leading ETL tool vendors are already trying to get out of the way of the database vendor steamrollers. It's a pretty crazy market. So what are my choices? The leaders are Ascential with DataStage and Informatica with Powercenter. With so few people fluent in both tools, and even fewer people having implemented large complex data warehouses like Sybase IWS in each tool, I am constantly asked to compare and contrast these two and recommend one over the other. To date I have refused, and I will continue to refuse to state any public opinion. I hope to work with both companies in the future and don't want to alienate either of them!! What I do say is that I can categorically guarantee that you can build large star schema data warehouses using either of these tools using the merchant databases Oracle, DB2, SQL Server as well as Red Brick and Sybase IQ. DataStage has Teradata support that I have seen work. I have not seen Informatica work with Teradata. I expect it does. I also categorically state that you cannot make a mistake buying either of these two ETL tools. It's like buying Oracle or DB2 for the database engine. Either is eminently justifiable. What is strange is that there are now more realistic competitors than there was four years ago. This despite the fact that there is significant price pressure on ETL tools and the overall market being tiny compared to, say, database revenues. I mean, Informatica and Ascential only made $US80M between the two of them in 2002Q3. This is a rounding error at Oracle, Microsoft or IBM. Oracle Warehouse Builder is now worth a look at if you are using Oracle. DTS is worth a look if you have simple requirements and want to do things on the cheap. IBM even has a thing called Warehouse Manager despite selling DataStage, I haven't even looked at Warehouse Manager so no opinion there. SAS still market their tool Warehouse Administrator and it is good enough to take data from a data warehouse and put in into SAS datasets for analysis. It is arguable if it is the best tool to choose to place data into the data warehouse unless you are of a size to use all SAS databases for your data warehouse. Cognos and Business Objects have both jumped into the ETL game with acquisitions. Both figure that if a sale of their analytical applications is going to generate ETL tool revenue then they might was well take that revenue themselves. It looks like if you buy the analytical applications you are going to buy their ETL tool so you might as well take a look at it. Sagent is still out there. But with all this other activity in the marketplace I wonder if they might not be the next one bought by a deep pocketed tools or database vendor or might they be pushed out of the market by the larger vendors. For 2002Q3 they announced $US8.8M in revenue and a $US9.1M loss. I can't believe that will go on for long. It's not like they have a fortune in the bank like Ascential. It now looks like the ETL tool world is going to become a world where each of the major vendors, database and end user, has an ETL tool to sell. And there will be a few (two) 'best of breed' ETL tools out there making a living by being better than the tools the vendors have. Ok, you've said a lot, but I'm still no closer to knowing which one to select. Absolutely true. All I have said is that DataStage and Informatica will absolutely do what you want. The rest, well, you will just not find many people with good skills across many ETL tools to help you out making a choice. It is unusual for people to gain experience across many of the ETL tools because of the time and effort it takes to learn them in detail. You really have to implement similar data warehouses with the tools to be able to compare them. And by the time you have done that it's two releases later and feature leapfrogging has taken place. In many ways you are very likely to be best off buying the tool that has the most licenses in your area and the most people available to work on your projects. Remember, the ETL work is still the largest single piece of work on the project at 60-80% of the total work on most projects. There is no point buying a tool that you cannot get local people skilled in. If you are going to build yourself a sizable application independent data warehouse the 'safe' bet is to go with one of these two. If you are going with a suite of analytical applications you will probably go with the tool that the vendor imbeds with the application. It is much more important to make sure that you learn how to build the data warehouse using the tool that you finish up using.
A005. What is the usual situation with Data Quality?'The quality of the data is near 100% perfect' is right up there with 'The check is in the mail.' It is very rarely true. The case that I find most often is that the operational system validates data from the perspective that the data needs to be 'valid' for operational system processing. This does not mean the data is 'valid' as in reflects the real world. Also, the operational system will usually maintain data as 'valid' at a point in time and provide little support for the changing values of data over time. When you put data into the data warehouse you will have to take care of changing data values over time. You should expect the data coming to you to be pretty bad. Then you will not be surprised if it is bad and you can be pleasantly surprise if it is good. Should I then delay the data warehouse build until the data quality is up to scratch? Well, not if you want to actually build it, because it generally takes a long time to bring data quality up to scratch. What should I do then? As a minimum, when you perform the data integration tasks of key lookups for fact record keys from the dimension tables, if you do not find as value you should put a '0' key into the fact record. Don't put a null because in most databases null=null is false. You then put a record into the dimension table with a 0 key that indicates the key was unknown at lookup time. At the detail level you should also put the key that you looked up into the fact record. This is something of a heresy but this is what I do. You can then use the real key to adjust the '0' key to the correct key at some future point in time. You should then also write reports in whatever query/reporting tool you are going to use to count the number of '0' keys for every dimension key on every fact table and list the unique values of the real keys. These reports let you find the real keys that are not being mapped to the dimension tables. Then you can do something about it. And what you usually need to do about it is to get people to type 'more correct' data into the operational system. This will take some time. These reports will also let you show the owner of the data the level of quality in terms of data integration to keys at least. The other solutions I have tried in the past are error files which require the files to be re-cycled into the data warehouse as well as automagically inserting dimension records for unknown dimension records coming in. I stopped doing either of these by the end of 1995. The main reasons for this being:
That's the minimum. What else might I do? Well, you can start from running reports to find data in error and move right on up to buying a product to measure data quality and perform some repair on the data in the ETL process. There are a number of DQ vendors out there and there are a number of 'gurus' on the subject. I am familiar with the Data Quality engine in DataStage, I know it works so you could try that. There are others on the market. Larry P. English is one of the 'industry gurus' on Data Quality and if you want to read up on Data Quality he has published what looks like a pretty reasonable book on the topic. I have only browsed the book so I am not offering any opinion on it here. Ok, if you have seen DQ issues so many times, what are the biggest headaches I'm likely to have? The biggest headaches you are likely to have are:
Of these identifying the unique customer is most problematic. You can buy tools to make intelligent guesses but they are never 100%. And when the customer base is large what is left over often requires a lot of manual work to correct. In the end the best way to identify a unique customer is for the customer to have a customer number they know about and know to quote to your company every time they contact you. And the best way to do that is to give them yet another plastic card to put into their already bulging wallet/purse. Personally, I think any company who is not interested in me enough to give me a customer number that I know to quote just isn't interested in me. The company that I deal with that I like the most as far as customer number and card goes is Qantas, Australias International Airline. I am a silver frequent flier, so I get a nice little silver card with my customer number on it. It has a phone number on the back for me to call, and the first thing I need to do when I call is to type in my customer number. When I call I rarely have to wait to get hold of someone and when I do get hold of someone they are polite and helpful and they talk to me by name. And all through this process Qantas can see what calls I make, what flights I take, what changes I make. It's way easier than trying to figure out who I am after the fact. In summary, don't believe anyone who tells you 'the data quality is fine'. I haven't been to a site yet where it was.
A006. Inmon vs KimballAgain, this is just my opinion and does not represent the opinions of Ralph or Bill. I publish this here because I am asked about this so often by people new to data warehousing. Firstly, I have had the benefit of meeting both Ralph and Bill and saying 'Thank You' to each of them for contributing so much to my family. They have both been pioneers in what we now call Data Warehousing and Business Intelligence. I have been one of the many who have directly benefited from their pioneering efforts. My understanding is that nowadays Bill and Ralph pretty much do their own thing and don't collaborate on articles or books. So, let's, for the sake of this discussion, leave out the names Bill and Ralph and just look at some of their published ideas. I first saw the time variant data with stability analysis model in Building the Data Warehouse first edition. This was back in 1993. Around the same time I was also trained, briefly, by Metaphor Computer Systems on how to design star schema databases. Metaphor were the leading star schemas design company back then. Having learned both modelling methods at the same time there was never any contradiction between an archival data store and an analytical data store to me. Each model does things that the other model finds difficult or impossible. Most importantly for me, I had built this multi-level single wide table analytical database and we found that the star schema improved our lot in life a great deal. A star schema can store detailed level data, lightly summarised data and highly summarised data as outlined in Building the Data Warehouse. (Indeed, this is exactly what my C Data Warehouse Generator does.) A star schema can also store time variant data as has been widely implemented in 'Type 2' dimensions. In those early days I tended to pick the type of data model that best supported the clients needs. I must say, it was almost always a star schema because I was building analytical applications. In 1995, a little known figure in our industry, who should be better known, Steve Mongulla, came up with the idea of combining the published concepts of archival data and star schemas. That is, build an archival data store and propagate the data forward to analytical star schema databases. Steve and I talked about this for a while and we both thought it was a good idea so we tried it out. Steve and his work played no small part in PriceWaterhouseCoopers success in the data warehouse space. As far as I am concerned, since 1995, the combination of the archival data store and the star schema/analytical data store has been the 'best of breed' way to build the most sophisticated and robust data warehouse. It has also been the most expensive. Not all customers have gone for the combination of both design methods, many customers buy the less expensive star schema method. However, I have always presented the choice to the customer so they can make an informed decision of what they forgo when there is no archival data store. Of late, customers have become more sophisticated in understanding their needs and are seeing the value of the combinations of the models. For some years after 1995 I was on somewhat of a 'search for the holy grail' trying to figure out how to put archival data into a star schema to avoid the high cost of the archival layer. I'm pleased to say that search has come to an end with Joe Oates and Sybase IWS. These are some names you are going to hear more about. The method used to store archival data inside a star schema is proprietary information of Sybase but I will say this. IWS design techniques make the archival layer almost redundant and these design techniques are how we will all be building DWs in 5 years time. It's just that Sybase has got the jump on the rest of the world at the moment.
A007. Where to find Data Models and who is selling Models?There are some models available for 'free' and some models are being sold as products by a variety of vendors. Also, 'analytical applications' is now a much hyped area and all these products have some form of data model underneath them. The ‘free’ ones I know about are:
If you are thinking about paying real money for
comprehensive models the ‘fee’ ones I know about are:
Naturally the ‘fee’ ones are far more extensive than
the ‘free’ ones. For example with Sybase IWS it is not uncommon to
have very significant numbers of fact tables and dimension tables. On a
recent project I think we had 80+ dimension tables and 20+ fact tables.
Sybase have some features built into IWS that are very, very clever. A comment I would make here is this. I’ve found that a lot of people have tried to read all these books and then jump in and design the enterprise data warehouse for their company. This is fine for a small to medium enterprise. However, for a large company this usually produces a model with quite a few pieces in it that are less than optimal. Nowadays I recommend companies engage a consultant who has built quite a few data warehouses and knows how to deal with all the ‘difficult’ areas to model. Make training a local part of the responsibility of the consultant. The main reason I recommend this is that designing a good data warehouse data model does not take that much time. Customising a good data model to your specific needs takes even less. Having someone who has done it over an over again is well worth the expense. Changing a data model that is in production takes a lot of time and money because you have to change the ETL feeding the model and the reports based on top of the model.
A008. This is Answer 8.A009. This is Answer 9.A010. This is Answer 10.A011. This is Answer 11.A012. This is Answer 12.
|
Send mail to peter@peternolan.net with
questions or comments about this web site.
|